Open Research Online The Information Geometry of Sparse Goodness-of-Fit Testing

: This paper takes an information-geometric approach to the challenging issue of goodness-of-fit testing in the high dimensional, low sample size context where—potentially—boundary effects dominate. The main contributions of this paper are threefold: ﬁrst, we present and prove two new theorems on the behaviour of commonly used test statistics in this context; second, we investigate—in the novel environment of the extended multinomial model—the links between information geometry-based divergences and standard goodness-of-ﬁt statistics, allowing us to formalise relationships which have been missing in the literature; ﬁnally, we use simulation studies to validate and illustrate our theoretical results and to explore currently open research questions about the way that discretisation effects can dominate sampling distributions near the boundary. Novelly accommodating these discretisation effects contrasts sharply with the essentially continuous approach of skewness and other corrections ﬂowing from standard higher-order asymptotic analysis.


Introduction
We start by emphasising the threefold achievements of this paper, spelled out in detail in terms of the paper's section structure below. First, we present and prove two new theorems on the behaviour of some standard goodness-of-fit statistics in the high dimensional, low sample size context, focusing on behaviour "near the boundary" of the extended multinomial family. We also comment on the methods of proof which allow explicit calculations of higher order moments in this context. Second, working again explicitly in the extended multinomial context, we fill a hole in the literature by linking information-geometric-based divergences and standard goodness-of-fit statistics. Finally, we use simulation studies to explore discretisation effects that can dominate sampling distributions "near the boundary". Indeed, we illustrate and explore how-in the high dimensional, low sample size context-all distributions are affected by boundary effects. We also use these simulation results to explore currently open research questions. As can be seen, the overarching theme is the importance of working in the geometry of the extended exponential family [1], rather than the traditional manifold-based structure of information geometry.
In more detail, the paper extends and builds on the results of [2], and we use notation and definitions consistently across these two papers. Both papers investigate the issue of goodness-of-fit testing in the high dimensional sparse extended multinomial context, using the tools of Computational Information Geometry (CIG) [1]. Section 2 gives formal proofs of two results, Theorems 1 and 2, which were announced in [2]. These results explore the sampling performance of standard goodness-of-fit statistics-Wald, Pearson's χ 2 , score and deviance-in the sparse setting. In particular, they look at the case where the data generation process is "close to the boundary" of the parameter space where one or more cell probabilities vanish. This complements results in much of the literature, where the centre of the parameter space-i.e., the uniform distribution-is often the focus of attention.
Section 3 starts with a review of the links between Information Geometry (IG) [3] and goodness-of-fit testing. In particular, it looks at the power family of Cressie and Read [4,5] in terms of the geometric theory of divergences. In the case of regular exponential families, these links have been well-explored in the literature [6], as has the corresponding sampling behaviour [7]. What is novel here is the exploration of the geometry with respect to the closure of the exponential family; i.e., the extended multinomial model-a key tool in CIG. We illustrate how the boundary can dominate the statistical properties in ways that are surprising compared to standard-and even high-order-analyses, which are asymptotic in sample size.
Through simulation experiments, Section 4 explores the consequences of working in the sparse multinomial setting, with the design of the numerical experiments being inspired by the information geometry.

Sampling Distributions in the Sparse Case
One of the first major impacts that information geometry had on statistical practice was through the geometric analysis of higher order asymptotic theory (e.g., [8,9]). Geometric interpretations and invariant expressions of terms in the higher order corrections to approximations of sampling distributions are a good example, [8] (Chapter 4). Geometric terms are used to correct for skewness and other higher order moment (cumulant) issues in the sampling distributions. However, these correction terms grow very large near the boundary [1,10]. Since this region plays a key role in modelling in the sparse setting-the maximum likelihood estimator (MLE) often being on the boundary-extensions to the classical theory are needed. This paper, together with [2], start such a development. This work is related to similar ideas in categorical, (hierarchical) log-linear, and graphical models [1,[11][12][13]. As stated in [13], "their statistical properties under sparse settings are still very poorly understood. As a result, analysis of such data remains exceptionally difficult".
In this section we show why the Wald-equivalently, the Pearson χ 2 and score statistics-are unworkable when near the boundary of the extended multinomial model, but that the deviance has a simple, accurate, and tractable sampling distribution-even for moderate sample sizes. We also show how the higher moments of the deviance are easily computable, in principle allowing for higher order adjustments. However, we also make some observations about the appropriateness of these classical adjustments in Section 4.
First, we define some notation, consistent with that of [2]. With i ranging over {0, 1, ..., k}, let n = (n i ) ∼ Multinomial (N, (π i )), where here each π i > 0. In this context, the Wald, Pearson's χ 2 , and score statistics all coincide, their common value, W, being Defining π (α) := ∑ i π α i , we note the inequality, for each m ≥ 1, in which equality holds if and only if π i ≡1/(k + 1)-i.e., iff (π i ) is uniform. We then have the following theorem, which establishes that the statistic W is unworkable as π min := min(π i ) → 0 for fixed k and N. Theorem 1. For k > 1 and N ≥ 6, the first three moments of W are: In particular, for fixed k and N, as π min → 0 A detailed proof is found in Appendix A, and we give here an outline of its important features. The machinery developed is capable of delivering much more than a proof of Theorem 1. As indicated there, it provides a generic way to explicitly compute arbitrary moments or mixed moments of multinomial counts, and could in principle be implemented by computer algebra. Overall, there are four stages. First, a key recurrence relation is established; secondly, it is exploited to deliver moments of a single cell count. Third, mixed moments of any order are derived from those of lower order, exploiting a certain functional dependence. Finally, results are combined to find the first three moments of W, higher moments being similarly obtainable.
The practical implication of Theorem 1 is that standard first (and higher-order) asymptotic approximations to the sampling distribution of the Wald, χ 2 , and score statistics break down when the data generation process is "close to" the boundary, where at least one cell probability is zero. This result is qualitatively similar to results in [10], which shows how asymptotic approximations to the distribution of the maximum likelihood estimate fail; for example, in the case of logistic regression, when the boundary is close in terms of distances as defined by the Fisher information.
Unlike statistics considered in Theorem 1, the deviance has a workable distribution in the same limit: that is, for fixed N and k as we approach the boundary of the probability simplex. In sharp contrast to that theorem, we see the very stable and workable behaviour of the k-asymptotic approximation to the distribution of the deviance, in which the number of cells increases without limit.
Define the deviance D via where µ i := E(n i ) = Nπ i . We will exploit the characterisation that the multinomial random vector (n i ) has the same distribution as a vector of independent Poisson random variables conditioned on their sum. Specifically, let the elements of (n * i ) be independently distributed as Poisson Po(µ i ). Then, ). Define the vector where D * is defined implicitly and 0 log 0 := 0. The terms ν, τ, and ρ are defined by the first two moments of S * via the vectors where C i := Cov(n * i , n * i log(n * i /µ i )) and V i := Var(n * i log(n * i /µ i )).
Proof of Theorem 2. In view of (1) and (2), it suffices to show that the first two moments of S * remain bounded as π min → 0. By the Cauchy-Schwarz inequality, this in turn is a direct consequence of the following result.
As a result of Theorem 2, the distribution of the deviance is stable in this limit. Further, as noted in [2], each of ν, τ, and ρ can be easily and accurately approximated by standard truncate and bound methods in the limit as π min → 0. These are detailed in Appendix B.

Divergences and Goodness-of-Fit
The emphasis of this section is the importance of the boundary of the extended multinomial when understanding the links between information geometric divergences and families of goodness-of-fit statistics. For completeness, a set of well-known results linking the Power-Divergence family and information geometry in the manifold sense are surveyed in Sections 3.1-3.3. The extension to the extended multinomial family is discussed in Section 3.4, where we make clear how the global behaviour of divergences is dominated by boundary effects. This complements the usual local analysis, which links divergences with the Fisher information, [8]. Perhaps the key point is that, since counts in the data can be zero, information geometric structures should also allow probabilities to be zero. Hence, closures of exponential families seem to be the correct geometric object to work on.

The Power-Divergence Family
The results of Section 2 concern the boundary behaviour of two important members of a rich class of goodness-of-fit statistics. An important unifying framework which encompasses these and other important statistics can be found in [5] (page 16) with the so-called Power-Divergence statistics. These are defined, for −∞ < λ < ∞, by 2NI λ n N : π := 2 with the cases λ = −1, 0 being defined by taking the appropriate limit to give Important special cases are shown in Table 1 (whose first column is described below in Section 3.3), and we also note the case λ = 2/3, which Read and Cressie recommend [5] (page 79) as a reasonably robust statistic with an easily calculable critical value for small N. In a sense, it lies "between" the Pearson χ 2 and deviance statistics, which we compared in Section 2.
Freeman-Tukey or Hellinger This paper is primarily concerned with the sparse case where many of the n i counts are zero, and we are also interested in letting probabilities, π i , becoming arbitrarily small, or even zero.

Literature Review
Before we look at this, we briefly review the literature on the geometry of goodness-of-fit statistics. A good source for the historical developments (in the discrete context) can be found in [5] (pages 131-153) and [7]. Important examples include the analysis of contingency tables, log-linear, and discrete graphical models. Testing is often used to check the consistency of a parametric model with given data, and to check dependency assumptions such as independence between categorical variables. However, we note an important caveat: as pointed out by [14,15], the fact that a parametric model "passes" a goodness-of-fit test only weakly constrains the resulting inference. The essential point here is that goodness-of-fit is a necessary, but not sufficient, condition for model choice, since-in general-many models will be empirically supported. This issue has recently been explored geometrically in [16] using CIG.
There have been many possible test statistics proposed for goodness-of-fit testing, and one of the attractions of the Power-Divergence family, defined in (11), is that the most important ones are included in the family and indexed by a single scalar λ. Of course, when there is a choice of test statistic, different inferences can result from different choices. One of the main themes of [5] is to give the analyst insight about selecting a particular λ. Key considerations for making the selection of λ include the tractability of the sampling distribution, its power against important alternatives, and interpretation when hypotheses are rejected.
The first order, asymptotic in N, χ 2 -sampling distribution for all members of the Power-Divergence family, which is appropriate when all observed counts are "large enough", is the most commonly used tool, and a very attractive feature of the family. However, this can fail badly in the "sparse" case and when the model is close to the boundary. Elementary, moment based corrections, to improve small sample performance, are discussed in [5] (Chapter 5). More formal asymptotic approaches to these issues include the doubly asymptotic, in N and k, approach of [17], discussed in Section 2 and similar normal approximation ideas in [18]. See also [19]. Extensive simulation experiments have been undertaken to learn in practice what 'large enough' means, see [5,20,21].
When there are nuisance parameters to be estimated (as is common), [22] points out that it is the sampling distribution conditional upon these estimates which needs to be approximated, and proposes higher order methods based on the Edgeworth expansion. Simulation approaches are often used in the conditional context due to the common intractability of the conditional distribution [23,24], and importance sampling methods play an important role-see [25][26][27]. Other approaches used to investigate the sampling distribution include jackknifing [28], the Chen-Stein method [29], and detailed asymptotic analysis in [30][31][32].
In very high dimensional model spaces, considerations of the power of tests rarely generates uniformly best procedures but, we feel, geometry can be an important tool in understanding the choices that need to be made. Further, [5], states the situation is "complicated", showing this through simulation experiments. One of the reasons for Read and Cressie's preferred choice of λ = 2/3 is its good power against some important types of alternative-the so-called bump or dip cases-as well as the relative tractability of its sampling distribution under the null. Other considerations about power can be found in [33] which looks specifically at mixture model based alternatives.

Links with Information Geometry
At the time that the Power-Divergence family was being examined, there was a parallel development in Information Geometry; oddly, however, it seemed to have taken some time before the links between the two areas were fully recognised. A good treatment of these links can be found in [6] (Chapter 9). Since it is important to understand the extreme values of divergence functions, considerations of convexity can clearly play an important role. The general class of Bregman divergences, [6,34] (page 240), and [35] (page 13) is very useful here. For each Bregman divergence, there will exist affine parameters of the exponential family in which the divergence function is convex. In the class of product Poisson models-which are the key building blocks of log-linear models-all members of the Power-Divergence family have the Bregman property. These are then α-divergences, capable of generating the complete Information Geometry of the model [35], with the link between α and λ given in Table 1. The α-representation highlights the duality properties, which are a cornerstone of Information Geometry, but which is rather hidden in the λ representation. The Bregman divergence representation for the Poisson is given in Table 2. The divergence parameter-in which we have convexity-is shown for each λ, as is the so-called potential function, which generates the complete information geometry for these models.

Extended Multinomial Case
In this paper, we are focusing on the class of log-linear models where the multinomial is the underlying class of distributions; that is, we condition on the sample size, N, being fixed in the product Poisson space. In particular, we focus on extended multinomials, which includes the closure of the multinomials, so we have a boundary. Due to the conditioning (which induces curvature), only the cases where λ = 0, −1 remain Bregman divergences, but all are still divergences in the sense of being Csiszár f -divergences [36,37].
The closure of an exponential family (e.g., [11,[38][39][40]), and its application in the theory of log-linear models has been explored in [12,13,41,42]. The key here is understanding the limiting behaviour in the natural-α = 1 in the sense of [8]-parameter space. This can be done by considering the polar dual [43], or, alternatively, the directions of recession- [12] or [42]. The boundary polytope determines key statistical properties of the model, including the behaviour of the sampling distribution of (functions of) the MLE and the shape of level sets of divergence functions. Figures 1 and 2 show level sets of the α = ±1 Power-Divergences in the (+1)-affine and (−1)-affine parameters (Panels (a) and (b), respectively) for the k = 2 extended multinomial model. The boundary polytope in this case is a simple triangle "at infinity", and the shape of this is strongly reflected in the behaviour of the level sets. In Figure 1, we show-in the simplex (π 0 , π 1 , π 2 )| ∑ 2 i=0 π i = 1, π i ≥ 0 -the level sets of the α = −1 divergence, which, in the Csiszár f -divergence form, is The figures show how in Panel (a), the directions of recession dominate the shape of level sets, and in Panel (b) the duals of these directions (i.e., the vertices of the simplex) each have different maximal behaviour. The lack of convexity of the level sets in Panel (a) corresponds to the fact that the natural parameters are not the affine divergence parameters for this divergence, so we do not expect convex behaviour. In Panel (b), we do get non-convex level sets, as expected. Figure 2 shows the same story, but this time for the dual divergence, Now, the affine divergence parameters are shown in Panel (a), the natural parameters. We see that in the limit the shape of the divergence is converging to that of the polar of the boundary polytope.
In general, local behaviour is quadratic, but boundary behaviour is polygonal.

Simulation Studies
In this section, we undertake simulation studies to numerically explore what has been discussed above. Separate sub-sections address three general topics-focusing on one particular instance of each, as follows: 1. The transition as (N, k) varies between discrete and continuous features of the sampling distributions of goodness-of-fit statistics-focusing on the behaviour of the deviance at the uniform discrete distribution; 2. The comparative behaviour of a range of Power-Divergence statistics-focusing on the relative stability of their sampling distributions near the boundary; 3. The lack of uniformity-across the parameter space-of the finite sample adequacy of standard asymptotic sampling distributions, focusing on testing independence in 2 × 2 contingency tables.
For each topic, the results presented invite further investigation.

Transition Between Discrete and Continuous Features of Sampling Distributions
Earlier work [2] used the decomposition: to show that a particularly bad case for the adequacy of any continuous approximation to the sampling distribution of the deviance D := D * |(N * = N) is the uniform discrete distribution: π i = 1/(k + 1). In this case, the Γ * term contributes a constant to the deviance, while the ∆ * term has no contributions from cells with 0 or 1 observations-these being in the vast majority in the N << k situation considered here. In other words, all of the variability in D comes from that between the n i log n i values for the (relatively rare) cell counts above 1. This gives rise to a discreteness phenomenon termed "granularity" in [2], whose meaning was conveyed graphically there in the case N = 30 and k = 200. Work by Holst [19] predicts that continuous (indeed, normal) approximations will improve with larger values of N/k, as is intuitive. Remarkably, simply doubling the sample size to N = 60 was shown in [2] to be sufficient to give a good enough approximation for most goodness-of-fit testing purposes. In other words, N being 30% of k = 200 was found to be good enough for practical purposes.
Here, we illustrate the role of k-asymptotics (Section 2) in this transition between discrete and continuous features by repeating the above analyses for different values of k. Figures 3 and 4 (where k = 100 while N = 20 and 40, respectively) are qualitatively the same as those presented in [2]. The difference here is that the smaller value of k means that a higher value of N/k (40%) is needed in Figure 4 to adequately remove the granularity evident in Figure 3. For k = 400, the figures with N = 50 and N = 100 (omitted here for brevity) are, again, qualitatively the same as in [2]-the larger value of k needing only a smaller value of N/k (25%) for practical purposes. Note the QQ-plots used in these two figures are relative to normal quantiles. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q  q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q In fact, the results show that in the high dimensional, low sample size case, all distributions are "close to" the boundary, and that discretisation effects can dominate.

Comparative Behaviour of Power-Divergence Statistics near the Boundary
Here we study the relative stability-near the boundary of the simplex-of the sampling distributions of a range of Power-Divergence statistics indexed by Amari's parameter α. Figure 5 shows histograms for six different values of α, N = 50, k = 200, and exponentially decreasing values of {π i }, as plotted in Figure 6. In it, red lines depict kernel density estimates using the bandwidth suggested in [44].
These sampling distributions differ markedly. The instability for α = 3 expected from Theorem 1 is clearly visible: very large values contribute to high variance and skewness. Analogous instability features (albeit at a lower level) remain with the Cressie-Read recommended value α = 7/3. In contrast (as expected from the discussion around Theorem 2), the distribution of the deviance (α = 1) is stable and roughly normal. Lower values of α retain these same features.

Variation in Finite Sample Adequacy of Asymptotic Distributions across the Parameter Space
Pearson's χ 2 statistic (α = 3) is widely used to test independence in contingency tables, a standard rule-of-thumb for its validity being that each expected cell frequency should be at least 5. For illustrative purposes, we consider 2 × 2 contingency tables, the relevant N-asymptotic null distribution being χ 2 1 . We assess the adequacy of this asymptotic approximation by comparing nominal and actual significance levels of this test, based on 10,000 replications. Particular interest lies in how these actual levels vary across different data generation processes within the same null hypothesis of independence. Figures 7 and 8 show the actual level of the Pearson χ 2 test for nominal levels 0.1 and 0.05 for sample sizes N = 20 and N = 50, with π r and π c denoting row and column probabilities, respectively. The above general rule applies only at the central black dot in Figure 7, and inside the closed black curved region in Figure 8. The actual level was computed for all pairs of values of π r and π c , averaged using the symmetry of the parameter space, and smoothed using the kernel smoother for irregular 2D data (implemented in the package fields in R). In each case, the white tone contains the nominal level, while red tones correspond to liberal and blue tones to conservative actual levels.
The finite sample adequacy of this standard asymptotic test clearly varies across the parameter space. In particular, its nominal and actual levels agree well at some parameter values outside the standard rule-of-thumb region; and, conversely, disagree somewhat at other parameter values inside it. Intriguingly, the agreement between nominal and actual levels does not improve everywhere with sample size. Overall, the clear patterns evident in this lack of uniformity invite further theoretical investigation.

Discussion
This paper has illustrated the key importance of working with the boundary of the closure of exponential families when studying goodness-of-fit testing in the high dimensional, low sample size context. Some of this work is new (Section 2), while some uses the structure of extended exponential families to add insight to standard results in the literature (Section 3). The last section, Section 4, uses simulation studies to start to explore open questions in this area.
One open question-related to the results of Theorems 1 and 2-is to see if a unified theory, for all values of α, and over large classes of extended exponential families, can be developed.
When there is no risk of confusion, we may abbreviate M(t; N) to M and f N,i (t; r) to f N (r), or even to f (r)-so that (A1) becomes M = f (0). Again, we may write ∂ r M(t; N)/∂t r i as M r , ∂ r+s M(t; N)/∂t r i ∂t s j as M r,s and ∂ r+s+u M(t; N)/∂t r i ∂t s j ∂t u l as M r,s,u , with similar conventions for higher order mixed derivatives.
We can now use this to explicitly calculate low order moments of the count vectors. Using E(n r i ) = ∂ r M(t; N)/∂t r i | t=0 , the first N moments of n i now follow from (A1) and repeated use of (A2), noting that m(0) = 1 and a i (0) = π i .
In particular, the first 6 moments of each n i can be obtained as follows, where N ≥ 6 is assumed. Using (A1) and (A2), we have This can be formalised in the following Lemma Lemma A1. The integer coefficients in any expansion can be computed using c r (1) = c r (r) = 1 together, for r ≥ 3, with the update: We note that if M r is required for r > N, we may repeatedly differentiate . Mixed moments of any order can be derived from those of lower order, exploiting the fact that a i depends on t only via t i . We illustrate this by deriving those required for the second and third moments of W.
First consider the mixed moments required for the second moment of W. Of course, Var(W) = 0 if k = 0. Otherwise, k > 0, and computing Var(W) requires E(n 2 i n 2 j ) for i = j. We find this as follows, assuming N ≥ 4.
The relation M 2 = f (2) + f (1) established above gives Repeated use of (A3) now gives We further look at the mixed moments needed for the third moment of W. For the skewness of W, we need E(n 2 i n 4 j ) for i = j and, when k > 1, E(n 2 i n 2 j n 2 l ) for i, j, l distinct. We find these similarly, as follows, assuming k > 1 and N ≥ 6. Equation (A4) above gives from which, using (A3) repeatedly, we have so that E(n 2 i n 2 j n 2 l ) equals N (6) π 2 i π 2 j π 2 l + N (5) π i π j π l {π i π j + π j π l + π l π i } + N (4) π i π j π l {π i + π j + π l } + N (3) π i π j π l .
Combining above results, we obtain here the first three moments of W. Higher moments may be found similarly.
Note that E[{W − E(W)} 3 ] depends on (π i ) only via π (−1) and the larger quantity π (−2) . In particular, for given k and N, the skewness of W tends to +∞ as one or more π i → 0 + .