1. Introduction
“Since there are no a priori arguments for the choice of a particular distribution, one needs to base the choice, and evaluation by statistical means”.
The problem of testing assumptions about the probability distribution underlying random samples of data is an ongoing area of inquiry in statistics and econometrics. While this literature has expanded substantially since
Pearson (
1900), unified and generally applicable omnibus methodologies that exhibit substantial power for testing a wide range of distributional hypotheses are lacking, although there have been a number of important developments in this regard (e.g.,
Bowman and Shenton 1975;
Epps and Pulley 1983;
Zoubir and Arnold 1996;
Doornik and Hansen 2008;
Meintanis 2011;
Wyłomańska et al. 2020). The literature offers a variety of methods for specific parametric families of probability distributions, which sometimes entail idiosyncratic and complicated regularity conditions, non-standard probability distributions under the null and alternative hypotheses, bootstrapping methodologies, and the specification of tuning parameters. The specialized methods often have specific requirements for implementing the testing mechanisms.
In this paper, we introduce a flexible and widely applicable nonparametric entropy test (ET) methodology for assessing the validity of simple hypotheses about a specific population distribution from which observed data are randomly sampled. The ET is completely general and attractive in the sense that it can, in principle, be applied to any functional specification of the population distribution, providing a common unified hypothesis testing framework. In particular, the test assesses the null hypothesis against the general alternative , where the distribution F is specified in the statement of the null hypothesis. The alternative may be true because the sample observations are not iid, or perhaps they are iid but drawn from some distribution different than F. This test differs from various “runs” or other types of tests that are designed to test the iid assumption without having to specify a particular null distribution F. It also differs in motivation from various goodness-of-fit-type tests, where the iid assumption is maintained under the alternative. Moreover, it differs from tests mentioned above in terms of the specification and calculation of test statistic outcomes, regularity conditions, and/or distribution of the test statistic under the null and alternative hypotheses.
Regardless of the simple null hypothesis being tested, there is a consistent and familiar asymptotic chi-square distribution for the test statistic under relatively straightforward regularity conditions. Moreover, regardless of the probability distribution being tested, the ET is computationally tractable and there is a straightforward specification process that can be followed to define and implement the test. This facilitates setting critical values of the test and analyzing size and power characteristics. A limitation of the ET procedure, however, is determining the most efficacious moment-type constraints to produce powerful tests of distributional hypotheses across a wide range of distribution families. This test-specification issue is a challenge for all distributional hypothesis testing procedures and is not unique to our ET. We explore one such moment specification in this paper and discuss this issue further in
Section 5. For a discussion on entropy and its applications in econometrics, see
Ullah (
1996).
Our proposed methodology is based on the well-developed statistical theory and sampling properties of Maximum Entropy (ME), subject to moment condition constraints, which relate to the characteristic functions (CF) of the population probability distribution being tested.
Epps (
1993) provides geometrical interpretations and important insights about the usefulness of CFs. In cases where a moment generating function (MGF) exists, it can be used in place of the CF in defining the moment condition constraints. The methodology most closely related to our approach is focused on deviations between hypothesized and empirical characteristic functions (ECFs). In this case, the ECF is defined and based on the classical empirical distribution function probability weights of
(e.g.,
Epps 2005;
Epps and Pulley 1983;
Fan 1997;
Koutrouvelis 1980;
Koutrouvelis and Kellermeier 1981). The proposed approach in this paper is unique, in that it utilizes the more general entropy-derived sample
probability weights to define an ECF.
1.1. Looking Back
Since the seminal work of
Pearson (
1900), with his chi-square goodness-of-fit test and its variants (e.g., the Cramér–von Mises test and the Kolmogorov–Smirnov test; see
D’Agostino and Stephens (
1986) for a comprehensive survey on goodness-of-fit measures), many procedures have been proposed for testing distributional functional forms. For an excellent historical review of tests for normality, see
Thode (
2002).
Shapiro and Brain (
1982) provide an extensive review of various types of tests of distributional assumptions that were developed in previous years. In the econometrics literature, some examples include a test for distributional assumptions developed by
Lee (
1983) for stochastic frontier functions; a testing procedure for the univariate normal distribution suggested by
Bera et al. (
1984); tests for the bivariate normal distribution devised by
Lee (
1984); a test for the count data model by
Lee (
1986),
Meddahi and Bontemps’s (
2011) test of distributional assumptions based on moment conditions; and the normality test of random variables, using the quantile-mean covariance function by
Bera et al. (
2016). More recently,
Amengual et al.’s (
2020) goodness-of-fit tests for parametric distributions based on the difference between the theoretical and empirical CFs were introduced, using regularization methods.
One of the earliest and most well-known nonparametric tests of a distributional hypothesis is the test of independence introduced by
Wald and Wolfowitz (
1940). The WW test is based on the concept of runs in a sequence of sample observations that are in increasing order of magnitude. The WW test has a familiar chi-square asymptotic distribution, but there is no provision in the test for explicitly assessing the identical distribution assumption. This provision has been treated most often as simply a maintained hypothesis in the way the test has been empirically applied (see, e.g.,
Fama 1965). Various variations of runs-type tests include the
Goodman (
1958) simplified runs test, and the
Cho and White (
2011) generalized runs test. Other nonparametric tests, such as the Mann–Kendall (
Mann 1945;
Kendall 1975) and
Bartels (
1982) rank-based tests, have been used to detect the presence of trends and more specific departures from
iid behavior. However, no single nonparametric test has emerged as being the most appropriate and powerful across all sampling situations. The discovery of such a test is unlikely given the myriad of alternative population distributions that are possible. The ME principle discussed later provides a general framework for assessing a wide array of departures from
iid sampling from a specified population distribution.
Some recent works use the concept of permutation entropy (PE) to assess the
iid null hypothesis (
Matilla-García and Marín 2008;
Canovas and Guillamon 2009). This PE test requires that the data series have a natural time-dependent order (time causality). In many econometric applications, this is a natural state for the data and is not a substantive restriction. In addition, it does not presuppose any model-based assumptions, and it is invariant to any monotonic transformation of the data. The approach is designed for a set of time-series observations on scalars, with no explicit direction for how the nonparametric independent test might apply in multivariate contexts.
Hong and White (
2005) developed a test of the serial independence of a scalar time series that is based on (regular, as opposed to permutation) entropy concepts. It is asymptotically locally more powerful than
Robinson’s (
1991) smoothed nonparametric modified entropy measure of serial dependence. Their entropy-based
iid test relies on Kullback–Leibler divergence (
Kullback and Leibler 1951) and on a fundamental principle that a joint density of serial observations factors into the product of its marginal distributions
iff the observations are independent. However, these types of implementations have, to date, involved the use of kernel density estimation methods that include all of the attendant arbitrariness related to choosing kernels and nuisance bandwidth parameters. This means that, for different functional forms of kernels employed and different bandwidth choices, different test outcomes by different users of these testing mechanisms can occur, since the quality of the asymptotic approximation can be affected (
Hong and White 2005). In addition, the finite sample level of these two tests may differ from the asymptotic level. Thus, as acknowledged by
Hong and White (
2005, p. 850), asymptotic theory may not work well even for relatively large samples when using these tests. There have been several other nonparametric entropy-based testing procedures (see
Matilla-García and Marín 2008), but none of them is the most powerful, some of them lack associated asymptotic distribution theory, others involve the use of stochastic kernels, and still others have nonstandard limiting distributions.
1.2. Looking Ahead
The test introduced in this paper is based on the classical information theoretic (IT) concept of ME, which places computations in a context that has become well-developed and understood in the literature, as well as tractable to implement (see, e.g.,
Golan 2006;
Judge and Mittelhammer 2012). The approach can lead to a rejection of a false hypothesis about random sampling from a population distribution, either because the functional form hypothesized for the population distribution is incorrect or the
iid assumption is false due to the presence of dependence or non-identical distributions, or both. The null distribution of the proposed testing methodology requires little more than the assumption that
iid sampling occurs from the specified population distribution specified under the null hypothesis, and it does not require more complex regularity conditions or complicated derivations and definitions.
1.3. Structure of the Paper
In
Section 2, we introduce and develop the ET.
Section 2.1 connects fundamental IT results in an entropy context to previous probabilistic nonparametric frameworks.
Section 2.2 presents the general form of the moment-constrained ME problem, whose optimized objective function has an asymptotic chi-square distribution. Using the results from
Section 2.1 and
Section 2.2,
Section 2.3 concludes the section by providing a general representation for the sample moment constraints used in defining the ET. Random sampling evidence is provided in
Section 3, both to illustrate finite sample size and power properties that are possible based on the proposed new ET approach and to provide a benchmark comparison of the ET to the classical Kolmogorov–Smirnov (KS) nonparametric testing approach (
Kolmogorov 1933;
Smirnov 1933).
Section 4 provides additional guidance for how the ET can be implemented in practice, where the statistical context is that of a general linear model, and a composite hypothesis is tested by using the ET approach via data transformations. Simulated data are used to assess the performance of the approach in that context, and its performance is compared to that of the KS statistic, using the Lilliefors correction to accommodate estimated values of parameters. All simulations were implemented by using the GAUSS Version 21 Matrix Programming Language (Aptech Systems Inc., Higley, AZ, USA). Finally, in
Section 5, we provide a summary of relevant findings, corresponding implications, and a discussion of ongoing research needed for extending and refining the methodology introduced in this paper.
3. ETφ Finite Sample Behavior
In this section, we present some Monte Carlo simulation results that illustrate the finite sample behavior of the ET. Note that the proposed testing approach is completely general in the sense that it can be applied to any functional specification of the population distribution. For illustration purposes, we first simulate in
Section 3.1 the size and power properties of the ET, where the underlying null hypothesis being tested is that the random sample of data
,
is generated from a standard normal distribution versus alternative distributions in the normal family. In
Section 3.2, we examine the power of the
statistic in cases where the true underlying population distribution is not a member of the normal family of distributions. In that section, we also compare the performance of the
statistic to that of the classical Kolmogorov–Smirnov (KS) statistic.
3.1. Size and Power of ETφ for Normal Distributions
In this subsection, our first set of simulations investigates the size and power properties of , given that the null hypothesis is , for various mean levels,, while holding the standard deviation constant at . Then, a second set of simulations is performed for testing the same null hypothesis for various standard deviation levels, , while holding the mean level constant at . The alternative hypotheses in both sets of simulations is that the sample of data observations did not arise iid from the standard normal distribution.
Recall that the CF of the normal parametric family of distributions, with mean
and variance
, is given by
. Substituting this CF into (5) leads to the following ET optimization problem for the standard normal distribution (i.e., for
):
where
.
We underscore that the results presented below are, in principle, relevant to any null hypothesis of the form , since .
The size and power properties of the ET were assessed by exploring relationships between the empirical power and nominal (target) test size for varying sample sizes of n, and for a conventional fixed nominal type I error probability of . In the simulations below, m = 100,000 iid random outcomes of data were taken from various normal population distributions.
First, by focusing attention on the mean level of 0 and
, it is apparent in both
Figure 1 and
Table A1 (see
Appendix A) that, as the sample size
n increases, the empirical size of the test converges to the nominal size of
. This result is to be expected given the asymptotic theory underlying the
statistic. Moreover, as both
and
n increase with
, the power of rejecting a false null hypothesis of standard normality (i.e., for
) increases and converges to 1. This is indicative of the test procedure being consistent.
The probabilities of rejecting
as the standard deviation of the population distribution deviates from 1 are presented in
Figure 2 and
Table A2 (see
Appendix A). As in the case of the power functions for varying values of the mean,
, the power of rejecting the false null hypothesis of standard normality increases and converges to 1 as both
and
n increase. This is, again, indicative of the test procedure being consistent.
Overall, in the case of testing for a specific normal distribution, the behavior of the statistic suggests that it is quite sensitive to departures from the null hypothesis and appears to provide a useful and powerful test for moderate-to-large sample sizes.
3.2. Power of ETφ for Some Non-Normal Population Distributions and a Comparison to the KS Statistic
In this subsection, we examine the power of the -based testing procedure in cases where the true underlying population distribution is not a member of the normal parametric family of distributions. Precisely, we examine a range of distributions that include two uniform and two exponential distributions that are either centered (at zero) or non-centered to provide contrasts to the standard normal in terms of both location and shape. We also examine two different Cauchy distributions that include the standard Cauchy and the Cauchy that has precisely the same peak density value as the standard normal . Finally, we sample from two t-distributions having 2 and 3 degrees of freedom (dof), which are less peaked and have heavier tails than the standard normal distribution. The latter four simulations facilitate observations on the power of the -based test against alternative hypotheses that mimic the standard normal in various ways. We also make comparisons to tests based on the KS statistic for the same random samples of data.
The probabilities of the ET rejecting
for the various non-normal distributions that were sampled are displayed graphically in
Figure 3 and presented in
Table A3 in
Appendix A. Note that all the distributions under examination in this subsection were scaled so that their standard deviations equaled 1, thus matching the standard normal distribution for the null hypothesis. The non-centered uniform distribution (UniformNC) was translated to a mean of 0.5, whereas the centered uniform (UniformC) has a mean of zero. The non-centered exponential (ExpoNC) has both a mean and standard deviation of one, whereas the centered exponential (ExpoC) was translated to a mean of zero while still having a standard deviation of one.
In the case of the Cauchy distributions, the -based test is notably powerful in detecting that the alternative hypothesis is true. In fact, even for a small sample size of n = 50, the test is virtually certain to detect the alternative for the standard Cauchy case, and for n 100, it is virtually certain to detect the alternative for the Cauchy() distribution. For both the non-centered exponential and non-centered uniform distributions, the power function increases rapidly as the sample size increases and approaches a rejection probability of 1.0 for moderate sample sizes. The power of the test also increases rapidly with increasing sample size for the two t-distributions, approaching one for relatively small sample sizes. The power function is increasing for the centered exponential and centered-uniform distributions as well, but for these distributions, the power increases at a slower rate compared to the other distributions sampled, especially in the case of the centered-uniform distribution.
Overall, within the scope of the alternative distributions sampled, the
appears to be relatively powerful for detecting departures from the null hypothesis for symmetric distributions that mimic the standard normal in various ways and that exhibit differing levels of kurtosis as sample size
n increases. Power functions for the non-centered uniform and centered exponential rose at slower rates compared to the Cauchy distributions, the
t-distributions, and the non-centered exponential; however, for UniformNC, the test was still quite powerful at moderate sample sizes. The rate of increasing power is clearly the smallest for the centered uniform, where, even at a sample size of
n = 1000, its power is a modest 0.50. These results suggest some areas of future research that will be discussed in
Section 5.
4. Regression Example Using Simulated Data: ET and KS Sampling Performances
In this section, we provide an additional perspective on how the ET can be implemented in the context of regression analysis. This exercise is informative, for example, if one is seeking support for the use of maximum likelihood estimation, based on the normal family of distributions, or else, if one is seeking to base hypothesis testing of estimated parameters on the use of
t or
F statistics. We investigate both the size and the power of the
under a variety of error population distributions (normal, log normal, Cauchy, autocorrelated processes, and a moving average process) across different sample sizes (
n = 50, 100, 250, 500, and 1000). The ET results for these simulations are displayed graphically in
Figure 5 and presented numerically in
Appendix B.
We focus on how one might apply the ET to the null hypothesis that residuals of a linear model specification are distributed iid against the alternative hypothesis that the residuals did not arise iid from a zero-mean normal distribution. In order to apply the ET approach directly, we formulate the test in terms of transformed least-squares residuals that follow a standard normal population distribution asymptotically, if the null hypothesis is true. The transformation eliminates the unknown variance parameter, as opposed to estimating it, which then provides sample observations that are fully consistent with the preceding ET test theory, and the distribution of the ET statistic retains its asymptotic validity. Details of the implementation are presented below.
We begin with the familiar specification of the general linear model, under the assumption of normally distributed homoscedastic and non-autocorrected errors:
We assume there are
n sample observations, and the dimensionality of
is
. It is well-known that the estimated least-squares residuals,
, from a fit of the linear model,
, are as follows:
Under the null hypothesis, the finite sample distribution of the estimated residuals is a multivariate singular normal distribution. The estimated residuals are derived from a non-full-rank linear transformation of the multivariate normal distribution associated with . Letting , , and under general conditions, as , , where denotes the jth row of the explanatory variable matrix, x. Moreover, . Thus, for large n, the estimated residuals asymptotically emulate iid normal random variables, with mean zero and variance .
In order to be able to apply the ET approach directly, consider the issue of transforming the composite null hypothesis of iid normality with unknown into a simple hypothesis. Given that , it follows that for , regardless of the value of . Then letting represent a sequence of odd integers, it follows that . At this point, one might rely on the asymptotic properties of the estimated residuals and consider using , in an ME problem to define akin to (4) and (5). Thus, the characteristic function would be the standard Cauchy, i.e., . However, foreshadowing an issue that will be discussed in the concluding section, we note that the power of the testing procedure can be improved by considering a moment condition that represents an alternative feature of the equality between the sample and population characteristic functions. Moreover, the alternative leads to a functionally simplified moment condition.
To define the alternative moment constraint, note that if
is true for all of
t, then its derivative with respect to
t is as follows:
Moreover, (9) is true if we have the following:
By taking a line integral of (10), over the (−1, 1) interval, the moment constraint ultimately used in defining the ET statistic in this application becomes simply as follows:
Note that an MGF could also have been used in this example to generate an asymptotically equivalent moment condition for use in defining an ET statistic. In particular, if , then a transformation based on the arctangent is distributed according to the standard uniform (0, 1) distribution, for which an MGF exists.
Incorporating (11) into the optimization problem that defines the ET statistic results in the following:
where the value of
in (12) is the number of pairs of observations used in defining the
observations, i.e., the size of the set
(equivalently,
n/2).
The size and power properties of the ET-based test are assessed by exploring relationships between the empirical power and nominal (target) test size for varying sample sizes of
n and for a conventional fixed nominal type I error probability of
(See
Figure 5 and
Appendix B). We apply it to simulated linear model data, where the true error variance is assumed to be,
,
x is an
design matrix that consists of a column of 1’s and a column obtained from generating
iid outcomes of a
distribution, and
.
Regarding the size of the test, it is apparent that the size becomes quite accurate for even the smallest sample size of
n = 50 (see
Appendix B). The size converges to the true target test size of 0.05 as
n increases.
To investigate the power of the test, a variety of alternative error distributions were simulated. Four of the simulations remained within the normal family of distributions and used the base level distribution in various ways that violated the iid assumption. In particular, the errors were modeled as an evolving autocorrelation process with (); an autocorrelated nonstationary random walk process with (); a more complex three-period lag autocorrelation process with (); and a moving average process with two lagged errors terms and associated coefficients and . The power increased rapidly as the sample size increased for all of these alternative hypotheses. The null hypothesis, , was highly likely to be rejected for , and it was virtually certain to be rejected for .
Two other error distributions were sampled that, in various ways, exhibited departures from the base distribution, but sampling continued to be iid. One of these distributions was a log normal distribution centered to have a mean zero, (CLogN(0, 0.94062)). Its two parameter values are such that, upon translation of the distribution to mean zero, the sampled outcomes matched the mean and variance of the base normal distribution. The log normal distribution was notably skewed to the right. Simulations were also conducted based on a Cauchy distribution parameterized as Cauchy(), which has the same peak density value as the standard normal. The power of the test as n increased was substantial and strongest for the CLogN alternative. The power function associated with the Cauchy distribution exhibited the lowest power as the sample size increased.
The same set of population sampling distributions was utilized in applying the KS testing approach to this regression setting. In the case of the KS test, standardized residuals were used for the data observations underlying the test. The standardized residuals were defined by dividing the least-squares residuals by an estimate of their standard deviation, as
, where
is the usual unbiased estimator for the residual variance. In applying the KS test, the well-known Lilliefors correction for the critical values of the test was used to account for the estimation of unknown parameters (note, use of the standard KS uncorrected critical values resulted in very poor sampling performance of the test). The empirical size and power properties of the KS test are displayed in
Figure 6, and the numerical values underlying the graphs are provided in
Appendix B.
The ET test strongly dominated the performance of the KS test in all non-
iid sampling scenarios, including all three AR error processes, as well as the MA error process. Except for the nonstationary random walk process (AR(1)) for which the KS test exhibited appreciable power for large sample sizes, the power of the KS test was poor in the other cases, and for the two single-lag AR processes and the MA process, power was barely larger than the size of the test. On the other hand, the KS test exhibited very substantial power against
iid sampling from both the Cauchy and the log normal distributions (the power curves are almost indistinguishable in
Figure 6). In comparison, for the log normal case, the power of the ET test increased substantially with increasing sample size, with high power for the higher sample sizes. In the case of
iid sampling from the Cauchy distribution, the ET test power also increased with sample size, but at a slower rate, and remained at only 60% of the power of the KS test for the highest sample size that was simulated.
5. Summary and Concluding Remarks
In this paper, we introduced the idea of basing the design of hypothesis tests for sampling distributions on test statistics that utilize information theoretic methods. The basic context is one of testing simple hypotheses and using an information theoretic basis for defining the test statistics that focuses on constrained entropy maximization. The simulations presented in this paper utilize single constraints that reflect features of the equality of sample and hypothesized characteristic functions. The characteristic functions are uniquely associated with hypothesized population sampling distributions. The sample characteristic function is based on the probability weights derived via the solution of the constrained entropy maximization problem. The asymptotic distribution of such test statistics relies on standard and relatively non-complex regularity conditions that can be used, in principle, to test for any hypothesized sampling distribution.
The sampling distributions of the test statistics are derived from a well-established asymptotic theory relating to maximizing entropy, which applies as well to maximizing any member of the Cressie–Read family of power divergence statistics under moment-type constraints. Using sample observations obtained from a number of alternative sampling distributions and a wide range of sample sizes, we verified that the asymptotic size of the test was correct and illustrated the power of the test under a number of sampling distribution alternatives. The ET exhibits appreciable increasing power, as sample size increases, for a number of alternative distributions when contrasted with the hypothesized null distributions. This included a testing context in which a random sample itself was being tested, as well as for a ubiquitous context in which tests relating to residuals of a least-squares regression were being applied. However, as for virtually all tests of hypotheses for population sampling distributions, its efficacy was not universally as strong across all alternative sampling distributions. Moreover, the performance of our entropy-based test was compared to the performance of the Kolmogorov–Smirnov testing approach, which revealed some relative strengths and weaknesses of both approaches.
Overall, the simulation results suggest that the information theoretic approach for testing simple hypotheses about population sampling distributions has notable promise. The results also suggest areas in which additional research would likely be useful for generating additional insights into the application of the methodology. In particular, the functional specification of the moment constraints used in the definition of the ET statistic deserves further exploration. How many moment constraints to incorporate in the maximum entropy problem, and of what type, is a question worthy of further research. In addition, rather than converting composite hypotheses to simple hypotheses through sample data transformations, as illustrated in this paper, the possibility of formulating an ET statistic that applies to composite hypotheses directly is worth contemplating. Currently, we are examining alternative specifications of moment conditions and their effect on the power of tests, and we are working to extend the methodology more generally to composite hypothesis contexts.