1. Introduction
Goodness-of-fit testing is a fundamental problem in statistical inference with widespread applications in model selection, validation, and diagnostics. In the parametric framework, the classical approach is often based on the likelihood ratio test (LRT), as motivated by the Neyman–Pearson lemma. While the lemma guarantees the optimality of the LRT for testing simple hypotheses, the LRT is also widely applied to composite hypotheses due to its favorable asymptotic properties [
1].
Let
be independent and identically distributed (i.i.d.) observations from a continuous distribution
F with density
f. We consider the hypothesis-testing problem
where under the null hypothesis
is known up to a parameter vector
, and under the alternative,
is completely unspecified. The classical likelihood ratio test statistic is given by
We reject
in favor of
for large values of
, or equivalently, when the log-likelihood ratio
exceeds a critical threshold. Under standard regularity conditions and assuming
holds, the distribution of
converges asymptotically to a chi-squared distribution with degrees of freedom equal to the difference in dimensionality between the null and alternative models [
2].
However, in many practical situations—particularly in nonparametric settings—the alternative density is unknown, rendering direct application of the LRT infeasible. To address this, nonparametric density-based methods have been developed as flexible alternatives. Among these, empirical likelihood ratio tests and entropy-based statistics have attracted considerable attention.
Vexler and Gurevich [
3] introduced an empirical likelihood ratio test in which the unknown density
is estimated nonparametrically using Vasicek’s [
4] entropy estimator. They defined the empirical likelihood ratio using
where
and
denotes the empirical cumulative distribution function. Here,
is the maximum likelihood estimate under
, and
are the order statistics. Boundary corrections are made such that
if
, and
if
. The integer
m, called the window size, is a positive number less than
. Vexler and Gurevich [
3] proposed the test statistic
Since
depends on
m, they further suggested the minimization
where
.
This general methodology has since been extended to various settings, including logistic distributions [
5], skew normality [
6], Laplace distributions [
7], and Rayleigh distributions [
8]. Although the method of Vexler and Gurevich [
3] is generally effective, it is known to suffer from boundary bias, particularly near the endpoints—a limitation also discussed by Ebrahimi et al. [
9] in the context of entropy estimation.
To address these shortcomings, we propose two test statistics. The first is a corrected version of the Vexler–Gurevich statistic (
2) and (
3), incorporating a position-dependent correction factor to properly account for boundary effects. The second is a new test statistic based on Correa’s [
10] local linear entropy estimator, which improves density estimation by locally interpolating the quantile function. Together, these methods aim to enhance the performance and reliability of goodness-of-fit testing.
The remainder of the paper is organized as follows.
Section 2 introduces the two proposed test statistics.
Section 3 presents their theoretical properties.
Section 4 describes the computational implementation using a bootstrap procedure.
Section 5 provides simulation studies and real-data applications to evaluate the tests’ performance. Finally,
Section 6 concludes and outlines potential future research directions.
4. Computational Algorithm
To implement the proposed test statistics
and
, it is necessary to select an appropriate window size parameter
m. A commonly used rule, suggested by Grzegorzewski and Wieczorkowski [
12], is
where
denotes the floor function. This choice effectively balances bias and variance in the entropy-based density estimators and is widely adopted in practice.
After computing the test statistic, the next step is to assess whether its value provides sufficient evidence to reject the null hypothesis. Values of the test statistic close to one indicate agreement between the empirical and theoretical densities, whereas large values suggest model misspecification and provide evidence against the null. Since the asymptotic distribution of is analytically intractable, we employ a bootstrap procedure to approximate its null distribution and obtain the corresponding p-value.
The following algorithm outlines the steps for conducting the bootstrap-based goodness-of-fit test:
- 1.
Given an observed sample
, compute the test statistic
as defined in Equation (
7).
- 2.
Fit the null model to the data, where denotes the maximum likelihood estimator under .
- 3.
Generate B bootstrap samples , for , by sampling from the fitted null model .
- 4.
For each bootstrap sample, compute the corresponding test statistic .
- 5.
Estimate the bootstrap
p-value as
where
denotes the indicator function.
- 6.
Reject the null hypothesis at significance level if .
This resampling approach avoids reliance on the asymptotic distribution of the test statistic, making the procedure suitable for small to moderate sample sizes and for complex models. An identical bootstrap procedure is applied to compute the
p-value for the statistic
defined in Equation (
5).
Note that, although the test statistics
and
are defined in multiplicative form, their theoretical properties are most naturally expressed on the log scale. When the null hypothesis
holds, the entropy-based density estimators
consistently estimate the parametric density
. By Lemma 1(i), the normalized log-statistics
converge in probability to zero. Consequently, the observed values
and
fluctuate around zero in finite samples. Since the bootstrap samples are generated from
, their corresponding statistics
and
also concentrate near zero, ensuring that the bootstrap distribution provides a valid approximation to the null distribution. Thus, the bootstrap
p-values are approximately uniform under
, thereby controlling the Type I error.
In contrast, when
is false, Lemma 1(ii) shows that
where
minimizes the Kullback–Leibler divergence between
f and
. Thus, the observed statistics become much larger than their bootstrap replicates, which remain centered near zero. This separation forces the bootstrap
p-values to converge to zero as
, thereby guaranteeing the consistency and power of the proposed tests.
5. Simulation Study
In this section, we evaluate the performance of the proposed test statistics through both simulation studies and real data applications. Due to the complexity and nonparametric nature of the estimators, the exact sampling distributions under the null hypothesis are analytically intractable. Therefore, we employ the bootstrap procedure described in
Section 4 to approximate the null distribution and compute the corresponding
p-values.
For each scenario in Examples 1 and 2, samples of size
are generated from the specified true distributions, as detailed in
Table 1 and
Table 2. The test statistics
,
, and
, as defined in Equations (
2), (
5), and (
7), respectively, are computed for each sample. Corresponding
p-values are then estimated using
bootstrap replications from the fitted null model. To ensure reproducibility, the random seed is set in R via
set.seed(2025). The R code implementing the proposed methods is available upon request from the corresponding author. In what follows, we denote the bootstrap
p-values by
, as defined in (
9).
Example 1. (Testing as the Null): We begin by testing the null hypothesis , where is the density of with μ unknown and estimated by for each sample.
Samples are generated from various true distributions, as listed in
Table 1, to evaluate the sensitivity of the proposed test statistics to departures from normality. These alternatives include normal distributions with different means and variances, a symmetric mixture, as well as heavy-tailed and skewed distributions.
Table 1 displays the resulting
p-values. When the data are truly normal—whether matching the null mean (
) or with a shifted mean (
)—the tests do not reject the null hypothesis and
p-values are large for all sample sizes, as expected. As the alternative distributions deviate further from normality (e.g., increased variance or heavier tails), the tests become more sensitive. For instance, under the
distribution, the null is not rejected for small samples (
) but is rejected for larger
n, reflecting the increase in power. For non-normal alternatives like the Cauchy and exponential distributions, the tests exhibit high power, yielding near-zero
p-values even for moderate sample sizes.
Notably,
and
yield identical results in our simulations (see
Table 1), while
can be marginally more sensitive in some cases, especially for moderate sample sizes or challenging alternatives. Overall, the results demonstrate that the proposed methods maintain the correct Type I error rate under the null and reliably detect departures from normality as the sample size increases.
Although
modifies the construction of the statistic by introducing a boundary correction, it is in fact equivalent to the original statistic up to a constant factor. To see the equivalence between
and
, we fix the window size
m. By (
3) and (
5), we obtain
The following lemma provides an explicit form of this product.
Lemma 2.
For each n and m, we havewhere Proof. From the definition of
in (
4), we can distinguish between interior indices and boundary indices.
First, consider the interior indices, i.e.,
. For these indices we have
. Hence, each term contributes
Since there are such indices, their total contribution to the product is simply 1.
Next, consider the lower boundary indices, i.e.,
. For these indices we have
so that
Reindex the product by letting
. Then, as
i runs from 1 to
m,
k runs from
m to
. Thus,
Now, consider the upper boundary indices, i.e.,
. For these indices we have
Let
. Then, as
i runs from
to
n,
j runs from 1 to
m. Substituting gives
Therefore, the set of factors from the upper boundary is
which is exactly the same as the set from the lower boundary. Thus,
Combining the lower and upper boundary contributions, we obtain
Since the interior indices contribute 1, the overall constant is
To simplify the product, observe that
The first part is
, while the second part is exactly
. Therefore,
Substituting back yields
□
These constants depend only on the pair
and not on the observed data. Consequently, the boundary-corrected statistic
differs from the original
only by multiplication with
. Since the same constant rescales both the observed test statistic and the bootstrap distribution under the null, the resulting
p-values remain identical. More explicitly, under the null we have
Thus, the boundary correction reduces edge bias in the density estimate but introduces only a constant multiplicative factor in the test statistic. For hypothesis testing based on parametric bootstrap calibration, it has no impact on the
p-values. This explains why
Table 1,
Table 2 and
Table 3 report identical
p-values for
and
across all scenarios. Accordingly, in the following example, we omit reporting
separately.
Example 2. (Testing as the Null): In this simulation, we test the null hypothesis , where is the density of the normal distribution with both mean μ and variance unknown. The maximum likelihood estimators and are used to fit the null model.
Samples are generated from the same set of alternative distributions as in Example 1, including normal distributions with different means and variances, symmetric mixtures, as well as heavy-tailed and skewed distributions. For each sample size , the test statistics and are computed, and bootstrap p-values are estimated using replications.
Table 2 reports the resulting
p-values, which provide insight into the performance of the tests when both location and scale are unknown. When the data are drawn from a normal distribution (either with the same or shifted mean), the tests maintain the nominal Type I error rate, as
p-values remain large across all sample sizes. For alternatives with heavier tails, skewness, or mixtures, the
p-values decrease rapidly as the sample size increases, demonstrating the ability of the proposed tests to detect deviations from the null model. The tests are especially powerful against the Cauchy and exponential alternatives, with
p-values near zero even for small samples. When both location and scale are unknown, the tests show some loss of sensitivity to changes in variance alone, as seen in the
case, where
p-values only decrease substantially for larger samples. Overall,
and
provide complementary perspectives, with the latter sometimes exhibiting greater sensitivity in challenging cases and with moderate sample sizes. These results highlight the robustness and power of the proposed methods for model assessment under a composite normal null hypothesis.
Example 3. (Real Data–Yarn Strength): We apply the proposed tests to the breaking strength values of 100 yarns, originally reported by Duncan [13]:
We assess whether these data are adequately modeled by a Laplace distribution,
where
and
denote the location and scale parameters, respectively. The maximum likelihood estimates are
and
, consistent with the values reported by Alizadeh Noughabi [
7]. Using these estimates, we compute the
p-values for
and
via a bootstrap procedure with
replications. The resulting
p-values,
for
and
for
, are both well above the conventional
significance level, providing no evidence against the Laplace model. These results support the adequacy of the Laplace distribution for the yarn strength data and further indicate that the Correa-based statistic tends to be more conservative, offering stronger support in this setting.
Example 4. (Real Data–River Flow): We assess the suitability of the three-parameter gamma distribution for modeling river flow measurements. The density is given bywhere , , and are the shape, rate, and location parameters, respectively. The dataset comprises river flow measurements (in millions of cubic feet per second) from the Susquehanna River at Harrisburg, Pennsylvania, recorded over the five-year period 1980–1984:
Following Al-Labadi and Evans [
14], the maximum likelihood estimates for the parameters are
,
, and
. Using these estimates, we apply the bootstrap algorithm with
replications to compute the
p-values for the
and
test statistics.
The resulting
p-values are
for
and
for
. Both values are well above conventional significance thresholds, indicating strong agreement between the observed data and the fitted three-parameter gamma model. These results confirm the suitability of the gamma distribution for describing the river flow data and show that the Correa-based statistic tends to be slightly more conservative. This conclusion is consistent with the Bayesian nonparametric test of Al-Labadi and Evans [
14], which likewise does not spuriously reject the adequacy of the gamma model.
To further evaluate the effectiveness of the proposed tests in detecting model misspecification, we conduct a power analysis under several alternative distributions, as well as under the null hypothesis. For each scenario, we fix the significance level at and generate 1000 independent samples for each sample size .
For each sample, the test statistics
and
are computed, and the corresponding
p-values are estimated using
bootstrap replications from the null model
, as described in
Section 4. The empirical rejection rate is calculated as the proportion of samples for which the null hypothesis is rejected.
To assess Type I error control, we also report results under the null hypothesis , where samples are drawn from . The empirical rejection rates in this case should be close to the nominal level , indicating the validity of the bootstrap calibration and the reliability of the testing procedure.
The alternative distributions considered highlight various types of departures from the null, including a mean shift (), heavy tails (Cauchy (0, 1)), and a symmetric distribution with different kurtosis (Logistic (0, 1)). This range of alternatives enables a comprehensive assessment of the sensitivity and robustness of the proposed methods.
The results, presented in
Table 3, demonstrate that both tests maintain appropriate Type I error rates when
is true. Moreover, the power increases rapidly with sample size and is highest when the true distribution deviates substantially from the null. Notably, the test based on Correa’s local linear entropy estimator (
) consistently outperforms the boundary-corrected
m-spacing test (
), particularly for small to moderate samples and challenging alternatives. These findings confirm the strong performance of the proposed methods, both in maintaining control of Type I error and in delivering high power against a range of alternatives.
Taken together, the simulation results and real-data examples demonstrate that both proposed tests provide reliable inference for model adequacy across a range of scenarios. The tests maintain proper Type I error control under the null, deliver substantial power under diverse alternatives, and do not spuriously reject in well-specified real-data settings. The new test based on Correa’s entropy estimator in particular offers notable advantages for moderate samples and challenging alternatives, while the boundary-corrected m-spacing approach retains strong and consistent performance. These findings illustrate the utility and flexibility of entropy-based density estimation methods for modern goodness-of-fit testing.
6. Conclusions
In this paper, we proposed two bootstrap-based test statistics for assessing the goodness-of-fit of fully specified parametric models. The first test is a boundary-corrected version of the empirical likelihood ratio statistic originally introduced by Vexler and Gurevich [
3], which incorporates a position-dependent correction factor to improve density estimation near the boundaries. Although the correction modifies the construction of the statistic, in practice it yields results that are numerically identical up to a fixed multiplicative constant and therefore lead to the same
p-values as the uncorrected version. The second test is based on Correa’s [
10] local linear entropy estimator, which provides a flexible and accurate alternative by utilizing local linear regression to approximate the derivative of the quantile function.
We established the theoretical properties of the proposed statistics, demonstrating their consistency and showing that, under fixed alternatives, they converge to the Kullback–Leibler divergence. Since the asymptotic distributions of these statistics are analytically intractable, we developed a bootstrap algorithm for practical implementation.
Comprehensive simulation studies demonstrated that both test statistics perform well in terms of controlling Type I error and detecting model misspecification. In particular, the test based on Correa’s estimator exhibited superior power across a wide range of alternatives, especially in small to moderate sample sizes. Applications to real data further confirmed the utility and flexibility of the proposed methods.
Unlike composite likelihood ratio tests, which typically involve parameter estimation under both the null and alternative models, the proposed approach only requires parameter estimation under the null hypothesis (if at all). In the case of a fully specified null, parameter estimation is entirely avoided. This significantly simplifies implementation and enhances robustness, especially in settings where maximum likelihood estimation is computationally challenging or unreliable.
Future research directions include extending these tests to multivariate distributions, regression models, and right-censored data.