Abstract
This paper studies the goodness of fit test for the bivariate Hermite distribution. Specifically, we propose and study a Cramér–von Mises-type test based on the empirical probability generation function. The bootstrap can be used to consistently estimate the null distribution of the test statistics. A simulation study investigates the goodness of the bootstrap approach for finite sample sizes.
1. Introduction
Testing the goodness-of-fit (gof) of given observations with a probabilistic model is a crucial aspect of data analysis.
Since the chi-square test was proposed and analyzed by Pearson in 1900 until today, new gof tests have been constructed and applied to continuous and discrete data. Just to mention some of the most recent publications, there are, for example, the works of: Ebner and Henze [1], Górecki, Horváth and Kokoszka [2], Puig and Wei [3], Arnastauskaitè et al. [4], Dörr, Ebner, and Henze [5]), Kolkiewicz, Rice, and Xie [6], Milonas et al. [7], Di Noia et al. [8], and Erlemann and Lindqvist [9].
Because count data can appear in different circumstances, the present investigation is oriented to gof in the discrete case, specifically, in the bivariate Hermite distribution (BHD).
In the univariate configuration, the Hermite distribution is a linear combination of the form , where and are independent Poisson random variables. The distinguishing property of the univariate Hermite distribution (UHD) is that it is flexible when it comes to modeling count data that present a multimodality, in addition to presenting several zeros, which is called zero-inflation. It also allows for modeling data in which the overdispersion is moderate, that is, the variance is greater than the expected value. It was McKendrick at [10] who modeled a phagocytic experiment (bacteria count in leukocytes) through the UHD, obtaining a more satisfactory model than with the Poisson distribution. However, in practice, bivariate count data emerge in several different disciplines and the BHD plays an important role, having superinflated data—for example, the number of accidents in two different periods [11].
The only gof test related to the Hermite distribution found in this study so far is the one developed by the researchers Meintanis and Bassiakos in [12]. However, this test is for univariate data.
On the other hand, to the best of our knowledge, we did not find literature on gof tests for BHD.
The purpose of this paper is to propose and study a gof test for the bivariate Hermite Distribution that is consistent.
According to Novoa-Muñoz in [13], the probability generating function (pgf) characterizes the distribution of a random vector and can be estimated consistently by the empirical probability generating function (epgf); the proposed test is a function of the epgf. This statistical test compares the epgf of the data with an estimator of the pgf of the BHD. As it is well known, to establish the rejection region, we need to know the distribution of the statistic test.
As for finite sample sizes, the resulting test statistic is of the Cramér–von Mises type, and it was not possible to calculate explicitly the distribution of the statistic under a null hypothesis. This is why one uses simulation techniques. Therefore, we decided to use a null approximation of the statistic by using a parametric bootstrap.
Because the properties of the proposed test are asymptotic (see, for example, [14]) and with the purpose of evaluating the behavior of the test for samples of finite size, a simulation study was carried out.
The present work is ordered as follows: In Section 2, we present some preliminary results that will serve us in the following chapters, and the definition of the BHD with some of its properties is also given. In Section 3, the proposed statistic is presented. Section 4 is devoted to showing the bootstrap estimator and its approximation to the null distribution of the statistic. Section 5 is dedicated to presenting the results of a simulation study, power of a hypothesis test, and the application to a set of real data.
Before ending this section, we introduce some notation: denotes a mixture (compounding) distribution, where represents the original distribution and the mixing distribution (i.e., the distribution of ) [15]; all vectors are row vectors, and is the transposed of the row vector x; for any vector denotes its kth coordinate, and its Euclidean norm; ; denotes the indicator function of the set A; denotes the probability law of the BHD with parameter ; denotes expectation with respect to the probability function ; and denotes the conditional probability law and expectation, given the data , respectively; all limits in this work are taken as denotes convergence in distribution; denotes almost sure convergence; let be a sequence of random variables or random elements and let ; then, means that is bounded in probability, means that and means that and denotes the separable Hilbert space of the measurable functions such that .
2. Preliminaries
Several definitions for the BHD have been given (see, for example, Kocherlakota and Kocherlakota in [16]). In this paper, we will work with the following one, which has received more attention in the statistical literature (see, for example, Papageorgiou et al. in [17]; Kemp et al. in [18]).
Let have the bivariate Poisson distribution with the parameters , , and (for more details of this distribution; see, for example, Johnson et al. in [19]); then, has the BHD. Kocherlakota in [20] obtained its pgf, which is given by
where , , and , .
From the pgf of the BHD, Kocherlakota and Kocherlakota [16] obtained the probability mass function of the BHD, which is given by
where is the moment-generating function of the normal distribution, is a polynomial of degree r in x, and .
Remark 1.
If , then the probability function is reduced to
Remark 2.
If is a random vector that is bivariate Hermite distributed with parameter θ, it will be denoted , where , and the parameter space is
Let be independent and identically distributed (iid) random vectors defined on a probability space and taking values in . In what follows, let
denote the epgf of for some appropriate .
The following section is dedicated to developing the statistic proposed in this study and, for this, it is essential to know the result that is presented below, the proof of which can be reviewed in [14]:
Proposition 1.
Let be iid from a random vector . Let be the pgf of , defined on . Let , such that ; then,
3. The Test Statistic and Its Asymptotic Null Distribution
Let be iid from a random vector . Based on the sample , the objective is to test the hypothesis
against the alternative
With this purpose, we will recourse to some of the properties of the pgf that allow us to propose the following statistical test.
According to Proposition 1, a consistent estimator of the pgf is the epgf. If is true and is a consistent estimator of , then consistently estimates the population pgf. Since the distribution of is uniquely determined by its pgf, , , a reasonable test for testing should reject the null hypothesis for large values of defined by
where
is a consistent estimator of and is a measurable weight function, such that , and
The assumption (3) on w ensures that the double integral in (2) is finite for each fixed n. Now, to determine what are large values of , we must calculate its null distribution, or at least an approximation to it. Since the null distribution of is unknown, we first try to estimate it by means of its asymptotic null distribution. In order to derive it, we will assume that the estimator satisfies the following regularity condition:
Assumption 1.
Under , if denotes the true parameter value, then
where is such that and .
Assumption 1 is fulfilled by most commonly used estimators; see [16,21].
The next result gives the asymptotic null distribution of .
Theorem 1.
Let be iid from . Suppose that Assumption 1 holds.
Then
where , with
. Moreover,
where are independent variates with one degree of freedom and the set is the non-null eigenvalues of the operator defined on the function space , as follows:
where
Proof.
By definition, . Note that
By Taylor expansion of around ,
where , , for some , is the vector of the first derivatives and is the matrix of the second derivatives of with respect to .
Thus, considering (3) results in
Using the Markov inequality and (8), we have
Then,
where
The asymptotic null distribution of depends on the unknown true value of the parameter ; therefore, in practice, they do not provide a useful solution to the problem of estimating the null distribution of the respective statistical tests. This could be solved by replacing with .
However, a greater difficulty is to determine the sets ; for most of the cases, calculating the eigenvalues of an operator is not a simple task and, in our case, we must also obtain the expression , which is not easy to find, since it depends on the function ℓ, which usually does not have a simple expression.
Thus, in the next section, we consider another way to approximate the null distribution of the statistical test, the parametric bootstrap method.
4. The Bootstrap Estimator
An alternative way to estimate the null distribution is through the parametric bootstrap method.
Let be iid taking values in . Assume that . Let be iid from a population with distribution , given , and let be the bootstrap version of obtained by replacing and by and , respectively, in the expression of . Let denote the bootstrap conditional probability law, given . In order to show that the bootstrap consistently estimate the null distribution of , we will assume the following assumption, which is a bit stronger than Assumption 1.
Assumption 2.
Assumption 1 holds and the functions ℓ and J satisfy
- (1)
- , as , where is an open neighborhood of θ.
- (2)
- is continuous as a function of ϑ at , and is finite .
As stated after Assumption 1, Assumption 2 is not restrictive since it is fulfilled by commonly used estimators.
The next theorem shows that the bootstrap distribution of consistently estimates its null distribution.
Theorem 2.
Let be iid from a random vector . Suppose that Assumption 2 holds and that , for some . Then,
Proof.
By definition, , with
and defined in (6).
Following similar steps to those given in the proof of Theorem 1, it can be seen that , where is defined as with and replaced by and , respectively.
To derive the result, first we will check that assumptions (i)–(iii) in Theorem 1.1 of Kundu et al. [23] hold.
Observe that
where
Clearly, and . Let be the covariance kernel of , which by SLLN satisfies
Moreover, let be a zero-mean Gaussian process on whose operator of covariance C is characterized by
From the central limit theorem in Hilbert spaces (see, for example, van der Vaart and Wellner [24]), it follows that on , when the data are iid from the random vector .
Let denote the covariance operator of and let be an orthonormal basis of . Let , by a dominated convergence theorem,
Setting in the aforementioned Theorem 1.1, this proves that condition (i) holds. To verify condition (ii), by using a monotone convergence theorem, Parseval’s relation and dominated convergence theorem, we obtained
To prove condition (iii), we first notice that
From the above inequality, for each fixed ,
for sufficiently large n. This proves condition (iii). Therefore, in , a.s. Now, the result follows from the continuous mapping theorem. □
From Theorem 2, the test function
or, equivalently, the test that rejects when is asymptotically correct in the sense that, when is true, , where is the upper percentile of the bootstrap distribution of and is the observed value of the test statistic.
5. Numerical Results and Discussion
According to Novoa-Muñoz and Jiménez-Gamero in [14], the properties of the statistic are asymptotic, that is, such properties describe the behavior of the test proposed for large samples. To study the goodness of the bootstrap approach for samples of finite size, a simulation experiment was carried out. In this section, we describe this experiment and provide a summary of the results that have been obtained.
It is necessary to emphasize, as mentioned in the Introduction that, to the best of our knowledge, we have not found another goodness-of-fit test for the bivariate Hermite distribution with which we can make a comparison. Therefore, the simulation study is limited only to the test presented in this investigation.
On the other hand, all the computational calculations made in this paper were carried out through codes written in the R language [25].
To calculate , it is necessary to give an explicit form to the weight function w. Here, the following is taken into account:
Observe that the only restrictions that have been imposed on the weight function are that w be positive almost everywhere in and the established in (3). The function given in (9) meets these conditions whenever , . Hence,
It was not possible to find an explicit form of the statistic , for which its calculation used the curvature package of R [25] to calculate it.
5.1. Simulated Data
In order to approximate the null distribution of the statistic for finite-size samples of sizes 30, 50, and 70 from a , for , the pgf (1), with , was utilized. The combinations of parameters were chosen in such a way that , .
The selected values of the other parameters were , , and .
The selected values of and were not greater than 1 since the Hermite distribution is characterized as being zero-inflated.
To estimate the parameter , we use the maximum likelihood method given in Kocherlakota and Kocherlakota [16]. Then, we approximated the bootstrap p-values of the proposed test with the weight function given in (9) for , and we generate bootstrap samples.
The above procedure was repeated 1000 times, and the fraction of the estimated -values that was found to be less than or equal to 0.05 and 0.10, which are the estimates type I error probabilities for 0.05 and 0.1.
The results obtained are presented in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7 for the different pairs . In each table, the established order was growing in and , and for each new increasing values in , and in each new , increasing values for . From these results, we can conclude that the parametric bootstrap method provides good approximations to the null distribution of the in most of the cases considered.
Table 1.
Simulation results for the probability of type I error for and .
Table 2.
Simulation results for the probability of type I error for and .
Table 3.
Simulation results for the probability of type I error for and .
Table 4.
Simulation results for the probability of type I error for and .
Table 5.
Simulation results for the probability of type I error for and .
Table 6.
Simulation results for the probability of type I error for and .
Table 7.
Simulation results for the probability of type I error for and .
It is seen that the values of and of the weight function affect bootstrap estimates of p-values.
From the tables, it is clear that the bootstrap p-values are increasingly approaching the nominal value as n increases. These approximations are better when . In particular, when is small (less than 5), then the bootstrap p-values are approached from the left (below) to the nominal value; otherwise, it happens when are fairly large values (greater or equal to 5). Table 4 is the one that shows the best results, being the weight function with that presents the best p-values estimates.
Unfortunately, we could not find a closed form for our statistic ; in order to calculate it, we used the curvature package of the software R [25]. This had a serious impact on the computation time since the simulations were increased in their execution time by at least 30%.
5.2. The Power of a Hypothesis Test
To study the power, we repeated the previous experiment for samples of size and, for the weight function, we used the values of and that yielded the best results in the study of type I error. The alternative distributions we use are detailed below:
- bivariate binomial distribution , where , , and ,
- bivariate Poisson distribution , where , ,
- bivariate logarithmic series distribution , where ,
- bivariate negative binomial distribution , where and ,
- bivariate Neyman type A distribution , where ,
- bivariate Poisson distribution mixtures of the form , where , denoted by .
Table 8 displays the alternatives considered and the estimated power for nominal significance level . Analyzing this table, we can conclude that all the considered tests, denoted by , are able to detect the alternatives studied and with a good power, giving better results in cases where . The best result was achieved for , as expected, as occurred in the study of type I error.
Table 8.
Simulation results for the power. The values are in the form of percentages, rounded to the nearest integer.
5.3. Real Data Set
Now, the proposed test will be applied to a real data set. The data set comprises the number of accidents in two different years, presented in [16], where X is the accident number of the first period and Y the accident number of the second period. Table 9 shows the real data set.
Table 9.
Real data of X accident number in a period and Y of another period.
The p-value, obtained from the statistic of the proposed test, with and applied to the real values, is 0.838; therefore, we decided not to reject the null hypothesis, that is, the data seem to have a BHD. This is consistent with the results presented by Kemp and Papageorgiou in [26], who performed the goodness-of-fit test obtaining a p-value of 0.3078.
Author Contributions
Conceptualization, F.N.-M.; methodology, F.N.-M. and P.G.-A.; software, F.N.-M. and P.G.-A.; validation, F.N.-M. and P.G.-A.; formal analysis, F.N.-M. and P.G.-A.; investigation, F.N.-M. and P.G.-A.; resources, F.N.-M.; data curation, P.G.-A.; writing—original draft preparation, F.N.-M. and P.G.-A.; writing—review and editing, F.N.-M. and P.G.-A.; visualization, F.N.-M. and P.G.-A. All authors have read and agreed to the published version of the manuscript.
Funding
This publication was supported by Universidad del Bío-Bío, DICREA [2220529 IF/R] and Universidad Adventista de Chile, DI [2021-139 II], Chile.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Acknowledgments
The corresponding author would like to thank research project DIUBB 2220529 IF/R and Fondo de Apoyo a la Participación a Eventos Internacionales (FAPEI) at Universidad del Bío-Bío, Chile. He also thanks the anonymous reviewers and the editor of this journal for their valuable time and their careful comments and suggestions with which the quality of this paper has been improved.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Ebner, B.; Henze, N. Tests for multivariate normality-a critical review with emphasis on weighted L2-statistics. TEST 2020, 29, 845–892. [Google Scholar] [CrossRef]
- Górecki, T.; Horváth, L.; Kokoszka, P. Tests of Normality of Functional Data. Int. Stat. Rev. 2020, 88, 677–697. [Google Scholar] [CrossRef]
- Puig, P.; Weiβ, C.H. Some goodness-of-fit tests for the Poisson distribution with applications in biodosimetry. Comput. Stat. Data Anal. 2020, 144, 106878. [Google Scholar] [CrossRef]
- Arnastauskaitè, J.; Ruzgas, T.; Bražènas, M. A New Goodness of Fit Test for Multivariate Normality and Comparative Simulation Study. Mathematics 2021, 9, 3003. [Google Scholar]
- Dörr, P.; Ebner, B.; Henze, N. A new test of multivariate normality by a double estimation in a characterizing PDE. Metrika 2021, 84, 401–427. [Google Scholar]
- Kolkiewicz, A.; Rice, G.; Xie, Y. Projection pursuit based tests of normality with functional data. J. Stat. Plan. Inference 2021, 211, 326–339. [Google Scholar] [CrossRef]
- Milonas, D.; Ruzgas, T.; Venclovas, Z.; Jievaltas, M.; Joniau, S. The significance of prostate specific antigen persistence in prostate cancer risk groups on long-term oncological outcomes. Cancers 2021, 13, 2453. [Google Scholar] [CrossRef] [PubMed]
- Di Noia, A.; Barabesi, L.; Marcheselli, M.; Pisani, C.; Pratelli, L. Goodness-of-fit test for count distributions with finite second moment. J. Nonparametric Stat. 2022. [Google Scholar] [CrossRef]
- Erlemann, R.; Lindqvist, B.H. Conditional Goodness-of-Fit Tests for Discrete Distributions. J. Stat. Theory Pract. 2022. [Google Scholar] [CrossRef]
- McKendrick, A.G. Applications of Mathematics to Medical Problems? Proc. Edinb. Math. Soc. 1926, 44, 98–130. [Google Scholar] [CrossRef]
- Cresswell, W.L.; Froggatt, P. The Causation of Bus Driver Accidents; Oxford University Press: Oxford, UK, 1963; p. 316. [Google Scholar]
- Meintanis, S.; Bassiakos, Y. Goodness-of-fit tests for additively closed count models with an application to the generalized Hermite distribution. Sankhya 2005, 67, 538–552. [Google Scholar]
- Novoa-Muñoz, F. Goodness-of-fit tests for the bivariate Poisson distribution. Commun. Stat. Simul. Comput. 2019. [Google Scholar] [CrossRef]
- Novoa-Muñoz, F.; Jiménez-Gamero, M.D. Testing for the bivariate Poisson distribution. Metrika 2013, 77, 771–793. [Google Scholar] [CrossRef]
- Johnson, N.L.; Kemp, A.W.; Kotz, S. Univariate Discrete Distributions, 3rd ed.; John Wiley & Sons, Inc.: New York, NY, USA, 2005. [Google Scholar]
- Kocherlakota, S.; Kocherlakota, K. Bivariate Discrete Distributions; John Wiley & Sons: Hoboken, NJ, USA, 1992. [Google Scholar]
- Papageorgiou, H.; Kemp, C.D.; Loukas, S. Some methods of estimation for the bivariate Hermite distribution. Biometrika 1983, 70, 479–484. [Google Scholar] [CrossRef]
- Kemp, C.D.; Kemp, A.W. Rapid estimation for discrete distributions. Statistician 1988, 37, 243–255. [Google Scholar] [CrossRef]
- Johnson, N.L.; Kotz, S.; Balakrishnan, N. Discrete Multivariate Distributions; Wiley: New York, NY, USA, 1997. [Google Scholar]
- Kocherlakota, S. On the compounded bivariate Poisson distribution: A unified approach. Ann. Inst. Stat. Math. 1988, 40, 61–76. [Google Scholar] [CrossRef]
- Papageorgiou, H.; Loukas, S. Conditional even point estimation for bivariate discrete distributions. Commun. Stat. Theory Methods 1988, 17, 3403–3412. [Google Scholar] [CrossRef]
- Serfling, R.J. Approximation Theorems of Mathematical Statistics; Wiley: New York, NY, USA, 1980. [Google Scholar]
- Kundu, S.; Majumdar, S.; Mukherjee, K. Central limits theorems revisited. Stat. Probab. Lett. 2000, 47, 265–275. [Google Scholar] [CrossRef]
- Van der Vaart, J.A.; Wellner, J.A. Weak Convergence and Empirical Processes; Springer: New York, NY, USA, 1996. [Google Scholar]
- R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. 2021. Available online: https://www.R-project.org/ (accessed on 1 July 2019).
- Kemp, C.D.; Papageorgiou, H. Bivariate Hermite distributions. Sankhya 1982, 44, 269–280. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).