Next Article in Journal
A Comparative Study of Swarm Intelligence Algorithms for UCAV Path-Planning Problems
Next Article in Special Issue
A Lochs-Type Approach via Entropy in Comparing the Efficiency of Different Continued Fraction Algorithms
Previous Article in Journal
Simplicial-Map Neural Networks Robust to Adversarial Examples
Previous Article in Special Issue
An Information-Theoretic Approach for Multivariate Skew-t Distributions and Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Over and Underdispersed Biparametric Extension of the Waring Distribution

Department of Statistics and Operations Research, University of Jaén, 23071 Jaén, Spain
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2021, 9(2), 170; https://doi.org/10.3390/math9020170
Submission received: 3 December 2020 / Revised: 4 January 2021 / Accepted: 8 January 2021 / Published: 15 January 2021
(This article belongs to the Special Issue Probability, Statistics and Their Applications)

Abstract

:
A new discrete distribution for count data called extended biparametric Waring ( E B W ) distribution is developed. Its name is related to the fact that, in a specific configuration of its parameters, it can be seen as a biparametric version of the univariate generalized Waring ( U G W ) distribution, a well-known model for the variance decomposition into three components: randomness, liability and proneness. Unlike the U G W distribution, the E B W can model both overdispersed and underdispersed data sets. In fact, the E B W distribution is a particular case of a U W G distribution when its first parameter is positive; otherwise, it is a particular case of a Complex Triparametric Pearson ( C T P ) distribution. Hence, this new model inherits most of their properties and, moreover, it helps to solve the identification problem in the variance components of the U G W model. We compare the E B W with the U G W by a simulation study, but also with other over and underdispersed distributions through the Kullback-Leibler divergence. Additionally, we have carried out a simulation study in order to analyse the properties of the maximum likelihood parameter estimates. Finally, some application examples are included which show that the proposed model provides similar or even better results than other models, but with fewer parameters.

1. Introduction

The univariate generalized Waring ( U G W ) is a triparametric distribution for overdispersed count data that has been studied by [1,2,3,4], among others. The interest of the U G W distribution lies in the decomposition of its variance into three components, randomness, liability and proneness, which allows us to get a deeper knowledge of the nature of data variability, that is, how and why data vary. Whereas the Poisson distribution provides the simplest answer to this issue (pure chance), any one-step Poisson mixture distributions assume that there are only two sources of variability (for example, the negative binomial or N B distribution which is a Poisson-Gamma mixture).
For this reason, the U G W distribution and the related regression model [5,6,7] have been widely applied for modelling overdispersed count data sets in different fields, such as lexicology [8], the number of authors in scientific articles [9], the evolution of the number of links in the World Wide Web [10], accident theory [11], clustered data [12], sources of variance in motor vehicle crash analysis [13], completeness errors in geographic data sets [14] or agriculture [15].
However, the U G W distribution has a serious drawback related to the variance decomposition. Since its first two parameters are interchangeable in the expression of the probability mass function (pmf), it is difficult to determine which component refers to liability or proneness. There are in the literature some suggestions available to avoid this problem. Ref. [1] recommends choosing the values of liability and proneness according to the researcher experience; Ref. [11] proposes the calculus of a bivariate version of the Waring distribution and [5] solves the problem using additional information provided by covariates through a regression model. In all cases the solution of the indetermination needs external information that is not always available.
Several extensions have been developed, such as the extended Waring distribution or G H D I [4], the Stuttering generalized Waring distribution [16] and the bivariate generalized Waring distribution [11], but they do not manage to solve the identification problem aforementioned.
In this paper, we study a specific biparametric distribution within the Gaussian hypergeometric distributions ( G H D ) family [17] and we propose it as an extension of the U G W distribution but with only two parameters. The proposed model does not only perform similar to the U G W distribution for overdispersed data sets but also solves the identification problem of the variance components. Moreover, the way in which the extension is carried out also allows for modelling underdispersed count datasets, since it can be seen as a particular case of a complex triparametric Pearson ( C T P ) distribution [18,19] although, in this case, the result of decomposition of the variance is not verified because the model cannot be expressed as a Poisson mixture. Thus, this extension—that we will call extended biparametric Waring (henceforward E B W ) distribution—inherits the good properties of the U G W and C T P distributions.
The rest of the paper is laid out as follows. Section 2 is devoted to defining the E B W distribution and to exploring its main probabilistic properties. In Section 3 some estimation methods are described and the properties of the maximum likelihood estimators are analised by a simulation study. In Section 4 we compare the E B W distribution with some other biparametric over- and underdispersed distributions. Some examples of application to real over- and underdispersed data that illustrate the versatility of the proposed model are included in Section 5. Finally, in Section 6, some conclusions of the current research are presented.

2. The Extended Bivariate Waring Distribution

2.1. Definition

The G H D family, generated by the 2 F 1 ( α , β ; γ , λ ) function
2 F 1 ( α , β ; γ ; λ ) = x = 0 ( α ) x ( β ) x ( γ ) x λ x x ! , x = 0 , 1 , 2
with ( γ ) x = Γ ( γ + x ) Γ ( γ ) , α , β C and γ , λ R , arises as a solution of the difference equation
G ( x ) f ( x + 1 ) L ( x ) f ( x ) = 0 , x = 0 , 1 , 2 ,
where G and L are quadratic polynomials with real coefficients, G ( x ) = ( γ + x ) ( x + 1 ) and L ( x ) = λ ( x α ) ( x β ) [20]. When convergence, positivity and normalization conditions are verified, the solution f ( x ) is the pmf of as a discrete distribution, that is
f ( x ) = P ( X = x ) = Γ ( γ ) Γ ( α ) Γ ( β ) 2 F 1 ( α , β ; γ ; λ ) Γ ( α + x ) Γ ( β + x ) Γ ( γ + x ) λ x x !
and the probability generating function (pgf)
G ( t ) = 2 F 1 ( α , β ; γ ; λ t ) / 2 F 1 ( α , β ; γ ; λ ) , t R .
It is important to point out that the first three parameters of the G H D are the roots of the polynomials L ( x ) and G ( x ) (except the sign of γ ).
A thorough classification of the G H D family in terms of the parameters can be seen in [4]. In the aforementioned paper, a detailed study of the G H D when α , β and γ are positive real numbers and 0 < λ 1 (denoted by type I) is made. The case when α and β are conjugate complex numbers, γ > 0 and 0 < λ 1 (denoted by type II distributions) has been studied in [18,19,21]. Type VII distributions, a finite case which may be seen as a generalization of the beta-binomial model, have been addressed by [22]. Likewise, the case when λ = 1 has been analysed by [23,24], among others.
In this paper, we focus on the case in which both G H D of type I and type II with λ = 1 converge. Thus, L ( x ) in Equation (1) has a real double root, that is, α = β . Then, the solution of Equation (1) is given in terms of a 2 F 1 ( α , α ; γ ; 1 ) function leading to a highly versatile biparametric discrete distribution with infinite range which is formalized in the following definition. From now on we will call it E B W , the acronym of Extended Bivariate Waring, distribution. Later on we will explain the nomenclature chosen for this new distribution.
Definition 1.
A random variable X following a  E B W ( α , γ ) distribution is defined by the following pmf
P ( X = x ) = Γ ( γ α ) 2 Γ ( α ) 2 Γ ( γ 2 α ) Γ ( α + x ) 2 Γ ( γ + x ) 1 x ! , x = 0 , 1 ,
where α R and γ > max ( 0 , 2 α ) .
The mean, μ , and variance, σ 2 , of X are
μ = α 2 γ 2 α 1 , σ 2 = α 2 ( γ α 1 ) 2 ( γ 2 α 1 ) 2 ( γ 2 α 2 ) = μ μ + γ 1 γ 2 α 2
so it is necessary that γ > 2 α + 1 and γ > 2 α + 2 to guarantee the existence of μ and σ 2 , respectively. In general, it can be proved that γ > 2 α + m to guarantee the existence of the m-th raw moment.

2.2. Properties

To study the properties of the E B W distribution we will distinguish among α > 0 and α < 0 but α Z .

2.2.1. α > 0

It is necessary that γ > 2 α , so we consider another parametrization of the distribution in terms of α and ρ = γ 2 α > 0 . Then, the expression of the pmf given in Equation (3) is now
P ( X = x ) = Γ ( α + ρ ) 2 Γ ( α ) 2 Γ ( ρ ) Γ ( α + x ) 2 Γ ( 2 α + ρ + x ) 1 x ! , x = 0 , 1 ,
and the expressions in Equation (4) reduce to
μ = α 2 ρ 1 , σ 2 = α 2 ( α + ρ 1 ) 2 ( ρ 1 ) 2 ( ρ 2 ) = μ μ + 2 α + ρ 1 ρ 2 .
To guarantee the existence of the m -th raw moment it is necessary that ρ > m .
Theorem 1.
The E B W ( α , ρ ) distribution with α , ρ > 0 is a U G W ( α , α , ρ ) distribution.
Proof. 
Considering α = β > 0 and λ = 1 in Equation (2) and applying that
2 F 1 ( α , β ; γ ; 1 ) = Γ ( γ ) Γ ( γ α β ) Γ ( γ α ) Γ ( γ β ) ,
it is easy to see that the pmf given in Equation (5) coincides with that of a U G W ( α , α , ρ ) distribution. □
Hence, our model may be seen as a biparametric case of a U G W distribution when α > 0 . As a consequence, it inherits the properties of the U G W distribution which are listed below:
  • It can be obtained from a two-step Poisson mixture:
    • X | λ P ( λ )
    • λ | α , v G a m m a ( α , v ) with density
      f ( λ | α , v ) = 1 Γ ( α ) v α λ α 1 e λ / v , λ > 0 , α , v > 0
      Therefore, X | α , v N B α , v with pmf
      f ( x | α , v ) = 1 x ! Γ ( x + α ) Γ ( α ) 1 1 + v α v 1 + v x , x = 0 , 1 ,
    • v | α , ρ B e t a ( α , ρ ) with density
      f ( v | α , ρ ) = Γ ( α + ρ ) Γ ( α ) Γ ( ρ ) v α 1 ( 1 + v ) α + ρ , v > 0 , α , ρ > 0
  • Since the E B W distribution with α > 0 is a Poisson mixture, it is always overdispersed.
  • It converges to P ( μ ) when ρ and α 2 with the same order of convergence.
  • As a consequence of the mixture, the variance of X can be split into three components known as randomness, liability and proneness, respectively:
    σ 2 = α 2 ρ 1 + α 2 ( α + 1 ) ( ρ 1 ) ( ρ 2 ) + α 3 ( α + ρ 1 ) ( ρ 1 ) 2 ( ρ 2 ) .
    Since we have got rid of one of the first two parameters of the U G W distribution, the indetermination problem with regard to the components of the variance [4] disappears in the biparametric model and, therefore, it is not necessary to provide additional information when determining the partition of the variance.
In order to know the effect of each parameter on the variance components of the E B W model we consider the proportion of variance explained by each one, that is:
1 = ( ρ 1 ) ( ρ 2 ) ( α + ρ 1 ) 2 + ( α + 1 ) ( ρ 1 ) ( α + ρ 1 ) 2 + α ( α + ρ 1 )
Figure 1 shows the evolution of the variance partition percentages for each component considering α fixed and ρ variable (low and high values) and then, ρ fixed and α variable (low and high values). In the first column ( α fixed), we can observe that the greater ρ is, the more important is the proneness. In the second column ( ρ fixed), the greater α is, the more important is the randomness. Otherwise, if α and ρ increase with the same convergence order, the proneness has a lower limit in 50 % of the variance, whereas the other two parts tend to 25 % each one.
Due to the structure of the U G W distribution in which the first two parameters are interchangeable and appear in a multiplicative form in the pmf, moments and decomposition of the variance, the maximum likelihood estimates of its first two parameters are usually almost equal. In fact, given a U G W ( a , k , ρ ) distribution, there exists an E B W ( α , ρ ) distribution with α = a k and the same parameter ρ , that is very close to the former. This can be seen in Figure 2 where the maximum Kullback-Leibler (KL) divergence for more details see [25] between the two models has been calculated for several values of a , k and ρ . We have considered the same scale in all the graphs in order to compare them. We observe that: (1) the divergence increases as k is separated from a; (2) the difference between a and k is less relevant as ρ increases; (3) in any case, the divergence is very small. Hence, the E B W type I distribution has the property of providing in many cases a similar fit but with one more degree of freedom. In general, the E B W distribution is able to provide acceptable fits for data simulated from a U G W distribution. This implies that, in most cases, there exists a E B W model reasonably similar to the U G W , but with the advantage of having fewer parameters. To show this fact we have simulated M = 1000 samples of size N = 100 , 300 and 500 from a U G W distribution with several values of its parameters and, for each sample, we have obtained the corresponding E B W and U G W fits. All the estimates have been computed by the maximum likelihood method. We have implemented our own functions in R [26]: the pmf of the E B W and U G W distributions and the fitting function for both models. To do the latter, we have used the optim function of the stats package with the L-BFGS-B method [27], since it allows box constraints, and considering as initial values the estimates provided by the method of moments (see Section 3 for more details).
For each group of 1000 samples we have computed the percentage of E B W fits achieved as well as the percentage of these fits which are better than the corresponding U G W fit in two senses: the AIC value and the χ 2 -goodness of fit test. Specifically, for the E B W fits achieved we have computed the percentage of them whose AIC value is less than that of the U G W fit and the percentage of p-values in the χ 2 -goodness of fit test greater than 0.01 , 0.05 and 0.1 (that is, the null hypothesis data comes from a E B W model cannot be rejected). These results appear in Table 1 and Table 2. We have also carried out the Kolmogorov-Smirnov test for discontinuous distributions [28] using the ks.test function of the dgof package in R, but in all the cases the p-values are greater than 0.1.

2.2.2. Case α < 0 but α Z

It can be seen as a particular case of a C T P distribution [18,21], which arises when the polynomial L ( x ) in Equation (1) has conjugate complex roots α = a + i b and β = a i b ; specifically, we have the following result.
Theorem 2.
If α < 0 the E B W ( α , γ ) distribution with γ > 0 is a C T P ( α , 0 , γ ) distribution.
Proof. 
The proof is straightforward since the pmf of the C T P ( α , 0 , γ ) with α R and γ > 0 , see for instance [21], coincides with the pmf of the E B W ( α , γ ) given in Equation (3). □
This result is also true when α > 0 . So, a U G W ( α , α , ρ ) C T P ( α , 0 , 2 α + ρ ) .
At this point we can justify the name chosen for the model proposed. On the one hand, when α > 0 the model may be seen as a biparametric case of the U G W distribution, which is always overdispersed and that may replace it with fewer parameters; on the other hand, when α < 0 the model can be underdispersed, so it may be considered as an underdispersed extension of a biparametric U G W distribution.
Once again, the proposed distribution inherits the properties of another distribution, in this case of the C T P distribution that we next summarize:
  • If ( α 1 ) 2 γ 2 α + 1 Z , the distribution has two consecutive modes in the values
    ( α 1 ) 2 γ 2 α + 1 1 = α 2 γ γ 2 α + 1 and ( α 1 ) 2 γ 2 α + 1 .
    Otherwise the distribution is unimodal with mode in 0 if α 2 < γ or in ( α 1 ) 2 γ 2 α + 1 , where [ · ] symbolises the integer part. Hence, the pmf is J-shaped or bell-shaped.
  • It may be underdispersed, equidispersed or overdispersed. Specifically:
    • It is underdispersed when α 1 or when 1 < α < 0.5 and γ > 3 α 2 + 4 α + 1 2 α + 1 .
    • It is equidispersed when 1 < α < 0.5 and γ = 3 α 2 + 4 α + 1 2 α + 1 .
    • It is overdispersed when α 0.5 or when 1 < α < 0.5 and γ < 3 α 2 + 4 α + 1 2 α + 1 .
  • A sufficient condition to be infinitely divisible (i.d.) is that α > 0.5 and γ > α 2 / ( 1 + 2 α ) . So, if α 0.5 the E B W distribution is not i.d. As a consequence, an underdispersed E B W cannot be i.d. since a necessary condition to be underdispersed is α < 0.5 .
  • It converges to the:
    • P ( μ ) when γ and α 2 with the same order of convergence.
    • Normal distribution, N ( μ , σ ) , when γ and α have the same order of convergence.
The C T P distribution cannot be expressed as a mixture, so in the E B W with α < 0 there is no a result of variance decomposition.

3. Estimation

3.1. Methods for Obtaining Estimators

We can estimate the two parameters of the E B W distribution using the method of moments and the maximum likelihood estimation method.
To apply the method of moments, we first solve the equations given in Equation (4). To this end we substitute γ 2 α 1 = α 2 / μ in the equation of σ 2 . Then,
σ 2 = μ μ 2 + α 2 + 2 α μ α 2 μ ,
which is equivalent to α 2 ( σ 2 μ ) 2 μ 2 α μ ( μ 2 + σ 2 ) = 0 . Then, replacing μ and σ 2 by their sample counterparts, x ¯ and s 2 , and solving the equation there are two possible estimates for α by the method of moments:
α ^ 1 = x ¯ 2 + x ¯ 4 + x ¯ ( s 2 x ¯ ) ( x ¯ 2 + s 2 ) s 2 x ¯ α ^ 2 = x ¯ 2 x ¯ 4 + x ¯ ( s 2 x ¯ ) ( x ¯ 2 + s 2 ) s 2 x ¯
It is clear that if data exhibit overdispersion, then α ^ 1 > 0 and α ^ 2 < 0 . On the other hand, if data are underdispersed both α ^ 1 and α ^ 2 are negative. Estimated α , the estimate of γ is calculated as γ ^ = α ^ 2 / x ¯ + 2 α ^ + 1 . Hence, there are also two possible estimates for γ with the only restriction of being positive, which it is true when:
  • 0 < x ¯ < 1 or
  • x ¯ > 1 and α ^ < x ¯ x ¯ ( x ¯ 1 ) or α ^ > x ¯ + x ¯ ( x ¯ 1 ) .
Using the MLE method we have to maximize the log-likelihood function. Thus, if x = ( x 1 , , x n ) is a sample of size n, the expression of the log-likelihood function is:
ln L x 1 , , x n ( α , ρ ) = i = 1 n 2 ln Γ ( α + x i ) ln Γ ( ρ + 2 α + x i ) n 2 ln Γ ( α ) 2 ln Γ ( ρ + α ) + ln Γ ( ρ ) ,
when α > 0 , using the parametrization given in Section 2.2.1, or
ln L x 1 , , x n ( α , γ ) = i = 1 n 2 ln Γ ( α + x i ) ln Γ ( γ + x i ) n 2 ln Γ ( α ) 2 ln Γ ( γ α ) + ln Γ ( γ 2 α ) ,
in another case. Both expressions can be maximized using numerical methods. In particular, we have used the L B F G S B method implemented in the optim function of the MASS package in R. This method allows box constraints on the parametric space, so we can impose ρ > 0 or γ > 0 in Equations (8) and (9), respectively. We consider the estimates obtained by the method of moments as initial values, in such a way that we maximize Equation (8) if α ^ > 0 or Equation (9) in another case.

3.2. Properties of the Estimators

We have carried out a simulation study in order to analyse the performance of the estimates of the model parameters. Specifically, we have simulated M = 1000 samples of size N = 500 of the E B W distribution and we have fitted the E B W model for each sample using the MLE method described in the previous section.
We have considered two scenarios: α > 0 , in which case the E B W distribution is always overdispersed, and α < 0 , in which case the E B W distribution can be under- and overdispersed. In all cases the values of the parameters satisfy the conditions for the existence of μ and σ 2 .
Results of the simulation procedure are shown in Table 3. Thus, Column 1 contains the mean bias and the s.d., in brackets (* indicates a significant bias at 5 % level based on a normal 95 % confidence interval, given that there are 1000 observations). Column 2 shows the average of the mean square error (MSE) of the parameter estimates and Column 3 the percentage of simulations in which the parameter estimate does not differ significantly at 5 % from the true value, known as coverage.
We have only included low values of ρ and γ because the higher these values are compared with α , the lower the mean and the variance are. In fact, if these parameters tend to infinity, holding α fixed, the E B W distribution degenerates into 0. In addition, if both α and ρ (or γ ) are high, the E B W is similar to the Poisson or the Normal distribution.
In general, we can deduce that:
  • If α > 0 the estimates are biased to the right, but the bias decreases as α increases, holding ρ fixed. The opposite happens with the bias of ρ , which increases as ρ increases.
  • If α < 0 the estimates are also biased, those for α to the left and for γ to the right, but the bias disappears as α decreases ( α < 1 ). Holding γ fixed, the bias decreases as α decreases and the same happens for γ .
  • The average MSE is low for both parameter estimates, although this measure increases as ρ (or γ ) increases since the estimates accuracy and precision decrease.
  • Regarding the coverage, it approaches 95 % , the confidence level considered, so it shows the validity of the inference made.

4. Comparison with Other Count Data Distributions

Next we study the differences and similarities between the E B W and other well-known biparametric discrete distributions for count data using again the KL divergence. Specifically, we consider the distributions N B , Complex Biparametric Pearson or C B P [19], which is a particular case of the C T P distribution, Generalized Poisson or G P [29,30], COM-Poisson or C M P [31,32] and Hyper-Poisson or H P [33]. The first two are suitable only for overdispersed data, whereas the other three can cope with both underdispersed and overdispersed data, although the G P has finite range in the underdispersed case.
We focus on the overdispersed scenario since the underdispersed one, for being the E B W distribution a particular case of the C T P distribution, was already carried out by [21].
To compute the KL divergence between the E B W distribution and the above-mentioned distributions (and vice versa), we have considered several values of μ and σ 2 , with σ 2 > μ , and then we have obtained the corresponding values of the parameters of each distribution (see Appendix A). For the C M P and H P distributions it should be taken into account that not all the combinations of μ and σ 2 are possible; empirically there seems to be an upper limit for σ 2 in μ ( μ + 1 ) . Thus, the values of the KL divergence are shown in Figure 3.
In general, we can observe that in an overdispersed scenario the most distant models from the E B W distribution are the C B P and H P distributions and the closest ones to the E B W distribution are the G P and N B distributions. On the other hand, in an underdispersed scenario the H P distribution, which is very similar to the C M P distribution, is the closest one [21]. Nevertheless, these distances in relation to the E B W distribution are really small, which implies that the performance of these distributions is very similar.

5. Examples

In this section we use the E B W distribution to fit both over- and underdispersed real data and we compare this fit with those obtained from other discrete distributions.

5.1. Overdispersed Data: Number of Some Sports Facilities by Municipality in Andalusia, 2015

We consider the variable X: number of some sports facilities by municipality in Andalusia, in 2015. Data have been directly obtained from the System of Multiterritorial Information of Andalusia (SIMA) of the Junta de Andalucía [34]. This category includes all sports facilities in the municipality except sports complexes, sports courts, pelota courts (frontons) and pools. A description of these data appears in Table 4, which contains the mean, variance, Aggregation Index (AI), quartiles and maximum.
We will model these data by the following distributions: E B W , N B , G P , C B P , U G W and C M P . A I C values, statistics and p-values corresponding to the χ 2 -goodness of fit test are shown in Table 5. We can see that the best fit is that provided by the E B W distribution. The Wald test supports this statement since the null hypothesis a = k (the first two parameters of the U G W distribution are equal) cannot be rejected: the statistic value is 1.31 × 10 6 and the corresponding p-value is 1. With the likelihood ratio test ( L R T ) we come to the same conclusion ( L R T = 2.36 × 10 10 and p-value 1 ).
Table 6 shows the observed and expected frequencies for the E B W , N B , G P and C B P fits. Figure 4 shows graphically the frequencies for the values between 0 and 10 sport facilities. We can see that, in general, the E B W distribution fit is really accurate (the greater Pearson residual is 1.55 for the interval 18–21). In the other side, the remaining distributions considered provide worse fits, in special in reference to the lowest values of the variable (high Pearson residuals in 0, 1 and 2).
Additionally, we can calculate the three components of the variance. The percentage of data variability due to randomness, liability and proneness is 14.25 % , 32.83 % and 52.92 % , respectively. We can observe that randomness does not play a very important role with respect to the total variability of data and that the most important component is proneness, which refers to specific and internal conditions instead of general conditions of the municipality (external), although liability is also remarkable. The idiosyncrasy of a municipality explains more than 50% of the variability in the number of some sports facilities, whereas shared conditions have less influence, but also noteworthy, on this variability.

5.2. Underdispersed Data: Turkish Poem

We consider data about the word length (in terms of number of syllables) in the turkish poem Gidisat by Erc u ¨ ment Behzat L a ^ v available in [35]. Following these authors, the count for 1 is treated as a count for 0, and in general the count for the response variable X is treated as X 1 , as though the data are generated by adding 1 to the distribution. These data exhibit underdispersion with a variance-mean ratio of 0.74 (see Table 4). Table 7 contains the parameter estimates, their standard errors (in parenthesis), the A I C , the observed and expected frequencies and the corresponding Pearson χ 2 test for each one of the models that copes with underdispersion, that is, E B W , C T P , C M P and H P (the G P distribution has been excluded because it is of finite range).
C T P and E B W fits provide practically the same results. In fact, b = 0 using the Wald test ( z e x p = 2.3 × 10 5 and p-value 1 ) and the LRT ( χ e x p 2 = 0 and p-value = 1). Observed and expected frequencies for each fit are represented in Figure 5 (the C T P distribution has been suppressed). Although the three fits are very similar and really good, the E B W distribution fit is the most accurate taking into account the expected frequencies.

6. Conclusions

The E B W distribution is a very flexible biparametric discrete distribution that allows for modelling a wide variety of over and underdispersed count datasets. There are other biparametric distributions that can also cope with over and underdispersion such as the G P , C M P or H P distributions, but the E B W distribution is more general because its pmf and moments can be explicitly obtained in terms of the parameters. In this paper we have proposed this new model to fits data from different fields of knowledge, that shows the versatility of this model in respect with its possible application. In addition, when the first parameter of the E B W distribution is positive, it allows to split the variance into three uniquely determined components. This property avoids the problem of indeterminacy present in the U G W distribution. In consequence, and taking into account this property, the E B W distribution is more adequate than other biparametric discrete distributions for modelling overdispersed data in which the non-random part of the variance has two components, none of them negligible.
Furthermore, when the first parameter is a negative integer the E B W distribution has finite range and it is underdispersed. Something similar happens with the G P distribution that also has finite range but only in the underdispersed case, whereas the E B W distribution can also be underdispersed with infinite range.

Author Contributions

Data curation, J.R.-A.; Formal analysis, V.C.-L., M.J.O.-J. and J.R.-A.; Investigation, V.C.-L., M.J.O.-J. and J.R.-A.; Methodology, V.C.-L. and M.J.O.-J.; Software, M.J.O.-J. and J.R.-A.; Supervision, J.R.-A.; Writing—original draft, J.R.-A., V.C.-L. and M.J.O.-J.; writing—review and editing, J.R.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data of example 1 is available in http://www.juntadeandalucia.es/institutodeestadisticaycartografia/sima/index2.htm and the example 2 is included in the article.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Obtaining the Parameters in Terms of μ and σ2

For the E B W distribution there is a pair of solutions for α and γ from Equation (4):
α 1 = μ 2 + μ 4 + μ ( σ 2 μ ) ( μ 2 + σ 2 ) σ 2 μ , γ 1 = α 1 μ + 2 α 1 + 1
α 2 = μ 2 μ 4 + μ ( σ 2 μ ) ( μ 2 + σ 2 ) σ 2 μ , γ 2 = α 2 μ + 2 α 2 + 1
It can be shown that if the E B W distribution is overdispersed, α 1 , γ 1 > 0 and α 2 < 0 , but γ 2 > 0 if μ < 1 . If the E B W distribution is underdispersed, both α 1 and α 2 are negative, but:
  • γ 1 > 0 when μ < 1 and σ 2 > μ ( 1 μ ) or when μ 1 and σ 2 > μ μ 2 + μ 3 ( μ 1 ) 2
  • γ 2 > 0 when μ < 1 and σ 2 > μ ( 1 μ ) .
As a consequence, if μ 1 , the only possible solution is that given in Equation (A1a) for both cases (over- and underdispersed).
Regarding the rest of the models, the expressions of their parameters in terms of μ and σ 2 can be seen in [21].

References

  1. Irwin, J.O. The generalized Waring distribution applied to accident theory. J. R. Stat. Soc. Ser. A 1968, 131, 205–225. [Google Scholar] [CrossRef]
  2. Xelakaki, E. Infinite divisibility, completeness and regression properties of the univariate generalized Waring distribution. Ann. Inst. Stat. Math. 1983, 35, 279–289. [Google Scholar] [CrossRef]
  3. Xelakaki, E. The univariate generalized Waring distribution in relation to accident theory: Proneness, spells or contagion? Biometrics 1983, 39, 887–895. [Google Scholar] [CrossRef]
  4. Rodríguez-Avi, J.; Conde-Sánchez, A.; Sáez-Castillo, A.J.; Olmo-Jiménez, M.J. A new generalization of the Waring distribution. Comput. Stat. Data Anal. 2007, 51, 6138–6150. [Google Scholar] [CrossRef]
  5. Rodríguez-Avi, J.; Conde-Sánchez, A.; Sáez-Castillo, A.J.; Olmo-Jiménez, M.J.; Martínez-Rodríguez, A.M. A generalized Waring regression model for count data. Comput. Stat. Data Anal. 2009, 53, 3717–3725. [Google Scholar] [CrossRef]
  6. Hilbe, J.M. Negative Binomial Regression; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
  7. Vílchez-López, S.; Sáez-Castillo, A.J.; Olmo-Jiménez, M.J. GWRM: An R Package for Identifying Sources of Variation in Overdispersed Count Data. PLoS ONE 2016, 11, e0167570. [Google Scholar] [CrossRef]
  8. Tesitelova, H. On the role of nouns in the lexical statistics. Prague Stud. Math. Linguist. 1967, 2, 121–131. [Google Scholar]
  9. Ajiferuke, I. A probabilistic model for the distribution of authorships. J. Am. Soc. Inf. Sci. 1991, 42, 279–289. [Google Scholar] [CrossRef]
  10. Levene, M.; Fenner, T.; Loizou, G.; Wheeldon, R. A stochastic model for the evolution of the web. Comput. Netw. 2002, 39, 277–287. [Google Scholar] [CrossRef] [Green Version]
  11. Xekalaki, E. The bivariate generalized Waring distribution and its application to Accident Theory. J. R. Stat. Soc. Ser. A 1984, 147, 488–498. [Google Scholar] [CrossRef]
  12. Grunwaldm, G.K.; Bruce, S.L.; Jiang, L.; Strand, M.; Rabinovitch, N. A statistical model for under- or overdispersed clustered and longitudinal count data. Biom. J. 2011, 53, 578–594. [Google Scholar] [CrossRef] [PubMed]
  13. Peng, Y.; Lord, D.; Zou, Y. Applying the generalized Waring model for investigating sources of variance in motor vehicle crash analysis. Accid. Anal. Prev. 2014, 73, 20–26. [Google Scholar] [CrossRef] [PubMed]
  14. Ariza-López, F.J.; Rodríguez-Avi, J. Estimating the count of completeness errors in geographic data sets by means of a generalized Waring regression model. Int. J. Geogr. Inf. Sci. 2015, 29, 1394–1418. [Google Scholar] [CrossRef]
  15. Huete-Morales, M.D.; Marmolejo-Martín, J.A. The Waring Distribution as a Low-Frequency Prediction Model: A Study of Organic Livestock Farms in Andalusia. Mathematics 2020, 8, 2025. [Google Scholar] [CrossRef]
  16. Panaretos, J.; Xekalaki, E. Extension of the Application of Conway-Maxwell-Poisson Models: Analyzing Traffic Crash Data Exhibiting Underdispersion. Risk Anal. 1986, 4, 313–318. [Google Scholar]
  17. Johnson, N.L.; Kemp, A.W.; Kotz, S. Univariate Discrete Distributions, 3rd ed.; Wiley: New York, NY, USA, 2005. [Google Scholar]
  18. Rodríguez-Avi, J.; Conde-Sánchez, A.; Sáez-Castillo, A.J.; Olmo-Jiménez, M.J. A triparametric discrete distribution with complex parameters. Stat. Pap. 2004, 45, 81–95. [Google Scholar] [CrossRef]
  19. Rodríguez-Avi, J.; Olmo-Jiménez, M.J. A regression model for overdispersed data without too many zeros. Stat. Pap. 2017, 58, 749–773. [Google Scholar] [CrossRef]
  20. Jordan, C. Calculus on Finite Differences; Chelsea Publishing Company: London, UK, 1965. [Google Scholar]
  21. Olmo-Jiménez, M.J.; Rodríguez-Avi, J.; Cueva-López, V. A review of the CTP distribution: A comparison with other over- and underdispersed count data models. J. Stat. Comput. Simul. 2018, 88, 2684–2706. [Google Scholar] [CrossRef]
  22. Rodríguez-Avi, J.; Conde-Sánchez, A.; Sáez-Castillo, A.J.; Olmo-Jiménez, M.J. A generalization of the Beta-Binomial distribution. J. R. Stat. Soc. Ser. C 2007, 56, 51–61. [Google Scholar] [CrossRef]
  23. Sibuya, M. Generalized hypergeometric, digamma and trigamma distributions. Ann. Inst. Statist. Math. 1979, 31, 373–390. [Google Scholar] [CrossRef]
  24. Sibuya, M.; Shimizu, R. Classification of the generalized hypergeometric family of distributions. Keio Sci. Technol. Rep. 1981, 34, 1–38. [Google Scholar]
  25. Burnham, K.P.; Anderson, D.R. Model Selection and Multi-Model Inference: A Practical Information-Theoretic Approach, 2nd ed.; Springer: New York, NY, USA, 2002. [Google Scholar]
  26. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2020. [Google Scholar]
  27. Byrd, R.H.; Lu, P.; Nocedal, J.; Zhu, C. A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 1995, 16, 1190–1208. [Google Scholar] [CrossRef]
  28. Arnold, T.B.; Emerson, J.W. Nonparametric Goodness-of-Fit Tests for Discrete Null Distributions. R J. 2011, 3, 34–39. [Google Scholar] [CrossRef] [Green Version]
  29. Consul, P.C. Generalized Poisson Distributions: Properties and Applications; Marcel Dekker: New York, NY, USA, 1989. [Google Scholar]
  30. Joe, H.; Zhu, R. Generalized Poisson Distribution: The Property of Mixture of Poisson and Comparison with Negative Binomial Distribution. Biom. J. 2005, 45, 219–229. [Google Scholar] [CrossRef] [PubMed]
  31. Conway, R.W.; Maxwell, W.L. A queuing model with state dependent service rates. J. Ind. Eng. 1962, 12, 132–136. [Google Scholar]
  32. Sellers, K.F.; Borle, S.; Shmueli, G. The COM-Poisson model for count data: A survey of methods and applications. Appl. Stoch. Model. Bus. Ind. 2012, 28, 104–116. [Google Scholar] [CrossRef]
  33. Bardwell, G.E.; Crow, E.L. A two parameter family of hyper-Poisson distributions. J. Am. Stat. Assoc. 1964, 54, 133–141. [Google Scholar] [CrossRef]
  34. Instituto de Estadística y Cartografía de Andalucía. SIMA: Sistema de Información Multiterritorial de Andalucía. Available online: http://www.juntadeandalucia.es/institutodeestadisticaycartografia/sima/index2.htm (accessed on 2 January 2021).
  35. Wimmer, G.; Köhler, R.; Grotjahn, R.; Altmann, G. Towards a Theory of Word Length Distribution. J. Quant. Linguist. 1994, 1, 98–106. [Google Scholar] [CrossRef]
Figure 1. Percentages of the extended biparametricWaring (EBW) variance components: randomness (green solid line), liability (blue dashed line) and proneness (purple dotted line). Column 1 has values of α = 0.5, 1, 5 and 20, respectively, for ρ from 2.1 to 20. Column 2 has values of ρ = 2.5, 5, 10 and 20, respectively, for α from 0.1 to 20.
Figure 1. Percentages of the extended biparametricWaring (EBW) variance components: randomness (green solid line), liability (blue dashed line) and proneness (purple dotted line). Column 1 has values of α = 0.5, 1, 5 and 20, respectively, for ρ from 2.1 to 20. Column 2 has values of ρ = 2.5, 5, 10 and 20, respectively, for α from 0.1 to 20.
Mathematics 09 00170 g001aMathematics 09 00170 g001b
Figure 2. Maximum Kullback Leibler divergence between the univariate generalized Waring ( U G W ) ( a , k , ρ ) distribution and the E B W ( a k , ρ ) distribution with a = 0.5 , 1 , 5 and 10 (from left to right).
Figure 2. Maximum Kullback Leibler divergence between the univariate generalized Waring ( U G W ) ( a , k , ρ ) distribution and the E B W ( a k , ρ ) distribution with a = 0.5 , 1 , 5 and 10 (from left to right).
Mathematics 09 00170 g002
Figure 3. Kullback Leibler divergence between the negative binomial ( N B ), Complex Biparametric Pearson ( C B P ), Generalized Poisson ( G P ), COM-Poisson ( C M P ), Hyper-Poisson ( H P ) and E B W distributions (and vice versa) in an overdispersed scenario. Rows 1–3 have μ = 1 , μ = 5 and μ = 10 , respectively.
Figure 3. Kullback Leibler divergence between the negative binomial ( N B ), Complex Biparametric Pearson ( C B P ), Generalized Poisson ( G P ), COM-Poisson ( C M P ), Hyper-Poisson ( H P ) and E B W distributions (and vice versa) in an overdispersed scenario. Rows 1–3 have μ = 1 , μ = 5 and μ = 10 , respectively.
Mathematics 09 00170 g003
Figure 4. Observed and expected frequencies for data about other sports facilities (from 0 to 10).
Figure 4. Observed and expected frequencies for data about other sports facilities (from 0 to 10).
Mathematics 09 00170 g004
Figure 5. Observed and expected frequencies for data about the number of syllables of a Turkish poem.
Figure 5. Observed and expected frequencies for data about the number of syllables of a Turkish poem.
Mathematics 09 00170 g005
Table 1. Percentage of: (1) E B W fits achieved for U G W generated data; (2) E B W fits with less AIC value than the corresponding U G W fit.
Table 1. Percentage of: (1) E B W fits achieved for U G W generated data; (2) E B W fits with less AIC value than the corresponding U G W fit.
Achieved EBW Fits<AIC
N N
U G W ( a , k , ρ ) 100300500100300500
( 0.5 , 1 , 2.5 ) 94.493.495.992.689.386.4
( 0.5 , 10 , 2.5 ) 99.810010053.425.19.9
( 0.5 , 10 , 20 ) 95.293.894.499.195.589
( 1.5 , 3 , 2.5 ) 10010010088.388.887.2
( 1.5 , 3 , 25 ) 97.99998.510010099.9
( 1.5 , 20 , 25 ) 10010010096.982.374.3
( 4 , 6 , 2.5 ) 99.91001009089.891.1
( 4 , 6 , 10 ) 10010010096.391.192.3
( 4 , 6 , 50 ) 95.896.597.610099.999.8
Table 2. Percentage of samples that come from a E B W model at the 1 % , 5 % and 10 % significance levels according to the χ 2 -goodness of fit test.
Table 2. Percentage of samples that come from a E B W model at the 1 % , 5 % and 10 % significance levels according to the χ 2 -goodness of fit test.
p-Value > 0.01 p-Value > 0.05 p-Value > 0.1
N N N
U G W ( a , k , ρ ) 100300500100300500100300500
( 0.5 , 1 , 2.5 ) 98.997.997.495.494.193.391.69188.8
( 0.5 , 10 , 2.5 ) 92.99395.986.186.890.580.581.686.1
( 0.5 , 10 , 20 ) 99.298.697.295.495.393.389.289.687.1
( 1.5 , 3 , 2.5 ) 95.692.391.291.287.584.386.483.377.6
( 1.5 , 3 , 25 ) 10097.898.195.994.393.788.987.687.6
( 1.5 , 20 , 25 ) 9998.598.595.594.392.990.989.788.9
( 4 , 6 , 2.5 ) 88.783.981.282.87773.97872.770
( 4 , 6 , 10 ) 97.896.895.493.791.491.490.386.387.8
( 4 , 6 , 50 ) 98.196.196.6949191.889.887.387
Table 3. Mean bias and s.d. in brackets (* indicates a statistically significant bias at 5 % level), average mean square error (MSE) and coverage for E B W fits.
Table 3. Mean bias and s.d. in brackets (* indicates a statistically significant bias at 5 % level), average mean square error (MSE) and coverage for E B W fits.
α > 0 α < 0
ParametersBias (s.d.)MSECoverageParametersBias (s.d.)MSECoverage
α = 0.5 0.02 (0.10) *0.0296.3 α = 0.75 −0.01 (0.05) *0.0095.1
ρ = 2.1 0.23 (0.80) *1.3094.8 γ = 0.75 0.02 (0.11) *0.0295.2
α = 1 0.01 (0.10) *0.0295.5 α = 0.75 −0.01 (0.08) *0.0197.1
ρ = 2.1 0.07 (0.36) *0.2796.3 γ = 1.5 0.07 (0.35) *0.2695.3
α = 1.5 0.01 (0.13) *0.0394.2 α = 0.75 −0.02 (0.14) *0.0596.8
ρ = 2.1 0.04 (0.29) *0.1694.5 γ = 3 0.29 (1.34) *5.0194
α = 1.5 0.01 (0.14) *0.0494.4 α = 1.5 0.01 (0.09)0.0194.2
ρ = 2.5 0.06 (0.39) *0.3094.4 γ = 0.75 0.00 (0.13)0.0394.1
α = 1.5 0.03 (0.29) *0.0696.1 α = 2.5 −0.00 (0.10)0.0294.5
ρ = 3.5 0.14 (0.70) *1.0096.6 γ = 0.75 0.01 (0.20)0.0794.2
Table 4. Descriptive statistics for data in examples.
Table 4. Descriptive statistics for data in examples.
x ¯ s 2 AI Q 1 Q 2 Q 3 Max
Facilities3.5725.897.2612455
Syllables 1 1.581.170.741225
Table 5. A I C values and χ 2 -goodness of fit test for data about some sports facilities.
Table 5. A I C values and χ 2 -goodness of fit test for data about some sports facilities.
EBW NB GP CBP UGW CMP
AIC3532.73579.23548.73555.13534.73579.86
χ 2 10.98138.40718.89736.04710.9837.620
d.f.151515161415
p-value0.75400.00080.21840.00290.68760.0010
Table 6. Observed and expected frequencies for data about other sports facilities.
Table 6. Observed and expected frequencies for data about other sports facilities.
Expected
X Observed EBW NB GP CBP
0140150.41169.71160.56112.72
1159147.58127.08139.42192.40
2123114.4097.78105.46139.44
38283.5375.9078.2586.61
46060.2759.1858.3254.66
53643.6846.2743.9136.11
63632.0236.2333.4324.96
72023.7828.4125.7117.94
81617.9022.3019.9613.33
91413.6617.5115.6310.18
101210.5613.7612.327.95
1148.2610.829.786.34
1296.538.517.815.14
1365.216.706.277.75
1457.635.275.05
155 7.427.425.49
1645.12
171 6.206.765.67
1806.07
192
201 5.958.945.72
210
22–2428.40
25–334 5.11
34–554 7.48
α ^ = 3.147 ( 0.181 ) θ ^ = 0.948 ( 0.064 ) λ ^ = 1.535 ( 0.059 ) b ^ = 1.727 ( 0.097 )
ρ ^ = 3.800 ( 0.383 ) μ ^ = 3.566 ( 0.151 ) θ ^ = 0.570 ( 0.018 ) γ ^ = 1.746 ( 0.141 )
Table 7. Parameter estimates, standard errors (in parenthesis), observed and expected frequencies, A I C and χ 2 test for fits to data about the word length of a Turkish poem.
Table 7. Parameter estimates, standard errors (in parenthesis), observed and expected frequencies, A I C and χ 2 test for fits to data about the word length of a Turkish poem.
Expected
X Observed EBW CTP CMP HP
16461.2461.2459.6961.20
2131136.23136.23141.87145.24
3122121.68121.68118.70112.60
46156.9356.9353.9252.17
51315.2715.2715.8817.27
≥632.662.663.945.55
α ^ = 10.530 ( 2.144 ) a ^ = 10.530 ( 2.158 ) λ ^ = 2.377 ( 0.276 ) γ ^ = 0.485 ( 0.099 )
γ ^ = 49.843 ( 24.257 ) b ^ = 0.001 ( 14.254 ) v ^ = 1.506 ( 0.137 ) λ ^ = 1.151 ( 0.104 )
γ ^ = 49.843 ( 24.416 )
A I C 1158.31160.31160.71164.4
χ 2 s t a t i s t i c 1.000 1.000 2.9146.014
p-value 0.801 0.606 0.4050.111
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Cueva-López, V.; Olmo-Jiménez, M.J.; Rodríguez-Avi, J. An Over and Underdispersed Biparametric Extension of the Waring Distribution. Mathematics 2021, 9, 170. https://doi.org/10.3390/math9020170

AMA Style

Cueva-López V, Olmo-Jiménez MJ, Rodríguez-Avi J. An Over and Underdispersed Biparametric Extension of the Waring Distribution. Mathematics. 2021; 9(2):170. https://doi.org/10.3390/math9020170

Chicago/Turabian Style

Cueva-López, Valentina, María José Olmo-Jiménez, and José Rodríguez-Avi. 2021. "An Over and Underdispersed Biparametric Extension of the Waring Distribution" Mathematics 9, no. 2: 170. https://doi.org/10.3390/math9020170

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop