A Family of Finite Mixture Distributions for Modelling Dispersion in Count Data

: This paper considers the construction of a family of discrete distributions with the flexibility to cater for under-, equi-and over-dispersion in count data using a finite mixture model based on standard distributions. We are motivated to introduce this family because its simple finite mixture structure adds flexibility and facilitates application and use in analysis. The family of distributions is exemplified using a mixture of negative binomial and shifted negative binomial distributions. Some basic and probabilistic properties are derived. We perform hypothesis testing for equi-dispersion and simulation studies of their power and consider parameter estimation via maximum likelihood and probability-generating-function-based methods. The utility of the distributions is illustrated via their application to real biological data sets exhibiting under-, equi-and over-disper-sion. It is shown that the distribution fits better than the well-known generalized Poisson and COM– Poisson distributions for handling under-, equi-and over-dispersion in count data.


Introduction
In the statistical analysis of count data, an important feature is the under-, equi-or over-dispersion of the data.There are many models that cater for a particular type of dispersion.In contrast, there are not many distributions with the flexibility to model under-, equi-and over-dispersion.Examples of these distributions are the hyper-Poisson [1], double Poisson [2] and weighted Poisson [3] distributions.Among these distributions, two popular distributions are the generalized Poisson distribution [4] and the COM-Poisson distribution [5][6][7][8].
The COM-Poisson distribution arises from a single queue-single server system with state-dependent service rates [5].Recently, this distribution has attracted a lot of interest and has been applied in many areas [6].Refs.[11,12] considered the COM-Poisson distribution in the context of modelling time series of counts.The COM-Poisson distribution is a member of the exponential family of distributions and confers several desirable properties for statistical inference.The COM-Poisson distribution has a pmf given by ( = ) =   (!)  (, ) where (, ) = ∑ for  > 0 and  ≥ 0. The COM-Poisson distribution is underdispersed or over-dispersed when  > 1 or  < 1, respectively.The case  = 1 gives the Poisson distribution.The COM-Poisson distribution cannot be directly expressed in terms of its mean.To facilitate regression modelling, ref. [13] considered a mean-parametrized COM-Poisson model based on an iterative solution of a mean equation.Compared with over-dispersion, there is relatively less discussion in the literature about under-dispersion in count data.Ref. [14] reviewed under-dispersed count models and various sources of data under-dispersion.Ref. [15] discussed count models that are arbitrarily under-dispersed and showed that the mean-parametrized COM-Poisson distribution can handle arbitrarily small under-dispersion.
It is to be noted that the distributions mentioned above have a restricted parameter range (GPD) or normalizing constants in the form of infinite series (double Poisson, COM-Poisson) that may add to the complexity of a statistical analysis.Recently, motivated by these issues, a particular case of the generalized shifted inverse trinomial (GIT) distribution [16], designated GIT 3,1 [17], has been proposed in order to cater for under-, equi-or over-dispersion in count data.It is noted that when the GIT is equi-dispersed, the distribution is non-Poisson.The GIT has a simple probabilistic structure as the convolution of negative binomial and binomial distributions.This is apparent from its probability-generating function (pgf), given by It is a member of a general family of distributions formed via the convolution of binomial and pseudo-binomial variables [18].This family has a pgf in the form , and Kemp gave conditions for this to be a valid pgf.Some members of this family of distributions have been considered in the literature.The pmf of the GIT distribution is given by for  = 0,1,2, ..., where n is a positive integer and   ≥ 0 for  = 1,2,3; ∑   = 1 3 =1 .
The GIT pmf may also be expressed as ) in terms of the Gauss hypergeometric function  2 1 .Other recent efforts to develop distributions that can handle under-, equi-or over-dispersion have been performed through COM-Poisson-type extensions on binomial and negative binomial distributions [19][20][21][22][23][24].Ref [25] examined the flexible weighted Poisson that extends the weighted Poisson of [3] and contains the COM-Poisson and hyper-Poisson distributions as special cases.
Even though the GIT has a simple probabilistic structure as a convolution of binomial and negative binomial random variables and is free from some of the limitations of the other popular distributions, the positive integer index parameter n can be an impediment in parameter estimation.Section 2 proposes a finite mixture model for the construction of a family of distributions that can cater for under-, equi-or over-dispersion.This is exemplified by a finite mixture of negative binomial and shifted negative binomial distributions.The negative binomial distribution is over-dispersed, but the finite mixture exhibits under-, equi-and over-dispersion.Section 2 also presents some basic properties of the finite mixture of negative binomial distributions, while Section 3 discusses parameter estimation and test of equi-dispersion.Empirical modelling of biological count data is considered in Section 4, with concluding remarks presented in Section 5.

Finite Mixture of Distributions
Finite mixture models have been widely employed due to their flexibility and computationally convenient representation of complex distributions.Mixture formulations lead to simple techniques in statistical analysis; for example, ref [26] gave a simple method for generating bivariate binomial samples with given marginals and correlation.A good review of this topic is found in [27].A general family of two-component finite mixture distributions is first established and then a particular case of interest is examined.

Finite Mixture of Distribution and Its Shifted Distribution
Let (; Θ) be a probability mass function (pmf), where Θ is the vector of parameters,  = 0, 1, 2, 3, .... and ( − 1) is the shifted pmf.Define the pmf (; Θ, ) of a random variable  as (; Θ, ) = (1 − )(; Θ) + ( − 1; Θ),  = 0, 1, 2, 3, …, where (−1; Θ) = 0 and 0 <  < 1.It is easy to check that If () is the probability-generating function (pgf) of (; Θ), the pgf of  is Moments of the finite mixture may be obtained from this pgf.Let  and  2 be the mean and variance of the component distribution (; Θ).The mean and variance of  with pmf given by Equation (1) are The index of dispersion  is given by The distribution of finite mixture (1) is under-, equi-or over-dispersed according to that is, The quantity  2 −  <, =, > 0 depends upon whether the component distribution is under-, equi-or over-dispersed.This shows that the choice of either an equi-dispersed or over-dispersed component distribution will make the finite mixture (1) an under, equi-or over-dispersed distribution.This is illustrated in the next section by using a mixture of negative binomial (NB) distributions.

Finite Mixture of Negative Binomial Distributions
The negative binomial distribution is a popular model that only caters for over-dispersion.However, when used as components in the finite mixture model (1), the resulting mixture model can handle under-, equi-or over-dispersion.The mixture of NB distributions has pmf (3) where 0 <  < 1,0 <  < 1,  > 0 and ()  = Γ(+)
The pgf of pmf ( 3) is given by The form of the pgf given by (5) shows that this finite mixture of NB distributions with pmf (3) can be called an NB-shifted NB distribution.The pgf (5) may also be written as This is seen to be the convolution of Bernoulli and NB distributions.The mean and variance of this special case are, respectively,

Weighted Negative Binomial Distribution
Let  be a random variable with pmf p(k) and assume that the probability of ascertaining the event  =  has a weighting factor w(k).The weighted distribution [28,29] with weight w(x) has a pmf If w(k) = k, the distribution with pmf ( 6) is known as the size(length)-biased distribution.
By comparing Equations ( 4) and (7), it is seen that the finite mixture of NB can be regarded as a weighted NB distribution with weight .

Conditions for Under-, Equi-and Over-Dispersion
The index of dispersion is The distribution is under, equi-and over-dispersed according to whether  is less than, equal to or more than 1.Thus, the conditions for under, equi-and over-dispersion are as follows: If √/ (1 − ) <  , the distribution is under-dispersed.Conversely, it is equi-dispersed if √/ (1 − ) =  and over-dispersed if √/ (1 − ) > .
Remark 1.It should be noted that for the special case of equi-dispersion, the distribution does not reduce to the Poisson distribution.With the substitution  = √/ (1 − ) into Equation ( 4), a weighted negative binomial distribution is obtained.

Shapes of the Distribution
To examine the shapes of NB-shifted NB distribution, the probabilities in Table 1 are computed for a different combination of parameters.Figure 1i shows that increasing the parameter  decreases the probability at zero count dramatically but increases the index of dispersion mildly.The computed pmf drops drastically from 0.81 (case (a)) to almost zero (case (c)) when  increases from 1 to 50.For case (a), the computed index of dispersion is close to 1 and can be considered to be an equi-dispersed distribution.When  and p are fixed and  (>1) is increased as shown in Table 1, the computed index of dispersion is always larger than 1, meaning that it is an over-dispersed distribution.Besides that, the modes are shifted to the right when  is increased.An increase to the length of the distribution is found and it is always skewed where the right tails are heavier.
The pmf plots of the mixture distribution as proposed in Table 1.
On the other hand, Table 1 shows that, when  is fixed and  and p are varied, we obtain either over-or under-dispersion.The length of the distribution is longer when  is larger than p, as presented in Figure 1ii.Meanwhile, case (f) has the highest computed index of dispersion.However, the distribution seems to be flattened when compared with other cases.
From the results of [30] (p.388), we can assert that the NB-shifted NB distribution is log-concave since both the binomial (Bernoulli) and negative binomial distributions are log-concave and as log-concavity is preserved under convolution.Case (e)

Case (f)
From Theorem 3 of [30] (p.386), which states that a necessary and sufficient condition of pmf {  } being strongly unimodal is that   be log-concave for all k, it follows that the finite mixture of NB distributions is strongly unimodal.
The failure rate of a distribution with pmf {  },   > 0,  = 0, 1,2,3, ... is defined by From the log-concavity property, the NB-shifted NB distribution has an increasing failure rate (IFR) [31].Furthermore, the following implications hold where IFRA is 'increasing failure rate average', NBU is 'new better than used', NBUE is 'new better than used in expectation' and HNBUE is 'harmonic new better than used in expectation'.Thus, the finite mixture of NB-shifted NB distribution is IFR, IFRA, NBU, NBUE and HNBUE.

Parameter Estimation
In this section, maximum likelihood (ML) estimation and probability-generating function-based estimation (pgf-estimator) are employed.
ML estimation is an efficient method for the estimation of unknown parameters, but the score equations involved can be complicated and difficult to solve.To overcome numerical complexity, numerical optimization via the simulated annealing algorithm [32] is used.
An alternative estimation method based on the pgf is suggested because the NBshifted NB pgf has a simple form.This method has been shown to provide quick and consistent estimates for discrete distributions and is robust to outliers [33].The pgf-based statistic considered here is where   () = and () = [  ] are, respectively, the empirical pgf and the theoretical pgf of the distribution.Ref [33] demonstrated that the pgf estimator outperformed the ML estimator in terms of achieving low values of mean-square error and bias.Remark 2. In the statistic (8), the integral over (0, 1) may be interpreted as averaging over the auxiliary variable t by a uniform distribution.A non-uniform distribution can be used.
The study of the power of this statistical hypothesis test is developed in the next subsection and the results will be provided in Table 2.The log-likelihood function can be written as ln  = ∑  {  =  } (  =   ) , where  1 ,  2 , ⋯ ,   is a random sample.The Rao's score test statistic is given by T =  −1   , where the score vector V and information matrix I are evaluated at the restricted ML estimates.
The score vector  is given as ( ]}, where  0 is digamma function and  1 is polygama function of order 1.

𝐿(𝛽 ̂; 𝑥) )
where  ̂ * is the vector of restricted ML estimators under  0 and  ̂ is the unrestricted ML estimators under  1 .The distribution of the test is  2 with 1 degree of freedom.

Statistical Power Analysis of the Rao's Score and Generalized Likelihood Ratio Tests
In this section, Table 2 displays the power of score and GLR tests from a Monte Carlo simulation study that is conducted with 1000 repetitions.In the simulation study, the significance level  is set at 5% and 10%.The sample sizes considered are N =100 (small), 500 (moderate) and 1000 (large), while different levels of under-, equi-and over-dispersion are incorporated.Rao's score test and the GLR test are asymptotically equivalent.
Table 2 demonstrates that, in the scenario of equi-dispersion ( = √/ (1 − )), the estimated empirical levels of Rao's score test and GLR test exhibit proximity to each other as the sample size N increases.In the case of under-dispersion, the GLR test consistently exhibits a higher power than Rao's score test, even with small sample sizes.Meanwhile, a 100% detection rate is achieved for an effect size of 0.61 when the sample size  ≥ 500.For over-dispersion, when the effect size is 0.25, Rao's score test outperforms the GLR test marginally.However, as the effect size increases to 0.60, a slightly stronger detection is observed with the GLR test, particularly when performed with larger sample sizes ( = 1000).It is evident that, as the deviation from  = √/ (1 − ) increases, the power of the test also increases.

Modeling of Biological Count Data
The NB-shifted NB distribution has been fitted to count data for different indexes of dispersion.The three real-life data sets are as follows: (1) Frequency and distribution of chromosome aberrations (dicentrics plus ring) in peripheral blood lymphocytes irradiated in vitro with γ-rays (dose 10 in Table 3 and dose 6 in Table 4) to represent under-and equi-dispersed data [35].

No. of Cells
Table 5. Fetal movement [36].The performance of the NB-shifted NB distribution is compared with GPD, GIT and COM-Poisson distributions.The parameter n under GIT distribution is chosen based on the lowest chi-square values.For the data with ID of 0.97 (a value very close to 1), Poisson distribution fits well as a comparison study for equi-dispersed data.Under NB-shifted NB distribution conditions, both ML estimation (MLE) and pgf-based estimation are used.

Number of Movements
For Table 3, the fits obtained by the NB-shifted NB distribution are comparable with the GPD, GIT and COM-Poisson distributions derived based on the p value.Both estimation methods work well, with similar chi-square values achieved.Meanwhile, Table 4 shows that, based upon the p value and chi-square value, the fits by the NB-shifted NB and GIT distributions are significantly better than the others.For this data set, pgf-based estimation is slightly better than MLE methods, with a lower chi-square value achieved.For over-dispersed data, the NB-shifted NB distribution achieves the highest p value (for MLE) and the lowest chi-square values if compared with others.The MLE method slightly outperforms pgf-based estimation, as presented in Table 5.In addition, the tests of equidispersion (Rao's score and GLR test) indicate that the null hypothesis should be rejected in Table 5, which agrees with the computed index of dispersion.For the two data sets in Tables 3 and 4, the simulation results presented in Table 6 show that the bias for the estimated parameter  ̂ is relatively large.To evaluate the performance of the pgf-estimator in fitting the NB-shifted NB distribution to real-life data sets, a non-parametric bootstrap simulation [37] was conducted with 1,000 repetitions.The simulation involved re-sampling, with replacement, from the original data set to generate a multitude of bootstrap samples.For each of these samples, the maximum likelihood estimate was computed.Non-parametric bootstrap simulation ensures that these insights are obtained without making specific assumptions about the underlying population's distribution.
The results of this simulation are summarized in Table 6, which provides key statistics the mean, standard error, and confidence interval from the bootstrap analysis for both the MLE and the pgf-estimator.Additionally, the bias of the MLE and pgf-estimator was assessed by comparing the average of the bootstrap estimates to their respective estimated parameters.The findings presented in Table 6 indicate that pgf-estimator has a lower bias than MLE.The mean, standard error, confidence intervals, and bias computed for both estimators are in close agreement.For the two data sets in Tables 3 and 4, the standard errors for the estimated parameter  ̂ are relatively large, indicating low precision in the estimates.

Concluding Remarks
A two-component mixture model of a distribution with its shifted counterpart has been proposed in order to generate models that can cater for under-, equi-and over-dispersion.As an example, the paper considers the finite mixture of NB and shifted NB distributions.The distribution does not, like the generalized Poisson distribution, have a problem with the range of the parameters.Even though the COM-Poisson distribution has a simple pmf, the computation of the normalizing constant could encounter problems for extreme values of the parameters.The GIT distribution does not face these issues.However, in comparison, the finite mixture of NB distributions (NB-shifted NB) is much simpler.The proposed distribution has been fitted to a variety of biological data sets.The good fits, shown by the distribution relative to the GPD, COM-Poisson distribution and GIT distribution, proves that it this a viable and simple alternate model for under-, equiand over-dispersion.The use of components other than NB and shifted NB for the finite mixture will be considered elsewhere.

Table 1 .
Different combination of parameters under NB-shifted NB distribution.

Table 2 .
Simulated power of Score and GLR tests for NB-shifted NB distribution.(Power = number of rejections divided by number of repetitions).

Table 6 .
Summary statistics for MLE and pgf-estimator based on bootstrap simulation.