Abstract
Multivariate count data are often modeled via a multivariate Poisson distribution, but it contains an underlying, constraining assumption of data equi-dispersion (where its variance equals its mean). Real data are oftentimes over-dispersed and, as such, consider various advancements of a negative binomial structure. While data over-dispersion is more prevalent than under-dispersion in real data, however, examples containing under-dispersed data are surfacing with greater frequency. Thus, there is a demonstrated need for a flexible model that can accommodate both data types. We develop a multivariate Conway–Maxwell–Poisson (MCMP) distribution to serve as a flexible alternative for correlated count data that contain data dispersion. This structure contains the multivariate Poisson, multivariate geometric, and the multivariate Bernoulli distributions as special cases, and serves as a bridge distribution across these three classical models to address other levels of over- or under-dispersion. In this work, we not only derive the distributional form and statistical properties of this model, but we further address parameter estimation, establish informative hypothesis tests to detect statistically significant data dispersion and aid in model parsimony, and illustrate the distribution’s flexibility through several simulated and real-world data examples. These examples demonstrate that the MCMP distribution performs on par with the multivariate negative binomial distribution for over-dispersed data, and proves particularly beneficial in effectively representing under-dispersed data. Thus, the MCMP distribution offers an effective, unifying framework for modeling over- or under-dispersed multivariate correlated count data that do not necessarily adhere to Poisson assumptions.
1. Introduction
There exists a rich history of research regarding multivariate discrete distributions [1]. Krishnamoorthy [2] introduced a multivariate binomial (MB) distribution for the d- dimensional vector from a table with a factorial moment generating function (fmgf)
where is the probability of , ; denotes the probability of ; and so on. Utilizing this form, Krishnamoorthy [2] further introduced the multivariate Poisson (MP) distribution as the limiting distribution of the multivariate binomial distribution wherein all of the probabilities appearing in Equation (1) have order and as , where ∘ denotes the corresponding probability subscripts in Equation (1). Accordingly, the fmgf of the MP distribution for a random vector is
Mahamunulu [3] noted that the MP distribution can likewise be derived by defining as the sum of independent Poisson() random variables , where ∗ denotes all subscripts involving with distributed as Poisson(), and as Poisson() where denotes the sum of the associated parameters with subsets , for such that . The corresponding joint probability generating function (pgf) has the form
where [3,4]. From (3), it is evident that the variables , , ..., have marginal Poisson distributions, and it can be further shown that all pairs of variables ’s are positively correlated.
While the MP distribution is a popular model for describing correlated discrete random variables, it is well known that Poisson models are constrained by their underlying assumption of equi-dispersion; analogous negative binomial (NB) models serve as a popular alternative due to their ability to address data over-dispersion [5]. Doss [6] discussed a multivariate negative binomial (MNB) distribution with joint pgf
for . From (4), it is evident that the variables , , ..., have marginal NB distributions which are known to be over-dispersed. For this reason, the MNB distribution can only accommodate data over-dispersion; accordingly, correlated under-dispersed data structures are only at best fitted by a MP model where the associated model parameters will still be biased. Therefore, in this work, we introduce the reader to the Conway–Maxwell–Poisson (CMP) distribution and develop a multivariate CMP (MCMP) distribution as a flexible alternative distribution for modeling correlated discrete count data. Section 2 introduces the reader to the CMP distribution and its bivariate analog as motivation. Section 3 develops the MCMP distribution and discusses its associated properties, and also introduces approaches for parameter estimation and hypothesis testing. Section 4 demonstrates the model flexibility by means of simulated and real data examples. Finally, Section 5 concludes the manuscript with discussion, while the appendices contain more detailed derivations and the datasets referenced in this work.
2. Conway–Maxwell–Poisson Distribution
The CMP distribution [7] has the probability mass function (pmf)
for a random variable Y, where is the dispersion parameter, generalizes the Poisson rate parameter, and denotes the normalizing constant. Equi-dispersion relative to the Poisson distribution is represented when while data over-dispersion (under-dispersion) occurs when . The CMP() distribution contains three well-known distributions as special cases: Poisson with rate parameter when ; geometric with success probability when and ; and Bernoulli with success probability when [8].
The distribution’s moments can be represented recursively as
in particular, its expected value and variance are
where the approximations provided in Equations (5) and (6) hold for or [9,10]. Further, the CMP has the moment generating function (mgf) and pgf .
Sellers et al. [11] construct a bivariate CMP model by means of the compounding method, wherein the joint conditional distribution of has a bivariate binomial distribution and the number of trials n is CMP() distributed. The pmf of is
where is the multinomial coefficient, and it has the joint pgf
for some parameters, , and probabilities such that , for , and for . This bivariate CMP distribution yields the three special bivariate cases that are achieved in their univariate analogs: for , the bivariate CMP distribution reduces to the bivariate Poisson [12,13]; when , we obtain the bivariate Bernoulli distribution [14]; and, for , , and , the bivariate CMP distribution reduces to a bivariate geometric model [11].
3. Multivariate Conway–Maxwell–Poisson Distribution
Generalizing the compounding approach in [11], we develop a convenient form for the MCMP distribution. Consider d random variables that, given some number of trials n, jointly have a conditional MB distribution with pgf
(Equation (37.71) of [1]), where n is a CMP random variable. The compounding technique that is formulated as a CMP-stopped MB (i.e., where the MB index parameter is CMP distributed) can then be applied, resulting in the corresponding MCMP distribution’s pgf as
where for some . Equation (8) contains parameters, but its degrees of freedom equals due to the restriction, ; this adds difficulty in the determination of model parameter maximum likelihood estimates (MLEs). We circumvent this issue through the reparametrization, . Each variable is independent under this parameterization, where
and . For simplicity, we use to denote but recognize that is no longer an independent parameter in the ensuing discussion. The pgf of the MCMP distribution can now be parameterized as
As is the case of the univariate and bivariate CMP, this MCMP includes the MP, multivariate geometric, and multivariate Bernoulli distributions all as special cases, where maintains representation as the dispersion parameter. When , this MCMP pgf reduces to the form of the MP joint pgf (see Equation (3)). When and , its pgf becomes
where ; where denotes a 1 in the ith position, ; where denote 1s in the locations; . This is the pgf of a multivariate geometric distribution (i.e., the MNB distribution pgf in Equation (4) with ). Finally, when , this MCMP becomes a multivariate Bernoulli (i.e., the MB in [2] with ) with and all remaining probabilities are where at least one of equals 1, . More broadly, denotes the equi-dispersion case while reflects data over-dispersion (under-dispersion), both for the joint distribution and the respective marginal distributions.
We derive the MCMP pmf by taking partial derivatives of the pgf, i.e.,
see Appendix A for pertinent details. Moving forward, we shall illustrate the MCMP results using the trivariate case as motivation, where the joint fmgf reduces to from which we can obtain the moments and product moments, respectively; see Appendix B for all relevant details. These results confirm that the dispersion parameter denotes the type of data-dispersion for the joint and marginal distributions, and the correlations between any two random variables is non-negative with for any variables .
3.1. Parameter Estimation
We perform parameter estimation by the method of maximum likelihood (ML). Considering the trivariate case of Equations (9) and (12), there are nine parameters required to specify a trivariate CMP distribution, namely, for ; ; and . Accordingly, the log-likelihood has the form
where denotes the jth observation in the ith data dimension, denotes the vector of the entire data set in the ith dimension; the precise form of is provided in Equation (A4) in Appendix A. The resulting score equations, however, do not have a closed form solution. For this reason, we carry out the statistical computations by using optimizing routines in R [15].
To perform the parameter estimation, we use the optim function where the negated form of the log-likelihood (Equation (13)) serves as the function to be optimized, and the L-BFGS-B method and its default convergence criteria are applied. Additionally, we approximate the standard errors of the estimated parameters by calculating the square root of the diagonal of the inverse Hessian matrix based on the approximate form obtained from optim. The complexity of the MCMP distribution, however, brings with it some computational difficulties when applying optim. The resulting MLE can vary considerably depending on the choice of starting values. To avert this, we consider several starting points including an exhaustive search in order to potentially improve the estimation result. Meanwhile, the resulting Hessian matrix provided from optim sometimes produces an inverse matrix containing negative diagonal elements; this violates the presumed positive semidefinite form of the Fisher information matrix. For these reasons, we recommend utilizing a parametric bootstrap method as an alternative approach for quantifying variability in the parameter estimates.
3.2. Hypothesis Testing
To check if a multivariate count data set suffers from any statistically significant data dispersion such that the MP distribution is unsuitable (favoring the MCMP distribution), we conduct the hypothesis test, : versus : . We do not concern ourselves with the direction of the data dispersion because the MCMP distribution can accommodate both over- and under-dispersion. Nonetheless, the resulting statistical inference, along with the estimate for , offers guidance regarding the type of dispersion present in the data. We use the likelihood ratio test (LRT) statistic, , where and , respectively, denote the maximum log-likelihoods associated with the MP and MCMP models. Theoretically, follows a distribution and thus can be used to assess whether the data are reasonably distributed as a MP distribution, or if statistically significant dispersion exists such that it warrants using the MCMP model. In a similar vein, one can consider hypothesis tests, or (versus otherwise) to determine whether the multivariate data satisfy a multivariate geometric or multivariate Bernoulli distribution, respectively; their associated LRTs have adjusted distributional forms based on a mixture involving to account for being at the respective boundaries for [16].
4. Examples
This section considers various simulated and real data examples to illustrate the flexibility of the MCMP model. For the real data sets, we compare model performance via the respective log-likelihood and Akaike Information Criterion (AIC) values. We particularly consider as introduced in Burnham and Anderson [17], where denotes the AIC associated with Model i, and is the minimum AIC among the considered models. [17] provides model support levels based on recommended ranges; see Table 1 for details.
Table 1.
Model support levels based on AIC difference values, , for Model i [17].
4.1. Simulated Data
Here, we provide simulated data examples to illustrate the MCMP model’s ability to correctly distinguish the MP distribution. Without loss of generality, we proceed with the use of the trivariate case. To evaluate the robustness of the simulation process, we consider data simulations of size {100, 250, 500, 1000} and simulate data 500 times at each size level.
We first consider a simulated trivariate Poisson distribution where the joint pgf (Equation (3)) is defined with , , , , , and , and obtain the MLEs for the trivariate CMP under two conditions: the unconstrained case, and the restricted case where . The latter case serves to reflect the trivariate Poisson model with , , , , , , ; thus, the value of no longer affects the model. Table 2 displays the proportion of statistics that fall within the respective 95% or 99% confidence bounds across the simulations. As expected, the proportion of values that are within the respective bounds, and , is quite close to their respective nominal levels, regardless of size level.
Table 2.
Proportion of values that lie within the and bounds, respectively, given various sample sizes, {100, 250, 500, 1000}.
To assess the power of the test, : versus : , we further generate data from the trivariate geometric (, , , , , , ), the trivariate Bernoulli (, , , , , , , ), and the trivariate CMP expressing over-dispersion (, ) and under-dispersion (, , ), respectively. All simulation results obtained are presented in Table 3. As the generating distribution has a measure of dispersion that moves away from 1 (i.e., the data deviate from the Poisson), we see the power increase in both directions. Meanwhile, the power likewise increases with the sample size in association with all of the respective distributions.
Table 3.
Power of the likelihood ratio test (at 5% level) when data are generated from the trivariate geometric, the trivariate CMP with , the trivariate CMP with and the trivariate Bernoulli, respectively, for various sample sizes, {100, 250, 500, 1000}.
4.2. Real Data: Corporación Favorita Grocery Sales
The Corporación Favorita grocery sales data [18] include information regarding the number of unit items sold daily to more than 4000 items in 35 different stores over a five-year period. To illustrate the MCMP distribution’s flexibility for describing real count data, we consider the unit sales regarding a particular item (Item ID:103665) over 100 days in each of three stores (Stores 1, 2 and 3, respectively); the data are provided in Table A2 in Appendix D. This dataset is over-dispersed due to the weekly and monthly periodic fluctuation; the number of sales often tend to be high at the beginning of each month as well as on weekends. Table 4 summarizes the results that stem from considering various trivariate models to describe the data, namely, the trivariate Poisson, trivariate NB [6], trivariate geometric, and trivariate CMP. For each of the assumed models, this table provides the respective MLEs, resulting log-likelihood, and AIC values.
Table 4.
Estimation results associated with the Corporación Favorita grocery sales data based on various assumed trivariate models: Conway–Maxwell–Poisson (CMP), trivariate Poisson, trivariate geometric and trivariate negative binomial (NB). Respective log-likelihood and Akaike Information Criterion (AIC) values are also provided, along with the number of free parameters for AIC determination.
Although the trivariate Poisson distribution has the least number of parameters (i.e., 7) among the models considered, it has the largest AIC (1748.5), suggesting its unsuitability for these data [17]. Meanwhile, the MLEs of the trivariate CMP model include , implying that the data are over-dispersed. The trivariate CMP produces a considerably smaller AIC (1627.9) relative to the trivariate Poisson; this further demonstrates the apparent data over-dispersion that should be addressed, but with relative to the AIC from the trivariate NB, the trivariate CMP (while second best among the four considered models) still has model support that is “considerably less” than that of the trivariate NB (); this result is still substantially better than the difference between the trivariate NB and Poisson models (), clearly inferring no support for the trivariate Poisson. Further, applying the trivariate CMP model introduces consideration of the trivariate geometric and NB models, respectively, as possible parsimonious models. The respective LRT statistics, for the test and for , both have p-values smaller than 0.005 which indicate that neither the trivariate Poisson nor the trivariate geometric fits the data well. Even still, serves as an indication of data over-dispersion, hence consideration of the general MNB distribution as a possible model.
Table 4 further shows that , , and for the CMP model are all 0; a similar situation appears on the estimation of the geometric and NB models, where , , and are also all 0. This indicates that there is no significant correlation within the data; this is true because the correlation coefficients between Stores 1 and 2, Stores 1 and 3, and Stores 2 and 3 are 0.15, 0.02, 0.21, respectively.
Figure 1 compares the marginal pmfs associated with each of the four models with the marginal relative frequencies associated with the number of unit sales for each of the three stores (Stores 1, 2, 3). These images show that the trivariate CMP and NB models produce very similar estimated marginal distributions with modes that are close to the observed mode, and have sufficiently wide tails to reflect the observed marginal frequencies, particularly for Store 2. Goodness-of-fit tests are likewise performed for comparing the aforementioned models to assess how well their marginal pmfs fit the marginal data frequencies. Following [19], we modify our observed frequencies by grouping observations greater than 8 on Store 1, greater than 9 on Store 2, and observations greater than 21 on Store 3. This allows the respective tail bins associated with each store to have a sufficiently large observed frequency to allow for the goodness-of-fit test to be conducted and the associated asymptotic chi-square distribution to be used. As a result, resulting statistics for the goodness-of-fit tests are expected to follow the chi-square distribution with 10, 11 and 23 degrees of freedom, respectively, for Stores 1, 2 and 3.
Figure 1.
Estimated marginal distributions associated with the trivariate CMP (blue/square), the trivariate geometric (purple/circle), the trivariate negative binomial (red/triangle), and the trivariate Poisson (green/diamond) compared with the original data relative frequencies (histogram) regarding the number of unit sales for: (a) Store 1, (b) Store 2, (c) Store 3.
Table 5 summarizes the goodness-of-fit test statistics for each of the stores and models. While the trivariate geometric model best fits the Store 1 marginal distribution, the goodness-of-fit scores for the trivariate CMP and NB models are considerably better and outperform their peers for Stores 2 and 3. Table 5 confirms these assertions with , and , respectively, for Stores 1, 2 and 3; we again see that the geometric model fits the data better for Store 1, and the trivariate CMP and NB models produce closer fits for Stores 2 and 3.
Table 5.
The goodness-of-fit test measures to compare the considered trivariate model (i.e., trivariate CMP, trivariate negative binomial (NB), trivariate Poisson, and trivariate geometric) marginal pmfs to the marginal data regarding the number of unit sales for Stores 1, 2 and 3, respectively.
4.3. Real Data: NBA All-Star
To demonstrate that the trivariate CMP can also be suitable for under-dispersed data, we consider data from the National Basketball Association (NBA) All-Star game rosters from 2000 to 2016 and seek to model the distribution of the number of players selected for the All-Star game each year in various positions [20]. For simplicity, we focus on the number of players that can play as Center (C), Forward (F), or Forward-center (FC); the data are provided in Table A3 of Appendix D. We again consider the trivariate CMP, the trivariate Poisson, and the trivariate NB distributions as possible models to describe this dataset. Table 6 contains a summary of the results including the respective MLEs, the resulting maximized log-likelihood, the number of free parameters, and the associated AIC for each of the three considered models.
Table 6.
Estimation results associated with the NBA All-Star data based on various assumed trivariate models: Conway–Maxwell–Poisson (CMP), Poisson, and negative binomial (NB). Respective log-likelihood and Akaike Information Criterion (AIC) values are also provided.
The trivariate CMP model performs the best among the considered models, attaining a maximum likelihood equaling −68.2 and . The trivariate Poisson and NB models meanwhile produce respective AICs equaling 180.6 and 182.7 such that both respective difference values as defined in [17] (Table 1) are greater than 26, indicating no empirical support in favor of either model. The difference between the respective AIC values for the trivariate Poisson and NB models stems from the difference in the number of free parameters while they attain the same maximized log-likelihood value (−83.3). Neither of these models can accommodate data under-dispersion, and consequently the optimal trivariate NB distribution is that model which converges to the trivariate Poisson as . Accordingly, the trivariate NB MLEs that best address data under-dispersion are those under the constraint of data equi-dispersion.
The trivariate CMP model successfully detects the data under-dispersion (). In fact, such a large suggests that we should consider modeling the data via a trivariate Bernoulli model. This would normally be true because the resulting CMP denominator includes which becomes considerably large for given large . This makes vanish for any such that at least one of the random variables exceeds 1. This is not the case here, however, because this data example likewise produces extremely large estimates. Reviewing the raw data likewise suggests clearly that the trivariate Bernoulli distribution is not appropriate because there exist count data that are larger than 1, thus violating the multivariate Bernoulli structure. Therefore, the use of the trivariate CMP to analyze these data is duly justified.
5. Discussion
In this paper, we present a MCMP model that is developed via the compounding method. The distribution is established as a CMP-stopped multivariate binomial distribution, i.e., a multivariate binomial distribution where the associated index parameter is CMP distributed. Along with an introduction to this resulting distribution, we discussed its statistical properties which aid in better model interpretation. The CMP model can flexibly accommodate both over- and under-dispersed count data, and it includes the Poisson, Bernoulli, and geometric distributions all as special cases. Accordingly, the MCMP model serves as a reliable tool for model determination because it can successfully recognize these three multivariate special cases, and serve as an overarching distributional structure connecting them. One can determine if significant data dispersion exists by calculating the LRT statistic discussed in Section 3.2, and analogous tests can be considered to determine whether the data effectively approximate either of the other two special case distributions (i.e., or , respectively). The MCMP distribution is particularly useful for modeling under-dispersed count data, as demonstrated through the simulated and real data examples.
A limitation of the MCMP model is that the correlation between any two of the d random variables comprising the MCMP is constrained to be non-negative, and so it may not be appropriate to consider this model to analyze multivariate count data containing negative correlations. This is true, however, of several multivariate discrete distributions, e.g., the [3] multivariate Poisson distribution. Meanwhile, this MCMP construct involves only one parameter () to describe data dispersion. Hence, this MCMP model is suitable only for data with similar levels of data dispersion in each dimension, however the model can be broadened to allow for dynamic dispersion. Future work will seek to define a broader generalization of the MP distribution (or modification of this MCMP model) that allows for a broader range of correlation and possesses greater flexibility with regard to data dispersion. One proposed approach, for example, is to consider using copulas to develop a multivariate CMP distribution, as described in [21]. Though this is a standard method for multivariate continuous variables, its use for modeling multivariate count data has its own limitations, most importantly that copulas for discrete outcomes are not identifiable, especially when those discrete outcomes follow count distributions [21,22,23].
Table 4 demonstrates another limitation of the MCMP model, namely that it cannot accommodate as much data over-dispersion as the MNB; the MCMP distribution at best contains the multivariate geometric distribution (which is a special case of MNB). The MNB model, however, can be viewed as the convolution of independent and identically distributed (iid) multivariate geometric distributions. This convolution structure will then be able to capture greater over-dispersion. More broadly, the same idea can be used to consider a multivariate version of the sum of CMPs (MSCMP) model [24] as a generalization to accommodate broader dispersion, and use its trivariate form to revisit the Corporación Favorita grocery sales dataset. Unfortunately, due to computational issues, we were only able to perform parameter estimations for the trivariate SCMP model under the restriction . Future work will further study the MSCMP model, for example, to determine how to optimally and directly compute the MSCMP pmf, and more efficiently determine the MLEs of model parameters. See Appendix C for more information about the MSCMP.
Author Contributions
Conceptualization (K.F.S.); Methodology (K.F.S., T.L., Y.W., N.B.); Formal analysis (K.F.S., T.L., Y.W., N.B.); Investigation (K.F.S., T.L., Y.W., N.B.); Software (K.F.S., T.L., Y.W.); Supervision (K.F.S.); Writing—original draft preparation (K.F.S., T.L., Y.W., N.B.); Writing—review and editing (K.F.S., T.L., Y.W., N.B.). All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data presented in this study are available in the appendix.
Acknowledgments
The authors thank Richard Sellers for insightful discussions that aided in better comprehension and understanding of the NBA data set. This paper is released to inform interested parties of research and to encourage discussion. The views expressed are those of the authors and not necessarily those of the U.S. Census Bureau.
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| MB | multivariate binomial |
| fmgf | factorial moment generating function |
| MP | multivariate Poisson |
| pgf | probability generating function |
| NB | negative binomial |
| MNB | multivariate negative binomial |
| CMP | Conway–Maxwell–Poisson |
| MCMP | multivariate Conway–Maxwell–Poisson |
| mgf | moment generating function |
| MLEs | maximum likelihood estimates |
| pmf | probability mass function |
| ML | maximum likelihood |
| LRT | likelihood ratio test |
| AIC | Akaike Information Criterion |
| MLE | maximum likelihood estimate |
| NBA | National Basketball Association |
| C | Center |
| F | Forward |
| FC | Forward-center |
| sCMP | sum of CMPs |
| MSCMP | multivariate version of the sum of CMPs |
Appendix A. Deriving the Probability Mass Function
In order to derive the general form of the pmf, we first introduce some notation. Assuming a MCMP distribution with d dimensions, there exist distinct probabilities in the pgf, where for all . The first derivation relies on the identity
where and is the multinomial coefficient. We can express
where , , , ⋯, , and denotes the number of times out of n trials where we obtain exactly i of the required elements, . Observe that the number of respective elements is , , , , ⋯, . Then,
where we use the simplified notation, with 1 being in the th position, and . Similarly,
where , etc.; finally,
To illustrate this, consider the case when . In this case,
where
and . We then find that
where
The second derivation utilizes the direct approach of differentiating the pgf to obtain the pmf. To simplify the notation, let as before and let , , , …, . As a result, the general joint pmf is given by
where and . In the trivariate case, the model simplifies to
where and .
Appendix B. Derivations of Moments
Let for ; and and for ; , , , are similarly defined. By differentiating the fmgf with respect to , , and then setting , we obtain the joint factorial moments of the trivariate form, . Accordingly, letting , the initial marginal and product moments are obtained as
The above moments demonstrate that dispersion definitions are maintained via for the marginal distributions, where denotes equi-dispersion and capture over- (under-) dispersion. For example, when , and , thus one can see that for (i.e., marginal equi-dispersion holds), while for when .
Using the notation introduced in Section 3, we derive the expression below. Let
We then obtain the derivatives,
Then, the expected value of is
similarly, , and . Meanwhile,
similarly, and . Consequently, we obtain the covariances to be
similarly,
Finally, noting that , we obtain the variances as
similarly,
Appendix C. Introduction to the Multivariate sCMP Model
A multivariate form of the sum of CMPs (MSCMP) distribution is an extension of the sCMP model in [24]. This section applies various MSCMP forms as alternative models to analyze the Corporación Favorita grocery sales data. The MSCMP distribution is defined as follows: given iid random variables that are MCMP distributed, has a MSCMP distribution with pgf
Though the sCMP is defined as the sum of multiple CMPs, m need not be integer-valued since the pgf is valid for all . This MSCMP distribution includes the MNB (when ), the MP (), and the MB () distributions all as special cases, and serves as an over-arching distribution that connects the three special cases.
Given the difficulties associated with calculating the pmf for the SCMP model, we pursue an alternative approach to compute its pmf for a positive integer, m. For simplicity, we illustrate this approach in the trivariate case and consider : let be iid trivariate CMP , and let ; then has a trivariate SCMP distribution, and
where are non-negative integers such that , , and ; similarly, we can determine the pmf of the trivariate SCMP distribution.
We fitted the Corporación Favorita grocery sales dataset with the trivariate SCMP model so as to demonstrate its capability of dealing with trivariate count data. Table A1 provides the resulting trivariate SCMP estimates with and where, for comparison, we also include the results of the trivariate NB and CMP models. Accordingly, we find that the trivariate SCMP models fit the data better than the trivariate CMP, with improvement growing with m. More precisely, we note that, as m increases, the log-likelihood increases while the AIC decreases. In particular, likewise decreases toward 0 (which results in the trivariate NB model) as m increases. Unfortunately, current computational issues prevent us from providing SCMP results for , but these results do illustrate that the SCMP model will produce a log-likelihood no worse than that from the trivariate NB as it is a special case of bivariate SCMP model.
Table A1.
Estimation results associated with the Corporación Favorita grocery sales data based on various assumed trivariate models: CMP, sCMP (), sCMP (), and NB. Respective log-likelihood and Akaike Information Criterion (AIC) values are also provided.
Table A1.
Estimation results associated with the Corporación Favorita grocery sales data based on various assumed trivariate models: CMP, sCMP (), sCMP (), and NB. Respective log-likelihood and Akaike Information Criterion (AIC) values are also provided.
| Model | Estimated Parameters | Log Likelihood | No. of Parameters | AIC | ||
|---|---|---|---|---|---|---|
| CMP | −804.9 | 9 | 1627.9 | |||
| sCMP () | −804.0 | 9 | 1626.0 | |||
| sCMP () | −803.5 | 9 | 1625.0 | |||
| NB | −802.8 | 8 | 1621.7 | |||
Appendix D. Real Datasets
Table A2.
Corporación Favorita sales data: Unit sales of an item in each of three stores over 100 days.
Table A2.
Corporación Favorita sales data: Unit sales of an item in each of three stores over 100 days.
| Day | Store 1 | Store 2 | Store 3 | Day | Store 1 | Store 2 | Store 3 | Day | Store 1 | Store 2 | Store 3 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 5 | 6 | 35 | 2 | 3 | 3 | 68 | 1 | 3 | 5 |
| 2 | 3 | 8 | 23 | 36 | 4 | 2 | 6 | 69 | 1 | 2 | 3 |
| 3 | 2 | 8 | 21 | 37 | 3 | 2 | 16 | 70 | 0 | 4 | 4 |
| 4 | 4 | 5 | 8 | 38 | 7 | 12 | 12 | 71 | 1 | 4 | 10 |
| 5 | 2 | 7 | 13 | 39 | 6 | 0 | 13 | 72 | 1 | 0 | 14 |
| 6 | 1 | 4 | 15 | 40 | 2 | 2 | 4 | 73 | 0 | 1 | 14 |
| 7 | 0 | 0 | 11 | 41 | 1 | 0 | 13 | 74 | 0 | 2 | 19 |
| 8 | 1 | 6 | 2 | 42 | 3 | 6 | 4 | 75 | 0 | 4 | 6 |
| 9 | 6 | 6 | 7 | 43 | 2 | 0 | 3 | 76 | 7 | 2 | 12 |
| 10 | 3 | 8 | 13 | 44 | 7 | 2 | 4 | 77 | 1 | 4 | 7 |
| 11 | 3 | 8 | 16 | 45 | 5 | 7 | 3 | 78 | 1 | 0 | 7 |
| 12 | 0 | 1 | 7 | 46 | 8 | 1 | 19 | 79 | 3 | 1 | 8 |
| 13 | 0 | 5 | 11 | 47 | 1 | 6 | 17 | 80 | 4 | 3 | 17 |
| 14 | 6 | 8 | 7 | 48 | 1 | 2 | 5 | 81 | 4 | 7 | 7 |
| 15 | 1 | 2 | 4 | 49 | 3 | 6 | 5 | 82 | 3 | 3 | 13 |
| 16 | 8 | 4 | 4 | 50 | 5 | 2 | 9 | 83 | 1 | 0 | 9 |
| 17 | 3 | 10 | 20 | 51 | 10 | 1 | 0 | 84 | 1 | 1 | 11 |
| 18 | 6 | 6 | 12 | 52 | 4 | 4 | 11 | 85 | 0 | 2 | 8 |
| 19 | 0 | 1 | 6 | 53 | 1 | 5 | 25 | 86 | 6 | 5 | 11 |
| 20 | 3 | 7 | 10 | 54 | 3 | 5 | 4 | 87 | 0 | 0 | 9 |
| 21 | 0 | 2 | 5 | 55 | 1 | 7 | 3 | 88 | 0 | 4 | 8 |
| 22 | 3 | 4 | 1 | 56 | 2 | 1 | 4 | 89 | 1 | 2 | 15 |
| 23 | 3 | 4 | 3 | 57 | 1 | 5 | 2 | 90 | 1 | 6 | 4 |
| 24 | 7 | 17 | 13 | 58 | 0 | 3 | 7 | 91 | 0 | 2 | 11 |
| 25 | 2 | 4 | 11 | 59 | 6 | 4 | 12 | 92 | 0 | 2 | 5 |
| 26 | 0 | 5 | 9 | 60 | 8 | 1 | 11 | 93 | 8 | 2 | 11 |
| 27 | 3 | 1 | 2 | 61 | 3 | 4 | 12 | 94 | 3 | 3 | 21 |
| 28 | 2 | 3 | 1 | 62 | 1 | 2 | 5 | 95 | 3 | 9 | 21 |
| 29 | 2 | 9 | 4 | 63 | 5 | 0 | 3 | 96 | 0 | 8 | 8 |
| 30 | 4 | 2 | 7 | 64 | 2 | 1 | 3 | 97 | 4 | 5 | 27 |
| 31 | 6 | 4 | 12 | 65 | 0 | 3 | 7 | 98 | 5 | 3 | 15 |
| 32 | 1 | 6 | 18 | 66 | 1 | 2 | 26 | 99 | 1 | 2 | 4 |
| 33 | 0 | 11 | 15 | 67 | 0 | 1 | 17 | 100 | 4 | 3 | 6 |
| 34 | 2 | 7 | 12 |
Table A3.
NBA All-Star data: Number of players selected for the NBA all-star game each year in each position (C = Center; F = Forward; FC = Forward-center).
Table A3.
NBA All-Star data: Number of players selected for the NBA all-star game each year in each position (C = Center; F = Forward; FC = Forward-center).
| Year | C | F | FC | Year | C | F | FC | Year | C | F | FC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2000 | 6 | 1 | 4 | 2006 | 3 | 3 | 4 | 2012 | 3 | 2 | 4 |
| 2001 | 3 | 4 | 3 | 2007 | 2 | 3 | 3 | 2013 | 2 | 2 | 4 |
| 2002 | 4 | 3 | 4 | 2008 | 3 | 2 | 3 | 2014 | 2 | 2 | 5 |
| 2003 | 4 | 2 | 3 | 2009 | 2 | 3 | 5 | 2015 | 2 | 4 | 4 |
| 2004 | 3 | 4 | 4 | 2010 | 2 | 2 | 5 | 2016 | 3 | 4 | 2 |
| 2005 | 2 | 2 | 5 | 2011 | 4 | 2 | 2 |
References
- Johnson, N.; Kotz, S.; Balakrishnan, N. Discrete Multivariate Distributions; John Wiley & Sons: New York, NY, USA, 1997. [Google Scholar]
- Krishnamoorthy, A.S. Multivariate binomial and Poisson distributions. Sankhyā Indian J. Stat. 1951, 11, 117–124. [Google Scholar]
- Mahamunulu, D.M. A note on regression in the multivariate Poisson distribution. J. Am. Stat. Assoc. 1967, 62, 251–258. [Google Scholar] [CrossRef]
- Teicher, H. On the Multivariate Poisson distribution. Skand. Aktuarietidskr. 1954, 37, 1–9. [Google Scholar] [CrossRef]
- Hilbe, J.M. Modeling Count Data; Cambridge University Press: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
- Doss, D.C. Definition and characterization of multivariate negative binomial distribution. J. Multivar. Anal. 1979, 9, 460–464. [Google Scholar] [CrossRef]
- Conway, R.W.; Maxwell, W.L. A queuing model with state dependent service rates. J. Ind. Eng. 1962, 12, 132–136. [Google Scholar]
- Sellers, K.F.; Shmueli, G.; Borle, S. The COM-Poisson model for count data: A survey of methods and applications. Appl. Stoch. Model. Bus. Ind. 2011, 28, 104–116. [Google Scholar] [CrossRef]
- Shmueli, G.; Minka, T.P.; Kadane, J.B.; Borle, S.; Boatwright, P. A useful distribution for fitting discrete data: Revival of the Conway-Maxwell-Poisson distribution. Appl. Stat. 2005, 54, 127–142. [Google Scholar] [CrossRef]
- Guikema, S.D.; Coffelt, J.P. A Flexible Count Data Regression Model for Risk Analysis. Risk Anal. 2008, 28, 213–223. [Google Scholar] [CrossRef] [PubMed]
- Sellers, K.F.; Morris, D.S.; Balakrishnan, N. Bivariate Conway-Maxwell-Poisson distribution: Formulation, properties, and inference. J. Multivar. Anal. 2016, 150, 152–168. [Google Scholar] [CrossRef]
- Kocherlakota, S.; Kocherlakota, K. Bivariate Discrete Distributions; Marcel Dekker: New York, NY, USA, 1992. [Google Scholar]
- Lai, C.D. Constructions of discrete bivariate distributions. In Advances in Distribution Theory, Order Statistics and Inference, Part I; Balakrishnan, N., Sarabia, J.M., Castillo, E., Eds.; Birkhauser: Boston, MA, USA, 2006; pp. 29–58. [Google Scholar]
- Marshall, A.W.; Olkin, I. A family of bivariate distributions generated by the bivariate Bernoulli distribution. J. Am. Stat. Assoc. 1985, 80, 332–338. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2017. [Google Scholar]
- Balakrishnan, N.; Pal, S. Lognormal lifetimes and likelihood-based inference for flexible cure rate models based on COM-Poisson family. Comput. Stat. Data Anal. 2013, 67, 41–67. [Google Scholar] [CrossRef]
- Burnham, K.P.; Anderson, D.R. Model Selection and Multimodel Inference; Springer: New York, NY, USA, 2002. [Google Scholar]
- Corporación Favorita. Grocery Sales Data. 2018. Available online: https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data (accessed on 26 April 2020).
- Voinov, V.; Nikulin, M.; Balakrishnan, N. Chi-Squared Goodness of Fit Tests with Applications; Academic Press: Boston, MA, USA, 2013. [Google Scholar]
- NBA. NBA All-Star Game, 2000–2016. Available online: https://www.kaggle.com/fmejia21/nba-all-star-game-20002016? (accessed on 22 April 2020).
- Inouye, D.I.; Yang, E.; Allen, G.I.; Ravikumar, P. A review of multivariate distributions for count data derived from the Poisson distribution. WIREs Comput. Stat. 2017, 9, e1398. [Google Scholar] [CrossRef] [PubMed]
- Genest, C.; Nešlehová, J. A Primer on Copulas for Count Data. ASTIN Bull. 2007, 37, 475–515. [Google Scholar] [CrossRef]
- Trivedi, P.; Zimmer, D. A Note on Identification of Bivariate Copulas for Discrete Count Data. Econometrics 2017, 5, 10. [Google Scholar] [CrossRef]
- Sellers, K.F.; Swift, A.W.; Weems, K.S. A flexible distribution class for count data. J. Stat. Distrib. Appl. 2017, 4, 1–21. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
