A Flexible Multivariate Distribution for Correlated Count Data

: Multivariate count data are often modeled via a multivariate Poisson distribution, but it contains an underlying, constraining assumption of data equi-dispersion (where its variance equals its mean). Real data are oftentimes over-dispersed and, as such, consider various advancements of a negative binomial structure. While data over-dispersion is more prevalent than under-dispersion in real data, however, examples containing under-dispersed data are surfacing with greater frequency. Thus, there is a demonstrated need for a ﬂexible model that can accommodate both data types. We develop a multivariate Conway–Maxwell–Poisson (MCMP) distribution to serve as a ﬂexible alternative for correlated count data that contain data dispersion. This structure contains the multivariate Poisson, multivariate geometric, and the multivariate Bernoulli distributions as special cases, and serves as a bridge distribution across these three classical models to address other levels of over- or under-dispersion. In this work, we not only derive the distributional form and statistical properties of this model, but we further address parameter estimation, establish informative hypothesis tests to detect statistically signiﬁcant data dispersion and aid in model parsimony, and illustrate the distribution’s ﬂexibility through several simulated and real-world data examples. These examples demonstrate that the MCMP distribution performs on par with the multivariate negative binomial distribution for over-dispersed data, and proves particularly beneﬁcial in effectively representing under-dispersed data. Thus, the MCMP distribution offers an effective, unifying framework for modeling over- or under-dispersed multivariate correlated count data that do not necessarily adhere to Poisson assumptions.


Introduction
There exists a rich history of research regarding multivariate discrete distributions [1]. Krishnamoorthy [2] introduced a multivariate binomial (MB) distribution for the ddimensional vector B = (B 1 , B 2 , . . . , B d ) from a 2 d table with a factorial moment generating function (fmgf) where p * i is the probability of B i , i = 1, . . . , d; p * ij denotes the probability of B i B j ; and so on. Utilizing this form, Krishnamoorthy [2] further introduced the multivariate Poisson (MP) distribution as the limiting distribution of the multivariate binomial distribution wherein all of the probabilities appearing in Equation (1)  as n → ∞, where • denotes the corresponding probability subscripts in Equation (1). Accordingly, the fmgf of the MP distribution for a random vector X = (X 1 , X 2 , . . . , X d ) is Mahamunulu [3] noted that the MP distribution can likewise be derived by defining X = (X 1 , X 2 , ..., X d ) as the sum of independent Poisson(A * ) random variables Y * , where * denotes all subscripts involving * = i ∈ {1, 2, . . . , d} with Y * distributed as Poisson(A * ), and X * as Poisson(µ * ) where µ * denotes the sum of the associated A * parameters with subsets j 1 , j 2 , . . . , j s for s ∈ {1, 2, . . . , d} such that j 1 < j 2 < . . . < j s . The corresponding joint probability generating function (pgf) has the form where [3,4]. From (3), it is evident that the variables X 1 , X 2 , ..., X d have marginal Poisson distributions, and it can be further shown that all pairs of variables X i 's are positively correlated.
While the MP distribution is a popular model for describing correlated discrete random variables, it is well known that Poisson models are constrained by their underlying assumption of equi-dispersion; analogous negative binomial (NB) models serve as a popular alternative due to their ability to address data over-dispersion [5]. Doss [6] for k > 0. From (4), it is evident that the variables X 1 , X 2 , ..., X d have marginal NB distributions which are known to be over-dispersed. For this reason, the MNB distribution can only accommodate data over-dispersion; accordingly, correlated under-dispersed data structures are only at best fitted by a MP model where the associated model parameters will still be biased. Therefore, in this work, we introduce the reader to the Conway-Maxwell-Poisson (CMP) distribution and develop a multivariate CMP (MCMP) distribution as a flexible alternative distribution for modeling correlated discrete count data. Section 2 introduces the reader to the CMP distribution and its bivariate analog as motivation. Section 3 develops the MCMP distribution and discusses its associated properties, and also introduces approaches for parameter estimation and hypothesis testing. Section 4 demonstrates the model flexibility by means of simulated and real data examples. Finally, Section 5 concludes the manuscript with discussion, while the appendices contain more detailed derivations and the datasets referenced in this work.

Multivariate Conway-Maxwell-Poisson Distribution
Generalizing the compounding approach in [11], we develop a convenient form for the MCMP distribution. Consider d random variables X =(X 1 , X 2 , . . . , X d ) that, given some number of trials n, jointly have a conditional MB distribution with pgf where Z(ψ, ν) = ∑ ∞ s=0 ψ s (s!) ν for some ψ > 0. Equation (8) contains 2 d + 2 parameters, but its degrees of freedom equals 2 d + 1 due to the restriction, ∑ 1 ; this adds difficulty in the determination of model parameter maximum likelihood estimates (MLEs). We circumvent this issue through the reparametrization, λ Each variable is independent under this parameterization, where . For simplicity, we use λ to denote ∑ 1 x 1 =0 · · · ∑ 1 x d =0 λ x 1 x 2 ···x d but recognize that λ is no longer an independent parameter in the ensuing discussion. The pgf of the MCMP distribution can now be parameterized as As is the case of the univariate and bivariate CMP, this MCMP includes the MP, multivariate geometric, and multivariate Bernoulli distributions all as special cases, where ν maintains representation as the dispersion parameter. When ν = 1, this MCMP pgf reduces to the form of the MP joint pgf (see Equation (3)). When ν = 0 and λ < 1, its pgf becomes where 1 i denotes a 1 in the ith position, i = 1, 2, . . . , d; . . . ; a 12···d = −λ 11···1 1−λ . This is the pgf of a multivariate geometric distribution (i.e., the MNB distribution pgf in Equation (4) with k = 1). Finally, when ν → ∞, this MCMP becomes a multivariate Bernoulli (i.e., the MB in [2] with n = 1) with p * 00···0 = 1+λ 00···0 1+λ and all remaining probabilities are p * where at least one of x i equals 1, i = 1, 2, . . . , d. More broadly, ν = 1 denotes the equi-dispersion case while ν < (>)1 reflects data over-dispersion (under-dispersion), both for the joint distribution and the respective marginal distributions.
Given the joint pgf in Equation (8), this MCMP model has the joint mgf M(t 1 , t 2 , · · · , t d ) = Π(e t 1 , e t 2 , · · · , e t d ) and the joint fmgf as G(t 1 , t 2 , ..., t d ) = Π(t 1 + 1, t 2 + 1, ..., t d + 1) We derive the MCMP pmf by taking partial derivatives of the pgf, i.e., see Appendix A for pertinent details. Moving forward, we shall illustrate the MCMP results using the trivariate case as motivation, where the joint fmgf reduces to G(t 1 , t 2 , t 3 ) from which we can obtain the moments and product moments, respectively; see Appendix B for all relevant details. These results confirm that the dispersion parameter ν denotes the type of data-dispersion for the joint and marginal distributions, and the correlations between any two random variables is non-negative with 0 ≤ ρ X i X j ≤ 1 for any variables i, j = 1, 2, 3.

Parameter Estimation
We perform parameter estimation by the method of maximum likelihood (ML). Considering the trivariate case of Equations (9) and (12), there are nine parameters required to specify a trivariate CMP distribution, namely, λ x 1 x 2 x 3 for x i = 0, 1; i = 1, 2, 3; and ν. Accordingly, the log-likelihood has the form where x ij denotes the jth observation in the ith data dimension, x i denotes the vector of the entire data set in the ith dimension; the precise form of The resulting score equations, however, do not have a closed form solution. For this reason, we carry out the statistical computations by using optimizing routines in R [15].
To perform the parameter estimation, we use the optim function where the negated form of the log-likelihood (Equation (13)) serves as the function to be optimized, and the L-BFGS-B method and its default convergence criteria are applied. Additionally, we approximate the standard errors of the estimated parameters by calculating the square root of the diagonal of the inverse Hessian matrix based on the approximate form obtained from optim. The complexity of the MCMP distribution, however, brings with it some computational difficulties when applying optim. The resulting MLE can vary considerably depending on the choice of starting values. To avert this, we consider several starting points including an exhaustive search in order to potentially improve the estimation result. Meanwhile, the resulting Hessian matrix provided from optim sometimes produces an inverse matrix containing negative diagonal elements; this violates the presumed positive semidefinite form of the Fisher information matrix. For these reasons, we recommend utilizing a parametric bootstrap method as an alternative approach for quantifying variability in the parameter estimates.

Hypothesis Testing
To check if a multivariate count data set suffers from any statistically significant data dispersion such that the MP distribution is unsuitable (favoring the MCMP distribution), we conduct the hypothesis test, H 0 : ν = 1 versus H 1 : ν = 1. We do not concern ourselves with the direction of the data dispersion because the MCMP distribution can accommodate both over-and under-dispersion. Nonetheless, the resulting statistical inference, along with the estimate for ν, offers guidance regarding the type of dispersion present in the data. We use the likelihood ratio test (LRT) statistic, Λ ν=1 = sup ν=1 ln L sup ln L , where sup ν=1 ln L and sup ln L, respectively, denote the maximum log-likelihoods associated with the MP and MCMP models. Theoretically, −2 log(Λ ν=1 ) follows a χ 2 1 distribution and thus can be used to assess whether the data are reasonably distributed as a MP distribution, or if statistically significant dispersion exists such that it warrants using the MCMP model. In a similar vein, one can consider hypothesis tests, H 0 : ν = 0 or H 0 : ν → ∞ (versus H 1 : otherwise) to determine whether the multivariate data satisfy a multivariate geometric or multivariate Bernoulli distribution, respectively; their associated LRTs have adjusted distributional forms based on a mixture involving χ 2 1 to account for being at the respective boundaries for ν [16].

Examples
This section considers various simulated and real data examples to illustrate the flexibility of the MCMP model. For the real data sets, we compare model performance via the respective log-likelihood and Akaike Information Criterion (AIC) values. We particularly consider ∆ i = AIC i − AIC min as introduced in Burnham and Anderson [17], where AIC i denotes the AIC associated with Model i, and AIC min is the minimum AIC among the considered models. [17] provides model support levels based on recommended ∆ i ranges; see Table 1 for details.  [4,7] Considerably less (10, ∞) Essentially none

Simulated Data
Here, we provide simulated data examples to illustrate the MCMP model's ability to correctly distinguish the MP distribution. Without loss of generality, we proceed with the use of the trivariate case. To evaluate the robustness of the simulation process, we consider data simulations of size {100, 250, 500, 1000} and simulate data 500 times at each size level.

Real Data: Corporación Favorita Grocery Sales
The Corporación Favorita grocery sales data [18] include information regarding the number of unit items sold daily to more than 4000 items in 35 different stores over a five-year period. To illustrate the MCMP distribution's flexibility for describing real count data, we consider the unit sales regarding a particular item (Item ID:103665) over 100 days in each of three stores (Stores 1, 2 and 3, respectively); the data are provided in Table A2 in Appendix D. This dataset is over-dispersed due to the weekly and monthly periodic fluctuation; the number of sales often tend to be high at the beginning of each month as well as on weekends. Table 4 summarizes the results that stem from considering various trivariate models to describe the data, namely, the trivariate Poisson, trivariate NB [6], trivariate geometric, and trivariate CMP. For each of the assumed models, this table provides the respective MLEs, resulting log-likelihood, and AIC values.
Although the trivariate Poisson distribution has the least number of parameters (i.e., 7) among the models considered, it has the largest AIC (1748.5), suggesting its unsuitability for these data [17]. Meanwhile, the MLEs of the trivariate CMP model includeν ≈ 0.25, implying that the data are over-dispersed. The trivariate CMP produces a considerably smaller AIC (1627.9) relative to the trivariate Poisson; this further demonstrates the apparent data over-dispersion that should be addressed, but with ∆ = 6.2 relative to the AIC from the trivariate NB, the trivariate CMP (while second best among the four considered models) still has model support that is "considerably less" than that of the trivariate NB (AIC = 1621.7); this result is still substantially better than the difference between the trivariate NB and Poisson models (∆ = 126.8), clearly inferring no support for the trivariate Poisson. Further, applying the trivariate CMP model introduces consideration of the trivariate geometric and NB models, respectively, as possible parsimonious models. The respective LRT statistics, −2 log Λ ν=1 = 124.6 for the test H 0 : ν = 1 and −2 log Λ ν=0 = 79.4 for H 0 : ν = 0, both have p-values smaller than 0.005 which indicate that neither the trivariate Poisson nor the trivariate geometric fits the data well. Even still,ν ≈ 0.25 serves as an indication of data over-dispersion, hence consideration of the general MNB distribution as a possible model. λ 000 = 0.00λ 111 = 0.00λ 100 = 0.32 CMPλ 010 = 0.47λ 001 = 1.17λ 110 = 0.00 −804.9 9 1627.9 λ 101 = 0.00λ 011 = 0.00ν = 0.25  Table 4 further shows thatλ 110 ,λ 101 ,λ 011 andλ 111 for the CMP model are all 0; a similar situation appears on the estimation of the geometric and NB models, whereâ 12 ,â 13 , a 23 andâ 123 are also all 0. This indicates that there is no significant correlation within the data; this is true because the correlation coefficients between Stores 1 and 2, Stores 1 and 3, and Stores 2 and 3 are 0.15, 0.02, 0.21, respectively. Figure 1 compares the marginal pmfs associated with each of the four models with the marginal relative frequencies associated with the number of unit sales for each of the three stores (Stores 1, 2, 3). These images show that the trivariate CMP and NB models produce very similar estimated marginal distributions with modes that are close to the observed mode, and have sufficiently wide tails to reflect the observed marginal frequencies, particularly for Store 2. Goodness-of-fit tests are likewise performed for comparing the aforementioned models to assess how well their marginal pmfs fit the marginal data frequencies. Following [19], we modify our observed frequencies by grouping observations greater than 8 on Store 1, greater than 9 on Store 2, and observations greater than 21 on Store 3. This allows the respective tail bins associated with each store to have a sufficiently large observed frequency to allow for the goodness-of-fit test to be conducted and the associated asymptotic chi-square distribution to be used. As a result, resulting statistics for the goodness-of-fit tests are expected to follow the chi-square distribution with 10, 11 and 23 degrees of freedom, respectively, for Stores 1, 2 and 3. Table 5 summarizes the goodness-of-fit test statistics for each of the stores and models. While the trivariate geometric model best fits the Store 1 marginal distribution, the goodness-of-fit scores for the trivariate CMP and NB models are considerably better and outperform their peers for Stores 2 and 3. Table 5 confirms these assertions with χ 2 10 (0.95) = 18.3, χ 2 11 (0.95) = 19.7 and χ 2 23 (0.95) = 35.2, respectively, for Stores 1, 2 and 3; we again see that the geometric model fits the data better for Store 1, and the trivariate CMP and NB models produce closer fits for Stores 2 and 3.

Real Data: NBA All-Star
To demonstrate that the trivariate CMP can also be suitable for under-dispersed data, we consider data from the National Basketball Association (NBA) All-Star game rosters from 2000 to 2016 and seek to model the distribution of the number of players selected for the All-Star game each year in various positions [20]. For simplicity, we focus on the number of players that can play as Center (C), Forward (F), or Forward-center (FC); the data are provided in Table A3 of Appendix D. We again consider the trivariate CMP, the trivariate Poisson, and the trivariate NB distributions as possible models to describe this dataset. Table 6 contains a summary of the results including the respective MLEs, the resulting maximized log-likelihood, the number of free parameters, and the associated AIC for each of the three considered models. The trivariate CMP model performs the best among the considered models, attaining a maximum likelihood equaling −68.2 and AIC min = 154.4. The trivariate Poisson and NB models meanwhile produce respective AICs equaling 180.6 and 182.7 such that both respective difference values as defined in [17] (Table 1) are greater than 26, indicating no empirical support in favor of either model. The difference between the respective AIC values for the trivariate Poisson and NB models stems from the difference in the number of free parameters while they attain the same maximized log-likelihood value (−83.3). Neither of these models can accommodate data under-dispersion, and consequently the optimal trivariate NB distribution is that model which converges to the trivariate Poisson as k → ∞. Accordingly, the trivariate NB MLEs that best address data under-dispersion are those under the constraint of data equi-dispersion. The trivariate CMP model successfully detects the data under-dispersion (ν = 38. 4 1). In fact, such a largeν suggests that we should consider modeling the data via a trivariate Bernoulli model. This would normally be true because the resulting CMP denominator includes (m!) ν which becomes considerably large for m > 1 given largeν. This makes p (x,y,z) vanish for any (x, y, z) such that at least one of the random variables exceeds 1. This is not the case here, however, because this data example likewise produces extremely largeλ estimates. Reviewing the raw data likewise suggests clearly that the trivariate Bernoulli distribution is not appropriate because there exist count data that are larger than 1, thus violating the multivariate Bernoulli structure. Therefore, the use of the trivariate CMP to analyze these data is duly justified.

Discussion
In this paper, we present a MCMP model that is developed via the compounding method. The distribution is established as a CMP-stopped multivariate binomial distribution, i.e., a multivariate binomial distribution where the associated index parameter is CMP distributed. Along with an introduction to this resulting distribution, we discussed its statistical properties which aid in better model interpretation. The CMP model can flexibly accommodate both over-and under-dispersed count data, and it includes the Poisson, Bernoulli, and geometric distributions all as special cases. Accordingly, the MCMP model serves as a reliable tool for model determination because it can successfully recognize these three multivariate special cases, and serve as an overarching distributional structure connecting them. One can determine if significant data dispersion exists by calculating the LRT statistic Λ discussed in Section 3.2, and analogous tests can be considered to determine whether the data effectively approximate either of the other two special case distributions (i.e., H 0 : ν = 0 or H 0 : ν → ∞, respectively). The MCMP distribution is particularly useful for modeling under-dispersed count data, as demonstrated through the simulated and real data examples.
A limitation of the MCMP model is that the correlation between any two of the d random variables comprising the MCMP is constrained to be non-negative, and so it may not be appropriate to consider this model to analyze multivariate count data containing negative correlations. This is true, however, of several multivariate discrete distributions, e.g., the [3] multivariate Poisson distribution. Meanwhile, this MCMP construct involves only one parameter (ν) to describe data dispersion. Hence, this MCMP model is suitable only for data with similar levels of data dispersion in each dimension, however the model can be broadened to allow for dynamic dispersion. Future work will seek to define a broader generalization of the MP distribution (or modification of this MCMP model) that allows for a broader range of correlation and possesses greater flexibility with regard to data dispersion. One proposed approach, for example, is to consider using copulas to develop a multivariate CMP distribution, as described in [21]. Though this is a standard method for multivariate continuous variables, its use for modeling multivariate count data has its own limitations, most importantly that copulas for discrete outcomes are not identifiable, especially when those discrete outcomes follow count distributions [21][22][23]. Table 4 demonstrates another limitation of the MCMP model, namely that it cannot accommodate as much data over-dispersion as the MNB; the MCMP distribution at best contains the multivariate geometric distribution (which is a special case of MNB). The MNB model, however, can be viewed as the convolution of independent and identically distributed (iid) multivariate geometric distributions. This convolution structure will then be able to capture greater over-dispersion. More broadly, the same idea can be used to consider a multivariate version of the sum of CMPs (MSCMP) model [24] as a generalization to accommodate broader dispersion, and use its trivariate form to revisit the Corporación Favorita grocery sales dataset. Unfortunately, due to computational issues, we were only able to perform parameter estimations for the trivariate SCMP model under the restriction m = 1, 2, 3. Future work will further study the MSCMP model, for example, to determine how to optimally and directly compute the MSCMP pmf, and more efficiently determine the MLEs of model parameters. See Appendix C for more information about the MSCMP.

Data Availability Statement:
The data presented in this study are available in the appendix.

Acknowledgments:
The authors thank Richard Sellers for insightful discussions that aided in better comprehension and understanding of the NBA data set. This paper is released to inform interested parties of research and to encourage discussion. The views expressed are those of the authors and not necessarily those of the U.S. Census Bureau.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:
We fitted the Corporación Favorita grocery sales dataset with the trivariate SCMP model so as to demonstrate its capability of dealing with trivariate count data. Table A1 provides the resulting trivariate SCMP estimates with m = 2 and m = 3 where, for comparison, we also include the results of the trivariate NB and CMP models. Accordingly, we find that the trivariate SCMP models fit the data better than the trivariate CMP, with improvement growing with m. More precisely, we note that, as m increases, the log-likelihood increases while the AIC decreases. In particular,ν likewise decreases toward 0 (which results in the trivariate NB model) as m increases. Unfortunately, current computational issues prevent us from providing SCMP results for m > 3, but these results do illustrate that the SCMP model will produce a log-likelihood no worse than that from the trivariate NB as it is a special case of bivariate SCMP model. Table A1. Estimation results associated with the Corporación Favorita grocery sales data based on various assumed trivariate models: CMP, sCMP (m = 2), sCMP (m = 3), and NB. Respective log-likelihood and Akaike Information Criterion (AIC) values are also provided.