On the Discretization of Continuous Probability Distributions for Flexible Count Regression

Most existing flexible count regression models allow only approximate inference. This 1 work proposes a new framework to provide an exact and flexible alternative for modeling and 2 simulating count data with various types of dispersion (equi-, underand overdispersion). The 3 new method, referred as “balanced discretization”, consists in discretizing continuous probability 4 distributions while preserving expectations. It is easy to generate pseudo random variates from 5 the resulting balanced discrete distribution since it has a simple stochastic representation in 6 terms of the continuous distribution. For illustrative purposes, we have developed the family of 7 balanced discrete gamma distributions which can model equi-, underand overdispersed count 8 data. This family of count distributions is appropriate for building flexible count regression models 9 because the expectation of the distribution has a simple expression in terms of the parameters 10 of the distribution. Using the Jensen–Shannon divergence measure, we have shown that under 11 equidispersion restriction, the family of balanced discrete gamma distributions is similar to the 12 Poisson distribution. Based on this, we conjecture that while covering all types of dispersion, a 13 count regression model based on the balanced discrete gamma distribution will allow recovering 14 a near Poisson distribution model fit when the data is Poisson distributed. 15


Introduction
The regression analysis of count responses mostly relies on the Poisson model. 19 However, the equidispersion (variance equals mean) assumption of the Poisson distri- 20 bution makes Poisson regression inappropriate in many situations where data show 21 overdispersion (variance greater than mean) or underdispersion (variance less than 22 mean). Moreover, it has been observed that many data analysed using overdispersion 23 models (e.g. negative binomial [1]) which are as popular as the Poisson regression model, 24 may be mixtures of overdispersed and underdispersed or equidispersed counts [2]. The 25 implication is that appropriate alternatives to the Poisson model should allow variable 26 dispersion, i.e. full dispersion flexibility [3]. Existing count regression models associated 27 with variable dispersion exhibit some drawbacks. The first is unproperly normalized 28 probability mass functions for underdispersion situations (Quasi-Poisson [4], Consul's 29 generalized Poisson [5] and Extended Poisson-Tweedy regresions [6]), which makes 30 inference approximate with quasi-models. Another drawback is the lack of a simple ex- 31 pression for the model mean value (Conway-Maxwell-Poisson [7], double Poisson [8,9], 32 gamma count [10], semi-non parametric Poisson polynomial [11] and discrete Weibull 33 of the discrete and the related continuous variable, but it provides only an approximate 48 solution at the cost of a turning parameter. Proposals in [3,18] and [19] offer solutions 49 for constructing count variables with fixed mean value and variable dispersion, but they 50 lack a physical basis, i.e. a generating mechanism to motivate their use in practice. 51 This work describes a discretization approach which modifies the "discrete concen-52 tration" method i.e. "methodology IV" in [14] to preserve the expectation of the continuous 53 distribution. Our proposal, referred to as "balanced discretization" is based on a prob-54 abilistic rounding mechanism which provides a generating mechanism with a simple 55 interpretation. Interestingly, balanced discretization is suited for regression analysis 56 where estimation of covariate effects on the mean count is of the highest interest. 57 The rest of the paper is organized as follows. Section 2 motivates and presents the   We shall denote Z the set of integers (Z = {· · · , −1, 0, 1, · · · }), N the set of non 72 negative integers (N = {0, 1, · · · }), N + the set of natural numbers (N + = {1, 2, · · · }), 73

84
First, we recall the discrete concentration method and the mean-preserving ap-85 proach of [3]. Let CD(θ) be a continuous probability distribution of interest. The discrete 86 concentration DC(θ) of X ∼ CD(θ) is the count variable Y with pmf and suf Clearly, the discrete concentration of X is simply Y  [20].

93
The mean-preserved discrete version Y of X is the variable with the cdf and only a finite number of decimal places are reported in practice [14,21]. Assume for 99 instance an operator measuring tree diameters X in a forest inventory frame, using a 100 measurement device scaled in millimeter (mm). Since X is a continuous variable, the 101 probability of observing X = x mm is zero. When the true value x of the diameter of a 102 tree actually falls between two consecutive graduations z and z + 1, the operator reports 103 either y = z mm or y = z + 1 mm, i.e. only a discretized version Y of X is observed.

104
Beyond this example, when direct measures are taken, only the number of an arbitrary 105 unit is actually counted. Clearly, the closer x is to z, the higher the probability of reporting 106 y = z and conversely, the closer x is to z + 1, the higher the probability of reporting 107 y = z + 1. Balanced discretization results from assuming that given z ≤ x < z + 1, the 108 probability for reporting y = z + 1 is exactly x − z. Let us consider an absolutely continuous probability distribution CD(θ) of interest.

111
A count random variable Y is said to be distributed as the balanced discrete coun- y+1 y x r f X (x|θ)dx (6) of X over (y, y + 1) and set 116 H X (y|θ) = F X (y + 1|θ) − F X (y|θ).
The balanced discretization mechanism in Eq (5) preserves partial expectations E X (1, y|θ) 117 of the continuous variable as shown by the Eq (10) of the following lemma.
118 Lemma 1. Let X and Y be defined as in Eq (5). Then, for any y ∈ Z, where E Y [Y|X ∈ A] is the partial mean of Y for X ∈ A. of Y are given for y ∈ Z and 0 ≤ u ≤ 1 by

126
Note from Eq (11) that BD(θ) assigns less probability mass to zero than the discrete 127 concentration of X ∼ CD(θ) if X has support R + or (0, M) for M ∈ R + . Eq (13) 128 emphasies that the balanced discretization method does not preserve the suf of the 129 continuous distribution, unlike discrete concentration (see Eq (2)). Nevertheless, the 130 balanced discrete cdf and suf satisfy the inequalities F X (y|θ) ≤ F Y (y|θ) ≤ F X (y + 1|θ)

131
(with equalities when the support of X is upper bounded by y) and S X (y|θ) ≤ S Y (y − 132 1|θ) ≤ S X (y − 1|θ) (with equalities when the support of X is lower bounded by y).

133
By Eq (14), balanced discretization preserves somehow the median of the continuous 134 distribution. Indeed, if X has an integral median m X , then Y has median m Y = m X − 1/2.

135
More generally, we have  This section presents expressions for moments of balanced discrete distributions. 139 We start with the first two moments since they are the most important in a count 140 regression context.

142
The balanced discrete counterpart of X, Y ∼ BD(θ), has mean µ Y (θ) = µ X (θ) and variance From Proposition 2, it appears that when the variance of a balanced discrete variable being the mean of Y (exact or estimate). The following corollary infers the index of 149 dispersion of a balanced discrete distribution from Proposition 2. 150 Furthermore, ζ 0 (θ) can be approximated with a tolerance α ∈ (0, 1) by the truncated sum .

156
The next proposition shows the relation between moments of balanced discrete 157 distributions and of discrete concentrations. 158 where µ (r) Z (θ) is the rth moment of the discrete concentration Z = X (Eq (3)) and µ is the expectation of the product of Z i and U with U|X ∼ BE R(X − Z), and is given by 177 Proposition 4. Let X, U and Y be defined as in Eq (5). Then, for y ∈ Z with probability mass and for r ∈ R such that X r is well defined in both (y − 1, y) and (y, y + 1), where Therefore, in the Expectation-Maximization algorithm framework, the maximization of 186 the joint likelihood of Y and X is reduced to the maximization of the likelihood f X (x|θ) of the continuous 188 variable X. Hence, the Expectation-Maximization algorithm will be appropriate for 189 fitting a balanced discrete distribution whenever fitting the underlying continuous 190 distribution is easy. Recall from Eq (4) that the cdf of the mean-preserved count variable has the form Then, using the identity balanced discretization as defined in Eq (5) provides a generating mechanism for the 198 mean-preserving method of [3]. has expectation b/a, variance b/(a 2 ), and rth order partial moment E g (r, y|b, a) = 214 E X [X r |y ≤ X < y + 1] given by The one-parameter gamma distribution is obtained for a = 1 (equidispersion) and is 216 denoted G(b).  Then, the pmf and cdf for y ∈ N, the variance and the index of dispersion of Y are respectively Note that ζ g 0 (µ, a) can be approximated via the truncation mechanism in Eq (19).
responds to a latent equidispersion mechanism and is marginally slightly overdispersed 228 as indicated by Eq (30) with a = 1. Setting a = µ −1 produces the balanced discrete 229 exponential distribution BE (µ), which is close to the geometric distribution since the 230 later corresponds to the discrete concentration of the exponential distribution [25].

253
and is zero only if p(y) = q(y) ∀ y ∈ Z.    It appears that the one-parameter BDG distribution based count regression model 294 will be an effective parsimonious (few parameters and more tractable) model [28] that 295 can be fitted to observed data to check the appropriateness of an equidispersion model.

296
Therefore, while a BDG regression model will allow exact inference in flexible count mod-297 eling, testing for latent equidispersion will allow recovering a near Poisson regression

300
With a view to allow exact inference in flexible count regression models, this work 301 describes balanced discretization, a method for simulating and modeling integer valued 302 data starting from a continuous random variable, through the use of a probabilistic 303 rounding mechanism. Most of existing alternatives were built to conserve a specific 304 characteristic of the continuous variable e.g. the failure rate [29] and the survival [14] 305 functions for modeling reliability data. Our proposal preserves expectation and is thus 306 appropriate for count regression. The method is very close to the discretizing approach 307 of [17] which also preserves the mean value but requires an a priori double truncation 308 of the continuous variable and introduces a turning parameter. Physical interpretation 309 is an important selection criterion for choosing an appropriate discretization method 310 [30]. As such, our proposal was motivated by a real world generating mechanism and 311 provides a physical interpretation for the mean-preserving method of [3]. Although 312 balanced discret distributions can model any count data, it may not be appropriate for 313 ageing data for which the integer part or the ceil is generally used [14] so that discrete 314 concentrations are a better choice.

315
The flexibility of the balanced discrete gamma family developed from the continu-   (5), the probability of observing Y = y given Thus, the probability 346 of observing Y = y and y ≤ X < y + 1 is the integral of [(y which proves Eq (8). Using the same argument on the probability of observing Y = y + 1 349 given that y ≤ X < y + 1 leads to P(Y = y + 1 & y ≤ X ≤ y + 1) equaling the integral 350 of (x − y) f X (x|θ) with respect to x over (y, y + 1) which yields Eq (9). Next, since Y is 351 discrete and takes one of the two values y and y + 1 when y ≤ X < y + 1, the partial Replacing P(Y = y & y ≤ X < y + 1) + P(Y = y + 1 & y ≤ X < y + 1) by the equiva-354 lent probability P(y ≤ X < y + 1) = F X (y + 1|θ) − F X (y|θ) and using Eq (9) to obtain 355 P(Y = y + 1 & y ≤ X < y + 1) results in Eq (10). ways to obtain Y = y are (U = 1 and y − 1 ≤ X < y) and (U = 0 and y ≤ X < y + 1).

359
In other words, Y = y is equivalent to y − 1 ≤ X < y or y ≤ X < y + 1. Since the two 360 instances are mutually exclusive, this gives where the second equality follows from replacing y by y − 1 in Eq (9) to compute the prob-362 ability P(Y = y & y −1 ≤ X < y) and using Eq (8) to obtain P(Y = y & y ≤ X < y + 1).
which follows by Eq (9) on using y − 1 instead of y. The expression of ρ y then follows 408 as given in Eq (21). From Eq (5), the conditional density of Y given X = z and U = u is = (x − y + u) u (1 + y − u − x) 1−u f X (x|θ)I (y−u,y−u+1) (x) where the first line follows by Bayes' rule and the last line follows on using y = x + u.