Distributions You Can Count On . . . But What’s the Point? †

: The Poisson regression model remains an important tool in the econometric analysis of count data. In a pioneering contribution to the econometric analysis of such models, Lung-Fei Lee presented a speciﬁcation test for a Poisson model against a broad class of discrete distributions sometimes called the Katz family. Two members of this alternative class are the binomial and negative binomial distributions, which are commonly used with count data to allow for under- and over-dispersion, respectively. In this paper we explore the structure of other distributions within the class and their suitability as alternatives to the Poisson model. Potential difﬁculties with the Katz likelihood leads us to investigate a class of point optimal tests of the Poisson assumption against the alternative of over-dispersion in both the regression and intercept only cases. In a simulation study, we compare score tests of ‘Poisson-ness’ with various point optimal tests, based on the Katz family, and conclude that it is possible to choose a point optimal test which is better in the intercept only case, although the nuisance parameters arising in the regression case are problematic. One possible cause is poor choice of the point at which to optimize. Consequently, we explore the use of Hellinger distance to aid this choice. Ultimately we conclude that score tests remain the most practical approach to testing for over-dispersion in this context.


Introduction
The well-known Pearson family of continuous distributions, originally explored by Pearson (1895), is comprised of any solution to a particular differential equation. In his PhD thesis, Katz (1945) explored a family of discrete distributions that are solutions to a difference equation analogous to the Pearson differential equation. 1 The Pearson family is a collection of four-parameter distributions and specializations thereof. Katz (1965, p.175) observes that certain specializations 'produce simpler and more manageable classes' and restricts attention to a set of one-and two-parameter distributions. In particular, his restrictions result in a family of distributions that nest the two-parameter binomial and negative binomial (or Pascal) distributions, together with the one-parameter Poisson distribution. 2 A defining characteristic of these distributions is that they arise when certain parameters, or parameter ratios, take integer values and so represent a set of measure zero in respect of the set of family members, which are defined in terms of real-valued parameters. The Katz family of distributions has proved important in the analysis of count data. It provides a framework within which practitioners can extend simple Poisson models to models that allow for individual heterogeneity, using the Poisson regression model (PRM). The PRM can, in turn, be extended to models that allow for either over-dispersion, using the negative binomial regression model (NBRM), or under-dispersion, using the binomial regression model. We shall, for the most part, defer consideration of under-dispersion to another time.
The problems of modelling and testing for over-dispersion have proved important in the count data literature. Essentially concurrently, papers by Cameron and Trivedi (1986); Lee (1986) and Lawless (1987ab) made substantial contributions to the literature on inference in the PRM, the NBRM, and testing for over-dispersion, with both Cameron and Trivedi (1986) and Lee (1986), in particular, couching substantial parts of their analysis within the context of the Katz family of distributions. This class of distributions is interesting because the binomial and negative binomial distributions are alternative specifications to the Poisson that allow under-and over-dispersion, respectively. Subsequent contributions to this literature include Dean and Lawless (1989); Dean (1992); Qu et al. (1990) and Fang (2003). Collectively they have explored likelihood ratio (LR), Lagrange multiplier (LM), and Wald tests, together with tests based on generalised method of moments (GMM) estimators, for over-dispersion in the PRM. Fang (2003) concludes that his preferred GMM test is that based on the fewest over-identifying assumptions offering essentially the same power as tests based on more over-identifying restrictions but having the greatest ease of calculation. 3 Interestingly, this preferred test is that originally proposed by Katz (1965), on an ad hoc basis.
In this paper, we investigate a new family of tests for over-dispersion in the PRM by exploring point optimal tests where the alternative hypothesis lies in the Katz family of distributions. An analysis of the Katz likelihood reveals that maximum likelihood estimation may be problematic in the over-dispersed case, suggesting that the use of point optimal tests may have value. For overviews of the use of point optimal tests in econometrics see King (1987) and King and Sriananthakumar (2015). To the best of our knowledge they have not previously been used in the context of testing for over-dispersion. This paper can be thought of as being comprised of three main parts. The first part provides a very brief description of the family of distributions introduced by Katz (1965), the second explores the role that these distributions can play in extending the PRM to allow for over-dispersion and, finally, we introduce a new class of point-optimal tests for over-dispersion. Specifically, in Section 2 we explore the Katz family of distributions, although most of the analysis is relegated to the Appendix while Section 3 explores the PRM and NBRM. In particular, we highlight that the typical treatments of the NBRM really have little to do with what might be thought of as the canonical negative binomial distribution. Section 4 then focuses on the 2 This family of distributions, and extensions to it, have proved important in the actuarial modelling of claims; see, for example, Hess et al. (2002); Panjer (1981); Sundt and Jewell (1981); Willmot (1988), and Pestana and Velosa (2004). Johnson et al. (1993, Chapter 2) provides an extensive discussion of both the Katz family and various other, often related, families of discrete distributions. Although, in respect of the Katz family of distributions alone, the treatment in Johnson and Kotz (1969, Chapter 2.4) is more complete; see also Gurland (2006) for a more recent treatment. 3 The one caveat to this observation is that the use of higher order moments may provide some power against models which share low order moments, thereby creating a class of implicit null hypotheses (Davidson and MacKinnon 1987). problem of testing for over-dispersion and the structure of the Katz likelihood. It is here that we introduce our family of point optimal tests and explore their small sample characteristics relative to some existing tests via a simulation study. We find that it is possible to choose a point optimal test which is better in the intercept only case, although the regression case proves problematic. One possible source of weakness in our point optimal tests is the choice of 'point' at which to optimize. In Section 5 we explore the use of Hellinger distance as a device to assist in choice of point. Although exact calculation of the Hellinger distance in this context is not tractable, it is straight-forward to obtain bounds on the distance. Using the upper bound, we find that the implied optimal points are extremely close to zero, implying that use of the score test is close to the optimal strategy in this context and so our advice to practitioners is to continue to use score tests to test for over-dispersion in this context. Section 6 concludes.

The Katz Family of Distributions
Among his many and varied interests, Karl Pearson was concerned with the problem of modelling (possibly) asymmetric empirical distributions. To this end, he developed a four-parameter family of skewed continuous distributions as solutions to a particular differential equation (Pearson 1895). The idea being that the distributions might be fitted to any data set using the method of moments approach that he had developed earlier (Pearson 1894). Perhaps surprisingly, the motivation for the choice of differential equation came from a difference equation that could be used to generate the hypergeometric distribution (Pearson 1895, pp.360-361); that is, from a discrete distribution. If we let p(y) ≡ P[Y = y] denote the probability that the discrete random variable Y takes a value y ∈ Y, where Y denotes the support of the distribution of Y, then the form of this difference equation was with a, b 0 , b 1 , and b 2 denoting the various parameters of the distribution. We note that this expression is of the form p(y)/p(y − 1) = P(y)/Q(y), where P and Q are polynomials in y, and remark in passing that the sequence of probabilities so-defined are hypergeometric in that the ratio of adjacent terms in the sequence can be expressed as a ratio of polynomials in the index y.
Pearson did not pursue a discrete analogue to his family of distributions. Indeed, apart from some incidental investigations along these lines, Katz (1945) provided the first detailed analysis of the family of distributions arising from (1) although, apart from some abstracts (Katz 1946(Katz 1948, it was not until Katz (1965) that this material was published. In the event, Katz (1965) focussed on a two-parameter special case of (1), 4 which he expressed in the form 4 Numerous extensions soon followed; see, for example, Bardwell and Crow (1964); Crow and Bardwell (1965); Ord (1967ab);Staff (1964Staff ( 1967 and Kemp (1968). Here we only briefly sketch some key ideas. For a more complete treatment of such families of distributions see, for example, any of Johnson et al. (1993, Chapter 2.3), Ord (1972, Chapter 5), or Dacey (1972.
with y ∈ Y ⊆ Z 0 , where Z 0 denotes the set of non-negative integers and subject to the usual axiomatic properties of probability: As we demonstrate in the appendix, there are circumstances where both Λ and Γ may include values that are positive, negative, or zero.
Although Katz himself included zero in Y, subsequent literature has not always done so, choosing instead to focus on the difference equation (2), whilst still referring to the resulting distributions as members of the Katz family; see, for example, Sundt and Jewell (1981); Willmot (1988) and Miller (1998). We too shall proceed in this latter manner, focussing on the what we call left-truncated Katz distributions, that include the original definition of Katz (1965) as the special case of no left-truncation. We relegate the technical analysis of these distributions to the appendix, which also gathers a number of other properties of this family of distributions.
Two members of this family that will be of particular interest to us are the Poisson and negative binomial distributions, which are commonly encountered in the modelling of counts and the possibility of over-dispersion. The probability mass functions (pmfs) of these distributions are the form: respectively. 5 Evidently, when γ = 0, p(y) is the pmf of the Poisson distribution. When γ > 0, if λ/γ = r is integer then p(y) yields a standard representation of the negative binomial pmf, where the probability of success in any given trial is π = 1 − γ. Even if λ/γ = τ is not integer, p(y) is still the pmf of a negative binomial distribution -see, for example, (9) -although the interpretation of λ/γ differs between the two cases. 6 We shall, hereafter, denote the Poisson distribution with parameter λ, P(λ), and the negative binomial with parameters τ and π, NB(τ, π). Before moving on, let us consider the well-known Poisson approximation to the negative binomial. A common statement of this result is NB(τ, π) → P(λ) as τ → ∞ provided λ = τ(1 − π) remains fixed. That is, π → 1 at the same rate as τ diverges. One advantage of the parameterization adopted in (5) is that the somewhat convoluted requirement on how the parameters evolve in the approximation readily reduces to γ → 0 + for fixed λ. 7 5 Observe that the Pochhammer symbol (r) y = Γ(y + r)/Γ(r), where y is a non-negative integer. Note that r can be negative. If r is a negative integer then (r) y = 0 for all y > r. If r is a positive integer then (r) y = (y + r − 1)!/(r − 1)!. 6 When λ/γ is integer the resulting pmfs are sometimes referred to as those of Pascal distributions, with the term negative binomial reserved for the more general case of λ/γ not necessarily integer. 7 Similarly, the Poisson approximation to the Binomial reduces to γ → 0 − for fixed λ, which is also a more intuitive statement of how parameters must evolve for the approximation to work than is typically encountered.

The Poisson Regression Model
The PRM extends the Poisson distribution to allow for individual heterogeneity. It has played an important role in the analysis of count data in both econometrics and statistics -early references include Gart (1964); Jorgenson (1961), and Haight (1967, Chapter 5) -and is readily available in standard software such as MATLAB, Stata, and R. The use of the PRM in econometrics became increasingly widespread following the significant contributions of Gilbert (1979;1982) and Hausman et al. (1984). Recent summaries can be found in Greene (2007); Winkelmann (2008) and Cameron and Trivedi (2013).
The PRM is obtained from the Poisson distribution by replacing the fixed parameter λ with a function, denoted λ i say, of the k-vector of characteristics x i that can vary across individuals. Specifically, in the language of generalized linear models (GLIMs), we have the link function with regression coefficients β. The work of Nelder and Wedderburn (1972) and Frome et al. (1973) shows how iterated least squares methods can be used to obtain maximum likelihood estimates of β; see also McCullagh and Nelder (1989). One shortcoming of the PRM is the implied equality of mean and variance that is characteristic of the Poisson distribution. Specifically, on replacing λ with λ i in (A8), we obtain 8 This is at odds with the observation that variability typically exceeds location in real world data, a feature known as over-dispersion. A common response to concerns about over-dispersion has been to explore extensions to the Poisson model that allow for different means and variances. To the extent that the Poisson regression model can be nested in such generalizations, this approach provides a framework within which one might test for either over-dispersion or underdispersion, although we will not explore this latter case here.
The fundamental characteristic of the PRM is that it is a function of the linear index, x i β, only through the 'parameter' λ i , as per (7). In the next two sub-sections we will consider different extensions to this model, the first being the classical NBRM and the second being what we dub the Katz regression model (KRM). Both models extend the PRM by nesting it within a richer model with an additional 'parameter'. An important distinction between the NBRM and the KRM is the role of λ i . In the case of the NBRM, λ i remains the conditional mean of the count Y i , whereas this is not the case in the KRM. A second distinction between the models is that the additional 'parameter' is typically treated as being a function of the linear index in the NBRM whereas in our treatment of the KRM it is not, it is a genuine parameter, although it is easy to envisage extensions where that requirement is relaxed. 8 We shall persist with the abuse of notation inherent in expressions like E [Y i | λ i ] rather than, say, a more complete notation along the lines of E [Y i | β; x i ], for the sake of the notational economy it affords.

The Classical Negative Binomial Regression Model
There are numerous paths leading to what might reasonably be called a negative binomial regression model. 9 This is due, at least in part, to the variety of ways in which one might generate a negative binomial distribution. For example, Boswell and Patil (1970) provide 15 different derivations and, of course, there is a variety of parameterizations of the negative binomial distribution that can also lead to differences. Below we explore a fairly commonly adopted approach and consider some of its implications.
Our starting point is the following observation, originally due to Greenwood and Yule (1920). Suppose that Y | θ ∼ P(θ), where θ is a random variable whose distribution is gamma with shape (τ) and rate (η) parameters, written θ ∼ G(η, τ), so that the corresponding density function is, 10 with E [θ] = τ/η and V [θ] = τ/η 2 . 11 Then, we obtain an unconditional distribution for Y on averaging with respect to θ > 0, so that If one imposes the restriction τ ≡ r ∈ N + , where N + denotes the set of positive integers (or the natural numbers), then this is simply a form of the negative binomial (Pascal) pmf, with π = η/(1 + η), see (A5). Note that E [Y] = τ/η = λ (say), the same as for the gamma distribution (8), and that One posible path to a NBRM is to extend the analysis of Greenwood and Yule (1920) to allow for individual heterogeneity; we follow the treatment of Cameron and Trivedi (1986, p.32). Specifically, we replace θ by θ i , where with i a disturbance term reflecting unobservables. Cameron and Trivedi (1986) then assume that either i , or 'equivalently' θ i , have a gamma distribution, conditional on the regressors. Their analysis then proceeds under the latter assumption, which is completely analogous to the developments of Greenwood and Yule (1920). Specifically, letting The NBRM was explored in Adamidis (1999); Greene (2008); Lawless (1987ab), and Raschke and Greene (2010). Hilbe (2011) and Hilbe (2014) provide useful recent surveys of the NBRM. 10 Common variants of this argument include: (i) Lee (1986), who specifies the gamma distribution in terms of the shape and scale (or inverse rate) (ξ = 1/η) parameters, that is, θ ∼ G(1/ξ, τ), and (ii) Cameron and Trivedi (1986), who use the so-called index form of the gamma distribution, which is specified in terms of the shape and mean (φ = τ/η) parameters, that is, θ ∼ G(τ/φ, τ). Cameron and Trivedi (1986) call the shape parameter (τ) the index or precision parameter. 11 Moments for the gamma distribution specifications given in Footnote 10 follow immediately on making the appropriate substitution for η. Moreover, and It is immediately obvious that, in the final analysis, the functional form of (10) is a complete irrelevance, with only the parameters of the mixing Gamma distribution of any importance and we have made no assumptions about them beyond allowing the possibility of varying at the individual level. From here, Cameron and Trivedi (1986) argue that a variety of models are available on defining for α > 0 and arbitrary constant k, so that Special cases of importance are then the Negbin I model (obtained when k = 1) and the Negbin II model (k = 0), 12 of which the latter is probably the more popular in the literature. This model nests the PRM as a limiting case where α → 0 from above, a testable proposition, which is equivalent to τ i diverging to ∞ for all i.
The specification (10) becomes more relevant if, instead, we assume that Thus, conditional on x i , θ i is a scaled Gamma random variate which is, itself, a Gamma random variate. From the properties of the Gamma distribution we have immediately that θ i |x i ∼ G(η i /δ i , τ i ). Moreover, analogs of results (11)-(13) are immediately available in this case on replacing η i by η i /δ i . In short, the differing distributional assumptions are 'equivalent' in that the structure of the results is the same in both cases, however, it is only in this latter case that (10) has any relevance, through the presence of δ i in the various expressions.
The attraction of the formulation (11)-(13) of Cameron and Trivedi (1986) is its close resemblance to a GLIM, which simplifies estimation. 13 As noted above, the null of a PRM obtains as τ i → ∞, however, results in a relatively odd PRM with a potentially unbounded mean, unless η i is diverging to infinity at the same rate as is τ i . Moreover, there is a Davies-type problem relating to the separate identification of both τ and η when the null is true. Greene (2008) camouflages this difference by imposing the restriction η = τ. 14 He refers to this restriction as being mean preserving, by which is meant that when η = τ, as would be the case in the PRM. We should note that, in order to generate the same class of models as do Cameron and Trivedi (1986), Greene (2008) also allows τ to be replaced by τ i , as defined by (14), but this means that η must be replaced by τ i too. 15 Of course, the restriction that τ i = η i reduces the two parameter mixing gamma distribution to a single parameter distribution with the loss of modelling flexibility that implies. However, without this restriction, the conditional mean of Y i is other than δ i .

The Katz Regression Model
The fundamental difference between the NBRM, as described in the above, and what we refer to as the Katz regression model (KRM) lies in the generation of the underlying distribution. Specifically, the Katz family of distributions is not generated via a mixing argument and so, in contrast to the NBRM, the probabilistic quantities of interest (pmfs and moments) are not functions of the parameters of the mixing distribution; see (11)-(13). In this sense, the parameterization of the Katz family is more natural than that of the NBRM. Directly analogously with the PRM, the KRM can be generated from (A6) simply by replacing λ by λ i , as per (6), which is analogous to our earlier development of the PRM. 16 Equally, one might explore models that see γ replaced by functions of regressors, γ i say, although we will not. Note that the conditional mean and variance of this distribution are given by (A8), with λ replaced by λ i . Contrast this structure with that for the NBRM described above. There we saw that the conditional mean of the dependent variable was not varying with γ, being a function of the linear index x i β alone. Similarly, by construction, the variance exceeded the mean of the dependent variable, but the reduction to a Poisson model requires the shape parameter τ of the mixing gamma distribution to be unbounded, which yields a degenerate distribution for given rate parameter η.
It is clear that it is not necessarily desirable to preserve the mean, in Greene's sense of equating τ i and η i (Greene 2008), because, as γ increases, the mean for both the NBRM and the KRM should be decreasing relative to that of the PRM. We note in passing that this is the model that underlies the generalized event count (GEC) model of King (1989); this model was also considered by Ghahfarokhi et al. (2008). That they obtained more complicated models than that proposed here, resulting in the models being less popular than the NBRM in practice, stems from the fact that they did not have (A6) as the pmf implied by (2), which in part is due to working with (2) rather than (A1). 15 Specifically, Greene (2008) discusses the broader class of models obtained when k is allowed to take values other than 0 or 1 in (14). He dubs this broad model the NBP model, seemingly because his notation uses p rather than the k used by Cameron and Trivedi (1986) (and here). 16 Alternatively, using similar averaging arguments to those seen previously for the NBRM, if we average P(θ) with respect to G θ; π 1−π , n , where π = 1 − γ and n = λ/γ, then we obtain a more common form of the negative binomial pmf.
Note that the mean and variance of this distribution are given by (A8). In contrast with the developments of (11), there is nothing in this model that requires that both the parameters of the mixing gamma distribution vary with the index i. Nor need they be linked in any restrictive way. Specifically, if we were to follow the developments of Greene (2008) who equates the parameters of the mixing distribution, we find that which constrains 0 < λ < 1 and, as λ = exp{x i β}, this implies that x i β < 0. As a general statement, this would appear to be a very odd restriction to want to impose.

Testing for Over-dispersion in Poisson Regression model
There is a vast literature addressing the problem of over-dispersion and how to test for it. We will not attempt to provide a comprehensive survey of this literature, focussing instead on a few key contributions, although it should be noted that most of the references cited so far will have some discussion of the problem. We shall break our discussion into two parts. First, we shall restrict attention to the case where the only regressor in the model is an intercept, so that λ i = λ is a constant. Then we will extend the analysis to allow for additional regressors. In each case, the null hypothesis will be that the data have been generated by the PRM. We investigate the performance of tests whose preferred alternative is, variously, that the data have come from one of the Negbin I, Negbin II, or Katz regression models. 17

The Katz Likelihood
For any positive real number n = λ/γ, with λ > 0 and 0 < γ < 1, the Katz pmf is given by where products of the form ∏ y i −1 s=0 = 1 for y i = 0. In textbook cases, where n is known, the first and last terms in (15) are functions of the data only and ∑ N i=1 y i is sufficient by the factorization theorem. But in the current context, when n is not known, there is no reduction to a fixed dimensional sufficient statistic. Even if λ is known there is no sufficiency reduction; the ratio n is required. Only the entire sample (or the order statistics) are sufficient but even they are are not complete so that different parameter configurations may give rise to the same data. We may surmise this from the likelihood (15) since any combination of λ and γ that preserves n gives the same likelihood. Adopting the convention that ∑ y i −1 s=0 log(n + s) = 0 when y i = 0, the log likelihood is where the second line follows by substituting for n and simplifying. In fact, (nonlinear) maximum likelihood estimators for n do not exist when the sample variance is less than the mean, that is, s 2 y ≤ȳ (see Al-Khasawneh (2010) and the references therein). Note that, even when drawing from a negative binomial, which by definition is over-dispersed, many individual samples will exhibit under-dispersion. We can see the difficulty explicitly by looking at simple moment estimates for λ and γ, that is, solve (A8) to getλ =ȳ (1 −γ) Hence, problems arise when s 2 y ≤ȳ since thenγ ≤ 0, which is illegtimate when investigating over-dispersion. Even if s 2 y ≥ȳ, convergence issues arise if the difference is not great. This suggests that test procedures that use maximum likelihood estimates, such as Wald or Likelihood Ratio tests, can be problematic and that there may be a role for point optimal approaches.

Point Optimal Tests
Point optimal tests have had a long and varied career in econometrics; see King (1987) and King and Sriananthakumar (2015) for an overview. These tests optimize power at a particular parameter value under the alternative, the idea being to have good power at a point where incorrectly accepting the null really matters. This is in contrast to, say, a score test that is locally best, in that it has the steepest power function local to the null hypothesis. Although not an undesireable property in any way, the practical difference between a null model and some other model local to the null is often, although not always, vanishingly small. So, optimizing the ability to distiguish between such null model and another local to it is not necessarily all that desireable a property. Moreover, there is implicit in such an approach the notion that the power function will be monotonically increasing, which ideally it should be, and that it will remain near the power envelope as the data generating process diverges from the null. In many cases this is indeed what happens, although we know that power functions are likely to cross, as otherwise the test would be uniformly most powerful, which is a very rare property indeed. The divergence between the power function of a score test and the power envelope is then something that requires exploration on a case by case basis and we will explore this below.
The log likelihood ratio of the Katz − NB alternative to the Poisson (P) null is written Assuming that both distributions are fully specified (with λ = λ 0 = λ 1 ), the Neyman-Pearson Lemma states that the U MP test of γ = 0 versus γ = γ 1 is given by LLR (λ, λ, γ 1 ). Hence assuming λ known, the so-called power envelope is determined by computing LLR (λ, λ, γ) over a range of values of γ ∈ (0, 1). A PO test is constructed by choosing a fixed γ = γ PO to be a 'representative" value under the alternative Katz − NB distribution, giving LLR (λ, λ, γ PO ). It is desirable that γ PO be chosen so that the power of the test LLR (λ, λ, γ PO ) is as close as possible to the power of the family of tests LLR (λ, λ, γ), γ ∈ (0, 1). That is, ideally, γ PO is chosen so that the power function of the resulting test is as close to the power envelope as possible.

Score Test
A common alternative to likelihood ratio approaches, which does not require maximum likelihood estimation of the parameter of interest, is to construct optimal tests local to the null γ = 0. The so-called eficient score tests, or simply score test, are derived by differentiating the log likelihood with respect to γ and then setting γ = 0. Such score tests are easily found for the Katz family using (16); see, for example, Katz (1965) and Lee (1986). Specifically, the score test is This test was originally proposed by Katz (1965) on heuristic grounds and by Lee (1986) as a formal score test. 18,19 4.4. Simulation Experiments

The Unconditional Model
In this section we simulate the powers of the point optimal and score tests and compare them to the benchmark power envelope. First we give details for the power envelope and this is followed by a description of the operational tests. We present results for a sample size of N = 50 throughout.
In Figures 1-3 below, we consider a range of values for λ ∈ (0, 8) and γ ∈ (0, 1) and look at the relative performance of the PO and S tests and the power envelope. For each (λ, γ) pair we generate samples from the Katz family, K(λ, γ). This is efficiently accomplished using p(0) = (1 − γ) λ γ along with the defining recurrence p(y + 1) = (λ + γy)/(y + 1)p(y) and then sampling from the inverse cumulative distribution function. Setting γ = , for some very small positive , effectively simulates from the Poisson null. The null critical values are computed by simulation to avoid asymptotic approximations. This means that the sizes of tests are accurate (up to simulation error) and hence that the power comparisons are meaningful in smaller sample sizes.
For the power envelope, we simulate from the null P(λ) = K(λ, ) distribution, compute 10, 000 values of LLR(λ, λ, γ) and extract the 95% quantile as a critical value, cv. The DGP is the Katz family K(λ, γ) and we simulate 10, 000 replicates from the DGP and count the percentage of times the PO statistic, LLR(λ, λ, γ), exceeded the cv to calculate the power envelope.
The operational PO test estimates λ 0 , λ 1 and fixes γ at γ PO . To do this, we use simple moment based estimators, that is, computeλ NB =λ P (1 −γ NB ), whereλ P =ȳ andγ NB = 1 − ȳ/s 2 y using the mean and variance of the data at hand. Shouldγ NB stray negative, we truncate and setγ NB = .
The PO test is computed as LLR λ P ,λ NB , γ PO while the score test, using (17), is S λ P . The null is Poisson P(λ) = K(λ, ) and we simulate from the null, computing LLR λ P ,λ NB , γ PO and S λ P for each realization, to get 5% cv's. We calculate the power by simulating from the DGP K(λ, γ). 18 Strictly, Katz (1965) adopted an approach more in keeping with a method of moments test. Specifically, he looked at the difference between estimators for the mean and variance, which should be equal under the null and then scaled this difference appropriately to obtain a distribution under the null. In any event, the statistic so obtained is the same as the one proposed by Lee (1986) that we consider here. 19 Lee (1986) proposed other tests than the one considered here, although he did not compare them numerically The results recorded in Miller (1998) suggests that those involving third order moments may have better power properties. For now we are primarily concerned with proof of concept and do not explore these other tests in light of the simplicity of (17). It is helpful to view Figures 1-3 in the light of the cross sections displayed in Figure 4. It is clear that, for small values of γ, no test can be expected to perform well close to the null. Equally, for large values of γ, with high degrees of over-dispersion, all reasonable tests can be expected to be powerful. Thus, for small and large degrees of over-dispersion we expect to see little difference in the performance of the envelope and the PO test as the left panel of Figure 1 attests. For moderate values of γ, the power of the PO test can be significantly smaller than the envelope as the coloured scale suggests. In the left panel the difference in power between the PO and score tests is plotted for γ PO = 0.1. The differences are not large but the PO test uniformly dominates as the scale indicates. Figure 2 plots the same surfaces for γ PO = 0.3 and the interesting feature is that both tests perform similarly but none dominates the other. In Figure 3, γ PO = 0.5 and the score test dominates by a small margin except where the DGP corresponds to γ PO .
Since the shapes of the surfaces are quite smooth over λ, we plot a cross-section at λ = 5 to look at absolute performance. There are three panels in Figure 4 each corresponding to a value of γ PO = 0.1, 0.3, 0.5. The power envelope is shown in red, with those of the PO test in blue and the Score test in orange. Also shown (vertically) is the PO point γ PO .
The power envelope reaches unity at around γ = 0.2. This corresponds to a degree of over-dispersion, in the Katz − NB distribution, of σ 2 /µ = 1/ (1 − γ) = 1.25. For the tests to reach equivalent power requires γ = 0.6 with σ 2 /µ = 2.5, roughly, and γ = 0.9 is required at σ 2 /µ = 10. So, neither test can match the envelope unless the degree of over-dispersion is quite large. For γ PO less than 0.3 the PO test performs better uniformly, at 0.3 they perform equally well and for γ PO > 0.3 the score test is better. Thus, a choice of γ PO which is small will uniformly dominate.

The Katz Regression
In practice, the analysis of over-dispersion often takes place when covariates need to be taken into account. As explained in Section 3 there are many ways in which this may be approached. We work directly from the definition of the Katz family rather than mix over a kernel Poisson distribution. The log of the likelihood takes the form The PO test, using γ PO , is based on the log likelihood ratio, and the PO test needs to estimate the parameters˘0 and˘1. Estimating λ 1,i = exp (β 1 x i ) may be problematic as trying to fit a NB regression when the data is Poisson can lead to identification/convergence problems, exacerbated by the fact that fitting these types of regressions requires nonlinear maximum likelihood estimation. We avoided this issue in the last sub-section by using Katz moment estimators. Here, we use a regression version of the same idea. First, estimate a Poisson regression P µ * i (x i ) (including a constant) which will return the mean estimate exp (b 0 + b 1 x i ). Noting that we can write , we can setβ 1 = b 1 and henceλ 1i = exp β 1 x i to give the vectorˆ1. To getλ 0,i we fit the Poisson without the constant term which returns the estimate exp (b 2 x i ), which givesλ 0,i = exp (b 2 x i ), and hence LLR ˆ0 ,ˆ1, γ PO .
As a comparator to the PO statistic, in the regression setting, we use the score test of Dean and Lawless (1989), which avoids the potential difficulties associated with maximum likelihood estimation. Thus, we estimate the Poisson regression P (µ (x i )) (with a constant) to get the vector of predictorsμ i and, using z i = (y i −μ i ) 2 − y i , the test S (ˆ) is computed as the t-statistic in the regression of z i onμ i . 20 Again critical values are computed by simulation. The null is generated as P (λ i ), with λ i = exp (β 1 x i ) used to keep the means of the counts low. We used x i ∼ i log (P (2) + 1) and the x i are kept fixed under replication. We generate simulated critical values, based on 10, 000 replications, for the tests S (ˆ) and LLR ˆ0 ,ˆ1, γ PO . To compute powers, the DGP K (λ i , γ) = NB(n i , p) is used, where n i = λ i /γ and γ takes a selection of values in (0, 1). As usual, π = 1 − γ. The results are presented in Figure 5.
The PO test performs badly for very high degrees of over-dispersion when γ PO is less than 0.5 approximately and is dominated by the score test for γ PO greater than 0.5. However, the choice γ PO = 0.5 does lead to superior PO performance albeit by not a great margin.

Summary
Our experimental results are mixed. In the unconditional model, the point optimal tests appeared to work best when γ was small, with their performance deteriorating relative to the score test as γ increased. In the regression model, the score test outperformed the point optimal tests suggesting that the null distribution of the score tests was more robust to the presence of nuisance parameters than was that of the point optimal tests, with none of the test statistics being pivotal. However, these rankings were also sensitive to the choice of point. This begs the question as to whether or not we are choosing the 'point' for the point optimal tests in a sensible way. It is to this question that we turn in the next section.

Hellinger Distance
The reasoning behind the use of point optimal tests is to put power where it is of greatest practical use. The immediate problem facing the use of point optimal tests is where to place the 'point'. Sometimes the testing problem suggests a solution. Other times the choice is less clear and is often based an the outcome from a simulation study 'run-off', making the results somewhat ad hoc. The attraction of point optimal test in the context of testing for over-dispersion is that the parameter space of interest, namely that of γ, is bounded and so there is some hope of finding an appropriate point. One way of defining appropriate in this context is where the distribution under the alternative starts to depart from that under the null in some substantial way. The questions then reduces to one of how we might measure such a departure. In this section we explore the use of Hellinger distance (H) for this purpose. We think that this is a novel use of such a distance measure and is of independent interest. We do not, however, assert that Hellinger distance is the only choice or even the best choice in this context, but it does yield some interesting results.
To begin, various definitions of Hellinger distance are available. 21 Originally proposed in an integral form by Hellinger (1909), we will work with the following discrete variant: Definition 1 (Hellinger Distance for Discrete Random Variables). The squared Hellinger distance between these two discrete distributions P and Q is where P = (p 1 , . . . , p k ) and Q = (q 1 , . . . , q k ).
We note in passing that the Hellinger distance is bounded, 0 ≤ H ≤ 1 =⇒ 0 ≤ H 2 ≤ 1. H = 1 iff P assigns zero probability to anywhere that Q assigns positive probability and H = 0 iff P = Q.

The Poisson Distribution
By way of example, to illustrate the basic idea and to help calibrate the procedure, suppose that we choose as our base case a Poisson distribution with parameter λ 0 so that the implied standard deviation is √ λ 0 . Writing Prob (X = x | λ) ≡ P(λ), we are going to explore the behaviour of H as we compare P(λ 0 ) with P(λ 1 ) for various (λ 0 , λ 1 ). When comparing Poisson distributions, the squared Hellinger distance is readily shown to Figure 6 provides some insight into the sensitivity of Poisson pmfs to changes in parameter values when the parameters are small and includes examples that are variously skewed to the right, (roughly) symmetric, and skewed to the left. Observe that, here we have used λ = 1 as the base case and that, as λ increases, it is by one standard deviation each time and so these changes are quite dramatic.
In Figure 7 we present values for H for various λ 0 and λ 1 . The dashed and dotted lines correspond to Hellinger distances of 0.1 and 0.05, respectively. We observe that H is asymmetric in λ for all λ 0 considered, which reflects the skewed nature of Poisson distributions. Note that, as λ 0 increases, so too does standard deviation of the base distribution. As this happens a given value of H will admit great differences between λ 0 and λ 1 . For example, when λ 0 = 1 a Hellinger distance of 0.1 or greater is achieved for any 0.8 ≈ L ≤ λ 1 ≤ U ≈ 1.3. In contrast, when λ 0 = 9, H ≤ 0.1 for all 8.2 ≈ L ≤ λ 1 ≤ U ≈ 9.9, which is a much wider interval than the previous case. Given that we are seeking to construct point optimal tests that compete with locally best tests, these results suggest that we need to be looking at points for which the Hellinger distance is quite small.

The Katz Distribution
We will take the Poisson (γ = 0) as our base model. Moreover, as Poisson-ness, or otherwise, is completely determined by the value of γ, we will hold λ fixed across models. Here the support under both null (equi-dispersion) and alternative (over-dispersion) is y ∈ Y = {0, 1, 2, . . .} and so Although not amenable to direct solution we notice that Therefore, We can solve this non-linear equation for γ L numerically for given λ and h L . Some results are reported in Table 1.  (21) For Given h L and λ (scaled by a factor of 10 12 ). We see that all values of γ L are positive, albeit extremely to zero. Alternatively, from (20) we also have the result Solutions to (22) for various λ and h U are given in Table 2. We see that γ U is monotonically increasing in h U but monotonically decreasing in λ. That is, once λ becomes sufficiently large, even small departures of the hellinger distance from zero are consistent with γ > 0.
All of the above said, however, the over-riding conclusion is that the optimal 'points' are going to be sufficiently close to zero that it is not clear that there is much benefit over just using the the score test, which is essentially point optimal at γ = 0. The main reason for such a conclusion is our earlier results indicating that, in the regression context, the score test is much less subject to the influence of the regression coefficients, which are nuisance parameters in this testing problem.

Conclusions
At a fundamental level, this paper explores the use of point optimal tests in the problem of testing for over dispersion. Our basis of comparison is the score test of Lee (1986), which is the same as the earlier method of moments test proposed by Katz (1965). Our findings are somewhat disappointing and we are unable to recommend that practitioners change their current practices as the performance of the point optimal tests is, at best, mixed. It may be possible to improve the performance of the point optimal tests by a more refined analysis of (i) the problem of nuisance parameters and (ii) the construction of p-values, along the lines suggested by King and Sriananthakumar (2015). This we leave for further work.
Along the way, the paper has made two other contributions. First, in the appendix we have provided a reasonably exhaustive treatment of the family of distributions consistent with the difference equation of Katz (1965). To the best of our knowledge this treatment extends all known earlier results by allowing for arbitrary points of left truncation. This expands the class of distributions originally considered by Katz (1965), which can be characterized as including zero in the support of the count variable. The treatment is closest to that of Willmot (1988), although there are differences in the mode of analysis and he restricts attention to extensions where only zero is omitted from the support of the count variable. We note in passing that right truncation is a much easier problem to deal with as it neither expands nor contracts the members in the family, in the way that left-truncation does. Its only consequence is the introduction of a scale factor equal to 1 − R, where R denotes the upper tail probability that has been truncated.
The other contribution that we have made is to introduce the use of Hellinger distance as a metric by which one might settle on the 'points' characterizing point optimal tests. This is novel and allows a more systematic treatment than the grid searches that have characterized such choices in the past.
Author Contributions: Both authors have contributed equally to all aspects of the preparation and writing of this paper. They have both read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflicts of interest.

Appendix A. On Left-Truncated Katz Distributions
Certain properties of the family of distributions defined by (2)-(4) are available on inspection. In particular, the support of Y is, in certain circumstances, parameter dependent. Here we characterize those circumstances.
To begin, let us establish some notation. Our count variable is Y ∈ Y L ≡ {L, L + 1, L + 2, . . . , n} ⊆ Z 0 , where n may be either infinitely large or some finite integer and L is a non-negative integer integer. We will restrict L < n because the case where L = n yields a probability mass function degenerate at L, which is statistically uninteresting and shall, hereafter, be ignored. Next, write (2) as This latter formulation has the advantage of being untroubled by the prospect of p(y) = 0. It will also prove convenient to be able to express all probabilities in terms of p(L), which we can do via back substitution in (A1). Thus, for all y ∈ {L + 1, . . . , n}, Moving forward we shall break up our observations into three categories: (i) those relating to the support of the random variable and the parameter space of the associated distributions, (ii) statements of the probability mass functions belonging to the family, and (iii) certain properties of the various distributions. The results are ultimately the same as those of Willmot (1988) in the special case where L = 1, although our mode of analysis is different and we extend his results by allowing for arbitrary L > 0. 22 Appendix A.1. Support and Parameter Spaces From (A2), we see that the sequence of probabilities generated by the difference equation (A1) is governed by p(L) and by the terms (λ + jγ), j = L, L + 1, . . . , y, as the ratio of factorials L!/(y + 1)! is a scale factor in the interval (0, 1]. Our subsequent analysis revolves around the behaviour of these quantities and the implications for p(y + 1) of these behaviours. With the exception of [1], we shall hereafter assume that p(y) > 0.
[1] 0 < p(L) < 1 From (A1) we see that if p(y) = 0 for any y ∈ Y L then p(y + r) = 0 for all r ∈ N. In particular, if p(L) = 0 then p(y) = 0 for all y ∈ Y L . But this leads to violation of (4), that is, probabilities do not sum to unity, and so we exclude p(L) = 0 from further consideration. Equally, if p(L) = 1, so that the pmf of Y is degenerate at L, which is a case that we have already excluded from further analysis.
[2] γ = 0 =⇒ λ > 0 and n = ∞ If p(y) > 0 and γ = 0 then we have a pmf degenerate at L unless λ > 0, which will be assumed hereafter. In this case there is no implied restriction on the upper bound of Y L , that is, n = ∞. Of course, the concern when generating an infinite sequence of probabilities is to ensure that the associated series, ∑ y∈Y p(y), converges. This can be examined by considering the quantity r y (γ) = p(y + 1) p(y) = λ + yγ y + 1 = γ + λ/y 1 + 1/y and noting that From the limit version of d'Alembert's ratio test we see that the series converges because R(0) < 1.
Similar in effect to the previous case, if L = 0 then λ + Lγ = λ. Given p(L) > 0, as assumed above, p(L + 1) > 0 if and only if λ > 0 which will be assumed, hereafter, for all cases where L = 0.
[4] λ > 0, γ > 0 =⇒ 0 < γ < 1 and n = ∞ Because y ≥ L ≥ 0, if λ > 0 and γ > 0 we see that λ + yγ > 0 for all y ∈ {L, L + 1, . . .} and so here the support of the pmf of Y is unbounded from above and independent of the values taken by λ and γ. Again, we can establish convergence of the corresponding series. Here Appealing again to the limit version of d'Alembert's ratio test we see that the series converges if γ < 1, diverges if γ > 1, but the test is inconclusive if γ = 1. Expanding the denominator of r y (1) in power series yields r y (1) = 1 + λ/y 1 + 1/y = (1 + λ/y) Applying Gauss's test, 23 we see that the series will converge absolutely if and only if 1 − λ > 1 but will otherwise diverge. Here we have assumed that λ > 0 and so 1 − λ < 1. Hence, the series is divergent for γ ≥ 1.
In this case there is no value of y that satisfies λ + yγ > 0 and so y + 1 cannot belong to Y L . Moreover, this statement remains true even if y = L. Consequently, in this case, the pmf of Y is degenerate at L, a situation that we have chosen to exclude from further consideration.

[6] λ and γ of different sign
In this case we see that λ + yγ can change sign as y increases, unlike the situation of the previous two cases. Let n denote the smallest value of y such that λ + yγ ≤ 0. Then n is the largest value in Y L . There are only two cases to consider here (having treated that of γ = 0 above): (i) λ ≥ 0, γ < 0, and (ii) λ ≤ 0, γ > 0.
(a) λ ≥ 0, γ < 0 =⇒ n = −λ/γ If λ ≤ −Lγ then the pmf of Y will be degenerate at L which, as explained above, is statistically uninteresting and a situation that we will assume away. That is, if γ < 0 then we will assume that λ > −Lγ. In particular, if L = 0 then this requirement reduces to λ > 0. As y increases, λ + yγ will approach zero from above. That value of y for which λ + yγ is first less than or equal to zero is the largest value of y in Y L and shall be denoted by n, so that p(n) is well-defined but p(n + 1) is not. 24 That is, n is the smallest integer greater than or equal to −λ/γ. This is the definition of the so-called ceiling function, written n = −λ/γ . In summary, if γ < 0 then we see that the upper bound on the support of the pmf of Y is a function of the parameters λ and γ, with the space of λ subject to the constraint λ > −Lγ.
Here λ + yγ is an increasing function of y but the pmf of Y is non-degenerate at L if and only if λ > −Lγ. As we have already excluded from further consideration pmfs degenerate at L we here assume this to be the case. In particular, when L = 0 we have a contradiction as we are assuming both λ > 0, which is required when L = 0 (see [3]), and λ ≤ 0; we conclude that λ ≤ 0 and γ > 0 can only arise when L ≥ 1. As λ + ψγ > 0 for all y ∈ Y, Y L will be unbounded from above provided that the series of probabilities so formed is convergent. Using the analysis outlined in [4], applying the ratio test we find convergence for all −Lγ < λ ≤ 0 provided that 0 < γ < 1. Moreover, if γ = 1, Gauss's test gives convergence provided that λ is strictly negative, that is, −Lγ < λ < 0.
We summarize these findings in Table A1 and note in passing that, when L = 0, the only valid parameter configurations are those found in the row λ > 0. Table A1. Parameter Configurations When L > 0.

Appendix A.2. Probability Mass Functions and Their Properties
Having established the various restrictions on the parameter space and the support for the family of distributions generated by (A1), we now turn attention to the resulting pmfs and their properties. To begin, we will distinguish between two classes of distributions: (i) L = 0, the class originally explored by Katz (1965), and (ii) L > 0, which has subsequently been explored by others. In order to explore these pmfs, our first task is to evaluate p(L) which forms part of the normalizing constant in (A2). 24 In essence, this is the same as adopting the convention that any negative probabilities are set to zero. It might be argued that this is at odds with Katz's original assumptions and should be excluded. Our justification for the inclusion in our analysis of these distributions where λ/γ is non-integer, is that Katz himself included them.
The class of distributions so defined includes the Poisson distributions, the two-parameter binomial (Bernoulli) distributions, and the two-parameter negative binomial (Pascal) distributions. Aside from these, the class contains only the mild generalizations obtained for the latter two of these types by permitting the parameter n (number of "trials" in direct sampling) and the parameter r (number of failures in inverse sampling) to take any positive real values. (Katz 1965, p.175) Appendix A.2.1. L = 0 In the previous section we established that, when L = 0, we require λ > 0. Moreover, we also required that γ < 1, with Y L unbounded from above if 0 ≤ γ < 1 but that an upper bound of n = −λ/γ exists if γ < 0. Summing the right-most side of (A2) over all y ∈ Y L and adding p(0) yields where we have adopted the convention of Recall that if γ ≥ 0 then n ≡ ∞, otherwise n = −λ/γ . Thus, where we have used the Pochhammer symbol (a) n to denote the rising factorial function (a) n = a(a + 1)(a + 2) . . . (a + n − 1) = Γ(a + n)/Γ(a), a polynomial of order n (n a non-negative integer) in a, with (a) 0 = 1 (including (0) 0 = 1), and where Γ(a) denotes the usual Gamma function. 25 Note that the argument of the Pochhammer symbol can be negative and is in certain cases considered below. In the event that 'a' is a negative integer, the Pochhammer symbol will equal zero for all n > a. The resulting pmfs are There are two simplifications that arise when λ/γ is integer. First, if one restricts attention to the case where λ/γ is integer, r say, and 0 < γ < 1 then (r) y y! = (r + y − 1)! (r − 1)! y! = r + y − 1 y .
On setting π = 1 − γ, the pmf reduces to This form of the negative binomial distribution, also known as the Pascal distribution, admits an inverse sampling interpretation is available. Specifically, Y can be interpreted as a count of the number of failures in a sequence of independent Bernoulli trials, each with probability of success π, before the rth success is observed. Interestingly, we note that and so the negative binomial representation in (A4) can be thought of as valid for all cases γ ≥ 0, recognizing that the case γ = 0 must be thought of as a limit. Finally, when λ/γ is non-integer, the pmf in (A4) still gives the probability that Y = y given the parameters λ and γ, it just no longer admits the inverse sampling interpretation usually ascribed to a count variable with a negative binomial distribution.
Second, if γ < 0 and n = λ/γ is a negative integer, so that −λ/γ = −λ/γ, then On setting π = −γ/(1 − γ), so that γ = −π/(1 − π), we can recognize the resulting pmf p(y) = n y π y (1 − π) n−y as that of a binomial random variable where, again, π denotes the success of a single Bernoulli trial and p(y) gives the probability of y successes in a sequence of n independent Bernoulli trials. That is, Y ∼ Binomial(n, π). These findings are summarized in Figure A1.
The Poisson, Pascal, and Binomial distributions, being those cases where λ/γ is integer, were the cases originally explored in Katz (1965). Figure A2, which is a variant of Katz (1965, Figure 1), provides a graphical representation of these distributions.   Kemp (1968) observed that the family of distributions depicted in Figure A2 could all be expressed in terms of hypergeometric functions on noting that and that, specifically, subject to the requirement that λ/γ is integer if γ < 0. This characterization of the probability function makes two things clear. First, the restriction that γ < 1 follows immediately from the standard convergence criteria for hypergeometric functions; see, inter alios, Abadir (1999, p.292). Second, it is clear that, for γ > 0, the restriction that λ/γ be integer is completely unnecessary as the probability function is perfectly well defined for non-integer values of this ratio. 26 It is straight-forward to show that the probability generating function for this family of distributions is of the form In the special cases where either γ ≥ 0 or where −λ/γ is a positive integer, G(t) reduces to Moments for all members of the family can be calculated directly from (A1), without reference to the exact form of the pmf. A slight re-arrangement of (A1) allows us to sum over Y, the support of Y, thus ∑ y∈Y (y + 1)p(y + 1) = ∑ y∈Y (λ + yγ)p(y). (A7) The left-hand side of (A7) can be written ∑ y∈Y (y + 1)p(y + 1) = 0 × p(0) + ∑ y∈Y (y + 1)p(y + 1) The right-hand side becomes Solving for µ yields and similar arguments lead to From Katz (1965, p.176) we have the following inverse parametric relationships which yields a potentially useful alternative parameterization of the distributions in terms of mean and variance rather than the somewhat more nebulous λ and γ. Observe that if , which is called under-dispersion. Importantly, if we consider the ratio then we see that under-dispersion, equi-dispersion, and over-dispersion are determined by the value of γ alone, and so λ is a nuisance parameter for the testing problems of interest in this paper. Finally, observe that E [Y] is an increasing function of γ. Specifically, This case differs from that of L = 0 in two key ways: (i) there are three more cases to consider, all related to λ ≤ 0 and, obviously, (ii) zero is no longer in Y L . To begin the analysis, let us first determine p(L) by summing over (A2). Noting that n may be infinite (depending on parameter configuration) and adopting the convention (A3), we see that (ii) if 0 < γ < 1, λ > 0 then, on noting that (λ/γ + L) y−L = (λ/γ) y / (λ/γ) L , (iii) if γ < 0, λ > 0 then These first three results correspond to those examined in the L = 0 case and they have the same simplifications for λ/γ integer as mentioned in that case. 27 The structure of the result is clear, with the normalizing constant scaled by a factor of 1 − Prob (Y < L), so that the resulting probabilities are simply left-truncated versions of those encountered previously. In particular, we see that, for L > 0 and λ > 0, Before moving it is worth reminding ourselves of cases that we need not consider further. If L > 0 and γ ≤ 0 then the only case leading to valid, non-degenerate distributions are those where λ > 0. The next three cases have no corresponding result when L = 0.
In the special case L = 1, the quantity in the square brackets reduces to unity and p(y) = − γ y y ln(1 − γ) , which is the pmf of a logarithmic distribution. If L > 1 then (A10) is recognizable as a left-truncated logarithmic distribution. (v) If 0 < γ < 1, −Lγ < λ < 0 then A comparison of this expression with that at (A9) reveals a remarkable similarity to the case where γ < 0 and λ > 0. As in the earlier case we see that (i) the ratio λ/γ is negative, (ii) there is a scale factor reflecting left-truncation, with the only substantial difference being that whereas here we have a series reducing to the term (1 − γ) −λ/γ , in the earlier case we had a sum that only offers a similar simplification when λ/γ is integer. (vi) The final case to consider is that where γ = 1 and −L < λ < 0. Here where the third equality is valid because λ < 0, and p(y) = − (λ) y /y! ∑ L−1 j=0 (λ) j /j! , y = 1, 2, 3, . . . This somewhat surprising result reduces to that of Willmot (1988) when L = 1, in which case the denominator reduces to unity.
We will not go through all the properties considered in the case L = 0, although we note in passing that y = 0 contributes nothing to any of the expectations used to calculate either the mean or variance of Y and so the expressions provided remain valid, except in the special case of γ = 1 where finite moments do not appear to exist. We can, however, update Figure A2 to reflect what we have learned in these cases where {0} ∈ Y, see Figure A3. 28 In essence, the major change is that the parameter space now admits non positive values of λ, provided that they exceed −Lγ and 0 < γ < 1, but only when {0} ∈ Y.