Application of Mixture Models for Doubly Inflated Count Data

Arora, Monika; Chaganty, N. Rao

doi:10.3390/analytics2010014

Open AccessArticle

Application of Mixture Models for Doubly Inflated Count Data

by

Monika Arora

¹

and

N. Rao Chaganty

^2,*

¹

Department of Mathematics, Indraprastha Institute of Information Technology, Delhi 110020, India

²

Department of Mathematics and Statistics, Old Dominion University, Norfolk, VA 23529-0077, USA

^*

Author to whom correspondence should be addressed.

Analytics 2023, 2(1), 265-283; https://doi.org/10.3390/analytics2010014

Submission received: 18 August 2022 / Revised: 13 February 2023 / Accepted: 6 March 2023 / Published: 11 March 2023

Download Versions Notes

Abstract

:

In health and social science and other fields where count data analysis is important, zero-inflated models have been employed when the frequency of zero count is high (inflated). Due to multiple reasons, there are scenarios in which an additional count value of k > 0 occurs with high frequency. The zero- and k-inflated Poisson distribution model (ZkIP) is more appropriate for such situations. The ZkIP model is a mixture distribution with three components: degenerate distributions at 0 and k count and a Poisson distribution. In this article, we propose an alternative and computationally fast expectation–maximization (EM) algorithm to obtain the parameter estimates for grouped zero and k-inflated count data. The asymptotic standard errors are derived using the complete data approach. We compare the zero- and k-inflated Poisson model with its zero-inflated and non-inflated counterparts. The best model is selected based on commonly used criteria. The theoretical results are supplemented with the analysis of two real-life datasets from health sciences.

Keywords:

poisson regression; zero-inflated data; zero- and k-inflated data; EM algorithm; health science

1. Introduction

A categorical variable deals with a set of categories which could be based on a measurement scale. When there is natural ordering in the measurement scale, the categorical variable is ordinal; otherwise, it is known as a nominal categorical variable. Categorical variables arise not only in medical and social science but also in many other studies such as travel, agriculture, education, finance, ecology, and others. A categorical random variable with the number of counts as its categories is usually modeled by a Poisson distribution. The Poisson distribution has one unknown parameter,

λ > 0

. This parameter is also the mean and variance of the distribution. This property of Poisson distribution is known as equi-dispersion. In real-life applications, the count data are often not equi-dispersed; instead, they could be over- or under-dispersed. There could be different reasons for over-dispersion, and one such reason is an excess number of zeros in the data. An appropriate model for such count datasets is the zero-inflated Poisson (ZIP) distribution. In a seminal paper, Lambert [1] studied the ZIP regression model. ZIP models and their applications have been studied extensively in the literature. Ghosh et al. [2] and Agarwal et al. [3] studied ZIP models using a Bayesian approach. Random effects ZIP models were studied by Min and Agresti [4] and Yau and Lee [5]. Furthermore, Saffari and Adnan [6], Yang and Simpson [7], and Nguyen and Dupuy [8] have applied ZIP models for censored data. Recently, a review of various ZIP models was presented in [9,10]. Applications of the ZIP model and its variations can be found in health science [11,12], manufacturing [1,2], and transportation [13,14]. The models have made their mark in biology [15], ecology [16], psychology [17,18], education [19], economics [20,21,22], and social networks [23].

In count data, besides zero, there could be another count

k > 0

that is inflated. The inflation could be due to various reasons such as the design of the study or types of responses. For example, the number of pap smear tests performed on women had zero and six inflated [24,25]. Similarly, the data on the number cigarettes smoked have zero (non-smokers) and 20 (a pack of cigarettes) inflated [25], while in a survey on the number of divorces, counts zero and 1 are likely to be inflated. Arora and Chaganty [24] described such situations where, besides zero, another count

k > 0

is also inflated, and applied zero- and k-inflated Poisson (ZkIP) models. The ZkIP model for count data is defined by a mixture of three distributions, which assumes that each count observation is a draw from a degenerate distribution at zero with probability

π_{1}

, from a degenerate distribution at value

k > 0

with probability

π_{2}

, or from a Poisson distribution with probability

(1 - π_{1} - π_{2})

. The probabilities

π_{1}

and

π_{2}

can also be viewed as mixing weights. Lin and Tsai [25] studied ZkIP models using the maximum likelihood estimation method. Sheth et al. [26] presented two forms of the ZkIP model. A special case of ZkIP is when

k = 1

. The special case is known as zero- and one-inflated Poisson (ZOIP). The other special case is

π_{1} = 0

; that is, only

k > 0

is inflated and the corresponding model is a k-inflated Poisson (kIP) model, which also can be regarded as an extension of the ZIP model. Recently, Arora et al. [27] studied the kIP models using traditional and data science approaches. For doubly censored data, ref. [28] studied zero and one inflation using power-normal distribution.

Most of these aforementioned articles deal with data that contain covariates besides the response variable and study regression models. However, at times the data have missing observations for the covariates. To build a regression model, a list-wise deletion is performed or missing observations are imputed. The deletion of observations could significantly reduce the sample size. On the other hand, the imputed observations could lead to misleading inferences. There is a need to develop inferential methods for data without covariates. These methods without covariates allow us to estimate the inflated proportion for count 0 and k categories. Furthermore, they could also be used as a preliminary step to detect the inflation before using the regression models. Models without covariates are easier to apply and more efficient for large datasets. The proposed model captures the double inflation and is parsimonious. The parameters are simple to interpret and the corresponding analysis is straightforward.

In this article, we deal with the situation where the covariate data are absent and develop an EM algorithm to obtain the estimates for the grouped count data with inflation at zero and

k > 0

. The EM algorithm provides maximum likelihood (ML) estimates when some data are missing or when latent variables are present. For the ZkIP data, the latent variables are the zero’s and counts

k > 0

coming from the degenerate distributions, as opposed to the Poisson distribution. The EM algorithm takes into account the missing information and allows us to obtain the ML estimates of the unknown parameters of the ZkIP model. The standard errors are obtained using the method described by Louis [29]. Our methods include the ZOIP and kIP as special cases. We compare the ZkIP model with the ZIP and Poisson models. We apply our methods on two real-life applications in health sciences. The outline of the article is as follows. Section 2 describes the distributions involved in detail. This includes the ZkIP distribution and its properties. For the grouped data, we present the likelihood function for the ZkIP model in Section 3. In Section 3.1, we present the mathematical details for the expectation–maximization (EM) method [30] to obtain the maximum likelihood estimates. In Section 3.2, we describe the method first described by Louis [29] on how to find the standard errors for the EM estimates for the ZkIP model. Section 4 describes the hypothesis tests for the unknown parameters. It also explains the methods used for model selection and measures to find a model that fits best. In Section 5, we perform two simulation studies. We compare the ZkIP model to ZIP and Poisson models using standardized bias and standarized mean squared error criteria. We also evaluate the coverage probabilities for various confidence levels. Finally, Section 6 contains the analysis of two real-life datasets.

2. Distributions

The Poisson distribution is normally used as a model for count data. The probability mass function of a random variable Y distributed as Poisson with mean

λ > 0

is given by

P (Y = y) = \frac{e^{- λ} λ^{y}}{y!}, y = 0, 1, 2, \dots

The probability mass function of a random variable Y following a zero-inflated Poisson (ZIP) distribution with parameters

λ > 0

and

0 < π_{1} < 1

is given by

P (Y = y) = \{\begin{matrix} π_{1} + (1 - π_{1}) e^{- λ} & when y = 0 \\ (1 - π_{1}) \frac{λ^{y} e^{- λ}}{y!} & when y = 1, 2, \dots \end{matrix}

(1)

A generalization of the ZIP model is the ZkIP model which accounts for inflated frequencies at zero and at

k > 0

. The ZkIP distribution is also a mixture model, similar to the ZIP. It is composed of mixing two degenerate distributions with a Poisson distribution. The probability mass function of Y distributed as ZkIP (

λ, π_{1}, π_{2}

) is given by

P (Y = y) = \{\begin{matrix} π_{1} + π_{3} e^{- λ} & when y = 0 \\ π_{2} + π_{3} \frac{λ^{k} e^{- λ}}{k!} & when y = k \\ π_{3} \frac{λ^{y} e^{- λ}}{y!} & when y = 1, 2, \dots, y \neq k, \end{matrix}

(2)

where

π_{3} = (1 - π_{1} - π_{2})

,

λ > 0

and

0 < π_{1} + π_{2} < 1

. The corresponding cumulative distribution function (CDF) is given by

F_{Y} (y) = \{\begin{matrix} 0 & when y < 0 \\ π_{1} + π_{3} \sum_{u = 0}^{⌊ y ⌋} \frac{λ^{u} e^{- λ}}{u!} & when 0 \leq y < k \\ π_{1} + π_{2} + π_{3} \sum_{u = 0}^{⌊ y ⌋} \frac{λ^{u} e^{- λ}}{u!} & when y \geq k, \end{matrix}

(3)

where

⌊ y ⌋

is the floor function. Using (3), we can show that the probability generating function of Y is

G_{Y} (z) = E (z^{Y}) = π_{1} + π_{2} z^{k} + π_{3} e^{λ (z - 1)} .

The moment generating function is given by

M_{Y} (t) = E (e^{t Y}) = π_{1} + π_{2} e^{t k} + π_{3} e^{λ (e^{t} - 1)} .

The mean

E (Y) = k π_{2} + π_{3} λ

and

V a r (Y) = k^{2} π_{2} (1 - π_{2}) + π_{3} λ (1 + π_{1} λ + π_{2} λ - 2 k π_{2})

can be obtained taking derivatives of

M_{Y} (t)

with respect to t at

t = 0

.

The unknown parameters in a ZkIP distribution are

λ

,

π_{1}

, and

π_{2}

with

λ > 0

and

0 < π_{1} + π_{2} < 1

. There are various methods for estimating the parameters and drawing inferences. In the next section, we develop the expectation–maximization (EM) algorithm to obtain the maximum likelihood estimates of the ZkIP model parameters and the corresponding standard errors.

3. Methodology

Suppose that we have a vector

y = (y_{1}, y_{2}, . . ., y_{n})

consisting of a random sample of n observations potentially from a ZkIP distribution. The frequency distribution of the sample can be organized in a table as

j	0	1	…	k	…	K	Total
Observed frequency	$n_{0}$	$n_{1}$	…	$n_{k}$	…	$n_{K}$	n

Here,

n_{j} =

# of

y_{i}

’s that are equal to j and

K = max {y_{i}}

. If the observations are truly from the ZkIP distribution, the values of

n_{0} and n_{k}

will be large compared to the rest of the frequencies. The vector of observed frequencies

(n_{0}, n_{1}, \dots, n_{K})

can be regarded as incomplete data in the sense that

n_{0}

is actually

n_{a} + n_{b}

and

n_{k} = n_{c} + n_{d}

, where the unknown

n_{a}

and

n_{c}

are frequencies from degenerate distributions at 0 and k, respectively. Using (3), we can write the likelihood function of the observed frequencies

n_{i}

’s as

\begin{matrix} L_{o b s} (π_{1}, π_{2}, λ | y) & \propto & {(π_{1} + π_{3} e^{- λ})}^{n_{0}} {(π_{2} + π_{3} \frac{λ^{k} e^{- λ}}{k!})}^{n_{k}} \prod_{j \neq 0, k}^{K} {(π_{3} \frac{λ^{j} e^{- λ}}{j!})}^{n_{j}} \\ \propto & {(π_{1} + π_{3} p_{0})}^{n_{0}} {(π_{2} + π_{3} p_{k})}^{n_{k}} \prod_{j \neq 0, k}^{K} {(π_{3} p_{j})}^{n_{j}}, \end{matrix}

(4)

where

p_{j} = (λ^{j} e^{- λ}) / j!

and

π_{3} = (1 - π_{1} - π_{2})

. Note that when

π_{2} = 0

, the ZkIP reduces to ZIP. Thus, the likelihood for the ZIP model is

\begin{matrix} L_{o b s} (π_{1}, λ | y) \propto {(π_{1} + (1 - π_{1}) e^{- λ})}^{n_{0}} \prod_{j \neq 0}^{K} {((1 - π_{1}) \frac{λ^{j} e^{- λ}}{j!})}^{n_{j}} . \end{matrix}

If

π_{1} = π_{2} = 0

, the likelihood (4) becomes the likelihood function of the Poisson distribution given by

\begin{matrix} L_{o b s} (λ | y) = \prod_{j = 0}^{K} {(\frac{λ^{j} e^{- λ}}{j!})}^{n_{j}} . \end{matrix}

3.1. EM Estimation

For the likelihood (4), the unknown parameter

θ = (π_{1}, π_{2}, λ)

can be estimated using the maximum likelihood (ML) approach. A computationally simple approach to get an ML estimate of

θ

is the expectation–maximum (EM) method, which was introduced by [30] in a seminal paper. The EM algorithm is a simple modification of the maximum likelihood and has become a popular alternative for ML estimation in cases where data are missing or incomplete. We describe the EM approach to study the ZkIP model for grouped data.

The frequency vector

(n_{0}, n_{1}, \dots, n_{k}, \dots, n_{K})

is the observed data. It can be viewed as partially incomplete data, in the sense that

n_{0} = n_{a} + n_{b}

and

n_{k} = n_{c} + n_{d}

, since the number,

n_{a}

, of zeros and the number,

n_{c}

, of ks are missing. Here

n_{a}

and

n_{c}

are the unknown number of observations from degenerate distributions at 0 and k, respectively. Thus, the complete data vector including the missing frequencies is

(n_{a}, n_{b}, n_{1}, \dots, n_{c}, n_{d}, \dots, n_{K})

. The likelihood function of this complete data vector is

\begin{matrix} L_{c o m p} (π_{1}, π_{2}, λ | y) \propto π_{1}^{n_{a}} π_{2}^{n_{c}} {π_{3}}^{(n - n_{a} - n_{c})} {p_{0}}^{n_{b}} {p_{k}}^{n_{d}} \prod_{j \neq 0, k}^{K} {p_{j}}^{n_{j}} \end{matrix}

(5)

where

p_{j} = (λ^{j} e^{- λ}) / j!

and

π_{3} = (1 - π_{1} - π_{2})

. Our interest is to maximize the likelihood (5) or minimize the negative of the log-likelihood. The log-likelihood,

ℓ_{c o m p} = log L_{c o m p}

, can be written as

\begin{matrix} ℓ_{c o m p} (π_{1}, π_{2}, λ | y) & \propto & n_{a} log (π_{1}) + n_{c} log (π_{2}) + (n - n_{a} - n_{c}) log π_{3} \\ + n_{b} log p_{0} + n_{d} log p_{k} + \sum_{j \neq 0, k}^{K} n_{j} log p_{j} \\ \propto & n_{a} log (π_{1}) + n_{c} log (π_{2}) + (n - n_{a} - n_{c}) log π_{3} \\ - n_{b} λ + n_{d} (- λ + k log λ) + \sum_{j \neq 0, k}^{K} n_{j} (- λ + j log λ), \end{matrix}

(6)

where the frequencies

n_{a}

and

n_{c}

are unknown. The expectation step in the EM algorithm replaces these frequencies with their expected values. To obtain the expected values, we assume there is a latent variable

z = (z_{1}, z_{2}, z_{3})

distributed as a multinomial with parameter vector

(1, π_{1}, π_{2}, π_{3})

, where the number of trials is one. Here,

z

takes values

(1, 0, 0)

with probability

π_{1}

,

(0, 1, 0)

with probability

π_{2}

, and

(0, 0, 1)

with probability

π_{3}

. That is,

P (z = (z_{1}, z_{2}, z_{3})) = \{\begin{matrix} π_{1} & if z_{1} = 1, z_{2} = 0, z_{3} = 0 \\ π_{2} & if z_{1} = 0, z_{2} = 1, z_{3} = 0 \\ π_{3} & if z_{1} = 0, z_{2} = 0, z_{3} = 1 . \end{matrix}

(7)

Furthermore, assume the conditional distribution of

Y | z

is

P (Y = y | z = (z_{1}, z_{2}, z_{3})) = \{\begin{matrix} 1 & for z_{1} = 1, y = 0 \\ 1 & for z_{2} = 1, y = k \\ \frac{λ^{y} e^{- λ}}{y!} & for z_{3} = 1, y = 0, 1, \dots \end{matrix}

(8)

Now, the joint distribution of

(Y, z)

obtained by multiplying (7) and (8) is

P (Y = y, z = (z_{1}, z_{2}, z_{3})) = \{\begin{matrix} π_{1} & for z_{1} = 1, y = 0 \\ π_{2} & for z_{2} = 1, y = k \\ π_{3} \frac{λ^{y} e^{- λ}}{y!} & for z_{3} = 1, y = 0, 1, \dots \end{matrix}

(9)

The marginal of Y can be obtained from (9) by summing over the three possible values of

z

. Thus, we obtain

\begin{matrix} P (Y = 0) & = & P (Y = 0, z_{1} = 1) + P (Y = 0, z_{2} = 1) + P (Y = 0, z_{3} = 1) \\ = & π_{1} + π_{3} e^{- λ}, \\ P (Y = k) & = & P (Y = k, z_{1} = 1) + P (Y = k, z_{2} = 1) + P (Y = k, z_{3} = 1) \\ = & π_{2} + π_{3} \frac{λ^{k} e^{- λ}}{k!}, \end{matrix}

and

\begin{matrix} P (Y = y) & = & P (Y = y, z_{1} = 1) + P (Y = y, z_{2} = 1) + P (Y = y, z_{3} = 1) \\ = & π_{3} \frac{λ^{y} e^{- λ}}{y!}, for y = 1, 2, \dots, y \neq k, \end{matrix}

which is equivalent to the ZkIP distribution defined by (3). Now, the conditional expected values can be computed by the posterior probabilities given in Table 1.

The three latent variables

z_{i}

s are the indicator variables for the three distributions in the ZkIP mixture model. More specifically,

\begin{matrix} {\hat{n}}_{a} = n_{0} E (z_{1} | y = 0) & = & n_{0} P (z_{1} = 1 | y = 0) = n_{0} \frac{π_{1}}{π_{1} + π_{3} p_{0}}, \\ {\hat{n}}_{c} = n_{k} E (z_{2} | y = k) & = & n_{k} P (z_{2} = 1 | y = k) = n_{k} \frac{π_{2}}{π_{2} + π_{3} p_{k}} . \end{matrix}

(10)

The maximization step or the M-step in the EM algorithm involves maximizing the log-likelihood (6) after substituting these estimates for

n_{a}

and

n_{c}

. However, this maximization is easy since the score equations have closed-form solutions. Indeed, equating partial derivatives with respect to the three parameters of (6) to zero we obtain,

\begin{matrix} \frac{\partial ℓ_{c o m p} (π_{1}, π_{2}, λ)}{\partial π_{1}} & = & 0 ⟺ {\hat{π}}_{1} = \frac{n_{a} (1 - π_{2})}{n - n_{c}}, \end{matrix}

(11)

\begin{matrix} \frac{\partial ℓ_{c o m p} (π_{1}, π_{2}, λ)}{\partial π_{2}} & = & 0 ⟺ {\hat{π}}_{2} = \frac{n_{c} (1 - π_{1})}{n - n_{a}}, \end{matrix}

(12)

\begin{matrix} \frac{\partial ℓ_{c o m p} (π_{1}, π_{2}, λ)}{\partial λ} & = & 0 ⟺ \hat{λ} = \frac{\sum_{j = 0}^{K} j n_{j}}{n - n_{a} - n_{c}} . \end{matrix}

(13)

We summarize the steps of the EM algorithm as follows:

Choose the initial values of $π_{1}^{0}$ , $π_{2}^{0}$ , and $λ^{0}$ for $π_{1}$ , $π_{2}$ and $λ$ , respectively.
E-step: Calculate ${\hat{n}}_{a}$ and ${\hat{n}}_{c}$ using (10), and set ${\hat{n}}_{b} = n_{0} - {\hat{n}}_{a}$ and ${\hat{n}}_{d} = n_{1} - {\hat{n}}_{c}$ .
M-step: Update the estimates of $π_{1}, π_{2}$ , and $λ$ using the formulas in (11), (12) and (13).
Iterate the E-step and M-step until the estimates ${\hat{π}}_{1}$ , ${\hat{π}}_{2}$ , and $\hat{λ}$ converge.

We have developed an R code for this algorithm and used it for the two data analysis examples in Section 6.

3.2. Standard Errors of EM Estimates

The optimization algorithms routinely output a numerically computed Hessian matrix for the functions that are being optimized. However, calculation of the standard errors will be more accurate if analytical expressions are available. To compute the standard errors of the estimates obtained by the EM algorithm, we follow the approach described by Louis [29]. The relation between the likelihood of complete, observed, and missing data is given as

\begin{matrix} L_{c o m p} (θ | y, z) & = & L_{o b s} (θ | y) L_{m i s s} (θ | (z | y)), \end{matrix}

(14)

where

y

and

z

stand for the observed and missing data, respectively. From (14), taking logs we obtain

\begin{matrix} ℓ_{c o m p} (θ | y, z) & = & ℓ_{o b s} (θ | y) + ℓ_{m i s s} (θ | (z | y)) . \end{matrix}

(15)

Taking second order partial derivatives, we can see that from Equation (15), the information matrices for the complete, observed, and missing data satisfy the following identity

\begin{matrix} I_{c o m p} & = & I_{o b s} + I_{m i s s} \end{matrix}

or

\begin{matrix} I_{o b s} & = & I_{c o m p} - I_{m i s s} . \end{matrix}

(16)

Since the right hand side of Equation (16) depends on the missing data, Louis [29] suggested to take the expected value of the missing data given the observed. This gives us the identity

\begin{matrix} I_{o b s} = E (I_{o b s} | y) = E (I_{c o m p} | y) - E (I_{m i s s} | y) . \end{matrix}

(17)

In other words, an estimate of the observed information matrix is given by

\begin{matrix} \hat{I_{o b s}} = E (I_{c o m p} | y) - E (I_{m i s s} | y) . \end{matrix}

(18)

Regularity conditions under which these information matrices are non-singular are given in [31]. Without going into technical details, we can say the salient regularity conditions are (1) the ranges of the random variables do not depend on the parameter; (2) the partial derivatives of the pdf with respect to the parameters exist; and (3) the integrals of the partial derivatives are same as the partial derivatives of the integrals with respect to the parameters. Under these conditions, the standard errors of the parameter estimates can be obtained taking the square root of the diagonal elements of inverse of the observed information matrix (18). Note that

\begin{matrix} I_{c o m p} & = & [\begin{matrix} - \frac{\partial^{2} ℓ_{c o m p}}{\partial π_{1}^{2}} & - \frac{\partial^{2} ℓ_{c o m p}}{\partial π_{1} \partial π_{2}} & - \frac{\partial^{2} ℓ_{c o m p}}{\partial π_{1} \partial λ} \\ - \frac{\partial^{2} ℓ_{c o m p}}{\partial π_{2} \partial π_{1}} & - \frac{\partial^{2} ℓ_{c o m p}}{\partial π_{2}^{2}} & - \frac{\partial^{2} ℓ_{c o m p}}{\partial π_{2} \partial λ} \\ - \frac{\partial^{2} ℓ_{c o m p}}{\partial λ \partial π_{1}} & - \frac{\partial^{2} ℓ_{c o m p}}{\partial λ \partial π_{2}} & - \frac{\partial^{2} ℓ_{c o m p}}{\partial λ^{2}} \end{matrix}] . \end{matrix}

(19)

From (6), the elements of the information matrix

I_{c o m p}

are

\begin{matrix} - \frac{\partial^{2} ℓ_{c o m p}}{\partial π_{1}^{2}} & = & \frac{n_{a}}{{π_{1}}^{2}} + \frac{n - n_{a} - n_{c}}{π_{3}^{2}} \\ - \frac{\partial^{2} ℓ_{c o m p}}{\partial π_{1} \partial π_{2}} & = & - \frac{\partial^{2} ℓ_{c o m p}}{\partial π_{2} \partial π_{1}} = \frac{n - n_{a} - n_{c}}{π_{3}^{2}} \\ - \frac{\partial^{2} ℓ_{c o m p}}{\partial π_{2}^{2}} & = & \frac{n_{c}}{π_{2}^{2}} + \frac{n - n_{a} - n_{c}}{π_{3}^{2}} \\ - \frac{\partial^{2} ℓ_{c o m p}}{\partial λ^{2}} & = & \frac{n_{d} k}{λ^{2}} + \frac{\sum_{j \neq 0, k}^{K} j n_{j}}{λ^{2}} . \end{matrix}

The other elements

- \partial^{2} ℓ_{c o m p} / \partial π_{1} \partial λ

and

- \partial^{2} ℓ_{c o m p} / \partial π_{2} \partial λ

are equal to zero. Since

n_{a}

and

n_{c}

are missing, we replace them by their expected values

\begin{matrix} E (n_{a} | n_{0}) = \frac{n_{0} π_{1}}{π_{1} + π_{3} p_{0}} and E (n_{c} | n_{k}) = \frac{n_{k} π_{2}}{π_{2} + π_{3} p_{k}} . \end{matrix}

Thus, the nonzero elements of

E (I_{c o m p} | y) = E (I_{c o m p} | n_{0}, n_{k})

are

\begin{matrix} E [- \frac{\partial^{2} ℓ_{c o m p}}{\partial π_{1}^{2}}] & = & \frac{n}{π_{3}^{2}} + \frac{n_{0}}{π_{1} (π_{1} + π_{3} p_{0})} - \frac{n_{0} π_{1}}{π_{3}^{2} (π_{1} + π_{3} p_{0})} - \frac{n_{k} π_{2}}{π_{3}^{2} (π_{2} + π_{3} p_{k})} \end{matrix}

and

\begin{matrix} E [- \frac{\partial^{2} ℓ_{c o m p}}{\partial π_{1} \partial π_{2}}] & = & E [- \frac{\partial^{2} ℓ_{c o m p}}{\partial π_{2} \partial π_{1}}] = \frac{n}{π_{3}^{2}} - \frac{n_{0} π_{1}}{π_{3}^{2} (π_{1} + π_{3} p_{0})} - \frac{n_{k} π_{2}}{π_{3}^{2} (π_{2} + π_{3} p_{k})} \\ E [- \frac{\partial^{2} ℓ_{c o m p}}{\partial π_{2}^{2}}] & = & \frac{n}{π_{3}^{2}} - \frac{n_{0} π_{1}}{π_{3}^{2} (π_{1} + π_{3} p_{0})} + \frac{n_{k}}{π_{2} (π_{2} + π_{3} p_{k})} - \frac{n_{k} π_{2}}{π_{3}^{2} (π_{2} + π_{3} p_{k})} \\ E [- \frac{\partial^{2} ℓ_{c o m p}}{\partial λ^{2}}] & = & \frac{n_{k} k}{λ^{2}} - \frac{n_{k} k π_{2}}{λ^{2} (π_{2} + π_{3} p_{k})} + \frac{1}{λ^{2}} \sum_{j \neq 0, k}^{K} j n_{j} . \end{matrix}

Next, to compute the second term,

E (I_{m i s s} | y)

, in Equation (16), we proceed as follows. The likelihood of the observed and complete data are given in (4) and (5), respectively. Hence, the likelihood of the missing data is obtained taking the ratio of these likelihoods and it is given by

\begin{matrix} L_{m i s s} (π_{1}, π_{2}, λ | z) & \propto & {π_{1}}^{n_{a}} {π_{2}}^{n_{c}} {(p_{0} π_{3})}^{n_{b}} {(p_{k} π_{3})}^{n_{d}} \\ {(\frac{1}{π_{1} + π_{3} p_{0}})}^{n_{0}} {(\frac{1}{π_{2} + π_{3} p_{k}})}^{n_{k}} . \end{matrix}

Thus, the log-likelihood of the missing data is

\begin{matrix} ℓ_{m i s s} (π_{1}, π_{2}, λ | y) & \propto & n_{a} log (π_{1}) + n_{c} log (π_{2}) - n_{0} log (π_{1} + π_{3} p_{0}) \\ - n_{k} log (π_{2} + π_{3} p_{k}) + (n_{b} + n_{d}) log (π_{3}) \\ - (n_{b} + n_{d}) λ + (n_{d} k) log (λ) . \end{matrix}

(20)

We can easily check that the first-order partial derivatives are

\begin{matrix} \frac{\partial ℓ_{m i s s}}{\partial π_{1}} & = & \frac{n_{a}}{π_{1}} - n_{0} (\frac{1 - p_{0}}{π_{1} + π_{3} p_{0}}) - \frac{n_{b} + n_{d}}{π_{3}} + \frac{n_{k} p_{k}}{π_{2} + π_{3} p_{k}} \\ \frac{\partial ℓ_{m i s s}}{\partial π_{2}} & = & \frac{n_{c}}{π_{2}} + n_{0} (\frac{p_{0}}{π_{1} + π_{3} p_{0}}) - \frac{n_{b} + n_{d}}{π_{3}} - \frac{n_{k} (1 - p_{k})}{π_{2} + π_{3} p_{k}} \\ \frac{\partial ℓ_{m i s s}}{\partial λ} & = & \frac{n_{d} k}{λ} - (n_{b} + n_{d}) + n_{0} (\frac{π_{3} p_{0}}{π_{1} + π_{3} p_{0}}) \\ - & \frac{n_{k} π_{3} p_{k}}{π_{2} + π_{3} p_{k}} (\frac{k}{λ} - 1) . \end{matrix}

and the negative of the second-order partial derivatives are

\begin{matrix} - \frac{\partial^{2} ℓ_{m i s s}}{\partial π_{1}^{2}} & = & \frac{n_{a}}{π_{1}^{2}} - \frac{n_{0} {(1 - p_{0})}^{2}}{{(π_{1} + π_{3} p_{0})}^{2}} - \frac{n_{k} p_{k}^{2}}{{(π_{2} + π_{3} p_{k})}^{2}} + \frac{(n_{b} + n_{d})}{π_{3}^{2}} \\ - \frac{\partial^{2} ℓ_{m i s s}}{\partial π_{1} \partial π_{2}} & = & \frac{n_{0} p_{0} (1 - p_{0})}{{(π_{1} + π_{3} p_{0})}^{2}} + \frac{n_{k} p_{k} (1 - p_{k})}{{(π_{2} + π_{3} p_{k})}^{2}} + \frac{(n_{b} + n_{d})}{π_{3}^{2}} \\ - \frac{\partial^{2} ℓ_{m i s s}}{\partial π_{1} \partial λ} & = & \frac{n_{0} (1 - π_{2}) p_{0}}{{(π_{1} + π_{3} p_{0})}^{2}} - \frac{n_{k} π_{2} p_{k}}{{(π_{2} + π_{3} p_{k})}^{2}} (\frac{k}{λ} - 1) \\ - \frac{\partial^{2} ℓ_{m i s s}}{\partial π_{2}^{2}} & = & \frac{n_{c}}{π_{2}^{2}} - \frac{n_{0} p_{0}^{2}}{{(π_{1} + π_{3} p_{0})}^{2}} - \frac{n_{k} {(1 - p_{k})}^{2}}{{(π_{2} + π_{3} p_{k})}^{2}} + \frac{(n_{b} + n_{d})}{π_{3}^{2}} \\ - \frac{\partial^{2} ℓ_{m i s s}}{\partial π_{2} \partial λ} & = & \frac{n_{0} π_{1} p_{0}}{{(π_{1} + π_{3} p_{0})}^{2}} - \frac{n_{k} (1 - π_{1}) p_{k}}{{(π_{2} + π_{3} p_{k})}^{2}} (\frac{k}{λ} - 1) \\ - \frac{\partial^{2} ℓ_{m i s s}}{\partial λ^{2}} & = & \frac{n_{0} π_{1} π_{3} p_{0}}{{(π_{1} + π_{3} p_{0})}^{2}} + \frac{n_{k} π_{2} π_{3} p_{k}}{{(π_{2} + π_{3} p_{k})}^{2}} {(\frac{k}{λ} - 1)}^{2} \\ - \frac{k}{λ^{2}} \frac{n_{k} π_{3} p_{k}}{π_{2} + π_{3} p_{k}} + \frac{k n_{d}}{λ^{2}} . \end{matrix}

Once again, using the expected values

\begin{matrix} E (n_{a} | n_{0}) = \frac{n_{0} π_{1}}{π_{1} + π_{3} p_{0}} and E (n_{c} | n_{k}) = \frac{n_{k} π_{2}}{π_{2} + π_{3} p_{k}}, \\ E (n_{b} | n_{0}) = \frac{n_{0} π_{3} p_{0}}{π_{1} + π_{3} p_{0}} and E (n_{d} | n_{k}) = \frac{n_{k} π_{3} p_{k}}{π_{2} + π_{3} p_{k}}, \end{matrix}

we obtain the elements of

E (I_{m i s s} | y) = E (I_{m i s s} | n_{0}, n_{k})

as follows

\begin{matrix} E [- \frac{\partial^{2} ℓ_{m i s s}}{\partial π_{1}^{2}}] & = & \frac{n_{0}}{π_{1} (π_{1} + π_{3} p_{0})} - \frac{n_{0} {(1 - p_{0})}^{2}}{{(π_{1} + π_{3} p_{0})}^{2}} - \frac{n_{k} p_{k}^{2}}{{(π_{2} + π_{3} p_{k})}^{2}} \\ + \frac{n_{0} p_{0}}{π_{3} (π_{1} + π_{3} p_{0})} + \frac{n_{k} p_{k}}{π_{3} (π_{2} + π_{3} p_{k})} \\ E [- \frac{\partial^{2} ℓ_{m i s s}}{\partial π_{1} \partial π_{2}}] & = & \frac{n_{0} p_{0} (1 - p_{0})}{{(π_{1} + π_{3} p_{0})}^{2}} + \frac{n_{k} p_{k} (1 - p_{k})}{{(π_{2} + π_{3} p_{k})}^{2}} \\ + \frac{n_{0} p_{0}}{π_{3} (π_{1} + π_{3} p_{0})} + \frac{n_{k} p_{k}}{π_{3} (π_{2} + π_{3} p_{k})} \end{matrix}

and

\begin{matrix} E [- \frac{\partial^{2} ℓ_{m i s s}}{\partial π_{1} \partial λ}] & = & \frac{n_{0} (1 - π_{2}) p_{0}}{{(π_{1} + π_{3} p_{0})}^{2}} - \frac{n_{k} π_{2} p_{k}}{{(π_{2} + π_{3} p_{k})}^{2}} (\frac{k}{λ} - 1) \\ E [- \frac{\partial^{2} ℓ_{m i s s}}{\partial π_{2}^{2}}] & = & \frac{n_{k}}{π_{2} (π_{2} + π_{3} p_{k})} - \frac{n_{0} p_{0}^{2}}{{(π_{1} + π_{3} p_{0})}^{2}} - \frac{n_{k} {(1 - p_{k})}^{2}}{{(π_{2} + π_{3} p_{k})}^{2}} \\ + \frac{n_{0} p_{0}}{π_{3} (π_{1} + π_{3} p_{0})} + \frac{n_{k} p_{k}}{π_{3} (π_{2} + π_{3} p_{k})} \\ E [- \frac{\partial^{2} ℓ_{m i s s}}{\partial π_{2} \partial λ}] & = & \frac{n_{0} π_{1} p_{0}}{{(π_{1} + π_{3} p_{0})}^{2}} - \frac{n_{k} (1 - π_{1}) p_{k}}{{(π_{2} + π_{3} p_{k})}^{2}} (\frac{k}{λ} - 1) \\ E [- \frac{\partial^{2} ℓ_{m i s s}}{\partial λ^{2}}] & = & \frac{n_{0} π_{1} π_{3} p_{0}}{{(π_{1} + π_{3} p_{0})}^{2}} + \frac{n_{k} π_{2} π_{3} p_{k}}{{(π_{2} + π_{3} p_{k})}^{2}} {(\frac{k}{λ} - 1)}^{2} . \end{matrix}

The remaining elements follow by symmetry. Matrices

E (I_{c o m p} | y)

and

E (I_{m i s s} | y)

are positive definite and so they are non-singular [32,33].

4. Goodness of Fit and Model Selection

Hypothesis testing usually follows parameter estimation to check the significance of the parameters. In the presence of competing models, there is a need to compare and find the best model. There are various measures useful for model comparisons, the most popular being the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). It is also important to check the goodness of fit of the models. This is accomplished using the Pearson chi-square statistic and the sum of absolute error. In this section, we will discuss the aforementioned topics, namely hypothesis testing, model selection, and goodness of fit.

4.1. Hypothesis Testing and Measures of Model Selection

The unknown parameter in the ZkIP distribution is

θ = (π_{1}, π_{2}, λ)

. The parameter

λ

is the rate parameter of the Poisson distribution, and thus

λ > 0

, while

π_{1}

represents the proportion of zeros,

π_{2}

is the proportion of

k > 0

from the degenerate distributions, and

0 < π_{1} + π_{2} < 1

. To study the statistical significance of the unknown parameter

θ

, we can perform various hypothesis tests. Under standard regularity conditions, the EM estimate

\hat{θ}

of

θ

is asymptotically normal with mean

θ^{0}

and a covariance matrix given by

{(\hat{I_{o b s}})}^{- 1}

, where

θ^{0}

is its true value, which we assume lies in the interior of the parameter space. We can use this result to construct a Wald’s test for the null hypothesis

H_{0} : λ = λ_{0}

versus the alternative

H_{0} : λ \neq λ_{0}

. The test statistic

\hat{λ} / S E (\hat{λ})

is asymptotically normal. Similarly, the Wald test could be used to test the hypotheses for a specified proportion

π_{2} = {π_{2}}^{0}

of observations coming from a degenerate distribution at k or a specified proportion

π_{1} = {π_{1}}^{0}

coming from the degenerate distribution at zero.

The FMM and Countreg procedures in SAS use the parameters

γ = log (π_{1} / π_{3})

and

δ = log (π_{2} / π_{3})

and test for the hypothesis

H_{0} : (γ, δ) = (0, 0)

. This hypothesis is equivalent to testing

H_{0} : (π_{1}, π_{2}) = (π_{1}^{0} = 1 / 3, π_{2}^{0} = 1 / 3)

, which we could do using Wald’s test because

π_{1}^{0} = 1 / 3

and

π_{2}^{0} = 1 / 3

are values in the interior of the respective parameter spaces.

As mentioned in Section 3, when

π_{2} = 0

, ZkIP reduces to ZIP, and additionally if

π_{1} = 0

, the model simply is a Poisson distribution. Thus, we can say ZkIP, ZIP, and Poisson are nested models. We can perform likelihood ratio tests (LRT) for model reduction for these nested models. To test the significance of inflation at

k > 0

, the appropriate hypothesis would be

H_{0} : π_{2} = 0

versus

H_{1} : π_{2} > 0

. Similarly, the significance of inflation at zero could be tested by

H_{0} : π_{1} = 0

versus

H_{1} : π_{1} > 0

. The problem is that the null hypothesis is on the boundary of the parameter space in both scenarios, and therefore the regularity conditions are not satisfied. However, Chant [34] and Shapiro [35] have shown that the test statistic

- 2 log L (\hat{θ})

is asymptotically distributed under the null hypothesis as

0.5 χ_{0}^{2} + 0.5 χ_{1}^{2}

, a mixture of chi-square distributions. We could use this result to test the hypotheses.

To find the best model we could use various criteria, the most popular being the Akaike information criterion (AIC). It selects the best model based on the expected difference between the hypothesized model and the observed data. The minimum difference, that is, the model with minimum AIC, is considered as the best among the analyzed models. The AIC is given by

- 2 log L (\hat{θ}) + 2 m

. Here,

log L (\hat{θ})

is the log-likelihood of the model at the ML estimates, while m is the number of parameters in the model. Recall that for a Poisson model, there is only one parameter

λ

. In ZIP there are two parameters,

λ

and

π_{1}

, and ZkIP has an additional parameter

π_{2}

. Akaike [36] has suggested not just the minimum value of the AIC that is of relevance, but also the difference between the AICs of various models. A rule of thumb to select the best model from a set of competing models for data, can be based on the difference between the AICs as given in Table 2.

The other popularly used measure to select the best model is Bayesian information criterion (BIC). The BIC is given by

- 2 log L (\hat{θ}) + m log n

, where n is the sample size. Similar to the AIC, the model with the minimum value of BIC among the competing models is the best. The AIC and BIC both penalize for adding more parameters to the model. The rules of thumb to study the difference between the BICs are given in Table 3. To choose the best model, we select the model with the minimum BIC. When the difference,

Δ_{i} = B I C_{i} - B I C_{m i n}

, is high then there is sufficient evidence against the competing models and the model with minimum BIC is the best.

4.2. Model Checking

There is also a need to check how well the best model among completing models fits the data. The goodness of fit of a model is studied using various measures. A commonly used measure is the Pearson statistic

χ^{2} = \sum {(O_{i} - E_{i})}^{2} / E_{i}

, where

O_{i}

is the observed frequency and

E_{i}

is the expected frequency of the i-the count. This statistic, under the null hypothesis, asymptotically follows a chi-square distribution with

(κ - 1)

degrees of freedom, where

κ

is the total number of categories. Large values of the test statistic lead to rejection of the null hypothesis. For inflated data, the

χ^{2}

values are usually high and thus tend to reject the null hypothesis. In such scenarios, a better measure is the sum of absolute errors (ABE) given by

\begin{matrix} sum ABE = \sum | O_{i} - E_{i} | . \end{matrix}

We employ these model checking criteria for the analysis of two real-life datasets in Section 6.

5. Simulations

To study the performance of the proposed EM algorithm, we have conducted some simulation studies. Data

Y = (Y_{1}, \dots, Y_{n})

of sample size n are generated from the ZkIP distribution with parameter vector

θ = (π_{1}, π_{2}, λ)

. For varying values of

θ

and values of

n = 200, 500, 1000

and 2000, we simulated

N =

10,000 datasets. Using these simulated data, we compare the performance of the ZkIP model to ZIP and ordinary Poisson using the standardized bias (SBias), standardized mean squared error (SMSE), and coverage probability criteria. The SBias and SMSE are more informative than Bias and MSE, respectively, and thus are preferable [38]. The standardized bias is given by

\begin{matrix} S B i a s (θ) & = & E (\hat{θ} - θ) / θ \\ \approx & (\sum_{i = 1}^{N} \frac{{\hat{θ}}^{i} - θ}{θ}) / N \end{matrix}

The standardized mean squared error is given by

\begin{matrix} S M S E (θ) & = & {E (\hat{θ} - θ)}^{2} / θ^{2} \\ \approx & (\sum_{i = 1}^{N} \frac{{({\hat{θ}}^{i} - θ)}^{2}}{θ^{2}}) / N \end{matrix}

The coverage probability of the parameters

θ

is the proportion of times the confidence interval contains the true value of the parameter. We considered

90 %, 95 %

, and

99 %

confidence intervals for all of the parameters and various sample sizes.

5.1. Simulation I

In our first simulation study, we generated data from ZkIP with

λ = 2

and a probability at zero of

π_{1} = 0.2

, and assumed that

k = 2

is inflated with probability

π_{2} = 0.4

. The data were independently generated

N =

10,000 times for each value of

n =

200, 500, 1000, and 2000. For ZIP and Poisson models, the standardized bias was negative for each sample size. This indicates that the models underestimate the parameters. The SBias for ZkIP was close to zero for all the parameters and all the sample sizes. Similarly, the SMSE was the smallest for the parameters of the ZkIP model, irrespective of the sample size. As expected, the SMSE decreased as the sample size increased. In this simulation exercise, we observed that the mean estimated values of the ZkIP parameters are close to the true values, and the SBias and SMSE are very close to zero (see Table 4 and Table 5). Thus, we conclude that the performance of the proposed EM algorithm is precise and accurate in this case.

To obtain the confidence intervals, we evaluated the EM estimates and SE of the parameters at each iteration using the methods proposed in Section 3.1 and Section 3.2, respectively. Table 6 shows that for all confidence levels (

90 %

,

95 %

, and

99 %

), the coverage probabilities are close to the nominal levels for all the parameters irrespective of the sample size.

5.2. Simulation II

In our second simulation study, we generated data from ZkIP (

λ = 5, π_{1} = 0.4,

and

π_{2} = 0.1

), and the inflation points were zero and

k = 3

. For each sample size

n = 200

,

500, 1000,

and 2000 we generated

N = 10,000

datasets. The average estimated value of

λ

for the N replications using our method for the ZkIP model is

4.9950 \leq \hat{λ} \leq 5.0024

for all the sample sizes. Similarly, the ranges of the average estimated values of

π_{1}

and

π_{2}

for N replications are

0.3995 \leq {\hat{π}}_{1} \leq 0.4003

, and

0.0994 \leq {\hat{π}}_{2} \leq 0.1004

, respectively. These results clearly demonstrate that our method of estimation is very precise and accurate. Table 7 contains the standard bias (SBias) calculated from the simulated data. The SBias is at a minimum and close to zero for all the parameters of the ZkIP model and for all the sample sizes. The SMSE values are also less for the ZkIP model compared to the ZIP and Poisson models for all the parameters and for all the sample sizes, as shown in Table 8. Thus, the proposed EM algorithm efficiently estimates the true parameters in this second simulation study as well. This conclusion is also supported by the coverage probabilities, which are close to nominal levels, especially for large sample sizes (see Table 9).

6. Applications

In this section, we illustrate the application of the zero- and k-inflated Poisson (ZkIP) model to analyze two real-life dataset examples. The first example (sunburn data) has counts zero and one which are inflated, and the second example (off days data) has inflated frequencies for zero and count

k = 2

. Both datasets were extracted from the National Health Interview Survey (NHIS) conducted by the National Center for Health Sciences (NCHS) in 2010. The NHIS has questionnaires and sampling designs for collecting data from US residents. NHIS collects data annually on topics related to health such as immunizations, depression, hepatitis, cancer, use of tobacco, and other variables related to the health and demographics of the subjects.

6.1. Sunburn Data

In this example, we study the prevalence of sunburn in adults in the US. It has been established that sunburn is one of the leading causes of developing skin cancer. Here, the response variable is the number of times the sunburn has occurred in the last 12 months. The sample data were collected on 3917 subjects. The mean and variance of the sample are 0.69 and 1.60, respectively. There are 64.05% of zeros and 19.35% of ones. The zeros are more than 50%, which strongly indicates the existence of inflation at zero, while one is probably also inflated. We first fit the Poisson model to the data and its inflated extensions. The results are shown in Table 10. All the unknown parameters

(π_{1}

,

π_{2}, λ)

are significant in all of the models. The estimated proportion of inflation at zero for ZIP is 0.54 and for ZOIP it is 0.61. From the ZOIP, the inflation at

k = 1

is 0.13. The LRT statistic between ZIP and Poisson is

- 2 log L = 977.16

, and the p-value is <0.0001. Thus, we reject the null hypothesis and conclude that the inflation at zero is significant or the ZIP model fits significantly better than the simple Poisson distribution. Similarly, comparing the ZOIP model to the ZIP model, the LRT statistic is

- 2 log L = 155.78

and the p-value is < 0.0001. The AIC is 8982.41 and the BIC is 9026.05 for the ZOIP model. The AIC difference between ZOIP and ZIP is

Δ_{Z I P A I C} = 153.78

, while that between ZOIP and Poisson is

Δ_{P o i s s o n A I C} = 1128.94

. Similarly, the BIC difference between ZOIP and ZIP is

Δ_{Z I P B I C} = 122.68

and

Δ_{P o i s s o n B I C} = 1099.84

. According to the AIC and BIC rules of thumb mentioned in Table 2 and Table 3, the ZOIP model gives the best fit when compared to the ZIP and Poisson models.

A comparison between observed and expected frequencies from the ZOIP, ZIP, and Poisson models is shown in Table 11. The expected frequencies from the Poisson model are not close to the observed frequencies and thus the sum of absolute error and

χ^{2}

values are very high. The ZIP model shows an improvement. It perfectly captures the inflation at zero but it does not provide a good fit for counts 1 to 8. The ZOIP model captures the inflations both at zero and at count one. The sum of absolute error is equal to

274.92

and

χ^{2} = 309.22

; both these numbers are smaller when compared to the other two models. For these data, the ZOIP model seems to be the best based on LRT, AIC, and BIC criteria. It also fits the data best based on absolute error and chi-square goodness of fit measures. The estimated inflation at zero is about 61% and at one is about 13%, and clearly both are significant.

6.2. Off Days Data

Back pain is a chronic disease among adults, and occasionally it can be severe, forcing many to take days off from work. In these off days data, the count variable is the number of days off taken due to back pain. The number of people surveyed was 2548. The sample mean and variance are 0.37 and 0.88, respectively. In the data, the zeros are 83% and 10% are equal to 2. Both these proportions are indicators of inflation and suggest that ZkIP with

k = 2

may be an appropriate model. We first fitted the simpler Poisson model for comparison purposes. Due to the high proportions of zeros, we then implemented the zero-inflated Poisson (ZIP) model. Furthermore, to test the significance of inflation at two, we embarked on the zero- and k-inflated Poisson (ZkIP) model. The estimates and standard errors of the parameters are in Table 12. The rate parameter,

λ

, is significant in all three models. The ZIP and ZkIP models have a significant

π_{1}

. The ZkIP model also has a significant

π_{2}

, indicating that along with the significant inflation at zero there is significant inflation at count 2. Table 12 also lists the negative log-likelihood, AIC, and BIC values for the models.

The comparison between ZIP and ZkIP models based on the LRT criterion gives a p-value less than 0.0001. Thus, the ZkIP model is significantly better than the ZIP model. The p-value for comparing Poisson and ZIP is also very small

(< 0.0001)

. Thus, ZIP is significantly better than Poisson. Since the models are nested, we can conclude that ZkIP outperforms both the ZIP and the Poisson models. Now, using the AIC and BIC criteria, the best model turns out to be ZkIP. Furthermore, the

Δ_{Z I P A I C}

difference between the ZIP and ZkIP model is 166.29, while

Δ_{P o i s s o n A I C} = 1335.28

. This clearly indicates that empirically there is no significant support for the Poisson or ZIP models. Similarly, when comparing the models using the BIC criterion,

Δ_{P o i s s o n B I C} = 1339.28 > > 10

and

Δ_{Z I P B I C} = 168.29 > > 10

.

The goodness of fit measures are shown in Table 13. The expected frequencies from the Poisson model are nowhere close to the observed frequencies, resulting in a high sum of absolute error and

χ^{2}

statistic. The ZIP model is able to capture the inflation at zero and thus has a relatively low error when compared to the Poisson model. The ZkIP model, along with inflation at zero, also captures the inflation at count two, thus it gives a good fit to all the counts. The sum of absolute error and

χ^{2}

statistic is at a minimum for the ZkIP model. Thus, the statistical significance of inflation parameters,

π_{1}

and

π_{2}

, in Table 12, and the minimum value for the sum of absolute errors indicates that the ZkIP model is a good model for this off days dataset.

7. Discussion

This article proposes a mixture model, ZkIP, for grouped count data with high frequencies for zero and another count of

k > 0

. The ZIP and kIP models are special cases of the ZkIP model. The ZkIP model has just one more parameter than ZIP and kIP, and it captures both the inflation at zero and k. Hence, the ZkIP model is a parsimonious model for studying doubly inflated count data. The model provides more accurate estimates of the probabilities when compared to the model for ungrouped data. The estimated probabilities at zero and k give the estimated count of zeros and ks in excess. The ZkIP model has applications in manufacturing, transportation, econometrics, ecology, and other disciplines. An algorithm is developed using the expectation–maximization (EM) approach to obtain the ML estimates for the ZkIP model. This is a computationally fast approach and extends the estimation method first proposed by Lambert [1] to study the ZIP model. To obtain the standard errors, instead of using the Hessian matrix, we implement the method given by Louis [29] that is based on complete data. We illustrate our algorithm and methodologies on two simulated and two real-life examples from health science. Using various criteria, we show that the ZkIP model is the most appropriate model for the sample data. We are currently extending our methods for zero- and k-inflated Conway–Maxwell–Poisson distributions for grouped and ungrouped data.

Author Contributions

Conceptualization, N.R.C.; data curation, M.A. and N.R.C.; writing—original draft, M.A. and N.R.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are openly available in “Data Files” at https://www.cdc.gov/nchs/nhis/1997-2018.htm (accessed on 27 May 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Lambert, D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992, 34, 1–14. [Google Scholar] [CrossRef]
Ghosh, S.K.; Mukhopadhyay, P.; Lu, J.C. Bayesian analysis of zero-inflated regression models. J. Stat. Plan. Inference 2006, 136, 1360–1375. [Google Scholar] [CrossRef]
Agarwal, D.K.; Gelfand, A.E.; Citron-Pousty, S. Zero-inflated models with application to spatial count data. Environ. Ecol. Stat. 2002, 9, 341–355. [Google Scholar] [CrossRef]
Min, Y.; Agresti, A. Random effect models for repeated measures of zero-inflated count data. Stat. Model. 2005, 5, 1–19. [Google Scholar] [CrossRef] [Green Version]
Yau, K.; Lee, A. Zero-inflated Poisson regression with random effects to evaluate an occupational injury prevention programme. Stat. Med. 2001, 20, 2907–2920. [Google Scholar] [CrossRef] [PubMed]
Saffari, S.E.; Adnan, R. Zero-inflated Poisson regression models with right censored count data. Matematika 2011, 27, 21–29. [Google Scholar]
Yang, Y.; Simpson, D.G. Conditional decomposition diagnostics for regression analysis of zero-inflated and left-censored data. Stat. Methods Med. Res. 2012, 21, 393–408. [Google Scholar] [CrossRef] [Green Version]
Nguyen, V.T.; Dupuy, J.F. Asymptotic results in censored zero-inflated Poisson regression. Commun. Stat. Theory Methods 2021, 50, 2759–2779. [Google Scholar] [CrossRef]
Altun, E. A new zero-inflated regression model with application. J. Stat. Stat. Actuar. Sci. 2018, 2, 73–80. [Google Scholar]
Bakouch, H.; Chesneau, C.; Karakaya, K.; Kuş, C. The Cos–Poisson model with a novel count regression analysis. Hacet. J. Math. Stat. 2021, 50, 559–578. [Google Scholar] [CrossRef]
Gupta, P.L.; Gupta, R.C.; Tripathi, R.C. Analysis of zero-adjusted count data. Comput. Stat. Data Anal. 1996, 23, 207–218. [Google Scholar] [CrossRef]
Umbach, D. On inference for a mixture of a Poisson and a degenerate distribution. Commun. Stat. Theory Methods 1981, 10, 299–306. [Google Scholar] [CrossRef]
Lord, D.; Washington, S.P.; Ivan, J.N. Poisson, Poisson-gamma and zero-inflated regression models of motor vehicle crashes: Balancing statistical fit and theory. Accid. Anal. Prev. 2005, 37, 35–46. [Google Scholar] [CrossRef]
Qin, X.; Ivan, J.N.; Ravishanker, N. Selecting exposure measures in crash rate prediction for two-lane highway segments. Accid. Anal. Prev. 2004, 36, 183–191. [Google Scholar] [CrossRef]
Ridout, M.; Demetrio, C.; Hinde, J. Models for count data with many zeros. In Proceedings of the International Biometric Conference, Cape Town, South Africa, 14–18 December 1998. [Google Scholar]
Welsh, A.; Cunningham, R.; Donnelly, C.; Lindenmayer, D. Modelling the abundance of rare species: Statistical models for counts with extra zeros. Ecol. Model. 1996, 88, 297–308. [Google Scholar] [CrossRef]
Atkins, D.; Gallop, R. Rethinking how family researchers model infrequent outcomes: A tutorial on count regression and zero-inflated models. J. Fam. Psychol. 2007, 21, 726–735. [Google Scholar] [CrossRef] [PubMed]
Loeys, T.; Moerkerke, B.; De Smet, O.; Buysse, A. The analysis of zero-inflated count data: Beyond zero-inflated Poisson regression. Br. J. Math. Stat. Psychol. 2012, 65, 163–180. [Google Scholar] [CrossRef] [PubMed]
Salehi, M.; Roudbari, M. Zero-inflated Poisson and negative binomial regression models: Application in education. Med. J. Islam. Repub. Iran 2015, 29, 297. [Google Scholar]
Cameron, A.C.; Trivedi, P.K. Regression Analysis of Count Data; Cambridge Press: London, UK, 2013. [Google Scholar]
Greene, W. Accounting for Excess Zeros and Sample Selection in Poisson and Negative Binomial Regression Models; Working Papers; New York University: New York, NY, USA, 1994. [Google Scholar]
Gurmu, S.; Trivedi, P. Excess zeros in count models for recreational trips. J. Bus. Econ. Stat. 1996, 14, 469–477. [Google Scholar]
Motalebi, N.; Owlia, M.S.; Amiri, A.; Fallahnezhad, M.S. Monitoring social networks based on zero-inflated Poisson regression model. Commun. Stat. Theory Methods 2023, 52, 2099–2115. [Google Scholar] [CrossRef]
Arora, M.; Chaganty, N.R. EM estimation for zero- and k-inflated Poisson regression model. Computation 2021, 9, 94. [Google Scholar] [CrossRef]
Lin, T.H.; Tsai, M.H. Modeling health survey data with excessive zero and k responses. Stat. Med. 2012, 32, 1572–1583. [Google Scholar] [CrossRef] [PubMed]
Sheth-Chandra, M.; Chaganty, N.R.; Sabo, R.T. A Doubly Inflated Poisson Distribution and Regression Model; Springer International Publishing: Berlin, Germany, 2019; pp. 131–145. [Google Scholar]
Arora, M.; Kalyani, Y.; Shanker, S. A comparative study on inflated and dispersed count data. In Proceedings of the 10th International Conference on Data Science, Technology and Applications (DATA 2021), Online, 6–8 July 2021; Volume 1, pp. 29–38. [Google Scholar]
Martínez-Flórez, G.; Bolfarine, H.; Gómez, H.W. Doubly censored power-normal regression models with inflation. TEST 2015, 24, 265–286. [Google Scholar] [CrossRef]
Louis, T.A. Finding the observed information matrix when using the EM algorithm. J. R. Stat. Soc. Ser. (Methodol.) 1982, 44, 226–233. [Google Scholar]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. (Methodol.) 1977, 39, 1–22. [Google Scholar]
Schervish, M.J. Theory of Statistics; Springer: New York, NY, USA, 1995. [Google Scholar]
Rao, C.R. Linear Statistical Inference and Its Applications; John Wiley and Sons Inc.: New York, NY, USA, 1965. [Google Scholar]
Wald, A. Tests of statistical hypotheses concerning several parameters when the number of observations is large. Trans. Am. Math. Soc. 1943, 54, 426–482. [Google Scholar] [CrossRef]
Chant, D. On asymptotic tests of composite hypotheses in nonstandard conditions. Biometrika 1974, 61, 291–298. [Google Scholar] [CrossRef]
Shapiro, A. Asymptotic distribution of test statistics in the analysis of moment structures under inequality constraints. Biometrika 1985, 72, 133–144. [Google Scholar] [CrossRef]
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
Kass, R.E.; Raftery, A.E. Bayes Factors. J. Am. Stat. Assoc. 1995, 90, 773–795. [Google Scholar] [CrossRef]
Mallick, A.; Joshi, R. Parameter Estimation and Application of Generalized Inflated Geometric Distribution. J. Stat. Theory Appl. 2018, 17, 491. [Google Scholar] [CrossRef] [Green Version]

Table 1.

P (z = (z_{1}, z_{2}, z_{3}) | Y = y)

for ZkIP model.

Table 1.

P (z = (z_{1}, z_{2}, z_{3}) | Y = y)

for ZkIP model.

$z = (z_{1}, z_{2}, z_{3})$	$y = 0$	$y = k$	$y \neq 0, k$
$(1, 0, 0)$	$\frac{π_{1}}{π_{1} + π_{3} p_{0}}$	0	0
$(0, 1, 0)$	0	$\frac{π_{2}}{π_{2} + π_{3} p_{k}}$	0
$(0, 0, 1)$	$\frac{π_{3} p_{0}}{π_{1} + π_{3} p_{0}}$	$\frac{π_{3} p_{k}}{π_{2} + π_{3} p_{k}}$	1

Table 2. Rules of thumb [36] for

Δ_{i} = A I C_{i} - A I C_{m i n}

.

Table 2. Rules of thumb [36] for

Δ_{i} = A I C_{i} - A I C_{m i n}

.

$Δ_{i}$	Level of Empirical Support of Model i
0–2	Substantial
4–7	Considerably less
>10	Essentially none

Table 3. Rules of thumb [37] for

Δ_{i} = B I C_{i} - B I C_{m i n}

.

Table 3. Rules of thumb [37] for

Δ_{i} = B I C_{i} - B I C_{m i n}

.

$Δ_{i}$	Evidence Against a Candidate Model to Be the Best Model
$0 \leq Δ_{i} \leq 2$	Not worth more than a bare mention
$2 < Δ_{i} \leq 6$	Positive
$6 < Δ_{i} \leq 10$	Strong
$Δ_{i} > 10$	Very strong

Table 4. Comparison of standardized bias (SBias) of the simulated data. True values

λ = 2

,

π_{1} = 0.2

,

π_{2} = 0.4

, and

k = 2

.

Table 4. Comparison of standardized bias (SBias) of the simulated data. True values

λ = 2

,

π_{1} = 0.2

,

π_{2} = 0.4

, and

k = 2

.

n	Parameters	ZkIP	ZIP	Poisson
2000	$\hat{λ}$	−0.0002	−0.1076	−0.2007
	${\hat{π}}_{1}$	0.0020	−0.4785	–
	${\hat{π}}_{2}$	$< 0.0001$	–	–
1000	$\hat{λ}$	−0.0015	−0.1081	−0.2006
	${\hat{π}}_{1}$	−0.0016	−0.4825	–
	${\hat{π}}_{2}$	$< 0.0001$	–	–
500	$\hat{λ}$	0.0014	−0.1070	−0.2002
	${\hat{π}}_{1}$	0.0009	−0.4798	–
	${\hat{π}}_{2}$	0.0021	–	–
200	$\hat{λ}$	0.0025	−0.1063	−0.1987
	${\hat{π}}_{1}$	−0.0085	−0.4865	–
	${\hat{π}}_{2}$	0.0025	–	–

Table 5. Comparison of standardized MSE (SMSE) of the simulated data. True values

λ = 2, π_{1} = 0.2

,

π_{2} = 0.4

, and

k = 2

.

Table 5. Comparison of standardized MSE (SMSE) of the simulated data. True values

λ = 2, π_{1} = 0.2

,

π_{2} = 0.4

, and

k = 2

.

n	Parameters	ZkIP	ZIP	Poisson
2000	$\hat{λ}$	0.0009	0.0118	0.0405
	${\hat{π}}_{1}$	0.0031	0.2329	–
	${\hat{π}}_{2}$	0.0013	–	–
1000	$\hat{λ}$	0.0019	0.0121	0.0406
	${\hat{π}}_{1}$	0.0063	0.2408	–
	${\hat{π}}_{2}$	0.0025	–	–
500	$\hat{λ}$	0.0037	0.0123	0.0408
	${\hat{π}}_{1}$	0.0122	0.2453	–
	${\hat{π}}_{2}$	0.0053	–	–
200	$\hat{λ}$	0.0095	0.0135	0.0413
	${\hat{π}}_{1}$	0.0328	0.2769	–
	${\hat{π}}_{2}$	0.0125	–	–

Table 6. Comparison of coverage probabilities of the simulated data. True values

λ = 2, π_{1} = 0.2

,

π_{2} = 0.4

, and

k = 2

.

Table 6. Comparison of coverage probabilities of the simulated data. True values

λ = 2, π_{1} = 0.2

,

π_{2} = 0.4

, and

k = 2

.

n	Parameters	90%	95%	99%
2000	$\hat{λ}$	0.8920	0.9440	0.9890
	${\hat{π}}_{1}$	0.8970	0.9590	0.9930
	${\hat{π}}_{2}$	0.8910	0.9530	0.9930
1000	$\hat{λ}$	0.8950	0.9430	0.9860
	${\hat{π}}_{1}$	0.9070	0.9500	0.9890
	${\hat{π}}_{2}$	0.9070	0.9550	0.9880
500	$\hat{λ}$	0.9030	0.9600	0.9900
	${\hat{π}}_{1}$	0.9080	0.9630	0.9920
	${\hat{π}}_{2}$	0.8930	0.9540	0.9880
200	$\hat{λ}$	0.9090	0.9520	0.9850
	${\hat{π}}_{1}$	0.9110	0.9590	0.9940
	${\hat{π}}_{2}$	0.8970	0.9550	0.9950

Table 7. Comparison of standardized bias (SBias) of the simulated data. True values

λ = 5, π_{1} = 0.4

,

π_{2} = 0.1

, and

k = 3

.

Table 7. Comparison of standardized bias (SBias) of the simulated data. True values

λ = 5, π_{1} = 0.4

,

π_{2} = 0.1

, and

k = 3

.

n	Parameters	ZkIP	ZIP	Poisson
2000	$\hat{λ}$	<0.0001	−0.0707	−0.4397
	${\hat{π}}_{1}$	−0.0012	−0.0072	–
	${\hat{π}}_{2}$	0.0044	–	–
1000	$\hat{λ}$	0.0005	−0.0698	−0.4399
	${\hat{π}}_{1}$	0.0008	−0.0052	–
	${\hat{π}}_{2}$	−0.0062	–	–
500	$\hat{λ}$	−0.0010	−0.0712	−0.4399
	${\hat{π}}_{1}$	−0.0013	−0.0073	–
	${\hat{π}}_{2}$	−0.0042	–	–
200	$\hat{λ}$	<0.0001	−0.0700	−0.4397
	${\hat{π}}_{1}$	<0.0001	−0.0061	–
	${\hat{π}}_{2}$	−0.0096	–	–

Table 8. Comparison of standardized MSE (SMSE) of the simulated data. True values

λ = 5, π_{1} = 0.4

,

π_{2} = 0.1,

, and

k = 3

.

Table 8. Comparison of standardized MSE (SMSE) of the simulated data. True values

λ = 5, π_{1} = 0.4

,

π_{2} = 0.1,

, and

k = 3

.

n	Parameters	ZkIP	ZIP	Poisson
2000	$\hat{λ}$	0.0002	0.0052	0.1935
	${\hat{π}}_{1}$	0.0007	0.0008	–
	${\hat{π}}_{2}$	0.0094	–	–
1000	$\hat{λ}$	0.0005	0.0052	0.1938
	${\hat{π}}_{1}$	0.0016	0.0016	–
	${\hat{π}}_{2}$	0.0188	–	–
500	$\hat{λ}$	0.0009	0.0057	0.1942
	${\hat{π}}_{1}$	0.0031	0.0031	–
	${\hat{π}}_{2}$	0.0352	–	–
200	$\hat{λ}$	0.0024	0.0066	0.1950
	${\hat{π}}_{1}$	0.0073	0.0074	–
	${\hat{π}}_{2}$	0.0937	–	–

Table 9. Comparison of coverage probabilities of the simulated data. True values

λ = 5, π_{1} = 0.4

,

π_{2} = 0.1

, and

k = 3

.

Table 9. Comparison of coverage probabilities of the simulated data. True values

λ = 5, π_{1} = 0.4

,

π_{2} = 0.1

, and

k = 3

.

n	Parameters	90%	95%	99%
2000	$\hat{λ}$	0.9010	0.9560	0.9930
	${\hat{π}}_{1}$	0.9110	0.9500	0.9920
	${\hat{π}}_{2}$	0.9020	0.9510	0.9920
1000	$\hat{λ}$	0.9100	0.9500	0.9880
	${\hat{π}}_{1}$	0.8990	0.9510	0.9870
	${\hat{π}}_{2}$	0.9150	0.9610	0.9870
500	$\hat{λ}$	0.9080	0.9540	0.9900
	${\hat{π}}_{1}$	0.8950	0.9540	0.9920
	${\hat{π}}_{2}$	0.9260	0.9680	0.9900
200	$\hat{λ}$	0.9061	0.9566	0.9899
	${\hat{π}}_{1}$	0.9162	0.9626	0.9949
	${\hat{π}}_{2}$	0.9263	0.9636	0.9869

Table 10. Parameter estimates for sunburn data.

Parameters	ZOIP	ZIP	Poisson
$\hat{λ}$	2.1415 *	1.4868 *	0.6906 *
	(0.0739)	(0.0357)	(0.0133)
${\hat{π}}_{1}$	0.6096 *	0.5355 *	–
	(0.0093)	(0.0104)
${\hat{π}}_{2}$	0.1273 *	–	–
	(0.0167)
$- 2 log L$	8976.41	9132.19	10,109.35
AIC	8982.41	9136.19	10,111.35
BIC	9026.05	9148.73	10,125.89

* These estimates are significant.

Table 11. Observed and expected frequencies of sunburn data.

Count	Observed Frequency	ZOIP	ZIP	Poisson
0	2509	2508.87	2508.94	1963.53
1	758	757.90	611.63	1355.98
2	374	277.61	454.67	468.20
3	127	198.17	225.33	107.78
4	40	106.09	83.75	18.61
5	47	45.44	24.90	2.57
6	27	16.22	6.17	0.30
7	19	4.96	1.31	0.03
8	16	1.33	0.24	0.003
$A B E$	–	274.92	445.56	1384.36
$χ^{2}$	–	309.22	1462.95	117,569.90

Table 12. The model description of off days data.

Parameters	ZkIP	ZIP	Poisson
$\hat{λ}$	2.0674 *	1.8569 *	0.3662 *
	(0.1075)	(0.0869)	(0.0120)
${\hat{π}}_{1}$	0.8204 *	0.8028 *	–
	(0.0075)	(0.0089)
${\hat{π}}_{2}$	0.0755 *	–	–
	(0.0121)
$- 2 log L$	3321.44	3489.73	4660.72
AIC	3327.44	3493.73	4662.72
BIC	3337.13	3505.42	4676.41

* These estimates are significant.

Table 13. Observed and expected frequencies of the off days dataset.

Count	Observed Frequency	ZkIP	ZIP	Poisson
0	2124	2123.94	2123.99	1766.75
1	84	69.38	145.70	646.93
2	264	264.01	135.27	118.44
3	25	49.42	83.73	14.46
4	23	25.54	38.87	1.32
5	14	10.56	14.43	0.10
>5	14	3.64	4.47	0.06
$A B E$	–	55.54	274.99	1125.86
$χ^{2}$	–	46.02	216.65	36,207.74

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Arora, M.; Chaganty, N.R. Application of Mixture Models for Doubly Inflated Count Data. Analytics 2023, 2, 265-283. https://doi.org/10.3390/analytics2010014

AMA Style

Arora M, Chaganty NR. Application of Mixture Models for Doubly Inflated Count Data. Analytics. 2023; 2(1):265-283. https://doi.org/10.3390/analytics2010014

Chicago/Turabian Style

Arora, Monika, and N. Rao Chaganty. 2023. "Application of Mixture Models for Doubly Inflated Count Data" Analytics 2, no. 1: 265-283. https://doi.org/10.3390/analytics2010014

APA Style

Arora, M., & Chaganty, N. R. (2023). Application of Mixture Models for Doubly Inflated Count Data. Analytics, 2(1), 265-283. https://doi.org/10.3390/analytics2010014

Article Menu

Application of Mixture Models for Doubly Inflated Count Data

Abstract

1. Introduction

2. Distributions

3. Methodology

3.1. EM Estimation

3.2. Standard Errors of EM Estimates

4. Goodness of Fit and Model Selection

4.1. Hypothesis Testing and Measures of Model Selection

4.2. Model Checking

5. Simulations

5.1. Simulation I

5.2. Simulation II

6. Applications

6.1. Sunburn Data

6.2. Off Days Data

7. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI