1. Introduction
Data that count the number of occurrences of certain events or the number of subjects or items that fall into certain categories arise in many scientific investigations, medical, and social science research. The most commonly used models to analyze such data are developed using the Poisson probability distribution. The Poisson distribution possesses the equi-dispersion property because its mean and variance are equal. However, in real-life examples, most often the data are over-dispersed or under-dispersed. The occurrence of over-dispersion is more common than under-dispersion. In the absence of equi-dispersion the most commonly used alternative to the Poisson distribution is the negative Binomial distribution.
There could be several reasons that lead to over-dispersion in the data. A primary cause of over-dispersion in the counts is an inflated number of zeros in excess of the number expected under the Poisson distribution. In such cases, an appropriate model is the zero-inflated Poisson (ZIP). The ZIP models are extensively studied in the literature. The earliest paper on the ZIP model was by Cohen [
5]. In a seminal paper, Lambert [
3] introduced and studied the ZIP regression model using the Expectation–Maximization (EM) approach [
6]. Lambert [
3] applied the ZIP model to count data where the response variable was the number of defects in a manufacturing process along with covariates masking, soldering, etc. The ZIP model with random effects has been studied by Min and Agresti [
7] and Yau and Lee [
8]. Ghosh et al. [
9] explored the Bayesian approach for small to moderate sample sizes. The ZIP models using the Bayesian approach for spatial data were studied by Agarwal et al. [
10]. Furthermore, ZIP models for censored data were studied by Saffari and Adnan [
11], Yang and Simpson [
12], and Nguyen and Dupuy [
13]. Altun [
14] introduced a zero-inflated Poisson–Lindley regression model, while, recently, Bakouch et al. [
15] introduced COS-Poisson distribution and the corresponding regression model for zero-inflated count data.
In health science research, zero-inflated count models have been shown to perform better than traditional count models [
16,
17]. The ZIP models have been applied across a wide spectrum of academic disciplines, including biology [
18], ecology [
19], psychology [
20,
21], and education [
22]. The ZIP models have also been studied in economics [
23,
24,
25]. In industry, the ZIP models have been applied in manufacturing [
3,
9], transportation [
26,
27], and insurance [
28]. Recently, Motalebi et al. [
29] applied the ZIP models for monitoring social networks. A good review and applications of ZIP models is given in Bohning and Seidel [
30] and Ridout et al. [
18]. The other ZIP-like models are zero-inflated negative binomial (ZINB), zero-inflated geometric (ZIG), and zero-inflated Binomial (ZIB). For example, Hall [
31] illustrated the use of ZIP and ZIB in horticulture.
The zero-inflated models can be fitted easily using available packages in SAS and R software. There are two procedures in SAS that deal with zero-inflated models. In SAS, the finite mixture model (FMM) and count regression (COUNTREG) procedures can be used to study zero-inflated models. They provide estimates, standard errors, and AIC values similar to glm procedure. The high-dimensional count regression procedure (HPCOUNTREG) in SAS can handle big data. In R, the package ’pscl’ includes functions for handling zero-inflated discrete distributions with various link options. The inflated count models are also available in the ’VGAM’ package.
In addition to zero, some data sets may have an inflated count of additional value
due to multiple effects, including the design of the study. Research questionnaire studies are examples with zero- and
k-inflated count data sets typically resulting either in the way the questions were asked or the way the responses were provided. For example, one study investigated the frequency of pap smear tests in women for the last six years. The survey had a large number of women who never had a pap smear and many who had pap smears on an annual basis. Thus, the survey resulted in large frequencies of zero and six. The other source of inflation is the nature of the response. For example, Arora et al. [
32] considered a study that counts the number of days a subject exercised per week. The reply of the non-exercising subjects was zero, and the reply of regularly exercising subjects was 5. Hence, the data have 0 and 5 counts inflated. Lin and Tsai [
1] describe a survey where adults were asked about the number of cigarettes they consume daily. The responses tend to be none or a pack. Since a pack consists of 20 cigarettes, the data result in inflated frequencies for 0 and 20. Lin and Tsai [
1] proposed a zero- and
k-inflated Poisson regression model (ZkIP) to analyze such data. Sheth et al. [
2] also introduced two forms of ZkIP models, known as doubly inflated Poisson (DIP) models. In this article, we study the ZkIP form given by Lin and Tsai [
1]. It is the same as the second DIP model proposed by Sheth et al. [
2].
The ZkIP is a finite mixture model. It has three components. The first is degenerate at zero with probability
. The second distribution is degenerate at
k with probability
, and the third distribution is Poisson with mean
with probability
. The mixture leads to heterogeneity in the data, which is not captured by the Poisson model. These components can also be interpreted as three groups of the population. A special case of the ZkIP model is the zero- and one-inflated Poisson model (ZOIP). Zhang et al. [
33] studied the properties and inference on the parameters of the ZOIP distribution without covariates. The inference of ZOIP without covariates was described by Alshkaki [
34]. A Bayesian approach for the ZOIP model was examined by Tang et al. [
35]. The ZOIP regression model using maximum likelihood and the Bayesian approach was also studied by [
36]. The zero- and one-inflated count data using truncated Poisson was studied by [
37]. Lin and Tsai [
1] introduced the ZkIP regression model and used the nonlinear optimization method to obtain the maximum likelihood (ML) estimates and standard errors. The ZkIP has also been studied by Finkelman et al. [
38] for grouped psychological data. In this article, we study the ZkIP model using the Expectation–Maximization (EM) approach. Furthermore, we pursue the method outlined by Louis [
4] to obtain the standard errors for the EM parameter estimates.
The outline of the article is as follows: we present the derivation of the zero- and
k-inflated Poisson (ZkIP) distribution in
Section 2.
Section 3 contains the corresponding ZkIP regression model that incorporates observed covariates on each subject. We describe the EM algorithm steps to estimate the regression and mixing parameters in
Section 3.1. Computational details of the standard errors for the regression estimates using the method described by Louis [
4] are presented in
Section 3.2. The criteria for model selection and goodness of fit are described in
Section 4. We illustrate our methods on two real-life data sets in
Section 5, including identification of significant covariates.
2. Zero- and -Inflated Poisson Distribution
The Poisson distribution is widely used to model nonnegative integer count data. The zero-inflated Poisson (ZIP) distribution is a popular model for count data containing excessive zeros. The ZIP distribution is a mixture of degenerate distribution at zero with probability
and Poisson distribution with probability
. Additionally, if another count value
in the data are also inflated, a suitable model is the Poisson distribution mixed with two- point masses
and
at 0 and
k, respectively. The probability mass function of a random variable
Y with this mixture distribution is given by
where
,
,
, and
. The distribution (
1) is known as the zero- and
k-inflated Poisson (ZkIP) distribution [
1]. The moment generating function of the ZkIP distribution is
and the probability generating function is
. Using
, it is easy to show that the mean and variance of the ZkIP distribution are
Since the ZkIP distribution is essentially a mixture of Poisson and two degenerate distributions at zero and
k with probabilities
and
, respectively, it reduces to ZIP when
and becomes the Poisson distribution if
. The following stochastic representation is instrumental in elucidating the properties of the ZkIP distribution. Consider a latent variable
distributed as multinomial with parameters
. Note that
takes values
with probability
,
with probability
, and
with probability
. That is,
Furthermore, let us assume the conditional distribution of
is
Thus, the joint distribution of
obtained by multiplying (
2) and (
3) is
The marginal of
Y can be obtained from (
4) by summing over the three possible values of
. Thus, we get
and
which is equivalent to the ZkIP distribution defined by (
1). Furthermore, the posterior distribution
can be summarized as in
Table 1.
In
Section 3, we build a ZkIP regression model using (
1), and, in
Section 3.1, we use the conditional probabilities in
Table 1 to develop the EM algorithm for estimation of the ZkIP parameters from the data.
3. Zero- and -Inflated Poisson Regression Model
Let
be a vector of
n independent count responses. We assume that the number of
’s that are equal to 0 (or
k) is high and corresponding to each
, a vector
of covariates has been observed. A reasonable model for the distribution of each
is given by (
1) with different parameters
but the same mixing parameters
and
. In this case, the likelihood function of the observed data is
where
and
for
. To incorporate the covariates into the model, we follow the standard generalized linear model (GLM) framework for the multinomial distribution. The three mixing distributions can be viewed as three nominal categories (degenerate(0), degenerate(
k), and Poisson) with probabilities
,
, and
, respectively. Following the GLM baseline category logit models for the multinomial, we re-parametrize and set
We treat the Poisson distribution as the baseline category, leading to two equations for the other two categories. As in loglinear models, the ZkIP regression model assumes the Poisson parameter
is a loglinear function of the covariates, and it is given by
where
is a
dimensional unknown regression parameter vector. For simplicity, we assume that the parameters
and
are constants. The generalization of the case where
and
are functions of the covariates is straightforward. Thus, the parameters of our ZkIP regression model are
,
, and
. In the next section, we consider estimating the parameters
,
, and
using the observed data.
3.1. Estimation of the Regression Parameters
In this section, we study methods for estimating the parameters of the ZkIP regression model. The two popular methods are the maximum likelihood (ML) and Expectation–Maximization (EM) approach. The ML technique involves optimizing the likelihood or the log-likelihood function with respect to the unknown parameters
,
, and
. Substituting the reparametrizations (
6) in the likelihood function (
5), we get
where
. The ML estimates can be obtained by maximizing the log-likelihood (
7) directly with respect to the parameters or taking the partial derivatives and solving the three score equations:
Equation (
8) can be solved iteratively using the Newton–Raphson method. In theory, this seems fine, but, in practice, there are convergence issues with ML estimation. An alternative to ML and a popular method for parameter estimation is the Expectation–Maximization (EM) approach, used by Lambert [
3] for the ZIP model. Here, we extend her ideas for the ZkIP model.
The EM approach treats the observed data
as the part of a complete data that includes a latent vector
, which is regarded as missing. Here, each
is a three- component vector with a probability distribution given by (
2), and the conditional distribution of
given
is given by (
3). Then, the joint distribution of the observed and missing data are given by
where
is the Poisson probability mass function with mean
. Therefore, the complete data likelihood function of the ZkIP model is given by
and the log-likelihood of the complete data,
for the ZkIP model is
Using the reparametrization given in (
6), we can write the log-likelihood of the complete data as
Note that, when
, the ZkIP reduces to the ZIP model. Thus, from (
9), the log-likelihood of the ZIP for the complete data is
From (
10), the log-likelihood of the ZIP for the complete data can be written as
Lambert [
3] used Equation (
11) as the complete data log-likelihood for the ZIP model to get the EM estimates.
We now proceed to describe in detail the EM algorithm for the ZkIP model. The first step in the EM algorithm involves selecting some initial values for the unknown parameters. The choice of the initial values is important for the convergence of the algorithm. An incorrect choice of the initial values could result in slow convergence or breakdown of the algorithm. We recommend using the proportions of zeros and
k’s from the observed data as initial values for the parameters
and
, respectively. Then, we use the relations (
6) to get initial values
and
for the parameters
and
, respectively. The initial values of
can be obtained by fitting a Poisson model on the data. The initial values
can be used as the coefficients of the covariates.
The next step involves filling the latent values
by its expectations, which is the E-step. We use the conditional expected values of
given in
Table 2 to generate
’s. Note that
Table 2 is a reparametrized version of
Table 1.
We use
Table 2 to estimate the missing values in the expectation step of the EM algorithm as follows:
For the maximization step in the EM algorithm, instead of maximizing the complete likelihood directly, we solve the score equations
where
and
is defined in (
10). In summary, the EM algorithm to estimate the parameters
,
, and the regression parameter
for the ZkIP regression model can be summarized as follows.
Select initial values , , for the parameters , , and respectively.
E-step: Estimate
,
using Equation (
12).
M-step: Solve the score Equation (
13) and obtain an updated estimates
,
,
.
Repeat the E-step and the M-step until the parameter estimates converge.
In the next section, we will discuss how to obtain standard errors of the EM estimates.
3.2. Standard Errors for EM Estimates
The most commonly used method to get the standard errors in the mixture models is to compute the matrix of partial derivatives of the log-likelihood for the observed data, that is, to calculate the information matrix from the observed data. The optimization algorithms routinely output a numerically computed Hessian matrix for the functions that are being optimized. Lambert [
3] used this method for computing the standard errors for the ZIP regression model. Lin and Tsai [
1] used the Hessian matrix to get the standard errors for the ZkIP model without actually computing second-order partial derivatives of the log-likelihood.
However, for the EM framework, an appropriate and easier approach for obtaining the standard errors is the method outlined by Louis [
4]. The method is based on the complete and missing data log-likelihoods. The relation between the likelihood of the complete, observed and missing data is given by
where
and
stand for the observed and missing data, respectively. Taking log on both sides of (
14), we get
We can see from Equation (
15) that the information matrices for the complete, observed, and missing data satisfy the following equation:
where
is obtained using (
15) as
Equation (
16) can be re-written as
Since the right-hand side of Equation (
18) depends on the missing data, Louis [
4] has recommended taking the conditional expected value of the missing data given the observed data. Therefore, we have
Thus, the estimate of the observed information matrix is given by
The elements of the expected information matrix
are given in
Appendix A.1. Using (
7), (
10), and (
15), the log-likelihood of the missing data for the ZkIP regression model is given by
The elements of the matrix
are the negative of the expected value of second-order derivatives of (
21), and these are given in
Appendix A.2. Using these second-order derivatives, we can compute
given in Equation (
20). The square root of diagonal elements of
gives the standard errors of the EM estimates.
4. Model Selection and Model Fit
In statistical inference, estimation of the parameters is usually followed by testing the significance of the parameters and selecting the best model for the data. Hence, in this section, we discuss the hypothesis testing to the significance of inflation at zero and
k—in other words, whether ZkIP significantly fits the data better than the ZIP or the Poisson model. There are various criteria to select the best model. We use the Akaike Information Criterion (AIC) and the likelihood-ratio method to arrive at the best model that fits the data. These details will be illustrated with a couple of real-life data analyses in
Section 5.
4.1. Hypothesis Testing
Here, we discuss hypothesis testing to determine significant parameters and covariates. In the ZkIP model, the parameters and represent the proportion of observations that come from degenerate distributions, and the parameter determines the effects of the covariates in the model. Let denote the EM estimates of these parameters. Assume that the true value is in the interior of the parameter space, that is, and . Under usual regularity conditions, is asymptotically normal with mean and covariance matrix is given by . We can use this result to construct a Wald’s test for testing the hypotheses that a specified proportion of observations come from a degenerate distribution at k or a specified proportion come from the degenerate distribution at zero. Similarly, the hypothesis could be tested for significance using Wald’s test.
The FMM and Countreg procedures in SAS use the parameters and and test for the hypothesis . This hypothesis is equivalent to testing , which can be done using Wald’s test because and are values in the interior of the parameter space.
As we discussed in
Section 2, the ZkIP, ZIP, and the Poisson model form a group of three nested models in the sense that Poisson is a special case of ZIP which is a special case of ZkIP. Thus, one could use the likelihood ratio test (LRT) to test the significance of the nested models, that is, whether the ZkIP model could be replaced by the ZIP model or whether the ZIP model could be replaced by the Poisson model. We need to test the null hypothesis
to see whether there is a significant or insignificant inflated frequency at count
k. The acceptance of the null hypothesis implies that we can replace the ZkIP model with the ZIP model. Similarly, the acceptance of
implies that inflation at zero is insignificant, and the ZIP model can be replaced by the Poisson model. Since
, the null hypothesis
corresponds to testing a parameter value on the boundary. Therefore, the standard asymptotic theory for the likelihood ratio statistic is not applicable. The asymptotic distribution of the likelihood ratio statistic is not a
distribution, but is a mixture of
distributions [
39,
40]. In fact, the test statistic
approaches a 50:50 mixture of
and
.
4.2. Model Selection
We can use several criteria for selecting the appropriate model between the three competing models, Poisson, ZIP, and ZkIP. A popular criterion is the Akaike Information Criteria (AIC). The AIC was introduced by [
41], and it is calculated as
, where
ℓ is the maximum value of the log-likelihood and
m is the number of parameters for the model under consideration. The log-likelihood tends to increase as we move from a simpler model to a complex one. The constant
penalizes the complex model since it will have more parameters than the simple model. This avoids overfitting the model for the data. To select the best model, we use minimum AIC criteria and apply Burnham and Anderson’s approach [
42]. The interpretation of AIC is weighty when different values are compared. Thus, it is a relative term and not an absolute term that is of importance. The approach given in [
42] is based on AIC differences
, where
is the AIC of the
i-th model and
is the minimum AIC of the models in the study. The lower values of
imply that there is not much difference between model
i and model with minimum AIC—while, from higher values of
, we can infer that the model with minimum AIC is better than model
i (
Table 3).
4.3. Goodness of Fit
For count data, the most commonly used statistic for testing the goodness-of-fit test is the Pearson chi-square statistic
, where
is observed frequency and
is the expected frequency of the
i-th category, and
c is the total number of categories. Asymptotically, the
statistic follows a chi-square distribution with
degrees of freedom. The test is not the best when there are inflated frequencies. An alternate and simple measure for checking the goodness-of-fit among competing models is the sum of Absolute Error (ABE), which is defined as
The model that has a minimum sum of ABE has the least deviation between the observed and expected frequencies. Hence, the model with a minimum error fits data the best.
5. Applications
In this section, we illustrate the results presented in
Section 3.1 and
Section 3.2 on two real-life data sets. These data sets were obtained from the National Health Interview Survey (NHIS) conducted by the National Center for Health Sciences (NCHS). Since 1957, NCHS has been collecting and archiving data on US residents. The data are collected annually on various health topics, including immunizations, depression, hepatitis, cancer, tobacco use, and other variables related to health. For our illustrations, we took a subset of data that was collected in the year 2015. We fit the zero- and
k-inflated Poisson (ZkIP) model for both the data sets and compare them to the zero-inflated Poisson (ZIP) and Poisson models. The first example illustrates a ZkIP model with inflations at 0 and
, while the second example demonstrates a zero- and one-inflated Poisson (ZOIP) model with inflations at 0 and
.
5.1. Pap Smear Data
Cervical cancer is a major concern for the female population. A common preventive and early detection screening procedure for cervical cancer is the pap smear test. In this example, the data consist of the number of pap smear tests a female took in the last six years for females aged more than 18 years. The count variable represents responses to two questions in the survey: (1) Have you ever had a Pap smear or Pap test? and (2) How many Pap tests have you had in the last 6 years?. If the reply to the first question is a ‘No’, then the number of tests done is reported zero, while, if the reply is a ‘Yes’, then the number of tests done is the same as the reply to the second question. The data also consist of the age of the female respondent and her answer to the question, “Do you ever received HPV shot or vaccine?”. Here, age is a continuous variable, whereas the response to HPV shot/vaccine is a dichotomous variable. Both of these variables could be treated as covariates in the model.
There were 33,672 females interviewed in the survey, out of which about choose not to answer, or their response was not recorded. We performed a list-wise deletion to clean the data and ended up with a data set consisting of 12,014 independent observations. The mean number of the pap smear tests for the data thus obtained is and the variance is . The percentage (count) of females who never took a pap smear test was , and the percentage (count) of females who had one pap smear each year for a total of six in the last six years was . The proportions of zero and six in the data set are inflated, and both of these proportions are more than what we would expect under a Poisson model. Therefore, an appropriate model for these data is the zero and six inflated Poisson model or the ZkIP model with .
Using the methods described in
Section 3.1 and
Section 3.2, we fitted ZkIP, ZIP, and the Poisson models for this pap smear data. We tested the significance of age and
HPV shot covariates in the models using Wald’s test. The variable age was not significant in the ZkIP and ZIP models. Thus, age was removed in subsequent analysis, and we reran the models with only
HPV shot as the covariate.
The regression parameter is significant for all the models at
. The estimates obtained by the EM algorithm and the corresponding standard errors for the EM estimates described in
Section 3.2 are presented in
Table 4. For the ZkIP model, the mixing parameter estimates were
and
, meaning about 12.6% of the zeros were from the degenerate distribution and 26% of the observed frequencies of six pap smear count were from a degenerate distribution at six. The table also has the AIC value and the maximum value of the log-likelihood function for different models. The AIC values of the ZkIP, ZIP, and Poisson models are 46,523.89, 52,205.70, and 56,061.88, respectively. The ZkIP model has minimum AIC, and
is greater than 5000 and
9000. Thus, according to
Table 3, the empirical support for both the Poisson and ZIP model is “essentially none”. Thus, for these data, adding one more distribution, which is degenerate at six to the model or the ZkIP with
, is a better model than the ZIP or Poisson model.
Recall that the three models Poisson, ZIP, and ZkIP, are nested models, and we could use the likelihood ratio criterion described in
Section 4 to decide whether the complex model could be reduced to the simpler model. The LRT statistic, which compares the Poisson model with the ZIP, is
and the
p-value is computed using the limiting distribution, which is a mixture of two
’s with equal weights, is less than
. This implies that the inflation at zero is significant, and the ZIP model is significantly better than the Poisson model. Similarly, we use LRT to compare ZkIP with the ZIP model. The value of the test statistic is
, which is again highly significant with a
p-value less than
. Hence, ZkIP is significantly better than the ZIP model.
Furthermore, we check the goodness-of-fit of the models by comparing the observed frequencies and the expected frequencies. The observed and predicted frequencies are in
Table 5.
Table 5 shows that the Poisson model has the highest sum of absolute error (ABE) and does not provide a good fit to the data. The error 5685.69 of the ZIP model is lower than that of the Poisson model 8086.16, while the sum of the absolute difference between the observed and expected frequency is minimum (1130.93) for the ZkIP model. Therefore, the ZkIP, which is able to capture inflated frequencies at both zero and 6, is a superior model for these data compared with ZIP and the Poisson model.
5.2. Emergency Room Data
The data for this example were taken from the NHIS 2015 database on children aged less than 18 years. The count variable is the number of visits to an emergency room (ER) of children in a year. We choose age (0–17) and gender (Male/Female) as the covariates. We remove the cases where the response or the covariates are missing and end up with a clean data set of n = 12,223 children. The average number of visits to the ER in our sample is , and the variance is . In the data, the count values 0 and 1 have frequencies 10,046 and 1466, respectively. These frequencies are high because they account for and percentages of the total sample.
We fit zero- and one-inflated Poisson (ZOIP), zero-inflated Poisson (ZIP), and the Poisson model for these data. The significance of the regression variables is tested using Wald Test. In the first iteration, the gender variable was insignificant in all three models, so it was removed from the models. The analysis is again performed with only age as the covariate. The model estimates and standard errors are presented in
Table 6.
The AIC value of the ZOIP, ZIP, and Poisson models are 15,481.24, 15,488.39, and 16,594.00, respectively. On calculating the AIC differences, we obtain
. The AIC difference between ZIP and Poisson models gives
. Clearly, the ZIP model performs better than the Poisson model. Furthermore, the ZIP model has “considerably less” support than the ZOIP model (
Table 3). We also performed the likelihood ratio test for model selection. The LRT statistic for testing the Poisson model over ZIP is given by
, which is highly significant. The LRT statistic
shows that the ZOIP model is significantly better than the ZIP. Hence, both the AIC and LRT criteria show that ZOIP fits best for these data.
The observed and expected frequencies of the ZOIP, ZIP, and Poisson models are in
Table 7. The ZIP model is able to capture the inflation at count zero. However, the ZOIP model is able to capture the inflation at count zero and one as well. The conclusions are also supported by the sum of ABE measure. In conclusion, the ZOIP model gives a good fit to the observed data.
6. Discussion
In this article, we developed the Expectation–Maximization (EM) algorithm for the ZkIP model generalizing the results of seminal work by Lambert [
3]. The EM algorithm is a computationally simpler approach to get the estimates of the unknown model parameters. However, unlike Lambert [
3], we obtain the standard errors of the parameters using the approach developed specifically for the EM algorithm by Louis [
4], which we believe is the right approach. We demonstrate our methods on two real-life data, showing that, in count data, if there is inflation at two points zero and
k, then ZkIP outperforms the simpler Poisson models, ZIP, and Poisson according to AIC and LRT criteria.
In our regression model, for simplicity, we assumed the Poisson parameter depends on the covariates, and the obvious extension is linking the covariates to the inflation parameters
and
as well. In that case, Equation (
6) becomes
and
. The covariate vectors
and
may or may not be the same. This will result in higher dimensionality of the information matrix. The variable selection methodologies in this article could be used to obtain a simpler model. Other possible extensions are obtained replacing the Poisson distribution with generalized Poisson or Conway–Maxwell Poisson (CMP). In particular, we could implement the EM algorithm for the ZkICMP model studied by [
32]. These extensions are currently our work in progress.