Generalized Poisson Hurdle Model for Count Data and Its Application in Ear Disease

For count data, though a zero-inflated model can work perfectly well with an excess of zeroes and the generalized Poisson model can tackle over- or under-dispersion, most models cannot simultaneously deal with both zero-inflated or zero-deflated data and over- or under-dispersion. Ear diseases are important in healthcare, and falls into this kind of count data. This paper introduces a generalized Poisson Hurdle model that work with count data of both too many/few zeroes and a sample variance not equal to the mean. To estimate parameters, we use the generalized method of moments. In addition, the asymptotic normality and efficiency of these estimators are established. Moreover, this model is applied to ear disease using data gained from the New South Wales Health Research Council in 1990. This model performs better than both the generalized Poisson model and the Hurdle model.


Introduction
Count data are common in various areas, such as public health, insurance, traffic, and epidemiology. The Poisson model and negative binomial model are usually applied to handle count data. When the sample variance is either larger or smaller than the sample mean, this means that it is over-dispersed or under-dispersed, respectively. The Poisson model and negative binomial model cannot handle over-or under-dispersed count data. Hence, the use of generalized Poisson distribution is proposed for over-or under-dispersed count data [1]. The generalized Poisson regression (GPR) model based on generalized Poisson distribution has been widely studied [2,3]. Additionally, the generalized Poisson Regression model was estimated by the maximum likelihood method and the method of moments [4]. Some measures, such as the Pearson chi-squared test, deviance, the likelihood ratio test, Akaike Information Criteria (AIC), and Bayesian Information Criteria (BIC), have been proposed for testing the goodness of fit of a model. Generalized Poisson Regression performs better than other regression models [5]. Zero inflated Waring distribution (ZIW) has been proposed to solve the problem of overdispersion in the data and the reduction of its mean [6].
For count data with an excess of zeros, zero-inflated regression models, such as zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB), have been proposed [7,8]. The zero-inflated Poisson (ZIP) model deals with count data with an excess of zeros [9], while the zero-inflated negative binomial (ZINB) model deals with over-dispersed count data with excess zeros, for example, the zero-inflated generalized Poisson (ZIGP) model [10][11][12][13][14][15], and the Zero-inflated Bell regression model [16,17]. These models have been used to analyze the relationship between the number of mothers that receive antenatal care visits and other factors, such as maternal education, partner education level, age of mothers, religion of mothers, and wealth index. In particular, these models are compared using AIC and BIC. In addition, the Zero-Inflated Hierarchical Poisson Model has been applied to data with an excess of zeros to analyze maternal mortality data from 2010 to 2013 in health facilities in four regions of Ghana [18].
Although these models deal with zero-inflated data, they cannot simultaneously deal with both zero-deflated and over-/under-dispersed data. An alternative model that accounts for zero-inflated or zero-deflated is the Hurdle model [19]. This model specifies two processes that generate zero counts and positive counts. Feng compared the zeroinflated and Hurdle models to determine the differences and performances of both models in a simulation [20]. The extended negative binomial Hurdle (ENBH) model deals with zeroinflation in addition to under-dispersion in non-zero counts [21]. Bocci et al. generalized the usual Hurdle regression model by specifying a multiple inflated truncated negative binomial distribution for the positive responses and applied it to the tourism behavior of Italian residents [22]. Park and Kim proposed a tree-structured hierarchical model for count data of both excessive zeros and over-dispersion [23]. Hasanah et al. estimated a Hurdle model using the Bayesian method with the non-closed form of posterior distributions by specified non-information priors for parameters [24].
Parameter estimation is important. Estimation methods include the maximum likelihood (ML), the method of moments, the generalized method of moments (GMM), and Bayesian estimation. The GMM method is widely used in parameter estimation due to its good statistical properties. Chen and Cheng proposed a partially linear additive spatial error model (PLASEM), estimated its parameters using GMM, and derived consistency and asymptotic normality for some estimators [25]. Muris investigated missing data using GMM, derived a set of moment conditions, and obtained the efficiency [26]. For count data, Sarvi et al. studied the use of the GEE-based generalized Poisson Regression model for over-and under-dispersed clustered count data with excess zeros [27]. Mahpolah et al. used GMM to estimate a Poisson regression model; the result gained with the use of GMM is better than that gained by the use of ML [28]. Allo et al. estimated the parameters of the generalized Poisson regression model using GMM and applied it to diarrhea in infants in Pasuruan Regency, East Java [29]. Yogita and Kirtee estimated parameters of the zero-inflated model using a probability estimated method based on a moment estimator of the mean parameter [30].
Ear diseases are important in healthcare. For ear disease, Lee et al. used a metaanalysis method for the incidence of ear diseases caused by swimming without ear protection [31]. Sanchez et al. investigated the impact of swimming in water containing saline chloride on the occurrence of ear disease [32]. Subtil et al. investigated whether water precautions reduce the rate of ear diseases [33]. Sanchez et al. measured the impact of 4 weeks of daily swimming on rates of ear discharge with a tympanic membrane perforation and on the middle ear, and found swimming not to be associated with increased risk of ear disease [34]. Count data, for example in ear disease, usually contains too many or too few zeros and exhibits dispersion characteristics meanwhile. This paper proposes the generalized Poisson Hurdle model (GPHR), which simultaneously deals with count data that are both zero-inflated/zero-deflated and over-/under-dispersed, and utilizes GMM to estimate the parameters. Furthermore, the asymptotic normality and efficiency for GMM estimators are established for the GPHR model under certain conditions. As high-order moments are not easy to calculate, the bootstrap method is used to estimate the variance of the GMM estimator in real data. In its application to ear disease using the data gained from New South Wales Health Research Council in 1990, it can be seen that the GMM estimators perform better than the ML estimators of the GPHR model.
The paper is organized as follows. Section 2 introduces the generalized Poisson Hurdle model. Section 3 carries out GMM estimation and establishes its asymptotic normality and efficiency. Section 4 introduces the Nelder mean algorithm in detail. Section 5 discusses the application to real data in ear disease. Section 6 concludes the paper. Technical proofs are given in Appendix A.

Basic Model
For count data, the generalized Poisson regression (GPR) model performs better with over-dispersion or under-dispersion, while the Hurdle regression model performs better with zero excess data. This paper introduces the generalized Poisson Hurdle regression (GPHR) model to simultaneously deal with both over-or under-dispersion and zeroinflated data.

Generalized Poisson Regression
To handle data with over-or under-dispersion upon generalized Poisson distribution, two main versions of generalized Poisson distribution are used, denoted by GP 1 and GP 2 .
Suppose that random variable Y follows a GP 1 distribution [1]. The probability density function of Y is: where y = 0, 1, 2, . . . and λ 1 and λ 2 are unknown parameters. Another example is generalized Poisson distribution GP 2 [35]. For random variable Y, the probability density function is: where α is a dispersion parameter. If α > 0, then the variance is greater than the mean, which is known as overdispersion; if α < 0, then the variance is less than the mean, which is known as under-dispersion; and if α = 0, then the generalized Poisson distribution degenerates to a Poisson distribution. Hence, the dispersion characteristics of data depends on dispersion parameter α.

Hurdle Model
The Hurdle regression model, also named the two-part structure, is an effective model in dealing with zero-inflated and zero-deflated data [19]. This model separates the generation of zero data from that of positive data and regards these as two separate processes. The first process judges whether the zero events happen; this is denoted by 0 when zero events happen with probability w and denoted by 1 when zero events do not happen with probability 1 − w. When the studied event has happened, we enter into the second process, i.e., how many times the event happens. In this process, the occurrence of the event conforms to, for example, a Poisson distribution, negative binomial distribution, or general Poisson distribution. However, in this process, the event's value must take a positive value and the events must occur at least once, which is based on the first process. Therefore, the event is conditionally distributed and zero truncated.
This leads us to the Hurdle model: where f 1 (y) and f 2 (y) are the density functions of the first process and second process, respectively. f 1 (0) = w is the probability that zero occurs and f 2 (y) The moments of the Hurdle model can be easily calculated. The mean is: where υ 2 is the untruncated mean for the density f 2 (y). Similarly, the 2-nd moment is: where σ 2 2 is the untruncated variance for the density f 2 (y). Hence, the variance of the Hurdle model is:

Generalized Poisson Hurdle Regression Model
This subsection combines the advantage of the generalized Poisson regression model for dispersed data and the hurdle model for zero-inflated and zero-deflated data. According to the basic principle of the Hurdle model, if the second process is in a zero-truncated generalized Poisson distribution, then the Generalized Poisson Hurdle Regression model can be proposed as follows: Using Equations (2), (3), (6) and (7), the mean and variance are: And: Obviously, the GPHR model changes to a simple Poisson Hurdle regression model when α = 0. Then, the GPHR model can be followed by Equations (11) and (12).
log it(w) = z T δ. (12) where x and z are (p + 1)-dimension vector as predictor variables (with a 1 in the first element), and β and δ are (p + 1)-dimension vector as regression parameters. In general, x and z may be the same in the model, the reason why the estimated coefficients of explanatory variables appear twice in real data analysis.

Estimation
In this section, we estimate the parameters using the generalized method of moments (GMM) and establish their asymptotically normal properties and efficiencies [36,37].

Parameter Estimation
In simple notation, let X = x T , z T T , ζ = β T , δ T T , where T indicates the transposition for matrix or vector; then, Equations (11) and (12) are simplified to: For the GPHR model, the 1st and 2nd moments can be easily obtained from Equations (9) and (10): where θ = α, ζ T T is an unknown parameter. Hence, when X is non-random, the moment condition for the GPHR model can be followed by Equation (14): As can easily be seen, E[h(Y, X, θ 0 )] = 0, where θ 0 is the vector of the true parameters. Using Equation (14), the sample moment condition can be obtained.
Then, we can obtain the objective function: where W(θ) is a weight matrix. There are several common forms for the weight matrix. for example, an identity matrix and the covariance matrix of the estimating equation vector. Without ambiguity, let Q n (θ) = Q n (Y, X, θ),h n (θ) = h n (Y, X, θ) and W = W(θ). Letθ be a minimizer of Equation (16); then,θ is called the GMM estimator.

Asymptotic Property and Efficiency
In this subsection, we investigate the efficiency and asymptotic normality of the GMM estimator. Let the true value of the parameter θ 0 ∈ Θ, where Θ is a compact set. For a matrix A, let A = ∑ a ij 2 1 2 . Some assumptions are required.

Assumption 1.
(i) The covariate X is a non-random variable; (ii) The weight matrix W is positive definite matrix; Assumption 1 ensures the identification of the GMM estimator by Lemma 1 and this is a condition for the existence of the GMM estimator [38].
To rigorously establish the consistency and asymptotic normality of the estimator, the following regular assumptions are required.
As can easily be seen, Assumption 2 is equivalent to the statement that h(Y, X, θ) is continuously differentiable in neighborhood N (θ 0 ) of θ 0 . Assumption 3. There exist C 1 > 0 and C 2 > 0 such that Assumptio 4. Var Y k , k = 1, 2 are bounded away from both zero and infinity uniformly.
The above assumptions ensure the consistency of the estimator. Hence, the following theorem holds.
To establish the asymptotic normality of the proposed estimator, some strict conditions are required on the moment and on the parameter space Θ.
Assumptions 5 and 6 are required for the asymptotic variance and its asymptotic normality. Similar to the conditions of theorem 3.4 in [38], asymptotic normality holds.

Theorem 2 (Asymptotic normality).
Let the observed data {(Y i , X i ), i = 1, 2, . . . , n} be i.i.d. If assumptions 1-6 hold, then: As can be seen from Theorem 2, the asymptotic variance can be simplified to In some cases, the efficiency of the estimator is established-that is, an estimator has minimum variance. For an unbiased estimator, an unbiased estimator (MVUE) with a minimum variance can be easily defined. In fact, if the estimator is unbiased, then it is asymptotically unbiased; hence, we concentrate on asymptotically unbiased estimators with minimum variance. For matrix P and Q, P > Q and P ≥ Q, where P − Q is a positive definite matrix and a positive semi-definite matrix. As a special case, the following theorem holds: Let the conditions of Theorem 2 hold. If is nonsingular matrix and W = Σ −1 , then: Hence, theθ is the estimator of minimum variance.
As can be seen from Theorem 3, the GMM estimator is most efficient when the weight matrix is identical to the inverse of the covariance of estimating function h(Y, X, θ) in all families of GMM.

Algorithm
Indeed, it is difficult to calculate the first derivative (16). However, GMM estimation can be calculated by the Nelder-Mead algorithm [39], which is widely known as the Nelder-Mead simplex algorithm, to find a minimum of the function of several variables. In particular, for complex functions, this algorithm is a good choice, since it does not require differentiation.
The Nelder-Mead algorithm runs as follows.
Usually, the iteration stops when m −1 < ε, or when the number of iterations reaches a fixed value.
In practice, the Nelder-Mead algorithm is widely used to optimize target functions, since it simply and easily achieves its minimization. It does not require continuous or differentiable target functions. This algorithm is significantly improved in the first few iterations and quickly provides satisfactory results. However, there are two disadvantages of the Nelder-Mead algorithm. The algorithm is very sensitive to the initial values. Additionally, the convergence of the algorithm is difficult to guarantee globally, even for smooth and well-behaved functions. See, for example, for special conditions [40][41][42][43]. The Nelder-Mead algorithm is modified to improve the worst-case performance of the algorithm in terms of convergence, but retains some or most of its efficiency in best-case scenarios [44,45].

Real Data Analysis
As a healthy sport, swimming can improve human energy metabolism and maintain respiratory health. However, frequent swimming may lead to excessive moisture in the ear and inflammation of the external auditory canal. Moisture can cause ear eczema. Skin damage caused by repeated scratching of eczema can make bacteria or fungi invade the ear canal tissue and cause infection. However, swimming in bacterially-contaminated water is a common cause of swimming ear disease. Therefore, in health care, it is significant in practice to explore the relationship between the frequency of ear disease and other factors such as swimming. On the other hand, people who often swim may not be infected with ear diseases.
In this section, we analyze real data in ear disease using the method proposed above, and compare results between different methods.

Data Description
The real data in this application relate to the incidence of ear diseases and come from the investigation carried out by the New South Wales Health Research Council in 1990. The data gathered include the frequency of swimming, the place of swimming, and the frequency of ear diseases, from a total of 190 observation data points. The number of ear diseases is recorded by a type of count data-i.e., how many times the self-diagnosis of ear infection has occurred. The frequency of swimming is a categorical variable-i.e., how often the swimmer swims in the ocean-and takes two values: "Often" or "Occas"; quantitative values of "Often" and "Occas" are 1 and 2, respectively. The place of swimming is also a categorical variable, relating to the usual swimming location, and takes the values "Beach" or "Non-beach"; quantitative values of "Beach" and "Nonbeach" are 1 and 4, respectively. In this paper, we use the frequency of swimming and place of swimming as explanatory variable x in our model and the number of ear diseases as response variable Y. Figure 1 is a histogram of the frequency distribution of ear disease, where the value interval of the number of ear disease is (0, 17) and 92 cases occurred in 0, accounting for 48.4% of the total; 27 occurred in 1, accounting for 14.2% of the total; 26 occurred in 2, accounting for 13.7% of the total; 21 occurred in 3, accounting for 11.1% of the total; and 12 occurred in 4, accounting for 6.3% of the total. As the number of occurrences increases, the proportion of the number of ear disease becomes smaller and smaller. As shown in Figure 1, the number of cases of ear disease has a larger probability accumulation at zero. Therefore, if there is zero excess in this set of data, the use of the Hurdle model in this paper may provide a better fit with the data. Information relating to the number characters for variable Y is shown in Table 1. As can be seen in Table 1, the sample data contain more zeros, the expected number of occurrences is 1.6, and the variance is 6.5. The variance is larger than expected-that is, the sample data suffer from over-dispersion. Therefore, the generalized Poisson Hurdle regression model is applied in this real data.

Empirical Analysis
In this subsection, we utilize the generalized Poisson (GP) regression model, the Poisson Hurdle (PH) regression model, and the GPHR model for ear disease. We estimate parameters using the ML method and GMM. For GMM, the initial value used is ML estimation. The Akaike Information Criteria (AIC) are introduced to compare the fitting effects of these three models, where the form of the AIC statistics is AIC = −2 + 2 , is the log-likelihood value, and is the number of free parameters. In general, the smaller the AIC values, the better the fitting effect of the model.
In order to make statistical inferences, the variance of the regression coefficients is estimated. It is natural to attempt to derive a consistent estimator, but as seen from the proof of the theorem, such an estimator is too complex and not practical. Hence, we use the bootstrap method to estimate variance. As shown in Figure 2, the basic schematic of bootstrapping proceeds. We denote the training set by = , , ⋯ , , randomly draw data sets with replacements from training set , and name these bootstrap data sets. The size of the bootstrap data sets may be not equal to that of the original training set. We repeat times and produce bootstrap data sets. Then, we refit the model to each of the bootstrap data sets and examine the behavior of the fits over the -th bootstrap data set.

Empirical Analysis
In this subsection, we utilize the generalized Poisson (GP) regression model, the Poisson Hurdle (PH) regression model, and the GPHR model for ear disease. We estimate parameters using the ML method and GMM. For GMM, the initial value used is ML estimation. The Akaike Information Criteria (AIC) are introduced to compare the fitting effects of these three models, where the form of the AIC statistics is AIC = −2l + 2p, l is the log-likelihood value, and p is the number of free parameters. In general, the smaller the AIC values, the better the fitting effect of the model.
In order to make statistical inferences, the variance of the regression coefficients is estimated. It is natural to attempt to derive a consistent estimator, but as seen from the proof of the theorem, such an estimator is too complex and not practical. Hence, we use the bootstrap method to estimate variance. As shown in Figure 2, the basic schematic of bootstrapping proceeds. We denote the training set by Z = (z 1 , z 2 , . . . , z N ), randomly draw data sets with replacements from training set Z, and name these bootstrap data sets. The size of the bootstrap data sets may be not equal to that of the original training set. We repeat B times and produce B bootstrap data sets. Then, we refit the model to each of the bootstrap data sets and examine the behavior of the fits over the i-th bootstrap data set. In Figure 2, represents any statistics computed from the data set . Using bootstrap sampling, we can estimate any aspect of the distribution of , for example, its variance: where * = ∑ * . For the bootstrap method used, see, for example, Efron and Tibshirani [46]. In this paper, all programs and algorithms are calculated by R in version 3.6.3 and Rstudio in version 1.2.5042. Table 2 shows the results for three models. For the generalized Poisson Hurdle regression model and the Poisson hurdle model, each model includes two parameters related to the explanatory variables, and the coefficients of explanatory variable are estimated twice, i.e., intercept, frequency and place appear twice in the results. The Wald confidence intervals at the significance of 5% and -values are also shown in Table 2. To estimate the variance of GMM, bootstrap methods are used where = 50 times. The GMM estimation and ML estimation produce similar results for the three models. The PH model obtains the worst results and the results of the AIC are relatively large, which may be due to the over-dispersion. The GP regression model and the GPHR model both have over-dispersed characteristics, meaning that the fitting effect is improved. On the other hand, the GPHR model improves the results in the case of zero excess, as can be seen from the AIC value, compared with the GP regression model. The variances of the GMM estimator are less than the variances of the ML estimator for three models. Additionally, it can be seen that people who often swim are more likely to suffer from ear disease. Furthermore, it is evident that swimming at the beach has a negative impact on ear disease. Under the null hypothesis, this statistic is asymptotically normally distributed. At the 5% significance level, the GPHR model fits the data for ear disease better than both the GP regression model and the PH regression model. In Figure 2, θ(Z) represents any statistics computed from the data set Z. Using bootstrap sampling, we can estimate any aspect F(θ) of the distribution of θ(Z), for example, its variance:V θ Z * i . For the bootstrap method used, see, for example, Efron and Tibshirani [46]. In this paper, all programs and algorithms are calculated by R in version 3.6.3 and Rstudio in version 1.2.5042. Table 2 shows the results for three models. For the generalized Poisson Hurdle regression model and the Poisson hurdle model, each model includes two parameters related to the explanatory variables, and the coefficients of explanatory variable are estimated twice, i.e., intercept, frequency and place appear twice in the results. The Wald confidence intervals at the significance of 5% and p-values are also shown in Table 2. To estimate the variance of GMM, bootstrap methods are used where B = 50 times. The GMM estimation and ML estimation produce similar results for the three models. The PH model obtains the worst results and the results of the AIC are relatively large, which may be due to the over-dispersion. The GP regression model and the GPHR model both have over-dispersed characteristics, meaning that the fitting effect is improved. On the other hand, the GPHR model improves the results in the case of zero excess, as can be seen from the AIC value, compared with the GP regression model. The variances of the GMM estimator are less than the variances of the ML estimator for three models. Additionally, it can be seen that people who often swim are more likely to suffer from ear disease. Furthermore, it is evident that swimming at the beach has a negative impact on ear disease. Under the null hypothesis, this statistic is asymptotically normally distributed. At the 5% significance level, the GPHR model fits the data for ear disease better than both the GP regression model and the PH regression model.

Conclusions
This paper introduces the GPHR model for count data. It combines the generalized Poisson Regression (GPR) model with the Hurdle Regression model. Compared with other models, such as the PH regression model and the GP regression model, the GPHR model can simultaneously deal with both zero-inflated or zero-deflated data and over-or under-dispersion in count data. GMM is used to estimate the parameters of the models. In some cases, the GMM performs better than ML. Since it is difficult to calculate the first derivative of the objective function, the Nelder-Mead method is used to estimate the parameters.
Ear diseases are important in healthcare. The proposed model is applied to real data on ear diseases, which is another purpose of this paper. Compared with other zero-inflated models, the GPHR model has a minimum AIC value. Given a significant level (5%), the Wald test is used to measure the significance of the factors incorporated in each model. As shown, the GPHR model fits well with the real data. In fact, one count data is not enough to validate advantages and disadvantages of the purposed model for both zero-inflated or zero-deflated data and over-or under-dispersion. Although this model performs well in the data for ear disease, other count data are better at checking this model. Indeed, we check consumer health information for medical treatment, and this model also performs well. We do not report in detail, since this paper aimed to purpose the GPHR model for count data of both zero-inflated or zero-deflated data and over-or under-dispersion, and apply this to ear disease data.
In this paper, we only studied data that were zero-inflated. The model may be extended for other types of count data, such as zero-deflated data. In addition, for the first process of the Hurdle model, the occurrence probability w is used. In fact, this may be extended to the Poisson distribution at zero when the first process occurs, with the same parameter µ used as one in the second process. For the second process of the Hurdle model, this may be extended to the case of truncation.
a function B(X) such that G(X, θ) ≤ B(X) for all θ ∈ Θ, and E[B(X)] < ∞, then E[G(X, θ)] is continuous and Proof of Theorem 1. Using lemma 1 and assumption 1, we can determine that Q 0 (θ) has a unique minimum value θ 0 . Next, we prove that the Q n (θ) uniformly converges to Q 0 (θ) with a high probability.
Based on assumptions 1-4, Theorem 1 holds-i.e.,θ P → θ 0 . In addition, since X is nonrandom: G n θ P → G, and G n θ P → G. Hence: Based on the Slutsky theorem and assumption 5, the asymptotic variance can be established. Thus, theorem 2 holds. Proof of Theorem 3. Based on Theorem 2, the covariance matrix of the estimator is G T Σ −1 G −1 when W = Σ −1 . Hence, √ n θ − θ 0 This is also a positive semi-definite matrix-i.e.,: Hence,θ is the estimator of minimum variance.