Poisson Extended Exponential Distribution with Associated INAR(1) Process and Applications

: The signiﬁcance of count data modeling and its applications to real-world phenomena have been highlighted in several research studies. The present study focuses on a two-parameter discrete distribution that can be obtained by compounding the Poisson and extended exponential distributions. It has tractable and explicit forms for its statistical properties. The maximum likelihood estimation method is used to estimate the unknown parameters. An extensive simulation study was also performed. In this paper, the signiﬁcance of the proposed distribution is demonstrated in a count regression model and in a ﬁrst-order integer-valued autoregressive process, referred to as the INAR(1) process. In addition to this, the empirical importance of the proposed model is proved through three real-data applications, and the empirical ﬁndings indicate that the proposed INAR(1) model provides better results than other competitive models for time series of counts that display overdispersion.


Introduction
In many fields of applied sciences, such as engineering, medicine, insurance, economics, and marketing, studying and analyzing count data play a significant role. Count data sets are often modeled using a Poisson distribution. However, the Poisson distribution cannot handle overdispersed data sets. Overdispersion occurs when the variance exceeds the mean. As a consequence, many researchers have developed mixed-Poisson distributions to provide alternative models for overdispersed count data, including [1][2][3][4]. Recent studies in this area are [5][6][7], among others. When using count data as a response variable, Poisson regression is a popular model. It is assumed that the dependent variable's mean and variance are both identical in the Poisson regression model. There is a lot of evidence to support the overdispersion that the count data sets exhibit. Thus, the Poisson regression's theoretical premise is practically violated. In the beginning, negative binomial regression (NB) was employed to model overdispersion in the context of count regression. The Poisson-transmuted exponential linear model was introduced by [2] and applied to healthcare data sets. The generalized Poisson-Lindley linear model was introduced by [8], who showed that generalized Poisson-Lindley linear models provide better modeling abilities than Poisson and NB regression models when there is an overdispersion of data.
There are many instances of integer-valued time series in the real world, such as the number of births at a hospital in successive months, the number of accidents, the number of patients, the number of chromosome exchanges in cells, and so on. As an inaugural approach, refs. [9][10][11] proposed a stochastic model for integer-valued time series called INAR(1)P for a first-order non-negative integer-valued autoregressive process Stats 2022, 5 with Poisson innovations. As time series of counts mostly exhibit overdispersion, the Poisson distribution is no longer applicable to the INAR(1) process. To overcome this issue, researchers have proposed different INAR(1) processes with flexible innovation distributions. Consequently, Aghababaei Jazi et al. [12] proposed an INAR(1) process with geometric innovations (INAR(1)G), Altun E. [5] presented an INAR(1) process with new Poisson weighted exponential innovation distribution (INAR(1)PWE),Altun et al. [13] introduced an INAR(1) process with Poisson quasi-xgamma innovations (INAR(1)PQX), and so on. Although these methods are excellent for overdispersed time series count data sets, they have significant drawbacks in real-world applications. By discovering more INAR(1) models, more opportunities will be available for optimally fitting real data sets by choosing those models that are most appropriate for each situation.
Therefore, this paper provide new facts on what we call a two-parameter mixed-Poisson distribution, namely the Poisson extended exponential (PEE) distribution, obtained by compounding the Poisson distribution with the extended exponential (EE) distribution proposed by [14]. The EE distribution is obtained by mixing exponential and gamma distributions. The probability density function (pdf) of the EE distribution is given by It is sometimes denoted as EE(α, β) to specify the parameters. This distribution also appears in a different form in [15], presented as a two-parameter Lindley distribution. Recent statistical literature has paid a lot of attention to the EE distribution. As a result of this, an EE regression model was proposed by [16] in which the reparameterization of the EE model based on the mean is performed. In addition, de Andrade et al. [17] proposed the exponentiated generalized EE distribution. Refs. [18,19] also showed the novelty and possibility of EE distribution through their study of different generalizations of the EE model. The PEE distribution appears in [20] under a discrete two-parameter Poisson-Lindley distribution version. However, to the best of our knowledge, some of these aspects are understudied, and the goal of this research is to rehabilitate them from applied perspectives. In particular, the appealing applicability and competence of the EE regression model inspired us to present a two-parameter mixed-Poisson distribution created by compounding Poisson with the EE distribution and elucidating its regression characteristics and associated INAR(1) process.
In the rest of the paper, the sections are arranged as follows. Section 2 presents the PEE distribution and explores some of its statistical properties. The finite sample performance of the estimation method is examined in Section 3 with a simulation study for the maximum likelihood estimation of the model parameters. A regression model is discussed in Section 4. The INAR(1)PEE process is developed in Section 5 using PEE innovations. An empirical analysis of three real data sets is conducted in Section 6 to prove that the proposed model is useful when compared to some existing models. In Section 7, a few concluding remarks are presented.

The Poisson Extended Exponential Distribution
In the new formulation, Poisson distribution is compounded with EE distribution to produce a mixed-Poisson distribution, which is known as the PEE distribution. Let the random variable X follow the PEE distribution which holds the following stochastic representation: X|λ ∼ P(λ) and λ|α, β ∼ EE(α, β), where λ > 0, α > 0 and β ≥ 0. Then the unconditional probability mass function (pmf) of X has the following form: In fact, by construction, the random variable X has the Poisson distribution with a parameter λ, and we assume that the parameter λ represents a random variable with the EE(α, β) distribution. Then, the unconditional distribution of X is obtained by the classical method of compounding, which gives The gamma function Γ(x) = ∞ 0 u x−1 e −u du was used here and the relation Γ(m + 1) = m!, for any positive integer m.
The discrete two-parameter Poisson-Lindley distribution proposed by [20] has the same pmf but had a different support for the parameters, i.e., α + β > 0, and merely explored its various distributional characteristics. In contrast to [20], our applied work is more focused on the count regression model and the accompanying INAR(1) process, which are of current interest. Our theoretical work adds more aspects to the aforementioned study. Different pmf shapes are presented in Figure 1 for several parameter combinations of PEE distribution. The figure unequivocally demonstrates that the PEE distribution is right skewed.

Moments, Skewness and Kurtosis
Some results that can be derived from [20] are now presented in this portion. The probability-generating function for a random variable X with the PEE distribution is provided by for |s| < α + 1. Correspondingly, the moment-generating function of X is given by for t ≤ log(α + 1). Let r be a positive integer. The rth factorial moment of a random variable X with the PEE distribution is given by That is, in accordance with the definition of the rth factorial moment, we have From the last equality, (5) is determined by applying the relation, Γ(m + 1) = m!, r being a positive integer. The first four non-central moments are derived as The variance of X is given by The explicit versions of measures such as skewness and kurtosis of X can be found using the following formulas: [Var(X)] 3 2 and respectively.

Dispersion Index and Coefficient of Variation
The dispersion index (DI) of the PEE distribution is given by As a complementary measure, the coefficient of variation (CV) of the PEE distribution is given by Tables 1 and 2 provide some numerical values for the PEE distribution's mean, variance, and DI for a variety of parameter configurations. For the values considered, we check the mean, variance, and DI of the PEE distribution, and it is inferred that the DI of the PEE distribution is always greater than one, clearly showing overdispersion.

Maximum Likelihood Estimation
Let X 1 , X 2 , ..., X n be a random sample of size n from the PEE distribution with unknown parameters α and β, and x 1 , x 2 , ..., x n be the related observations of the variables of this sample. Then the likelihood function is given by the following finite product: The maximum likelihood estimates (MLEs) of the parameters α and β, sayα and β, are obtained by (α,β) = argmax (α,β) L or, in an equivalent manner in our setting, (α,β) = argmax (α,β) log L. To provide more practical facts, the normal equations are given by Thenα andβ are obtained by solving the equation ∂ log L ∂α = 0 and ∂ log L ∂β = 0, provided they reach a maximum well. This can only be achieved by a numerical optimization technique by using mathematical packages such as R, Mathematica and Python.

Simulation Study
The Monte Carlo simulation was performed to demonstrate the model's efficiency using the maximum likelihood method. The estimates were calculated for true values of parameters for N = 1000 samples of sizes 50, 75, 200, 500, 750, and 1000. The following formulas are also used to calculate indices such as MLE, bias, mean square errors (MSEs), and coverage probabilities (CPs) and average lengths (ALs) of confidence intervals (CIs).
Here, h = α or β, and s j,ĥ and I{.} denote the standard errors (SEs) of the MLEs and indicator function, respectively. Tables 3 and 4 show the simulation results for two sets of parameter values. It has been found that MSEs and ALs of the CIs decrease with increasing sample size. The CPs of the CIs for each parameter are relatively close to the nominal 95% level.

PEE Regression Model
According to the previous section, the PEE model can model overdispersed data sets, which is critical since the majority of data in real life displays overdispersion. As a count regression model, this section uses the PEE distribution to model overdispersed data sets.

Model Construction
Let Y be a random variable representing the response variable and the number of occurrences of an event that follows the PEE distribution as well. To begin, let us consider the following reparametrization: With this configuration, we obtain the pmf of the PEE distribution in terms of the mean E(Y) = µ > 0 and α > 0. Then the corresponding pmf is obtained as where y = 0, 1, 2, ... With the appropriate link functions, explanatory variables can be used to model the mean of the random variable Y. Covariates and the mean of the dependent variable can be linked using the log-link function. Let us consider Y 1 , Y 2 , ..., Y n a random sample of size n from Y. Using the log-link function, the mean of Y i is linked to the covariate vector x T i = (x i1 , x i2 , ..., x ik ) T by the following equation: where γ = (γ 0 , γ 1 , γ 2 , ..., γ k ) is the unknown regression coefficients. Based on (7), a linear form for the pmf of Y i |X T i = x T i which follows the PEE distribution with parameter µ i , and α is obtained as where y i is the ith observations of Y.

Estimation of the Model
To estimate the regression coefficients γ, the maximum likelihood method is used. The logaritmic transformation of the likelihood function of the PEE count regression model is given by Now the unknown parameter vector γ is obtained by maximizing (9). To accomplish this, we employ the optim function of R software. In addition, the SEs of these estimates are calculated using the fdHess function in R software.

Simulation of the PEE Regression Model
In this part, the maximum likelihood method used to estimate the unknown regression parameters is analysed using a simulation study. The parametric combinations (α = 1.5, γ 0 = 0.6, γ 1 = 0.2, γ 2 = 0.3) and (α = 1.2, γ 0 = 0.7, γ 1 = 0.3, γ 2 = 0.4) are used to generate N = 1000 samples of sizes n = 50, 100, 200, and 500 from the following model: log(µ i ) = γ 0 + γ 1 x i1 + γ 2 x i2 . We assume that x i1 and x i2 are generated from the uniform distribution with parameters 0 and 1, which is denoted by U(0, 1). Here, indices such as estimates, bias, and MSEs are used to prove the asymptotic property of the MLEs. Table 5 reports the simulation results. From Table 5, it is clear that as sample size increases, the bias and MSEs are decreasing, implying the consistency property of the MLEs for estimating the regression parameters.

INAR(1) Model with PEE Innovations
The INAR(1) process is widely used in the modeling of time series of counts in several scientific disciplines, including actuarial, finance, and medical. By applying the binomial thinning operator, INAR(1) differs from the first-order autoregressive process (AR(1)). The INAR(1) process is given by where 0 ≤ p < 1, and the innovation process is denoted by { t } t∈Z which are independent and identically distributed (iid) integer-valued random variables having mean, E( t ) = µ and variance, Var( t ) = σ 2 . The binomial thinning operator is denoted by the symbol • and is defined as where G j j≥1 is the sequence of Bernoulli random variables with probability For the INAR(1) process, the one-step transition probability matrix is given by where 0 < p < 1. There are many examples in real life where these types of stochastic processes play a role, including the number of passengers each year, the growth of bacteria each day, the number of scientific books cited, and many more. Here, a new INAR(1) process is introduced by assuming that the { t } innovations follow a PEE distribution. The one-step transition probability of the INAR(1)PEE model is given by So, hereafter, the described process will be called the INAR(1)PEE process. Weiss C.H. [21] provide the mean, variance, and DI of {X t } t∈Z by using the mean, variance, and DI of the innovation distribution. For the INAR(1)PEE process, they are Var(X t ) = α 2 (α + αp + 1) + 2β 2 (α + αp + 1) + αβ(3α(p + 1) + 4) and DI(X t ) = 1 According to [21,22], the conditional expectation and variance of the INAR(1)PEE process are given by and respectively.

Estimation
The conditional maximum likelihod (CML), conditional least squares (CLS), and Yule-Walker (YW) methods are used to obtain the unknown parameters of the INAR(1) process.

Conditional Maximum Likelihood
The complicated form of the likelihood function resulting from the usual maximum likelihood method motivated the researchers to use the CML method instead of maximum likelihood. The knowledge of the transition probabilities is sufficient for the creation of likelihood in the CML technique since conditioning on the first observation results in a simple form of the likelihood, whereas there is no such conditioning present in the traditional maximum likelihood approach. The conditional log-likelihood function for the INAR(1)PEE process of the random sample X 1 , X 2 , ...., X T based on associated observations x 1 , x 2 , ..., x T is given by where X 1 is fixed, and Pr(X t = x t |X t−1 = x t−1 ) is given by (13). By the maximization of (19), the CML estimates are obtained by using the constrOptim function of R.

Conditional Least Squares
The below function is minimized to obtain the CLS estimates of the parameters of the INAR(1) process

Yule-Walker
As a result of the YW approach, the theoretical moments as well as the empirical ones are solved synchronously. Given that the autocorrelation function (ACF) of the INAR(1) process at lag η is ρ x (η) = p η , the YW estimate of p is given bŷ wherex = 1 T ∑ T t=1 x t . Now, the theoretical mean is solved with their empirical equivalents to derive the YW estimates of α and β. More precisely, when the theoretical mean equated with the empirical mean, we obtain By substituting (21) in (16) and equating it with the sample dispersion,α YW is obtained.

Simulation
A simulation study was performed to check the finite sample performance of the CML, CLS, and YW estimates. In this regard, the number of replications is chosen as N = 1001 for different sample sizes, n = 50, 100, 200, 300, and 500. The two parameter vectors used here are (p = 0.5, α = 0.7, β = 1) and (p = 0.7, α = 0.5, β = 0.8). The simulation results are interpreted based on the biases and MSEs. The R-code is given in Appendix A. Tables 6 and 7 show the results. The biases and MSEs of the CML estimates are the smallest when the three estimation methods are compared, and the CML estimation approach outperforms the others. The CML estimation approach is then applied.

Empirical Studies
With the help of three real-life data sets, the superiority of the PEE model is illustrated.

Corn Borer Data
The first data set is from [23]. The data are from the biological experiment, representing the number of larvae of the European corn borer (ECB) in the field (Pyrausta).
Utilizing the optim function of R, the Hessian and the Fisher information matrices are assessed. Each parameter's SE is evaluated by using the fact that the SEs can be computed as the square root of the diagonal elements of the inverse of the Fisher information matrix. As shown in Table 8, the MLEs with their corresponding SEs and confidence intervals (CIs) (lower bound of CI, upper bound of CI) for the numbers of borers data set are provided. From Table 9, it is clearly evident that the PEE distribution is the best among the considered competitive models since it has the lowest AIC, BIC, and value with the highest log L and p-value. The fitted PEE distribution is overdispersed since the mean and variance of the PEE distribution for the corn borer data are 1.375 and 2.2131, respectively.  Figure 2 presents the estimated pmfs of all the considered models from which the distribution adequacy of the PEE model is clearly seen.

Length of Hospital Stay
The effectiveness of the count regression model under the PEE distribution is assessed using the second data set. The data consists of 3589 observations from the files of 1991 Arizona cardiovascular patients that were located in the COUNT package of the R programming language. The PEE regression model is used to model the length of stay (y i ) by using the covariates: cardiovascular procedure (x 1i ) (1 = CABG, 0 = PTCA), sex (x 2i ) (1 = male, 0 = female), type of admission (x 3i ) (1 = urgent, 0 = elective), and age (x 4i ) (1 = age > 75, 0 = age ≤ 75). Given below is the regression structure which will be fitted by the PEE distribution, the new Poisson generalized Lindley (NPGL) regression model (see [29]), the Poisson-xgamma (PX) regression model (see [7]), the Poisson-Lindley (PL) regression model and the basic Poisson regression model: The mean and variance of the dependent variable are calculated as 8.831 and 47.973, respectively, stating the clear overdispersion. Table 10 gives the parameter estimates and results of information criterion. Altun E. [29] used this data set to prove the better fit of the NPGL regression model. Hence, from Table 10, it is clear that the PEE regression model is better than competing models since it has minimized values for its -log L, AIC, and BIC. We thus conclude that it will be a more appropriate model than the other models for modelling this data set. As a result, we can say that the length of hospital stay increases when people have CABG cardiovascular surgery, are admitted urgently, and are over the age of 75. Additionally, female individuals have a longer hospital stay than male individuals.

Weekly Number of Syphilis Cases Data
Here, the performance of the INAR(1)PEE process is carried out with other famous INAR(1) processes such as the INAR(1)P process (see [10]), the INAR(1)G process (see [12]), the INAR(1)PTE process (see [30]), and the INAR(1)PWE process (see [5]). The data set used here is the weekly number of syphilis cases in the United States from 2007 to 2010 in New York. The ZIM package of the R software contains the data. The mean, variance, and DI of the data set are 24.6316, 105.6761, and 4.2903, respectively. The data have statistically significant overdispersion according to the test [31] presented, which results in a p-value of less than 0.001. In Figure 3, the fundamental plots of the data set, including the ACF, the partial ACF (PACF), the histogram, and the time series plots, are depicted. It is concluded that the INAR(1) process could be a possible model for this data set, since only the first lag is significant in the PACF plot. As shown in Table 11, fitting INAR(1) processes with the PEE innovations and other corresponding innovations yields parameter estimates along with SE, AIC, BIC, theoretical mean, variance, and DI. The minimum AIC and BIC statistics values for the INAR(1)PEE process demonstrate that it offers a better fit than other INAR(1) processes. The theoretical DI value for the INAR(1)PEE process is also relatively close to the empirical one. In light of this, it is believed that the INAR(1)PEE process provides a very good explanation for the properties of the data set.

Concluding Remarks
This paper focuses on a two-parameter discrete distribution obtained by compounding the Poisson and EE distributions and called the PEE distribution. The properties of the PEE distribution were derived and discussed. The properties, including the factorial moments, the moment-generating function, and the probability-generating functions, are evaluated, and they are in explicit forms. The article thus highlights the PEE distribution and, for the first time, its regression model and the INAR(1) model. The PEE model is found to outperform all other compared models in all aspects of the present study. In the modelling of positive integer-valued data sets from various fields of study, the proposed model is expected to increase its prevalence and have a broader variety of applications.