A Statistical Model for Count Data Analysis and Population Size Estimation: Introducing a Mixed Poisson–Lindley Distribution and Its Zero Truncation

: Count data consists of both observed and unobserved events. The analysis of count data often encounters overdispersion, where traditional Poisson models may not be adequate. In this paper, we introduce a tractable one-parameter mixed Poisson distribution, which combines the Poisson distribution with the improved second-degree Lindley distribution. This distribution, called the Poisson-improved second-degree Lindley distribution, is capable of e ﬀ ectively modeling standard count data with overdispersion. However, if the frequency of the unobserved events is unknown, the proposed distribution cannot be directly used to describe the events. To address this limitation, we propose a modi ﬁ cation by truncating the distribution to zero. This results in a tractable zero-truncated distribution that encompasses all types of dispersions. Due to the unknown frequency of unobserved events, the population size as a whole becomes unknown and requires estimation. To estimate the population size, we develop a Horvi tz –Thompson-like estimator utilizing truncated distribution. Both the untruncated and truncated distributions exhibit desirable statistical properties. The estimators for both distributions, as well as the population size, are asymptotically unbiased and consistent. The current study demonstrates that both the truncated and untruncated distributions adequately explain the considered medical datasets, which are the number of dicentric chromosomes after being exposed to di ﬀ erent doses of radiation and the number of positive Salmonella. Moreover, the proposed population size estimator yields reliable estimates.


Introduction
Unobserved events in count data are events that were not recorded.For example, the unobserved events in insurance claims refer to the events where the policyholders do not claim.In some cases, even these events are not identified.For example, the number of times a motorist gets stopped by the police.The number of motorists that did not get stopped cannot be identified, which is known as unobserved events.Modeling count data by examining the observed events or both the observed and unobserved events is typical in statistical modeling.This study would like to explore this area by introducing a mixed Poisson-Lindley distribution and extending it to its zero-truncated version, as well as population size estimation.
The Lindley distribution was originally introduced in Bayesian analysis as a mixture of an exponential distribution and a gamma distribution [1].Ghitany et al. [2] conducted a comprehensive examination of its statistical characteristics.Subsequent research has expanded the Lindley distribution into both two-parameter [3][4][5][6][7][8][9][10][11] and three-parameter variants [12][13][14], each offering innovative enhancements and broader applications.These distributions have been particularly useful in the fields of survival analysis and reliability assessments.
Sankaran [15] pioneered the use of the Lindley distribution as a mixing distribution with the Poisson distribution, thus creating the Poisson-Lindley (PL) distribution.This model, and the methods to estimate its parameters, were studied extensively in later research [16].Various mixed PL distributions have been proposed as alternatives to the traditional Poisson and negative binomial models for fitting count data [17][18][19].These new distributions share the overdispersion characteristic common to mixed Poisson distributions [20].Other notable mixed Poisson models include the Poisson Inverse Pareto distribution [21] and the Poisson-transmuted record-type exponential distribution [22], among others.
When count data exclusively contain positive numbers, zero truncation is a common technique used to adjust distributions accordingly.Examples include the zero-truncated Poisson [23], zero-truncated negative binomial [24], and zero-truncated PL [25] distributions.In fields such as criminology, accurately estimating the size of a population when the frequency of non-events is not observable is a significant challenge [26][27][28][29][30][31][32][33].Rossmo and Routledge [30] highlighted the necessity of understanding the size of a criminal population to inform the creation of effective laws and policies.Such estimations are still relatively rare in medical research but are equally needed.
The current study aims to introduce a new, practical mixed PL distribution by combining an improved second-degree Lindley (ISDL) distribution [7] with a Poisson variable, resulting in what we name the Poisson-improved second-degree Lindley (PISDL) distribution.This constitutes the first aim of this research.Given the ISDL distribution's superior modeling performance [7], we anticipate that the PISDL distribution may outperform the original PL distribution.To accommodate strictly positive data, we also introduce a zero-truncated version of the PISDL (ZTPISDL) distribution, which represents the second aim of this research.Furthermore, we propose an innovative estimator for population size in relation to the ZTPISDL distribution, detailed in Section 5.3, marking the third and most original objective of the paper.Understanding the population size is critical for developing comprehensive policies.This study uses data from epidemiology and cytogenetics to demonstrate the applications of the proposed distributions and estimator.
This paper is structured as follows.Section 1 outlines the study's objectives, building upon previous research on mixed PL distributions, their truncations, and population size estimations.Section 2 presents the PISDL distribution, its statistical properties, and estimation techniques.Section 3 introduces and investigates the zero-truncated PISDL distribution.Section 4 describes the development of a new population size estimator, assuming a PISDL distribution, accompanied by simulation studies in Sections 2-4 to evaluate estimator performance.Section 5 applies the proposed models to medical dataset analyses.Finally, Section 6 concludes with a discussion of the implications, limitations, and future research directions of this study.

Probability Mass Function of the PISDL Distribution
Definition 1.A random variable  is said to follow a Poisson-improved second-degree Lindley (PISDL) distribution with parameter θ if it obeys |~  and ~  , where ,  > 0.
Theorem 1.Let  be a random variable that follows PISDL with parameter  ; then, the probability mass function (pmf) of  is given as Proof.Using Definition 1, the pmf of | is The resulting marginal distribution   for the PISDL distribution with parameter  is obtained as follows: shows the pmf plot for the PISDL distribution.In Figure 1, the distribution is skewed to the right, unimodal, and decreasing, which is further supported by the decreasing ratio of probability given as It is worth noting that the PISDL distribution is actually a three-component mixture distribution that can be written as   =    +    +    , where   is the pmf of the negative binomial distribution with a parameter number of successes i and proportion   + 1 ⁄ .When  = 1,   is the pmf of the geometric distribution, which is a special case for the negative binomial distribution.The formulae for  and   for  = 1,2,3 are given as Even though the PISDL distribution is a three-component mixture negative binomial distribution, the existence of three modes cannot be seen in any of the plots in Figure 1 for the selected values of .This insinuates that the three modes, which come from the three sub-populations, must be located very close to each other.As mentioned in [34], if the modes of the sub-populations are located very close to each other, then the population will have a single mode.As such, if the existence of the modes of the sub-populations, each with very close mode values, can be certain, then this distribution can be considered as one of the candidates for model fittings.
The cumulative distribution function (cdf) for the PISDL distribution is given in Equation (2) and visualized in Figure 2. In Figure 2, it is clear that the PISDL distribution has a valid cdf since   → 1 as  → ∞.
The plot for the survival function of the PISDL distribution is given in Figure 3.The hazard rate function ℎ  is defined by taking the ratio of the pmf to the survival function, i.e., ℎ  =     ⁄ , and is given as The hazard rate function plot is given in Figure 4.In Figure 4, it can be noted that the hazard rate functions show an increasing pattern with a limiting value of  , meaning lim → ℎ  = .

Some Statistical Properties of the PISDL Distribution
The  moment in the origin of the PISDL distribution can be written by the following generic expression: In particular, the first two moments in the origin using Equation ( 5) are obtained, respectively, as Hence, the index of dispersion ( ) can be written as Since the  > 1, the PISDL distribution is overdispersed for all .
The mode of the PISDL distribution can be obtained by maximizing the log pmf of the PISDL distribution or equivalently by solving the quadratic equation  +  +  = 0 , where  ≥ 0 is the mode of the distribution, where  = ln  + 1 ,  = 2 + 5 ln  + 1 − 2, and  = 5 +   + 4 ln  + 1 − 2 − 5.As a result, the solution for the equation is It can be shown from Figure 1 that  > 0, for 0 <  < 1 and when  ≥ 1 , then  = 0.The moment, the probability and the cumulant generating functions are given, respectively, is

Parameter Estimation of the PISDL Distribution
The parameter of the PISDL distribution needs to be estimated before modeling realworld datasets.Here, parameter estimation is based on the two commonly used estimation methods, which are the methods of moments and maximum likelihood.

Method of Moments Estimator
The method of moments estimator can be obtained by equating the sample mean to the population mean.Therefore, the moment estimator of , hereby denoted as  , can be obtained by solving the following equation: x λ λ λ λ λ or equivalently solving the cubic equation  +  +  +  = 0 , where  = ̅ ,  = 2̅ − 1,  = 2̅ − 4, and  = −6.

Maximum Likelihood Estimator
Note that the log-likelihood function can be written as  = ln  = ∑ ln   = ∑  ln   , where  is the frequency of x-valued data.Hence, l is given as


By differentiating l with respect to , we obtain in which equating it to zero and solving it yields the maximum likelihood estimator (MLE) for , which is abbreviated as  .Equivalently, one can directly maximize the l to obtain a similar result.To estimate the variance of  , Fisher's information about  and   needs to be obtained, and it is given as The summation term cannot be written in a closed form, but one may use several Lerch Transcendent [35] functions for it.However, it is only practical to leave the summation term as is.So, the variance of  can be written as    ⁄ and, subsequently, the 1 −  100% confidence interval can be written as  ∓   /  √ ⁄ .
If  = 0.025, and then the 95% confidence interval will be obtained.The summation term reduces considerably to a simple form when  is substituted, thus resulting in a constant.This will eventually give an estimated variance of  .For example, for  = 100 and when  = 1.8793, the summation equates to 1.4284, which then results in the estimated variance of 0.0145.

Simulation Study
A simulation study is conducted to assess the performance of the two earlier estimators in estimating the parameter of the PISDL distribution.The algorithm for the simulation study is as follows: Step 1: Generate  = 1000, 2000, … , 10,000 random data that follows the PISDL distribution with  = 0.5.
Generally, if the MAD and MSE approach zero as N increases, then the estimate is asymptotically unbiased and consistent.For the simulation study, R software version 3.0.2 is used, and the estimated parameter using MLE is obtained using the optim command to optimize the log-likelihood value.The results of the simulation study are presented in Figure 5.In Figure 5, for any value of , as N increases, the MAD and the MSE of the MLE and the moment estimator decrease, suggesting that both estimates are asymptotically unbiased and consistent.

Zero-Truncated Poisson-Improved Second-Degree Lindley Distribution
Truncation is a widely used trait in real-world situations in a variety of domains, including industry [23][24][25], medicine [23,24], and many more.The progression of a disease that is not an increasing function but will stabilize after a certain period is an example of truncation.Therefore, a flexible truncated count data distribution is introduced by truncating the PISDL distribution at zero, yielding a zero-truncated Poisson-improved second-degree Lindley distribution (ZTPISDL).It was observed that the PISDL and the PL distributions are equally competent based on the two datasets considered in Section 5. Therefore, it is expected that the ZTPISDL distribution to be as competent as the zerotruncated PL (ZTPL) distribution.The development and the statistical properties of the ZTPISDL distribution are discussed in the following sections.

Probability Mass Function of the ZTPISDL Distribution
Proof.Using Definition 2, the pmf of Y is obtained.□ Figure 6 shows the pmf plot for the ZTPISDL distribution that follows similar shapes as the pmf plot for the PISDL distribution, which is skewed to the right, unimodal, and decreasing.Using the log of   defined in (10), the mode  of the distribution can be obtained by solving the following formula:

Some Statistical Properties of the ZTPISDL Distribution
If ~ for  = 1, 2, 3, …, then the  moment for Y can be easily obtained because it satisfies   =   1 −  0 ⁄ . The first two moments in the origin are obtained and can be, respectively, written as Using a similar approach as the  for X, the index of dispersion for Y ( ) can be written as Based on the  , the ZTPISDL distribution is underdispersed (overdispersed) when  > < 1.51494 .The ZTPISDL distribution is only equidispersed when  = 1.51494.The recurrence probability for Y is similar for X, except the x is substituted with The generating functions for both ZTPISDL and PISDL distributions can be related since their pmfs are related as well.Therefore, their relationships are given, respectively, below so that the moment generating, the cumulant generating, and the probability generating functions for Y can be worked out as follows:

Parameter Estimation of the ZTPISDL Distribution
The parameter of the ZTPISDL distribution can be estimated using the moment and MLE techniques.The moment estimator of , hereby denoted as  , can be obtained by solving the following equation: in which equating it to zero and solving it implies the MLE for , which is denoted as  .Equivalently, one can directly maximize the l directly to obtain a similar result.

Simulation Study
A simulation study is conducted to assess the performance of the obtained estimators for the parameter of the ZTPISDL distribution.The algorithm for the simulation study is similar to the one in Section 2.4, except that the data are generated using the ZTPISDL distribution.Similarly, R software is used, and the estimated parameter using MLE is obtained using the optim command to optimize the log-likelihood value.The findings of the simulation study are shown in Figure 7. Figure 7 shows that for any value of , as N increases, the MAD and the MSE of both estimates fall, indicating that they are both asymptotically unbiased and consistent.
When dealing with a truncated distribution, the population size is usually unknown and needs to be estimated.Assuming that a population follows the ZTPISDL distribution, a population size estimator is developed and studied.The discussion on the population size is provided in the next section.

Horvitz-Thompson Estimator under ZTPISDL Distribution (HT-ZTPISDL)
A popular estimator for the population size is the Horvitz-Thompson estimator [36], which includes information from both truncated and untruncated distributions.A Horvitz-Thompson estimator has the following form: where  is the estimator for ω for a distribution and n is the sample size.Basically, the estimated parameter of the truncated distribution is substituted into the probability mass function of the untruncated distribution for the unobserved events.The estimator for the population size  , which follows a ZTPISDL distribution in the form of the Horvitz-Thompson estimator (HT-ZTPISDL), is given as ( ) ( ) where  is the MLE for  in the ZTPISDL distribution.

Variance and Confidence Interval for HT-ZTPISDL
Böhning [37] has provided a simple yet informative method for obtaining a variance for any population size estimator using the conditional expectation technique.The variance of HT-ZTPISDL can be written as where   = 1 −  0 .Observe that the variation in estimating the population size comes from two sources of variation.The first term in Equation ( 14) explains the binomial variation involved in sampling n units of data with population size  and probability   [37].The second term in Equation ( 14) explains the variation that occurs when estimating parameter  using n observed data [37].Using the delta method for the first term of Equation ( 14), we obtain ( ) As ~ ,   , we obtain The equation above is further estimated by substituting   with n and  with  , yielding For the ZTPISDL distribution, we obtain ( ) ( ) ( ) Now, consider the second term of Equation ( 14), and assume that ( ) ( ) Using the delta method, we obtain where   =    ⁄ and   is given in Equation ( 9).Therefore, ( ) By substituting  with  , the second term of Equation ( 14) can be estimated as By combining Equations ( 15) and ( 16), the variance of the HT-ZTPISDL can be estimated as Therefore, the 95% confidence interval for the estimator can be written as  ∓  .  , where  .= 1.96.

Simulation Study
A simulation study is conducted to assess the performance of the HT-ZTPISDL estimator in estimating the population size when the data are generated from the ZTPISDL distribution.The algorithm for the simulation study is as follows: Step 1: Generate  = 1000, 2000, … , 10,000 random data, which follow the ZTPISDL distribution with  = 0.5.
R software is used to obtain  using the optim command, and  is obtained by plugging in  in Equation (10).The results of the simulation study are presented in Figure 8.In this figure, for any value of , as N increases, the MAD and the MSE of  decreases, suggesting that  is asymptotically unbiased and consistent.

Medical Data Applications
The applications in the medical datasets are segregated into three subsections with respect to the PISDL distribution, the ZTPISDL distribution, and the estimation of the population size.For a comparison of the model fittings, Akaike's information criterion, AIC [38], and Bayesian information criterion, BIC [39], are used.

Model Fittings Using the PISDL Distribution
Two datasets on the number of dicentric chromosomes after being exposed to different doses of radiation (0.405 and 0.600) that were studied by Puig and Barquinero [40] are considered in fitting using the PISDL distribution.The two datasets are overdispersed with dispersion values of 1.2704 and 1.2178, respectively.Since the PISDL distribution is closely related to the Poisson and the PL distributions, the model fittings from the PISDL distribution are compared with those from the Poisson and the PL distributions.The results of the model fittings for the two datasets using the three distributions are summarized in Tables 1 and 2.
In both tables, the Poisson distribution does not fit the data based on the p-value.On the other hand, the model fittings based on both PL and PISDL distributions gave similar AIC and BIC values, as well as non-significant p-values, indicating that both distributions can be used for describing the number of dicentric chromosomes after being exposed to radiation of different doses.However, the first dataset was fitted better by the PL distribution, whereas the second dataset was fitted better by the PISDL distribution.Therefore, it is reasonable to suggest that both PL and PISDL distributions are equally competent and can be selected as the best distributions in explaining the number of dicentric chromosomes after being exposed to two different doses of radiation.The comparison between the empirical plots of the data and the fitted values based on Equation (1) in Theorem 1 for the considered datasets, (i) the number of dicentric chromosomes after being exposed to a 0.405 radiation dose, and (ii) the number of dicentric chromosomes after being exposed to a 0.600 radiation dose are presented in Figure 9.

Model Fittings Using ZTPISDL Distribution
A dataset on the number of positive samples of Salmonella data, which was initially given in a survey study by Snow et al. [41] and later summarized by Arnold et al. [42], is considered in model fitting using the zero-truncated Poisson (ZTP), the zero-truncated PL (ZTPL), and the ZTPISDL distributions.The data refer to the number of farms with at least one positive sample of Salmonella.The dataset is overdispersed with a dispersion value of 1.4381.The results of the model fittings are given in Table 3.In Table 3, the ZTP distribution does not provide a good fit to the data based on the p-value.On the contrary, the ZTPISDL and the ZTPL distributions provide a good fit to the data based on the AIC and BIC values, as well as the non-significant p-value.However, the dataset was fitted better by the ZTPISDL distribution.Therefore, this suggests that the ZTPISDL is the best distribution in describing the number of positive samples of Salmonella data.The comparison between the empirical plots of the data and the fitted value based on Equation (10) in Theorem 2 for the above dataset is presented in Figure 10.

Estimating Population Size
The dataset studied in Section 5.2 on the number of positive samples of Salmonella data is considered to estimate the population size of the sample.The Horvitz-Thompson estimator based on the proposed ZTPISDL distribution is compared with those based on the ZTP and the ZTPL distributions.The Horvitz-Thompson estimators based on the ZTP [43,44] and the ZTPL [32] distributions are, respectively, given as where  refers to the MLE of  for the ZTP distribution and  refers to the MLE of  for the ZTPL distribution.The estimated population sizes and their corresponding standard deviations, as well as the lower and upper limits for the 95% confidence interval, are presented in Table 4. Based on Table 4, the estimated population size based on the ZTPISDL distribution using the MLE is 66.64 with a 95% confidence interval between 57.04 and 76.24.Since the ZTPISDL distribution based on the MLE provides the best fit for the data (refer to Table 3), the resulting estimated population size is acceptable.

Conclusions, Limitations, and Future Research
A tractable one-parameter Poisson-improved second-degree Lindley (PISDL) distribution has been proposed to address the need for modeling count data exhibiting overdispersion.This distribution is composed of three negative binomial distributions, each with fixed mixing proportions and parameters, allowing for the fitting of datasets originating from three sub-populations whose modes are proximal.However, if the three modes are far from each other and clearly visible from the plots of the datasets, the PISDL distribution may not be a good candidate for model fittings.The hazard rate function for the PISDL distribution showed an increasing shape.Parameters of the PISDL distribution have been estimated using maximum likelihood estimation (MLE) and moment methods, and both were found to be asymptotically unbiased and consistent.It has been observed from model fittings that the PISDL distribution performs on par with the PL distribution and surpasses the standard Poisson distribution in describing the number of dicentric chromosomes post-exposure to various radiation doses.
Given that data may not always present the frequency of unobserved events and exhibit dispersion, methodologies like zero truncation or the size-biased approach are employed.Zero truncation is a commonly favored method for handling datasets lacking frequencies of non-observed events.Hence, a zero-truncated version of the proposed distribution, named the zero-truncated PISDL (ZTPISDL) distribution, has been proposed to accommodate data exhibiting either over-or underdispersion.Parameters of the ZTPISDL distribution estimated by the MLE and moment methods have also been shown to be asymptotically unbiased and consistent.When applied to datasets, the MLE technique for estimating parameters of the ZTPISDL distribution has provided the best fit in comparison with zero-truncated Poisson and zero-truncated PL distributions.
When the population size is unknown due to the absence of frequencies for nonobserved events in positive count data, estimation has been conducted using the Horvitz-Thompson estimator based on the ZTPISDL distribution.Since the ZTPISDL distribution has provided the best fit for the dataset considered in this study, the acceptance of the resulting estimated population size for the number of positive Salmonella samples is justified.It is suggested that this population size estimate may serve as a lower bound to the actual population size, especially when the ZTPISDL distribution is extended to linear models with the inclusion of relevant covariates.Moreover, the derivation of the variance and confidence interval for the population size estimator is intended to assist policymakers in revising rules and guidelines pertaining to the population, which will, in turn, benefit the population at large.Despite its flexibility, the PISDL distribution's mixing proportion is solely dependent on λ, which consequently limits its flexibility.By introducing a new parameter (denoted as α) to influence the mixing proportion alongside λ, greater flexibility is achieved, resulting in a more adaptable PISDL distribution.This would, in turn, lead to a more versatile ZTPISDL distribution and improved estimates for population size.
Further research is anticipated to explore additional modifications and applications of the PISDL distribution.These include, but are not limited to, actuarial measures, such as value-at-risk and tail value-at-risk, reliability measures, like hazard rate and entropy, various forms of inflated models, including zero-inflated, k-inflated, and zero-oneinflated distributions, and weighted models, like size-biased and area-biased distributions.These enhancements are expected to broaden the utility of the PISDL distribution, making it a competitive model in the realm of the statistical literature.

Figure 2 .
Figure 2. The cdf plot for the PISDL distribution when  = 0.5, 1.0, 2.5, 5.0.Based on Equation (2), the survival function   can be obtained and given as

Step 3 :
Repeat Steps 1-2 for a total of 2000 iterations and obtain the estimates.Step 4: Calculate the mean absolute deviation, MAD, and the mean squared error values, MSEs, given, respectively, as  = ∑  −  2000 ⁄ and  = ∑  −  2000 , where  can be the MLE or moment estimator for .

Step 3 :
Repeat Steps 1-2 for a total of 2000 iterations and obtain the estimates.Step 4: Calculate the relative absolute error, RAB values, and the relative standard deviation, RSd values, given, respectively, as  =  −   ⁄ and

Figure 8 .
Figure 8.The  and the  of the estimated population size when  = 0.5, 1.0, 2.0.

Figure 9 .
Figure 9. Plots of the empirical (vertical black line) and the fitted (blue line) for (i) the number of dicentric chromosomes after being exposed to a 0.405 radiation dose and (ii) the number of dicentric chromosomes after being exposed to a 0.600 radiation dose.

Figure 10 .
Figure 10.A plot of the empirical (vertical black line) and the fitted values (blue line) for the number of positive samples of Salmonella.

Table 1 .
Model fittings of the number of dicentric chromosomes after being exposed to a 0.405 radiation dose using Poisson, PL, and PISDL distributions.

Table 2 .
Model fittings of the number of dicentric chromosomes after being exposed to a 0.600 radiation dose using Poisson, PL, and PISDL distributions.

Table 3 .
Model fittings of the number of positive samples of Salmonella using the ZTP, ZTPL, and ZTPISDL distributions.

Table 4 .
The estimated population size, the standard deviation, and the lower and upper limits for the 95% confidence interval of the population size estimator based on the ZTP, ZTPL, and ZTPISDL distributions.