The Applications of Generalized Poisson Regression Models to Insurance Claim Data

: Predictive modeling has been widely used for insurance rate making. In this paper, we focus on insurance claim count data and address their common issues with more ﬂexible modeling techniques. In particular, we study the zero-inﬂated and hurdle-generalized Poisson and negative binomial distributions in a functional form for modeling insurance claim count data. It is shown that these models are useful in addressing the problem of excess zeros and over-dispersion of the claim count variable. In addition, we show that including the exposure as a covariate in both the zero and the count part of the model is an effective approach to incorporating exposure information in zero-inﬂated and hurdle models. We illustrate the effectiveness and versatility of the introduced models using three real datasets. The results suggest their promising applications in insurance risk classiﬁcation and beyond.


Introduction
In a priori risk classification, actuaries group risks with similar risk characteristics in order to set insurance premiums.Accurate risk classification is extremely important for maintaining a financially sound and equitable system, assuring the availability of needed insurance coverage to the public.
The individual risk characteristics used in risk classification are called rating variables.For example, in automobile insurance, commonly used rating variables include geography, driver characteristics such as age, gender, and marital status, and vehicle characteristics such as the make and value of the vehicle insured.
Risk classification systems are generally based, whenever possible, on statistical analysis.Naturally, statistical methods such as generalized linear models and generalized additive models provide useful tools.Numerous books and papers discuss the application of statistical methods in insurance rate making, see, e.g., Renshaw (1994), Denuit et al. (2007), Frees (2009), Frees et al. (2014), and the references therein.
This paper studies claim frequency modeling.It is well known that the Poisson regression model is not always suitable because real-world claim frequency data usually exhibit over-dispersion.Alternative models have been proposed in the literature.Notably, negative binomial regression models were discussed by Dionne and Vanasse (1989), Frees and Valdez (2008), and Wüthrich and Merz (2008).Inverse Gaussian models were studied by Dean et al. (1989) and Wang et al. (2023).Consul (1993) compared the generalized Poisson (GP) distribution with several well-known distributions and concluded that the GP distribution is a plausible model for claim frequency data.
Insurance claim data usually have an excessive number of zeros.Zero-inflated models, studied by Lambert (1992), have been used to deal with such problems in the literature.
For example, Yip and Yau (2005) applied several parametric zero-inflated count distributions, including zero-inflated Poisson (ZIP), zero-inflated generalize negative binomial, zero-inflated generalized Poisson, and zero-inflated double Poisson distributions, to accommodate the excess zeros in insurance claim count data.Famoye and Singh (2006) applied the zero-inflated generalized Poisson regression model to fit a domestic violence dataset.Czado et al. (2007) extended the zero-inflated generalized Poisson regression model by including explanatory regression parameters in both the zero-inflation and the dispersion parameters and applied the extended model to patent outsourcing rate data.
The hurdle model, which was introduced by Cragg (1971) and later refined by Mullahy (1986), can also be applied to model data with an excessive number of zeros.For instance, Saffari et al. (2013) and Zuo et al. (2021) studied the hurdle-generalized Poisson distribution, whereas Bhaktha (2018) employed the hurdle negative binomial approach.Additionally, using an insurance claim number dataset, Boucher et al. (2007) compared various zeroinflated and hurdle models.
Another issue with insurance claim datasets lies in the fact that different observations may have different risk exposures, but only the total number of claims for all exposures is recorded.For example, some policyholders stay longer in the policy than others.An "offset" term is often utilized to account for the varying exposure scale.In the case of the log link function, this is equivalent to including the log of exposure as an explanatory variable with a fixed coefficient of one (Agresti 2015).For zero-inflated and hurdle models, the offset is usually only included in the count part of the models, see, e.g., Lee et al. (2001), Loquiha et al. (2013), Zhen et al. (2018), andDai et al. (2018).However, as pointed out by Feng (2022), varying exposure can also influence the probability of observing excessive zeros.
The paper's main contributions are as follows.First, we delineate several forms of hurdle-generalized Poisson (HGP) and hurdle-generalized negative binomial (HNB) regression models.It is shown that these models are useful in addressing the problem of excess zeros and over-dispersion of claim count datasets.Second, through a detailed analysis, we show that including exposure in both the zero and the count parts as a covariate is an effective approach to incorporating exposure information into zero-inflated and hurdle models.Lastly, from a practical point of view, we illustrate the effectiveness and versatility of the introduced models using real datasets and compare the results with other commonly used models.
We organize the rest of the paper as follows.Section 2 provides the mathematical background, specifically highlighting several forms of HGP and HNB regression models.Section 3 studies how to include exposure in zero-inflated and hurdle regression models.Section 4 presents real-world applications, analyzing various models using data from a Malaysian auto insurance dataset, the US National Medical Expenditure Survey, and a French auto insurance dataset.Section 5 explores the variable selection problem in the HGP and HNB models by applying the Lasso shrinkage methodology.Section 6 concludes the paper.

Mathematical Models
In this section, we first provide the mathematical background of generalized Poisson and generalized negative binomial models and then introduce their hurdle functional forms.

Various Forms of Generalized Poisson and Generalized Negative Binomial Random Variables
From a probability point view, the GP distribution was introduced by Consul and Jain (1973) as a limiting form of a generalized negative binomial distribution.Consul and Shoukri (1988) showed that a GP distribution can be viewed as the distribution of the number of served customers in a busy period of a queue with Poisson arrival and a constant service time.GP distribution can also be considered as the distribution of the total progeny in a Galton branching process, where both the initial number of a species and the number of offspring an individual produces follow a Poisson distribution.From a statistical point view, the GP distribution and its related distributions are flexible and can be used to model over-dispersed or under-dispersed data.
The GP distribution has been applied in actuarial science.For instance, Gerber (1990) showed that the number of jumps it takes for a classical Poisson risk process with a constant claim size to reach a certain level follows a GP distribution.Consul (1993) compared the GP distribution with several well-known distributions and concluded that the GP distribution is a plausible model for claim frequency data.Calderín-Ojeda et al. (2019) proposed a special GP distribution, and tested the performance of their GP regression model using French Motor Personal Line datasets, which are available in the R package "CASdatasets".Scollnik (1995) presented a Bayesian analysis of GP distribution using two datasets; one was the number of injuries in automobile accidents, and the other was the ship damage incident data from Lloyd's Register of Shipping.
Different forms of GP random variables have been proposed in the literature.The classical GP-1 distribution has a probability mass function (pmf) of where µ i is the mean parameter and a is the dispersion parameter.The variance of GP-1 is µ i (1 + a) 2 .Thus, a > 0 implies over-dispersion, while a < 0 implies under-dispersion.When a = 0, GP-1 reduces to a Poisson distribution.
A slightly different parameterization gives the so-called GP-2 distribution with the pmf The mean and variance of the GP-2 distribution are µ i and µ i (1 + aµ i ) 2 , respectively.While the GP-1 distribution has a linear mean-variance relationship, the GP-2 distribution has a cubic mean-variance relationship.The applications of the GP-2 distribution have been discussed in, e.g., Wang and Famoye (1997) and Ismail and Jemain (2007).
Another parameterization of the GP distribution, GP-P, which was studied in, e.g., Zamani and Ismail (2012), has the pmf The additional parameter, P, provides more flexibility in modeling the variance function.It reduces to GP-1 and GP-2 regressions with P = 1 and P = 2, respectively.
The generalized negative binomial (NB-P) distribution, which was introduced in Greene ( 2008) and discussed in Cameron and Trivedi (2013), Hilbe (2011) and Ismail and Zamani (2013), has a parameter set (a, µ i , P) and the pmf

Hurdle Functional Form of the Generalized Poisson Regression Model
A hurdle model involves the application of two different models to analyze data that fall either above or below a specific threshold, which is typically set at zero.Therefore, it is sometimes called a two-part model.Following Mullahy (1986), the distribution of the claim counts according to a hurdle model is given by . ., where f 1 and f 2 are two probability functions that describe the distribution of the zero and non-zero parts of Y i .In insurance applications, the quantity Φ can be interpreted as the probability of reporting at least one claim.As argued in Boucher et al. (2007), in auto insurance, policyholders' behavior may change after a claim has been made; therefore, it is natural to apply hurdle models to describe the two parts (zero claim and non-zero claims) of the claim process.An advantage of the hurdle model is that the parameters for each part can be estimated separately.
In what follows, we assume that f 1 is Bernoulli-distributed.Then, the hurdle functional form of a generalized Poisson (HGP-P) regression model is given as where g P (y i ) is defined in Equation ( 1).Note that the term g P (y i ) 1−g P (0) is usually referred to as the zero-truncated GP distribution.In addition, we assume that µ i is related to covariates x i by a log link function log where β is the vector of regression parameters, and ω i is related to covariates z i by a logit link function The HGP-P model reduces to the HGP-1 and HGP-2 models when P = 1 and P = 2, respectively.Therefore, the likelihood ratio test (LRT) can be applied for testing the HGP-1 model (or HGP-2 model) against the HGP-P model.
The loglikelihood function for the HGP-P regression model is given by log . Note that the regression parameters β and γ are included in the loglikelihood function through the link functions for µ i and ω i .
The two components of the loglikelihood function, log L 1 (γ) and log L 2 (β, a, P), can be maximized separately.In particular, the parameter γ can be estimated using a simple logistic regression.The system of normal equations for estimating β is obtained by taking the partial derivative of log L 2 (β, a, P).Since these partial derivative equations cannot be simplified, the Newton-Raphson method is applied to solve them.The standard errors of the parameter estimates are given by the square root of the diagonal elements of the inverse of the Hessian matrix.The estimated parameters from the truncated Poisson fit are used as starting values for faster convergence.
We note that the two-part structure of the hurdle model greatly simplifies the optimization procedure.

Hurdle Functional Form of the Generalized Negative Binomial Regression Model
The hurdle functional form of the generalized negative binomial (HNB-P) regression model is defined as where h P (•) is the NB-P pmf defined in Equation ( 2), ω i is related to covariates z i with a logit link function (4), and µ i is related to covariates x i via a log link function (3).
The loglikelihood function for the HNB-P regression model is given by . The estimation of the regression parameters for HNB-P is similar to that for the HGP-P model.

Incorporating Exposure in Zero-Inflated and Hurdle Regression Models
In many insurance loss datasets, different policyholders (observations) may have different risk exposures, yet only the total number of claims is reported.For example, a dataset could report the total number of claims made by a policyholder during the whole policy period, but different policyholders may stay in the policy for different periods of time.An offset term in the regression is a commonly used strategy for enclosing a population size at risk or the amount of exposure time.Particularly, if a log link function is used, the model can be defined as log or equivalently , where E i is the exposure for policyholder i.This approach of considering exposure makes sense because, intuitively, the mean number of events should be proportional to the size of the exposure.
For zero-inflated and hurdle models, the offset is usually only included in the count part of the models, see, e.g., Lee et al. (2001), Loquiha et al. (2013), Zhen et al. (2018), Dai et al. (2018).However, as pointed out by Feng (2022), the probability of observing excessive zeros can also be impacted by exposure in many situations.One might directly impose exposure in the zero-inflated part of the model in the same way as in the count model.For example, if the logit model is used for the zero part, we might write However, this may not be plausible because it indicates that the probability of zero inflation ω i increases with the exposure size, which is counter-intuitive.Feng (2022) then proposed the model logit where ξ 1 and ξ 2 are the regression coefficients for the logarithm transformed E i .Model (5) allows risk exposures to be included in the analysis as a regular covariate in both the binary and count parts of the zero-inflated and hurdle models.We next provide a simulation study to illustrate the benefit of such a method.

A Simulation Study
In this subsection, we implement a simulation study to compare several approaches to incorporate risk exposure in zero-inflation models.
We generate 100 observations as follows.Each observation i is associated with an exposure size E i , which is uniformly distributed among one to ten.The number of events, N i , for the ith observation is then the summation of E i independent and identically distributed ZIP-distributed random variables Y (i) with parameters (ω i , µ i ), where ω i is the zero-inflation probability and µ i is the Poisson count mean.That is, Furthermore, assume that there are two covariates: x 1,i , which can take values 0 or 1, and x 2,i , which is a realization of a normal (1, 1) random variable.The distribution parameters are related to the covariates by: The mean and standard deviation of the simulated number of events are 18.4 and 32.02, respectively, where 34% of the claims are zero.
We next fit the simulated data to ZIP and ZIGP-P regression models that handle the exposures differently, as described in Equations ( 6)-(10).

ZIP
logit The parameter estimates, including absolute t-ratios, log likelihood (LL), Akaike information criterion (AIC), and Bayesian information criterion (BIC), for the ZIP model are presented in Table 1, and those for the ZIGP-P model are shown in Table 2. Table 1 shows that the ZIP ee model has the lowest AIC and BIC values.The worst model is ZIP 11 , which includes an offset term in the binary component and the positive count.Notice that the parameter value ξ 1 for the binary part is negative, expressing the fact that when the exposure increases, one should expect a smaller value for the zero-inflation parameter ω i ; on the other hand, the value of parameter ξ 2 for the count part is positive, expressing the fact that when the exposure increases, one should expect a greater value for the expected count µ i .This finding highlights the importance of having exposure in both the binary and count parts of the model.
Notice that in the true model, the distribution of N i is no longer ZIP; it is rather a summation of some random number of ZIP distributions.Therefore, there is no reason that one has to fit the data with exposure with a ZIP model.
Table 2 shows that the ZIGP-P ee model fits the data better than the competing models based on AIC and BIC criteria.In addition, comparing Tables 1 and 2, we see that the ZIGP-P models perform better than the ZIP models.This is because, as discussed above, the distribution of N i s is no longer ZIP.The ZIGP-P model, which includes two additional parameters compared to the ZIP model, presents a more flexible option for fitting the data.
Table 3 presents the average AIC and BIC values obtained from analyzing 100 simulated datasets for model comparison when the number of simulated data is 1000, 5000, and 10,000.The results demonstrate that the ZIGP-P ee model performs better than the other models in all scenarios, consistently producing the smallest AIC and BIC values.These findings indicate that the ZIGP-P ee model is a robust and reliable model for analyzing the simulated dataset.Table 4 shows the results of fitting the HGP-P model to the simulated data in this section under different treatments of exposure.According to the AIC and BIC, the HGP-P ee model outperforms other competing models, which is consistent with previous findings.This observation emphasizes the importance of including exposure (log(Exposures)) in the binary part of the HGP-P model.The estimated effect of log(Exposures) on the binary component of the HGP-P ee model, ξ 1 = −1.58(t-ratio = 40.38)reflects the negative association between exposure and the probability of observing an excess zero count.This finding is consistent with the ZIGP-P ee model's ξ 1 estimation.In contrast, the estimated effect of log(Exposures) on the count component of the HGP-P ee model is positive, with an associated effect size of ξ 2 = 0.46 (t-ratio = 36.68).Notably, this effect size is close to that of the ZIGP-P ee model (ξ 2 = 0.47).Furthermore, the functional parameter for the HGP-P ee model was estimated to be P = 1.66, which is very close to the value of P = 1.67 for the ZIGP-P ee model.We remark that none of the models in Equations ( 6)-( 10) "correctly" describe the underlying simulation model.Our analysis shows that models with observations with different exposures and are zero-inflated, including exposure in both the zero and the count parts as covariates (model ee), perform the best.

Model Fitting Results
In this section, we apply our proposed regression models to three datasets: the Malaysian Motor Insurance Data, the 1987/88 US National Medical Expenditure Survey data, and a French auto insurance dataset, freMTPL2freq, which is available in the R "CASdatasets" package.

Malaysian Motor Insurance Data
This dataset from Insurance Services Malaysia includes 1.01 million private car policies from ten Malaysian insurance companies in 2002.It includes information on exposures measured by the number of cars per year, claim counts for own damage and third party property damage, and four rating factors: vehicle year, vehicle make, vehicle cc, and location.The first three rating variables describe vehicle properties, whereas the last one (location) gives the location where the vehicle was operated.This dataset has been studied by Fuzi et al. (2016).As detailed therein, each of the four rating factors has five levels, amounting to 5 4 = 625 cross-classified rating classes.Excluding 73 rating classes with zero exposure, we used 552 rating classes in this study.The response variable is the number of own damage claims in this study.
We fitted the dataset to the HP, HGP-P, and HNB-P regression models.The zero part was fit using logistic regression, and the none-zero part by maximizing the likelihood using the "nlm" function in R.This separation of the estimation of zero and non-zero parts greatly simplifies the computation.
The parameter estimates and the absolute values of the t-ratio for the models are reported in Table 5.It is seen that the over-dispersion and functional parameters (a and P) in the non-zero parts of the GP-P and NB-P models are both significant.In addition, in all models, the coefficients ξ 1 and ξ 2 for the log exposures of the zero (logistic) and the non-zero parts, respectively, are significant.
For comparison purposes, we fitted the Poisson, GP-1, GP-2, GP-P, NB-1, NB-2, NB-P, and corresponding zero-inflated and hurdle models to this dataset.The LL, AIC, and BIC for these models are provided in Table 6.Based on AIC and BIC, the HGP and HNB models are obviously better than the HP model.Further, the HGP-P, HGP-1, and HNB-P models are the top three best models, followed by HNB-1.The best functional parameters in the HGP-P and HNB-P models are P = 1.09 and P = 1.12, respectively, which are close to 1.In particular, the HNB-P model has a much lower AIC/BIC than the HNB-2 model, confirming that it is more flexible than the latter, which is accessible in the "pscl" package in R.Moreover, the coefficient for log exposure in the count part is positive, and in the logistic part it is negative.They are both significant; this verifies our simulation results in Section 3.

The US National Medical Expenditure Survey Data
We now consider the US National Medical Expenditure Survey 1987/88 data studied by Deb and Trivedi (1997).This dataset contains a subsample of 4406 observations of individuals aged 66 and over who were covered by Medicare, a public insurance program.The dataset is available from the R package accompanying Kleiber and Zeileis (2008) and is also known as "DebTrivedi".The number of physician office visits (ofp), with a mean and variance of 5.77 and 45.69, respectively, is the response variable.We fitted the data to the HP, HGP-P, and HNB-P regression models.The parameter estimates and absolute value of t-ratios are provided in Table 7.Based on the Wald test, both over-dispersion and functional parameters (a and P) are significant.Table 8 presents the LL, AIC, and BIC for the Poisson, GP-1, GP-2, GP-P, NB-1, NB-2, NB-P, and their related zero-inflated and hurdle models.It also shows the results for some popular models used to fit the data, which include the constrained two-point finite mixture of negative binomials (CFMNB-2), the two-point finite mixture of negative binomials (FMNB-2), and the constrained three-point finite mixture of negative binomials (CFMNB-3) that were introduced by Deb and Trivedi (1997), as well as the two-point negative binomial mixture (NBM2) used by Park and Kim (2021).
Overall, the HGP-P and CFMNB-3 models, and the FMNB-2 model, which are based on NB-1 specifications, are among the preferred models according to AIC and BIC.

The freMTPL2freq Dataset
The freMTPL2freq dataset, which is included in the "CASdatasets" package, provides information on the number of claims and risk-related features for 677,991 third party motor liability policies.Table 9 provides a summary of the covariates that were included in the analysis.The mean and variance of the number of claims are reported as 0.0532 and 0.0577, respectively.Moreover, it was observed that 94.98% of observations have zero claims.
Table 9.The description of the coviates in the French dataset.

VehPower
The power of the car.

VehAge
The vehicle age in years

DriveAge
The driver age in years.

Log(density)
The log of the number of residents per square kilometer of the city where the car driver lives.

BonusMalus
Zero indicate a bonus, while one indicates a malus.

VehGas
The car's fuel equals zero for regular fuel and one for diesel.
Log(exposure) The log of the period of exposure for a policy in years.
Table 10 compares our models with several commonly used regression models.It shows that the ZIGP-P model exhibited the lowest AIC and BIC values, indicating its superiority in fitting the data.The ZINB-P and HGP-P models rank second and third, respectively.Furthermore, it is worth noting that the running time for the HGP-P model, thanks to its two-part model setting, is much shorter than the ZIGP-P and ZINB-P models.Table 11 displays the estimated coefficients and absolute t-ratios for four models: ZIGP-P, ZINB-P, HGP-P, and HNB-P.In all models, both the over-dispersion parameter a and functional parameters P are statistically significant.Further, we compared the AIC values for two situations in Table 12; the first column shows models that include exposure as an offset in the count part, and the second column shows those that include exposure as a covariate in the count and zero-inflation parts.The results indicate that the models with exposure included as a covariate in both parts have a lower AIC, suggesting that they fit the data better than those with exposure included only as an offset.
Likelihood ratio tests for various statistical models applied to the French dataset are presented in Table 13.

The Lasso Regression
In this section, we briefly study the variable selection problem associated with the HGP-P and HNB-P regression models discussed in the paper by using the US National Medical Expenditure Survey 1987/88 data.Variable selection is important because it may simplify the regression model as well as reduce the out-of-sample prediction error.
Lasso regression, introduced in Tibshirani (1996), has been proven to be an effective method for variable selection.Park and Hastie (2007) expanded Lasso regression to a generalized linear model to handle count data.Related to this paper's context, Tang et al. (2014) proposed an EM adaptive Lasso method to select risk factors (covariates) for an auto insurance claim dataset.Wang et al. (2015) employed it to address the issue of variable selection for a model with zero inflation and over-dispersion.
In this study, we apply a simplified version of the Lasso shrinkage method, which aims to maximize the penalized log likelihood function where λ ≥ 0 is the tuning parameter and β i are the parameters of interest.The intercept β 0 and the model parameters a and P are excluded from the penalty.
When λ increases, the estimates of the coefficient values deviate from maximum likelihood estimates, resulting in lower in-sample goodness-of-fit.However, the model is simplified, potentially improving the out-of-sample performance.
Since the LL of the logistic and truncated parts of hurdle models can be separated, we may perform Lasso regression separately for the two parts.Lasso regression for the logistic part can be executed in R utilizing the "glmnet" package.Lasso regression for the truncated

Table 1 .
Parameter estimates, t-ratios, and model fit measures for simulated data using the ZIP model under various exposure scenarios.Est.|t.ratio| Est.|t.ratio| Est.|t.ratio| Est.|t.ratio|

Table 2 .
Parameter estimates, t-ratios, and model fit measures for simulated data using the ZIGP-P model under various exposure scenarios.

Table 3 .
Comparing the model fitness of ZIGP-P ee , ZIGP-P e , ZIGP-P 11 , and ZIGP-P 1 based on the mean values of AIC and BIC over 100 simulated datasets created from the ZIP model.

Table 4 .
Parameter estimates, t-ratios, and model fit measures for simulated data using the HGP-P model under various exposure scenarios.

Table 5 .
Parameter estimates and absolute t-ratios for the Malaysian Motor Insurance Data.

Table 6 .
The number of parameters, LL, AIC, and BIC of different models for Malaysian Motor Insurance Data.

Table 7 .
Parameter estimates and absolute t-ratios for the US National Medical Expenditure Survey dataset.

Table 8 .
Number of parameters, LL, AIC, and BIC for different models for the US National Medical Expenditure Survey dataset.

Table 10 .
LL, AIC, BIC, and computational time (CT) for various statistical models applied to the French dataset.

Table 11 .
Parameter estimation and absolute t-ratio for ZIGP-P, ZINB-P, HGP-P, and HNB-P models for the French dataset.

Table 12 .
Comparison of AIC values of various models with exposure included as an offset in the count part and as a covariate based on the French dataset.

Table 13 .
Likelihood ratio tests for various statistical models applied to the French dataset.