Next Article in Journal
Effectiveness of Green Bonds in Selected CEE Countries: Analysis of Similarities
Next Article in Special Issue
Invariance of the Mathematical Expectation of a Random Quantity and Its Consequences
Previous Article in Journal
Enhancing Sustainable Finance through Green Hydrogen Equity Investments: A Multifaceted Risk-Return Analysis
Previous Article in Special Issue
Coupled Price–Volume Equity Models with Auto-Induced Regime Switching
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Applications of Generalized Poisson Regression Models to Insurance Claim Data

1
School of Mathematical and Computational Sciences, University of Prince Edward Island, Charlottetown, PE C1A 4P3, Canada
2
Department of Statistical and Actuarial Sciences, Western University, London, ON N6A 5B7, Canada
*
Author to whom correspondence should be addressed.
Risks 2023, 11(12), 213; https://doi.org/10.3390/risks11120213
Submission received: 7 November 2023 / Revised: 26 November 2023 / Accepted: 4 December 2023 / Published: 7 December 2023
(This article belongs to the Special Issue Risks Journal: A Decade of Advancing Knowledge and Shaping the Future)

Abstract

:
Predictive modeling has been widely used for insurance rate making. In this paper, we focus on insurance claim count data and address their common issues with more flexible modeling techniques. In particular, we study the zero-inflated and hurdle-generalized Poisson and negative binomial distributions in a functional form for modeling insurance claim count data. It is shown that these models are useful in addressing the problem of excess zeros and over-dispersion of the claim count variable. In addition, we show that including the exposure as a covariate in both the zero and the count part of the model is an effective approach to incorporating exposure information in zero-inflated and hurdle models. We illustrate the effectiveness and versatility of the introduced models using three real datasets. The results suggest their promising applications in insurance risk classification and beyond.

1. Introduction

In a priori risk classification, actuaries group risks with similar risk characteristics in order to set insurance premiums. Accurate risk classification is extremely important for maintaining a financially sound and equitable system, assuring the availability of needed insurance coverage to the public.
The individual risk characteristics used in risk classification are called rating variables. For example, in automobile insurance, commonly used rating variables include geography, driver characteristics such as age, gender, and marital status, and vehicle characteristics such as the make and value of the vehicle insured.
Risk classification systems are generally based, whenever possible, on statistical analysis. Naturally, statistical methods such as generalized linear models and generalized additive models provide useful tools. Numerous books and papers discuss the application of statistical methods in insurance rate making, see, e.g., Renshaw (1994), Denuit et al. (2007), Frees (2009), Frees et al. (2014), and the references therein.
This paper studies claim frequency modeling. It is well known that the Poisson regression model is not always suitable because real-world claim frequency data usually exhibit over-dispersion. Alternative models have been proposed in the literature. Notably, negative binomial regression models were discussed by Dionne and Vanasse (1989), Frees and Valdez (2008), and Wüthrich and Merz (2008). Inverse Gaussian models were studied by Dean et al. (1989) and Wang et al. (2023). Consul (1993) compared the generalized Poisson (GP) distribution with several well-known distributions and concluded that the GP distribution is a plausible model for claim frequency data.
Insurance claim data usually have an excessive number of zeros. Zero-inflated models, studied by Lambert (1992), have been used to deal with such problems in the literature. For example, Yip and Yau (2005) applied several parametric zero-inflated count distributions, including zero-inflated Poisson (ZIP), zero-inflated generalize negative binomial, zero-inflated generalized Poisson, and zero-inflated double Poisson distributions, to accommodate the excess zeros in insurance claim count data. Famoye and Singh (2006) applied the zero-inflated generalized Poisson regression model to fit a domestic violence dataset. Czado et al. (2007) extended the zero-inflated generalized Poisson regression model by including explanatory regression parameters in both the zero-inflation and the dispersion parameters and applied the extended model to patent outsourcing rate data.
The hurdle model, which was introduced by Cragg (1971) and later refined by Mullahy (1986), can also be applied to model data with an excessive number of zeros. For instance, Saffari et al. (2013) and Zuo et al. (2021) studied the hurdle-generalized Poisson distribution, whereas Bhaktha (2018) employed the hurdle negative binomial approach. Additionally, using an insurance claim number dataset, Boucher et al. (2007) compared various zero-inflated and hurdle models.
Another issue with insurance claim datasets lies in the fact that different observations may have different risk exposures, but only the total number of claims for all exposures is recorded. For example, some policyholders stay longer in the policy than others. An “offset” term is often utilized to account for the varying exposure scale. In the case of the log link function, this is equivalent to including the log of exposure as an explanatory variable with a fixed coefficient of one (Agresti 2015). For zero-inflated and hurdle models, the offset is usually only included in the count part of the models, see, e.g., Lee et al. (2001), Loquiha et al. (2013), Zhen et al. (2018), and Dai et al. (2018). However, as pointed out by Feng (2022), varying exposure can also influence the probability of observing excessive zeros.
The paper’s main contributions are as follows. First, we delineate several forms of hurdle-generalized Poisson (HGP) and hurdle-generalized negative binomial (HNB) regression models. It is shown that these models are useful in addressing the problem of excess zeros and over-dispersion of claim count datasets. Second, through a detailed analysis, we show that including exposure in both the zero and the count parts as a covariate is an effective approach to incorporating exposure information into zero-inflated and hurdle models. Lastly, from a practical point of view, we illustrate the effectiveness and versatility of the introduced models using real datasets and compare the results with other commonly used models.
We organize the rest of the paper as follows. Section 2 provides the mathematical background, specifically highlighting several forms of HGP and HNB regression models. Section 3 studies how to include exposure in zero-inflated and hurdle regression models. Section 4 presents real-world applications, analyzing various models using data from a Malaysian auto insurance dataset, the US National Medical Expenditure Survey, and a French auto insurance dataset. Section 5 explores the variable selection problem in the HGP and HNB models by applying the Lasso shrinkage methodology. Section 6 concludes the paper.

2. Mathematical Models

In this section, we first provide the mathematical background of generalized Poisson and generalized negative binomial models and then introduce their hurdle functional forms.

2.1. Various Forms of Generalized Poisson and Generalized Negative Binomial Random Variables

From a probability point view, the GP distribution was introduced by Consul and Jain (1973) as a limiting form of a generalized negative binomial distribution. Consul and Shoukri (1988) showed that a GP distribution can be viewed as the distribution of the number of served customers in a busy period of a queue with Poisson arrival and a constant service time. GP distribution can also be considered as the distribution of the total progeny in a Galton branching process, where both the initial number of a species and the number of offspring an individual produces follow a Poisson distribution. From a statistical point view, the GP distribution and its related distributions are flexible and can be used to model over-dispersed or under-dispersed data.
The GP distribution has been applied in actuarial science. For instance, Gerber (1990) showed that the number of jumps it takes for a classical Poisson risk process with a constant claim size to reach a certain level follows a GP distribution. Consul (1993) compared the GP distribution with several well-known distributions and concluded that the GP distribution is a plausible model for claim frequency data. Calderín-Ojeda et al. (2019) proposed a special GP distribution, and tested the performance of their GP regression model using French Motor Personal Line datasets, which are available in the R package ”CASdatasets”. Scollnik (1995) presented a Bayesian analysis of GP distribution using two datasets; one was the number of injuries in automobile accidents, and the other was the ship damage incident data from Lloyd’s Register of Shipping.
Different forms of GP random variables have been proposed in the literature. The classical GP-1 distribution has a probability mass function (pmf) of
g 1 y i = P Y i = y i μ i , a = μ i μ i + a y i y i 1 ( 1 + a ) y i y i ! e μ i + a y i 1 + a , y i = 0 , 1 , 2 , ,
where μ i is the mean parameter and a is the dispersion parameter. The variance of GP-1 is μ i ( 1 + a ) 2 . Thus, a > 0 implies over-dispersion, while a < 0 implies under-dispersion. When a = 0 , GP-1 reduces to a Poisson distribution.
A slightly different parameterization gives the so-called GP-2 distribution with the pmf
g 2 y i = P Y i = y i μ i , a = μ i μ i + a μ i y i y i 1 I + a μ i y i y i ! e μ i + a μ i y i 1 + a μ i , y i = 0 , 1 , 2 , .
The mean and variance of the GP-2 distribution are μ i and μ i ( 1 + a μ i ) 2 , respectively. While the GP-1 distribution has a linear mean–variance relationship, the GP-2 distribution has a cubic mean–variance relationship. The applications of the GP-2 distribution have been discussed in, e.g., Wang and Famoye (1997) and Ismail and Jemain (2007).
Another parameterization of the GP distribution, GP-P, which was studied in, e.g., Zamani and Ismail (2012), has the pmf
g P ( y i ) = P Y i = y i μ i , a , P = μ i μ i + a μ i P 1 y i y i 1 1 + a μ i P 1 y i y i ! e μ i + a μ i P 1 y i 1 + a μ i P 1 , y i = 0 , 1 , 2 , .
A GP-P random variable Y i has mean E Y i = μ i and variance Var Y i = μ i 1 + a μ i P 1 2 . The additional parameter, P, provides more flexibility in modeling the variance function. It reduces to GP-1 and GP-2 regressions with P = 1 and P = 2 , respectively.
The generalized negative binomial (NB-P) distribution, which was introduced in Greene (2008) and discussed in Cameron and Trivedi (2013), Hilbe (2011) and Ismail and Zamani (2013), has a parameter set ( a , μ i , P ) and the pmf
h P y i = P Y i = y i μ i , a , P = Γ y i + a 1 μ i 2 P y i ! Γ a 1 μ i 2 P × a 1 μ i 2 P a 1 μ i 2 P + μ i a 1 μ i 2 P μ i a 1 μ i 2 P + μ i y i , y i = 0 , 1 , .

2.2. Hurdle Functional Form of the Generalized Poisson Regression Model

A hurdle model involves the application of two different models to analyze data that fall either above or below a specific threshold, which is typically set at zero. Therefore, it is sometimes called a two-part model. Following Mullahy (1986), the distribution of the claim counts according to a hurdle model is given by
P Y i = y i = f 1 ( 0 ) , y i = 0 , 1 f 1 ( 0 ) 1 f 2 ( 0 ) f 2 ( y i ) : = Φ f 2 ( y i ) , y i = 1 , 2 , ,
where f 1 and f 2 are two probability functions that describe the distribution of the zero and non-zero parts of Y i . In insurance applications, the quantity Φ can be interpreted as the probability of reporting at least one claim. As argued in Boucher et al. (2007), in auto insurance, policyholders’ behavior may change after a claim has been made; therefore, it is natural to apply hurdle models to describe the two parts (zero claim and non-zero claims) of the claim process. An advantage of the hurdle model is that the parameters for each part can be estimated separately.
In what follows, we assume that f 1 is Bernoulli-distributed. Then, the hurdle functional form of a generalized Poisson (HGP-P) regression model is given as
P Y i = y i = ω i , y i = 0 , 1 ω i g P y i 1 g P ( 0 ) , y i = 1 , 2 , 3 , ,
where g P y i is defined in Equation (1). Note that the term g P y i 1 g P ( 0 ) is usually referred to as the zero-truncated GP distribution. In addition, we assume that μ i is related to covariates x i by a log link function
log μ i = x i T β ,
where β is the vector of regression parameters, and ω i is related to covariates z i by a logit link function
log ω i 1 ω i = z i T γ .
The HGP-P model reduces to the HGP-1 and HGP-2 models when P = 1 and P = 2 , respectively. Therefore, the likelihood ratio test (LRT) can be applied for testing the HGP-1 model (or HGP-2 model) against the HGP-P model.
The loglikelihood function for the HGP-P regression model is given by
log L ( γ , β , a , P ) = log L 1 ( γ ) + log L 2 ( β , a , P ) ,
where
log L 1 ( γ ) = i = 1 n I y i = 0 log ω i + 1 I y i = 0 log 1 ω i ,
and
log L 2 ( β , a , P ) = i = 1 n 1 I y i = 0 log ( 1 exp ( A i ) ) + y i 1 log μ i + a μ i P y i ! + log μ i y i log 1 + a μ i P 1 log y i ! A i .
with A i = μ i + a μ i P 1 y i 1 + a μ i P 1 . Note that the regression parameters β and γ are included in the loglikelihood function through the link functions for μ i and ω i .
The two components of the loglikelihood function, log L 1 ( γ ) and log L 2 ( β , a , P ) , can be maximized separately. In particular, the parameter γ can be estimated using a simple logistic regression. The system of normal equations for estimating β is obtained by taking the partial derivative of log L 2 ( β , a , P ) . Since these partial derivative equations cannot be simplified, the Newton–Raphson method is applied to solve them. The standard errors of the parameter estimates are given by the square root of the diagonal elements of the inverse of the Hessian matrix. The estimated parameters from the truncated Poisson fit are used as starting values for faster convergence.
We note that the two-part structure of the hurdle model greatly simplifies the optimization procedure.

2.3. Hurdle Functional Form of the Generalized Negative Binomial Regression Model

The hurdle functional form of the generalized negative binomial (HNB-P) regression model is defined as
P Y i = y i = ω i , y i = 0 , 1 ω i h P y i 1 h P ( 0 ) , y i = 1 , 2 , 3 , ,
where h P ( · ) is the NB-P pmf defined in Equation (2), ω i is related to covariates z i with a logit link function (4), and μ i is related to covariates x i via a log link function (3).
The loglikelihood function for the HNB-P regression model is given by
log L ( γ , β , a , P ) = log L 1 ( γ ) + log L 2 ( β , a , P ) ,
where
log L 1 ( γ ) = i = 1 n I y i = 0 log ω i + 1 I y i = 0 log 1 ω i ,
and
log L 2 ( β , a , P ) = i = 1 n 1 I y i = 0 { B i log B i y i log B i + μ i B i log B i + μ i + j = 0 y i 1 log B i + j + y i log μ i log 1 h p ( 0 ) } .
with B i = a 1 μ i 2 P . The estimation of the regression parameters for HNB-P is similar to that for the HGP-P model.

3. Incorporating Exposure in Zero-Inflated and Hurdle Regression Models

In many insurance loss datasets, different policyholders (observations) may have different risk exposures, yet only the total number of claims is reported. For example, a dataset could report the total number of claims made by a policyholder during the whole policy period, but different policyholders may stay in the policy for different periods of time. An offset term in the regression is a commonly used strategy for enclosing a population size at risk or the amount of exposure time. Particularly, if a log link function is used, the model can be defined as
log μ i = x i T β + log E i ,
or equivalently μ i = E i e x i T β , where E i is the exposure for policyholder i. This approach of considering exposure makes sense because, intuitively, the mean number of events should be proportional to the size of the exposure.
For zero-inflated and hurdle models, the offset is usually only included in the count part of the models, see, e.g., Lee et al. (2001), Loquiha et al. (2013), Zhen et al. (2018), Dai et al. (2018). However, as pointed out by Feng (2022), the probability of observing excessive zeros can also be impacted by exposure in many situations. One might directly impose exposure in the zero-inflated part of the model in the same way as in the count model. For example, if the logit model is used for the zero part, we might write
logit ω i = log ω i 1 ω i = z i T γ + log E i .
However, this may not be plausible because it indicates that the probability of zero inflation ω i increases with the exposure size, which is counter-intuitive. Feng (2022) then proposed the model
logit ω i = z i T γ + ξ 1 log E i , log μ i = x i T β + ξ 2 log E i ,
where ξ 1 and ξ 2 are the regression coefficients for the logarithm transformed E i . Model (5) allows risk exposures to be included in the analysis as a regular covariate in both the binary and count parts of the zero-inflated and hurdle models. We next provide a simulation study to illustrate the benefit of such a method.

A Simulation Study

In this subsection, we implement a simulation study to compare several approaches to incorporate risk exposure in zero-inflation models.
We generate 100 observations as follows. Each observation i is associated with an exposure size E i , which is uniformly distributed among one to ten. The number of events, N i , for the ith observation is then the summation of E i independent and identically distributed ZIP-distributed random variables Y ( i ) with parameters ( ω i , μ i ) , where ω i is the zero-inflation probability and μ i is the Poisson count mean. That is,
N i = j = 1 E i Y j ( i ) .
Furthermore, assume that there are two covariates: x 1 , i , which can take values 0 or 1, and x 2 , i , which is a realization of a normal ( 1 , 1 ) random variable. The distribution parameters are related to the covariates by:
logit ( ω i ) = γ 0 + γ 1 x 1 , i + γ 2 x 2 , i ,
with γ 0 = γ 1 = γ 2 = 1 , and
log ( μ i ) = β 0 + β 1 x 1 , i + β 2 x 2 , i ,
with β 0 = β 1 = β 2 = 0.5 .
The mean and standard deviation of the simulated number of events are 18.4 and 32.02, respectively, where 34 % of the claims are zero.
We next fit the simulated data to ZIP and ZIGP-P regression models that handle the exposures differently, as described in Equations (6)–(10).
ZIP logit ω i = γ 0 + γ 1 x 1 + γ 2 x 2 , log μ i = β 0 + β 1 x 1 + β 2 x 2 ,
ZIP e e logit ω i = γ 0 + γ 1 x 1 + γ 2 x 2 + ξ 1 log E i , log μ i = β 0 + β 1 x 1 + β 2 x 2 + ξ 2 log E i ,
ZIP e logit ω i = γ 0 + γ 1 x 1 + γ 2 x 2 , log μ i = β 0 + β 1 x 1 + β 2 x 2 + ξ 2 log E i ,
ZIP 11 logit ω i = γ 0 + γ 1 x 1 + γ 2 x 2 + log E i , log μ i = β 0 + β 1 x 1 + β 2 x 2 + log E i ,
ZIP 1 logit ω i = γ 0 + γ 1 x 1 + γ 2 x 2 , log μ i = β 0 + β 1 x 1 + β 2 x 2 + log E i .
The parameter estimates, including absolute t-ratios, log likelihood (LL), Akaike information criterion (AIC), and Bayesian information criterion (BIC), for the ZIP model are presented in Table 1, and those for the ZIGP-P model are shown in Table 2.
Table 1 shows that the ZIP e e model has the lowest AIC and BIC values. The worst model is ZIP 11 , which includes an offset term in the binary component and the positive count. Notice that the parameter value ξ 1 for the binary part is negative, expressing the fact that when the exposure increases, one should expect a smaller value for the zero-inflation parameter ω i ; on the other hand, the value of parameter ξ 2 for the count part is positive, expressing the fact that when the exposure increases, one should expect a greater value for the expected count μ i . This finding highlights the importance of having exposure in both the binary and count parts of the model.
Notice that in the true model, the distribution of N i is no longer ZIP; it is rather a summation of some random number of ZIP distributions. Therefore, there is no reason that one has to fit the data with exposure with a ZIP model.
Table 2 shows that the ZIGP-Pee model fits the data better than the competing models based on AIC and BIC criteria. In addition, comparing Table 1 and Table 2, we see that the ZIGP-P models perform better than the ZIP models. This is because, as discussed above, the distribution of N i s is no longer ZIP. The ZIGP-P model, which includes two additional parameters compared to the ZIP model, presents a more flexible option for fitting the data.
Table 3 presents the average AIC and BIC values obtained from analyzing 100 simulated datasets for model comparison when the number of simulated data is 1000, 5000, and 10,000. The results demonstrate that the ZIGP-Pee model performs better than the other models in all scenarios, consistently producing the smallest AIC and BIC values. These findings indicate that the ZIGP-Pee model is a robust and reliable model for analyzing the simulated dataset.
Table 4 shows the results of fitting the HGP-P model to the simulated data in this section under different treatments of exposure. According to the AIC and BIC, the HGP-Pee model outperforms other competing models, which is consistent with previous findings. This observation emphasizes the importance of including exposure (log(Exposures)) in the binary part of the HGP-P model. The estimated effect of log(Exposures) on the binary component of the HGP-Pee model, ξ 1 = −1.58 (t-ratio = 40.38) reflects the negative association between exposure and the probability of observing an excess zero count. This finding is consistent with the ZIGP-Pee model’s ξ 1 estimation. In contrast, the estimated effect of log(Exposures) on the count component of the HGP-Pee model is positive, with an associated effect size of ξ 2 = 0.46 (t-ratio = 36.68). Notably, this effect size is close to that of the ZIGP-Pee model ( ξ 2 = 0.47). Furthermore, the functional parameter for the HGP-Pee model was estimated to be P = 1.66 , which is very close to the value of P = 1.67 for the ZIGP-Pee model.
We remark that none of the models in Equations (6)–(10) “correctly” describe the underlying simulation model. Our analysis shows that models with observations with different exposures and are zero-inflated, including exposure in both the zero and the count parts as covariates (model e e ), perform the best.

4. Model Fitting Results

In this section, we apply our proposed regression models to three datasets: the Malaysian Motor Insurance Data, the 1987/88 US National Medical Expenditure Survey data, and a French auto insurance dataset, freMTPL2freq, which is available in the R “CASdatasets” package.

4.1. Malaysian Motor Insurance Data

This dataset from Insurance Services Malaysia includes 1.01 million private car policies from ten Malaysian insurance companies in 2002. It includes information on exposures measured by the number of cars per year, claim counts for own damage and third party property damage, and four rating factors: vehicle year, vehicle make, vehicle cc, and location. The first three rating variables describe vehicle properties, whereas the last one (location) gives the location where the vehicle was operated. This dataset has been studied by Fuzi et al. (2016). As detailed therein, each of the four rating factors has five levels, amounting to 5 4 = 625 cross-classified rating classes. Excluding 73 rating classes with zero exposure, we used 552 rating classes in this study. The response variable is the number of own damage claims in this study.
We fitted the dataset to the HP, HGP-P, and HNB-P regression models. The zero part was fit using logistic regression, and the none-zero part by maximizing the likelihood using the “nlm” function in R. This separation of the estimation of zero and non-zero parts greatly simplifies the computation.
The parameter estimates and the absolute values of the t-ratio for the models are reported in Table 5. It is seen that the over-dispersion and functional parameters (a and P) in the non-zero parts of the GP-P and NB-P models are both significant. In addition, in all models, the coefficients ξ 1 and ξ 2 for the log exposures of the zero (logistic) and the non-zero parts, respectively, are significant.
For comparison purposes, we fitted the Poisson, GP-1, GP-2, GP-P, NB-1, NB-2, NB-P, and corresponding zero-inflated and hurdle models to this dataset. The LL, AIC, and BIC for these models are provided in Table 6.
Based on AIC and BIC, the HGP and HNB models are obviously better than the HP model. Further, the HGP-P, HGP-1, and HNB-P models are the top three best models, followed by HNB-1. The best functional parameters in the HGP-P and HNB-P models are P = 1.09 and P = 1.12 , respectively, which are close to 1. In particular, the HNB-P model has a much lower AIC/BIC than the HNB-2 model, confirming that it is more flexible than the latter, which is accessible in the “pscl” package in R.
Moreover, the coefficient for log exposure in the count part is positive, and in the logistic part it is negative. They are both significant; this verifies our simulation results in Section 3.

4.2. The US National Medical Expenditure Survey Data

We now consider the US National Medical Expenditure Survey 1987/88 data studied by Deb and Trivedi (1997). This dataset contains a subsample of 4406 observations of individuals aged 66 and over who were covered by Medicare, a public insurance program. The dataset is available from the R package accompanying Kleiber and Zeileis (2008) and is also known as “DebTrivedi”. The number of physician office visits (ofp), with a mean and variance of 5.77 and 45.69, respectively, is the response variable. We fitted the data to the HP, HGP-P, and HNB-P regression models. The parameter estimates and absolute value of t-ratios are provided in Table 7. Based on the Wald test, both over-dispersion and functional parameters (a and P) are significant.
Table 8 presents the LL, AIC, and BIC for the Poisson, GP-1, GP-2, GP-P, NB-1, NB-2, NB-P, and their related zero-inflated and hurdle models. It also shows the results for some popular models used to fit the data, which include the constrained two-point finite mixture of negative binomials (CFMNB-2), the two-point finite mixture of negative binomials (FMNB-2), and the constrained three-point finite mixture of negative binomials (CFMNB-3) that were introduced by Deb and Trivedi (1997), as well as the two-point negative binomial mixture (NBM2) used by Park and Kim (2021).
Overall, the HGP-P and CFMNB-3 models, and the FMNB-2 model, which are based on NB-1 specifications, are among the preferred models according to AIC and BIC.

4.3. The freMTPL2freq Dataset

The freMTPL2freq dataset, which is included in the “CASdatasets” package, provides information on the number of claims and risk-related features for 677,991 third party motor liability policies. Table 9 provides a summary of the covariates that were included in the analysis. The mean and variance of the number of claims are reported as 0.0532 and 0.0577, respectively. Moreover, it was observed that 94.98 % of observations have zero claims.
Table 10 compares our models with several commonly used regression models. It shows that the ZIGP-P model exhibited the lowest AIC and BIC values, indicating its superiority in fitting the data. The ZINB-P and HGP-P models rank second and third, respectively. Furthermore, it is worth noting that the running time for the HGP-P model, thanks to its two-part model setting, is much shorter than the ZIGP-P and ZINB-P models.
Table 11 displays the estimated coefficients and absolute t-ratios for four models: ZIGP-P, ZINB-P, HGP-P, and HNB-P. In all models, both the over-dispersion parameter a and functional parameters P are statistically significant.
Further, we compared the AIC values for two situations in Table 12; the first column shows models that include exposure as an offset in the count part, and the second column shows those that include exposure as a covariate in the count and zero-inflation parts. The results indicate that the models with exposure included as a covariate in both parts have a lower AIC, suggesting that they fit the data better than those with exposure included only as an offset.
Likelihood ratio tests for various statistical models applied to the French dataset are presented in Table 13.

5. The Lasso Regression

In this section, we briefly study the variable selection problem associated with the HGP-P and HNB-P regression models discussed in the paper by using the US National Medical Expenditure Survey 1987/88 data. Variable selection is important because it may simplify the regression model as well as reduce the out-of-sample prediction error.
Lasso regression, introduced in Tibshirani (1996), has been proven to be an effective method for variable selection. Park and Hastie (2007) expanded Lasso regression to a generalized linear model to handle count data. Related to this paper’s context, Tang et al. (2014) proposed an EM adaptive Lasso method to select risk factors (covariates) for an auto insurance claim dataset. Wang et al. (2015) employed it to address the issue of variable selection for a model with zero inflation and over-dispersion.
In this study, we apply a simplified version of the Lasso shrinkage method, which aims to maximize the penalized log likelihood function
log L λ i 1 β i ,
where λ 0 is the tuning parameter and β i are the parameters of interest. The intercept β 0 and the model parameters a and P are excluded from the penalty.
When λ increases, the estimates of the coefficient values deviate from maximum likelihood estimates, resulting in lower in-sample goodness-of-fit. However, the model is simplified, potentially improving the out-of-sample performance.
Since the LL of the logistic and truncated parts of hurdle models can be separated, we may perform Lasso regression separately for the two parts. Lasso regression for the logistic part can be executed in R utilizing the “glmnet” package. Lasso regression for the truncated functional form of generalized Poisson (TGP-P) regression and the truncated functional form of generalized negative binomial (TNB-P) regression has not been implemented in the literature. Therefore, it is implemented based on our own R codes.
To obtain the optimal value of λ that leads to the most accurate out-of-sample prediction, we applied five-fold cross-validation.
As shown in Table 14, for the logistics parts, we find that the tuning parameter is 10; four variables, “regionnortheast”, “age”, “faminc” and “employedye”, are removed from the models. This results in a decrease in the out-of-sample deviance from 746.0 to 740.4.
The results of the Lasso regression with TNB-P and TGP-P models are shown in Table 15. At the optimal value of the tuning parameter λ (18.95 and 10.77 for TNB-P and TGP-P, respectively), shrunken models lead to lower out-of-sample deviances and thus perform better than the full models. Furthermore, the out-of-sample prediction accuracy of the TNB-P model is lower than that of the TGP-P model.
Considering both Table 14 and Table 15, we can see that the “employedyes” should be removed from the zero and non-zero parts of the model. However, other variables that were candidates for removal in zero and non-zero parts are different.

6. Discussion and Conclusions

In this paper, we explored the zero-inflated and hurdle-generalized Poisson/negative binomial models for analyzing count data. It was shown that such models can effectively tackle the common challenges of excessive zero and over-dispersion in analyzing insurance claim data. The nested structure of the models mentioned allows for the use of a likelihood ratio test to select the most appropriate model.
Further, we provided a detailed study of how to include exposure information in zero-inflated and hurdle models. We find that including exposure as a covariate in both the zero and non-zero parts can provide superior results than just including it in the non-zero part as an offset.
Finally, we showed that Lasso regression can be applied to HGP-P and HNB-P regression models for variable selection.
There are several directions to be explored for future research. One is to apply Bayesian methods to GP regression models, focusing on modeling over-dispersed count data. An earlier study in this direction was presented by Scollnik (1995). Another direction is to investigate the variable selection of zero-inflated or hurdle models. Techniques such as linear shrinkage, pretest, shrinkage pretest, Stein-type, and positive Stein-type Liu estimators, see, e.g., Stein (1981), Ledoit and Wolf (2003), and Månsson et al. (2012), could be considered in the context of the ZIGP-P or HGP-P models.

Author Contributions

Conceptualization, P.F., S.L. and J.R.; methodology, P.F., S.L. and J.R.; software, P.F.; validation, P.F., S.L. and J.R.; formal analysis, P.F., S.L. and J.R.; investigation, P.F., S.L. and J.R.; resources, P.F.; data curation, P.F.; writing—original draft preparation, P.F.; writing—review and editing, P.F., S.L. and J.R.; All authors have read and agreed to the published version of the manuscript.

Funding

Shu Li was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC), grant number RGPIN-2019-06219. Jiandong Ren was funded by NSERC grant number RGPIN-2019-06561.

Data Availability Statement

In support of the results reported in this paper, the French Motor Third Party Liability Claims dataset, freMTPL2freq, was utilized and can be accessed through the ’CASdatasets’ package in R. This dataset can be loaded directly using the library (CASdatasets) and data (freMTPL2freq) commands in R. Additionally, the US National Medical Expenditure Survey dataset was also employed, and is available via the ’MixAll’ library in R, accessible with the commands library (MixAll) and data (DebTrivedi). It is important to highlight that, due to privacy and ethical constraints, the Malaysian dataset referenced in this study is not available for public sharing. We adhere strictly to MDPI’s data-sharing policies, ensuring that all data supporting our findings, except those restricted, are readily accessible. Detailed guidelines and policies regarding data availability are available on the MDPI Ethics website.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

A full list of the abbreviations used in this manuscript (ordered alphabetically):
AICAkaike information criterion
BICBayesian information criterion
CFMNB-2Constrained two-point finite mixture of negative binomials
CFMNB-3Constrained three-point finite mixture of negative binomials
FMNB-2Two-point finite mixture of negative binomials
LLLog likelihood
LRTLikelihood ratio test
GPGeneralized Poisson
GP-PFunctional form of generalized Poisson
HGPHurdle-generalized Poisson
HGP-PHurdle functional form of generalized Poisson
HNBHurdle negative binomial
HNB-PHurdle functional form of negative binomial
HPHurdle Poisson
NB-PFunctional form of negative binomial
NBM2Two-point negative binomial mixture
TGP-PTruncated functional form of generalized Poisson
TNB-PTruncated functional form of negative binomial
ZIGP-PZero-inflated functional form of generalized Poisson
ZINB-PZero-inflated functional form of negative binomial
ZIPZero-inflated Poisson

References

  1. Agresti, Alan. 2015. Foundations of Linear and Generalized Linear Models. Hoboken: John Wiley & Sons. [Google Scholar]
  2. Bhaktha, Nivedita. 2018. Properties of Hurdle Negative Binomial Models for Zero-Inflated and Overdispersed Count Data. Ph.D. Thesis, The Ohio State University, Columbus, OH, USA. [Google Scholar]
  3. Boucher, Jean-Philippe, Michel Denuit, and Montserrat Guillén. 2007. Risk classification for claim counts: A comparative analysis of various zero inflated mixed Poisson and hurdle models. North American Actuarial Journal 11: 110–31. [Google Scholar] [CrossRef]
  4. Calderín-Ojeda, Enrique, Emilio GóMez-Déniz, and Inmaculada Barranco-Chamorro. 2019. Modelling zero-inflated count data with a special case of the generalised Poisson distribution. ASTIN Bulletin: The Journal of the IAA 49: 689–707. [Google Scholar] [CrossRef]
  5. Cameron, A. Colin, and Pravin K. Trivedi. 2013. Regression Analysis of Count Data. Cambridge: Cambridge University Press, vol. 53. [Google Scholar]
  6. Consul, Prem. C. 1993. A model for distributions of injuries in auto-accidents. Insurance: Mathematics and Economics 13: 147. [Google Scholar] [CrossRef]
  7. Consul, Prem. C., and Mohamed M. Shoukri. 1988. Some chance mechanisms related to a generalized poisson probability model. American Journal of Mathematical and Management Sciences 8: 181–202. [Google Scholar] [CrossRef]
  8. Consul, Prem C., and Gaurav C. Jain. 1973. A generalization of the Poisson distribution. Technometrics 15: 791–9. [Google Scholar] [CrossRef]
  9. Cragg, John G. 1971. Some statistical models for limited dependent variables with application to the demand for durable goods. Econometrica: Journal of the Econometric Society 39: 829–44. [Google Scholar] [CrossRef]
  10. Czado, Claudia, Vinzenz Erhardt, Aleksey Min, and Stefan Wagner. 2007. Zero-inflated generalized Poisson models with regression effects on the mean, dispersion and zero-inflation level applied to patent outsourcing rates. Statistical Modelling 7: 125–53. [Google Scholar] [CrossRef]
  11. Dai, Lin, Michael D. Sweat, and Mulugeta Gebregziabher. 2018. Modeling excess zeros and heterogeneity in count data from a complex survey design with application to the demographic health survey in sub-saharan africa. Statistical Methods in Medical Research 27: 208–20. [Google Scholar] [CrossRef]
  12. Dean, Charmaine, Jerry. F. Lawless, and Gord. E. Willmot. 1989. A mixed Poisson–inverse-gaussian regression model. Canadian Journal of Statistics 17: 171–81. [Google Scholar] [CrossRef]
  13. Deb, Partha, and Pravin K. Trivedi. 1997. Demand for medical care by the elderly: A finite mixture approach. Journal of Applied Econometrics 12: 313–336. [Google Scholar] [CrossRef]
  14. Denuit, Michel, Xavier Maréchal, Sandra Pitrebois, and Jean-François Walhin. 2007. Actuarial Modelling of Claim Counts: Risk Classification, Credibility and Bonus-Malus Systems. Hoboken: John Wiley & Sons. [Google Scholar]
  15. Dionne, Georges, and Charles Vanasse. 1989. A generalization of automobile insurance rating models: The negative binomial distribution with a regression component. ASTIN Bulletin: The Journal of the IAA 19: 199–212. [Google Scholar] [CrossRef]
  16. Famoye, Felix, and Karan P. Singh. 2006. Zero-inflated generalized poisson regression model with an application to domestic violence data. Journal of Data Science 4: 117–30. [Google Scholar] [CrossRef]
  17. Feng, Cindy. 2022. Zero-inflated models for adjusting varying exposures: A cautionary note on the pitfalls of using offset. Journal of Applied Statistics 49: 1–23. [Google Scholar] [CrossRef]
  18. Frees, Edward W. 2009. Regression Modeling with Actuarial and Financial Applications. Cambridge: Cambridge University Press. [Google Scholar]
  19. Frees, Edward W., Richard A. Derrig, and Glenn Meyers. 2014. Predictive Modeling Applications in Actuarial Science. Cambridge: Cambridge University Press, vol. 1. [Google Scholar]
  20. Frees, Edward W., and Emiliano A. Valdez. 2008. Hierarchical insurance claims modeling. Journal of the American Statistical Association 103: 1457–69. [Google Scholar] [CrossRef]
  21. Fuzi, Mohd Fadzli Mohd, Abdul Aziz Jemain, and Noriszura Ismail. 2016. Bayesian quantile regression model for claim count data. Insurance: Mathematics and Economics 66: 124–37. [Google Scholar] [CrossRef]
  22. Gerber, Hans U. 1990. When does the surplus reach a given target? Insurance: Mathematics and Economics 9: 115–9. [Google Scholar] [CrossRef]
  23. Greene, William. 2008. Functional forms for the negative binomial model for count data. Economics Letters 99: 585–90. [Google Scholar] [CrossRef]
  24. Hilbe, Joseph M. 2011. Negative Binomial Regression. Cambridge: Cambridge University Press. [Google Scholar]
  25. Ismail, Noriszura, and Abdul Aziz Jemain. 2007. Handling overdispersion with negative binomial and generalized poisson regression models. In Casualty Actuarial Society Forum. Arlington County: Casualty Actuarial Society, vol. 2007, pp. 103–58. [Google Scholar]
  26. Ismail, Noriszura, and Hossein Zamani. 2013. Estimation of claim count data using negative binomial, generalized Poisson, zero-inflated negative binomial and zero-inflated generalized Poisson regression models. In Casualty Actuarial Society E-Forum. Arlington County: Casualty Actuarial Society, vol. 41, pp. 1–28. [Google Scholar]
  27. Kleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics with R. Berlin: Springer Science & Business Media. [Google Scholar]
  28. Lambert, Diane. 1992. Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics 34: 1–14. [Google Scholar] [CrossRef]
  29. Ledoit, Olivier, and Michael Wolf. 2003. Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of Empirical Finance 10: 603–21. [Google Scholar] [CrossRef]
  30. Lee, Andy H, Kui Wang, and Kelvin K. W. Yau. 2001. Analysis of zero-inflated Poisson data incorporating extent of exposure. Biometrical Journal 43: 963–75. [Google Scholar] [CrossRef]
  31. Loquiha, Osvaldo, Niel Hens, Leonardo Chavane, Marleen Temmerman, and Marc Aerts. 2013. Modeling heterogeneity for count data: A study of maternal mortality in health facilities in mozambique. Biometrical Journal 55: 647–60. [Google Scholar] [CrossRef] [PubMed]
  32. Månsson, Kristofer, B. M. Golam Kibria, and Ghazi Shukur. 2012. On liu estimators for the logit regression model. Economic Modelling 29: 1483–88. [Google Scholar] [CrossRef]
  33. Mullahy, John. 1986. Specification and testing of some modified count data models. Journal of Econometrics 33: 341–65. [Google Scholar] [CrossRef]
  34. Park, Myung Hyun, and Joseph H. T. Kim. 2021. Modelling healthcare demand count data with excessive zeros and overdispersion. Global Economic Review 50: 358–81. [Google Scholar] [CrossRef]
  35. Park, Mee Young, and Trevor Hastie. 2007. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69: 659–77. [Google Scholar] [CrossRef]
  36. Renshaw, Arthur E. 1994. Modelling the claims process in the presence of covariates. ASTIN Bulletin: The Journal of the IAA 24: 265–85. [Google Scholar] [CrossRef]
  37. Saffari, Seyed Ehsan, Robiah Adnan, and William Greene. 2013. Investigating the impact of excess zeros on hurdle-generalized Poisson regression model with right censored count data. Statistica Neerlandica 67: 67–80. [Google Scholar] [CrossRef]
  38. Scollnik, David P. M. 1995. Bayesian analysis of two overdispersed Poisson models. Biometrics 51: 1117–26. [Google Scholar] [CrossRef]
  39. Stein, Charles M. 1981. Estimation of the mean of a multivariate normal distribution. The Annals of Statistics 9: 1135–51. [Google Scholar] [CrossRef]
  40. Tang, Yanlin, Liya Xiang, and Zhongyi Zhu. 2014. Risk factor selection in rate making: EM adaptive LASSO for zero-inflated poisson regression models. Risk Analysis 34: 1112–27. [Google Scholar] [CrossRef]
  41. Tibshirani, Robert. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58: 267–88. [Google Scholar] [CrossRef]
  42. Wang, Shuo, Wangxue Chen, Meng Chen, and Yawen Zhou. 2023. Maximum likelihood estimation of the parameters of the inverse gaussian distribution using maximum rank set sampling with unequal samples. Mathematical Population Studies 30: 1–21. [Google Scholar] [CrossRef]
  43. Wang, Weiren, and Felix Famoye. 1997. Modeling household fertility decisions with generalized Poisson regression. Journal of Population Economics 10: 273–83. [Google Scholar] [CrossRef] [PubMed]
  44. Wang, Zhu, Shuangge Ma, and Ching-Yun Wang. 2015. Variable selection for zero-inflated and overdispersed data with application to health care demand in germany. Biometrical Journal 57: 867–84. [Google Scholar] [CrossRef]
  45. Wüthrich, Mario V., and Michael Merz. 2008. Stochastic Claims Reserving Methods in Insurance. Hoboken: John Wiley & Sons. [Google Scholar]
  46. Yip, Karen C. H., and Kelvin K. W. Yau. 2005. On modeling claim frequency data in general insurance with extra zeros. Insurance: Mathematics and Economics 36: 153–63. [Google Scholar] [CrossRef]
  47. Zamani, Hossein, and Noriszura Ismail. 2012. Functional form for the generalized poisson regression model. Communications in Statistics-Theory and Methods 41: 3666–75. [Google Scholar] [CrossRef]
  48. Zhen, Zhen, Liyang Shao, and Lianjun Zhang. 2018. Spatial hurdle models for predicting the number of children with lead poisoning. International Journal of Environmental Research and Public Health 15: 1792. [Google Scholar] [CrossRef]
  49. Zuo, Guoxin, Kang Fu, Xianhua Dai, and Liwei Zhang. 2021. Generalized Poisson hurdle model for count data and its application in ear disease. Entropy 23: 1206. [Google Scholar] [CrossRef]
Table 1. Parameter estimates, t-ratios, and model fit measures for simulated data using the ZIP model under various exposure scenarios.
Table 1. Parameter estimates, t-ratios, and model fit measures for simulated data using the ZIP model under various exposure scenarios.
ZIPZIPeeZIPeZIP11ZIP1
ParEst.|t.ratio|Est.|t.ratio|Est.|t.ratio|Est.|t.ratio|Est.|t.ratio|
Logistic proportion of models
γ 0 −1.5334.920.497.20−1.6035.20−2.9272.02−1.8536.64
γ 1 0.5512.520.7114.000.5812.950.6917.430.6814.34
γ 2 0.5322.250.7025.070.5622.930.6629.600.6625.23
ξ 1 −1.5939.98
Count proportion of models
β 0 1.86336.01.16106.91.16105.60.046.750.047.51
β 1 0.80164.40.79162.50.79162.50.79160.90.79160.8
β 2 0.83341.60.82334.40.82334.40.81329.10.81328.9
ξ 2 0.4178.010.4178.17
LL−45,067−40,652−41,671−50,271−47,284
AIC90,14781,31983,356100,55394,579
BIC90,19081,37783,406100,59794,623
Table 2. Parameter estimates, t-ratios, and model fit measures for simulated data using the ZIGP-P model under various exposure scenarios.
Table 2. Parameter estimates, t-ratios, and model fit measures for simulated data using the ZIGP-P model under various exposure scenarios.
ZIGP-PZIGP-PeeZIGP-PeZIGP-P11ZIGP-P1
ParEst.|t.ratio|Est.|t.ratio|Est.|t.ratio|Est.|t.ratio|Est.|t.ratio|
Logistic part of the models
γ 0 −1.6535.030.405.75−1.7435.16−3.2366.730.4633.39
γ 1 0.6013.240.7414.400.6413.720.8118.94−1.4731.21
γ 2 0.5823.240.7425.520.6223.940.7731.090.6113.49
ξ 1 −1.5839.43
Non-zero part of the models
β 0 1.88138.51.0641.950.9937.480.074.690.6113.49
β 1 0.7957.800.7961.50.7961.210.8051.870.7860.03
β 1 0.82111.70.81115.90.81115.80.8195.220.80115.9
ξ 2 0.4737.300.5038.09
a0.3915.310.1713.410.1913.280.4514.500.3916.60
P1.4675.211.6777.161.6475.511.4672.391.4382.04
LL−30,270−28,606−29582−33,088−29,685
AIC6055657,23359,18166,19359,386
BIC60,61457,30459,24666,25059,443
Table 3. Comparing the model fitness of ZIGP-Pee, ZIGP-Pe, ZIGP-P11, and ZIGP-P1 based on the mean values of AIC and BIC over 100 simulated datasets created from the ZIP model.
Table 3. Comparing the model fitness of ZIGP-Pee, ZIGP-Pe, ZIGP-P11, and ZIGP-P1 based on the mean values of AIC and BIC over 100 simulated datasets created from the ZIP model.
nZIGP-PeeZIGP-PeZIGP-P11ZIGP-P1
AIC10005721.155915.206633.576037.35
500028,575.0529,503.7933,097.6330,157.15
10,00057,194.1559,054.3866,010.1160,329.38
BIC10005770.235959.376672.836076.62
500028,640.2229,562.4433,149.7730,209.29
10,00057,266.2659,119.2766,067.7960,387.06
Table 4. Parameter estimates, t-ratios, and model fit measures for simulated data using the HGP-P model under various exposure scenarios.
Table 4. Parameter estimates, t-ratios, and model fit measures for simulated data using the HGP-P model under various exposure scenarios.
HGP-PHGP-PeeHGP-PeHGP-P11HGP-P1
ParEst.|t.ratio|Est.|t.ratio|Est.|t.ratio|Est.|t.ratio|Est.|t.ratio|
Logistic part of the models
γ 0 −1.4634.720.599.06−1.4634.72−2.5473.88−1.4634.72
γ 1 0.5211.990.6613.290.5211.990.5414.440.5211.99
γ 2 0.4921.400.6424.120.4921.400.4824.930.4921.40
ξ 1 −1.5840.38
Non-zero part of the models
β 0 1.88135.31.0741.751.0741.750.107.290.107.29
β 1 0.8057.700.7961.350.7961.350.8152.750.8152.75
β 2 0.82110.50.81114.60.81114.70.8295.280.8295.28
ξ 2 0.4636.680.4636.68
a0.4114.940.1713.220.1713.220.2615.710.2615.71
P1.4573.141.6675.851.6675.851.6286.181.6286.18
LL−30,287−28,633−29,678−33,564−30,472
AIC60,58957,28659,37467,14460,960
BIC60,64757,35859,43967,20161,018
Table 5. Parameter estimates and absolute t-ratios for the Malaysian Motor Insurance Data.
Table 5. Parameter estimates and absolute t-ratios for the Malaysian Motor Insurance Data.
Coefficients for the Non-Zero Part of the ModelsLogistic Coef.
PoissonGP-PNB-P
ParameterEst.|t.ratio|Est.|t.ratio|Est.|t.ratio|Est.|t.ratio|
Intercept−2.5949.29−2.7517.14−2.8116.943.003.14
2–3 year0.5039.680.5412.300.5312.24−2.142.92
4–5 year0.4836.460.4910.780.4910.99−0.991.66
6–7 year0.4131.150.449.850.439.78−1.312.02
above 80.2620.330.276.110.276.100.150.25
1001–1300 cc−0.104.40−0.101.45−0.101.51−0.490.78
1301–1500 cc0.104.260.071.120.081.07−1.722.04
1501–1800 cc0.3012.560.273.930.283.91−1.511.66
above 1800 cc0.3816.120.375.300.375.13−1.471.49
Local type 2−0.2612.01−0.335.31−0.314.76−0.090.09
Foreign type 1−0.2823.55−0.256.14−0.256.331.271.43
Foreign type 20.000.150.061.030.061.010.120.15
Foreign type 3−0.167.69−0.131.87−0.131.902.032.03
East0.2413.270.305.130.294.92−0.470.73
Central0.3530.020.338.240.338.27−1.541.77
South0.2318.150.265.890.255.790.360.56
East Malaysia0.085.480.071.420.081.53−0.020.04
log(Exposure)0.93187.480.9559.640.9559.21−1.156.19
a--1.518.015.346.64--
P--1.0942.391.1234.66--
LL−3809.43−2028.35−2036.86−82.35
AIC7654.854096.704113.73200.71
BIC7730.614180.874197.90278.35
Table 6. The number of parameters, LL, AIC, and BIC of different models for Malaysian Motor Insurance Data.
Table 6. The number of parameters, LL, AIC, and BIC of different models for Malaysian Motor Insurance Data.
ModelsNo. of ParametersLLAICBIC
Poisson18−3917.57871.17948.7
GP-119−2166.14370.14452.1
GP-219−2441.64921.25003.1
GP-P20−2146.44332.84419.1
NB-119−2191.04419.94501.9
NB-219−2324.14686.24768.2
NB-P20−2173.64387.34473.5
ZIP36−3899.97871.98027.2
ZIGP-137−2281.44636.94796.5
ZIGP-237−2659.25392.35551.9
ZIGP-P38−2167.14410.34574.2
ZINB-137−2695.55464.95624.5
ZINB-237−2356.84787.54947.1
ZINB-P38−2153.84383.64547.5
HP36−3891.87855.68008.9
HGP-137−2116.24306.54464.1
HGP-237−2420.64915.15072.7
HGP-P38−2110.74297.44459.2
HNB-137−2125.94325.84483.4
HNB-237−2321.94717.84875.4
HNB-P38−2119.24314.44476.2
Table 7. Parameter estimates and absolute t-ratios for the US National Medical Expenditure Survey dataset.
Table 7. Parameter estimates and absolute t-ratios for the US National Medical Expenditure Survey dataset.
Coefficients for the Non-Zero Part of the ModelsLogistic Coef.
PoissonGP-PNB-P
ParameterEst.|t.ratio|Est.|t.ratio|Est.|t.ratio|Est.|t.ratio|
Intercept1.8420.461.627.011.606.63−1.372.27
Poorhlth0.2815.190.316.180.316.180.070.42
Exclhlth−0.3410.67−0.374.86−0.394.81−0.322.27
Numchron0.1224.800.1512.400.1511.920.5512.14
Adldiff0.127.190.102.260.122.71−0.18−1.44
Noreast0.115.920.102.090.122.320.030.21
Other regions0.021.240.010.330.020.46−0.10−0.89
Midwest0.126.060.122.450.142.620.100.71
Age−0.086.92−0.072.47−0.082.680.192.51
Black0.000.03−0.030.53−0.030.50−0.322.52
Male−0.010.71−0.020.65−0.020.61−0.464.82
Married−0.074.60−0.061.66−0.071.800.252.41
School0.029.580.023.770.023.820.054.24
Faminc0.001.310.000.350.000.460.010.36
Employed0.062.69−0.010.100.030.56−0.010.09
Private health0.199.510.244.580.274.900.766.85
Medicaid0.197.350.253.630.273.750.553.21
a--0.605.431.674.20--
P--1.4514.951.5612.28--
Table 8. Number of parameters, LL, AIC, and BIC for different models for the US National Medical Expenditure Survey dataset.
Table 8. Number of parameters, LL, AIC, and BIC for different models for the US National Medical Expenditure Survey dataset.
ModelsNo. of ParametersLLAICBIC
Poisson17−18,13436,30336,412
GP-118−12,14724,33024,445
GP-218−12,23724,51024,625
GP-P19−12,14724,33224,453
NB-118−12,15624,34824,463
NB-218−12,20224,44024,555
NB-P19−12,15524,34824,470
ZIP34−16,29032,64832,862
ZIGP-135−12,09624,26124,485
ZIGP-235−12,09524,25924,483
ZIGP-P36−12,08524,24224,472
ZINB-135−12,13324,33624,560
ZINB-235−12,11724,30424,528
ZINB-P36−12,11424,30124,531
HP34−16,29032,64832,862
HGP-135−12,08524,24024,460
HGP-235−12,09624,26224,482
HGP-P36−12,07724,22724,453
HNB-135−12,11324,29624,517
HNB-235−12,11024,29124,511
HNB-P36−12,10424,28024,507
NBM233−12,13924,34324,554
CFMNB-2 *21−12,09824,23824,372
FMNB-2 *37−12,07324,22024,456
CFMNB-3 *24−12,09824,24424,397
CFMNB-2 **21−12,14924,34024,474
FMNB-2 **37−12,13424,34224,579
CFMNB-3 **24−12,14924,34624,499
* Based on the NB-1. ** Based on the NB-2.
Table 9. The description of the coviates in the French dataset.
Table 9. The description of the coviates in the French dataset.
VariableDescription
VehPowerThe power of the car.
VehAgeThe vehicle age in years
DriveAgeThe driver age in years.
Log(density)The log of the number of residents per square kilometer of the city where the car driver lives.
BonusMalusZero indicate a bonus, while one indicates a malus.
VehGasThe car’s fuel equals zero for regular fuel and one for diesel.
Log(exposure)The log of the period of exposure for a policy in years.
Table 10. LL, AIC, BIC, and computational time (CT) for various statistical models applied to the French dataset.
Table 10. LL, AIC, BIC, and computational time (CT) for various statistical models applied to the French dataset.
ModelsLLAICBICCT (Seconds)
Poisson−140,092280,201280,29269
GP-1−139,593279,205279,308236
GP-2−139,694279,407279,510471
GP-P−139,586279,191279,3051562
NB-1−139,602279,222279,325898
NB-2−139,700279,419279,521401
NB-P−139,596279,212279,3271292
ZIP−139,709279,450279,6321850
ZIGP-1−139,573279,180279,3741711
ZIGP-2−139,653279,340279,5341331
ZIGP-P−139,474278,984279,190915
ZINB-1−139,490279,014279,209741
ZINB-2−139,593279,220279,414700
ZINB-P−139,481278,997279,2031339
HP−139,665279,361279,521125
HGP-1−139,565279,163279,331132
HGP-2−139,572279,177279,345157
HGP-P−139,562279,160279,336211
HNB-1−139,573279,180279,347139
HNB-2−139,578279,190279,357135
HNB-P−139,571279,178279,354269
Table 11. Parameter estimation and absolute t-ratio for ZIGP-P, ZINB-P, HGP-P, and HNB-P models for the French dataset.
Table 11. Parameter estimation and absolute t-ratio for ZIGP-P, ZINB-P, HGP-P, and HNB-P models for the French dataset.
Count Model Coefficients
ZIGP-PZINB-PHGP-PHNB-P
ParameterEst.|t.ratio|Est.|t.ratio|Est.|t.ratio|Est.|t.ratio|
Logistic Proportion of models
Intercept−0.140.39−0.140.432.7778.122.7778.12
VehPower0.102.510.102.67−0.011.25−0.011.25
VehAge−0.067.61−0.067.970.0127.850.0127.85
DrivAge0.013.180.013.57−0.017.83−0.017.83
Log(density)0.061.970.061.95−0.0310.56−0.0310.56
BonusMalus5.932.545.932.38−1.0447.42−1.0447.42
VehGas5.383.925.373.930.108.280.108.28
Log(Exposure)−0.546.21−0.546.69−0.3860.67−0.3860.67
Count Proportion of models
Intercept−2.4549.70−2.4450.11−5.208.64−5.356.63
VehPower−0.011.250.001.180.123.490.123.38
VehAge−0.0213.54−0.0213.79−0.052.97−0.052.81
DrivAge0.003.960.003.700.011.000.011.04
Log(density)0.037.070.037.320.163.560.172.98
BonusMalus0.8024.450.8026.291.687.661.746.01
VehGas−0.328.37−0.329.740.312.050.311.96
Log(Exposure)0.4150.460.4151.960.544.050.553.45
a0.012.360.023.090.023.800.052.09
P0.725.080.716.520.8313.030.847.48
Table 12. Comparison of AIC values of various models with exposure included as an offset in the count part and as a covariate based on the French dataset.
Table 12. Comparison of AIC values of various models with exposure included as an offset in the count part and as a covariate based on the French dataset.
ModelsExposure as an Offset in the Count PartExposure as a Covariate
Poisson288,718280,201
GP-P287,192279,191
NB-P287,212279,212
ZIP287,774279,450
ZIGP-P287,102278,984
ZINB-P287,120278,997
HP284,806279,361
HGP-P283,571279,160
HNB-P283,588279,178
Table 13. Likelihood ratio tests for various statistical models applied to the French dataset.
Table 13. Likelihood ratio tests for various statistical models applied to the French dataset.
Models ComparedLRT Valuep-Value
GP-1 vs. Poisson997.8< 0.001
GP-2 vs. Poisson795.8< 0.001
GP-P vs. GP-115.60.0001
GP-P vs. GP-2217.6< 0.001
NB-1 vs. Poisson980.6< 0.001
NB-2 vs. Poisson784.2< 0.001
NB-P vs. NB-111.60.0007
NB-P vs. NB-2208< 0.001
ZIGP-1 vs. ZIP272.4< 0.001
ZIGP-2 vs. ZIP111.6< 0.001
ZIGP-P vs. ZIGP-1196.8< 0.001
ZIGP-P vs. ZIGP-2357.6< 0.001
ZINB-1 vs. ZIP437.4< 0.001
ZINB-2 vs. ZIP231.4< 0.001
ZINB-P vs. ZINB-119< 0.001
ZINB-P vs. ZINB-2225< 0.001
HGP-1 vs. HP200.3< 0.001
HGP-2 vs. HP186.4< 0.001
HGP-P vs. HGP-15.10.024
HGP-P vs. HGP-219< 0.001
HNB-1 vs. HP183.9< 0.001
HNB-2 vs. HP173.9< 0.001
HNB-P vs. HNB-12.80.093
HNB-P vs. HNB-212.8< 0.001
Table 14. Modeling results for the original full logistic regression model and shrunken model applied to the US National Medical Expenditure Survey dataset.
Table 14. Modeling results for the original full logistic regression model and shrunken model applied to the US National Medical Expenditure Survey dataset.
Full ModelLasso Regression
VariablesEst.p-ValueEst.p-Value
Intercept−1.040.00−1.240.00
healthpoor−0.530.00−0.280.09
healthexcellent0.620.000.360.03
numchron0.090.100.020.76
adldiffyes−0.200.17−0.080.56
regionnoreast−0.040.800.001.00
regionother0.120.350.070.53
regionwest−0.300.06−0.130.33
age0.010.780.001.00
blackyes0.570.000.400.00
gendermale0.520.000.390.00
marriedyes−0.240.03−0.080.46
school0.140.010.100.05
faminc0.020.710.001.00
employedyes0.040.800.001.00
privinsyes−1.020.00−0.820.00
medicaidyes−0.570.00−0.220.23
In-sample LL−1436.9−1446.7
Out-of-sample LL−373.0−370.2
In-sample deviance2873.82893.4
Out-of-sample deviance746.0740.4
Table 15. Modeling results for the original full TGP-P and TNB-P regression and shrunken models applied to the US National Medical Expenditure Survey dataset.
Table 15. Modeling results for the original full TGP-P and TNB-P regression and shrunken models applied to the US National Medical Expenditure Survey dataset.
TGP-P TNB-P
Full ModelLasso Reg.Full ModelLasso Reg.
VariablesEst.p-ValEst.p-ValEst.p-ValEst.p-Val
Intercept1.270.001.390.001.190.001.300.00
healthpoor0.310.000.260.000.320.000.300.00
healthexcellent−0.420.00−0.280.00−0.400.00−0.310.00
numchron0.150.000.150.000.140.000.140.00
adldiffyes0.100.040.070.140.120.020.110.04
regionnoreast a 0.130.020.060.230.110.060.040.46
regionother b , c 0.040.450.001.000.060.230.001.00
regionwest0.120.030.050.320.140.020.070.20
age a −0.040.04−0.030.14−0.040.05−0.040.09
blackyes b , c −0.030.500.001.000.010.920.001.00
gendermale−0.030.53−0.010.72−0.050.27−0.040.34
marriedyes−0.080.08−0.050.27−0.050.26−0.040.41
school0.070.000.060.000.090.000.080.00
faminc a , b −0.010.760.001.00−0.020.34−0.020.43
employedyes a , b , c 0.100.120.001.000.070.320.001.00
privinsyes0.240.000.150.010.290.000.220.00
medicaidyes0.270.000.160.040.300.000.240.00
a0.570.000.670.001.850.002.000.00
P1.480.001.400.001.490.001.450.00
In-sample LL−8290.6−8297.7−8311.3−8314.3
Out-of-sample LL−2108.3−2105.1−2106.7−2088.2
In-sample deviance2825.52852.82734.22748.4
Out-of-sample deviance762.3759.0755.9720.8
a removed variable based on logistic Lasso. b removed variable based on TGP-P Lasso. c removed variable based on TNB-P Lasso.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Faroughi, P.; Li, S.; Ren, J. The Applications of Generalized Poisson Regression Models to Insurance Claim Data. Risks 2023, 11, 213. https://doi.org/10.3390/risks11120213

AMA Style

Faroughi P, Li S, Ren J. The Applications of Generalized Poisson Regression Models to Insurance Claim Data. Risks. 2023; 11(12):213. https://doi.org/10.3390/risks11120213

Chicago/Turabian Style

Faroughi, Pouya, Shu Li, and Jiandong Ren. 2023. "The Applications of Generalized Poisson Regression Models to Insurance Claim Data" Risks 11, no. 12: 213. https://doi.org/10.3390/risks11120213

APA Style

Faroughi, P., Li, S., & Ren, J. (2023). The Applications of Generalized Poisson Regression Models to Insurance Claim Data. Risks, 11(12), 213. https://doi.org/10.3390/risks11120213

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop