Mixed Poisson Regression Models with Varying Dispersion Arising from Non-Conjugate Mixing Distributions

Abstract: In this article we present a class of mixed Poisson regression models with varying dispersion arising from non-conjugate to the Poisson mixing distributions for modelling overdispersed claim counts in non-life insurance. The proposed family of models combined with the adopted modelling framework can provide sufficient flexibility for dealing with different levels of overdispersion. For illustrative purposes, the Poisson-lognormal regression model with regression structures on both its mean and dispersion parameters is employed for modelling claim count data from a motor insurance portfolio. Maximum likelihood estimation is carried out via an expectation-maximization type algorithm, which is developed for the proposed family of models and is demonstrated to perform satisfactorily.


Introduction
During the last three decades, mixed Poisson regression models have been applied in various fields of studies, including non-life insurance for modelling overdispersed claim count data. The members of this family of models, which can be constructed based on a mixing distribution which is conjugate to the Poisson distribution, such as the negative binomial and the Poisson-inverse-Gaussian, have been the most popular choices due to the simplicity of their log-likelihood functions, which can be easily maximized using the standard maximum likelihood (ML) estimation approach. See, for example, Refs. [1][2][3] for the former and [4][5][6] for the latter, among many others. However, it should be noted that the assumption of conjugancy can be very restrictive for constructing a mixed Poisson model that will be able to efficiently capture different levels of overdispersion in real claim count data sets. In particular, as is well known, overdispersion is a direct consequence of unobserved heterogeneity due to systematic effects in the data. For example, in motor insurance, the driving skills, preferences, habits, and driving experience of policyholders, which differ, may lead to extra variation in the claim count data, the degree of which is controlled by the value of the dispersion parameter of the mixed Poisson model. Furthermore, overdispersion can either be caused by a large presence of zeros or a heavy tail in the data. Regarding the latter case, as is well known, the tails of mixed Poisson distributions in the case of continuous mixing distributions are similar to the tails of their mixing distributions (see, for instance, Ref. [7]). Therefore, restricting attention only to mixed Poisson models which are derived based on conjugate to the Poisson mixing densities may result in biased parameter estimates because of their inability to always efficiently model the tail of claim count distribution. This, in turn, may have a profound impact on many tasks which are carried out by the actuaries, such as risk management and pricing of (re-)insurance contracts. Thus, it becomes clear that an important task of actuaries is to be able to design more representative probabilistic models for the number of claims with good prediction accuracy. This procedure depends on the reliability of the statistical method which will be used to construct them.
The aim of this article is to present a general class of mixed Poisson regression models with varying dispersion stemming from non-conjugate C 2 densities with a continuous first derivative and a continuous second derivative. The class of mixed Poisson models we consider is very wide and the flexibility it provides in (i) the distributional choice for the mixing density and (ii) modelling jointly the mean and dispersion parameters as parametric functions of risk factors allows us to add the required amount of weight to the right tail area of the claim count distribution for accommodating different levels of overdispersion, thus resulting in an improved risk evaluation. At this point, it should be noted that, with the exception of very few articles, such as those by [8][9][10], modelling jointly all the parameters of mixed Poisson models in terms of explanatory variables has not been explored in depth. Nevertheless, mean regression models often cannot adequately account for the heteroscedasticity of the claim count distribution or its possible dependence on risk factors. In addition, note that in [9], the exponential family distribution assumption for the univariate response variable is relaxed and replaced by a general distribution family, including distributions based on Box-Cox transformations (such as the Box-Cox t-distribution or the Box-Cox power exponential distribution) and zero adjusteddistributions. However, to the best of our knowledge, this is the first study to consider regression structures on the mean and dispersion parameters of univariate mixed Poisson models that have a probability mass function (pmf) which cannot be written in closedform expressions. For demonstration purposes, the Poisson-lognormal (PLN) regression model with varying dispersion is fitted on a motor third-party liability (MTPL) insurance claim count data set using an expectation-maximization (EM) type algorithm, which takes advantage of the stochastic mixture representation of the proposed family of models which have a density that cannot be written in closed form in an easy and efficient manner. Moreover, it is worth noting that the development of stochastic algorithms, such as the EM and stochastic gradient descent algorithms, is of particular importance in machine learning and artificial intelligence applications, as they can be employed for efficiently calibrating various statistical models and deep networks. Two very interesting recent articles are those by [11,12].
The rest of this article is structured as follows. In Section 2, we present the derivation of the proposed mixed Poisson regression model with varying dispersion for claim counts. Section 3 deals with the ML estimation procedure for the PLN regression model with varying dispersion based on the proposed EM-type algorithm. Section 4 contains an application to the MTPL claim count data, and we fit the PLN claim count regression model with varying dispersion. In addition, the negative binomial regression model with varying dispersion and the zero-inflated (ZIP) Poisson regression model are used as benchmarks for comparison. Finally, concluding remarks are provided in Section 5.

Modelling Framework
Consider a non-life insurance portfolio with n policyholder contracts each involving a particular claim type, and assume that the individual claim frequencies, k i , arising from each insured i, for i = 1, . . . , n, are independent. In addition, suppose that given a continuous random variable, z i > 0, k i |z i is distributed according to a Poisson distribution with probability mass function (pmf) given by for k i = 0, 1, 2, . . . , and where µ i > 0. The mean and variance of k i |z i are E(k i |z i ) = Var(k i |z i ) = µ i z i . Furthermore, consider that z i is distributed according to a C 2 mixing distribution, with probability density function (pdf) g(z i ; φ i ), with φ i > 0, which is not conjugate to the Poisson distribution given by Equation (1). Additionally, we assume that z i has a unit mean, that is, E(z i ) = 1, to ensure that the model is identifiable.
Considering the previous assumptions, we see that the resulting distribution of k i is a mixed Poisson distribution with pmf where g(z i ; φ i ) is the pdf of z i . In addition, the mean and the variance of k i are given by and Finally, under the proposed modelling framework, the mean and dispersion parameters, µ i and φ i , are modelled as functions of risk factors and where x 1,i and x 2,i are the, potentially different, vectors of explanatory variables with dimensions p 1 × 1 and p 2 × 1, respectively, and where β 1 = β 1,1 , . . . , β 1,p 1 and β 2 = β 2,1 , . . . , β 2,p 2 are vectors of regression coefficients, where we consider that the matrices X 1 and X 2 are of full rank and are composed of the rows given by x 1,i and x 2,i , respectively.

Model Specification: The Poisson-Lognormal Regression Model with Varying Dispersion
For expository purposes, we specify the lognormal distribution as the mixing distribution of z i with the following pdf where z i > 0 and φ i > 0, with mean E(z i ) = 1 and variance Var(z i ) = exp φ 2 i − 1. Then, based on Equations (1) and (7), it is easy to see that the resulting distribution of k i is the Poisson-lognormal (PLN) distribution with pmf where µ i > 0 and φ i > 0 are given by Equations (5) and (6), respectively. Unfortunately, since g(z i ; φ i ) is not conjugate to the Poisson, the integral in Equation (8) is mathematically intractable, but it can be easily calculated using numerical integration.
Using the results in Equations (3) and (4), we calculate the mean and the variance of the PLN regression model with varying dispersion and

Statistical Inference: The EM-Type Algorithm
Let (k i , x 1,i , x 2,i ), i = 1, . . . , n, be a sample of independent observations, where k i is the response variable, and x 1,i and x 2,i are the vectors of explanatory variables with dimensions p 1 × 1 and p 2 × 1, respectively. In addition, suppose that the data are produced according to the mixed Poisson model with varying dispersion. Then, the log-likelihood of the model can be written as log(P(k i )), (11) where θ = β 1 , β 2 is the vector of the parameters, and where P(k i ) is the pmf of the model which is given by Equation (2). It should be noted that the likelihood given by Equation (11) is cumbersome to maximize since it is not usually tractable. Moreover, when the mean and dispersion parameters are modelled in terms of risk factors, additional computational challenges can be encountered.
However, ML estimation can be accomplished relatively easily via an EM-type algorithm. In particular, if the unobserved data z i are augmented to the observed data (k i , x 1,i , x 2,i ), for i = 1, . . . , n, then the complete data log-likelihood factorizes into two parts log(g(z i ; φ i )) (12) where g(z i ; φi) is the pdf of the mixing distribution which is not conjugate to the Poisson, and where µ i and φ i are given by Equations (5) and (6) respectively. The EM-type algorithm for the mixed Poisson regression model with varying dispersion can be described as follows • E-Step: The Q-function, which is the conditional expectation of the complete data log-likelihood, is given by Q(θ; θ (r) ) = E z i l c (θ)|K, θ (r) ∝ (13) where θ (r) is the estimate of θ at the rth iteration in the E-step of our EM algorithm. Then, using the estimates θ (r) , calculate the pseudo-values w 1 i = E z i (z i |k i ; θ (r) ) and w k i = E z i (s k (Z i )|k i ; θ (r) ) for i = 1, . . . , n and k = 1, . . . , ν, where s k (.) are certain functions which are involved in the terms needed for maximizing the part of the Q-function which corresponds to the conditional expectation of the log-likelihood of the mixing distribution g(z i ; φ i ). -Firstly, taking the necessary derivatives of the Q-function with respect to β 1 , we obtain the following results and for i = 1, . . . , n and j = 1, . . . , p 1 , and where Then, the iterative procedure for the Newton-Raphson algorithm for β 1 goes as follows -Secondly, differentiating the Q-function with respect to β 2 gives and where for computing h 1 (β 2 ) and H 2 (β 2 ), we need to use the pseudo-values w k i for i = 1, . . . , n and k = 1, . . . , ν, because in this case, the maximization of the Q-function reduces to the maximization of the conditional expectation of the log-likelihood of the mixing distribution g(z i ; φ i ). Then, the Newton-Raphson iterative algorithm for β 2 is as follows for i = 1, . . . , n and l = 1, . . . , p 2 . • Finally, iterate between the E-and the M-Steps until some convergence criterion is satisfied, for instance where l (r) is the value of the log-likelihood after the r-th iteration, and where tol is a small number usually of the form 10 −m , where m ∈ Z + . The stopping criterion refers to the progress of the likelihood function (i.e., its convergence). If the stopping criterion is satisfied, the EM algorithm stops iterating, and the estimate of θ is θ (r+1) . Otherwise, θ is updated by θ (r+1) , and the algorithm returns to the E-step.

EM Estimation for the PLN Regression Model with Varying Dispersion
In this section, we implement the EM algorithm for finding the ML estimates of the parameters of the PLN regression model with varying dispersion (Algorithm 1). The complete data log-likelihood of the model is given by for i = 1, . . . , n. Thus, the posterior expectations needed for the E-step are E z i z i |k i ; θ (r) and E z i (log(z i )) 2 |k i ; θ (r) , while at the M-step one needs to maximize the expected value of l c (θ) with respect to θ. In particular, more formally, the EM-type algorithm can be written as follows.

E-Step:
Calculate, for all i = 1, . . . , n, and where µ Note that the expectations in Equations (24) and (25) can be evaluated numerically. Alternatively, a Monte Carlo approach can be used based on a rejection algorithm, leading to variants of the EM algorithm, such as the Monte Carlo EM (MCEM) algorithm, which do not rely on the pdf g(k i |z i ), that cannot be written in closed form, but it is sufficent to simulate from the posterior distribution g(z i |k i , -Firstly, the regression parameters β 1 are updated using the pseudo-values w 1 i , which are given by Equation (24), and the Newton-Raphson algorithm, which is given in Equations (16)-(18).

-
Secondly, the regression parameters β 2 are updated using the pseudo-values w 1 i and w 2 i , which are given by Equations (24) and (25), respectively, and the Newton-Raphson algorithm, which, in the case of the lognormal mixing distribution, is as follows and for i = 1, . . . , n and l = 1, . . . , p 2 , and where Then, we can obtain the updated estimates of β (r) 2 as follows (28)

Numerical Illustration
This study was based on a subset of claim frequency data from a pool of MTPL insurance policies observed for 3.5 years from a major Greek insurance company. A total of 14,143 observations with complete records (i.e., with availability of all the explanatory variables) were taken for our analysis. The response variable is the number of claims at fault registered for each insured vehicle. In addition, a subset of explanatory variables with the highest predictive power for the response variable was chosen based on exploratory analysis. In particular, we considered the following covariates: the age of the driver (AD), the horsepower (HP) of their car, and the age of their car (AC). Additionally, we grouped the levels of each a priori rating variable with respect to risk profiles with similar claim frequency in order to balance homogeneity as well as sufficiency of the volume of data in each cell.
The summary of the explanatory variables and their corresponding groupings with the number of observations in each category along with the descriptive statistics for claim counts are shown in Table 1. In the following subsection, we fit the Poisson-lognormal (PLN) regression model on the number of claims. Moreover, we will compare its fit with those of the classic negative binomial type I (NBI) distribution, which has been used in an abundance of actuarial settings for approximating claim counts, for the case when regression components are introduced on its mean and dispersion parameters. Finally, the high presence of zeros in the MTPL data set motivates the use of zero-inflated models, which can provide a parsimonious yet powerful way to handle data sets that contain a large number of zeros. In this study, the zero-inflated Poisson (ZIP) regression model will be used as a benchmark for comparison.

•
The NBI regression model with varying dispersion is derived as follows. Consider policyholder i, i = 1, . . . , n, whose number of claims, denoted as k i , with k i = 0, 1, 2, 3, . . . , are independent. In addition, assume that k i |, z i follows a Poisson distribution with pmf given by Equation (1), and z i follows a Gamma distribution with pdf given by where φ i > 0. Parameterization (29) ensures that E(z i ) = 1, and hence the model is identifiable. Then, the unconditional distribution of k i becomes an NBI distribution, with pmf given by The mean and the variance of the NBI distribution are given by and • The mean and dispersion parameters of the NBI distribution are modelled in terms of covariates µ i = exp x 1,i β 1 and (33) where x 1,i and x 2,i are covariate vectors with dimensions p 1 × 1 and p 2 × 1, respectively, with β 1 = β 1,1 , . . . , β 1,p 1 and β 2 = β 2,1 , . . . , β 2,p 2 the corresponding parameter vectors, and where it is assumed that the matrices X 1 and X 2 , with rows given by x 1,i and x 2,i , respectively, are of full rank. • The pmf of the ZIP regression model is given by The mean and the variance of the ZIP distribution are given by and where µ i = exp(x 1,i β 1 ), and where x 1,i is a covariate vector with dimension p 1 × 1, with β 1 = β 1,1 , . . . , β 1,p 1 the corresponding parameter vector, and where it is assumed that the matrix X 1 with rows given by x 1,i , respectively, are of full rank (note that π can also be modelled in terms of covariates using the logit link function. However, we refrain from doing this in this paper since this approach did not lead to better fitting performances for the ZIP model for the MTPL data).

Modelling Results
The ML estimates of the parameters (all the parameters were statistically significant at a 5% threshold) for the NBI and PLN regression models with varying dispersion and the ZIP regression model are presented in Table 2. Note that variable selection can be performed for all the models by selecting the best predictor for parameter µ i using backward elimination. This can be done by including all available explanatory variables present in the data set and testing whether the exclusion of each variable will result in lower global deviance (DEV), Akaike information criterion (AIC), and Schwartz Bayesian criterion (SBC) values. Subsequently, in the case of the NBI and PLN models, we can take all the variables selected for the parameter µ i and continue variable selection for the parameter φ i by performing forward selection, where we can test which explanatory variable would lead to a further decrease of the DEV, AIC, and SBC values when added to parameter φ i . Additionally, if different subsets of explanatory variables result in very similar values of DEV, AIC, and SBC, we should chose the simpler model with less predictors to avoid overfitting. Regarding our data, as we can see from Table 2, the explanatory variables AD, HP, and AC were chosen for µ i , and only the variable AD was chosen for φ i . From the results in Table 2, we observe that the values of the estimated regression coefficients of the variables AD, HP, and AC have a similar effect (positive and/or negative) on parameter µ i in the case of all the models, and the same observation can be made for parameter φ i in the case of the NBI and PLN models.
Finally, we rely on normalized quantile residuals [13] as an exploratory graphical tool to help us evaluate the adequacy of the fit of the NBI, ZIP, and PLN models. For these discrete response distributions, the normalized (randomized) quantile residuals are defined asr i = Φ −1 (u i ), where Φ −1 is the inverse cumulative distribution function of a standard normal distribution and where u i is defined as a random value from the uniform distribution on the interval F i (k i − 1|θ (r+1) ), F i (k i |θ (r+1) ) , where F i is the cumulative distribution function estimated for the ith policyholder, and where θ (r+1) contains the estimated model parameters after the EM algorithm has reached the global maximum, and k i is the corresponding observation. The fit of the claim count model can be evaluated by means of the usual quantile-quantile plots. Specifically, if the data indeed follow the assumed distribution, then the residual on the quantile-quantile plot will fall approximately on a straight line. Figure 1 shows the normalized (random) quantiles for the ZIP regression model and the NBI and PLN claim frequency regression models with varying dispersion.  From Figure 1, we observe that the residuals indicate that the NBI and PLN are better assumptions than the ZIP model since the residuals of the former two are close to the right tail of the claim frequency distribution. Furthermore, the PLN model seems to fit the claim count data slightly better than the NBI model, since, as was previously mentioned, the tail of mixed Poisson models is equivalent to the tail of their mixing distributions [7], and in this case the lognormal mixing density has a thicker right tail than the Gamma mixing density. Therefore, overall it is reasonable to suggest the employment of the PLN model for modelling claim counts in our data set. As we are going to observe in what follows, the PLN model also provides better fitting performances than the NBI and PIG models in terms of the DEV, AIC, and SBC values.

Models Comparison
In this subsection, we compare the fit of the ZIP regression model and the NBI and PLN regression models with varying dispersion based on DEV, AIC, and SBC, which are classic hypothesis/specification criteria.
The DEV is defined as wherel is the maximum of the log-likelihood, andθ is the estimated parameter vector of the model. Furthermore, the AIC and the SBC are given by and where d f are the degrees of freedom, and n is the number of observations in the sample.
The resulting DEV, AIC, and SBC values for the competing models are presented in Table 3. We observe that the PLN regression model provides the best fit with respect to all three criteria.

Computational Aspects
All computing was made using the programming language R. The PLN regression model with varying dispersion was estimated using the EM algorithm, which was presented in Section 3. In addition, the ML estimates of the parameters of the NBI regression model with varying dispersion and the ZIP regression model were obtained using the generalized additive models for the location, scale, and shape (GAMLSS) package in R [14].
Note that a rather strict criterion was used, and it took the algorithm quite a large number of iterations to converge. In particular, the stopping criterion was set as tol = 10 −12 . Note also that the M-step involves two Newton-Raphson iterations, and hence it is important to identify the choice of meaningful initial values for the vectors β 1 and β 2 , as this can increase increase the computational time requirements for the EM algorithm and make it more difficult to locate the global maximum. We obtained good initial values for β 1 by fitting the simple Poisson regression. Additionally, we obtained good initial values for β 2 by (i) calculating Var(k i ) for the eight different risk classes which can be formed using all available risk factors and the observations i = 1, . . . , n and (ii) calculating E(k i ) for the eight different risk classes and using the log-link function for φ i (see Equation (6)), so we solve Equation (4) with respect to β 2 . However, we also checked with many other starting values for β 2 in order to ensure that the global maximum had been obtained. For all cases, the EM algorithm converged to a similar solution. The standard errors were computed by using the standard approach of [15].
Finally, as was anticipated, in terms of CPU time, it took the NBI regression model with varying dispersion and the ZIP regression model less than one minute, and they both compared significantly more favorably to the PLN regression model with varying dispersion, which exceeded 30 min of CPU time. However, it should be taken into account that the PLN model has a density which does not exist in closed form, and that there were 14,143 observations in the sample of MTPL data that was examined in this article. For larger data sets with more features, the computing effort can be reduced if the E-and M-steps are executed in parallel across multiple threads to exploit the processing power of modern-day multicore machines.

Conclusions, Limitations, and Future Research
In this article, we considered a family of mixed Poisson claim count regression models with varying dispersion and dependence parameters arising from non-conjugate mixing distributions for approximating overdispersed claim frequencies in non-life insurance. The flexibility in the choice of the mixing distribution combined with the proposed approach, which assumes that the mean and dispersion parameters of the model can be modelled in terms of risk factors, can provide an advantage relative to the majority of previous approaches in the literature, which have concentrated on mixing densities conjugate to the Poisson mixing and assumed that only the mean parameter can vary through covariates. From a practical business standpoint, the proposed modelling framework is beneficial for the insurance company, as it will result in an improved risk evaluation of policyholders who are more likely to have accidents, since the tail behaviour of mixed Poisson models is similar to that of the mixing density and the majority of heavy-tailed mixing distributions are not conjugate to the Poisson. The PLN regression model with regression specifications on its mean and dispersion parameters was considered for expository purposes. Furthermore, we developed an efficient EM algorithm for maximum likelihood estimation of the parameters of the model. The implementation of the algorithm was illustrated by fitting the model to a real MTPL insurance data set. An interesting line for further research would be to extend the model to the multivariate case to permit inferences about the dependence structure between different types of overdispersed claim counts from the same and/or different types of coverage. However, it should be noted that the PLN model becomes more complicated in the two-dimensional setting due to algebraic intractability, which is a problem that is inherited from the univariate case. Moreover, modelling all the parameters of the PLN model in terms of covariates can further increase the computational burden in the highdimensional setting. Finally, another fruitful future research direction is to include time series components to take into account both cross-dependence between different types of claims and time dependence. Data Availability Statement: Please note that we cannot make the data supporting reported results publicly available due to data use agreement with the company which provided the data set. However, a toy data set along with the code for estimating the parameters of the PLN regression model with varying dispersion can be made available upon request.

Acknowledgments:
We would like to thank the handling editor and the two anonymous referees for their very helpful comments and suggestions that have significantly improved this article. Furthermore, we would like to thank the participants of the 13th International Conference on Computational and Methodological Statistics. Finally, we would like to thank the undergraduate student Shahzeb Khan for his interest in the research that was undertaken in this article and for his help in citing previous works.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: