1. Introduction
In the automobile insurance sector, it is natural to calculate the a priori premium taking into account the number of claims and individual characteristics of each insured, such as gender, age, years of validity of the policy, etc. This procedure to compute the a priori premium is usually completed via parametric models rather than using the ordinary regression model, which can predict values of the number of claims even if negative. For this purpose, parametric models based on the use of the Poisson, negative binomial, and Poisson-inverse Gaussian distributions, among others, are the standard models considered in the univariate case. As of today, most insurance companies distinguish, apart from the total number of claims, individualized claims for different coverages, such as windscreen claims, thefts and fire claims, etc. So far, most actuarial models aim to differentiate only between two types of coverage when computing an appropriate premium based on different coverage. Perhaps one of the reasons for that is due to the lack of models capable of describing more than two coverages. The most often considered approach to tackle this problem is the one based on the bivariate Poisson distribution (see
Bermúdez 2009;
Bermúdez and Karlis 2017, among others). See also
Gómez-Déniz (
2016);
Gómez-Déniz and Calderín-Ojeda (
2018);
Gómez-Déniz and Calderín-Ojeda (
2020);
Denuit et al. (
2009), and
Frees (
2010) for more details related to this topic. Alternative references for a review of count regression are
Cameron and Trivedi (
1986,
1998);
Winkelmann (
2003), and
Boucher et al. (
2007). A copula-based correlated random effects model that accommodates dependence between claim frequency and severity was examined in
Oh et al. (
2020).
Traditionally, the business associated with insurance consists of selling risk coverage to buyers. In particular, in automobile insurance, the insurer provides financial protection against physical damage or bodily injury resulting from an incident (see
Frees et al. 2016). However, it is common today, mainly due to the existing competition, that the insurance companies offer coverage of different claims within the same product not only to gain in competitiveness, but also to benefit from risk diversification and volatility. In this paper, we consider a motor vehicle insurance portfolio with policies observed during some time period that contain, apart from other known factors (gender, age, years of validity of the driver’s license, etc.), information about the claims number concerning different coverages that are considered as response variables. This includes windscreen, parking, theft and fire, etc.
Therefore, it is assumed that the insurance company collects information on the claims for these coverages and the total number of claims given by the sum of the claims in all the coverages. Thus, every policyholder generates a sequence of claims numbers for each coverage; one of them is the total claims number, which includes the sum of the coverages’ claims. Then, based on a conditional specification, a multivariate model that allows a simple way to describe the use of a finite but sufficiently large number of coverages is proposed. The resulting multivariate discrete distribution obtained enables us to study the dependence structure of a limited number of coverages in automobile insurance and include covariates such as gender, age, etc. We start by using a Poisson model for the random variable total claims number, and then by conditioning, we introduce the remaining variables in a branch architecture structure. Finally, closed-form expressions are given for parameter estimates, and a priori premiums are provided when different premium principles are used.
The purpose of this paper is to introduce a novel methodology based on a multivariate distribution via a conditional specification, proposed to account for different numbers of claims in different coverages and also for the total claims frequency. This approach enable us to examine the dependence structure of a finite number of coverages in motor vehicle insurance and also incorporate heterogeneity in the model through explanatory variables. Then, we use this procedure to calculate premiums based only on the claims frequency. Next, we show that the amount of claims can be incorporated into this multivariate model to derive multivariate claims’ severity distributions. For this, we assume that the claims size in the joint coverages follows a multivariate Erlang distribution. As multivariate probability distributions are complex, it is argued that analytical solutions are highly unlikely as compared to those derived under univariate and bivariate cases (see
Cummins and Wiltbank 1983,
1984); nevertheless, in this work, we derive a multivariate model where the total number of claims that affect the portfolio is the result of the interaction of multivariate processes. The main advantage of the modelization presented in this work is that it avoids working with copulas (see, for instance,
Balakrishnan and Lai 2009, chp. 1, p. 59). Although the copula approach for modeling multivariate models has been proven to be very useful, it has also been criticized due to the difficulty of choosing an appropriate copula structure and the complication of estimating the parameters that control the dependency. In addition, a multivariate zero-inflated model to account for the excess of common zeros in the empirical distribution is developed. Finally, these two multivariate distributions can be reparameterized to incorporate covariates to determine which factors and explanatory variables have an influence on the mean of the corresponding coverage. As an illustration, in this work, we use the French Motor Personal Line datasets available in the package “CASdatasets” in
R, which include five response variables.
Although the modeling proposed here was developed ad hoc for the auto insurance market, it is unquestionable that other insurance lines in general insurance might benefit from it. For example, in home insurance, the whole premium could be split into different coverages such as moisture damage, theft, pipe repairs, locksmiths, and even protection against tenant rent default.
The rest of the paper is structured as follows.
Section 2 describes the primary model and some of its properties. Then, premium calculations based on this basic model are discussed. Finally, a multivariate zero-inflated model and multivariate regression procedures are shown. Some methods of estimation are provided in
Section 3. Next, a numerical application pertaining to a private motor French insurer is developed in
Section 4. Finally, conclusions are drawn in
Section 5.
2. The Branch Architecture Model
Let us consider a portfolio with N observed policies during T periods of time and also assume that the insurance company gathers information on the number of claims related to several types of coverages. Example of these coverages may include windscreens, fire and theft, etc. Therefore, the insurer collects information about these coverages, as well as the total number of claims for each policyholder given by the sum of the claims in all different coverages. For the ith policyholder, we consider the multivariate random variable expressed as the following sequence, of claims numbers for coverage j, with assuming that one of them, i.e., the first one, is the total number of claims, which includes the sum of the claims for all types of coverages purchased by this policyholder.
Furthermore, we assume that
, the total number of claims recorded in the auto insurance portfolio, follows a Poisson distribution with mean
for
, where
N is the total number of policyholders. Now, let us suppose that the policyholders have purchased some of the types of coverages, such as windscreen protection, fire and theft, parking, etc. That is, once the policyholder has made a claim, this can be of any of these types. Let us denote by
,
, a random variable associated with the number of claims corresponding to the first type of coverage and policyholder
i, resulting from the
th claim of the total claims reported by the
ith policyholder assumed to be independent and identically distributed, following also a Poisson distribution with mean
. Then, the conditional distribution of
given
,
, the total number of claims of this first coverage, among the
total claims is a Poisson distribution with parameter
and the joint distribution of
has a probability function given by,
for
and
and with the convention that
for
and 1 otherwise. This bivariate distribution appears in
Leiter and Hamdan (
1973) (see also
Cacaoullos and Papageorgiou 1980;
Johnson et al. 1996, chp. 37, p. 136 in the context of accident analysis)
Let
,
now be a random variable associated with the total number of claims corresponding to the second type of coverage and policyholder
i, resulting from the
th claim of the total claims reported by the
ith policyholder assumed to be independent and identically distributed Poisson distribution with mean
and conditionally independent of
. Then, the conditional distribution of
given
,
, the total number of claims of this second coverage, among the
total claims is a Poisson distribution with parameter
, and now, the joint distribution of
has a probability function given by,
where the hypotheses of conditional independence between the two types of coverages were assumed.
Following the same argument, it is easy to see that if we have
types of coverages, then the joint probability function of
is given by:
where
. For this multivariate distribution, it is allowed that
takes larger or smaller values than
; however, in the proposed model, it is verified that
is larger than or equal to
for all
. In this case, it is obvious that
must be larger than
,
. The latter statement is confirmed in the numerical application section.
The ordinary probability-generating function of
with the probability mass function (pmf) given in (
1) is given by:
for
,
.
From here, it is easy to see that the marginal distribution of
is Poisson with parameter
, while
,
have a Neyman Type A distribution with parameters
and
. Recall that the probability function of the Neyman Type A distribution (see
Neyman 1939;
Douglas 1955;
Kemp 1967;
Johnson et al. 2005, chp. 8, among others) is given by:
for
.
Some computations provide the marginal and cross-moments, which are given by:
from which it is simple to see that:
and therefore, the model admits only a positive correlation between pairs of random variables. The marginal variances are given by:
Observe that, using (
5) and (6) together with (
3) and (4), the model is equidispersed (variance equal to the mean) for
and overdispersed (variance larger than the mean) for the rest of the coverages.
Finally, the correlation can easily be computed as:
One can be interested also in the distribution of
given
. The probability-generating function of this conditional distribution is given by:
where
and
are the Bell numbers given by:
with:
being the Stirling number of the second kind.
1Now, the conditional mean of
given by
can be written as:
2.1. Some Results in Risk Theory
Observe that due to the model construction, we have that
, i.e., every claim in coverage
j is a proportion of the total claims
. Then, if the actuary decides to use the net premium principle, i.e.,
, to compute the premium, then for the
ith policyholder and coverage
s with
, the premium results
, where
is the net premium for the total coverage, that is the sum of the premiums in each of the coverages purchased. A similar result is obtained by using the expected value principle. A catalog of premium principles can be found in
Young (
2006).
Let us now consider that the actuary decides to use the variance premium principle, i.e., with , to calculate the premium. Then, in this case, we obtain that and . However, in this case, we have that , which is different from , except for the case in which and no coverages exist.
However, a model solely based on the number of claims is not realistic. In risk theory, it is common to incorporate the amount associated with each of the claims to build the compound model. That is, the property and/or casualty ratemaking are generally based on a claim frequency distribution and a loss distribution. Due to the complex derivation of this multivariate compound model, the subscript
i is removed from the text in the remainder of this section. For this purpose, let us now assume that
,
, where
is the random variable denoting the size or amount of the
ith claim, following an exponential distribution with probability density function (pdf)
. Furthermore, we assume that
, are independent and identically distributed random variables and also independent of the number of claims
. It is well known (see for example
Rolski et al. 1999) that
follows a piecewise distribution with pdf given by
,
, and
. Then, by following the methodology given in
Lee and Lin (
2012), we have that
follows a multivariate Erlang distribution with scale parameter
and shape parameter
,
. Their marginal distributions are a univariate Erlang mixture.
Then, simple computations provide,
Here,
represents the modified Bessel function of the first kind, which admits the following series representation,
The distribution for the coverages can be computed by using (
2) in the following way,
Now, taking into account that:
we finally obtain the aggregate claim size pdf for the different coverages given by,
for
. Thus, they are also given as a piecewise distribution. For practical purposes, the infinite sum that appears in this expression can be replaced by a finite sum from one to
k, where
k can take values around one-hundred. From the assumption of the independence between the number of claims and the claims size, we have that:
which can be considered as the net premium when both the number and size are considered at the same time.
2.2. Multivariate Zero-Inflated Model
In many automobile insurance portfolios, the claims are rarely observed as compared to the no-claims situation. Univariate and bivariate zero-inflated models have been introduced in the statistical literature in many fields. In the setting of auto insurance, we refer to
Boucher et al. (
2007) and
Frees et al. (
2016) for the univariate case and
Bermúdez (
2009) and
Bermúdez and Karlis (
2017) for the bivariate case. Multivariate ones are scarce in the general statistical literature. References in the statistical literature are
Li et al. (
1999) and
Liu and Tian (
2015). In the actuarial literature, there are no references of models of this nature that go beyond the two variables. However, multivariate zero-truncated models were considered in
Zhang et al. (
2020).
A multivariate zero-inflated model can be constructed as a mixture of the multivariate distribution given in (
1) and a point mass at
in the following way,
where
is an inflation parameter. Obviously, this model reduces to (
1) for
. Under this model, the marginal means and cross-moments are given by:
from which the covariance between pairs of marginal random variables can be obtained. They are given by:
where
.
Again, if the actuary computes the premium by using the net premium principle, then for each coverage, the premiums are not affected by the inflation parameter . A complete model would allow inflating each coverage with inflation parameters ; however, they are not included in this work due to the computational cost of estimating a large number of parameters.
The marginal variances are given by:
Finally, the correlations are:
for
,
.
2.3. A regression Model
For the sake of convenience, the model (
1) can be rewritten in a different way to facilitate the implementation of covariates to determine which factors and explanatory variables have an influence on the mean of the corresponding coverage. Then, by equating
to
and
to
,
,
, we obtain the normalized joint distribution, which can be expressed as,
for
,
.
The probability function (
12) satisfies the condition that the marginal means are given by
,
, assuming that
, for all
j. Thus, it is suitable for including covariates. Then, to carry out this regression model, we suppose that the observed counts
have independent distributions given by (
12) with
,
. Now, it is assumed that a set of observable covariates useful to subdivide the portfolio into classes of risks with homogeneous characteristics are included in the linear predictor,
. To guarantee a positive expected value of the response variables, it is reasonable and common to use a logarithmic link for this function and therefore express the mean as:
where
is a vector of
m covariates for the
ith observation
and
denotes the corresponding vector of regression coefficients to be estimated, which usually includes a constant term. Without loss of generality, it is assumed that for each
,
is related to the same set of covariates. In addition, one of the covariates may be identified as an exposure term to calibrate the size of a potential outcome variable by assuming that the mean varies proportionally with the exposure
(see
Frees 2010;
Frees et al. 2016),
Similarly, the covariates can be implemented in the multivariate zero-inflated regression model by simply regressing the mean value of the different coverages. It should be pointed out, although it is not considered here, that it could also be assumed that the inflated parameter could depend on certain regressors. This issue seems not to be possible here, and thus, it could be a subject that merits further investigation in future research.
3. Estimation of the Parameters
In this section, we firstly describe the methodology for the maximum likelihood estimation and derivation of the entries of Fisher’s information matrix for the basic model. Next, the same development is illustrated for the associated regression model. Finally, the expression of the log-likelihood function, score equations, and the second derivative of the log-likelihood function with respect to the parameters for the zero-inflated model are exhibited.
In general, the statistical inference for multivariate models is not trivial, and the computational procedure is often expensive (see, for instance,
Selch and Scherer 2010). Nevertheless, the estimation procedure for the model proposed here is straightforward. To see this, we first consider the case without covariates. Let us assume that a sample
that includes
n independent observations in each one of the
types of coverage is collected. The log-likelihood function is proportional to:
where
. After differentiating the latter expression, it is possible to obtain in closed-form the maximum likelihood estimators of the parameters. They are given by:
where
,
, represents the sample mean, i.e.,
.
The elements that provide the entries of Fisher’s information matrix are as follows:
For the regression model, the log-likelihood function contains
parameters, and it is proportional to:
where
and
.
The score equations are given by:
with
and
.
Fisher’s information matrix is made up of four blocks, as can be seen below:
where:
where
;
, and
is the zero matrix with dimension
.
For the zero-inflated model, the log-likelihood is proportional to:
where
is the number of zeroes of the random variable
. The normal equations that provide the maximum likelihood estimates are given by,
From (
13), we obtain that
, which can be carried out to (15) to obtain the estimator of
, say
. This value is carried out now to (
13) to obtain the estimator of the inflated parameter,
. Finally, from (15), the estimator of
,
is obtained in the closed-form expression given by
.
The second partial derivatives are as follows,
Observe that the analytic expressions of and are not feasible. For computational reasons, for large values of n, this is evaluated by ignoring the expectation operator and replacing it by and . The asymptotic variance–covariance matrix is approximated by inverting the observed information matrix.
When covariates are introduced under the inflated model, we proceed first by replacing in (
9) the pmf
f by its corresponding (
12), where again,
, and
. In practice, as shown in the numerical applications below, the parameter estimation and computation of standard errors were carried out by the method of maximum likelihood using Mathematica
v.12.0. We directly maximized the log-likelihood function by using different maximum search methods available in the FindMaximum built-in function in the Mathematicasoftware package. This software package also provides at least two methods of obtaining the elements of the Hessian matrix. The first one consists of retrieving them from the Cholesky factors (this package is available on the web upon request). The second one, which is faster, derives them by finite differentiation. Results were also confirmed with WinRATS v.7.0.
4. Numerical Application
For our empirical analysis, we used the French Motor Personal Line datasets available in the package “CASdatasets“ in
R. This is a collection of ten datasets that comes from a private motor French insurer. Each dataset includes risk features such as claim amount, risk area, gender of the policyholder, number of claims for different coverages, etc. In particular, we chose the freMPL10 dataset, which includes 32,100 policies for the year 2004. In our study, we considered six response variables, which are shown in
Table 1. Note that the dependent variable Claims for each policyholder comprises the sum of the individual claims in all other variables. The details of the joint claims frequency for all types of coverage and the total number of claims are illustrated in
Appendix A (
Table A1). Note that the maximum number of claims reported by an insured is six. The number of policyholders who did not report a claim is 12,257 (38.18%), and the number of customers that only declared a claim in any of the coverages is 10,803 (33.65%).
Together with all the responses, this dataset includes a set of explanatory variables.
Table 2 below describes the factors and explanatory variables used in the investigation. We also considered an offset variable when modeling the claims frequency, exposure, the time exposed to risk during the investigation period.
In
Table 3, the parameter estimates and their corresponding
p-values are provided for the basic and zero-inflated models without covariates. Some measures of model selection are also provided in the bottom part of this table. For comparisons purposes, we used the multivariate negative binomial distribution (MNB) provided in (
Johnson et al. 1996, chp. 36, p. 94) with the pmf given by:
where
,
,
, and
. As can be seen in
Table 3, the multivariate Poisson distribution studied in this paper has a better performance than the MNB for this dataset. Furthermore, it is observable that the the zero-inflated model improves the basic one due to the high frequency of zeros. On the other hand, we also tried to fit the two multivariate Poisson distributions provided in
Bermúdez and Karlis (
2011); however, we were unable to derive the maximum likelihood estimates of this model for this dataset.
Table 4 below exhibits the empirical Pearson’s correlation between the different frequencies associated with each response variable for the total number of claims, and each one of the different coverages (first row), the correlation derived computed via the basic model (second row) and zero-inflated model (third row), and that computed by using (
7) and (
10), respectively, are also shown. It is observable that there exists a weak positive correlation between Claims and the rest of the dependent variables for each one of the coverages, and the empirical values are near the theoretical values. These figures were calculated before incorporating the effect of the explanatory variables for the different coverages. We also calculated the correlation coefficient between the rest of the response variables. Again, there is a weak positive correlation, ranging from 0.0480 between Responsible and Nonresponsible and 0.0035 between Parking and Windscreen.
Empirical marginal distributions and fitted marginals under the basic model (Fit 1) and zero-inflated model (Fit 2) are illustrated below in
Table 5 using the estimates computed in the previous section. Note that the total number of observations equals 32,100.
We fit the multivariate regression model in (
12) and the zero-inflated regression model described in the second section. Parameter estimates and their corresponding
p-values are displayed in
Table 6 and
Table 7, respectively. It is observable that for the former regression model, the explanatory variables private 1 and risk area are statistically significant at the 5% significance level. This is also verified by the intercept term of the model, i.e., constant. Furthermore, some other covariates (private 2, profession and has km limit) are significant at the same level for all the responses except for Fire and Theft; similarly, driver age is not significant for the dependent variable Responsible. On the other hand, with respect to the zero-inflated regression model, the explanatory variables risk area and constant are statistically significant at the same significance level for all responses; moreover, the covariate private 1 is not significant for the response variable Parking and has km limit for Fire and Theft. In terms of the four measures of model selection considered, the zero-inflated regression model is preferable over the model (
12).
Now, we are interested in comparing the six mixed random variables’ aggregate claims amount for Claims
and the different coverages, i.e., Nonresponsible, Responsible, Parking, Windscreen, and Fire and Theft,
. In order to estimate the scale parameter
of the exponential distribution, we considered the variable ClaimAmount available in the dataset freMPL10. The estimate of this parameter is
. The pdf/pmf associated with the mixed random variables is displayed in
Figure 1. As expected, the density of the random variable Claims fades away to zero slower than the random variables of the different coverages. Among the different coverages, the Responsible variable is the one that approaches zero faster compared to the other coverages.
5. Final Comments and Future Research
It is common today, mainly due to the existing competition, that insurers offer coverage of different claims within the same product not only to gain in competitiveness, but also to benefit from risk diversification and volatility. Up to date, most insurance companies differentiate, apart from the total number of claims, individualized claims for different coverages, such as windscreen claims and thefts and fire claims, among others. Therefore, it seems reasonable to assume that every policyholder generates a sequence of claims numbers for each coverage; one of them is the total claims number, which includes the sum of the claims in all the coverages. In this work, we introduced a new methodology based on a multivariate discrete distribution via conditional specification to explain the claims frequency in different coverages and the total claims number. This procedure allows us to analyze the dependence structure of a finite number of coverages in motor vehicle insurance and also to include heterogeneity via explanatory variables. Closed-form expressions were given for model parameter estimates, and a priori premiums were provided when different premiums principles were used. Numerical applications revealed that specific covariates are statistically significant in some coverages, yet they are not so for others. In this way, it allows us to discern how the different explanatory variables affect each coverage when calculating the corresponding premiums.
The approach introduced in this work avoids the use of copula-based modeling. The latter methodology has been very useful, but at the same time, very criticized in the statistical literature when modeling multivariate data. Although there exists a wide catalog of copulas, it has been mentioned that a weakness of the copula approach is in choosing an appropriate copula structure for the model at hand (
Balakrishnan and Lai 2009, chp. 1, p. 59). Furthermore, any copula includes a parameter that controls the dependence structure, and this parameter is sometimes difficult to estimate since it must fall into the admissible support. As explored in the second section of this work, the model depends extremely on the parameter
, and for that reason, a more flexible dependence structure based on multivariate subordination is an issue that deserves to be studied. In this regard, using this approach would be interesting to compare this family of distributions with the multivariate regression model based on the multivariate Sarmanov distribution, similar to the models derived in
Bolancé and Vernic (
2019). This model could be used to explain situations where the policyholder wishes to extend the third-party motor vehicle insurance to account for different coverages that adapt to their personal needs. Alternatively, it could be feasible to implement a multivariate version with elliptical copula-based models to accommodate a wide range of dependence. It is essential to mention that the properties of the copula are not the same as for continuous random variables since the probability of ties in the data is positive. Thus, the estimation cannot be directly carried out, and a continuous extension of integer-valued random variables is needed by using the approach proposed by
Denuit and Lamber (
2005).
The purpose of the work is not to compare other models, as models of this nature are not known to our knowledge in the actuarial literature. However, the cases with two coverages were discussed via the bivariate Poisson case (see
Bermúdez and Karlis 2017) and the case with all the coverages using the multivariate negative binomial distribution in (
Johnson et al. 1996, chp. 36, p. 94). Obviously, the fit obtained with the proposed modeling does not seem entirely reasonable (as judged by the chi-squared test statistics, which was not shown in the paper). Then, the model could be improved by using a similar model, but assuming that the total number of claims and all the coverages follow a negative binomial distribution instead. It would be also possible to zero-inflate all the different coverages. This issue could be the subject of future research.