Open Access
This article is

- freely available
- re-usable

*Int. J. Financial Stud.*
**2018**,
*6*(4),
84;
https://doi.org/10.3390/ijfs6040084

Article

Estimating Major Risk Factor Relativities in Rate Filings Using Generalized Linear Models

^{1}

Ted Rogers School of Management, Ryerson University, Toronto, ON M5B 2K3, Canada

^{2}

Mathematics and Statistics, University of Guelph, Guelph, ON N1G 2W1, Canada

^{*}

Author to whom correspondence should be addressed.

Received: 26 July 2018 / Accepted: 2 October 2018 / Published: 11 October 2018

## Abstract

**:**

Predictive modeling is a key technique in auto insurance rate-making and the decision-making involved in the review of rate filings. Unlike an approach based on hypothesis testing, the results from predictive modeling not only serve as statistical evidence for decision-making, they also discover relationships between a response variable and predictors. In this work, we study the use of predictive modeling in auto insurance rate filings. This is a typical area of actuarial practice involving decision-making using industry loss data. The aim of this study was to offer some general guidelines for using predictive modeling in regulating insurance rates. Our study demonstrates that predictive modeling techniques based on generalized linear models (GLMs) are suitable in auto insurance rate filings review. The GLM relativities of major risk factors can serve as the benchmark of the same risk factors considered in auto insurance pricing.

Keywords:

rate Filings; auto insurance regulation; generalized linear models; rate-making; predictive modelingJEL Classification:

G22## 1. Introduction

Modeling aggregate loss (Duan 2018; Frees 2014; Meyers 2007; Shi 2016) using insurance risk factors is a key aspect in the decision-making of rate change review application. In Duan (2018), a logistic regression model was proposed to classify the loss data of a Chinese company into different risk levels using the burden index. Multivariate negative binomial models for the insurance claim counts were proposed by Shi (2014) to capture the potential dependent structure among the different claim types. In David (2015), the Poisson regression and negative binomial models were applied to a French auto insurance portfolio to investigate the degree of risk using claim counts data. In Najafabadi (2017), a k-inflated negative binomial mixture regression model was proposed to model the frequency and severity of reported claims in an Iranian third-party insurance data set. Authors in Tang (2014) proposed a new risk factor selection approach based on EM algorithm and LASSO penalty, for a zero-inflated Poisson regression model. The analysis of car insurance data from the SAS Enterprise Miner database was then used to show the usefulness of the proposed method in rate-making. However, all those works focused on the study of the loss experience of an individual company, rather than on the total loss at the industry level. In auto insurance regulation, approval of the submitted rate change application by insurance companies requires significant statistical evidence from a company’s past loss experience Chen (2009). Thus, a study of total loss behavior at the industry level becomes important in providing a constructive review of a company’s rate change applications.

Auto insurance companies often consider various risk factors based on the available loss experience, such as age, number of years of license, gender, number of years without accidents, territory, and vehicle rate group. In addition, various socio-demographic factors are used to differentiate the risk levels of drivers. More recently, usage-based insurance (UBI) programs have been offered by many major insurance companies. The analysis of data associated with UBI (e.g., mileages, traffic conditions, and driving habits) are new types of rating factors (Cripe 2012; Husnjak 2015; Ma 2018). However, this is not the case for insurance regulators because the detailed individual loss information is not available for the review of a rate change. Of course, insurance regulators are not interested in reproducing company results when reviewing the rate change application, but they need an analysis based on the aggregated loss experience from major insurance companies or the whole industry. A review decision made by a regulator must be supported by key findings on the rating factors at the industry level. Since insurance companies use many risk factors in pricing, the question arises as to whether one can focus only on the major rating factors in the review process. Answering to this question calls for use of suitable predictive modeling techniques that can be applied in rate filings review, particularly in classification of risk (Hamid 2011; Watson 2004). The aim of such modeling techniques is to find better understanding of the characteristics of major risk factors. In Canada, among many other risk factors, Class and Driving Record (DR) are the most important factors. The DR is a factor that includes levels 0, 1, 2, …6+, where each level represents the accumulated non-accident years of an insured, while Class represents a combination of driver age, gender, and use of vehicle. The information that is used for carrying out relativity estimates is either loss cost or loss ratios. None of the current predictive modeling of insurance loss uses industry aggregate loss data, and focus on major risk factors. This motivated us to investigate if major risk factors are able to capture most of the variation of total loss data at an industry level.

In predictive modeling, to better capture the major variation of loss data, considerable effort has been spent on finding the optimal solution in terms of the minimum overall bias (Zhao 2011; David 2015; Frees 2015; Garrido 2016). Due to its significant impact on insurance rate regulation, loss cost or loss ratios modeling became popular (Harrington 1992; Tennyson 2009). When loss cost is used, the bias is defined as the difference between the predicted loss cost and its actual observed value. The minimum overall bias considers both the bias of estimating pure premium for each class of insured and its associated number of exposures. To estimate the risk relativities, minimum bias procedure (MBP) introduced by Bailey (1963) has often been used, and it has become a traditional approach in non-life insurance pricing (Feldblum 2003). Recent research on actuarial practice has demonstrated that generalized linear models (GLMs), an advanced statistical method, have become prevalent in the field of rate-making, risk modeling, or forecasting in most of the European and Asian countries (Ohlsson 2010). In Zhao (2011), GLM was used to model the loss frequency and loss severity of automobile insurance in China to analyze the impacts of automobile, human, and rating territory on loss frequency and severity. In David (2015), an overview of GLM in the calculation of the pure premium given the observable characteristics of the policyholders was presented. In Garrido (2016), GLM was fitted to the marginal frequency and the conditional severity components of the total claim cost with the introduction of dependence between the number of claims and the average claim size was studied. The main reason for the prevalence of GLM is that it enables a simultaneous modeling of all possible risk factors as well as determination of the retention of risk factors in the model.

In the risk classification system of auto insurance companies, rate-making is one of the most important aspects among many others, such as underwriting and marketing strategies (Outreville 1990). Its major goal is to develop a set of risk factor relativities that can be further used for pricing an insurance policy. Also, rate-making that uses industry-level data is a typical task in a rate filings review. Even though an insurance company has access to the transactional loss and claim data, it is clear that the rate-making process is based on the aggregate level of the company data. This is because it is not in the company’s interest to evaluate the risk of a single exposure. The rate-making is done based on the average loss costs of each combination of the levels of risk factors. The average loss cost (or simply the loss cost in the following discussion) is defined as the total loss divided by the total exposures within each possible combination of the levels of risk factors, where the total loss and the total exposures are the aggregate measures of loss and exposures for that particular combination. Traditionally, within actuarial practice, the estimates of rating factor relativities are conducted by using MBP (Feldblum 2003). In rate filings reviews, often the rating variables of Class and DR are considered. The relativities are estimated by MBP separately for each data set, that is, from each respective combination of different years, territories, and coverages. The problem with this approach as a rate-making technique is that the potential interactions among rating variables are not considered. Thus, the results may not be comparable. Also, MBP is a numerical approach that is unable to statistically evaluate the difference between relativities.

In this paper, we present the results of a comparative study employing MBP and GLM as modeling tools. We focus on the study of industry-level data used in rate filings reviews to decide if major data variation could be retained in the models, and how it is affected by the risk factors under consideration. We investigate the consistency of the results obtained from both MBP and GLM methods. It is expected that one could replicate the results obtained by MBP by applying GLM to the same data set. For ease of understanding the interaction of rating variables, we mainly focus on the predictive modeling of loss cost using DR, Class, and accident year as the major rating variables. The significance of this work is in providing general guidelines for the use of predictive modeling in an insurance rate filings review for auto insurance regulators. In particular, this work demonstrates the usefulness of GLM in rate-making for rate filing purposes. This approach can help us to understand how a decision is being made when focusing only on major risk factors. This paper is organized as follows. In Section 2, we summarize the data used in this work, and discuss the methodologies, including MBP and GLM, used for producing the results presented in this paper. In Section 3, comparative results obtained from MBP and GLM under different model settings are analyzed. Finally, in Section 4, we conclude our findings and provide summary remarks.

## 2. Methods

#### 2.1. Minimum Bias Procedure

Let ${r}_{ij}$ be a collection of observed values. It can be loss severity, claim counts, average loss cost, or loss ratio, depending on the interest of modeling, and how we define a response variable. Let ${x}_{1i}$ be the value contributed by the ith DR level (i.e., DR takes values 0, 1, 2, 3, 4, 5, 6+), and let ${x}_{2j}$ be the value contributed by the jth level of Class. Also, let ${w}_{ij}$ be the number of exposures of risk in the (i,j)th combination of levels of the underlying risk factors. The objective of MBP is to find optimal solutions for ${x}_{1i}$ and ${x}_{2j}$, subject to the following two sets of equations as constraints (Feldblum 2003):

$$\begin{array}{c}\hfill \sum _{j}{w}_{ij}({r}_{ij}-({x}_{1i}+{x}_{2j}))=0,\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\forall i,\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\mathrm{where}\phantom{\rule{0.166667em}{0ex}}i=1,2,\dots ,7,\\ \hfill \sum _{i}{w}_{ij}({r}_{ij}-({x}_{1i}+{x}_{2j}))=0,\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\forall j,\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\mathrm{where}\phantom{\rule{0.166667em}{0ex}}j=1,2,\dots ,14.\end{array}$$

Numerically, MBP iteratively solves for ${x}_{1i}$ and ${x}_{2j}$ using the following two equations, until ${x}_{1i}^{\left(l\right)}$ and ${x}_{2j}^{\left(l\right)}$ converge at the lth step, where

$$\begin{array}{c}\hfill {x}_{1i}^{(l+1)}=\sum _{j}{w}_{ij}({r}_{ij}-{x}_{2j}^{\left(l\right)})/\sum _{j}{w}_{ij},\\ \hfill {x}_{2j}^{(l+1)}=\sum _{i}{w}_{ij}({r}_{ij}-{x}_{1i}^{\left(l\right)})/\sum _{i}{w}_{ij}.\end{array}$$

The relativities are obtained by selecting kth levels as a base, which are defined as $\frac{{x}_{1i}}{{x}_{1k}}$, $\forall i$ for DR, and $\frac{{x}_{2j}}{{x}_{2k}}$, $\forall j$ for Class.

#### 2.2. Generalized Linear Models for Rate-Making

The main idea of generalized linear models (GLMs) (De Jong 2001; Haberman 1996; Ohlsson 2010) in rate-making is to model a transformation of the expected value of a response variable, which in our case is ${r}_{ij}$. The transformation function $g\left({\mu}_{ij}\right)$ is called the link function, where ${\mu}_{ij}=E\left({r}_{ij}\right)$. The model that is used to explain the transformation function g is a linear model, and it can be written as follows:
where ${\mathit{Y}}^{\top}$=$[{y}_{1i},{y}_{2j}]$ and ${\beta}^{\top}$=$[{\beta}_{1i},{\beta}_{2j}]$. Here, ${y}_{1i}$ and ${y}_{2j}$ are dummy variables. That is, ${y}_{1i}$ takes the value 1 when i corresponds to the ith level of DR, otherwise it is zero. ${y}_{2j}$ takes the value 1 when j corresponds to the jth level of Class, otherwise it is zero. ${\u03f5}_{ij}$ is the error random variable assumed to have a certain probability distribution function. ${\mathit{Y}}^{\top}\beta $ is referred to as a systematic component. The variance of ${\u03f5}_{ij}$ is assumed to have the following functional relationship with the mean response:
where $V\left(x\right)$ is called a variance function. The parameter $\varphi $ scales the variance function $V\left(x\right)$, and ${\omega}_{ij}$ is a constant assigning a weight. This result comes from a family of distributions called the exponential family (McCullagh 1989), which determines the parameters $\varphi $ and ${\omega}_{ij}$. The case when $V\left(x\right)=1$ implies a normal distribution. If $V\left(x\right)=x$, then the distribution is Poisson. If $V\left(x\right)={x}^{2}$, then it is a gamma distribution, and if $V\left(x\right)={x}^{3}$, then it is an inverse Gaussian distribution (Mildenhall 1999). These distributions are the distributions used in this work. In general, the following relationship between variance function and mean value of response is considered:

$$\begin{array}{c}\hfill g\left({\mu}_{ij}\right)={\mathbf{Y}}^{\top}\beta ,\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\mathrm{or}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}g\left({r}_{ij}\right)={\mathbf{Y}}^{\top}\beta +{\u03f5}_{ij},\end{array}$$

$$\begin{array}{c}\hfill Var\left({\u03f5}_{ij}\right)=\frac{\varphi V\left({\mu}_{ij}\right)}{{\omega}_{ij}},\end{array}$$

$$\begin{array}{c}\hfill V\left({\mu}_{ij}\right)\sim {\mu}_{ij}^{p}.\end{array}$$

We discussed the cases when p=0, 1, 2, and 3. For $p<0$ or $p>2$, the distributions are called Tweedie distributions (note that the inverse Gaussian belongs to this class of distribution) (Dunn 2001; Tweedie 1957).

In the case when g is an identity function, GLM becomes a general linear model. Thus, the relativities are obtained by computing the exponential transformation of model coefficients, which are denoted by $exp\left\{{\beta}_{1i}\right\}$ and $exp\left\{{\beta}_{2j}\right\}$, $\forall \phantom{\rule{0.166667em}{0ex}}i$ and j (Ohlsson 2010). In this work, the following model specifications are investigated:

- Systematic component: $\mu $ $\left(Loss\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}Cost\right)$∼$Class+DR$;
- Error distribution functions: Gaussian, Poisson, inverse Gaussian, and gamma;
- The multiplicative model, resulting in a log link function for GLM, that, is $g=log$;
- The number of exposures is used as a weight value.

From the way the systematic component for the given data is modeled, one can see that for each combination of risk factor levels, the loss cost has a unique distribution, and the model specifies a common variation captured by the error distribution function. Because the standard deviation estimate is available for each level of risk factors, one can easily construct their confidence intervals based on the normal approximation approach. To validate the choice of log-scale, a Box-Cox transform (Osborne 2010) can be applied. If the obtained parameter value is close to zero, this implies that a selection of the log link function is appropriate for the given data.

#### 2.3. Handling Weights in GLM

In general, it is expected that GLM will be applied to either an industry-level data summary or a company’s transactional data summary. There are no fundamental differences between these two cases, as they are only subject to different levels of data aggregation. In transactional data summary, the loss costs are calculated in exactly the same way as in the industry data summary. It is important to consider a suitable weight function for each cell data, so that the variance function of the pre-defined error distribution in GLM is appropriately modeled. The number of exposures associated with each cell of the two-way data summary (by DR and Class) is used as a weight for the loss cost approach. The main purpose is to offset the different levels of data variation among the cells. Under the definition of the weight function in GLM, the assumption of having an identical error distribution for each cell of loss cost becomes reasonable.

In practice, there is flexibility in specifying the weight function, but this depends on how each cell data within the summary is defined. In addition to the use of the number of exposures at each combination as a weight, a new type of weight function can be defined by
where ${E}_{i}$ is the one-way summary of the number of exposures for the ith level of DR, and ${E}_{j}$ is the one-way summary of the number of exposures for the jth level of Class. ${E}_{total}$ is the total number of exposures. Also, another weight function can be defined by
when one wants to emphasize the importance of DR, by ignoring the effect of the number of exposures from the Class. Similarly, a third weight function can be defined by

$$\begin{array}{c}\hfill {w}_{ij}=\frac{{E}_{i}\times {E}_{j}}{{E}_{total}^{2}},\end{array}$$

$$\begin{array}{c}\hfill {w}_{ij}=\frac{{E}_{i}}{{E}_{total}}\end{array}$$

$$\begin{array}{c}\hfill {w}_{ij}=\frac{{E}_{j}}{{E}_{total}}.\end{array}$$

#### 2.4. Inclusion of Year as a Major Risk Factor

The estimates of relativities obtained from multiple-year loss experience are often considered to be more robust. One approach is to combine the multiple-year data together and ignore their year labels. In this case, the credibility of loss cost for each combination is improved as the corresponding number of exposures is increased with the increase of number of years of data. Thus, the number of exposures is now based on multiple years of aggregation. However, the disadvantage of this approach is that the effect from different years on relativity estimates is transferred to the relativities of other risk factors. A better approach is to add $Year$ as a major risk factor to the GLM, in which the systematic component is re-defined as

$$\begin{array}{c}\hfill log\left(Loss\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}Cost\right)\sim Class+DR+Year.\end{array}$$

Since we added $Year$ as another variable, ${Y}^{\top}$=$[{y}_{1i},{y}_{2j},{y}_{3k}]$ and ${\beta}^{\top}$=$[{\beta}_{1i},{\beta}_{2j},{\beta}_{3k}]$, where ${y}_{3k}$ is a dummy variable taking the value 1 when k corresponds to the kth level of $Year$. The ${\beta}_{3k}$ is the associated coefficient. In this approach, the effect from the factor $Year$ can be estimated. Also, the data variation captured by the model is improved by the consideration of an additional risk factor. Under this approach, the estimate of relativity is less dependent on the choice of an error distribution function. This implies the robustness of estimating relativity using GLM. The relativity of $Year$ reflects the level of loss cost for that particular year. Since loss costs of later years have not yet fully developed to the ultimate loss level, a smaller value of relativity of $Year$ may only mean a lower loss cost for that particular year, based on the available reported loss amounts.

#### 2.5. Fundamental Difference between MBP and GLM

The use of loss cost as a response variable is a common approach when applying MBP to industry-level data (Brown 1988; Garrido 2016; Ismail 2009). The loss cost approach considers both the relativities caused by loss frequency and loss severity. Within the loss cost approach, the relativities are obtained by a conditional iterative process, which implies that the lth step of the output of the algorithm becomes the input of the (l+1)th step, Brown (1988). The minimization of the total bias is based on the marginal balance principle that is applied to each column and each row of the loss cost data matrix, respectively, where the row and the column correspond to DR and Class. Essentially, MBP treats the bias within each cell of the data matrix as being independently and identically distributed, without actually measuring its distribution. This method is less practical when the total number of risk factors is large, due to the nature of the numerical approach. This approach does not give optimal estimate for base rate (Brown 1988), which is important for calculating pure premium. In terms of numerical convergence, the initial value condition plays an important role. The initial values are assigned based on the one-way summary of loss costs. Due to the nature of the iterative process, the final solution is often affected by the choice of initial values. Each value will be adjusted based on the result from the previous step.

Unlike MBP, GLM specifies an error distribution function for the bias of loss costs, and transforms observed values by link function to improve the linearity between the transformed response variable and the risk factors. The transformation is introduced to model a potential non-linear relationship via an appropriate mapping from the original space of the mean value of observations to a linear feature space. The selected error distribution function mainly captures the true distribution of underlying loss cost. Instead of placing a balance constraint for each row and column of data, GLM uses the assumption that within each row and each column, the error function is common to the bias. This constraint has a similar effect on determining a unique solution, as it was observed in MBP. However, in MBP, an iterative process is used to solve a non-linear optimization problem. The idea of using the GLM method is to fit the bias to a certain distribution and use a statistical approach to estimate the distribution parameters. In this case, an interval estimate of relativity becomes possible. Because of this, the use of GLM is more powerful than using MBP in terms of the statistical validation of test data. When specifying an equivalent constraint for both MBP and GLM, one can recover the same estimates of relativities from both MBP and GLM (Brown 1988; Mildenhall 1999).

With regards to making a prediction of loss costs, both MBP and GLM require a common base rate for all rating variables. The common base estimated using GLM corresponds to the intercept of the model. For MBP, it is calculated from the total loss divided by the total exposures for all risk factors. Therefore, the performances of MBP and GLM differ in minimizing overall bias: the base rate from GLM is a model estimate, but the base rate for MBP is just an empirical estimate. However, the estimated relativities can be the same for certain cases for both methods.

#### 2.6. Some Discussions

In general, the use of GLM enables us to select variables according to their statistical significance. This is likely why GLM has been considered to be the most powerful tool for non-life insurance pricing (Ohlsson 2010). One of the important tasks of the pricing problem is ensuring that risk factors and their levels are included in the model. As the dimension of the risk factors and their associated number of levels increase, it becomes challenging to interpret the impact of particular combinations of the rating variables on the loss cost. Therefore, focusing on the most significant rating variables may be superior. However, we do not encounter this difficulty within the scope of this work, as we only consider a few important risk factors that appear in the rate filings review. The relativities estimates of major risk factors become a general guideline for rate filings reviewers to better understand the nature of insurance rate fluctuations when comparing the results from different companies. Thus, the selection of a proper approach makes rate regulation statistically more sound, as the results are less affected by reviewer experience and judgement.

Recently, the generalized additive model (GAM) has been proposed in predictive modeling for rate-making (Denuit 2004; Fuzi 2016; Klein 2014). From a technical perspective, GAM is an extension of GLM. In statistics, data are categorized into two types, numerical or categorical. When modeling data in practice, there is often no clear-cut rule about the data type. In most cases, it depends on expert opinion and the purpose of modeling. This means that for some of the variables within rate-making, we can have different types of treatments. For instance, the DR takes values 0, 1, 2, 3, 4, 5, 6+. If our goal is to estimate the average differential for each driving record, we assume that the collected data are categorical. However, since there is a natural ordering of the data, we expect that with the increase of the value of driving record, the differential decreases. It is reasonable to expect that there is a monotonic relationship within different driving records. Therefore, one can impose this relationship on a model by specifying the functional form of the Driving Record. This is the idea behind the extension of GLM to GAM.

## 3. Results

In this study, we considered an industry-level data set for accident years from 2007–2009, with third party liability (TPL) coverage and urban territory. The relativities of DR and Class at various levels were obtained for the accident year 2007 data set by both MBP and GLM. Additionally, a data set comprising these three accident years as a whole was used for relativity estimation for both cases of with and without $Year$ as a rating factor. The modeling and analysis were based on the aggregate measure of each combination of levels among all factors that we considered. The summary of average loss costs is reported in Table 1. The value of each cell represents the loss cost of each given combination of levels from DR and Class. Notice that in the table the NA values mean that the loss costs were not available for these cells, which causes the problem of dealing with the missing values in the later computation. In some cases, zero values of the loss cost appeared in the summary data table. If this happens, then zero values were simply replaced by NAs. Ideally, we expected to have full information in the summary data table. Since we used industry-level data, the impact of missing values on some combinations of DR and Class was reduced when compared to the cases that dealt with company-level data. Thus, it makes sense to use the estimated relativities as a benchmark in rate filings review.

The obtained results of model coefficients and relativities are presented in Figure 1 for the case of using Gaussian distribution as an error function. The results include both the results of the general linear model and of the generalized linear models. From Figure 1, we can see that there were no significant large differences among the estimates of relativities. However, the general linear model gave slightly higher overall average relativities, which implies that the pure premium using results from the general linear model would be overestimated when compared to the generalized linear models. When GLM results were compared to the ones from MBP, a similar pattern was observed. This confirms our expectation that GLM would perform better in terms of reducing the effects of the interaction of risk factors, leading to lower estimates of relativities. However, this is only applicable for a Gaussian error function, which implies that the underlying assumption of the loss cost has to have a lighter tail distribution. This suggests that in modeling loss cost using GLM with a Gaussian error function, a large loss, corresponding to catastrophic events, must be removed before fitting the models to the data. This is because the loss data without removing large loss often follow a heavy-tailed distribution (Bradley 2003; McNeil 1997).

The results obtained by using other error distributions, including gamma and inverse Gaussian, are presented in Figure 2. The estimates of relativities for both DR and Class were larger than the ones obtained from MBP or Poisson distribution. The MBP and Poisson distribution led to the same theoretical results—see Brown (1988). This may mean that the use of a heavy tail distribution in an error distribution function gives larger estimates of relativities. Risk increased when the tail of the loss distribution was heavier, resulting in larger relativities for the rating factors. When the relativities obtained from GLM or MBP were compared to one-way relativities, we saw that they were mostly underestimated. This suggests that more risk factors may need to be included in order to better capture data variation. Rate filings reviews do not aim at pricing, but focus instead on producing results that help to better understand the nature of major risk factors used by companies.

By looking at the loss cost or the one-way relativities of loss cost for DR1 and DR2 (i.e., the last column in both Figure 1 and Figure 2), we realized that there was a reversal for the relativity of DR1 and DR2. We expected a higher relativity for the insured who had less experience. Unfortunately, the results from the one-way summary were not as good as the ones obtained by GLM or MBP because the one-way relativity based on a two-way summary loss cost ignores the differences of the number of exposures among cells. Therefore, in order to better model loss costs using GLM, a suitable choice of a weight function needs to be specified, reflecting the difference of data value in each cell caused by the different levels of loss aggregation.

The results of relativity estimates and their comparisons under different model assumptions are reported in Figure 3, Figure 4, Figure 5 and Figure 6. In Figure 3 and Figure 4, the overall pattern for DR was consistent for all cases. Relativity decreased with the increase of the number of non-accident years. This means that prior driving record proved to have validity in predicting a driver’s accident risk. This result coincides with the results of in Peck (1983), who demonstrated the significant predictive power of DR as a rating factor. In all cases, the relativities of DR obtained from GLM were close to the actual relativities calculated from the one-way summary. For Classes 1 to 7, which contained the majority of the risk exposures, the estimated relativities for all considered models were close to the actual one-way estimates. The model estimates’ departure from the actual one-way relativities became significant only for those Classes that contained a smaller number of risk exposures. The values of the coefficients and relativities of the considered methods are displayed in Figure 7. Overall, these values coincided with the actual values, but GLM underestimated the relativities at some levels of Class—namely, Class8, Class10, Class11, and Class18. We believe that this is due to both the missing values and the level of credibility of data from those classes. Combining Class with DR, we were able to see that Class was successful in explaining the loss cost data. A similar research outcome was reported in Lourens (1999). Unlike a single class with multiple characteristics as in our work, drivers’ sex, age, and level of education were used in Lourens (1999).

The values of the coefficients and relativities obtained by the GLM for the data with three accident years combined are reported in Figure 8 and Figure 9. Recall that for the one-way relativities we observed the reversal of the values for DR1 and DR2 in both the 2007 and 2009 accident years (see Figure 1 and Figure 2). However, after combining three years’ data, this reversal of the values disappeared in one-way relativities. This implies that the use of the data of three accident years combined was better in terms of discovering the desired pattern. The results obtained by GLM for both cases with and without $Year$ as a risk factor did not display a reversal phenomenon. Also, with the inclusion of $Year$ as a risk factor, we could capture its effect on loss cost. From the results of Figure 8 and Figure 9 we see that this effect was significant among different years.

From the results displayed in Figure 8 and Figure 9, we can conclude that the model with $Year$ as a risk factor was a good choice from the rate filings review prospective, as it could avoid extra uncertainty from single-year data. The results from the data of three years combined were more reliable and more suitable for acting as a benchmark for the rate filings review. The model can be easily extended to further include the territory as additional risk factor. However, we feel that from the rate filing review perspective it is not necessary, because it will significantly decrease the number of exposures for each combination of given levels. Also, we found that the definition of territory presents another difficulty, as it may change from time to time. Definitions of DR and Class are relatively more stable.

## 4. Conclusions

The GLM procedure with Poisson error distribution function is equivalent to MBP in estimating relativities. However, due to the different methods of estimating base rate, the GLM with Poisson error function outperformed MBP in terms of the overall bias. For the data considered, the gamma error function was the most appropriate error distribution. For the relativity estimation under multiple-year loss experiences, the gamma error distribution function was the only loss function that led to a monotonic decrease with the increase of driving experience. Because a monotonic decrease is desired for DR in auto insurance pricing, this suggests that the gamma error distribution function should be used. When GLM is applied to aggregate data, a weight function is needed if one believes that there is some bias from estimating average loss cost for each combination of risk factors, or if the level of data variation at each cell needs to be considered. The commonly used weight function is the number of exposures for each combination of the levels of the risk factors. However, GLM is able to reduce the effect of potential interaction of risk factors. Because of this, the obtained relativities for major risk factors are usually lower than those from the one-way analysis. GLM can be expanded by including additional rating variables, conducting a statistical test of significance, and evaluating the predictive power of the model. Since GLM is an efficient and reliable predictive tool, it is popular in other areas of predictive modeling.

The overall implication of the presented findings is that the pricing issues in a complex insurance system from the rate filings perspective is better understood. The estimates of major risk factors captured major pricing patterns. The results explain the natural variation in the process when reviewing a company’s rate change application. This enhances decision-making in operations using only major risk factors. because it is less affected by the details of the company’s results. Understanding how advanced statistical techniques work may also lead to a better communication with insurance companies in the process of discussing rate changes. Our study demonstrated that GLM is suitable as a predictive modeling technique in auto insurance rate filings review. The GLM relativities of major risk factors within rate filings review can be used as the benchmark for the same risk factors used in auto insurance pricing.

## Author Contributions

Conceptualization, S.X.; Data curation, S.X.; Formal analysis, S.X.; Investigation, S.X. and A.L.; Methodology, S.X.; Resources, A.L.; Software, S.X.; Validation, S.X.; Visualization, S.X.; Writing – original draft, S.X.; Writing – review & editing, S.X. and A.L.

## Conflicts of Interest

The authors declare no conflict of interest.

## References

- Bailey, Robert A. 1963. Insurance Rates With Minimum Bias. Arlington County: Casualty Actuarial Society, vol. 50, pp. 4–11. [Google Scholar]
- Bradley, Brendan O., and Murad S. Taqqu. 2003. Financial risk and heavy tails. In Handbook of Heavy-Tailed Distributions in Finance. Edited by ST Rachev. Amsterdam: Elsevier, pp. 35–103. [Google Scholar]
- Brown, Robert L. 1988. Minimum Bias with Generalized Linear Models. Arlington County: Casualty Actuarial Society, vol. 75, pp. 187–217. [Google Scholar]
- Chen, Shu-Ching, Min Chen, Na Zhao, Shahid Hamid, Kasturi Chatterjee, and Michael Armella. 2009. Florida public hurricane loss model: Research in multi-disciplinary system integration assisting government policy making. Government Information Quarterly 26: 285–94. [Google Scholar] [CrossRef]
- Cripe, Frederick F., and Stephen Fiete. 2012. Usage-Based Insurance Cost Determination System and Method. U.S. Patent 8,280,752, October 2. [Google Scholar]
- David, Mihaela, and Dănuţ-Vasile Jemna. 2015. Modeling the frequency of auto insurance claims by means of poisson and negative binomial models. Annals of the Alexandru Ioan Cuza University-Economics 62: 151–68. [Google Scholar] [CrossRef]
- David, Mihaela. 2015. Auto insurance premium calculation using generalized linear models. Procedia Economics and Finance 20: 147–56. [Google Scholar] [CrossRef]
- De Jong, Piet, and Gillian Z. Heller. 2008. Generalized Linear Models for Insurance Data. Cambridge: Cambridge Books. [Google Scholar]
- Denuit, Michel, and Stefan Lang. 2004. Non-life rate-making with Bayesian GAMs. Insurance: Mathematics and Economics 35: 627–47. [Google Scholar] [CrossRef]
- Duan, Zhengmin, Yonglian Chang, Qi Wang, Tianyao Chen, and Qing Zhao. 2018. A Logistic Regression Based Auto Insurance Rate-Making Model Designed for the Insurance Rate Reform. International Journal of Financial Studies 6: 18. [Google Scholar] [CrossRef]
- Dunn, Peter K., and Gordon K. Smyth. 2001. Tweedie family densities: Methods of evaluation. Paper presented at 16th International Workshop on Statistical Modelling, Odense, Denmark, July 2–6. [Google Scholar]
- Feldblum, Sholom, and J. ERIC Brosius. 2003. The Minimum Bias Procedure: A Practitioner’S Guide. Arlington County: Casualty Actuarial Society, vol. 90, pp. 196–273. [Google Scholar]
- Frees, Edward W., and Gee Lee. 2015. Rating endorsements using generalized linear models. Variance 10: 51–74. [Google Scholar]
- Frees, Edward W. 2014. Frequency and severity models. Predictive Modeling Applications in Actuarial Science 1: 138–64. [Google Scholar]
- Fuzi, Mohd Fadzli Mohd, Abdul Aziz Jemain, and Noriszura Ismail. 2016. Bayesian quantile regression model for claim count data. Insurance: Mathematics and Economics 66: 124–37. [Google Scholar] [CrossRef]
- Garrido, José, Christian Genest, and Juliana Schulz. 2016. Generalized linear models for dependent frequency and severity of insurance claims. Insurance: Mathematics and Economics 70: 205–15. [Google Scholar] [CrossRef]
- Haberman, Steven, and Arthur E. Renshaw. 1996. Generalized linear models and actuarial science. The Statistician 45: 407–36. [Google Scholar] [CrossRef]
- Hamid, Shahid S., Jean-Paul Pinelli, Shu-Ching Chen, and Kurt Gurley. 2011. Catastrophe model-based assessment of hurricane risk and estimates of potential insured losses for the state of Florida. Natural Hazards Review 12: 171–76. [Google Scholar] [CrossRef]
- Harrington, Scott E. 1992. A Note on the Impact of Auto Insurance Rate Regulation. In Foundations of Insurance Economics. Dordrecht: Springer, pp. 724–28. [Google Scholar]
- Husnjak, Siniša, Dragan Peraković, Ivan Forenbacher, and Marijan Mumdziev. 2015. Telematics system in usage based motor insurance. Procedia Engineering 100: 816–25. [Google Scholar] [CrossRef]
- Ismail, Noriszura, and Abdul Aziz Jemain. 2009. Comparison of Minimum Bias and Maximum Likelihood Methods for Claim Severity. Arlington County: Casualty Actuarial Society, p. 243. [Google Scholar]
- Klein, Nadja, Michel Denuit, Stefan Lang, and Thomas Kneib. 2014. Nonlife ratemaking and risk management with Bayesian generalized additive models for location, scale, and shape. Insurance: Mathematics and Economics 55: 225–49. [Google Scholar] [CrossRef]
- Lourens, Peter F., Jan AMM Vissers, and Maaike Jessurun. 1999. Annual mileage, driving violations, and accident involvement in relation to drivers’ sex, age, and level of education. Accident Analysis & Prevention 31: 593–97. [Google Scholar]
- Ma, Yu-Luen, Xiaoyu Zhu, Xianbiao Hu, and Yi-Chang Chiu. 2018. The use of context-sensitive insurance telematics data in auto insurance rate making. Transportation Research Part A: Policy and Practice 113: 243–58. [Google Scholar] [CrossRef]
- McCullagh, Peter, and John A. Nelder. 1989. Generalized Linear Models. Boca Raton: CRC Press, vol. 37. [Google Scholar]
- McNeil, Alexander J. 1997. Estimating the tails of loss severity distributions using extreme value theory. ASTIN Bulletin: The Journal of the IAA 27: 117–37. [Google Scholar] [CrossRef]
- Meyers, Glenn G. 2007. Estimating predictive distributions for loss reserve models. Variance 1: 248–72. [Google Scholar]
- Mildenhall, Stephen J. 1999. A Systematic Relationship Between Minimum Bias and Generalized Linear Models. Arlington County: Casualty Actuarial Society, vol. 86, pp. 393–487. [Google Scholar]
- Najafabadi, Amir T. Payandeh, and Saeed Mohammad Pour. 2017. A k-Inflated Negative Binomial Mixture Regression Model: Application to Rate-Making Systems. Asia-Pacific Journal of Risk and Insurance 12. [Google Scholar] [CrossRef]
- Ohlsson, Esbjörn, and Björn Johansson. 2010. Non-Life Insurance Pricing With Generalized Linear Models. Berlin: Springer, vol. 2. [Google Scholar]
- Osborne, Jason W. 2010. Improving your data transformations: Applying the Box-Cox transformation. Practical Assessment, Research & Evaluation 15: 2. [Google Scholar]
- Outreville, J. Francois. 1990. Underwriting Cycles And Rate Regulation In Automobile Insurance Markets. Journal of Insurance Regulation 9: 274. [Google Scholar]
- Peck, Raymond C., and Jensen Kuan. 1983. A statistical model of individual accident risk prediction using driver record, territory and other biographical factors. Accident Analysis & Prevention 15: 371–93. [Google Scholar]
- Shi, Peng, and Emiliano A. Valdez. 2014. Multivariate negative binomial models for insurance claim counts. Insurance: Mathematics and Economics 55: 18–29. [Google Scholar] [CrossRef]
- Shi, Peng, and James Guszcza. 2016. Frameworks for General Insurance Ratemaking. In Predictive Modeling Applications in Actuarial Science. Case Studies in Insurance. Cambridge: Cambridge University Press, Vol. 2, p. 100. [Google Scholar]
- Tang, Yanlin, Liya Xiang, and Zhongyi Zhu. 2014. Risk factor selection in rate making: EM adaptive LASSO for zero—Inflated poisson regression models. Risk Analysis 34: 1112–27. [Google Scholar] [CrossRef] [PubMed]
- Tennyson, Sharon. 2009. The Relationship Between Auto Insurance Rate Regulation and Insured Loss Costs: An Empirical Analysis. Available online: https://ecommons.cornell.edu/handle/1813/15118 (accessed on 11 October 2018).
- Tweedie, Maurice C. K. 1957. Statistical properties of inverse Gaussian distributions. II. The Annals of Mathematical Statistics 28: 696–705. [Google Scholar] [CrossRef]
- Watson, Charles C., Mark E. Johnson, and Martin Simons. 2004. Insurance rate filings and hurricane loss estimation models. Journal of Insurance Regulation 22: 39–64. [Google Scholar]
- Zhao, Hui-qing, and Han-zhang Wang. 2011. An Empirical Study of Rate Making of Automobile Insurance in China—Analysis Based on Generalized Linear Models. Journal of Tianjin University of Commerce 5: 8–12. [Google Scholar]

**Figure 1.**Model coefficients (i.e., ${V}_{1}$ columns) and relativities (i.e., EXP(${V}_{1}$) columns) obtained respectively under generalized linear model (GLM) with Gaussian error distribution function & log link function and under GLM with Gaussian error distribution function & the identity link function. The loss cost data from the accident year 2007 was used. MBP: minimum bias procedure.

**Figure 2.**Model coefficients (i.e., ${V}_{1}$ columns) and relativities (i.e., EXP(${V}_{1}$) columns) obtained respectively under GLM with log link function and various error distribution functions including gamma, Poisson, and inverse Gaussian. The loss cost data from the accident year 2007 was used.

**Figure 7.**The predicted loss costs associated with different levels of Class under GLM with log link function and gamma error distribution function.

**Figure 8.**Relativities obtained under GLM with log link function and various error functions including gamma, Poisson, and inverse Gaussian. The combined data of three accident years’ loss cost were used.

**Figure 9.**Relativities obtained under GLM with log link function and various error functions including gamma, Poisson, and inverse Gaussian. Three accident years’ loss cost data were used, and the model included $Year$ as a rating variable. DR and Class relativities obtained from MBP and the one-way analysis are also reported.

**Table 1.**Two-way summary of loss cost by Driving Record (DR) and Class. The values represent the dollar amount of average loss cost for the given combination of Class and DR. Note that there are no levels 4 and 14–17, because this is how the level is specified in the given data set.

Class1 | Class2 | Class3 | Class5 | Class6 | Class7 | Class8 | Class9 | Class10 | Class11 | Class12 | Class13 | Class18 | Class19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

DR0 | 559 | 824 | 675 | 372 | 517 | 2006 | 2359 | 2513 | 3087 | 1109 | 1507 | 397 | 946 | 461 |

DR1 | 510 | 644 | 508 | 146 | 433 | 660 | 2398 | 1435 | 1300 | 995 | 1113 | 1951 | 644 | 436 |

DR2 | 395 | 709 | 580 | 90 | 149 | 697 | 290 | 1010 | 1533 | 992 | 1151 | 1068 | 558 | 827 |

DR3 | 563 | 573 | 648 | 137 | 187 | 470 | 509 | 544 | 2201 | 1218 | 881 | 1167 | 848 | 471 |

DR4 | 458 | 450 | 686 | 145 | 78 | 1288 | 147 | 787 | 936 | 849 | 679 | 578 | 532 | 776 |

DR5 | 388 | 444 | 585 | 62 | 224 | 329 | 708 | 214 | NA | 152 | 629 | 659 | 320 | 476 |

DR6 | 275 | 301 | 374 | 78 | 90 | 345 | NA | 657 | NA | NA | 295 | 479 | NA | 274 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).