1. Introduction
The multivariate classification ratemaking has made significant advances in the past decade with the introduction of different types of statistical methods for pricing individual risks. These new techniques bring important benefits to insurers such as more equatable pricing of the individual risks, better competitive advantage, and protection against adverse selection and support informed decisions about the type of risk that the insurer is willing to write.
Werner and Modlin (
2016) introduced many of these techniques with a primary focus on generalized linear models (GLMs), which are used in pricing and reserving.
According to the Actuarial Review (
Baribeau 2016) “the predictive modeling is advancing far beyond its general linear model (GLM)-based roots due to the explosion of new data sources, technological innovation and advanced analytics.” Recent literature on extensions of GLMs with mixture modeling has been a part of this innovation process, and this is proposed by several researchers. A finite mixture of Poisson regression models with an application to insurance ratemaking was studied by
Bermúdez (
2012) for count data. The authors recognized that unobserved heterogeneity in this type of data requires more structure in the modeling techniques and that the current model fitting can be improved by using finite mixture of bivariate Poisson regressions. The unobserved heterogeneity in the zero-inflated type of data is attributed to the fact that variance is often larger than the mean. In 2015, another Poisson mixture model for count data was considered by
Brown and Buckley (
2015) for modeling heterogeneity in a Group Life insurance portfolio. The authors showed violation of heterogeneity across groups and suggested that putting similar groups together is necessary for further analysis. Later, a non-parametric Bayesian approach was considered in modeling this mixture distribution by incorporating a Dirichlet process prior and using reversible-jump Markov chain Monte Carlo (
Green 1995) to estimate the number of components.
Garrido et al. (
2016) and
Shi et al. (
2015) consider relaxing the GLM assumption of independence between the number and the size of claims. The multi-modality of univariate insurance losses was recently modeled using mixtures by
Miljkovic and Grün (
2016),
Verbelen et al. (
2014),
Lee and Lin (
2010), and
Klugman and Rioux (
2006). The findings of these studies indicate a need to explore the modeling of unobserved heterogeneity in a regression setting where the amount of claims is linked to several other covariates and the goal is to find a finite number of sub-populations of policyholders in an insurance portfolio.
In predictive modeling with a GLM setting, unobserved heterogeneity may occur when important covariates have been omitted and their influence is not accounted for in the analysis (
Grün and Leisch 2008). The unobserved heterogeneity may not be fully captured when a single component GLM is used to model the data set. While these techniques have been explored in other fields, they have not been fully adopted in the actuarial field. Actuaries may consider relaxing the assumption that the regression coefficients and dispersion parameters are the same for all observations. In this case, a goal is to find groups of policy holders with similar regression coefficients.
Multiple techniques have been developed which deal with the grouping of heterogeneous data such as hierarchical clustering (
Johnson 1967;
Kaufman and Rousseeuw 1990), association analysis (
Manly 2005), and
k-means clustering algorithm (
Jobson 1992;
Lewis et al. 2003;
McCune and Grace 2002). There are a number of clustering methods based on mathematical techniques such as association indices (
Chen et al. (
2011);
Wu et al. (
2008)), distance metrics (
Everitt et al. (
2001)), and matrix decomposition (
Manly (
2005);
Quinn and Keough (
2002);
Wu et al. (
2007)). However, these algorithms do not have a likelihood-based formulation and therefore do not provide a reliable method of model selection or assessment. A particularly powerful likelihood-based approach to one-dimensional clustering based on finite mixtures, with the variables in the columns being utilized to group the objects in the rows, is provided by
McLachlan and Basford (
1988),
McLachlan and Peel (
2004),
Everitt et al. (
2001),
Böhning et al. (
2007),
Wu et al. (
2008), and
Melnykov and Maitra (
2010).
The objective of this article it to review two recently proposed mixture-based clustering approaches for modeling unobserved heterogeneity in an insurance portfolio, which, to the best of our knowledge, have not yet been exploited by the practitioner in the actuarial science area. Moreover, we focus on modeling severity of losses as a function of several covariates arising from different sub-populations. These sub-populations are grouped into clusters based on a similar historical experience or well-defined similarity rules and the results of each group can be considered in underwriting and ratemaking. We particularly consider two different data modeling frameworks: mixture modeling of ordinal variables (for classified data based on the level of risk) and the mixture of GLMs for a mixed type of covariates (for individual data). As an illustration, we show how these two mixture models can be used to model an insurance portfolio and have complementary properties. Similarities and differences of both approaches are discussed in the context of the data structure available for the selected insurance portfolio.
In an insurance context, ordinal variables are often defined based on a risk classification with intrinsic order. For example, driver age can be treated as an ordinal outcome with the youngest drivers being associated with the highest risk propensity. Similarly, losses can be treated on an ordinal scale based on the intensity of claims. There are a variety of approaches to the modeling of ordinal data that properly respect the ordinal nature of the data.
Liu and Agresti (
2005) and
Agresti (
2010) described various proportional odds version models using adjacent–categories logits, cumulative logits
McCullagh (
1980), and continuation–ratio logits
McCullagh and Nelder (
1989). Our article focuses on the ordered stereotype model (OSM) introduced by
Anderson (
1984), which is more flexible than the most common models as a result of adding additional score parameters associated with the distance among ordinal categories.
The generalized linear mixed cluster-weighted model (known as CWM) was recently proposed by
Ingrassia et al. (
2015) as a flexible family of mixture models for fitting the joint distribution of a random vector composed of a response variable and a set of mixed-type covariates with the assumption that continuous covariates come from Gaussian distribution. The CWM method does not consider ordinal data; thus, we are also interested in the mixture-based clustering for OSM built to handle ordinal data previously proposed by
Fernández et al. (
2016). The ordinal data in insurance setting can be considered if the policyholders have been previously classified based on the level of risk (e.g., 1 is the lowest level of risk and 5 is the highest). This classification is often obtained for underwriting and risk management purposes. Both methods assume that an inherent clustering effect is present in the data; therefore, each sub-population of the variable of interest, such as claims, can be modeled separately. One of the main advantages of these approaches over other non model-based clustering techniques is their likelihood-based foundation because maximum likelihood theory provides estimators and model selection. Another advantage of these two methods is that they complement each other, i.e., they allow for flexibility in terms of data collection and categorization. If the collected data is all organized based on the ordinal levels corresponding to several risk classifications, then a mixture-based clustering method for ordered stereotype model would be a suitable tool to further test the heterogeneity in the data. For insurance data sets that are currently analyzed using existing GLMs, the CWM approach would allow for detecting an unobserved heterogeneity in the data by testing if more than a single component GLM fits the data better. If the CWM model provides a better fit to the data, then this model should replace a single component GLM, where applicable.
The remainder of this paper is organized as follows.
Section 2 introduces the formulation and model estimation for both mixture-based methods as well as the model selection criteria.
Section 3 is devoted to the presentation of the data set used in this study and the results obtained using these two approaches. Conclusions are summarized in
Section 4. The appendix provides additional results with respect to the analysis conducted as part of
Section 3.
3. Application
3.1. Data
The French motor claims data set, on 413,169 motor third-party liability policies, was accessed from the CASdatasets ((
Dutang and Charpentier 2016);
Miljkovic (
2017)) in R, for the purpose of our analysis. Both claim number and corresponding losses are available from the same portfolio of policyholders.
Charpentier (
2014) used the same data set to illustrate the modeling of frequency and severity of claims using various single component GLMs. In order to illustrate how OSM and CWM models are applied, we focus our analysis on Region 24 (R24), engine power
, and car brand category “Renault, Nissan, or Citroen” with the sample size of 1269 claims. The R24 has about 39% of the total French policies written, with engine power type
and “Renault, Nissan, or Citroen” being the most popular cars.
The variables of interest for our analysis are loss amount, driver age, car age, density, and exposure. When the CWM method is employed, loss amount, density, and exposure are treated as numerical continuous variables, while driver age and car age are modeled as categorical variables. The driver age and car age, coded as 99 (unknown), are excluded from the analysis.
Table 1 provides the summary of all variables used in both models.
We categorized the numerical variables into ordinal variables with the aim of applying the OSM model over the same data set. The ordinal variables correspond to level of risk, where 1 is the lowest level of risk and 5 is the highest. We assume that this classification has already been determined by the underwriting practices and our goal is to test if there is a need for further classification due to unobserved heterogeneity in the data. The OSM method is sufficiently flexible that it can be adopted to the different variables and a different number of ordinal levels, depending on the data.
In the following subsections, we present the analysis and the results based on the two modeling approaches presented in this paper.
3.2. OSM Results
An array of row (claim) clustering models (
1) with different numbers of clusters
G = 1:5 were fitted, and the information criteria measures AIC and BIC were computed. The results are summarized in
Table 2.
Both the AIC and the BIC indicate that the best model is the OSM version, including row (claim) clustering with
clusters with AIC = 23,584 and BIC = 23,685. Each row is allocated to the group to which the claim belongs with the highest posterior probability, where the mixing probabilities are
.
Figure 1 displays the resultant
clustering structures and their profiles. The scatter plot (left) displays the average fitted scores over the 5 variables, using a weighted average which accounts for the fitted spacings.
Appendix B explains the details of the calculation of these average scores.
Different color and shape points and color bars represent the resultant
claim clustering settings. Three groups are clearly distinguished in the scatter plot. This graph illustrates the conclusion from AIC/BIC that, among the row clustering models, the model with three claim groups is the best for our data. The bar plot (right) displays the profiles of the different clusters according to the estimated probability of the data response
being equal to the category
k in Equation (
2). We might conclude that the claims classified in the first group correspond to those with the lowest risk regarding the five categories, the ones in the second group have a more low-to-moderate risk, and the claims in the third group are those with more risk. An attractive feature of the stereotype regression model is that it allows one to test whether two adjacent categories are distinguishable. Since each ordinal response category
k is associated with a score parameter
, the spacing between adjacent
values shows us how similar or different the categories are in terms of the effect of claims (see
Agresti (
2010) (Section 4.3.5) and
Fernández et al. (
2016) (Section 1.2.2)).
Table A1 in
Appendix A shows the parameter estimates and their uncertainty for the model with
clusters. For this data set, the fitted score parameters were
,
,
,
, and
. Therefore, the distance between ordinal categories “1” and “2” (2.636) is greater than that between other adjacent categories. However, there is almost no spacing between categories “4” and “5”. This implies that the relative frequencies in these two categories are independent of the clustering structure. Therefore, retaining the distinction between “4” and “5” is not informative about the clustering structure. In that case, the model still holds with the same scores if the ordinal scale is collapsed by combining those two adjacent categories into one single response category. This spacing setting is illustrated in the spaced mosaic plot (
Fernández et al. 2014) in
Figure 2.
Common location measures in continuous response variables such as the mean or the median are not directly applicable in ordinal responses. However, the estimates of the scores parameters
,
provide a continuous scale in
. Re-scaling
conveniently as
,
, and
gives us a continuous scale in the range
, which we can use as a new numerical value in our data set, replacing the original ordinal response values
with the new adjusted spacing
. For instance, if initially
, the replacement would be
. We can then calculate average over this new data
to determine the different profiles of each cluster from our original heterogeneous data.
Table 3 shows the mean in the data set.
We can see that the cluster is composed by the claims with the largest losses, which corresponds to the youngest drivers, oldest cars, and largest density. The clusters and have very similar values apart from the car age, which makes them different.
3.3. CWM Results
Tools for generalized linear CWM are available in the statistical R package
flexCWM, developed by
Mazza et al. (
2017). The log-normal CWM was fitted to the following covariates: driver age, car age, density, and exposure. The model selection procedure based on the AIC and the BIC found three mixture components with their corresponding mixing probabilities as follows:
,
, and
.
Table 4 shows the summary results for log-likelihood, AIC, and BIC. The CWM function selects the best model based on the minimum value of BIC. In our analysis, the best model is detected when
and these results are shown in bold in
Table 4. The number of selected components is consistent with the OSM approach.
This analysis is done for Region 24, engine power , and car brand category “Renault, Nissan, or Citroen” with the sample size of 1269 claims. Within these settings, we can observe that the effect of clustering is present in the portfolio and three subpopulation of drivers are found to have similar characteristics. As a result 3-component GLM seems to provide better fit than a single component GLM based on BIC criteria.
The left display of
Figure 3 shows the distribution of losses by group and driver age while the right display shows the distribution of losses by group and car age. It is apparent that the highest losses are generated by the youngest drivers (age group 18–22), classified in Group 1. Additionally, the second highest losses are reported by drivers aged 43–74 in Group 1. Old cars, 10–15 years old, are associated with the largest losses in Group 1. A combination of young drivers (less than 23 years old) and old cars (10–15 years) are indicators for the largest losses in this portfolio. A classification of drivers in R24 can be done based on a level of risk to which they are exposed.
Table A2 in
Appendix A shows the results for the estimated coefficients in each component of the CWM model. Driver age and car age are coded as factors with Level 1 being omitted. Thus, the coefficients shown in the output for driver’s age by category are estimated relative to the youngest drivers (less than 23 years old). We can observe that the significance of the coefficients in each cluster vary greatly across these clusters. The significance is coded as
(***), 0.001 (**), 0.01 (*), and 0.05 (.), corresponding to the
p-value of the specific coefficient. Coefficients in Cluster 3 are all significant, while most of the coefficients associated with driver age and car age are not significant in the first two clusters. Coefficients associated with driver age in Cluster 3 are all positive and increase with age showing that losses increase with age. The 75+ age group generate on average more losses compared to the <23 age group. In Clusters 1 & 2, we observe the negative coefficients for drivers age, indicating a reverse trend in losses for these two groups. Here, we can see that, in these groups, younger drivers generate the highest losses, and, as age increases, losses tend to decrease on average, holding other variables constant. Similarly, the coefficients for car’s age by category are estimated relative to the newest cars (less than 1 year old). These coefficients are all significant in Cluster 3 and their values are positive. We observe that older cars generate more losses on average, compared to newer cars in Cluster 3. In Cluster 1, the car age coefficients are mostly negative while in Cluster 2 they are positive. The sign and magnitude of the coefficients change depending on the relative center of each cluster. Density variable is highly significant in all three clusters. Exposure variable is significant in Cluster 1 only. Here, the finite mixture modeling allows us to model a natural representation of heterogeneity in three latent groups of policyholders in this insurance portfolio. Based on the BIC, the model fit is improved by using three component GLM rather than one component. The interpretation of the coefficients can be done for each cluster separately to account for the “observed” heterogeneity within each subpopulation of the corresponding cluster.
4. Conclusions
This article reviews two recently developed mixture-based clustering approaches applied to an insurance portfolio in order to find the optimal number of clusters and test the assumption of heterogeneity. While these approaches have been used in different fields, to the best of our knowledge, they have so far not been applied in the actuarial field for testing the unobserved heterogeneity in the insurance portfolio. These two methods allow for flexibility in terms of data collection and categorization and allow for further assessment of the underwriting risk. If all the collected data are organized based on the ordinal levels corresponding to several risk classifications, then a mixture-based clustering method for OSM would be a suitable tool to further test heterogeneity in the data. For the insurance data sets that are currently analyzed using existing GLMs, the CWM approach would allow for finding multiple GLM components when modeling frequency and severity of claims. The following family of distributions are supported by CWM with their link functions: binomial, poisson, t-distribution, log-normal, gamma, and inverse gaussian. If the multi-component CWM model provides a better fit to the data then a single component GLM, we recommend that this clustering effect is considered in practice. The OSM and CWM approaches can be used by practitioners and academics in the actuarial area. We have shown that they have complementary properties.
In a mixture-based clustering method for OSM, when the data is organized based on ordinal levels, the effect of a few extreme losses needs to be tempered down. Allocating these extreme losses in an ordinal, “high-risk” category makes them less influential in the model fitting, giving broad categories which enable us to detect major overall patterns. This enables the inclusion of all of the different levels of dispersion across ordinal categories. Additionally, assigning scores to ordinal categories provides an easy way of showing the descriptive statistics. If practitioners have knowledge about the score for each of the ordered categories, assigning scores might be the best way to analyze the data, because ordinary linear models can be applied. However, if practitioners do not have any predetermined idea about the spacing between adjacent categories, the use of an ordinal stereotype model is convenient, as the data dictates non-equally spaced scores among ordinal outcomes. The estimation of the spacing among ordinal responses is an improvement over other ordinal data models, such as the proportional odds model and continuation-ratio model, although more research in performance comparison with other equivalent methods is needed.
For illustration of these methods, we analyzed a small data set of French motor insurance claims in Region 24 using the methods discussed in this paper. We found three clusters in the data using both the CWM and OSM approaches and therefore show the presence of unobserved heterogeneity. Due to the nature of an insurance portfolio, unobserved heterogeneity is most likely to exist in almost all insurance portfolios. Thus, current modeling techniques should be extended to account for mixture-based clustering that will allow practitioners to detect additional sub-populations of the policy holders based on the same set of characteristics within an existing portfolio, and make the appropriate pricing and risk management decisions for each group. These two methods can also assist practitioners with the underwriting selection process as they have good complementary properties.
Finally, the future research may focus on comparing clustering structures resulting from different methodologies. Over the same data set, one can apply three measures in common that are used to compare clusterings: the Adjusted Rand Index (ARI), the Variation of Information (VI), and the Normalized Information Distance (NID) proposed by
Hubert and Arabie (
1985),
Meila (
2005), and
Kraskov et al. (
2005), respectively.