5.1. Definition of Data Structure and Sample
A very time-consuming phase was the definition of the basic statistical unit and of the data structure to be analyzed. Companies that collect telematics data generally record all the information related to individual trips (hereafter referred to as “trips”) for each policyholder. Each trip begins with the ignition of the vehicle and ends with its shutdown. Therefore, the black box records all the information related to the vehicle between these two instants. Such information is manifold and is also synthesized according to the needs of the company or its clients (insurance companies).
For the analyses reported below, a data structure was defined in which each record is represented by all the information regarding trips, aggregated for each policyholder, for each day, for each time band considered, and for each province crossed during the day.
For each crash, all the information regarding the “Characteristics of the policyholder” and the “Characteristics of the vehicle” is available, making it possible to associate each crash with the individual policyholder/vehicle as well as with the corresponding record in the data structure.
It should be noted that the crash detected does not exactly coincide with the insurance claim. The crash is automatically detected by the black box based on certain predefined parameters. In this specific case, a crash was defined as any event registering a variation, measured in g, greater than a fixed threshold value set at 3. Therefore, it often occurs that a crash is not associated with any claim, or that a claim is not detected as a crash (for example, claims occurring with the vehicle switched off or with impacts not exceeding the threshold). However, for a smaller sample, information regarding actual insurance claims was also provided in order to verify the relationship between the two events.
In the proposed application, it was decided not to use all the traditional information normally employed by insurance companies for rating purposes, which are usually easily available, such as engine displacement, fuel type, fiscal horsepower, or the policyholder’s occupation.
With regard to the definition of the sample, this was mainly influenced by the information available to the company and by the constraints related to their dissemination and confidentiality.
Therefore, all the information available for this study was analyzed, despite some heterogeneity in terms of numerical size and time span. In particular, the sample related to the analyzed portfolio consists of all trips recorded for 100,000 Italian policyholders over a period of five years. With respect to claim costs, the analyzed sample consists of data regarding claims (both caused and incurred) for approximately 11,000 policyholders. The dataset pertains to Italy and was provided by a major telematics service provider collaborating with leading national insurance companies. The randomly selected sample is broadly representative of the Italian motor insurance market. The analyses properly account for each policyholder’s actual exposure, in terms of coverage duration and distance traveled, as well as the specific regulatory context of the period, during which black-box installation was mainly promoted for fraud-prevention purposes. Furthermore, telematics data are processed in compliance with Italian privacy regulations, using aggregated or anonymized information for insurance premium calculations.
5.2. Exploratory Analyses
Before proceeding with the evaluation of the rating coefficients used in the model, as well as the selection of the rating variables, an exploratory analysis of the available data and variables was carried out. All variables were analyzed considering as response variables both the claims frequency and the average cost of a single claim.
For some variables, such as Year of Registration, Age of the Policyholder, and Territorial Area, a grouping of the modalities into classes was performed. In particular, the classes related to the Territorial Areas were derived through a cluster analysis. The classification of these variables was carried out using the k-means method. Specifically, since the number of provinces was greater than 100, a hierarchical procedure would have been difficult to read and interpret.
One element that emerged from the initial exploratory analyses is the strong correlation among the variables considered, as can be observed from the graphs reported below in
Figure 1,
Figure 2 and
Figure 3.
Figure 1: Km_urb, Km_ext and Km_hig denote the number of kilometers traveled on urban roads, extra-urban roads, and highways, respectively; n_over_urb, n_over_ext and n_over_hig denote the number of speed limit violations recorded on urban roads, extra-urban roads, and highways, respectively; and Km_over_urb, Km_over_ext and Km_over_hig denote the number of kilometers driven above the speed limits on urban roads, extra-urban roads, and highways, respectively.
Figure 2: Km_band1, Km_band2, Km_band3, Km_band4, Km_band5 and Km_band6 denote the number of kilometers traveled in the following time bands, respectively: 2–5, 6–9, 10–13, 14–17, 18–21, and 22–01.
Figure 3: Tot_SA_urb, Tot_SB_urb and Tot_SC_urb represent, respectively, the number of mild accelerations, the number of mild brakings, and the number of mild turns recorded on urban roads; Tot_HA_urb, Tot_HB_urb and Tot_HC_urb represent, respectively, the number of harsh accelerations, the number of harsh brakings, and the number of harsh turns recorded on urban roads.
Correlation structures (
Figure 4) very similar to those reported previously in
Figure 2 also emerge from the analysis carried out on the events recorded on extra-urban roads and highways, although their graphical representation is omitted her.
Figure 4: Km_MON, Km_TUE, Km_WED, Km_THU, Km_FRI, Km_SAT and Km_SUN represent, respectively, the number of kilometers traveled by the policyholders on Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday.
It should be noted that the correlations observed in the previous graphs must be regarded as spurious, since they are dependent on a third common variable. The third variable influencing all the others can be identified as the total kilometers traveled by the policyholder/vehicle. In fact, as one would logically expect, as the number of kilometers traveled annually by the policyholder increases, the exposure to risk also increases, and consequently all the values of those variables expressed in kilometers traveled or in the number of related recorded events (for example, the number of brakings performed in a year) also increase.
Below (
Figure 5), we report the graph representing the crash frequency recorded for the different classes of annual kilometers traveled. It should be noted that the annual claim frequency increases as the annual number of kilometers traveled increases, but in a less than proportional way. This phenomenon might suggest that driving experience, and thus traveling many kilometers per year, has a relatively positive effect on claims frequency. In reality, as will be shown in the multivariate analyses, this phenomenon could be due to the kilometers traveled on highways, since those who drive many kilometers per year typically do so on highways.
In addition to this type of correlation, correlations were also found among similar variables related to the policyholder’s driving behavior: for example, those who brake frequently exhibit high values both for the variable “mild brakings” and for “harsh brakings,” and the same applies to accelerations and turns. The correlation among brakings, accelerations, and turns across different road types, however, is less evident.
These aspects have been duly taken into account in the multivariate analyses, applying transformations of the variables in order to avoid multicollinearity problems.
In order to pursue the primary objective of the rating phase, namely to determine a premium (in particular the fair premium) as closely as possible aligned with the specific claims experience of each assumed risk, univariate analyses of the variables were first carried out. These analyses made it possible to investigate the significance of the variables with respect to the claims experience of the risks under consideration.
Moreover, the variables were studied by considering separately, as response variables, the claims frequency and the average cost per claim.
Below are reported some of the variables examined for each model adopted:
5.2.1. Claims Frequency
For each variable, the claims frequencies were determined (considering as risk exposure the vehicles/years), the claims frequencies per 100,000 km traveled, and the claims frequencies per 100 h of driving.
The latter indicator was subsequently disregarded, as it does not provide additional information compared to the frequency per kilometer, being strongly correlated with it.
The following three tables (
Table 1,
Table 2 and
Table 3) highlight a higher claims frequency for young drivers. Moreover, while the claims frequency determined using vehicle-years shows a sharp decrease for age classes above 30 years, this trend is less pronounced when considering the values reported per kilometers traveled and per hours of driving.
Finally, the data related to company cars present contrasting results when comparing the claims frequency calculated in the traditional way with those calculated on the basis of kilometers and hours of driving. In fact, the claims frequency of class G is in line with the average claims frequency of the portfolio, whereas the other two frequencies are well below the average.
One could conclude that the higher riskiness of company cars depends exclusively on their greater usage and, from the perspective of a per-kilometer tariff, this category should pay a lower premium than the portfolio average, disregarding for the moment the average cost per claim.
Below, we report the distribution of claims recorded in our sample, as well as the claims frequencies for each road type.
From
Table 4, it can be observed that the claims frequency on urban roads is higher than on other road types, since more driving hours are spent on this type of road. It should be noted that the claims frequency per kilometer on highways is very low, but it tends to increase compared to other road types when exposure is measured in driving hours. This trend is easily explained by the fact that on highways more kilometers are traveled in less time.
In this case, a low claims frequency is observed for the classes of policyholders who drive few kilometers per year. However, these policyholders prove to be very risky when the kilometers traveled are taken into account, showing claims frequencies per kilometer much higher than the portfolio average (
Table 5).
The same considerations made for urban roads also apply to extra-urban roads and highways, as reported in the following two tables (
Table 6 and
Table 7).
With regard to the analysis of the days of the week, after an initial examination of the individual days and based on the indications provided by the corresponding correlations, it was considered appropriate to group them into only two classes: “weekdays” (Monday to Friday) and “weekends” (Saturday and Sunday).
Below, the claims frequencies calculated for each class of annual kilometers traveled are reported separately for weekdays (
Table 8) and weekends (
Table 9).
Also in this case, after an initial analysis of the individual time bands and based on the indications provided by the corresponding correlations, they were grouped into three classes: “day” (from 06:00 to 13:59), “evening” (from 14:00 to 21:59), and “night” (from 22:00 to 05:59).
Below, the claims frequencies calculated for each class of annual kilometers traveled are reported separately for the three variables “day” (
Table 10), “evening” (
Table 11), and “night” (
Table 12).
In general, exceeding the legally imposed speed limits is considered to be among the main causes of road accidents. In the following two tables (
Table 13 and
Table 14), the claims frequencies are reported, calculated on the basis of the annual kilometers traveled above the limits and on the number of speed limit violations recorded in a year.
Both variables show a claims frequency that increases with the number of kilometers traveled above the speed limit and with the number of speed limit violations, whereas the claims frequency per kilometer decreases as the annual kilometers and the annual number of speed limit violations increase. This latter result is strongly influenced by the fact that policyholders who drive more kilometers in a year exhibit a lower claims frequency per kilometer. This phenomenon may depend on the greater driving experience of the policyholder or on the fact that many of these kilometers are traveled on highways.
Below (
Table 15), the claims frequencies calculated for different classes of annual kilometers traveled are reported.
For each of the other factors aimed at capturing the policyholder’s driving style (accelerations, brakings, and turns), two variables were created based on the severity of the recorded event. Accordingly, the variables analyzed are: mild and harsh accelerations (
Table 16), mild and harsh brakings (
Table 17), mild and harsh turns (
Table 18).
Below, the univariate analyses carried out on these variables defined as “harsh” are reported.
For these latter variables as well, a dependency on the number of kilometers traveled can be observed. In particular, for most variables it is noted that the claims frequency increases with the number of recorded DB events, but once the frequency is adjusted for the effect of kilometers traveled, the relationship between claims frequency and the number of DB events is reversed.
For each individual variable previously examined, the Wald test was performed in order to verify their statistical significance. Specifically, each variable was related to the response variables (claims frequency and claims frequency per kilometer) within a univariate model. Below (
Table 19), the
p-values resulting from the tests are reported.
All the variables considered individually, with the exception of some DB variables, turn out to be significant. It should be emphasized once again that this result is strongly influenced by the spurious correlation existing among the variables. With regard to DB variables, the only significant one is “harsh brakings”.
5.2.2. Average Claim Cost
Starting from the sample data on claim costs, the total cost borne by the company for each claim was calculated. In the analysis of average costs, the direct compensation system was taken into account.
For all the variables studied, the Kruskal–Wallis test was performed in order to verify their statistical significance. The Kruskal–Wallis test, also known as the nonparametric analysis of variance for a single classification factor, can be regarded as an extension of the Wilcoxon–Mann–Whitney test, based on ranks.
Below (
Table 20), the results of the Kruskal–Wallis test performed on each individual variable are reported. In addition, for the variables found to be significant, the corresponding box plots (
Figure 6,
Figure 7,
Figure 8,
Figure 9) are provided.
Classes 1–5 (
Figure 6) shown in the previous chart are decoded as follows: 0–400; 400–2500; 2500–5000; 5000–10,000, and greater than 10,000.
Figure 7.
Box Plot—Age Class.
Figure 7.
Box Plot—Age Class.
Classes 1–4 (
Figure 8) shown in the following chart are decoded as follows: 0–150; 150–700; 700–2000, greater than 2000.
Figure 8.
Box Plot—No. Over Limit.
Figure 8.
Box Plot—No. Over Limit.
Classes 1–5 (
Figure 9) shown in the following chart are decoded as follows: 0–500; 500–2000; 2000–4000; 4000–6000, and greater than 6000.
Figure 9.
Box Plot—Kilometers Traveled on Weekends.
Figure 9.
Box Plot—Kilometers Traveled on Weekends.
5.3. Tariff Modeling Analysis
Two separate statistical models were constructed for claims frequency and for average claim cost. The models employed fall within the framework of Generalized Linear Models (GLMs) and, depending on the scope of the analysis, provide nonlinear regressions for the two phenomena.
In particular, for claims frequency three different distributions were examined:
Poisson Model:
GLM with logarithmic link function and Poisson probability structure:
Negative Binomial Model:
GLM with logarithmic link function and Negative Binomial probability structure:
Zero-Inflated Poisson Model:
GLM with logarithmic link function and probability structure given by the mixture of a degenerate distribution at zero and a Poisson distribution:
Moreover, in order to account not only for exposure expressed in vehicles/year but also for that expressed in kilometers traveled, the following offset variables were introduced for each GLM: vehicles/year and kilometers traveled (divided by 100,000).
All baseline models were constructed considering the following variables:
Age class;
Vehicle registration Year;
Total kilometers traveled in a year by each policyholder;
Geographical area, constructed through a cluster analysis on the provinces included in the dataset;
Number of times the speed limit is exceeded per 1000 km on the three different types of roads (urban, extra-urban, highway);
Percentages of kilometers traveled on the three different types of roads relative to total kilometers traveled;
Percentages of kilometers traveled above the speed limit relative to total kilometers traveled on the three different types of roads;
Percentages of kilometers traveled on weekdays relative to the total kilometers traveled during the week;
Percentages of kilometers traveled on weekends relative to the total kilometers traveled during the week;
Percentages of kilometers traveled during daytime relative to the total kilometers traveled over 24 h;
Percentages of kilometers traveled during evening relative to the total kilometers traveled over 24 h;
Percentages of kilometers traveled during nighttime relative to the total kilometers traveled over 24 h.
The selection procedure used to determine a subset of significant explanatory variables was the backward method. Therefore, from the adopted model, variables whose inclusion did not increase, or even decreased, explanatory power to explain the variability of the phenomenon were eliminated.
Below (
Table 21 and
Table 22), the Wald tests performed for all models constructed with the final variables are presented.
With regard to the models based on the Poisson and the Negative Binomial distributions, it should be noted that the final models are very similar to each other, both in terms of the variables selected and the p-values obtained for each variable. In both cases, among the most significant variables, as expected, are the total annual kilometers, age, and geographical area.
The model based on the ZIP distribution also highlights the same variables found to be significant in the other two models. It should be noted, in the tables reported below (
Table 23 and
Table 24), that the variable “total annual kilometers” is significant only for the “zero-inflation” component, i.e., for explaining the probability of having zero claims, whereas the “geographical area” is significant only for the Poisson component.
Above (
Table 24) are reported the goodness-of-fit measures calculated for the purpose of comparing the models. Based on the results of the calculated measures, it can be concluded that the model based on the ZIP distribution is better able to explain the variability of the phenomenon, presenting a lower value of both the AIC and the Scaled Pearson statistic, and a higher value of the log-likelihood.
In the following tables (
Table 25 and
Table 26), the tests are reported using kilometers traveled as the exposure, in order to identify the set of significant variables for claims frequency per kilometer.
As was highlighted for the models with exposure in vehicles/year, these latter two models (Poisson and Negative Binomial) also present the same variables in the final model, with very similar p-values.
In this latter model (
Table 27), for the zero-inflation component, variables widely used in personalization techniques adopted by insurance companies—age and geographical area—did not result as significant. Age is significant only for the Poisson component.
It should be emphasized, however, that this model is less capable of explaining portfolio variability compared to the Poisson and Negative Binomial models, as can be seen from the following tables (
Table 28).
Below (
Table 29) is reported the Wald test performed for the model on average claim costs.
Given that the sum of the percentages of kilometers traveled on urban roads, extra-urban roads, and highways equals 1, one of the three was removed from the model in order to avoid multicollinearity issues. Therefore, it can be concluded that the only significant information is that related to mileage on the different types of roads.