Abstract
This study introduces a novel method for driving risk assessment based on the analysis of near-miss events captured in telematics data. Near-miss events, which are highly correlated with accidents, are employed as proxies for accident prediction. This research employs histogram-based gradient boosting regressors (HGBRs) for the analysis of telematics data, with comparisons made across datasets from China and Spain. The results presented in this paper demonstrate that HGBR outperforms conventional generalized linear models, such as Poisson regression and negative binomial regression, in predicting driving risks. Furthermore, the findings suggest that near-miss events could serve as a substitute for traditional claims in calculating insurance premiums. It can be seen that the machine learning algorithm offers the prospect of more accurate risk assessments and insurance pricing.
1. Introduction
A “near miss” is considered to cause no property damage or personal injury, but is prone to damage or injury when there is a slight movement in time or location. In the field of traffic safety, near miss, also known as near accident, near collision, and so on, refers to an operation without personal or property loss but with high risk. It does not turn into an accident but from a probabilistic point of view the more near-miss events there are, the more likely it is that an accident occurs []. Therefore, near-miss events, which are responsible for traffic safety, are a subject of increasing interest []. Although near-miss events can be collected through telematics, modeling near-miss events allows us to evaluate the risk dynamically and predict potential accidents before they occur, thus enhancing preemptive safety measures. For example, a driver with a more aggressive driving style may not have a near-miss for a period of time, but the regression results of his driving data show that he does have near-miss events. In other words, regression analysis of telematics data can dig out risks hidden deep inside. In this study, a new driving risk assessment method is proposed, which uses the results of the near-miss event estimation model to calculate the driving risk of each driver. This method has been verified on datasets from China and Spain.
Vehicles and drivers are the main participants in traffic accidents, and their daily travel is inseparable from auto insurance, which provides security. Motor insurance is compulsory in most countries to protect those who should be compensated for losses caused by vehicles in motion. The premium calculation of auto insurance, which is based on the determination of claim risks and leads to insurance pricing, has always been the most concerned issue for both the policyholders and insurers. Historically, insurers have relied on the number of accidents during the insured period, in fact, insurers only consider the number of claims because policyholders may not report accidents when claiming implies that a penalization on the next term price is enforced. The reality is that, if a minor accident occurs, most of the drivers will avoid paying more premiums by not claiming []. Policies with no claims for several years tend to be the majority in most portfolios, making it difficult to obtain useful information that truly reflects the risks of driving []. With the development of telematics, usage-based insurance (UBI), a new type of vehicle insurance, can use more potential data attributes to complete the driving risk assessment and then complete the auto insurance premium determination []. The method of scoring driving risk based on expected near-miss events proposed in this study is a new attempt in the field of UBI. It is crucial to develop UBI systems that can provide personalized insurance premiums based on real-time driving behavior.
Research on UBI has been ongoing for years. Initially, researchers included driving mileage alongside traditional auto insurance factors to assess driving risks and set premiums. This early mileage-based approach was known as pay-as-you-drive (PAYD) [,,]. Later, as data from the Internet of Vehicles became available, factors related to driving behavior and driving conditions were introduced into the rate-making model, which was called pay-how-you-drive (PHYD) [,,]. In the future, if 5G communication technology and in-car processors become widely used, UBI schemes will be called manage-how-you-drive (MHYD), which will enable drivers to analyze their competence at the wheel and allow for calculating insurance premiums in real-time []. The driving risk assessment in this study is based on the near-miss event prediction model supported by telematics data. Still, the utilization rate of vehicles is also considered, so it should be regarded as a combination of PAYD and PHYD.
Driving risk scores are used in numerous contexts, primarily for pricing and risk analysis. This latter function can inform drivers about their performance or help insurance companies classify customers, though it is typically used internally or for marketing purposes. There are various methods for assessing driving risk. Many researchers advocate for methods based on the traditional generalized linear model (GLM), which is the most widely used in the insurance field. Linear regression [], logistic regression [,], quantile regression [], Poisson regression [,], zero-inflated Poisson regression [,], negative binomial regression [,], zero-inflated negative binomial regression [,], panel data regression [], generalized additive model [,], etc., are widely used in driving risk assessment due to their good interpretability. On the other hand, black-box algorithms in machine learning have also been used in UBI research, such as cluster analysis [,], decision tree [], support vector machine [], neural network [], gradient boosting method [], and other relevant models [,]. In recent years, scholars have combined the boosted generalized linear model with machine learning to study UBI by taking advantage of both [,,]. Nevertheless, the advent of these novel methodologies may encounter obstacles due to regulatory constraints in certain jurisdictions where the application of black-box predictive analysis is prohibited [].
The role of near-miss events in driving risk assessment and premium pricing is quite flexible and can be approached from multiple perspectives. Guillen et al. [] analyzed three types of near-miss events, i.e., cornering, braking, and accelerating, as independent variables and proved that both traditional and telematics variables are relevant to risk factors. Sun et al. [] employed Poisson regression and negative binomial regression to analyze four types of near-miss events as dependent variables in summary datasets and panel datasets, respectively. This not only confirmed the significant influence of certain driving risk variables but also identified specific driving risk factors for each driver. Guillen et al. [] proposed a new method for determining auto insurance rates, using historical claims as the dependent variable and near-miss events as the independent variable. They calculated influence coefficients via the log-link function and incorporated these into the pricing model to complete rate-making. Guillen et al. [] utilized telematics data from 19,214 drivers over 55 weeks to develop predictive models for weekly accident frequency. They demonstrated that incorporating behavioral and contextual factors significantly enhances risk assessment. This approach also highlights potential usage-based insurance schemes through Poisson regression-derived driving scores and personalized safety alerts. This paper builds on previous studies but uses data that lacks information on accidents or claims. Here, the frequency of each near-miss event is treated as a dependent variable to evaluate driving risk.
The rest of this paper is organized as follows. The data structure, data description, and variable availability of the datasets from China and Spain are presented in Section 2. Section 3 presents the generalized linear model and machine learning algorithm used in this research. Section 4 presents the results of the two types of models on the two datasets. The similarities and differences are compared and discussed. Section 5 summarizes the findings and shortcomings of this study.
2. Data Description
Different telematics datasets from China and Spain are included in this study. Although the sources of the two datasets are different, their data attributes have many similarities and differences (see details in Table 1). () and (), as common attributes of the two datasets, are selected as the dependent variables of this study. It is worth noting that there are no common attributes in the driving behavior category, but they are all taken into account because they have important information about driving risk. Similarly, in the driving duration category, while and in the dataset from China and and in the dataset from Spain do not appear in the other dataset, they should not be ignored because they contain key information. By the same token, in the driving distance category, in the dataset from China and , kmh and kmh in the dataset from Spain are retained.
Table 1.
Types, names, and definitions of the data attributes from the China telematics dataset and the Spain telematics dataset.
The two datasets have their own data structures. The main feature is that neither dataset contains accidents or claims, while frequent near-miss events can serve as valuable indicators of driving risk. Unlike accidents and claims, which occur only a few times a year, near-miss events tend to occur more frequently and are easily captured by telematics sensors. The dataset from China contains telematics data from 261 vehicles over six days, as shown in Table 2. Notice that the means of and are both non-negative integers, which suggests that Poisson regression models could be used. In contrast, the dataset from Spain consists of 285 connected vehicles with a period ranging from 1 to 194 days, see Table 3. Notice that and are also non-negative integers but their variances are much higher than the mean, which suggests that Poisson regression may not be as good as negative binomial regression in modeling the frequency of near misses. In general, negative binomial regression would be more appropriate when there is excessive dispersion in the count data, but if this dispersion is mainly caused by a few outliers, the Poisson regression model may be more robust. In other words, the negative binomial regression model may be affected by outliers, leading to biased parameter estimates and, thus, affecting predictive performance. Note that the median, 75% quantile, and maximum value of the variable are equal, which might indicate repeated values caused by sensor errors. Consequently, this variable has been ignored in this study.
Table 2.
The data description for the dataset from China includes the count, mean, standard deviation, and quartiles.
Table 3.
The data description for the dataset from Spain includes the count, mean, standard deviation, and quartiles.
Given the large number of data attributes, a correlation analysis of the variables was also conducted prior to modeling and before examining the regression results. In the dataset from China, as shown in Figure 1, each type of variable exhibits a certain degree of internal correlation. Notably, the driving distance variables demonstrate the strongest correlation. There is a positive correlation between driving behavior variables and driving distance variables, whereas the correlation between driving duration variables and the other two types of variables is weak or even negative. In the dataset from Spain (see Figure 2), driving distance variables and driving duration variables show an obvious positive correlation, while driving behavior variables have no obvious correlation with other variables. Interestingly, the linear correlation between dependent variables and independent variables in the dataset from China is not strong, but the correlation between dependent variables and independent variables in the dataset from Spain is strong. Since the correlation between independent variables can affect model parameter identification and the assessment of causality, it is necessary to conduct multicollinearity tests and make trade-offs before modeling. Alternatively, regularization terms can be added to eliminate the bad effects of multicollinearity during modeling. As shown in Table 4 and Table 5, after eliminating variables with the excessive variance inflation factor (VIF), the remaining variables passed the multicollinearity test. Although, this work is not limited to generalized linear models, the number of variables involved in this study can be easily handled by a machine learning algorithm, which is good at processing multidimensional data.
Figure 1.
Correlation of variables from the dataset from China. Red represents linear positive correlation and blue represents linear negative correlation; the darker the color, the stronger the linear correlation.
Figure 2.
Correlation of variables from the dataset from Spain. Red represents linear positive correlation and blue represents linear negative correlation; the darker the color, the stronger the linear correlation.
Table 4.
Variance inflation factor and inverse variance inflation factor for the dataset from China.
Table 5.
Variance inflation factor and inverse variance inflation factor for the dataset from Spain.
3. Methods
3.1. Modeling
Conventional GLMs are preferred and assignable in the UBI scenario []. In the context of a Poisson regression model, the conditional expectation function is a non-negative function of a vector of explanatory variables. This is analogous to the negative binomial regression model, where the conditional expectation function is also defined by a log-link function, as follows:
where denotes the expectation of , i denotes the number of the observation, denotes the risk exposure variables, ... represent the independent variables that have passed the VIF multicollinearity test, and constant and ... are unknown parameters that need to be estimated.It should be noted that to facilitate comparative analysis of the results, the exposure variables for both datasets in this study are measured in days. But, in practice, the value of the exposure variable can vary depending on the actual situation [].
In contrast to the preceding studies, a gradient boost method is introduced to deal with the potentially large data volume in the practical work of UBI. The HGBR algorithm is an improved gradient boosting regression tree (GBRT) algorithm based on histograms []. The basic idea of HGBR is to discretize continuous floating-point eigenvalues into integer-valued bins and construct a histogram with the width of k. When traversing the data, the histogram accumulates statistics according to the discretized values as indices. After traversing the data once, the histogram accumulates the required statistics and then traverses according to the discrete values of the histogram to find the optimal segmentation point and build the gradient boosting tree, which tremendously reduces the number of splitting points to consider instead of relying on sorted continuous values. Hence, the HGBR algorithm offers advantages such as low memory consumption, reduced computational costs, high cache utilization rate, and a clear construction process. This is why similar iterative processes are found in algorithms like XGBoost and LightGBM []. It is well-known that the auto insurance industry is characterized by high-dimensional and large-volume data, and the future expansion of telematics is expected to lead to even more explosive growth in data volume and complexity. Such data characteristics meet the scope of application of the HGBR algorithm.
It is important to note that the HGBR loss function adds an L2 regularization term based on the traditional GBRT loss function:
where n represents the sample size, represents the output value of the tth decision tree; m represents the number of leaf nodes of the tth decision tree; denotes the optimal value of the jth leaf node; and represents the regularization coefficient. Since the output requirement of this study is non-negative, a 1/2 Poisson deviation is selected as the base loss function; see Equation (3).
Equation (2) can be obtained through a series of approximate derivations, such as the Taylor expansion, as follows:
where represents the first-order derivative of Equation (3), and represents the second-order derivative of Equation (3). When each leaf node region takes the optimal solution , the minimum loss function is as follows:
Then, the optimal split point of each leaf node region can be determined by analyzing the change in Equation (5) before and after the split. Assume that the sum of the first-order derivatives of the loss function for the parent leaf node before the split is , and the sum of the second-order derivatives is . After the split, the sum of the first-order derivatives for the left leaf node is , and the sum of the second-order derivatives is . The sum of the first-order derivatives of the right leaf node after the split is , and the sum of the second-order derivatives is . The loss gain before and after the node split is defined as follows:
The split point that maximizes Equation (6) before and after the leaf node region split is the optimal split point. The pseudocode of HGBR is shown in Algorithm 1. The HGBR process in this study is implemented with the help of sklearn.ensemble.HistGradientBoostingRegressor.
| Algorithm 1 HGBR on near-miss |
|
3.2. Driving Risk Factor
According to Chinese regulations, China’s auto insurance premium rate-making model is as follows:
where the benchmark premium is based on static characteristics such as the vehicle brand, engine capacity, driver age, ..., and the additional fee rate r is based on the additional items purchased by the policyholder, only the rate adjustment factor can be adjusted by the insurance company. The risk adjustment factor contains many contents, such as a traffic violation factor, a non-indemnity discount, and a driving risk score factor, which is the very focus of this study. The idea of the PAYD mode is that the more a vehicle is used, the greater the probability that accidents or near-miss events occur. Using this idea, a risk factor based on usage can be obtained as follows:
where represents a variable that measures the vehicle utilization rate, which is positively correlated with near-miss events, denotes the estimated value obtained by the predictive model (note that we assume that ), goes from 0 to 1. In this study, denotes the number of driving trips per day.
The relationship between originally observed and predicted risk events requires special attention. The predicted expected frequency value is the Poisson distribution expectation computed from the regression model. If the original value is greater than the predicted value, it indicates that the model underestimates the driving risk, and the driving risk factor will naturally be higher. On the contrary, if the original value is less than the predicted value, it indicates that the model overestimates the driving risk and the corresponding risk factor is low. Given the above description, and assuming that both the original values obey the Poisson distribution, the near-miss event risk factor can be obtained from the Poisson cumulative distribution function, as follows:
where y represents the original value of near-miss events, represents the predicted value of near-miss events, and the goes from 0 to 1. By adding the above two factors, the driving risk factor—considering both vehicle utilization rate and near-miss event probability—can be obtained as follows:
where , ranging from 0 to 1 (0.5 in this study), represents the proportion of near-miss events that are taken into account in the driving risk factor, goes from 0 to 1, the higher the value, the greater the driving risk. In practical applications, each type of risk factor can be weighted and averaged, which will not be discussed in this study.
4. Results and Discussions
Assuming that each near-miss event (as a dependent variable) obeys the Poisson distribution, the Kolmogorov–Smirnov test was conducted on them, respectively, and the test results (seeing Table 6) showed that none of the four near-miss events conformed to the standard Poisson distribution. This may bring about the estimation results of Poisson regression bias. In order to compare the effects of the two regressions, Poisson regression and negative binomial regression were estimated on the dataset from China and the dataset from Spain, respectively, and their coefficient estimations and significance results are shown in Table 7 and Table 8. It is worth noting that prior to undertaking the regressions, the independent variables were subjected to standardization. This process entailed the division of the total number of variables such as brakes, trips, and distance by the exposure period, thereby converting them into average daily rates (with the “” suffix). This modification was implemented to ensure that the values of the independent variables were comparable across insureds with varying exposure periods, thereby enhancing the robustness and interpretability of the model. From the regression results, whether in the dataset from China or Spain, most of the independent variables in the Poisson regression demonstrate significant effects, regardless of which near-miss event is used as the dependent variable. However, it seems to mean that the variances of the estimators are probably understated. Furthermore, the Akaike information criterion (AIC) and Bayesian information criterion (BIC) of negative binomial regressions are smaller than Poisson regressions for the same variables. The log-likelihood of the negative binomial regression is higher than that of the Poisson regression. And the discrete parameter is significantly greater than zero. These all imply that the negative binomial regression performs more convincingly at the parameter estimation level relative to Poisson regression. Since this study focuses more on obtaining an accurate prediction model, further validation is needed to compare the accuracy of the two predictions.
Table 6.
Kolmogorov–Smirnov test for four near-miss events.
Table 7.
Results of Poisson regression and negative binomial regression for the variables Harshacceleration and Harshdeceleration from the dataset from China.
Table 8.
Results of Poisson regression and negative binomial regression for the variables nearmiss_accel and nearmiss_brake from the dataset from Spain.
According to the negative binomial regression results for the dataset from China (see Table 7), driving behavior variables, especially , have significant positive effects on near-miss events. The positive impacts of and the on show that the more aggressive the driving behavior, the more near-miss events will be observed, which is consistent with the common sense of daily driving. The positive effect of shows that the more frequent the driving, the more near-miss events are generated, while the negative impacts of on and on indicate that driving in a specific environment will reduce dangerous driving. For example, driving on weekends can make drivers more cautious due to changes in the driving environment, both inside and outside the vehicle. Another example is the driver is less frequently involved in other vehicles’ trajectories due to low traffic flow at night. The strong negative effects of (under 15 min) and (between 15 min and half an hour) indicate that short-term driving will produce less dangerous driving. Correspondingly, the negative effects of (between one hour and two hours) and (over 2 h) are believed to be caused by the fact that driving fatigue leads to less intense driving.
In contrast, the negative binomial regression results for the dataset from Spain are, in many respects, identical yet still slightly different than those from the dataset from China. As Table 8 shows, the biggest similarity is that shows a strong positive effect for near-miss events. In a similar vein, , , and , have a negative effect on both types of near-miss events. However, trips between 30 min and 1 h also show significant negative effects on near-miss events, which is different from the results of the above model for China. The biggest difference is that and do not show significant effects on near-miss events in the dataset from Spain, which is probably related to the different driving conditions and different driving habits of drivers. In addition, driving behavior variables such as do not show a significant effect on near-miss events. This lack of significance is likely related to differences in variable definition methods, data collection methods, and data collection channels between the two datasets [,].
In the prediction process, factors such as model complexity, generalization capability, and adaptability to data characteristics need to be considered to select the most suitable model. To validate the prediction capability, the evaluation of the GLM model and the HGBR model are presented in Table 9. Here, all three models were run with default maximum iterations set at 100 on both a train set and a test set to prevent overfitting from affecting model judgment. The results show that the Poisson regression model has the poorest performance on the test set, indicating weak generalization. Conversely, while the negative binomial regression did not perform exceptionally well on the training set, it outperformed Poisson regression on the test set. This suggests that it is more sensitive and adaptable to the overly discrete data characteristic of this study. This aligns with previous evaluations of parameter estimation and model comparison. Therefore, negative binomial regression has been chosen as the comparison group in subsequent studies for HGBRs, which consistently demonstrate superior model evaluation metrics over GLMs across all datasets and dependent variables. Although most of the hyperparameters remain default without tuning (loss = ’poisson’, max_iter = 100, max_depth = 2, random_state = 0, and other defaults), the HGBR’s performance is exceptional. In practice, the optimal model hyperparameters can be further selected by using cross-validation and grid search to obtain the best model performance with the best data characterization capabilities.
Table 9.
The prediction evaluation of three regressions on training and test datasets; indices include mean Poisson deviance, root mean square error, mean absolute error, and explained variance score.
As illustrated in Figure 3, the HGBR exhibits superior predictive capabilities and more accurately simulates the data distribution of the two types of near-miss events in comparison to the negative binomial model across both datasets (from China and Spain). It has been observed that the negative binomial model’s predictions for zero values sometimes do not align with reality, unlike the HGBR, which demonstrates a good fit to actual data. This discrepancy may be attributed to the failure of the GLM to account for non-linear relationships among variables. There is also the fact that the prediction process of the GLM is actually an approximation to a hypothetical distribution, and once the actual situation is not a standard distribution, its predictive power is weakened. In contrast, the HGBR’s strength lies in its capacity to incorporate diverse data types and variable dependencies, which contributes to its exceptional predictive power.
Figure 3.
Cumulative frequency comparison of the original value (left), negative binomial regression model predicted value (middle), HGBR model predicted value (right) on (a) Harshacceleration from the dataset from China, (b) Harshdeceleration from the dataset from China, (c) nearmiss_accel from the dataset from Spain, and (d) nearmiss_brake from the dataset from Spain.
Permuted feature importance, one of the explanation tools of machine learning, could be derived from changes in model prediction errors following the disruption of eigenvalues []. To understand the contribution of features to model prediction, permuted feature importance tests are performed on the Poisson regression model and HGBR model, respectively. It can be seen from Figure 4 that the two models of the two near-miss event labels in the dataset from China are more sensitive to driving behavior variables. The same is true for the dataset from Spain, except that the model from the dataset from Spain is more sensitive to distance and duration variables, as shown in Figure 5. However, the characteristics of the same model do not contribute to the prediction of different labels. Due to the different features selected among different models, the comparison of feature contributions is meaningless. It is worth mentioning that the permuted feature importance of the negative binomial regression model is partially consistent with the variable significance shown in Table 7 and Table 8, but there are also contradictions. This means that this method can only be used as an aid to understanding the contribution of variables, and its reliability and interpretability need to be improved.
Figure 4.
Feature importance ranking for the dataset from China; the upper left subgraph presents the results of negative binomial regression on Harshacceleration, the upper right subgraph presents the results of negative binomial regression on Harshdeceleration, the lower left subgraph presents the results of HGBR on Harshacceleration, and the lower right subgraph presents the results of HGBR on Harshdeceleration.
Figure 5.
Feature importance ranking for the dataset from Spain; the upper left subgraph presents the results of negative binomial regression on nearmiss_accel, the upper right subgraph presents the results of negative binomial regression on nearmiss_brake, the lower left subgraph presents the results of HGBR on nearmiss_accel, and the lower right subgraph presents the results of HGBR on nearmiss_brake.
Once the aforementioned risk factor calculation method was employed to derive driving risks, the predicted outcomes from the HGBR were combined with the actual values of near-miss events from the two datasets. This process yielded four groups of driving risk factors, as detailed in Table 10. Due to the limited number of drivers, all these metrics were derived from the full sample. The distribution maps in Figure 6 illustrate the distribution of driving risks for each near-miss event. In each group, a point represents an observation’s driving risk factor, while the shaded areas and box plots represent the kernel density and distribution of observed driving risks. In the dataset from China, either in the group or in the group, the driving risk for each near-miss event group is primarily concentrated into two clusters. The first cluster, comprising observations with a value above 0.6, indicates a high-risk group, while the second cluster, comprising observations with a value below 0.6, indicates a low-risk group. In the dataset from Spain, both the acceleration and braking groups indicate that the majority of drivers’ risks are distributed approximately normally around 0.3, whereas a minority are distributed approximately normally around 0.8.
Table 10.
Driving risk factor description of different near-miss events on different datasets.
Figure 6.
Driving risk distribution for the dataset from China, which shows the Harshacceleration group (first pink on the left) and Harshdeceleration group (second pink on the left), and the dataset from Spain, which shows the nearmiss_accel group (first green on the right) and nearmiss_brake group (second green on the right).
While there is no inherent correlation between the four sets of driving risk factor outcomes, they do exhibit certain similarities. These include the magnitude of individual driving risks and how they are aggregated. The results demonstrate that—regardless of the dimension used to assess a driver’s risk-taking behavior—two distinct groups emerge: one with lower driving risk and another with higher driving risk. These groups exhibit a clear divergence, with the majority of ambiguous drivers occupying a relatively minor position. In general, the lower-risk drivers constitute the majority, yet it is not implausible that the majority of drivers may engage in more aggressive driving under certain circumstances. This result also proves that the driving risk scoring algorithm proposed in this study can effectively distinguish the driving risk levels. In addition, although laws and regulations in Spain differ from those in China, and neither auto insurance market currently offers a near-miss-based premium calculation product, this method demonstrates a straightforward way to show that near-miss events can be used as corrections to the price of subsequent periods. Furthermore, the same approach can be applied across different countries.
5. Conclusions
Near-miss events have been shown to be effective for assessing driving risks. Most independent variables in the GLM model are statistically significant, indicating that the model effectively captures the distribution pattern of near-miss events as dependent variables. The HGBR model demonstrates exemplary predictive capacity, ensuring accurate output variables derived from input variables. The values and distribution of driving risk factors align with prevailing expectations and understanding. This study’s findings suggest that near-miss events have the potential to serve as independent variables, providing valuable information for driving risk regression analysis. Furthermore, near-miss events may be utilized as substitutes for accidents or claims when scoring driving risks []. Additionally, driving risks can be leveraged to adjust premiums. However, the actuarial implications of such adjustments for insurance companies require analysis. In particular, issues such as the equilibrium of premiums and the distribution of payouts remain beyond the scope of this paper and necessitate further examination.
The aforementioned study raises an intriguing question about the potential benefits of employing driving risk predictors instead of directly using near-miss frequencies. We contend that predictive models offer an effective methodology for generating a risk score that incorporates contextual data beyond near-miss information. This data can be influenced by external factors unrelated to the driver, such as the hazardous actions of other drivers. Consequently, the risk score or predictive value provides a more accurate approximation of the expected number of near-miss events and, ultimately, the projected number of accidents.
Both conventional generalized linear models and machine learning algorithms have their own respective merits and limitations. The outcomes of Poisson regression and negative binomial regression indicate the effect size and statistical significance of each independent variable on the dependent variable. They also highlight the causal impact of telematics attributes on near-miss events. The high accuracy of the HGBR in predicting near-miss events demonstrates its robust capability in handling telematics data with multiple driving-related variables. In the context of applying driving risk to rate-making, the interpretability of the calculation method is highly valuable to policyholders. Meanwhile, the efficiency and precision of the algorithm enable insurers to process large volumes of driver data in an effective and precise manner.
It needs to be recognized that the findings of this study are constrained by a number of limitations pertaining to the availability of relevant data. Firstly, the dataset from China covers a duration of less than seven days, which precludes essential analyses such as comparisons between weekdays and weekends. This limited timeframe may not accurately represent annual driving behavior, particularly as it may be influenced by seasonal variations. Additionally, the restricted temporal range could contribute to inconsistencies in the reported significance of variables. Moreover, the dataset from Spain is insufficiently comprehensive, and the types and number of variables included are not as large as those in the dataset from China, which renders the results less interpretable. Furthermore, the lack of traditional insurance data and driving condition data represents a substantial limitation. Including factors that capture the characteristics of drivers is crucial for providing a comprehensive evaluation of the risks associated with driving.
Given the good performance of the ensemble learning algorithms used in this study, future research will explore the use of more interpretable machine learning algorithms for modeling and predicting large amounts of telematics data, and for car insurance pricing. While artificial neural networks are not inherently interpretable, they can nonetheless be explained using secondary tools or methods [,]. Thus, future research will explore how state-of-the-art artificial neural networks can be applied to auto insurance to change the persistent perception among insurers and administrators that such algorithms are completely black-box systems. Exploring and analyzing telematics data using AI methods is important for shifting our perceptions and decision-making processes from non-autonomous to semi-autonomous to fully autonomous driving.
Author Contributions
Conceptualization, S.S. and M.G.; methodology, S.S. and M.G.; software, S.S.; validation, M.G. and L.N.; formal analysis, S.S.; investigation, S.S.; resources, M.G. and A.M.P.-M.; data curation, S.S. and M.G.; writing—original draft preparation, S.S.; writing—review and editing, M.G. and L.N.; visualization, S.S.; supervision, M.G.; project administration, M.G.; funding acquisition, L.N. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
This research is based on the original telematics data from a Chinese company and a Spanish company, which could not be made public due to confidentiality agreements.
Acknowledgments
The authors acknowledge support from Beijing Wuzi University, the Departament de Recerca i Universitats, the Departament d’Acció Climàtica, Alimentació i Agenda Rural, and the Fons Climàtic of the Generalitat de Catalunya (2023 CLIMA 00012).
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Abbreviations
The following abbreviations are used in this manuscript:
| UBI | usage-based insurance |
| IoV | Internet of Vehicles |
| PAYD | pay-as-you-drive |
| PHYD | pay-how-you-drive |
| MHYD | manage-how-you-drive |
| VIF | variance inflation factor |
| GLM | generalized linear model |
| HGBR | histogram-based gradient boosting regressor |
| GBRT | gradient boosting regression tree |
| AIC | Akaike information criterion |
| BIC | Bayesian information criterion |
| MPD | mean Poisson deviance |
| RMSE | root mean square error |
| MAE | mean absolute error |
| EVS | explained variance score |
References
- Arai, Y.; Nishimoto, T.; Ezaka, Y.; Yoshimoto, K. Accidents and Near-Misses Analysis by Using Video Drive-Recorders in a Fleet Test; Technical Report; SAE Technical Paper; SAE: Warrendale, PA, USA, 2001. [Google Scholar]
- Verma, R.; Gupta, S.; Sharma, A.K.; Sahni, V.; Goyal, K. Role of Telematics in Motor Insurance: A Way Forward. Acad. Mark. Stud. J. 2021, 25, 1–8. [Google Scholar]
- Boucher, J.P.; Denuit, M.; Guillen, M. Number of accidents or number of claims? An approach with zero-inflated Poisson models for panel data. J. Risk Insur. 2009, 76, 821–846. [Google Scholar] [CrossRef]
- Guillen, M.; Nielsen, J.P.; Pérez-Marín, A.M. Near-miss telematics in motor insurance. J. Risk Insur. 2021, 88, 569–589. [Google Scholar] [CrossRef]
- Bian, Y.; Yang, C.; Zhao, J.L.; Liang, L. Good drivers pay less: A study of usage-based vehicle insurance models. Transp. Res. Part A Policy Pract. 2018, 107, 20–34. [Google Scholar] [CrossRef]
- Litman, T. Distance-based vehicle insurance feasibility, costs and benefits. Victoria 2007, 11. [Google Scholar]
- Paefgen, J.; Staake, T.; Thiesse, F. Evaluation and aggregation of pay-as-you-drive insurance rate factors: A classification analysis approach. Decis. Support Syst. 2013, 56, 192–201. [Google Scholar] [CrossRef]
- Paefgen, J.; Staake, T.; Fleisch, E. Multivariate exposure modeling of accident risk: Insights from Pay-as-you-drive insurance data. Transp. Res. Part A Policy Pract. 2014, 61, 27–40. [Google Scholar] [CrossRef]
- Boquete, L.; Rodríguez-Ascariz, J.M.; Barea, R.; Cantos, J.; Miguel-Jiménez, J.M.; Ortega, S. Data acquisition, analysis and transmission platform for a pay-as-you-drive system. Sensors 2010, 10, 5395–5408. [Google Scholar] [CrossRef]
- Tselentis, D.I.; Yannis, G.; Vlahogianni, E.I. Innovative insurance schemes: Pay as/how you drive. Transp. Res. Procedia 2016, 14, 362–371. [Google Scholar] [CrossRef]
- Tselentis, D.I.; Yannis, G.; Vlahogianni, E.I. Innovative motor insurance schemes: A review of current practices and emerging challenges. Accid. Anal. Prev. 2017, 98, 139–148. [Google Scholar] [CrossRef]
- Sun, S.; Bi, J.; Guillen, M.; Pérez-Marín, A.M. Assessing driving risk using internet of vehicles data: An analysis based on generalized linear models. Sensors 2020, 20, 2712. [Google Scholar] [CrossRef] [PubMed]
- Jin, W.; Deng, Y.; Jiang, H.; Xie, Q.; Shen, W.; Han, W. Latent class analysis of accident risks in usage-based insurance: Evidence from Beijing. Accid. Anal. Prev. 2018, 115, 79–88. [Google Scholar] [CrossRef] [PubMed]
- Pérez-Marín, A.M.; Guillen, M.; Alca niz, M.; Bermúdez, L. Quantile regression with telematics information to assess the risk of driving above the posted speed limit. Risks 2019, 7, 80. [Google Scholar] [CrossRef]
- Boucher, J.P.; Pérez-Marín, A.M.; Santolino, M. Pay-as-you-drive insurance: The effect of the kilometers on the risk of accident. In Anales del Instituto de Actuarios Espa noles; Instituto de Actuarios Espa noles: Madrid, Spain, 2013; Volume 19, pp. 135–154. [Google Scholar]
- Gao, G.; Wüthrich, M.V.; Yang, H. Evaluation of driving risk at different speeds. Insur. Math. Econ. 2019, 88, 108–119. [Google Scholar] [CrossRef]
- Guillen, M.; Nielsen, J.P.; Ayuso, M.; Pérez-Marín, A.M. The use of telematics devices to improve automobile insurance rates. Risk Anal. 2019, 39, 662–672. [Google Scholar] [CrossRef]
- Guillen, M.; Nielsen, J.P.; Pérez-Marín, A.M.; Elpidorou, V. Can automobile insurance telematics predict the risk of near-miss events? N. Am. Actuar. J. 2020, 24, 141–152. [Google Scholar] [CrossRef]
- Sun, S.; Bi, J.; Guillen, M.; Pérez-Marín, A.M. Driving risk assessment using near-miss events based on panel Poisson regression and panel negative binomial regression. Entropy 2021, 23, 829. [Google Scholar] [CrossRef]
- Boucher, J.P.; Côté, S.; Guillen, M. Exposure as duration and distance in telematics motor insurance using generalized additive models. Risks 2017, 5, 54. [Google Scholar] [CrossRef]
- Verbelen, R.; Antonio, K.; Claeskens, G. Unravelling the predictive power of telematics data in car insurance pricing. J. R. Stat. Soc. Ser. C (Appl. Stat.) 2018, 67, 1275–1304. [Google Scholar] [CrossRef]
- Guo, F.; Fang, Y. Individual driver risk assessment using naturalistic driving data. Accid. Anal. Prev. 2013, 61, 3–9. [Google Scholar] [CrossRef]
- Carfora, M.F.; Martinelli, F.; Mercaldo, F.; Nardone, V.; Orlando, A.; Santone, A.; Vaglini, G. A “pay-how-you-drive” car insurance approach through cluster analysis. Soft Comput. 2019, 23, 2863–2875. [Google Scholar] [CrossRef]
- Burton, A.; Parikh, T.; Mascarenhas, S.; Zhang, J.; Voris, J.; Artan, N.S.; Li, W. Driver identification and authentication with active behavior modeling. In Proceedings of the 2016 IEEE 12th International Conference on Network and Service Management (CNSM), Montreal, QC, Canada, 31 October–4 November 2016; pp. 388–393. [Google Scholar]
- Baecke, P.; Bocca, L. The value of vehicle telematics data in insurance risk selection processes. Decis. Support Syst. 2017, 98, 69–79. [Google Scholar] [CrossRef]
- Guelman, L. Gradient boosting trees for auto insurance loss cost modeling and prediction. Expert Syst. Appl. 2012, 39, 3659–3667. [Google Scholar] [CrossRef]
- So, B.; Boucher, J.P.; Valdez, E.A. Cost-Sensitive Multi-Class Adaboost For Understanding Driving Behavior Based on Telematics. ASTIN Bull. J. IAA 2021, 51, 719–751. [Google Scholar] [CrossRef]
- Gao, G.; Wang, H.; Wüthrich, M.V. Boosting Poisson regression models with telematics car driving data. Mach. Learn. 2021, 111, 243–272. [Google Scholar] [CrossRef]
- Henckaerts, R.; Côté, M.P.; Antonio, K.; Verbelen, R. Boosting insights in insurance tariff plans with tree-based machine learning methods. N. Am. Actuar. J. 2021, 25, 255–285. [Google Scholar] [CrossRef]
- Lee, S.C. Addressing imbalanced insurance data through zero-inflated Poisson regression with boosting. ASTIN Bull. J. IAA 2021, 51, 27–55. [Google Scholar] [CrossRef]
- McDonnell, K.; Murphy, F.; Sheehan, B.; Masello, L.; Castignani, G.; Ryan, C. Regulatory and Technical Constraints: An Overview of the Technical Possibilities and Regulatory Limitations of Vehicle Telematic Data. Sensors 2021, 21, 3517. [Google Scholar] [CrossRef]
- Guillen, M.; Pérez-Marín, A.M.; Nielsen, J.P. Pricing weekly motor insurance drivers’ with behavioral and contextual telematics data. Heliyon 2024, 10, e36501. [Google Scholar] [CrossRef]
- Tamim Kashifi, M.; Ahmad, I. Efficient histogram-based gradient boosting approach for accident severity prediction with multisource data. Transp. Res. Rec. 2022, 2676, 236–258. [Google Scholar] [CrossRef]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Yanez, J.S.; Guillén, M.; Nielsen, J.P. Weekly dynamic motor insurance ratemaking with a telematics signals bonus-malus score. ASTIN Bull. J. IAA 2024, 1–28. [Google Scholar] [CrossRef]
- Masello, L.; Castignani, G.; Sheehan, B.; Guillen, M.; Murphy, F. Using contextual data to predict risky driving events: A novel methodology from explainable artificial intelligence. Accid. Anal. Prev. 2023, 184, 106997. [Google Scholar] [CrossRef]
- McDonnell, K.; Murphy, F.; Sheehan, B.; Masello, L.; Castignani, G. Deep learning in insurance: Accuracy and model interpretability using TabNet. Expert Syst. Appl. 2023, 217, 119543. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).