Abstract
This article proposes a novel application of the Bayesian variable selection framework for driver telematics data. Unlike the traditional LASSO, the Bayesian variable selection framework allows us to incorporate the importance of certain features in advance in the variable selection procedure so that the traditional features more likely remain in the ratemaking models. The applicability of the proposed framework in the ratemaking practices is also validated via synthetic telematics data.
MSC:
62P05
1. Introduction
Driver telematics is a field that combines telecommunications and informatics to monitor and analyze the behavior and performance of drivers. It involves the use of devices/apps and technologies to collect data on various aspects of driving, which can be used for improving safety, efficiency, and overall driving experience. More specifically, the device or app collects various types of data including vehicle speed and acceleration, braking patterns, steering behavior, and idle time to capture the inherent risk associated with the behaviors of the driver. Driver telematics has been actively used by insurers in the format of pay-as-you-drive (PAYD) or pay-how-you-drive (PHYD), so there have been many research outputs on the effective usage of driver telematics data for auto insurance pricing, such as [1,2].
However, while driver telematics data contain rich information for more precise risk classification, most of the time the data are high-dimensional and require appropriate dimension reduction and/or regularization in order to be compatible with GLMs, which is one of the mainly used benchmarks for auto insurance pricing. For example, ref. [3] utilized a speed–acceleration heatmap to effectively use high-dimensional telematics features in auto insurance pricing. Ref. [4] discussed possible dimension reduction techniques for driver telematics data, such as territorial embedding and PCA. Recently, ref. [5] developed a methodology to summarize telematics characteristics into a one-dimensional safety score, which can be used as input for a regulatory-compliant GLM rating. Note that all the aforementioned dimension reduction approaches preserve the traditional features, whereas only the telematics features are pre-processed due to the non-ignorable importance of traditional features (such as vehicle capacity or primary use of the vehicle) in auto insurance ratemaking practices [6,7,8,9,10] (although UBI contracts have been available to policyholders for more than twenty years, their market share remains relatively small at approximately 5% [11,12]). However, the literature on variable selection with actuarial application is still scarce, and the importance of traditional features has not yet been incorporated into the variable selection procedure [13,14,15,16]. In this regard, we propose a novel variable selection approach in a Bayesian framework for driver telematics data that enables us to incorporate traditional features as well as effectively select significant telematics features, which are often high-dimensional.
1.1. Driver Telematics Data
This section elaborates on the synthetic telematics dataset employed in our empirical analysis. The dataset originated from the work of [17], who obtained the original telematics data from a Canadian insurance company, covering auto insurance policies and claims in Ontario from 2013 to 2016. To protect sensitive information and adhere to data-sharing policies, ref. [17] utilized an algorithm known as “Extended SMOTE” to generate a synthetic version of the dataset. Note that we also utilized a more pre-processed version of the dataset in which the territorial code (with 55 categories) was converted to a real valued variable TerritoryEmb as in [4] for this paper.
The dataset comprises 3864 entries with positive claim amounts and includes 10 traditional variables along with 39 telematics-related variables. The outcome variable analyzed in this paper is the size of the insurance claim on a natural logarithmic scale, and descriptions of all variables are provided in Table 1.
Table 1.
Description of the traditional and telematics features in the dataset.
Here Annual.miles.drive is categorized as a traditional variable in Table 1; it is based on self-reported data rather than telematics. In contrast, Total.miles.driven reflects actual mileage recorded by telematics devices or mobile applications.
1.2. Variable Selection in a Linear Regression Model
Here we use a linear regression model for analyzing insurance claims on a natural log scale, which is given as follows:
Let the observable data consisting of and , where represents the aggregated claim amount of the i-th policyholder on a log scale, and is a given deterministic design matrix, which contains variables for driver telematics as well as traditional variables such as age and region. The vector denotes additive Gaussian errors, and is the vector of regression coefficients to be estimated. Note that the assumption of Gaussian errors in insurance claim modeling is widely adopted in the actuarial literature [18,19,20,21]. As illustrated by the histogram of the response variable in Table 1, its empirical distribution exhibits a strong resemblance to a normal distribution (Figure 1).
Figure 1.
Histogram of the response variable.
In the standard linear model, the regression coefficients are estimated by the ordinary least-squares estimator (LSE), which minimizes the residual sum of squares:
where for a vector . In R, we use the function lm() to compute the LSE.
Alternatives to the standard linear model can be employed with a least absolute shrinkage and selection operator (LASSO) and an elastic net (EN) model for simultaneous variable selection and parameter estimation. Using a popular R package called glmnet, which provides a unified framework for fitting penalized regression models, we compute the LASSO and EN estimators. glmnet solves the following problem, which contains two key tuning parameters and :
Setting , we can obtain the LASSO estimator. When , the resulting estimator becomes an EN estimator. The tuning parameter controls the overall strength of the regularization. We select the optimal value of via grid search based on cross-validation, as implemented in the cv.glmnet() function.
To highlight the importance of variable selection, we compare the root mean squared errors (RMSEs) of the LASSO, EN, and linear models. We determine the optimal value of by performing cross-validation over a grid ranging from 0.01 to 2.00 in increments of 0.01, and we use for EN models. The original dataset described in Section 1.1 is treated as the population from which a sample is drawn. We generate 50 samples, each of size S, and compute the RMSEs for each sample using 5-fold cross-validation. The use of cross-validation provides a more reliable assessment of predictive performance by reducing the variability associated with a single data split [22]. The sample sizes considered are . The results of the 5-fold cross-validation across the 50 samples are summarized in Table 2 and Figure 2.
Table 2.
Summary statistics for the RMSEs of LASSO, EN, and linear models across 50 samples.
Figure 2.
Boxplots of RMSEs for LASSO, EN, and linear models (LMs) across different sample sizes. The horizontal axis indicates sample sizes , and the vertical axis shows the RMSEs.
Notably, the LASSO and EN models significantly reduce the RMSEs for all values of S. However, they do not differentiate between traditional variables and telematics variables during selection. Given that insurance companies may prioritize traditional variables, we analyze the proportions of traditional variables included by the LASSO and EN models. Table 3 presents the inclusion rates for the five traditional variables, Insured.age, Car.age, Car.use, Credit.score, and Region, that are considered important by insurance companies. Indeed, we can see that the inclusion rates for the four traditional variables, except for the credit score, are very low, especially as the sample size decreases. To overcome such a limitation, we propose a Bayesian method that allows us to control the probabilities that specific variables are included in the model. While the naïve use of LASSO and EN regularizations cannot handle the distinction between the traditional variables and telematics variables, the proposed Bayesian approach allows us to incorporate the importance of the features in advance by adjusting the values of the hyperparameters.
Table 3.
Inclusion rates of the five traditional variables—Insured.age, Car.age, Car.use, Credit.score, and Region—using the LASSO and EN models.
The remainder of this article is organized as follows: Section 2 explains the theoretical foundation of the proposed Bayesian variable selection method. Section 3 presents the applicability of the proposed method for driver telematics data. We conclude the article in Section 4 with a brief discussion of possible future research directions.
2. Variable Selection in a Bayesian Framework
2.1. Prior Distributions and Likelihood
For simultaneous variable selection and estimation, we utilize a Bayesian method, aiming to identify significant predictors (or covariates) that should be included in the model while estimating unknown parameters. First, to impose sparsity on , we employ a latent binary vector and use the popular spike-and-slab prior with a Gaussian slab as follows [23]:
where denotes an indicator function of an event E such that if E is true; otherwise, . is a hyperparameter. Given the latent vector , the spike-and-slab prior in (2) indicates that the regression coefficients are independent, with each following one of two distributions. Specifically, has a point mass at 0 if and follows a normal distribution with mean zero and variance if . Let be the prior density function of given and . Then,
where is the density function of given and corresponding to (2).
For the latent binary vector , ref. [24] employed independent Bernoulli prior distributions for each , and proved the selection consistency of the posterior distribution of . In contrast, ref. [25] considered a two-parameter Ising prior to capture the structural dependence among covariates, assuming that the latent variables lie in an undirected graph. Let represent a matrix that determines the connectivity among called a coupling matrix. To construct , ref. [25] used an adjacency matrix for the underlying graph such that the diagonal entries of are all zeros and the -th entry of is
where and . Thus, the prior probability mass function (pmf) of in [25] is as follows:
where is a normalizing constant. The Ising prior described in (3) effectively accounts for the dependence among covariates, except for the special case where , which corresponds to independent Bernoulli priors. In order to use the Ising prior in (3), must be pre-specified. In their simulation studies, ref. [25] assumed a linear chain dependence such that each element of was connected to two neighboring elements—one step ahead and one step behind, i.e., was connected to and for , with the boundary conditions and .
We construct the coupling matrix in a more data-driven manner by taking into account that (1) the inclusion probabilities of traditional variables should be higher than those of telematics variables, and (2) the correlations among telematics variables should be appropriately modeled. Let and denote index sets for traditional and telematics variables, respectively, such that and . For (1), we assign a large value to an off-diagonal entry if the corresponding indices are in . Next, for (2), we define the connectivity between and for using the sample correlation between the j-th and k-th covariates denoted by . The off-diagonal entries of are then defined as if , and otherwise. Here, a threshold controls the sparsity of . The diagonal entries of are all zeros, and the off-diagonal entries are summarized in (4):
where . We emphasize that the prior pmf of remains the same as in (3), with the only difference being in . Furthermore, as the value of increases, the probabilities that traditional variables are included in the model also increase. Finally, for the variance in the error terms, we set the prior on in the form , as in [25]. All the necessary random quantities , , and are given; the likelihood function is as follows:
It should be noted that is specified from the given dataset at the beginning and is not updated throughout the Gibbs sampling procedure. Using the prior distributions and likelihood specified in this section, we describe the procedures for variable selection and parameter estimation in Section 2.2 and Section 2.3.
2.2. Posterior Inference for
We introduce the Gibbs sampling method suggested by [25] for variable selectionsunder the Ising prior in (3) with a coupling matrix in (4). Significant predictor variables are identified based on the posterior inclusion probabilities (PIPs) , . However, the PIPs cannot be computed in a closed form. To approximate the PIPs, ref. [25] defined -dimensional vectors , which are constructed by removing from and index sets for . Then, the prior conditional probabilities and are straightforward to compute using the fact that is binary:
It is worth noting that the conditional probabilities above no longer involve intractable normalizing constants. By the Bayes rule, the posterior conditional probability that given and is
where is the Bayes factor. We can explicitly compute the Bayes factors by integrating out and under the prior distributions specified in Section 2.1. The detailed derivation of is described in [25]. We draw Gibbs samples of iteratively using (5) with a initial vector and compute the approximated PIPs denoted by for based on the empirical distribution of the Gibbs samples. Specifically, we divide the number of Gibbs samples where by the total number of the samples excluding the burn-ins. Ref. [25] demonstrated that the Ising prior for is effective in capturing the dependence among covariates and outperforms the independent Bernoulli prior in terms of variable selection. However, they did not address the estimation of or . To tackle this, we propose the Gibbs sampling method for simultaneous variable selection and parameter estimation in the following subsection.
2.3. Posterior Inference for and
At the current iteration t for the Gibbs sampling introduced in the previous subsection, the current binary vector is denoted as . Given the current vector with , we can derive the full posterior of as follows:
Observing the last term in (6), which takes the form of a Gaussian density function, we can infer from which Gibbs samples for are generated.
For the derivation of the full posterior of , we define a sparse vector , where the symbol ⊙ denotes a Hadamard product and an activation set . Then, the full posterior of is
where denotes the cardinality of . The last form in (7) indicates that follows an inverse gamma distribution with a shape parameter and a scale parameter:
Using the full posteriors and , we can iteratively generate Gibbs samples, and the sample means serve as the Bayesian estimators for and .
3. Application for Telematics Feature Selection
Now we assess the applicability of the proposed method for variable selection with driver telematics data. In Section 3.1, we examine how the inclusion probabilities of traditional variables vary with the values of , demonstrating that our method effectively adjusts these probabilities in a flexible way.
3.1. Inclusion Probabilities of Traditional Variables
is a user-specified value that can be chosen according to the purpose of the analysis. A larger value of increases the probability that the traditional variables are included in the model. Using the five values of , we follow the steps below to compute the average of the inclusion probabilities with its standard deviations:
- Step 0.
- We choose and .
- Step 1.
- We generate a sample of size S from the population.
- Step 2.
- Given a sample, we implement the Gibbs procedure described in Section 2.2 to draw . The first 1000 iterations are considered burn-ins, and the last 1000 are used.
- Step 3.
- We compute each marginal PIP as follows:
- Step 4.
- We perform Steps 1, 2, and 3 fifty times and compute the average and standard deviation of PIPs across the fifty repetitions.
Table 4 shows the PIPs for the five traditional variables, Insured.age, Car.age, Car.use, Credit.score, and Region. The observed increase in PIPs with increasing values of implies that the inclusion rates of traditional variables can be effectively controlled through appropriate choices of . Additional results on the inclusion rates of the remaining traditional and telematics variables are summarized in Appendix A.
Table 4.
Inclusion rates of the five traditional variables—Insured.age, Car.age, Car.use, Credit.score, and Region—using the proposed Bayesian model.
3.2. RMSE Comparison
In the previous subsection, we demonstrated that the proposed Bayesian approach allows flexible control over the inclusion probabilities of traditional variables in the model. However, such flexibility would be meaningless if it comes at the cost of poor accuracy. In this subsection, we compare the RMSEs of the proposed method with those of the LASSO model. As in Section 1.2, we compute the RMSEs using the 5-fold cross-validation with the 50 samples generated, each of size . As shown in Table 4, setting ensures the inclusion of all five traditional variables. Therefore, we fix at 10. As summarized in Table 5, the proposed Bayesian method outperforms LASSO, showing that variable inclusion can be flexibly controlled without compromising—and even enhancing—accuracy.
Table 5.
Summary statistics for the RMSEs of LASSO and the proposed Bayesian method with across the 50 samples.
4. Conclusions
In this article, it was shown that the proposed Bayesian variable selection can effectively address the predetermined importance of available covariates, which maintains the traditional variables in the ratemaking scheme without losing too much of the prediction accuracy. While such results are promising, we acknowledge that the aforementioned Bayesian variable selection framework is constrained by the assumption of Gaussian white noise. In this regard, it is expected that this research can be extended to incorporate the proposed Bayesian variable selection scheme for response variables with different distributional assumptions, such as the Poisson distribution. It should also be noted that due to the limited availability of publicly available telematics data, all the analyses are based on synthetic telematics data (which originated from actual data) provided by So et al. [17]. That being said, we recommend that practitioners test the external validity of the proposed framework, provided that an actual telematics dataset is available.
Author Contributions
Conceptualization, M.K. and H.J.; methodology, M.K.; software, M.K.; validation, M.K.; formal analysis, M.K.; investigation, M.K.; data curation, H.J.; writing—original draft preparation, M.K. and H.J.; writing—review and editing, M.K. and H.J.; visualization, M.K.; supervision, H.J.; project administration, M.K. and H.J.; funding acquisition, M.K. and H.J. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by PNU-RENovation (2024–2025).
Data Availability Statement
The synthetic dataset and R codes used in this article are available at https://github.com/ssauljin/telematics_Bayes (accessed on 17 October 2025).
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| PAYD | Pay-as-you-drive |
| PHYD | Pay-how-you-drive |
| GLM | Generalized Linear Model |
| PCA | Principal Component Analysis |
| SMOTE | Synthetic Minority Oversampling Technique |
| RMSE | Root Mean Squared Error |
| LASSO | Least Absolute Shrinkage and Selection Operator |
| PIP | Posterior Inclusion Probability |
Appendix A
Table A1 presents the inclusion probabilities of the remaining traditional variables, excluding the five variables already summarized in Table 3 and Table 4. Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8 and Table A9 present the inclusion probabilities of the 39 telematics variables.
Table A1.
Inclusion rates of the remaining five traditional variables—Insured.sex, Marital, Annual.miles.drive, Years.noclaims, and TerritoryEmb—using the LASSO, EN, and proposed Bayesian models.
Table A1.
Inclusion rates of the remaining five traditional variables—Insured.sex, Marital, Annual.miles.drive, Years.noclaims, and TerritoryEmb—using the LASSO, EN, and proposed Bayesian models.
| Method | S | Insured.sex | Marital | Annual.miles.drive | Years.noclaims | TerritoryEmb |
|---|---|---|---|---|---|---|
| LASSO | 600 | 0.12 | 0.20 | 0.14 | 0.88 | 0.18 |
| 1000 | 0.20 | 0.24 | 0.28 | 1.00 | 0.28 | |
| 1400 | 0.14 | 0.28 | 0.34 | 1.00 | 0.24 | |
| EN | 600 | 0.12 | 0.14 | 0.12 | 0.98 | 0.20 |
| 1000 | 0.18 | 0.28 | 0.32 | 1.00 | 0.28 | |
| 1400 | 0.20 | 0.34 | 0.38 | 1.00 | 0.32 | |
| Bayesian method () | 600 | 0.39 | 0.41 | 0.42 | 0.84 | 0.40 |
| 1000 | 0.31 | 0.35 | 0.38 | 0.92 | 0.36 | |
| 1400 | 0.27 | 0.32 | 0.36 | 0.97 | 0.31 | |
| Bayesian method () | 600 | 0.61 | 0.62 | 0.63 | 0.91 | 0.61 |
| 1000 | 0.51 | 0.55 | 0.57 | 0.95 | 0.55 | |
| 1400 | 0.46 | 0.51 | 0.53 | 0.98 | 0.49 | |
| Bayesian method () | 600 | 0.94 | 0.94 | 0.94 | 0.99 | 0.94 |
| 1000 | 0.91 | 0.92 | 0.93 | 0.99 | 0.92 | |
| 1400 | 0.90 | 0.91 | 0.91 | 1.00 | 0.90 | |
| Bayesian method () | 600 | 0.99 | 0.99 | 0.99 | 1.00 | 0.99 |
| 1000 | 0.99 | 0.99 | 0.99 | 1.00 | 0.99 | |
| 1400 | 0.98 | 0.99 | 0.99 | 1.00 | 0.99 | |
| Bayesian method () | 600 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| 1000 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |
| 1400 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Table A2.
Inclusion rates of five telematics variables—Annual.pct.driven, Total.miles.driven, Pct.drive.mon, Pct.drive.tue, and Pct.drive.wed—using the LASSO, EN, and proposed Bayesian models.
Table A2.
Inclusion rates of five telematics variables—Annual.pct.driven, Total.miles.driven, Pct.drive.mon, Pct.drive.tue, and Pct.drive.wed—using the LASSO, EN, and proposed Bayesian models.
| Method | S | Annual.pct.driven | Total.miles.driven | Pct.drive.mon | Pct.drive.tue | Pct.drive.wed |
|---|---|---|---|---|---|---|
| LASSO | 600 | 0.20 | 0.30 | 0.20 | 0.26 | 0.34 |
| 1000 | 0.12 | 0.58 | 0.32 | 0.38 | 0.62 | |
| 1400 | 0.26 | 0.80 | 0.36 | 0.48 | 0.78 | |
| EN | 600 | 0.20 | 0.36 | 0.20 | 0.30 | 0.42 |
| 1000 | 0.18 | 0.62 | 0.38 | 0.42 | 0.64 | |
| 1400 | 0.22 | 0.84 | 0.42 | 0.54 | 0.78 | |
| Bayesian method () | 600 | 0.24 | 0.57 | 0.22 | 0.27 | 0.27 |
| 1000 | 0.20 | 0.64 | 0.19 | 0.24 | 0.29 | |
| 1400 | 0.20 | 0.70 | 0.17 | 0.24 | 0.31 | |
| Bayesian method () | 600 | 0.25 | 0.58 | 0.22 | 0.27 | 0.27 |
| 1000 | 0.20 | 0.65 | 0.19 | 0.24 | 0.29 | |
| 1400 | 0.21 | 0.71 | 0.16 | 0.24 | 0.30 | |
| Bayesian method () | 600 | 0.25 | 0.60 | 0.22 | 0.27 | 0.27 |
| 1000 | 0.20 | 0.67 | 0.19 | 0.24 | 0.29 | |
| 1400 | 0.21 | 0.74 | 0.16 | 0.24 | 0.30 | |
| Bayesian method () | 600 | 0.25 | 0.60 | 0.22 | 0.27 | 0.27 |
| 1000 | 0.20 | 0.67 | 0.19 | 0.24 | 0.29 | |
| 1400 | 0.21 | 0.73 | 0.17 | 0.24 | 0.30 | |
| Bayesian method () | 600 | 0.25 | 0.60 | 0.22 | 0.27 | 0.27 |
| 1000 | 0.20 | 0.67 | 0.19 | 0.24 | 0.29 | |
| 1400 | 0.21 | 0.73 | 0.17 | 0.24 | 0.30 |
Table A3.
Inclusion rates of five telematics variables—Pct.drive.thr, Pct.drive.fri, Pct.drive.sat, Pct.drive.sun, and Pct.drive.2hrs—using the LASSO, EN, and proposed Bayesian models.
Table A3.
Inclusion rates of five telematics variables—Pct.drive.thr, Pct.drive.fri, Pct.drive.sat, Pct.drive.sun, and Pct.drive.2hrs—using the LASSO, EN, and proposed Bayesian models.
| Method | S | Pct.drive.thr | Pct.drive.fri | Pct.drive.sat | Pct.drive.sun | Pct.drive.2hrs |
|---|---|---|---|---|---|---|
| LASSO | 600 | 0.18 | 0.26 | 0.08 | 0.22 | 0.40 |
| 1000 | 0.12 | 0.32 | 0.16 | 0.42 | 0.64 | |
| 1400 | 0.10 | 0.44 | 0.16 | 0.52 | 0.72 | |
| EN | 600 | 0.18 | 0.30 | 0.08 | 0.28 | 0.42 |
| 1000 | 0.14 | 0.34 | 0.20 | 0.48 | 0.66 | |
| 1400 | 0.10 | 0.52 | 0.24 | 0.62 | 0.82 | |
| Bayesian method () | 600 | 0.23 | 0.23 | 0.32 | 0.34 | 0.38 |
| 1000 | 0.18 | 0.20 | 0.30 | 0.34 | 0.37 | |
| 1400 | 0.16 | 0.21 | 0.29 | 0.36 | 0.37 | |
| Bayesian method () | 600 | 0.23 | 0.23 | 0.32 | 0.34 | 0.38 |
| 1000 | 0.18 | 0.21 | 0.29 | 0.34 | 0.37 | |
| 1400 | 0.15 | 0.21 | 0.29 | 0.35 | 0.37 | |
| Bayesian method () | 600 | 0.23 | 0.23 | 0.32 | 0.34 | 0.38 |
| 1000 | 0.18 | 0.21 | 0.29 | 0.34 | 0.37 | |
| 1400 | 0.16 | 0.21 | 0.29 | 0.36 | 0.36 | |
| Bayesian method () | 600 | 0.23 | 0.23 | 0.32 | 0.34 | 0.38 |
| 1000 | 0.17 | 0.21 | 0.29 | 0.34 | 0.37 | |
| 1400 | 0.15 | 0.21 | 0.29 | 0.36 | 0.37 | |
| Bayesian method () | 600 | 0.23 | 0.23 | 0.32 | 0.34 | 0.38 |
| 1000 | 0.17 | 0.21 | 0.29 | 0.34 | 0.37 | |
| 1400 | 0.15 | 0.21 | 0.29 | 0.36 | 0.36 |
Table A4.
Inclusion rates of five telematics variables—Pct.drive.3hrs, Pct.drive.4hrs, Pct.drive.wkday, Pct.drive.wkend, and Pct.drive.rush.am—using the LASSO, EN, and proposed Bayesian models.
Table A4.
Inclusion rates of five telematics variables—Pct.drive.3hrs, Pct.drive.4hrs, Pct.drive.wkday, Pct.drive.wkend, and Pct.drive.rush.am—using the LASSO, EN, and proposed Bayesian models.
| Method | S | Pct.drive.3hrs | Pct.drive.4hrs | Pct.drive.wkday | Pct.drive.wkend | Pct.drive.rush.am |
|---|---|---|---|---|---|---|
| LASSO | 600 | 0.32 | 0.22 | 0.02 | 0.02 | 0.22 |
| 1000 | 0.56 | 0.32 | 0.0 | 0.0 | 0.36 | |
| 1400 | 0.66 | 0.38 | 0.02 | 0.02 | 0.52 | |
| EN | 600 | 0.40 | 0.24 | 0.02 | 0.02 | 0.16 |
| 1000 | 0.58 | 0.34 | 0.0 | 0.0 | 0.44 | |
| 1400 | 0.70 | 0.38 | 0.04 | 0.04 | 0.54 | |
| Bayesian method () | 600 | 0.45 | 0.35 | 0.37 | 0.37 | 0.26 |
| 1000 | 0.43 | 0.31 | 0.34 | 0.34 | 0.26 | |
| 1400 | 0.45 | 0.35 | 0.33 | 0.34 | 0.24 | |
| Bayesian method () | 600 | 0.45 | 0.35 | 0.37 | 0.37 | 0.26 |
| 1000 | 0.43 | 0.31 | 0.34 | 0.34 | 0.27 | |
| 1400 | 0.46 | 0.35 | 0.33 | 0.34 | 0.25 | |
| Bayesian method () | 600 | 0.45 | 0.35 | 0.37 | 0.37 | 0.27 |
| 1000 | 0.43 | 0.32 | 0.34 | 0.34 | 0.27 | |
| 1400 | 0.46 | 0.35 | 0.33 | 0.34 | 0.26 | |
| Bayesian method () | 600 | 0.45 | 0.35 | 0.37 | 0.37 | 0.27 |
| 1000 | 0.43 | 0.32 | 0.34 | 0.34 | 0.28 | |
| 1400 | 0.46 | 0.35 | 0.33 | 0.34 | 0.27 | |
| Bayesian method () | 600 | 0.45 | 0.35 | 0.37 | 0.37 | 0.27 |
| 1000 | 0.43 | 0.32 | 0.34 | 0.34 | 0.28 | |
| 1400 | 0.46 | 0.35 | 0.33 | 0.34 | 0.27 |
Table A5.
Inclusion rates of five telematics variables—Pct.drive.rush.pm, Avgdays.week, Accel.06miles, Accel.08miles, and Accel.09miles—using the LASSO, EN, and proposed Bayesian models.
Table A5.
Inclusion rates of five telematics variables—Pct.drive.rush.pm, Avgdays.week, Accel.06miles, Accel.08miles, and Accel.09miles—using the LASSO, EN, and proposed Bayesian models.
| Method | S | Pct.drive.rush.pm | Avgdays.week | Accel.06miles | Accel.08miles | Accel.09miles |
|---|---|---|---|---|---|---|
| LASSO | 600 | 0.16 | 0.34 | 0.46 | 0.12 | 0.02 |
| 1000 | 0.22 | 0.58 | 0.78 | 0.06 | 0.02 | |
| 1400 | 0.24 | 0.80 | 0.80 | 0.16 | 0.0 | |
| EN | 600 | 0.16 | 0.32 | 0.50 | 0.14 | 0.04 |
| 1000 | 0.22 | 0.64 | 0.80 | 0.12 | 0.02 | |
| 1400 | 0.26 | 0.82 | 0.84 | 0.18 | 0.0 | |
| Bayesian method () | 600 | 0.22 | 0.39 | 0.35 | 0.41 | 0.44 |
| 1000 | 0.18 | 0.50 | 0.34 | 0.35 | 0.39 | |
| 1400 | 0.15 | 0.59 | 0.29 | 0.31 | 0.35 | |
| Bayesian method () | 600 | 0.22 | 0.39 | 0.35 | 0.41 | 0.44 |
| 1000 | 0.18 | 0.51 | 0.35 | 0.36 | 0.39 | |
| 1400 | 0.15 | 0.59 | 0.29 | 0.32 | 0.35 | |
| Bayesian method () | 600 | 0.23 | 0.40 | 0.35 | 0.41 | 0.44 |
| 1000 | 0.18 | 0.51 | 0.35 | 0.36 | 0.39 | |
| 1400 | 0.15 | 0.60 | 0.29 | 0.32 | 0.35 | |
| Bayesian method () | 600 | 0.23 | 0.40 | 0.35 | 0.41 | 0.44 |
| 1000 | 0.18 | 0.51 | 0.35 | 0.36 | 0.39 | |
| 1400 | 0.15 | 0.60 | 0.29 | 0.31 | 0.35 | |
| Bayesian method () | 600 | 0.23 | 0.40 | 0.35 | 0.41 | 0.44 |
| 1000 | 0.18 | 0.51 | 0.35 | 0.36 | 0.39 | |
| 1400 | 0.15 | 0.60 | 0.30 | 0.32 | 0.35 |
Table A6.
Inclusion rates of five telematics variables—Accel.11miles, Accel.12miles, Accel.14miles, Brake.06miles, and Brake.08miles—using the LASSO, EN, and proposed Bayesian models.
Table A6.
Inclusion rates of five telematics variables—Accel.11miles, Accel.12miles, Accel.14miles, Brake.06miles, and Brake.08miles—using the LASSO, EN, and proposed Bayesian models.
| Method | S | Accel.11miles | Accel.12miles | Accel.14miles | Brake.06miles | Brake.08miles |
|---|---|---|---|---|---|---|
| LASSO | 600 | 0.02 | 0.04 | 0.02 | 0.42 | 0.58 |
| 1000 | 0.04 | 0.04 | 0.06 | 0.74 | 0.64 | |
| 1400 | 0.0 | 0.04 | 0.06 | 0.78 | 0.74 | |
| EN | 600 | 0.04 | 0.04 | 0.08 | 0.62 | 0.72 |
| 1000 | 0.04 | 0.08 | 0.08 | 0.80 | 0.74 | |
| 1400 | 0.04 | 0.04 | 0.08 | 0.86 | 0.84 | |
| Bayesian method () | 600 | 0.46 | 0.47 | 0.48 | 0.42 | 0.47 |
| 1000 | 0.41 | 0.43 | 0.41 | 0.46 | 0.46 | |
| 1400 | 0.38 | 0.41 | 0.37 | 0.56 | 0.48 | |
| Bayesian method () | 600 | 0.46 | 0.47 | 0.48 | 0.42 | 0.47 |
| 1000 | 0.41 | 0.43 | 0.41 | 0.46 | 0.46 | |
| 1400 | 0.38 | 0.41 | 0.38 | 0.56 | 0.48 | |
| Bayesian method () | 600 | 0.46 | 0.48 | 0.48 | 0.41 | 0.47 |
| 1000 | 0.41 | 0.43 | 0.41 | 0.45 | 0.46 | |
| 1400 | 0.38 | 0.41 | 0.38 | 0.55 | 0.48 | |
| Bayesian method () | 600 | 0.46 | 0.47 | 0.48 | 0.41 | 0.47 |
| 1000 | 0.41 | 0.43 | 0.41 | 0.45 | 0.47 | |
| 1400 | 0.38 | 0.40 | 0.37 | 0.55 | 0.48 | |
| Bayesian method () | 600 | 0.46 | 0.47 | 0.48 | 0.41 | 0.47 |
| 1000 | 0.41 | 0.43 | 0.41 | 0.45 | 0.46 | |
| 1400 | 0.38 | 0.40 | 0.37 | 0.55 | 0.48 |
Table A7.
Inclusion rates of five telematics variables—Brake.09miles, Brake.11miles, Brake.12miles, Brake.14miles, and Left.turn.intensity08—using the LASSO, EN, and proposed Bayesian models.
Table A7.
Inclusion rates of five telematics variables—Brake.09miles, Brake.11miles, Brake.12miles, Brake.14miles, and Left.turn.intensity08—using the LASSO, EN, and proposed Bayesian models.
| Method | S | Brake.09miles | Brake.11miles | Brake.12miles | Brake.14miles | Left.turn.intensity08 |
|---|---|---|---|---|---|---|
| LASSO | 600 | 0.28 | 0.50 | 0.04 | 0.08 | 0.12 |
| 1000 | 0.30 | 0.66 | 0.12 | 0.10 | 0.20 | |
| 1400 | 0.16 | 0.90 | 0.08 | 0.02 | 0.30 | |
| EN | 600 | 0.42 | 0.56 | 0.14 | 0.04 | 0.12 |
| 1000 | 0.38 | 0.82 | 0.16 | 0.08 | 0.28 | |
| 1400 | 0.22 | 0.92 | 0.14 | 0.06 | 0.30 | |
| Bayesian method () | 600 | 0.51 | 0.63 | 0.49 | 0.45 | 0.40 |
| 1000 | 0.51 | 0.68 | 0.45 | 0.42 | 0.35 | |
| 1400 | 0.56 | 0.72 | 0.39 | 0.35 | 0.33 | |
| Bayesian method () | 600 | 0.51 | 0.62 | 0.49 | 0.45 | 0.40 |
| 1000 | 0.52 | 0.68 | 0.44 | 0.42 | 0.35 | |
| 1400 | 0.56 | 0.72 | 0.40 | 0.35 | 0.34 | |
| Bayesian method () | 600 | 0.51 | 0.63 | 0.49 | 0.45 | 0.40 |
| 1000 | 0.52 | 0.68 | 0.45 | 0.42 | 0.35 | |
| 1400 | 0.56 | 0.71 | 0.40 | 0.35 | 0.34 | |
| Bayesian method () | 600 | 0.52 | 0.63 | 0.49 | 0.46 | 0.40 |
| 1000 | 0.52 | 0.68 | 0.45 | 0.42 | 0.35 | |
| 1400 | 0.56 | 0.72 | 0.39 | 0.35 | 0.33 | |
| Bayesian method () | 600 | 0.52 | 0.63 | 0.49 | 0.46 | 0.40 |
| 1000 | 0.52 | 0.68 | 0.45 | 0.42 | 0.35 | |
| 1400 | 0.56 | 0.71 | 0.39 | 0.35 | 0.33 |
Table A8.
Inclusion rates of five telematics variables—Left.turn.intensity09, Left.turn.intensity10, Left.turn.intensity11, Left.turn.intensity12, and Right.turn.intensity08—using the LASSO, EN, and proposed Bayesian models.
Table A8.
Inclusion rates of five telematics variables—Left.turn.intensity09, Left.turn.intensity10, Left.turn.intensity11, Left.turn.intensity12, and Right.turn.intensity08—using the LASSO, EN, and proposed Bayesian models.
| Method | S | Left.turn.intensity09 | Left.turn.intensity10 | Left.turn.intensity11 | Left.turn.intensity12 | Right.turn.intensity08 |
|---|---|---|---|---|---|---|
| LASSO | 600 | 0.02 | 0.04 | 0.02 | 0.10 | 0.20 |
| 1000 | 0.06 | 0.0 | 0.04 | 0.10 | 0.36 | |
| 1400 | 0.04 | 0.0 | 0.04 | 0.12 | 0.40 | |
| EN | 600 | 0.06 | 0.02 | 0.10 | 0.0 | 0.20 |
| 1000 | 0.12 | 0.02 | 0.04 | 0.14 | 0.30 | |
| 1400 | 0.04 | 0.0 | 0.08 | 0.18 | 0.48 | |
| Bayesian method () | 600 | 0.42 | 0.44 | 0.44 | 0.44 | 0.44 |
| 1000 | 0.38 | 0.38 | 0.38 | 0.37 | 0.42 | |
| 1400 | 0.36 | 0.36 | 0.35 | 0.34 | 0.40 | |
| Bayesian method () | 600 | 0.42 | 0.44 | 0.44 | 0.44 | 0.44 |
| 1000 | 0.38 | 0.38 | 0.38 | 0.37 | 0.42 | |
| 1400 | 0.36 | 0.36 | 0.35 | 0.34 | 0.40 | |
| Bayesian method () | 600 | 0.42 | 0.44 | 0.44 | 0.44 | 0.44 |
| 1000 | 0.38 | 0.38 | 0.38 | 0.37 | 0.42 | |
| 1400 | 0.36 | 0.36 | 0.35 | 0.34 | 0.40 | |
| Bayesian method () | 600 | 0.42 | 0.44 | 0.44 | 0.44 | 0.44 |
| 1000 | 0.38 | 0.38 | 0.38 | 0.37 | 0.42 | |
| 1400 | 0.36 | 0.36 | 0.35 | 0.34 | 0.40 | |
| Bayesian method () | 600 | 0.42 | 0.44 | 0.44 | 0.44 | 0.44 |
| 1000 | 0.38 | 0.38 | 0.38 | 0.37 | 0.42 | |
| 1400 | 0.36 | 0.36 | 0.35 | 0.34 | 0.40 |
Table A9.
Inclusion rates of five telematics variables—Right.turn.intensity09, Right.turn.intensity10, Right.turn.intensity11, and Right.turn.intensity12—using the LASSO, EN, and proposed Bayesian models.
Table A9.
Inclusion rates of five telematics variables—Right.turn.intensity09, Right.turn.intensity10, Right.turn.intensity11, and Right.turn.intensity12—using the LASSO, EN, and proposed Bayesian models.
| Method | S | Right.turn.intensity09 | Right.turn.intensity10 | Right.turn.intensity11 | Right.turn.intensity12 |
|---|---|---|---|---|---|
| LASSO | 600 | 0.04 | 0.02 | 0.06 | 0.08 |
| 1000 | 0.06 | 0.4 | 0.04 | 0.26 | |
| 1400 | 0.02 | 0.04 | 0.06 | 0.32 | |
| EN | 600 | 0.10 | 0.04 | 0.04 | 0.14 |
| 1000 | 0.18 | 0.08 | 0.08 | 0.30 | |
| 1400 | 0.08 | 0.06 | 0.16 | 0.34 | |
| Bayesian method () | 600 | 0.46 | 0.46 | 0.46 | 0.45 |
| 1000 | 0.43 | 0.43 | 0.43 | 0.40 | |
| 1400 | 0.41 | 0.41 | 0.41 | 0.36 | |
| Bayesian method () | 600 | 0.46 | 0.46 | 0.46 | 0.45 |
| 1000 | 0.43 | 0.43 | 0.43 | 0.40 | |
| 1400 | 0.41 | 0.41 | 0.41 | 0.36 | |
| Bayesian method () | 600 | 0.46 | 0.46 | 0.46 | 0.45 |
| 1000 | 0.43 | 0.43 | 0.43 | 0.40 | |
| 1400 | 0.41 | 0.42 | 0.41 | 0.36 | |
| Bayesian method () | 600 | 0.46 | 0.46 | 0.46 | 0.45 |
| 1000 | 0.43 | 0.43 | 0.43 | 0.40 | |
| 1400 | 0.41 | 0.42 | 0.41 | 0.36 | |
| Bayesian method () | 600 | 0.46 | 0.46 | 0.46 | 0.45 |
| 1000 | 0.43 | 0.43 | 0.43 | 0.40 | |
| 1400 | 0.41 | 0.42 | 0.41 | 0.37 |
References
- Ayuso, M.; Guillén, M.; Pérez-Marín, A.M. Time and distance to first accident and driving patterns of young drivers with pay-as-you-drive insurance. Accid. Anal. Prev. 2014, 73, 125–131. [Google Scholar] [CrossRef]
- Cheng, J.; Feng, F.Y.; Zeng, X. Pay-as-you-drive insurance: Modeling and implications. N. Am. Actuar. J. 2023, 27, 303–321. [Google Scholar] [CrossRef]
- Gao, G.; Meng, S.; Wüthrich, M.V. Claims frequency modeling using telematics car driving data. Scand. Actuar. J. 2019, 2019, 143–162. [Google Scholar] [CrossRef]
- Jeong, H. Dimension reduction techniques for summarized telematics data. J. Risk Manag. 2022, 33, 1–24. [Google Scholar] [CrossRef]
- Peiris, H.; Jeong, H.; Zou, B. Development of Telematics Risk Scores in Accordance with Regulatory Compliance. Available at SSRN 5049191. 2024. Available online: https://ssrn.com/abstract=5049191 (accessed on 31 January 2025).
- Guillen, M.; Nielsen, J.P.; Ayuso, M.; Pérez-Marín, A.M. The use of telematics devices to improve automobile insurance rates. Risk Anal. 2019, 39, 662–672. [Google Scholar] [CrossRef]
- Jiang, Q.; Shi, T. Auto insurance pricing using telematics data: Application of a hidden markov model. N. Am. Actuar. J. 2024, 28, 822–839. [Google Scholar] [CrossRef]
- Chan, I.W.; Tseung, S.C.; Badescu, A.L.; Lin, X.S. Data mining of telematics data: Unveiling the hidden patterns in driving behavior. N. Am. Actuar. J. 2025, 29, 275–309. [Google Scholar] [CrossRef]
- Guillen, M.; Nielsen, J.P.; Pérez-Marín, A.M. Near-miss telematics in motor insurance. J. Risk Insur. 2021, 88, 569–589. [Google Scholar] [CrossRef]
- So, B.; Jeong, H. Simulation Engine for Adaptive Telematics data. Variance 2025, 18. [Google Scholar] [CrossRef]
- Holzapfel, J.; Peter, R.; Richter, A. Mitigating moral hazard with usage-based insurance. J. Risk Insur. 2024, 91, 813–839. [Google Scholar] [CrossRef]
- Peiris, H.; Jeong, H.; Kim, J.K.; Lee, H. Integration of traditional and telematics data for efficient insurance claims prediction. ASTIN Bull. J. IAA 2024, 54, 263–279. [Google Scholar] [CrossRef]
- Williams, B.; Hansen, G.; Baraban, A.; Santoni, A. A practical approach to variable selection—A comparison of various techniques. In Proceedings of the Casualty Actuarial Society E-Forum, Philadelphia, PA, USA, 9–11 March 2015. [Google Scholar]
- Devriendt, S.; Antonio, K.; Reynkens, T.; Verbelen, R. Sparse regression with multi-type regularized feature modeling. Insur. Math. Econ. 2021, 96, 248–261. [Google Scholar] [CrossRef]
- McGuire, G.; Taylor, G.; Miller, H. Self-assembling insurance claim models using regularized regression and machine learning. Variance 2021, 14, 1–22. [Google Scholar] [CrossRef]
- Jeong, H.; Chang, H.; Valdez, E.A. A non-convex regularization approach for stable estimation of loss development factors. Scand. Actuar. J. 2021, 2021, 779–803. [Google Scholar] [CrossRef]
- So, B.; Boucher, J.P.; Valdez, E.A. Synthetic dataset generation of driver telematics. Risks 2021, 9, 58. [Google Scholar] [CrossRef]
- Brazauskas, V.; Jones, B.L.; Zitikis, R. Robust fitting of claim severity distributions and the method of trimmed moments. J. Stat. Plan. Inference 2009, 139, 2028–2043. [Google Scholar] [CrossRef]
- Shi, P.; Basu, S.; Meyers, G.G. A Bayesian log-normal model for multivariate loss reserving. N. Am. Actuar. J. 2012, 16, 29–51. [Google Scholar] [CrossRef]
- Jeong, H.; Dey, D. Application of a vine copula for multi-line insurance reserving. Risks 2020, 8, 111. [Google Scholar] [CrossRef]
- Bae, T.; Miljkovic, T. Loss modeling with the size-biased lognormal mixture and the entropy regularized EM algorithm. Insur. Math. Econ. 2024, 117, 182–195. [Google Scholar] [CrossRef]
- Abu-Mostafa, Y.S.; Magdon-Ismail, M.; Lin, H.T. Learning from Data; AMLBook: New York, NY, USA, 2012; Volume 4. [Google Scholar]
- Ishwaran, H.; Rao, J.S. Spike and slab variable selection: Frequentist and Bayesian strategies. Ann. Stat. 2005, 33, 730–773. [Google Scholar] [CrossRef]
- Narisetty, N.N.; He, X. Bayesian variable selection with shrinking and diffusing priors. Ann. Stat. 2014, 42, 789–817. [Google Scholar] [CrossRef]
- Li, F.; Zhang, N.R. Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J. Am. Stat. Assoc. 2010, 105, 1202–1214. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).