Next Article in Journal
Majorization Inequalities for n-Convex Functions with Applications to 3-Convex Functions
Previous Article in Journal
Linear Quadratic Pursuit–Evasion Games on Time Scales
Previous Article in Special Issue
Estimating Skewness and Kurtosis for Asymmetric Heavy-Tailed Data: A Regression Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Bayesian Approach to Simultaneous Variable Selection and Estimation in a Linear Regression Model with Applications in Driver Telematics

1
Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
2
Department of Statistics, Pusan National University, Busan 46241, Republic of Korea
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(20), 3341; https://doi.org/10.3390/math13203341
Submission received: 19 August 2025 / Revised: 13 October 2025 / Accepted: 14 October 2025 / Published: 20 October 2025
(This article belongs to the Special Issue Actuarial Statistical Modeling and Applications)

Abstract

This article proposes a novel application of the Bayesian variable selection framework for driver telematics data. Unlike the traditional LASSO, the Bayesian variable selection framework allows us to incorporate the importance of certain features in advance in the variable selection procedure so that the traditional features more likely remain in the ratemaking models. The applicability of the proposed framework in the ratemaking practices is also validated via synthetic telematics data.

1. Introduction

Driver telematics is a field that combines telecommunications and informatics to monitor and analyze the behavior and performance of drivers. It involves the use of devices/apps and technologies to collect data on various aspects of driving, which can be used for improving safety, efficiency, and overall driving experience. More specifically, the device or app collects various types of data including vehicle speed and acceleration, braking patterns, steering behavior, and idle time to capture the inherent risk associated with the behaviors of the driver. Driver telematics has been actively used by insurers in the format of pay-as-you-drive (PAYD) or pay-how-you-drive (PHYD), so there have been many research outputs on the effective usage of driver telematics data for auto insurance pricing, such as [1,2].
However, while driver telematics data contain rich information for more precise risk classification, most of the time the data are high-dimensional and require appropriate dimension reduction and/or regularization in order to be compatible with GLMs, which is one of the mainly used benchmarks for auto insurance pricing. For example, ref. [3] utilized a speed–acceleration heatmap to effectively use high-dimensional telematics features in auto insurance pricing. Ref. [4] discussed possible dimension reduction techniques for driver telematics data, such as territorial embedding and PCA. Recently, ref. [5] developed a methodology to summarize telematics characteristics into a one-dimensional safety score, which can be used as input for a regulatory-compliant GLM rating. Note that all the aforementioned dimension reduction approaches preserve the traditional features, whereas only the telematics features are pre-processed due to the non-ignorable importance of traditional features (such as vehicle capacity or primary use of the vehicle) in auto insurance ratemaking practices [6,7,8,9,10] (although UBI contracts have been available to policyholders for more than twenty years, their market share remains relatively small at approximately 5% [11,12]). However, the literature on variable selection with actuarial application is still scarce, and the importance of traditional features has not yet been incorporated into the variable selection procedure [13,14,15,16]. In this regard, we propose a novel variable selection approach in a Bayesian framework for driver telematics data that enables us to incorporate traditional features as well as effectively select significant telematics features, which are often high-dimensional.

1.1. Driver Telematics Data

This section elaborates on the synthetic telematics dataset employed in our empirical analysis. The dataset originated from the work of [17], who obtained the original telematics data from a Canadian insurance company, covering auto insurance policies and claims in Ontario from 2013 to 2016. To protect sensitive information and adhere to data-sharing policies, ref. [17] utilized an algorithm known as “Extended SMOTE” to generate a synthetic version of the dataset. Note that we also utilized a more pre-processed version of the dataset in which the territorial code (with 55 categories) was converted to a real valued variable TerritoryEmb as in [4] for this paper.
The dataset comprises 3864 entries with positive claim amounts and includes 10 traditional variables along with 39 telematics-related variables. The outcome variable analyzed in this paper is the size of the insurance claim on a natural logarithmic scale, and descriptions of all variables are provided in Table 1.
Here Annual.miles.drive is categorized as a traditional variable in Table 1; it is based on self-reported data rather than telematics. In contrast, Total.miles.driven reflects actual mileage recorded by telematics devices or mobile applications.

1.2. Variable Selection in a Linear Regression Model

Here we use a linear regression model for analyzing insurance claims on a natural log scale, which is given as follows:
y = X β + ϵ .
Let the observable data consisting of y = ( y 1 , , y n ) R n and X R n × p , where y i represents the aggregated claim amount of the i-th policyholder on a log scale, and X is a given deterministic design matrix, which contains variables for driver telematics as well as traditional variables such as age and region. The vector ϵ N n ( 0 , σ 2 I n ) denotes additive Gaussian errors, and β R p is the vector of regression coefficients to be estimated. Note that the assumption of Gaussian errors in insurance claim modeling is widely adopted in the actuarial literature [18,19,20,21]. As illustrated by the histogram of the response variable in Table 1, its empirical distribution exhibits a strong resemblance to a normal distribution (Figure 1).
In the standard linear model, the regression coefficients β are estimated by the ordinary least-squares estimator (LSE), which minimizes the residual sum of squares:
β ^ LSE = arg min β y X β 2 = X X 1 X y ,
where z 2 = z z for a vector z . In R, we use the function lm() to compute the LSE.
Alternatives to the standard linear model can be employed with a least absolute shrinkage and selection operator (LASSO) and an elastic net (EN) model for simultaneous variable selection and parameter estimation. Using a popular R package called glmnet, which provides a unified framework for fitting penalized regression models, we compute the LASSO and EN estimators. glmnet solves the following problem, which contains two key tuning parameters α and λ :
min β 1 n y X β 2 + λ ( 1 α ) β 2 / 2 + α β 1 .
Setting α = 1 , we can obtain the LASSO estimator. When α ( 0 , 1 ) , the resulting estimator becomes an EN estimator. The tuning parameter λ > 0 controls the overall strength of the regularization. We select the optimal value of λ via grid search based on cross-validation, as implemented in the cv.glmnet() function.
To highlight the importance of variable selection, we compare the root mean squared errors (RMSEs) of the LASSO, EN, and linear models. We determine the optimal value of λ by performing cross-validation over a grid ranging from 0.01 to 2.00 in increments of 0.01, and we use α = 0.5 for EN models. The original dataset described in Section 1.1 is treated as the population from which a sample is drawn. We generate 50 samples, each of size S, and compute the RMSEs for each sample using 5-fold cross-validation. The use of cross-validation provides a more reliable assessment of predictive performance by reducing the variability associated with a single data split [22]. The sample sizes considered are S { 600 , 1000 , 1400 } . The results of the 5-fold cross-validation across the 50 samples are summarized in Table 2 and Figure 2.
Notably, the LASSO and EN models significantly reduce the RMSEs for all values of S. However, they do not differentiate between traditional variables and telematics variables during selection. Given that insurance companies may prioritize traditional variables, we analyze the proportions of traditional variables included by the LASSO and EN models. Table 3 presents the inclusion rates for the five traditional variables, Insured.age, Car.age, Car.use, Credit.score, and Region, that are considered important by insurance companies. Indeed, we can see that the inclusion rates for the four traditional variables, except for the credit score, are very low, especially as the sample size decreases. To overcome such a limitation, we propose a Bayesian method that allows us to control the probabilities that specific variables are included in the model. While the naïve use of LASSO and EN regularizations cannot handle the distinction between the traditional variables and telematics variables, the proposed Bayesian approach allows us to incorporate the importance of the features in advance by adjusting the values of the hyperparameters.
The remainder of this article is organized as follows: Section 2 explains the theoretical foundation of the proposed Bayesian variable selection method. Section 3 presents the applicability of the proposed method for driver telematics data. We conclude the article in Section 4 with a brief discussion of possible future research directions.

2. Variable Selection in a Bayesian Framework

2.1. Prior Distributions and Likelihood

For simultaneous variable selection and estimation, we utilize a Bayesian method, aiming to identify significant predictors (or covariates) that should be included in the model while estimating unknown parameters. First, to impose sparsity on β , we employ a latent binary vector γ = ( γ 1 , , γ p ) { 0 , 1 } p and use the popular spike-and-slab prior with a Gaussian slab as follows [23]:
β j γ j , σ 2 i n d e p ( 1 γ j ) 1 ( β j = 0 ) + γ j N ( 0 , σ 2 ν 2 ) , j = 1 , , p ,
where 1 ( E ) denotes an indicator function of an event E such that 1 ( E ) = 1 if E is true; otherwise, 1 ( E ) = 0 . ν 2 is a hyperparameter. Given the latent vector γ , the spike-and-slab prior in (2) indicates that the regression coefficients are independent, with each β j following one of two distributions. Specifically, β j has a point mass at 0 if γ j = 0 and follows a normal distribution with mean zero and variance σ 2 ν 2 if γ j = 1 . Let p ( β γ , σ 2 ) be the prior density function of β given γ and σ 2 . Then,
p ( β γ , σ 2 ) = j = 1 p p ( β j γ j , σ 2 ) ,
where p ( β j γ j , σ 2 ) is the density function of β j given γ j and σ 2 corresponding to (2).
For the latent binary vector γ , ref. [24] employed independent Bernoulli prior distributions for each γ j , j = 1 , , p and proved the selection consistency of the posterior distribution of γ . In contrast, ref. [25] considered a two-parameter Ising prior to capture the structural dependence among covariates, assuming that the latent variables γ 1 , , γ p lie in an undirected graph. Let J R p × p represent a matrix that determines the connectivity among γ called a coupling matrix. To construct J , ref. [25] used an adjacency matrix for the underlying graph such that the diagonal entries of J are all zeros and the ( j , k ) -th entry of J is
J ( j , k ) = 1 , if γ j and γ k are connected , 0 , otherwise ,
where j , k { 1 , 2 , , p } and j k . Thus, the prior probability mass function (pmf) of γ in [25] is as follows:
p ( γ ) = 1 Z ( a , b ) exp a j = 1 p γ j + b · γ J γ ,
where Z ( a , b ) = γ exp a j = 1 p γ j + b · γ J γ is a normalizing constant. The Ising prior described in (3) effectively accounts for the dependence among covariates, except for the special case where b = 0 , which corresponds to independent Bernoulli priors. In order to use the Ising prior in (3), J must be pre-specified. In their simulation studies, ref. [25] assumed a linear chain dependence such that each element of γ was connected to two neighboring elements—one step ahead and one step behind, i.e., γ j was connected to γ j 1 and γ j + 1 for j = 1 , , p , with the boundary conditions γ 0 = γ p and γ p + 1 = γ 1 .
We construct the coupling matrix J in a more data-driven manner by taking into account that (1) the inclusion probabilities of traditional variables should be higher than those of telematics variables, and (2) the correlations among telematics variables should be appropriately modeled. Let I trad and I tele denote index sets for traditional and telematics variables, respectively, such that I trad I tele = { 1 , 2 , , p } and I trad I tele = . For (1), we assign a large value to an off-diagonal entry if the corresponding indices are in I trad . Next, for (2), we define the connectivity between γ j and γ k for j , k I tele using the sample correlation between the j-th and k-th covariates denoted by r j k . The off-diagonal entries of J are then defined as J ( j , k ) = 1 if | r j k | > τ , and J ( j , k ) = 0 otherwise. Here, a threshold τ ( 0 , 1 ) controls the sparsity of J . The diagonal entries of J are all zeros, and the off-diagonal entries are summarized in (4):
J ( j , k ) = J * , if j , k I trad , 1 , if j , k I tele and | r j k | > τ , 0 , if j , k I tele and | r j k | τ ,
where J * > 1 . We emphasize that the prior pmf of γ remains the same as in (3), with the only difference being in J . Furthermore, as the value of J * increases, the probabilities that traditional variables are included in the model also increase. Finally, for the variance in the error terms, we set the prior on σ 2 in the form p ( σ 2 ) 1 / σ 2 , as in [25]. All the necessary random quantities β , γ , and σ 2 are given; the likelihood function is as follows:
p ( y β , γ , σ 2 ) = ( 2 π σ 2 ) n / 2 exp 1 2 σ 2 i = 1 n ( y i x i β ) 2 .
It should be noted that J is specified from the given dataset at the beginning and is not updated throughout the Gibbs sampling procedure. Using the prior distributions and likelihood specified in this section, we describe the procedures for variable selection and parameter estimation in Section 2.2 and Section 2.3.

2.2. Posterior Inference for γ

We introduce the Gibbs sampling method suggested by [25] for variable selectionsunder the Ising prior in (3) with a coupling matrix in (4). Significant predictor variables are identified based on the posterior inclusion probabilities (PIPs) P r ( γ j = 1 y ) , j = 1 , , p . However, the PIPs cannot be computed in a closed form. To approximate the PIPs, ref. [25] defined ( p 1 ) -dimensional vectors γ j { 0 , 1 } p 1 , which are constructed by removing γ j from γ and index sets I j = { k : γ k = 1 , k j } for j = 1 , , p . Then, the prior conditional probabilities P r ( γ j = 0 γ j ) and P r ( γ j = 1 γ j ) are straightforward to compute using the fact that γ j is binary:
P r ( γ j = 0 γ j ) = 1 1 + exp a + b k I j γ k , P r ( γ j = 1 γ j ) = exp a + b k I j γ k 1 + exp a + b k I j γ k .
It is worth noting that the conditional probabilities above no longer involve intractable normalizing constants. By the Bayes rule, the posterior conditional probability that γ j = 1 given γ j and y is
P r ( γ j = 1 γ j , y ) = P r ( γ j = 1 γ j ) P r ( γ j = 1 γ j ) + B F j 1 P r ( γ j = 0 γ j ) ,
where B F j = p ( y γ j = 1 , γ j ) / p ( y γ j = 0 , γ j ) is the Bayes factor. We can explicitly compute the Bayes factors by integrating out β and σ 2 under the prior distributions specified in Section 2.1. The detailed derivation of B F j is described in [25]. We draw Gibbs samples of γ iteratively using (5) with a initial vector γ ( 0 ) and compute the approximated PIPs denoted by P r ^ ( γ j = 1 y ) for j = 1 , , p based on the empirical distribution of the Gibbs samples. Specifically, we divide the number of Gibbs samples where γ j = 1 by the total number of the samples excluding the burn-ins. Ref. [25] demonstrated that the Ising prior for γ is effective in capturing the dependence among covariates and outperforms the independent Bernoulli prior in terms of variable selection. However, they did not address the estimation of β or σ 2 . To tackle this, we propose the Gibbs sampling method for simultaneous variable selection and parameter estimation in the following subsection.

2.3. Posterior Inference for β and σ 2

At the current iteration t for the Gibbs sampling introduced in the previous subsection, the current binary vector is denoted as γ ( t ) = ( γ 1 ( t ) , , γ p ( t ) ) . Given the current vector γ ( t ) with γ j ( t ) = 1 , we can derive the full posterior of β j as follows:
p ( β j β j , γ ( t ) , σ 2 , y ) exp 1 2 · 1 σ 2 i = 1 n x i j 2 + 1 ν 2 : = ( τ j 2 ) 1 β j 2 2 i = 1 n x i j 2 + 1 ν 2 1 i = 1 n x i j y i k j p x i k β k : = μ j β j exp 1 2 τ j 2 β j μ j 2 .
Observing the last term in (6), which takes the form of a Gaussian density function, we can infer β j β j , γ ( t ) , σ 2 , y N ( μ j , τ j 2 ) from which Gibbs samples for β are generated.
For the derivation of the full posterior of σ 2 , we define a sparse vector β ˜ = γ β , where the symbol ⊙ denotes a Hadamard product and an activation set A ( t ) = { j : γ j ( t ) = 1 } . Then, the full posterior of σ 2 is
p ( σ 2 β , γ ( t ) , y ) = ( σ 2 ) p + A ( t ) 2 1 exp 1 σ 2 1 2 ν 2 β ˜ β ˜ + 1 2 y X β ˜ y X β ˜ ,
where A ( t ) denotes the cardinality of A ( t ) . The last form in (7) indicates that σ 2 β , γ ( t ) , y follows an inverse gamma distribution with a shape parameter a σ 2 : = p + A ( t ) / 2 and a scale parameter:
b σ 2 : = 1 2 ν 2 β ˜ β ˜ + 1 2 y X β ˜ y X β ˜ .
Using the full posteriors p ( β j β j , γ ( t ) , σ 2 , y ) and p ( σ 2 β , γ ( t ) , y ) , we can iteratively generate Gibbs samples, and the sample means serve as the Bayesian estimators for β and σ 2 .

3. Application for Telematics Feature Selection

Now we assess the applicability of the proposed method for variable selection with driver telematics data. In Section 3.1, we examine how the inclusion probabilities of traditional variables vary with the values of J * , demonstrating that our method effectively adjusts these probabilities in a flexible way.

3.1. Inclusion Probabilities of Traditional Variables

J * is a user-specified value that can be chosen according to the purpose of the analysis. A larger value of J * increases the probability that the traditional variables are included in the model. Using the five values of J * { 2 , 3 , 5 , 7 , 10 } , we follow the steps below to compute the average of the inclusion probabilities with its standard deviations:
Step 0.
We choose J * { 2 , 5 , 10 } and S { 600 , 1000 , 1400 } .
Step 1.
We generate a sample of size S from the population.
Step 2.
Given a sample, we implement the Gibbs procedure described in Section 2.2 to draw γ ( 1 ) , , γ ( 2000 ) . The first 1000 iterations are considered burn-ins, and the last 1000 are used.
Step 3.
We compute each marginal PIP as follows:
P r ^ ( γ j = 1 y ) = 1 1000 t = 1001 2000 γ j ( t ) , j = 1 , , p .
Step 4.
We perform Steps 1, 2, and 3 fifty times and compute the average and standard deviation of PIPs across the fifty repetitions.
Table 4 shows the PIPs for the five traditional variables, Insured.age, Car.age, Car.use, Credit.score, and Region. The observed increase in PIPs with increasing values of J * implies that the inclusion rates of traditional variables can be effectively controlled through appropriate choices of J * . Additional results on the inclusion rates of the remaining traditional and telematics variables are summarized in Appendix A.

3.2. RMSE Comparison

In the previous subsection, we demonstrated that the proposed Bayesian approach allows flexible control over the inclusion probabilities of traditional variables in the model. However, such flexibility would be meaningless if it comes at the cost of poor accuracy. In this subsection, we compare the RMSEs of the proposed method with those of the LASSO model. As in Section 1.2, we compute the RMSEs using the 5-fold cross-validation with the 50 samples generated, each of size S { 600 , 1000 , 1400 } . As shown in Table 4, setting J * = 10 ensures the inclusion of all five traditional variables. Therefore, we fix J * at 10. As summarized in Table 5, the proposed Bayesian method outperforms LASSO, showing that variable inclusion can be flexibly controlled without compromising—and even enhancing—accuracy.

4. Conclusions

In this article, it was shown that the proposed Bayesian variable selection can effectively address the predetermined importance of available covariates, which maintains the traditional variables in the ratemaking scheme without losing too much of the prediction accuracy. While such results are promising, we acknowledge that the aforementioned Bayesian variable selection framework is constrained by the assumption of Gaussian white noise. In this regard, it is expected that this research can be extended to incorporate the proposed Bayesian variable selection scheme for response variables with different distributional assumptions, such as the Poisson distribution. It should also be noted that due to the limited availability of publicly available telematics data, all the analyses are based on synthetic telematics data (which originated from actual data) provided by So et al. [17]. That being said, we recommend that practitioners test the external validity of the proposed framework, provided that an actual telematics dataset is available.

Author Contributions

Conceptualization, M.K. and H.J.; methodology, M.K.; software, M.K.; validation, M.K.; formal analysis, M.K.; investigation, M.K.; data curation, H.J.; writing—original draft preparation, M.K. and H.J.; writing—review and editing, M.K. and H.J.; visualization, M.K.; supervision, H.J.; project administration, M.K. and H.J.; funding acquisition, M.K. and H.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by PNU-RENovation (2024–2025).

Data Availability Statement

The synthetic dataset and R codes used in this article are available at https://github.com/ssauljin/telematics_Bayes (accessed on 17 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

    The following abbreviations are used in this manuscript:
PAYDPay-as-you-drive
PHYDPay-how-you-drive
GLMGeneralized Linear Model
PCAPrincipal Component Analysis
SMOTESynthetic Minority Oversampling Technique
RMSERoot Mean Squared Error
LASSOLeast Absolute Shrinkage and Selection Operator
PIPPosterior Inclusion Probability

Appendix A

Table A1 presents the inclusion probabilities of the remaining traditional variables, excluding the five variables already summarized in Table 3 and Table 4. Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8 and Table A9 present the inclusion probabilities of the 39 telematics variables.
Table A1. Inclusion rates of the remaining five traditional variables—Insured.sex, Marital, Annual.miles.drive, Years.noclaims, and TerritoryEmb—using the LASSO, EN, and proposed Bayesian models.
Table A1. Inclusion rates of the remaining five traditional variables—Insured.sex, Marital, Annual.miles.drive, Years.noclaims, and TerritoryEmb—using the LASSO, EN, and proposed Bayesian models.
MethodSInsured.sexMaritalAnnual.miles.driveYears.noclaimsTerritoryEmb
LASSO6000.120.200.140.880.18
10000.200.240.281.000.28
14000.140.280.341.000.24
EN6000.120.140.120.980.20
10000.180.280.321.000.28
14000.200.340.381.000.32
Bayesian method ( J * = 2 )6000.390.410.420.840.40
10000.310.350.380.920.36
14000.270.320.360.970.31
Bayesian method ( J * = 3 )6000.610.620.630.910.61
10000.510.550.570.950.55
14000.460.510.530.980.49
Bayesian method ( J * = 5 )6000.940.940.940.990.94
10000.910.920.930.990.92
14000.900.910.911.000.90
Bayesian method ( J * = 7 )6000.990.990.991.000.99
10000.990.990.991.000.99
14000.980.990.991.000.99
Bayesian method ( J * = 10 )6001.001.001.001.001.00
10001.001.001.001.001.00
14001.001.001.001.001.00
Table A2. Inclusion rates of five telematics variables—Annual.pct.driven, Total.miles.driven, Pct.drive.mon, Pct.drive.tue, and Pct.drive.wed—using the LASSO, EN, and proposed Bayesian models.
Table A2. Inclusion rates of five telematics variables—Annual.pct.driven, Total.miles.driven, Pct.drive.mon, Pct.drive.tue, and Pct.drive.wed—using the LASSO, EN, and proposed Bayesian models.
MethodSAnnual.pct.drivenTotal.miles.drivenPct.drive.monPct.drive.tuePct.drive.wed
LASSO6000.200.300.200.260.34
10000.120.580.320.380.62
14000.260.800.360.480.78
EN6000.200.360.200.300.42
10000.180.620.380.420.64
14000.220.840.420.540.78
Bayesian method ( J * = 2 )6000.240.570.220.270.27
10000.200.640.190.240.29
14000.200.700.170.240.31
Bayesian method ( J * = 3 )6000.250.580.220.270.27
10000.200.650.190.240.29
14000.210.710.160.240.30
Bayesian method ( J * = 5 )6000.250.600.220.270.27
10000.200.670.190.240.29
14000.210.740.160.240.30
Bayesian method ( J * = 7 )6000.250.600.220.270.27
10000.200.670.190.240.29
14000.210.730.170.240.30
Bayesian method ( J * = 10 )6000.250.600.220.270.27
10000.200.670.190.240.29
14000.210.730.170.240.30
Table A3. Inclusion rates of five telematics variables—Pct.drive.thr, Pct.drive.fri, Pct.drive.sat, Pct.drive.sun, and Pct.drive.2hrs—using the LASSO, EN, and proposed Bayesian models.
Table A3. Inclusion rates of five telematics variables—Pct.drive.thr, Pct.drive.fri, Pct.drive.sat, Pct.drive.sun, and Pct.drive.2hrs—using the LASSO, EN, and proposed Bayesian models.
MethodSPct.drive.thrPct.drive.friPct.drive.satPct.drive.sunPct.drive.2hrs
LASSO6000.180.260.080.220.40
10000.120.320.160.420.64
14000.100.440.160.520.72
EN6000.180.300.080.280.42
10000.140.340.200.480.66
14000.100.520.240.620.82
Bayesian method ( J * = 2 )6000.230.230.320.340.38
10000.180.200.300.340.37
14000.160.210.290.360.37
Bayesian method ( J * = 3 )6000.230.230.320.340.38
10000.180.210.290.340.37
14000.150.210.290.350.37
Bayesian method ( J * = 5 )6000.230.230.320.340.38
10000.180.210.290.340.37
14000.160.210.290.360.36
Bayesian method ( J * = 7 )6000.230.230.320.340.38
10000.170.210.290.340.37
14000.150.210.290.360.37
Bayesian method ( J * = 10 )6000.230.230.320.340.38
10000.170.210.290.340.37
14000.150.210.290.360.36
Table A4. Inclusion rates of five telematics variables—Pct.drive.3hrs, Pct.drive.4hrs, Pct.drive.wkday, Pct.drive.wkend, and Pct.drive.rush.am—using the LASSO, EN, and proposed Bayesian models.
Table A4. Inclusion rates of five telematics variables—Pct.drive.3hrs, Pct.drive.4hrs, Pct.drive.wkday, Pct.drive.wkend, and Pct.drive.rush.am—using the LASSO, EN, and proposed Bayesian models.
MethodSPct.drive.3hrsPct.drive.4hrsPct.drive.wkdayPct.drive.wkendPct.drive.rush.am
LASSO6000.320.220.020.020.22
10000.560.320.00.00.36
14000.660.380.020.020.52
EN6000.400.240.020.020.16
10000.580.340.00.00.44
14000.700.380.040.040.54
Bayesian method ( J * = 2 )6000.450.350.370.370.26
10000.430.310.340.340.26
14000.450.350.330.340.24
Bayesian method ( J * = 3 )6000.450.350.370.370.26
10000.430.310.340.340.27
14000.460.350.330.340.25
Bayesian method ( J * = 5 )6000.450.350.370.370.27
10000.430.320.340.340.27
14000.460.350.330.340.26
Bayesian method ( J * = 7 )6000.450.350.370.370.27
10000.430.320.340.340.28
14000.460.350.330.340.27
Bayesian method ( J * = 10 )6000.450.350.370.370.27
10000.430.320.340.340.28
14000.460.350.330.340.27
Table A5. Inclusion rates of five telematics variables—Pct.drive.rush.pm, Avgdays.week, Accel.06miles, Accel.08miles, and Accel.09miles—using the LASSO, EN, and proposed Bayesian models.
Table A5. Inclusion rates of five telematics variables—Pct.drive.rush.pm, Avgdays.week, Accel.06miles, Accel.08miles, and Accel.09miles—using the LASSO, EN, and proposed Bayesian models.
MethodSPct.drive.rush.pmAvgdays.weekAccel.06milesAccel.08milesAccel.09miles
LASSO6000.160.340.460.120.02
10000.220.580.780.060.02
14000.240.800.800.160.0
EN6000.160.320.500.140.04
10000.220.640.800.120.02
14000.260.820.840.180.0
Bayesian method ( J * = 2 )6000.220.390.350.410.44
10000.180.500.340.350.39
14000.150.590.290.310.35
Bayesian method ( J * = 3 )6000.220.390.350.410.44
10000.180.510.350.360.39
14000.150.590.290.320.35
Bayesian method ( J * = 5 )6000.230.400.350.410.44
10000.180.510.350.360.39
14000.150.600.290.320.35
Bayesian method ( J * = 7 )6000.230.400.350.410.44
10000.180.510.350.360.39
14000.150.600.290.310.35
Bayesian method ( J * = 10 )6000.230.400.350.410.44
10000.180.510.350.360.39
14000.150.600.300.320.35
Table A6. Inclusion rates of five telematics variables—Accel.11miles, Accel.12miles, Accel.14miles, Brake.06miles, and Brake.08miles—using the LASSO, EN, and proposed Bayesian models.
Table A6. Inclusion rates of five telematics variables—Accel.11miles, Accel.12miles, Accel.14miles, Brake.06miles, and Brake.08miles—using the LASSO, EN, and proposed Bayesian models.
MethodSAccel.11milesAccel.12milesAccel.14milesBrake.06milesBrake.08miles
LASSO6000.020.040.020.420.58
10000.040.040.060.740.64
14000.00.040.060.780.74
EN6000.040.040.080.620.72
10000.040.080.080.800.74
14000.040.040.080.860.84
Bayesian method ( J * = 2 )6000.460.470.480.420.47
10000.410.430.410.460.46
14000.380.410.370.560.48
Bayesian method ( J * = 3 )6000.460.470.480.420.47
10000.410.430.410.460.46
14000.380.410.380.560.48
Bayesian method ( J * = 5 )6000.460.480.480.410.47
10000.410.430.410.450.46
14000.380.410.380.550.48
Bayesian method ( J * = 7 )6000.460.470.480.410.47
10000.410.430.410.450.47
14000.380.400.370.550.48
Bayesian method ( J * = 10 )6000.460.470.480.410.47
10000.410.430.410.450.46
14000.380.400.370.550.48
Table A7. Inclusion rates of five telematics variables—Brake.09miles, Brake.11miles, Brake.12miles, Brake.14miles, and Left.turn.intensity08—using the LASSO, EN, and proposed Bayesian models.
Table A7. Inclusion rates of five telematics variables—Brake.09miles, Brake.11miles, Brake.12miles, Brake.14miles, and Left.turn.intensity08—using the LASSO, EN, and proposed Bayesian models.
MethodSBrake.09milesBrake.11milesBrake.12milesBrake.14milesLeft.turn.intensity08
LASSO6000.280.500.040.080.12
10000.300.660.120.100.20
14000.160.900.080.020.30
EN6000.420.560.140.040.12
10000.380.820.160.080.28
14000.220.920.140.060.30
Bayesian method ( J * = 2 )6000.510.630.490.450.40
10000.510.680.450.420.35
14000.560.720.390.350.33
Bayesian method ( J * = 3 )6000.510.620.490.450.40
10000.520.680.440.420.35
14000.560.720.400.350.34
Bayesian method ( J * = 5 )6000.510.630.490.450.40
10000.520.680.450.420.35
14000.560.710.400.350.34
Bayesian method ( J * = 7 )6000.520.630.490.460.40
10000.520.680.450.420.35
14000.560.720.390.350.33
Bayesian method ( J * = 10 )6000.520.630.490.460.40
10000.520.680.450.420.35
14000.560.710.390.350.33
Table A8. Inclusion rates of five telematics variables—Left.turn.intensity09, Left.turn.intensity10, Left.turn.intensity11, Left.turn.intensity12, and Right.turn.intensity08—using the LASSO, EN, and proposed Bayesian models.
Table A8. Inclusion rates of five telematics variables—Left.turn.intensity09, Left.turn.intensity10, Left.turn.intensity11, Left.turn.intensity12, and Right.turn.intensity08—using the LASSO, EN, and proposed Bayesian models.
MethodSLeft.turn.intensity09Left.turn.intensity10Left.turn.intensity11Left.turn.intensity12Right.turn.intensity08
LASSO6000.020.040.020.100.20
10000.060.00.040.100.36
14000.040.00.040.120.40
EN6000.060.020.100.00.20
10000.120.020.040.140.30
14000.040.00.080.180.48
Bayesian method ( J * = 2 )6000.420.440.440.440.44
10000.380.380.380.370.42
14000.360.360.350.340.40
Bayesian method ( J * = 3 )6000.420.440.440.440.44
10000.380.380.380.370.42
14000.360.360.350.340.40
Bayesian method ( J * = 5 )6000.420.440.440.440.44
10000.380.380.380.370.42
14000.360.360.350.340.40
Bayesian method ( J * = 7 )6000.420.440.440.440.44
10000.380.380.380.370.42
14000.360.360.350.340.40
Bayesian method ( J * = 10 )6000.420.440.440.440.44
10000.380.380.380.370.42
14000.360.360.350.340.40
Table A9. Inclusion rates of five telematics variables—Right.turn.intensity09, Right.turn.intensity10, Right.turn.intensity11, and Right.turn.intensity12—using the LASSO, EN, and proposed Bayesian models.
Table A9. Inclusion rates of five telematics variables—Right.turn.intensity09, Right.turn.intensity10, Right.turn.intensity11, and Right.turn.intensity12—using the LASSO, EN, and proposed Bayesian models.
MethodSRight.turn.intensity09Right.turn.intensity10Right.turn.intensity11Right.turn.intensity12
LASSO6000.040.020.060.08
10000.060.40.040.26
14000.020.040.060.32
EN6000.100.040.040.14
10000.180.080.080.30
14000.080.060.160.34
Bayesian method ( J * = 2 )6000.460.460.460.45
10000.430.430.430.40
14000.410.410.410.36
Bayesian method ( J * = 3 )6000.460.460.460.45
10000.430.430.430.40
14000.410.410.410.36
Bayesian method ( J * = 5 )6000.460.460.460.45
10000.430.430.430.40
14000.410.420.410.36
Bayesian method ( J * = 7 )6000.460.460.460.45
10000.430.430.430.40
14000.410.420.410.36
Bayesian method ( J * = 10 )6000.460.460.460.45
10000.430.430.430.40
14000.410.420.410.37

References

  1. Ayuso, M.; Guillén, M.; Pérez-Marín, A.M. Time and distance to first accident and driving patterns of young drivers with pay-as-you-drive insurance. Accid. Anal. Prev. 2014, 73, 125–131. [Google Scholar] [CrossRef]
  2. Cheng, J.; Feng, F.Y.; Zeng, X. Pay-as-you-drive insurance: Modeling and implications. N. Am. Actuar. J. 2023, 27, 303–321. [Google Scholar] [CrossRef]
  3. Gao, G.; Meng, S.; Wüthrich, M.V. Claims frequency modeling using telematics car driving data. Scand. Actuar. J. 2019, 2019, 143–162. [Google Scholar] [CrossRef]
  4. Jeong, H. Dimension reduction techniques for summarized telematics data. J. Risk Manag. 2022, 33, 1–24. [Google Scholar] [CrossRef]
  5. Peiris, H.; Jeong, H.; Zou, B. Development of Telematics Risk Scores in Accordance with Regulatory Compliance. Available at SSRN 5049191. 2024. Available online: https://ssrn.com/abstract=5049191 (accessed on 31 January 2025).
  6. Guillen, M.; Nielsen, J.P.; Ayuso, M.; Pérez-Marín, A.M. The use of telematics devices to improve automobile insurance rates. Risk Anal. 2019, 39, 662–672. [Google Scholar] [CrossRef]
  7. Jiang, Q.; Shi, T. Auto insurance pricing using telematics data: Application of a hidden markov model. N. Am. Actuar. J. 2024, 28, 822–839. [Google Scholar] [CrossRef]
  8. Chan, I.W.; Tseung, S.C.; Badescu, A.L.; Lin, X.S. Data mining of telematics data: Unveiling the hidden patterns in driving behavior. N. Am. Actuar. J. 2025, 29, 275–309. [Google Scholar] [CrossRef]
  9. Guillen, M.; Nielsen, J.P.; Pérez-Marín, A.M. Near-miss telematics in motor insurance. J. Risk Insur. 2021, 88, 569–589. [Google Scholar] [CrossRef]
  10. So, B.; Jeong, H. Simulation Engine for Adaptive Telematics data. Variance 2025, 18. [Google Scholar] [CrossRef]
  11. Holzapfel, J.; Peter, R.; Richter, A. Mitigating moral hazard with usage-based insurance. J. Risk Insur. 2024, 91, 813–839. [Google Scholar] [CrossRef]
  12. Peiris, H.; Jeong, H.; Kim, J.K.; Lee, H. Integration of traditional and telematics data for efficient insurance claims prediction. ASTIN Bull. J. IAA 2024, 54, 263–279. [Google Scholar] [CrossRef]
  13. Williams, B.; Hansen, G.; Baraban, A.; Santoni, A. A practical approach to variable selection—A comparison of various techniques. In Proceedings of the Casualty Actuarial Society E-Forum, Philadelphia, PA, USA, 9–11 March 2015. [Google Scholar]
  14. Devriendt, S.; Antonio, K.; Reynkens, T.; Verbelen, R. Sparse regression with multi-type regularized feature modeling. Insur. Math. Econ. 2021, 96, 248–261. [Google Scholar] [CrossRef]
  15. McGuire, G.; Taylor, G.; Miller, H. Self-assembling insurance claim models using regularized regression and machine learning. Variance 2021, 14, 1–22. [Google Scholar] [CrossRef]
  16. Jeong, H.; Chang, H.; Valdez, E.A. A non-convex regularization approach for stable estimation of loss development factors. Scand. Actuar. J. 2021, 2021, 779–803. [Google Scholar] [CrossRef]
  17. So, B.; Boucher, J.P.; Valdez, E.A. Synthetic dataset generation of driver telematics. Risks 2021, 9, 58. [Google Scholar] [CrossRef]
  18. Brazauskas, V.; Jones, B.L.; Zitikis, R. Robust fitting of claim severity distributions and the method of trimmed moments. J. Stat. Plan. Inference 2009, 139, 2028–2043. [Google Scholar] [CrossRef]
  19. Shi, P.; Basu, S.; Meyers, G.G. A Bayesian log-normal model for multivariate loss reserving. N. Am. Actuar. J. 2012, 16, 29–51. [Google Scholar] [CrossRef]
  20. Jeong, H.; Dey, D. Application of a vine copula for multi-line insurance reserving. Risks 2020, 8, 111. [Google Scholar] [CrossRef]
  21. Bae, T.; Miljkovic, T. Loss modeling with the size-biased lognormal mixture and the entropy regularized EM algorithm. Insur. Math. Econ. 2024, 117, 182–195. [Google Scholar] [CrossRef]
  22. Abu-Mostafa, Y.S.; Magdon-Ismail, M.; Lin, H.T. Learning from Data; AMLBook: New York, NY, USA, 2012; Volume 4. [Google Scholar]
  23. Ishwaran, H.; Rao, J.S. Spike and slab variable selection: Frequentist and Bayesian strategies. Ann. Stat. 2005, 33, 730–773. [Google Scholar] [CrossRef]
  24. Narisetty, N.N.; He, X. Bayesian variable selection with shrinking and diffusing priors. Ann. Stat. 2014, 42, 789–817. [Google Scholar] [CrossRef]
  25. Li, F.; Zhang, N.R. Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J. Am. Stat. Assoc. 2010, 105, 1202–1214. [Google Scholar] [CrossRef]
Figure 1. Histogram of the response variable.
Figure 1. Histogram of the response variable.
Mathematics 13 03341 g001
Figure 2. Boxplots of RMSEs for LASSO, EN, and linear models (LMs) across different sample sizes. The horizontal axis indicates sample sizes S { 600 , 1000 , 1400 } , and the vertical axis shows the RMSEs.
Figure 2. Boxplots of RMSEs for LASSO, EN, and linear models (LMs) across different sample sizes. The horizontal axis indicates sample sizes S { 600 , 1000 , 1400 } , and the vertical axis shows the RMSEs.
Mathematics 13 03341 g002
Table 1. Description of the traditional and telematics features in the dataset.
Table 1. Description of the traditional and telematics features in the dataset.
TypeVariableDescription
TraditionalInsured.ageAge of insured driver, in years
Insured.sexSex of insured driver (Male/Female)
Car.ageAge of vehicle, in years
MaritalMarital status (single/married)
Car.useUse of vehicle: private, commute, farmer, commercial
Credit.scoreCredit score of insured driver
RegionType of region where driver lives: rural, urban
Annual.miles.driveAnnual miles expected to be driven declared by driver
Years.noclaimsNumber of years without any claims
TerritoryEmbEmbedded value from the territorial location of vehicle
TelematicsAnnual.pct.drivenAnnualized percentage of time on the road
Total.miles.drivenTotal distance driven in miles
Pct.drive.xxxPercent of driving day xxx of the week: mon/tue/…/sun
Pct.drive.xhrsPercent vehicle driven within x h: 2 h/3 h/4 h
Pct.drive.wkdayPercent vehicle driven during weekdays
Pct.drive.wkendPercent vehicle driven during weekends
Pct.drive.rush.amPercent of driving during a.m. rush hours
Pct.drive.rush.pmPercent of driving during p.m. rush hours
Avgdays.weekMean number of days used per week
Accel.xxmilesNumber of sudden acceleration 6/8/9/11/12/14 mph/s per 1000 m
Brake.xxmilesNumber of sudden brakes 6/8/9/11/12/14 mph/s per 1000 m
Left.turn.intensityxxNumber of left turn per 1000 m with intensity 08/09/10/11/12
Right.turn.intensityxxNumber of right turn per 1000 m with intensity 08/09/10/11/12
Responselog(50+AMT_Claim)Natural logarithm of the total amount of observed claims
Table 2. Summary statistics for the RMSEs of LASSO, EN, and linear models across 50 samples.
Table 2. Summary statistics for the RMSEs of LASSO, EN, and linear models across 50 samples.
MethodSMin1st Quartile2nd QuartileMean3rd QuartileMax
LASSO6000.9230.9420.9531.1220.9656.343
10000.9270.9400.9481.0210.9583.771
14000.9250.9360.9420.9450.9481.079
EN6000.9180.9430.9561.1560.9685.838
10000.9280.9410.9470.9990.9572.704
14000.9260.9380.9420.9590.9491.745
Linear model6001.0234.3597.98110.91514.32045.505
10000.9712.1284.2687.25310.41841.810
14000.9771.4522.3262.8973.2509.854
Table 3. Inclusion rates of the five traditional variables—Insured.age, Car.age, Car.use, Credit.score, and Region—using the LASSO and EN models.
Table 3. Inclusion rates of the five traditional variables—Insured.age, Car.age, Car.use, Credit.score, and Region—using the LASSO and EN models.
MethodSInsured.ageCar.ageCar.useCredit.scoreRegion
LASSO6000.460.540.300.920.20
10000.680.720.401.000.26
14000.740.900.641.000.14
EN6000.680.500.301.000.24
10000.740.800.481.000.24
14000.840.920.681.000.26
Table 4. Inclusion rates of the five traditional variables—Insured.age, Car.age, Car.use, Credit.score, and Region—using the proposed Bayesian model.
Table 4. Inclusion rates of the five traditional variables—Insured.age, Car.age, Car.use, Credit.score, and Region—using the proposed Bayesian model.
J * SInsured.ageCar.ageCar.useCredit.scoreRegion
26000.560.690.500.990.40
10000.490.760.491.000.33
14000.430.860.541.000.29
36000.730.810.691.000.62
10000.660.860.671.000.53
14000.610.920.691.000.48
56000.960.980.961.000.94
10000.950.980.951.000.92
14000.930.990.951.000.90
76000.991.000.991.001.00
10000.991.000.991.000.99
14000.991.000.991.000.99
106001.001.001.001.001.00
10001.001.001.001.001.00
14001.001.001.001.001.00
Table 5. Summary statistics for the RMSEs of LASSO and the proposed Bayesian method with J * = 10 across the 50 samples.
Table 5. Summary statistics for the RMSEs of LASSO and the proposed Bayesian method with J * = 10 across the 50 samples.
MethodSMin1st Quartile2nd QuartileMean3rd QuartileMax
LASSO6000.9230.9420.9531.1220.9656.343
10000.9270.9400.9481.0210.9583.771
14000.9250.9360.9420.9450.9481.079
EN6000.9180.9430.9561.1560.9685.838
10000.9280.9410.9470.9990.9572.704
14000.9260.9380.9420.9590.9491.745
Bayesian method6000.9190.9480.9590.9600.9691.021
10000.9230.9410.9510.9490.9580.975
14000.9260.9340.9410.9410.9480.969
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jeong, H.; Kim, M. Bayesian Approach to Simultaneous Variable Selection and Estimation in a Linear Regression Model with Applications in Driver Telematics. Mathematics 2025, 13, 3341. https://doi.org/10.3390/math13203341

AMA Style

Jeong H, Kim M. Bayesian Approach to Simultaneous Variable Selection and Estimation in a Linear Regression Model with Applications in Driver Telematics. Mathematics. 2025; 13(20):3341. https://doi.org/10.3390/math13203341

Chicago/Turabian Style

Jeong, Himchan, and Minwoo Kim. 2025. "Bayesian Approach to Simultaneous Variable Selection and Estimation in a Linear Regression Model with Applications in Driver Telematics" Mathematics 13, no. 20: 3341. https://doi.org/10.3390/math13203341

APA Style

Jeong, H., & Kim, M. (2025). Bayesian Approach to Simultaneous Variable Selection and Estimation in a Linear Regression Model with Applications in Driver Telematics. Mathematics, 13(20), 3341. https://doi.org/10.3390/math13203341

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop