Synthetic Dataset Generation of Driver Telematics

This article describes techniques employed in the production of a synthetic dataset of driver telematics emulated from a similar real insurance dataset. The synthetic dataset generated has 100,000 policies that included observations about driver's claims experience together with associated classical risk variables and telematics-related variables. This work is aimed to produce a resource that can be used to advance models to assess risks for usage-based insurance. It follows a three-stage process using machine learning algorithms. The first stage is simulating values for the number of claims as multiple binary classifications applying feedforward neural networks. The second stage is simulating values for aggregated amount of claims as regression using feedforward neural networks, with number of claims included in the set of feature variables. In the final stage, a synthetic portfolio of the space of feature variables is generated applying an extended $\texttt{SMOTE}$ algorithm. The resulting dataset is evaluated by comparing the synthetic and real datasets when Poisson and gamma regression models are fitted to the respective data. Other visualization and data summarization produce remarkable similar statistics between the two datasets. We hope that researchers interested in obtaining telematics datasets to calibrate models or learning algorithms will find our work valuable.


Background
Usage-based insurance (UBI) is a recent innovative product in the insurance industry that exploits the use and access of improved technology. It is a type of automobile insurance policy where the cost of insurance is directly linked to the usage of the automobile. With the help of telematics device or mobile app, auto insurers are able to track and monitor mileage, speed, acceleration, and other driving-related data. This data transmission allows insurers to later store information for monitoring driving behavior and subsequently, for risk assessment purposes.
According to the Oxford dictionary, telematics refers to "the use or study of technology that allows information to be sent over long distances using computers." Its origin can be traced back to the French word, télématique, combining the words "telecommunications" and "computing science." There is a growing list of applications of telematics in various industries, and it is most prominently used in the insurance industry. The infrastructure offered by health telematics allows for access to healthcare that helps reduce costs while optimizing quality of patient care. The installation of a smart home system with alarms that remotely monitor home security can drastically reduce cost of homeowners insurance. In auto insurance, a plugin device, an integrated equipment installed by car manufacturers, or a mobile application can be used to directly monitor cars thereby allowing insurers to more closely align driving behaviors with insurance premium rates through UBI. It was said in Karapiperis et al. (2015) that Progressive Insurance Company, in collaboration with General Motors, offered the first such UBI in the early 2000s that offered premium discounts linked to monitoring of driving activities and behavior. With agreement of the driver, a tracking device was installed in the vehicle to collect information through GPS technology. Subsequently, with even further advances in technology, different forms of UBI have emerged that include, for example, Pay-as-you-Drive (PAYD), Pay-how-you-Drive (PHYD), Pay-as-you-Drive-as-you-Save (PAYDAYS), Payper-mile, and Pay-as-you-Go (PASG).
The variations in UBI programs generally fall into two broad categories: how you drive and how far you drive. In the first category, insurers track data, such as the changes in your speed, how fast you are driving as you make a left or right turn, the day of the week you drive, and the time of day you drive, that reflects your driving maneuvering behavior. In the second category, insurers track data that is related to your driving mileage, essentially the distance you travel in miles or kilometers. It is interesting to note that, even prior to development of telematics, Butler (1993) have suggested the use of cents-per-mile premium rating for auto insurance. See also Denuit et al. (2007) for an early discussion of the development of PAYD auto pricing.

Literature
The actuarial implications of usage-based insurance for fair risk classification and a more equitable premium rating are relevant; this is reflected in the growth in the literature on telematics in actuarial science and insurance. Many of the research on telematics have found the additional value of information derived from telematics to provide improved claims predictions, risk classification, and premium assessments. Husnjak et al. (2015) provides a very nice overview of the architecture and pricing paradigms employed by various telematics programs around the world. Table 1 provides an overview of the literature in actuarial science and insurance, with an outline of the work describing the data source, the period of observation with sample size, the analytical techniques employed, and a brief summary of the research findings. For example, the early work of Ayuso et al. (2014) examines a comparison of the driving behaviors between novice and experienced young drivers, those aged below 30, with PAYD policies. The analysis is based on a sample of 15,940 young drivers with PAYD policies in 2009 drawn from a leading Spanish insurance company. The work of Guillen et al. (2020) demonstrates how the additional information drawn from telematics can help predict near-miss events. The analysis is based on a pilot study of drivers from Greece in 2017 who agreed to participate in a telematics program. Data

Motivation
Here in this article, we provide the details of the procedures employed in the production of a synthetic dataset of driver telematics. This synthetic dataset was generated to imitate the intricate characteristics of a similar real insurance dataset; the intent is not to reproduce nor replicate the original characteristics in order to preserve the privacy that may be alluded from the original source. In the final synthetic dataset generated, we produced 100,000 policies that included observations about driver's information and claims experience (number of claims and aggregated amount of claims) together with associated classical risk variables and telematicsrelated variables. As previously discussed, an increasingly popular auto insurance product innovation is usage-based insurance (UBI) where a tracking device or a mobile app is installed to monitor insured driving behaviors. Such monitoring is an attempt of the industry to link risk premiums assessed with observable variables that are more directly tied to driving behaviors. While such monitoring may be engineered more frequently than that reproduced or implied in our synthetic dataset, the dataset is in aggregated or summarized form assumed to be observed over a certain period of time and can be used for research purposes of performing risk analysis of UBI products. For the academic researcher, the dataset can be used to calibrate advances in actuarial and risk assessment modeling. On the other hand, the practitioner may find the data useful for market research purposes where for instance, an insurer is intending to penetrate the UBI market.
In the actuarial and insurance community as driven by industry need that is facilitated with computing technology advancement, there is a continuing growth of the need for data analytics to perform risk assessment with high accuracy and efficiency. Such exercise involves the construction, calibration, and testing of statistical learning models, which in turn, requires the accessibility of big and diverse data with meaningful information. Access to such data can be prohibitively difficult, understandably so because several insurers are reluctant to provide data to researchers for concerns of privacy.
This drives a continuing interest and demand for synthetic data that can be used to perform data and predictive analytics. This growth is being addressed in the academic community. To illustrate, the work of Gan and Valdez (2007) and Gan and Valdez (2018) created synthetic datasets of large portfolios of variable annuity products so that different metamodeling techniques can be constructed and tested. Such techniques have the potential benefits of addressing the intensive computational issues associated with Monte Carlo techniques typically common in practice. Metamodels have the added benefits of drastically reducing computational times and thereby providing a more rapid response to risk management when market forces drive the values of these portfolios. Gabrielli and Wüthrich (2018) developed a stochastic simulation machinery to reproduce a synthetic dataset that is "realistic" and reflects real insurance claims dataset; the intention is for analysts and researchers to have access to a large data in order to develop and test individual claims reserving models. Our paper intends to continue this trend of supporting researchers by providing them with a synthetic dataset to allow them to calibrate advancing models. More specifically, we build the data generating process to produce an imitation of the real telematics data. The procedure initially constructs two neural networks, which emulates the number of claims and aggregated amount of claims that can be drawn from real data. We then generate 100,000 synthetic observations with features using extended version of SMOTE. Inserting the synthetic observations into two neural networks, we are able to produce the complete portfolio with the synthetic number of claims and aggregated amount of claims.
The rest of this paper has been structured as follows. Section 2 describes the machine learning algorithms used to perform the data generation. Section 3 provides a description of all the variables included in the synthetic datafile. Section 4 provides the details of the data generation process using the feedforward neural networks and the extended SMOTE. This section also provides the comparison of the real data and the synthetically generated data when Poisson and gamma regression models are used. We conclude in Section 5.

Related work
This section briefly explains two popular machine learning algorithms that we employed to generate the telematics synthetic dataset. The first algorithm is the extended SMOTE, Synthetic Minority Oversampling Technique. This procedure is used to generate the classical and telematics predictor variables in the dataset. The second algorithm is the feedforward neural network. This is used to generate the corresponding response variables that describe number of claims and the aggregated amount of claims.

Extended SMOTE
Developed by Chawla et al. (2002), the Synthetic Minority Oversampling Technique (SMOTE) is originally intended to address classification datasets with severe class imbalances. The procedure is to augment the data to oversample observations for the minority class and this is accomplished by selecting samples that are within the neighborhood in the feature space. First, we choose a minority class and then we obtain its K-nearest neighbors, where K is typically set to 5. All K neighbors should be minority instances. Then, one of these K neighbor instances are randomly chosen to compute new instances by interpolation. The interpolation is performed by computing the difference between the minority class instance under consideration and the selected neighbor taken. This difference is multiplied by a random number uniformly drawn between 0 and 1, and the resulting instance is added to the considered minority class. In effect, this procedure does not duplicate observations, however, the interpolation causes the selection of a random point along the "line segment" between the features (Fernández et al. (2018)).
This principle of SMOTE for creating synthetic data points from minority class is employed and adopted in this paper with a minor adjustment. In our data generation, we applied it to generate predictor variables based on the entire feature space of the original or real dataset. The one minor adjustment we used is to tweak the interpolation by randomly drawing a number from a U -shaped distribution, rather than a uniform distribution, between 0 and 1. This mechanism has the resulting effect of maintaining the characteristic of the original or real dataset with small possibility of duplication. In particular, we are able to capture characteristics of observations that may be considered unusual or outliers. Further description of synthetically generated portfolio is given in Section 4.1.3.

Feedforward neural network
Loosely modeled after the idea of neurons that form the human brain, neural network consists of a set of algorithms for doing machine learning in order to cleverly recognize patterns. Neural networks are indeed very versatile as they can be used for addressing inquiries that are considered either supervised or unsupervised learning; this set of algorithms has grown in popularity as the method continues to provide strong evidence of its ability to produce predictions with high accuracy. A number of research using neural networks has been published in the actuarial and insurance literature. Wüthrich (2019) showed that the biased estimation issue resulting from use of neural networks with early stopping rule can be diminished using shrinkage version of regularization. Yan et al. (2020) used backpropagation (BP) neural network optimized by an improved adaptive genetic algorithm to build car insurance fraud detection model. Additional research has revealed the benefits and advantages of neural networks applied to various models for insurance pricing, fraud detection, and underwriting. Among these include, but are not limited to, Viaene et al. (2005), Dalkilic et al. (2009), Ibiwoye et al. (2012, and Kiermayer and Weiß (2020).
The idea of neural networks can be attributed to the early work of McCulloch and Pitts (1943). A neural network (NN) consists of several processing nodes, referred to as neurons, considered to be simple yet densely interconnected. Each neuron produces a sequence of realvalued activations triggered by a so-called activation function, and these neurons are organized into layers to form a network. The activation function plays a crucial role in the output of the model, affecting its predictive accuracy, computational efficiency of learning a model, and convergence. There are several types of neural network activation functions, and we choose just a few of them for our purpose.
Neural network algorithms have the tendency to be complex and to overfit the training dataset. Because of this model complexity, they are often referred to as black-box as it becomes difficult sometimes to draw practical insights into the learning mechanisms employed. Part of this problem has to do with the large number of parameters and the resulting non-linearity of the activation functions. However, these disadvantageous features of the model may be beneficial for the purpose of our data generation. For instance, the overfitting may help us build a model with high accuracy and precision so that we produce a synthetic portfolio that mimics the characteristics of the portfolio derived from the real dataset.
For feedforward neural networks, signals are more straightforward because they are allowed to go in one direction only: from input to output (Goodfellow et al. (2016)). In effect, the output from any layer does not directly affect that same layer so that the effect is that there are no resulting feedback loops. In contrast, for recurrent neural networks, signals can travel in both directions so that feedback loops may be introduced in the network. Although considered more powerful, computations within recurrent neural networks are much more complicated than those within feedforward neural networks. As later described in the paper, we fit two simulations using the feedforward neural network. 1 has three feature variables as the input, one hidden layer, two nodes for the hidden layer, and the response variable y as the resulting output. The activation function (f ) is responsible for converting weighted sum of previous node values ( ) into a node value of that layer. Representative activation functions are sigmoid and Rectified Linear Unit (ReLU) functions as seen in the bottom left of Figure 1. The sigmoid is used as an activation function in neural network that converts any real-valued sample to a probability range between 0 and 1. It is this property that the neural network can be used as binary classifier. On the other hand, the ReLU function is a piecewise linear function that gives the input directly as output, if positive, and zero as output, otherwise. This function is often the default function for many neural network algorithms because it is believed to train the model with ease and with outstanding performance.
In the feedforward neural network, parameters are the weights (w i ) of connections between layers. Hyperparameters are the values to determine the architecture of the neural network model, which include, among others, the number of layers, the number of nodes in each layer, activation functions, and parameters used for optimizer (e.g., Stochastic Gradient Descent (SGD) learning rate). Parameters can be learned from the data using a loss optimizer. However, hyperparameters still must be predetermined prior to the learning process and, in many cases, these decisions depend on the judgment of the analyst or the user. The work of Hornik et al. (1989) proved that standard multi-layer feedforward networks are capable of approximating any measurable function, and thus is called the universal approximator. This implies that any lack of success in applications must arise from inadequate learning, insufficient numbers of hidden units, or the lack of a deterministic relationship between input and target. Hyperparameters may be more essential in deep learning to be able to yield satisfactory output.
We found that a number of research done in neural networks focused on introducing the algorithms for optimizing hyperparameters values. Some of the frequently used searching strategies are grid search, random search (Bergstra and Bengio (2012)), and sequential modelbased optimization (Bergstra et al. (2011)). This line of work on hyperparameters is presently a very active field of research that includes, for example, hyperparameters in parameter learning process (e.g., Thiede and Parlitz (2019), Franceschi et al. (2017), and Maclaurin et al. (2015)). However, the methods proposed in the current literature are relatively new and not mature enough to be used in practical real world problems. The simple and widely used optimization algorithms are the grid search and the random search. The grid search, on one hand, is the method to discretize the search space of each hyperparameter and based on the Cartesian products, to discretize the total search space of hyperparameters. Then, after learning for each set of the hyperparameters, we select the best at the end. It is intuitive and easy to apply but it does not take into account relative feature importance, and therefore is considered ineffective and extremely time-consuming. This method is also severely influenced by the curse of dimensionality as the number of hyperparameters increase. In the random search, on the other hand, hyperparameters are randomly sampled. Bergstra and Bengio (2012) showed that the random search, as compared to the grid search, is particularly effective, especially when dealing with relative feature importance. However, since the next trial set of hyperparameters are not chosen based on previous results, it is also time-consuming especially when it involves a large number of hyperparameters, thereby suffering from the same curse of dimensionality as the grid search.
To optimize hyperparameters, we find that one of the most powerful strategies is the sequential model-based optimization, also sometimes referred to as Bayesian optimization. The following set of hyperparameters are determined based on the result of previous sets of hyperparameters. Bergstra et al. (2011) and Snoek et al. (2012) showed that sequential model-based optimization outperforms both grid and random searches. Sequential model-based optimization constructs a probabilistic surrogate model to define the posterior distribution over unknown black box function (loss function). The posterior distribution is developed based on conditioning on the previous evaluations and a proxy optimization is performed to seek the next location to evaluate. For the proxy optimization, the acquisition function is computed based on the posterior distribution and has the highest value at the location having the highest probability of the lowest loss function; this point becomes the next location. Most commonly, Gaussian process is used as surrogate model because of their flexibility, well-calibrated uncertainty, and analytic properties (Murugan (2017)). Thus, we use the Gaussian process as the hyperparameter tuning algorithm.
Another important decision, which may affect the time efficiency and performance of the neural network model, is to choose the optimizer. The optimizer refers to an algorithm used to update parameters of model in order to reduce the losses. Neural network is not a convex optimization. For this reason, in the training process, it could fall into the minimum of local part and the convergence rate could be too small leading to the learning process unfinished for days (Li et al. (2012)). To address this issue, diverse optimizers have been suggested: Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent, Momentum, AdaGrad (Duchi et al. (2011)), RAMSProp (Hinton et al. (2012)), Adam (Kingma and Ba (2014)) and others (Ruder (2016)). The Adam optimization is an efficient stochastic optimization that has been suggested and it combines the advantages of two popular methods: AdaGrad, which works well with sparse gradients, and RMSProp, which has an excellent performance in on-line and non-stationary settings. Recent works by Zhang et al. (2019), Peng et al. (2018), Bansal et al. (2016) and Arik et al. (2017) have presented and proven that Adam optimizer provides better performance than others in terms of both theoretical and practical perspectives. Therefore in this paper, we use Adam as the optimizer in our neural network simulations.
3 The synthetic output: file description For our portfolio emulation, we based it on a real dataset acquired from a Canadian-based insurer, which offered a UBI program that was launched in 2013, to its automobile insurance policyholders. The observation period was for years between 2013 and 2016, with over 70,000 policies observed for which the dataset drawn is pre-engineered for training a statistical model for predictive purposes. See also So et al. (2020).  We generated a synthetic portfolio of 100,000 policies. Table 2 provides the types, names, definitions or brief description of the various variables in the resulting datafile, which can be found in http://www2.math.uconn.edu/~valdez/data.html.
The synthetic datafile contains a total of 52 variables, which can be categorized into three main groups: (a) 11 traditional features such as policy duration, age and sex of driver, (b) 39 telematics features including total miles driven, number of sudden breaks or sudden accelerations, and (3) 2 response variables describing number of claims and aggregated amount of claims.
Additional specific information of the variables in the datafile is presented below: • Duration is the period that policyholder is insured in days, with values in [22,366].
• Insured.age is the age of insured driver in integral years, with values in [16,103].
• Car.age is the age of vehicle, with values in [-2,20]. Negative values are rare but are possible as buying a newer model can be up to two years in advance.
• Years.noclaims is the number of years without any claims, with values in [0, 79] and always less than Insured.age.
• Annual.pct.driven is the number of day a policyholder uses vehicle divided by 365, with values in [0,1.1].
• Pct.drive.mon, · · · , Pct.drive.sun are compositional variables meaning that the sum of seven (days of the week) variables is 100%.
• AMT Claim is the aggregated amount of claims, with values in [0, 138766.5]. Summary statistics of synthetic and real data is shown in Table 3. Table 3 provides an interesting comparison of the summary statistics of the aggregated amount of claims derived from the synthetic datafile and compared to the real dataset, broken down by the number of claims from the synthetic dataset. First, we observe that we do not exactly replicate the statistics, a good indication that we have done a good job of reconstructing a portfolio based on the real dataset with very little indication of reproducing nor replicating the exact data. Second, these statistics show that we are able to preserve much of the characteristics of the original dataset according to the spread and depth of observations we have as described in this table. To illustrate, among those with exactly 2 claims, the average amount of claim in the synthetic file is 8960 and it is 8643 in the real dataset; the median is 7034 in the synthetic file while it is 5148 in the real data. The respective standard deviations, which give a sense of how dispersed the values are from the mean, are 9554 and 10924. We shall be able to compare more of these intricacies when we evaluate the quality of the reproduction by giving more details of this type of comparisons.
As we said earlier, we reproduced 52 variables and the data types are summarized in Table  4. The NB Claim variables can be treated as integer-valued or a classification or categorical variable, with 0 category as those considered to be least risky drivers who thus far have zero

The data generating process
The data generation of the synthetic portfolio of 100,000 drivers is a three-stage process using the feedforward neural networks to perform the two simulations and using extended SMOTE to reproduce the feature space. The first stage is simulating values for the number of claims as multiple binary classifications using feedforward neural networks. The second stage is simulating values for amount of claims as a regression using feedforward neural network with number of claims treated as one of the feature variables. In the final stage, a synthetic portfolio of the space of feature variables is generated applying an extended SMOTE algorithm. The final synthetic data is created by combining the synthetic number of claims, the synthetic amount of claims, and finally, the synthetic portfolio. The resulting data generation is evaluated with a comparison between the synthetic data and the real data when Poisson and gamma regression models are fitted to the respective data. Note that the response variables were generated with extremely complex and nonparametric procedure, so that these comparisons do not necessarily reflect the true nature of the data generation. We also provide other visualization and data summarization to demonstrate the remarkable similar statistics between the two datasets.

The detailed simulation procedures
Synthetic telematics data is generated based on two feedforward neural network simulations and extended SMOTE. For convenience, we will use notations x i ∈ X = {X 1 , X 2 , · · · , X 50 }, i = 1, 2, · · · , M , which describe the portfolio having 50 feature variables and x i is observation (the policy). Y 1 is NB Claim and Y 2 is AMT Claim. Superscript r means real data and s means synthetic data.

The simulation of number of claims
To mimic the real telematics data, the first step is to build the simulation generating Y s 1 , with four categorical values. It is a multi-class classification problem. However, we converted it into multiple binary class classifications to make each process simple and simultaneously improve the accuracy of simulation.
1. Sub-simulation 1: The data is given as the following: , 2 (2) , · · · , M (2) }. The data is given as the following: (2) ), (x r 2 (2) , z r 22 (2) ), · · · , (x r M (2) , z r 2M (2) )} 3. Sub-simulation 3: The data is given as the following: Feedforward neural network simulation is learned from each D k . Hyperparameters are tuned via Gaussian Process (GP) algorithm as detailed in the previous section: the number of hidden layers, the number of nodes for first hidden layer, the number of nodes for the rest of the hidden layers, activation functions, batch size, and the learning rate. The resultant architecture of the network is introduced in Table 5. We set up sigmoid activation function for output layer since this is binary problem; it has the value between 0 and 1. Threshold is 0.5 and cross entropy loss function is used. The weight of the neural network is optimized using the Adam optimizer.
In the Adam optimizer, as input values, we need α (learning rate), β 1 , β 2 , and . See Algorithm 1. In practice, β 1 = 0.9, β 2 = 0.999 and = 1e −08 are commonly used and no further tuning is usually done. Thus, we only tuned the learning rate via GP.  Table 5: The architecture of the three sub-simulations for number of claims.
The accuracy of the three sub-simulations is shown in Figure 2. When the real portfolio is plugged in, its prediction reveals 100% coincidence with the real number of claims. This implies that as we plug in realistic portfolio into this combined frequency simulation, we are able to arrive at realistic number of claims. After building three sub-simulations, plugging in synthetically generated portfolio, X s into sub-simulation 1, we get Z s 1 . Then we extract X s |Z s 1 = 1, plugging it into sub-simulation 2 and get the value, Z s 2 . Likewise, plugging in X s |Z s 2 = 1 into sub-simulation 3, we obtain the final one, Z s 3 . By combining these three results, we finally generate synthetic number of claims, Y s 1 .

The simulation of aggregated amount of claims
We produce the subset of portfolios, which satisfies the condition, Y r 1 > 0. Corresponding to a new index of the subset is defined as {1 (sev) , 2 (sev) , · · · , M (sev) }. The number and amount of claims are not treated independent to each other but rather, the number of claims Y r 1 , is also considered as one of the feature variables. Therefore, we use the following data to train the aggregated amount of claims simulation: , y r 11 (sev) ), y r 21 (sev) ), ((x r 2 (sev) , y r 12 (sev) ), y r 22 (sev) ), · · · , ((x r M (sev) , y r 1M (sev) ), y r 2M (sev) )} Y r 2 is a non-negative continuous value. Thus, in the second simulation, we use ReLU as the activation function and MSE as the loss function. Adam optimizers are used with the hyperparameters selected in the same manner as described in Section 4.1.1. These are further described in Table 6. Architecture N.hidden L. N.nodes first hidden L. N.nodes rest hidden L. Activation BatchSize Learning R. 6 344 67 ReLU 3 0.000526 Table 6: The architecture of simulation for the aggregated amount of claims. To generate Y s 2 , we use Y s 1 obtained from Section 4.1.1 and we extract the subset of synthetic portfolio with the condition, Y s 1 > 0. This subset of synthetic portfolio and corresponding Y s 1 are the input of the simulation to get Y s 2 .

Synthetic portfolio generation
As described in Section 2.1, we propose extended version of SMOTE to generate the final synthetic portfolio, X s . Extended SMOTE is primarily different from the original SMOTE in just a single step: the interpolation step. The detailed procedure is the following: for each feature vector (observation, x r i ), the distance between x r i and the other feature vectors in X r are computed based on the Euclidean distance and one-nearest neighbor is obtained. Difference between x r i and this neighbor is multiplied by a random number drawn from the U -shape distribution as shown in Figure 4. Adding the random number to the x r i , we create a synthetic feature vector, x s i . 100,000 synthetic observations are generated, which consisted of the synthetic portfolio, X s . After applying the extended SMOTE, the following considerations had also been reflected in the synthetic portfolio generation.
• Integer features are rounded up; • For categorical features, only Car.use are multi-class. Car.use is converted by one-hot coding before applying extended SMOTE so that every categorical feature variable has the value 0 or 1. After the generation, they are rounded up; •

Comparison: Poisson and gamma regression
Combining every outputs (X s , Y s 1 , Y s 2 ) obtained from Section 4.1, the data with telematics features is thereby complete. Any statistical or machine learning algorithms can now be performed on this completed synthetic datafile. To further compare the quality of the reconstruction of the real dataset to produce the synthetic datafile, one simple approach is to compare the resulting outputs when a Poisson regression model is calibrated on the number of claims (frequency) and a gamma regression model is calibrated on the amount of claims (severity), using the respective real dataset and the synthetic datafile. Both models are relatively standard benchmark models in practice. To be more specific, we fitted both Poisson and gamma regression models to the real and synthetic data to predict the number of claims ( NB Claim Duration ) and the average amount of claims ( AMT Claim NB Claim |NB Claim > 0). A net premium can be calculated by taking the product of the number of claims and the average amount of claims. The purpose of this exercise is not to evaluate the quality of the models nor the relative importance of the feature variables, but rather to compare the resulting outputs between the two datasets. The training models are based on all the feature variables in the absence of variable selection. Figure 5 describes the average claim frequency between the real telematics on the left side and the synthetic telematics on the right side. For simplicity, we only provide the behavior of the claim frequency for three feature variables: Annual.pct.drive, Credit.score, and Pct.drive.tue. For both datasets, we see that observed values are colored blue and the predicted values are colored orange. As we expected, the distributions of the average claim frequency, the pattern of blue and orange, for these feature variables considered here have very similar patterns between the real and the synthetic datasets. As similarly done for frequency, Figure 6 depicts the average claim severity between the real telematics and the synthetic telematics. For our purpose, we examine these comparisons based on two feature variables: Yrs.noclaims and Total.miles.driven. Both these feature variables do not seem to produce much variation in the predicted values: this may explain that these are relatively less important predictor variables for claims severity. However, this may also be explained by the fact that we do not necessarily have an exceptionally good model here for prediction. However, this is not the purpose of this exercise.
Still from both Figure 5 and 6, there are some information we can draw. First, the patterns of blue dots are similar between the real and synthetic data for every feature variable considered here. Even though we do not include the graphs of other features, for all features, they show similar dispersion. Included features are the one considered as importance variables on classification model introduced in So et al. (2020). This seems to suggest that real and synthetic data have similar frequency and feature distributions for all variables, which implies that the synthetic datafile is behaving as realistic as the real data. In conclusion, it mimics the real dataset exceptionally well. Second, the patterns of orange dots are also similar between the real and synthetic data. In more details, predicted frequency ( Figure 5) and severity ( Figure  6) from model tuned based on real data have similar dispersion with those from model tuned on synthetic data. This suggests results obtained by synthetic data might have little difference from results obtained by real data and we can use synthetic data to train statistical model instead of real data. These conclusions are further supported by Figure 7, which shows quantile-quantile (QQ) plot of the predicted pure premium between the real data and the synthetic data. We do, however, observe that we tend to overestimate the pure premium for the synthetic datafile for high quantiles. This may be a result of the randomness produced throughout the data generation process. This is not, by any means, an alarming concern.

Concluding remarks
It has been discussed that there is a perceived positive social effect to vehicle telematics: it encourages careful driving behavior. Indeed, UBI programs can have many potential benefits to insurers, consumers, and the society, in general. Insurers are permitted to put a price tag that links more directly related to habits of insured drivers. As a consequence, this helps insurance companies increase the predictability of their profit margin and provides customers the opportunity for more affordable premiums. On the other hand, consumers may be able to control the level of premium costs by maintaining safer driving habits or if at all possible, by reducing the frequency of driving. Furthermore, UBI may benefit the society because with safer driving and fewer drivers on the road, this may reduce the frequency of accidents, traffic congestion, and car emissions. In order to get the optimal benefits of UBI to both insurers and their policyholders, it becomes subsequently crucial to identify the more significant telematics variables that truly affects the occurrence of car accidents. These perceived positive benefits motivated us to provide the research community a synthetic datafile, which has the intricacies and characteristics of a real data, that may be used to examine, construct, build, and test better predictive models that can immediately be put into practice. For additional details of benefits of UBI, see Husnjak et al. (2015).
In summary, this paper describes the generating process used to produce a synthetic datafile of driver telematics that has largely been based and emulated from a similar real insurance dataset. The final synthetic dataset produced has 100,000 policies that included observations about driver's claims experience, together with associated classical risk variables and telematics-related variables. One primary motivation for such production is to encourage the research community to develop innovative and more powerful predictive models; this synthetic datafile can be used to train and test such predictive models so that we can provide better techniques that can be used in practice to assess UBI products. As alluded throughout this paper, the data generation is best described as a three-stage process using feedforward neural networks to simulate the number and aggregated amount of claims and later applying extended SMOTE algorithm to finalize the portfolio in its entirety. The resulting data generation is evaluated by a comparison between the synthetic data and the real data when Poisson and gamma regression models are fitted to the respective data. Data summarization and visualization of these resulting fitted models between the two datasets produce remarkable similar statistics and patterns. We are hopeful that researchers interested in obtaining driver telematics datasets to calibrate statistical models or machine learning algorithms will find the output of this research helpful for their purpose. We encourage the research community to build better predictive models and test these models with our synthetic datafile.