A Sarmanov Distribution with Beta Marginals: An Application to Motor Insurance Pricing

: Background: The Beta distribution is useful for fitting variables that measure a probability or a relative frequency. Methods: We propose a Sarmanov distribution with Beta marginals specified as generalised linear models. We analyse its theoretical properties and its dependence limits. Results: We use a real motor insurance sample of drivers and analyse the percentage of kilometres driven above the posted speed limit and the percentage of kilometres driven at night, together with some additional covariates. We fit a Beta model for the marginals of the bivariate Sarmanov distribution. Conclusions: We find negative dependence in the high quantiles indicating that excess speed and night-time driving are not uniformly correlated.


Introduction
We analyse a bivariate model based on the Sarmanov distribution with marginal Beta distributions. These marginals are specified based on a generalized linear model (Beta-GLM) or Beta regression as defined by Ferrari and Cribari-Neto [1]. The objective is to fit data defined in the (0, 1) interval.
Many authors have analysed bivariate Beta distributions [see, for example, [2][3][4][5]]. However, these distributions pose several difficult challenges: their generalization to higher dimensions and their specification as a generalized linear model are not straightforward. The Sarmanov distribution provides a way to address these challenges.
Originally, the Sarmanov distribution in its bivariate form was introduced by Sarmanov [6], its multivariate version was suggested by Lee [7] and was generalized by Bairamov et al. [8]. Its use to model the bivariate behaviour of random variables with a marginal Beta(α, β) distribution was proposed by Gupta and Wong [3]. These authors defined the five parameter bivariate Beta distribution from what is known as Morgenstern's distribution [9] with marginal Beta, which is a particular case of the Sarmanov distribution.
The bivariate Sarmanov distribution is characterized by its flexibility in the marginal distributions and, furthermore, given that its functional form establishes that the marginals are clearly separated from the dependency model, the specification in terms of a bivariate generalized linear model turns out to be natural. Generalizing from two dimensions to higher dimensions is simple-(see [10] for an example of a trivariate Sarmanov distribution specified as a generalized linear model with Negative Binomial marginals).
In this work, we show an application of the bivariate Sarmanov distribution with Beta marginals generalised linear model to predict two of the most relevant telematics variables in motor insurance [11]. Telematics variables are obtained from GPS/inertial devices installed in vehicles and they provide an abundant source of information to motor insurers. In our case study, a bivariate model is specified, for the proportion of kilometres driven above the posted speed limit and the proportion of kilometres driven at night. These two variables seem to be related, but researchers have not yet been able to find a good way to understand their association. The explanatory variables are the characteristics of the insured policyholder and the vehicle. The database used in our application has already been analysed in various works published in statistical, transport and risk analysis journals (see [11][12][13][14][15][16][17]). In all previous studies, the two telematics variables that we analyse here were used as predictors of the accident rate, and they were assumed to be uncorrelated.
In Section 2, the new bivariate Sarmanov model is specified and the particular case with marginal Beta-GLM with a domain in the (0, 1) interval is analysed; the estimation method is also discussed. The results of our case study are shown in Section 3. Finally, Section 4 contains the conclusions.

The Models
Let (Y 1 , Y 2 ) be a bivariate random vector that, for convenience, is defined in (0, 1) 2 . Its distribution depends on a set of k quantitative or binary covariates, whose values are represented by the vector x j = x 1j , ..., x kj , j = 1, 2, where x 1j = 1 is a constant term. The relationship between Y j and the covariates is given by the linear predictor x j β j , where β j = β j 1 , ..., β j k , j = 1, 2, are vectors of parameters to be estimated. To simplify the notation, the covariates are assumed to be the same for j = 1 and j = 2, and so the vector of explanatory variables is denoted as x. The bivariate probability density function (pdf) associated with the Sarmanov distribution is: where ω is the dependence parameter and φ j , j = 1, 2, are bounded kernel functions. For the function defined in (1) to be a pdf, the following conditions must hold: and 1 + ωφ 1 (y 1 |xβ 1 )φ 2 (y 2 |xβ 2 ) ≥ 0, ∀ (y 1 , y 2 ) ∈ (0, 1) 2 .
For given values of x β j , j = 1, 2, we define: Taking into account the condition defined in (3), bounds can be defined for the dependency parameter ω. However, as this parameter does not depend on the linear predictor, new extreme values are defined as: , so that the bounds of the dependency parameter are: The previous condition holds for every vector of covariates x, which implies that the dependency parameter must be located within the narrowest bounds. In practice, we will assume that the vectors observed in the sample dataset lead to the entire domain of values of linear predictors x β j , j = 1, 2.
In the insurance context, where we will discuss our illustration, we assume that all possible risk profiles that can be insured by the company are already present in the portfolio.
For each vector of covariate observations x, we can also obtain the covariance between the dependent variables as: where v j (x) = 1 0 y j φ j (y j |xβ j ) f j (y j |x β j ) dy j , j = 1, 2. The correlation is obtained by dividing by the product of standard deviations.
There exist many possible specifications for the kernel functions φ j , j = 1, 2 (see [18] [for a description of kernel functions proposed in the literature). When fitting the bivariate Beta distribution without covariates, Gupta and Wong [3] propose a kernel function such as φ j = 2F j − 1, where F j is the cumulative distribution function (cdf). This specification has the advantage that the bounds for the dependency parameter are given by −1 ≤ ω ≤ 1 for any vector x. However, the previous model does not allow obtaining closed expressions for some magnitudes of interest, such as the conditioned moments. In this work, we propose to use kernels φ j = y r j − E(Y r j ), where r is a value to be determined by the analyst. Next, some results obtained for the particular case of the Sarmanov distribution with marginal Beta(α, β) distribution with r = 1 are analyzed. These cases intuitively correspond to a situation of linear dependency, controlled by the dependence parameter ω.

The Bivariate Beta GLM Model
The pdf of a random variable Y with Beta(α, β) distribution, with parameters α, β > 0, is: and its cdf is: where Γ(·) and B(·, ·) are the Gamma and Beta functions, respectively, and B(y, ·, ·) is the incomplete Beta function. The Beta regression was proposed by Ferrari and Cribari-Neto [1], with the reparametrization µ = α α+β and ψ = α + β, so that: where ψ −1 is the scale parameter. We note that, given the values of µ and ψ, it holds that V(Y) < 0.25. The specification as GLM is defined as (note that we use µ(x) to emphasize that µ depends on the linear predictor): is a link function that can be defined in different ways, in this work, we use the logit link, 1−µ(x) . To simplify the notation from now on, we eliminate the linear predictors in the conditioned part. The pdf associated with the bivariate random vector (Y 1 , Y 2 ) with a Sarmanov distribution and Beta GLM marginals that will be called the Sarmanov-Beta-GLM is (): For the previous expression to be a pdf, the dependency parameter must be located within the bounds defined in (4), which, for the kernel functions that we propose, are: The bivariate cdf associated with a Sarmanov-Beta-GLM is obtained directly from the double integral of the bivariate pdf defined in (6): where y 1 , y 2 ∈ (0, 1).

Proposition 1.
The conditioned pdf is: and similarly for f Y 2 |Y 1 (y 2 |Y 1 = y 1 ). Integrating the previous expression, the conditional cdf is obtained as Proof. The conditioned pdf is obtained directly as Integrating the result of f Y 1 |Y 2 (y 1 |Y 2 = y 2 ) in (9), we obtain: In addition, since then, by substituting the previous expression in (11), then (10) follows directly.
The conditioned quantile is obtained from the inverse of expression (10), for which a numerical method (such as Newton's method) can be used.

Proposition 2.
The conditional expectation is: where is the variance, which also depends on the vector of covariates. Similarly, E (Y 2 |Y 1 = y 1 ) can be found.
Proof. The conditional expectation is obtained directly by solving the integral: Likewise, the corresponding result is obtained for E (Y 2 |Y 1 = y 1 ). Proposition 3. From (5), the conditional covariance which depends on the vector of covariates x is: and the correlation is: Proof. Note that the covariance and the correlation are calculated directly if, in expression (5), we see that: The dependence parameter of the model proposed in Gupta and Wong [3], which uses kernel functions φ j = 2F j − 1, j = 1, 2, is located in the interval −1 ≤ ω ≤ 1 and is the same for all x. Our proposal bounds the dependence parameter to the narrowest interval among those obtained from all x. However, the advantage of our proposal is that our model allows for obtaining closed expressions for some magnitudes of interest such as bivariate moments (covariance) and conditional moments. In the numerical analysis section, we also compare the correlations estimated from our model and that of Gupta and Wong [3].

Estimation
In practice, we start from a bivariate sample of n observations. Let us denote the sample information as (Y i1 , Y i2 ), i = 1, ..., n, where for each i we know the values of the covariates X i = (X i1 , ..., X ik ) . Our objective is to estimate the parameter vectors β j , the scale parameters, ψ j and j = 1, 2, and the dependency parameter ω, from the maximization of the logarithm of the likelihood function associated with the Sarmanov distribution: The maximization of (15) cannot be carried out directly without considering that the parametric space is restricted and, in addition, as it was shown in expression (4), the bounds of the dependence parameter are closely related to the parameters of the marginals. Thus, in the maximization process, infeasible solutions will often be reached unless a careful numerical procedure is specifically designed. One way to address these difficulties is to rely on the IFM (Inference from Margin) method that has been widely used in the estimation of copulas see [19] [for a review]. For the estimation of the Sarmanov distribution, the IFM was already used by Bolancé and Vernic [10] for the case of GLM marginals with Negative Binomial distributions.
The IFM method is implemented as follows: Inicialization. The parameters for the marginals are estimated as: For the initial estimation, function betareg() of betareg R package is used. With these parameters of the marginals, we start the iterative process in the two steps described below.
Step 1. Given the estimated marginal parameters in iteration m − 1 and taking into account the limits of the dependence parameter ω defined in (4), with function optim() and the L-BFGS-B method using R, we estimate ω from the maximization of the likelihood function given fixed values of the marginal parameters, which is: where l ω|12 is the likelihood as a function of ω given the estimated parameters for the marginals in iteration m − 1.
Step 2. Given the estimated dependency parameterω (m) in step 1, the marginal parameters are re-estimated in iteration m as: where l 12|ω is the likelihood as a function of the marginal parameters given the dependence parameter estimated in step 1. The above maximization is also performed with function optim() and the L-BFGS-B method of R. Steps 1 and 2 described above are repeated until reaching the convergence criterion based on the differences between parameter estimates obtained in two consecutive iterations.
In practice, the algorithm described above is based on the optimization of conditional likelihood functions and not on the likelihood function defined in (15). However, in the last stage, the parameters estimated with the IFM method can be used as initial parameters in the process of maximizing the full likelihood function defined in (15). For this purpose, function optim() and method L-BFGS-B of R are used again.

Remark 2.
To estimate the Sarmanov model proposed by Gupta and Wong [3], it is not necessary to use the two-step process, since the bounds of the dependence parameter do not depend on the parameters of the marginal distributions.

Numerical Analysis
We analyse a database corresponding to a car insurance portfolio, in which part of the variables have been measured via a telematic system. The objective of our analysis is to model the joint behaviour of the percentage of kilometres driven above the posted speed limits (Y 1 ) and percentage of kilometres driven at night (Y 2 ). It is well known that both variables are related to the risk of having an accident. In Table 1, we show the main descriptive statistics of the dependent variables and the covariates used in the modelling process. For the estimation of the Sarmanov-Beta-GLM, the dependent variables have been transformed as indicated in Remark 1 in Section 2.2. Furthermore, to avoid very low coefficient values due to the scale of some covariates, variables age (X 1 ), age of driving license (X 2 ) and age of the vehicle (X 5 ) have been divided by 10; the vehicle power variable (X 6 ) is divided by 100 and the total annual distance driven in kilometres (X 7 ) is divided by 1000. In addition, note that, in this study, we have included a variable denoting the driver's gender (X 3 ) and an indicator of private garage (X 4 ) as covariates.
The last row of Table 1 shows the Pearson correlation between the two dependent variables. This correlation is compared with the corresponding parameter estimate obtained from the Sarmanov model with marginal Beta proposed here and with the one proposed by Gupta and Wong [3], from now on the GW model. With this objective, Table 2 shows the dependence parameters estimated with both models, and the AIC and BIC statistics without including the covariates and including them. Using expression (14) and without covariates, from the dependence parametersω = 14.883, it can be deduced that the estimated correlation is 0.0601, which is within the confidence interval of the Pearson correlation as shown in the last row of Table 1. On the contrary, if we use the five parameter Beta distribution, the (residual) correlation that is obtained from the numerical calculation of expression (5) is practically zero. This means that the association is captured by the bivariate model. Comparing both models, with and without covariates, using the AIC and BIC statistics, the results of Table 2 show that the fit is better for the model proposed here than it is for the GW model.  Table 3 shows the results of our Sarmanov-Beta-GLM using different vectors of covariates. Model I includes all the explanatory variables, among which we have the age (X 1 ), the age of the driving license (X 2 ) and the total distance driven annually (X 7 ), these three variables are associated with driving experience. To analyze the robustness of the results, in Model II, age (X 1 ) is eliminated, and, in addition, in Model III, the age of a driver's license (X 2 ) is also eliminated. The results of Model I show that the effect of age is negative on both Y 1 and Y 2 that the effect of the driver's license age is positive on Y 1 and negative on Y 2 and the effect of total distance, X 7 , is positive on both dependent variables. By eliminating age (X 1 ) in Model II, the signs of the parameters associated with X 2 and X 7 are maintained, although the value is smaller in the case of X 2 and remains practically the same for X 7 . After eliminating variables X 1 and X 2 , we see that the effect of the total annual distance driven remains practically the same. If we observe the effects of the rest of covariates, these are practically the same in models I, II, and III. A man driver (X 3 ) with a powerful vehicle (X 6 ) would have larger Y 1 and Y 2 than the rest, all other characteristics being the same. However, using parking at night (X 4 ) has a positive effect on the percentage of speeding distance (Y 1 ) and a negative effect on the percentage of night-time driving (Y 2 ); the opposite happens with the age of the vehicle (X 5 ). The effect of X 5 indicates that, when the vehicle is older, drivers tends to diminish the percent of speed driving, while night-time driving is larger.
To visualize the dependence between Y 1 and Y 2 in different quantiles, the following three examples of insured drivers are graphically analysed:  Profile 1 represents the average insured individual of the portfolio; Profile 2 is a younger man driver, less experienced than Profile 1 and with a newer and less powerful vehicle; finally, Profile 3 is an older man driver, more experienced than Profile 1 and an older and more powerful vehicle. Figure 1 represents different quantiles of the variable kilometres driven above the speed limit (Y 1 ) in the y-axis given the values of the percentage of kilometres driven at night (Y 2 ) for Profile 1 in the x-axis. Quantiles have been obtained from the expression (10). Note that, if the dependence parameter was zero, all the curves would remain constant. The adjusted dependence structure results in the represented conditional quantiles having a negative nonlinear relationship and, furthermore, the curves for the different quantile levels are non-parallel. Figure 1 indicates that, for Profile 1, the higher the percentage of kilometres driven at night (Y 2 ), the greater the caution in driving and, therefore, the lower the percentage of distance driven above the speed limits (Y 1 ). The same quantiles at 75% (plot on the left) and 95% (plot on the right) confidence levels are represented in Figure 2. These plots show that the curves are non-parallel and that Profile 3 is the most risky, followed by Profiles 1 and 2.

Conclusions
We have developed a bivariate model based on the Sarmanov distribution with marginal Beta GLM which has allowed us to model two important variables in modern motor insurance telematics databases. Our model is an alternative to a proposal previously made by Gupta and Wong [3] based on what is known as Morgenstern's distribution, which is a particular case of the Sarmanov distribution. Our proposal allows for obtaining closed expressions for some magnitudes of interest, such as the bivariate cdf and conditioned moments, covariance and correlation, which are fundamental in risk analysis. We have shown that our Sarmanov-Beta-GLM model presents better fits than previous proposals also based on the Sarmanov distribution.
The results of our case study have shown that, for a specific example, although the dependence parameter is positive, which directly implies that, in the mean, the relationship between the conditioned mean and the values of the variable that conditions is positive, the conditional quantiles show that the relationship between the conditioned quantile, and the value of the conditioning variable may be negative for high quantile levels, a result that is consistent with the expected behaviour of drivers.