Frequency and Severity Dependence in the Collective Risk Model: An Approach Based on Sarmanov Distribution

: In actuarial mathematics, the claims of an insurance portfolio are often modeled using the collective risk model, which consists of a random number of claims of independent, identically distributed (i.i.d.) random variables (r.v.s) that represent cost per claim. To facilitate computations, there is a classical assumption of independence between the random number of such random variables (i.e., the claims frequency) and the random variables themselves (i.e., the claim severities). However, recent studies showed that, in practice, this assumption does not always hold, hence, introducing dependence in the collective model becomes a necessity. In this sense, one trend consists of assuming dependence between the number of claims and their average severity. Alternatively, we can consider heterogeneity between the individual cost of claims associated with a given number of claims. Using the Sarmanov distribution, in this paper we aim at introducing dependence between the number of claims and the individual claim severities. As marginal models, we use the Poisson and Negative Binomial (NB) distributions for the number of claims, and the Gamma and Lognormal distributions for the cost of claims. The maximum likelihood estimation of the proposed Sarmanov distribution is discussed. We present a numerical study using a real data set from a Spanish insurance portfolio.


Introduction
The collective risk model is a basic classical actuarial risk model consisting of the sum of a random number of independent, identically distributed (i.i.d.) random variables (r.v.s) that represent costs. To facilitate computations related to this model, there is a classical assumption of independence between the random number of such random variables (i.e., the claim frequency) and the random variables themselves (i.e., the claim severities). However, studies on real data emphasized in several cases the existence of a certain dependence that should be taken into account because it can affect important actuarial quantities like premiums and ruin probabilities. Therefore, alternative approaches incorporate dependence between the number of claims and their average severity, see, for example, Erhardt and Czado [1], Czado et al. [2], Krämer et al. [3], Lee and Shi [4], or Oh et al. [5]. In this paper, we propose the bivariate Sarmanov distribution to analyze the joint behavior of the number of claims and of each one of the individual claim amounts, instead of their average; i.e., we consider heterogeneity

Introducing Sarmanov Dependence
If N denotes the r.v. number of claims from a certain portfolio and X 1 , X 2 , ..., X N the corresponding claim amounts, then the resulting aggregate claims can be represented by the collective model as where S = 0 when N = 0. The usual assumptions under which this model is considered are X 1 , X 2 , ..., X N i.i.d. positive r.v.s, independent of the r.v. N. If, in particular, we assume that X 1 = X 2 = ... = X N =X, whereX is the mean cost per policyholder, we obtain a simpler representation of the collective risk model, for which S = NX. Let X denote the generic r.v. claim amount whose distribution is assumed to be absolutely continuous with probability density function (pdf) denoted by f X , let p denote the probability mass function (pmf) of N and let f S be pdf of S. The cumulative distribution function (cdf) of an r.v. will be denoted by F indexed with the r.v.'s name. It is well-known that for model (1), where f * n is the n-fold convolution of the function f , iteratively defined by In order to relax the independence condition between the number of claims and the claim amounts, we replace the above assumptions with the following ones: Hypothesis 1. Given N = n, the r.v.s X 1 , X 2 , ... are assumed to be i.i.d.

Hypothesis 2.
Assume a Sarmanov type dependence between each X i and N, i.e., (X i , N) i≥1 are identically distributed with the mixed Sarmanov pdf where ψ and φ are bounded non-constant kernel functions, ω ∈ R and f is a pdf; for simplicity, we denote by Y an r.v. having pdf f and representing X > 0. Note that the pdf (2) is mixed because it joins the continuous pdf f and the discrete pmf p. In order for (2) to define a proper pdf, we impose the conditions With L Y denoting the Laplace transform of the r.v. Y, we shall use the exponential kernels: . Then, letting from condition (4), ω is restricted to the following interval Note that under the assumption H2, the distribution of the r.v. X will have both an absolutely continuous component (with pdf f X ) and a probability mass at 0; hence, the distribution of S also has a probability mass at 0 and the pdf f S .
For the mixed Sarmanov pdf (2), it can be easily deduced that: Pr (X = 0 |N = n ) = 1, n = 0 0, n ≥ 1 ; In the following proposition, we present the distribution of S. Its proof is given in the Appendix A.

Proposition 1.
Under the assumptions H1-H2, it holds that To evaluate the pdf of S based on formula (8), we shall need the following result, which gives a formula for the conditional convolutions.

Proposition 2.
Under the assumptions H1-H2, for x > 0 it holds that Next, we present in terms of Y the first two moments of the aggregate claims S and the correlation coefficient between X and N.

Proposition 3.
Under the assumptions H1-H2, the expected value and variance of S are given by while the correlation coefficient between X and N is

Simulation from the Collective Model
To simulate values from the distribution of S under the assumptions H1-H2, we use the inversion method for the conditional cdf of X given N = n, i.e., we use Therefore, we first simulate the value n from the distribution of N. If n = 0 then s = 0; otherwise, we generate n uniform U (0, 1) values (u i ) n i=1 and, by solving the equation F X|N=n (x i ) = u i for x i , we obtain the vector (x i ) n i=1 . Then, s = ∑ n i=1 x i is a value generated from S according to model (1).

Parameters Estimation
be a random sample of the number of claims and individual claim amounts. Let 0 be the 0 vector, let θ and ν be, respectively, the parameters vectors of the marginal distributions of N and Y, while ω is the dependence parameter of the Sarmanov distribution. From (2), the corresponding log-likelihood function is where L (n i ) K i=1 ; θ is the likelihood function corresponding to the marginal r.v. N and L x i x ij > 0, i = 1, ..., K, j = 1, ..., n i ; ν the one corresponding to Y.
Since maximizing the log-likelihood expressed in (10) is very difficult, specifically due to the close relationship that exists between the dependency parameter and the parameters associated with the marginal distributions, we define l (n i , (θ; ν|ω) to be the log-likelihood function corresponding to the marginal parameters given ω, and l (n i , (ω|θ; ν) the log-likelihood function of the dependence parameter given the marginal parameters θ, ν, and we determine the Maximum Likelihood Estimation (MLE) of the parameters in two phases: Phase 1 By MLE, find initial values for the parameters of the marginal distributions. Then, iterate the following two steps: Step 1 (iteration j) Given the parameters for the marginal distributions, findω j within the interval defined in (5) for this dependence parameter by maximizing the log-likelihood Step 2 Givenω j , obtain new values for the parameters of the marginals by maximizing the log-likelihood function l (n i , Repeat steps 1 and 2 until convergence. If the dependence parameter is located at an extreme of the interval, recalculate these intervals using the parameters for marginals obtained in Step 2.

Phase 2
Starting with the initial parameters estimated in Phase 1, perform full MLE.
As a result that our parametric space is bounded, we used the optim() function of R with the method L-BFGS-B to perform the optimizations.
The method proposed in Phase 1 is known as the Inference From Margin (IFM) method, and has been widely used in the estimation of Copulas, see [14] for a review. Regarding the estimation of the Sarmanov distribution, the method in two phases, IFM and full MLE, has already been used with excellent results by Bolancé and Vernic [8] for the case of NB marginals.

Particular Severity Distributions
For our particular dataset, we considered two severity distributions: Lognormal and Gamma.
Under the Gamma severity distribution assumption, Y ∼ Ga (α, β) , α, β > 0, with β the rate parameter, we recall that Furthermore, given that We approach the Lognormal severity distribution in a different way. We recall that if Y follows a Lognormal distribution LN µ, σ 2 , σ > 0, then Z = ln Y follows a normal distribution N µ, σ 2 . Therefore, it is easier to estimate the model having the same counting distribution, but normal severity distribution, using logarithmized claim amounts. So, we first consider the bivariate Sarmanov distribution f Z,N with Z ∼ N µ, σ 2 and exponential kernel φ. However, since the domain of the normal distribution is the entire real line and the kernel function φ must be bounded, we shall work with a left truncated normal distribution with left truncation point a, i.e., Z ∼ LTN µ, σ 2 , a , and we select this truncation point such that Pr(Z <= a) is almost zero; note that a good choice is a = −3σ + µ (another simple choice for a would be the minimum of the log-data). Hence, where Here , while ϕ and Φ denote the pdf and, respectively, cdf of the standard normal distribution. Then, the limits for the interval (5) of ω are Now, to obtain the bivariate Sarmanov distribution with a truncated Lognormal marginal, we change variable Y = e Z in (11) and have where f LTLN(µ,σ 2 ,a) (y) = 1 Kσy ϕ ln y−µ σ is the pdf of the left truncated Lognormal distribution LTLN µ, σ 2 , a . It follows that the bivariate Sarmanov distribution with a Lognormal marginal is given by where the kernel functionφ ( The following proposition gives the needed ingredients to calculate the expected value and variance of S under the simplifying assumption γ = 1.

Particular Counting Distributions
As counting distributions, we consider the Poisson and Negative Binomial distributions. The following result holds.
In the following result, we present formulas needed to evaluate the expected value and variance of S given in Proposition 3, for our particular distributions.
be the exponential kernel.
In the numerical study that we present below, we combine the two discussed counting distributions with the two severity distributions and obtain the following particular compound distributions: compound Poisson-Gamma, compound Negative Binomial-Gamma, compound Poisson-Lognormal, and compound Negative Binomial-Lognormal.

Numerical Study
We analyzed a dataset containing a sample of K = 99,972 Spanish insureds with a total of 8872 claims. We assumed that they have a homogeneous risk profile. For each individual, we obtained information on the number of claims and on the individual cost of each claim notified by each insured. Our aim was to fit the bivariate Sarmanov distribution and to check the effect of dependence between frequency and severity on the risk premium. We compared the results obtained when considering that S = NX, whereX represents the cost per policyholder (calculated as the mean of individual claim amounts, see comment after model (1)), and when considering that S = ∑ N j=1 X j , where X j represents the j-th claim amount of the policyholder or cost per claim; i.e., we compared the results obtained by taking into account the heterogeneity of the claim costs for each insured and by considering that this heterogeneity does not exist.
In Table 1, we display the results of the initial analysis of the number of claims, consisting in the basic descriptives and initially estimated parameters (by MLE) for the marginal distribution associated with this variable. From the values of the Chi-square statistic, we can see that the best fit is obtained with the NB distribution.
In Table 2, we show the basic descriptive statistics for the cost per policyholder and for the cost per claim, together with the MLE parameters of the Gamma and Lognormal distributions for these variables. The main difference between the two variables is in the scale and shape parameters. As expected, the cost per claim r.v. has larger variance and more pronounced right skewness than the cost per policyholder. We compared the goodness of fit Akaike Information Criterion (AIC) for the Gamma and Lognormal distributions, and obtained that the best fit is provided by the Lognormal distribution. The results of the estimated bivariate Sarmanov distributions using the procedure described in Section 2.3 are shown in Table 3. In both cases, i.e., cost per claim and cost per policyholder, the best fit was obtained with the NB-Lognormal model. The correlations ρ calculated for this type of model are consistent with the empirical correlations given in Table 2.
In Table 4, we present the values of the estimated mean and variance of S obtained when using each estimated model presented in Table 3. These values are the ones used to calculate the risk premium. If we compare the results obtained using the estimated dependence parameter ω > 0 with the results obtained by assuming ω = 0 (i.e., independence), we observe that, as we expected, the positive values estimated for the dependence case increase the mean and the variance. Furthermore, the largest values are obtained with the NB-Lognormal Sarmanov distribution.
To see the effect on risk premiums, in Table 5 we calculated the risk premium according to the standard deviation principle, i.e., π R = ES + δ √ VarS, where δ is the loading constant. We display the pure (δ = 0) and risk (δ > 0) premiums, assuming that δ = 1; we also considered the case when N and X are independent (i.e., ω = 0), and when N and X are Sarmanov distributed for both estimated NB-Lognormal models, with cost per claim and cost per policyholder. We can see that, by using the NB-Lognormal distribution, the risk premiums are larger for the cost per policyholder model. Specifically, in the dependent case, using the individual claim cost information allows the company to reduce its risk premium with approximately 55 Euros. Table 3.

Conclusions
In this paper, we have shown how the flexibility of the Sarmanov model allows us to introduce dependence between different types of variables, discrete and continuous. We have used the Sarmanov distribution to model the bivariate distribution joining the number of claims and the individual claim amounts, which is needed to estimate, e.g., the moments of the aggregate claims in the collective risk model considering dependence between its variables. We numerically compared the results obtained using the cost per claim and also using the cost per policyholder; in both cases, the NB-Lognormal Sarmanov model proved to provide the best fit to our real dataset.
We have also analyzed the differences between the expectations and the variances of the aggregate claims in the collective models obtained using alternative estimated distributions. We note that, as expected, these values are larger for the models in which a Lognormal marginal distribution is considered for the cost.
Author Contributions: Both authors contributed equally to this work in terms of conceptualization, methodology, validation, writing; software, C.B. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Acknowledgments:
We thank Fundación BBVA, Equipos de Investigación Científica en Big Data 2018 for support. Also, we are very grateful to the three referees for their nice reviews and valuable comments that helped to improve the paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: Appendix A Proof of Proposition 1. Since S = 0 when N = 0, we clearly obtain that Pr (S = 0) = p (0). Then, for x > 0, we have Differentiating with respect to x easily yields the formula of f S .
Proof of Proposition 2. Let x > 0. We prove the result by induction: when m = 1, based on the definition of f * 0 , we have Assuming that the result holds for m ≥ 1 and taking m + 1, we obtain f * (m+1) where D is the integration domain. Changing the index in the last sum, we have Inserting this into the above formula of f * (m+1) X|N=n and rearranging gives f * (m+1) which immediately yields the result. This completes the proof.
Proof of Proposition 3. We prove the expected value and variance formulas in the usual way. Using (7), we have which easily yields the formula of ES. In what concerns the variance, we use We shall need the following formulas. From (7), we have On the other hand, and after inserting all these into (A1), we obtain the variance formula. From formula (6) it is easy to check that On the other hand, we have that which, together with the above formula of VarX, immediately yields the stated formula of corr (X, N). This completes the proof.
Proof of Proposition 4. We note that yielding the first two formulas. The other two formulas result from and the result is immediate.
Proof of Lemma 1. The formulas (12) and (13) can be obtained directly. Formula (14) easily results from which is also the case with formula (15).