A Copula-Based Bivariate Composite Model for Modelling Claim Costs

: This paper aims to develop a new family of bivariate distributions for modelling different types of claims and their associated costs jointly in a flexible manner. The proposed bivariate distributions can be viewed as a continuous copula distribution paired with two marginals based on composite distributions. For expository purposes, the details of one of the proposed bivarite composite distributions is provided. The dependence measures for the resulting bivariate copula-based composite distribution are studied, and its fitting is compared with other bivariate composite distributions and existing bivariate distributions. The parameters of the proposed bivariate composite model are estimated via the inference functions for margins (IFM) method. The suitability of the proposed bivariate distribution is examined using two real-world insurance datasets, namely the motor third-party liability (MTPL) insurance dataset and Danish fire insurance dataset


Introduction
During the last few decades, there has been a notable surge in actuarial research focused on modelling the costs of different types of claims in non-life insurance, employing a diverse range of claim severity approaches.This is due to the peculiar characteristics of the claim severity distribution, which pose several challenges.The distribution often spans several magnitudes, encompassing small and moderate claim sizes with high frequency, as well as a few major ones with low frequency.Additionally, claim size data are unimodal and heavily skewed to the right (see, for instance, Ref. [1]).As can be clearly understood, when the data span a wide range of magnitudes, selecting a probability distribution that can efficiently fit small, moderate, and large claims becomes crucial for insurance pricing, reserving, and risk management.The method of composing distributions (see, for example, Refs.[2][3][4][5][6]) provides a reasonable fit for such datasets.
Regarding the most recent studies on composite models, which are the main research focus of this work, it is worth noting that Ref. [6] proposed different composite models by considering Burr, Loglogistic, Paralogistic, and Generalized Pareto distributions for the tail of the data and truncated densities before and after the threshold point.Ref. [7], instead of creating single composite model, considered 256 composite models derived from sixteen parametric distributions frequently used in actuarial science.Ref. [8] placed special emphasis on modelling extreme claims using a variety of composite models and threshold selection techniques, including heuristic methods, the Minimum AMSE of the Hill estimator, the exponentiality test, and the Gertensgarbe plot.A new composite distribution known as the composite Rayleigh-Pareto (CRP) distribution was presented by [9].This article also discussed parameter estimation methods, including the method of moment and maximum likelihood estimation method.Ref. [10] introduced a novel single-parameter composite length-biased exponential Pareto (CLBEP) distribution to model insurance losses.The distributional properties and other statistical characteristics of the CLBEP distribution are mathematically determined.Authors provided a comparison study of the new CLBEP with other composite and conventional distributions.
Ref. [11] introduced a mixture composite claim severity regression model extending the setup of Ref. [12], which used a finite mixture distribution for the body and a Paretotype distribution for the tail of the distribution.This extension incorporated explanatory variables on all three parts of the claim size distribution: clustering probabilities, body part, and tail part.It should be noted that, even though the literature concerning composite models in the univariate setting is abundant, their bivariate extensions have not been investigated so far.However, in non-life insurance, it is common for the actuary to observe the existence of dependence structures between different types of claims and their associated costs, either from the same type of coverage or from multiple types of coverage, such as motor and home insurance bundled into one single policy.
Examining interdependent hazards is a pivotal undertaking for insurance companies.The presence of dependence is manifested in the ability to predict the distribution of one random variable based on the knowledge of another.Particularly in the realm of insurance, it is imperative for companies to delve into the interconnections between various lines of business and assess the repercussions of a catastrophic event, such as an earthquake or hurricane, impacting multiple lines simultaneously.Copulas emerge as a prominent model-based approach for scrutinizing dependencies in risks, and the tail dependence coefficient serves as a metric for measuring dependence concerning extreme losses.The rising popularity of utilizing copulas to compute joint distributions for two or more random variables, depicting extreme losses, and capturing tail dependence is evident in the fields of actuarial science and finance Refs.( [13,14]).The copula technique is distinctive for partitioning the joint distribution into two components: individual marginal distributions for each random variable and a copula function that amalgamates the marginals to form a joint distribution explaining the dependence structure.Though direct modeling with multivariate distributions is generally preferred for joint risks, complexity in specification and limited applications can pose challenges.Copulas provide an alternative, allowing a focus on effective independent marginal risk modeling before uniting them using a copula framework.This approach gains prominence due to the wealth of literature available for univariate modeling, making it a practical choice in various scenarios.
Copula functions offer an efficient method to quantify dependence between hazards by fully specifying the dependence structure among random variables.Understanding tail dependence across diverse business lines is vital for insurance and reinsurance.Unlike multivariate distributions, copulas simplify summarizing tail dependence with diverse dependency structures achievable through various parameter settings.Archimedean copulas like Gumbel and Joe focus on modeling right tail extremes, Clayton on left tail extremes, and symmetric copulas like Frank, Gaussian, and t copulas distribute equal weight to both tails.Considering this fact, in the present study, we introduce a Gumbel copula-based bivariate distribution having composite marginal distributions for modelling heavy righttailed claim severity jointly from multiple types of claims, which are often encountered in high-dimensional non-life insurance datasets.The motivation behind the development of the bivariate composite models mainly lies in the following factors: 1. Insurance portfolios often involve multiple risks that may be correlated.A bivariate distribution allows for modeling the joint behavior of two variables, capturing the dependence structure between them.This is crucial for accurately assessing the overall risk in a portfolio.2. In insurance, the occurrence of extreme events is of particular interest.Bivariate composite distributions can better capture tail dependence, which is essential for estimating the joint probability of simultaneous extreme events.
Bivariate composite distributions are being developed in order to describe insurance heavy-tailed datasets.This is because more realistic and accurate representations of joint risk structures in complicated insurance portfolios are required.In particular, we make the following contributions: 1.The bivariate distributions are constructed using a suitable copula distribution consisting of two marginal composite distributions.2. A general framework for constructing the copula-based bivariate composite distribution is discussed.For expository purposes, a specific bivariate composite distribution is presented.3.This distribution aims to describe the behavior of bivariate data with a predominance of small and medium values but includes a few extreme values.4. Dependence measures associated with the proposed bivariate composite distribution based on the copula distribution are derived.5.The parameters of the distributions are estimated using the IFM method, which consists of estimating univariate parameters by separately maximizing the marginal composite distribution and then estimating the dependence parameters from the bivariate likelihoods derived based on the copula.
The rest of the paper is structured as follows.In Section 2, we derive alternative marginal composite distributions based on the Classical Composition (CC) technique.Section 3 presents the construction of the bivariate composite distributions based on the copula distribution.Parameter estimation of the bivariate composite distribution via the IFM method is discussed.The computational aspects of fitting the bivariate composite distributions are discussed in Section 4. Section 5 demonstrates a data generation algorithm and results based on a simulation experiment.In Section 6, we describe the MTPL and Danish fire loss datasets used for our empirical analysis and comparison of the bivariate composite distributions with existing bivarite distributions.Finally, concluding remarks can be found in Section 7.

The Univariate Composite Distribution Obtained through the Traditional Composition Method
Ref. [1] proposed various composite distributions using an unrestricted mixing weight (r), the right-truncated and left-truncated densities truncated at thresholds (θ) for the head and tail distributions, respectively.The resulting probability density function (pdf) of the composite distribution can be written as where Ξ 1 and Ξ 2 are the parameter spaces for the head and tail parts of the composite distribution, and the mixing weight r ∈ [0, 1] and θ > 0. The functions are the adequate truncations of the pdfs f 1 and f 2 up to and after an unknown threshold value θ, respectively.

•
The value of the weight parameter r is obtained by the continuity condition imposed at the threshold θ, i.e., r f * 1 (θ|Ξ • Further, imposing the differentiability condition at the threshold value θ, i.e., r f , makes the density smooth. These above conditions reduce the number of parameters and make the resulting density continuous and differentiable.We henceforth refer to this technique as the Classical Composition (CC) technique.
The dependence of Y i among the claim types is modelled using the copula, with the joint distribution of Y i given by where i , y i ) are the realizations of the Y = (Y (1) , Y (2) ), C ϕ is a copula function, and ϕ := {ϕ (j,j ′ ) } j,j ′ =1,2 is the copula parameter that explains the association between the two random variables, say (Y (1) , Y (2) ).

Modelling Dependence via the Gumbel Copula
Many insurance datasets exhibit strong correlation at high values of claim amounts but less correlation at low values of claim amounts.Hence, to jointly model the two types of claims having high tail dependence, the Gumbel copula will be an appropriate choice to model such a dataset (see Ref. [17]).The dependency between two types of claims y i can be modelled through the Gumbel copula as where ϕ := {ϕ (j,j ′ ) } j,j ′ =1,2 ∈ [1, ∞) is the asymmetric copula parameter influencing the correlations among (Y (1) , Y (2) ).F j (y j ) is the cdf of a univariate r.v.Y j for j = 1, 2. We present two dependence measures associated with the Gumbel copula as: • Kendall's tau: Kendall's tau, denoted by τ, is a bivariate measure of dependence for continuous variables that measures the amount of concordance present in a bivariate distribution.Kendall's tau for the Gumbel copula in terms of the copula parameter ϕ can be written as: • Tail dependence property: The amount of dependency in the upper or lower quadrant tail of a bivariate distribution is referred to as bivariate tail dependence.The expression for the upper tail dependence parameter λ U for the Gumbel copula is given by:

The Bivariate Composite H-Inverse Weibull Distribution
Let Y = (Y (1) , Y (2) ) and y i = (y i , y i ), respectively, be the claims vector and its corresponding realizations of the two types of claims.Suppose that Y (j) i , j = 1, 2 follows the composite H-Inverse Weibull (IW) distribution, where H stands for the head part distribution of the bivariate composite distribution with pdf given by The cdf of composite H-Inverse Weibull distribution may be written as where i = 1, 2, • • • , n and j = 1, 2. Ξ (j) and r H,IW ∈ [0, 1] are the parameter space and mixing weight for the head part of the j th marginal composite H-Inverse Weibull distribution, α (j) > 0, threshold point associated with j th marginal composite distribution θ (j) > 0, and scale parameter γ (j) > 0. F H (y (j) ) and F H (θ (j) ) are the cdf of the H (head) distribution at y (j) and the threshold point θ (j) , respectively.f H (.) is the density of the head part of the marginal composite H-Inverse Weibull distribution.f * H (y (j) ; Ξ (j) ) = f H (y (j) ;Ξ (j) ) is the adequate right-truncated density of the head part of the marginal composite H-Inverse Weibull distribution truncated at the threshold point θ (j) .The expression for the mixing weight r H,IW can be obtained using (2).The dependency among the two types of claims, say (y i ), can be studied using the Gumbel copula as follows: , (7) where y (1) i and y (2) i are the realizations for the two types of claims Y (1) and Y (2) .F 1 y (1) i and F 2 y (2) i are the cdfs of composite H-Inverse Weibull models evaluated at y (1) i and y (2) i , respectively.ϕ is the asymmetric parameter controlling the correlations among (Y (1) , Y (2) ).We use three different parametric distributions for the H (head) part of the bivariate composite models, namely the Weibull distribution, the Paralogistic distribution, and the inverse Burr distribution.In this paper, for expository purposes, the development of the bivariate composite Weibull-Inverse Weibull (W-IW) distribution is presented.
i be a random variable (r.v.) obtained by considering the Weibull distribution for the head and the Inverse Weibull distribution for the tail part of the marginal composite model.The pdf of the r.v.Y (j) i can be written as For i = 1, 2, • • • , n and j = 1, 2, where µ (j) > 0, σ (j) > 0, the scale parameter γ (j) > 0, α (j) > 0, threshold point θ (j) > 0, and r is the mixing weight of the composite model.The analytical expression for the mixing weight r (j) W,IW can be easily obtained using (2).
The cdf of the composite W-IW distribution is The joint cdf of the bivariate composite W-IW distribution can be obtained by coupling two marginal W-IW cdfs using the Gumbel copula, as shown in (4).Similarly, the joint cdf of the bivariate composite Paralogistic-Inverse Weibull (P-IW) distribution and the bivariate composite Inverse Burr-Inverse Weibull (IB-IW) distribution can also be obtained by coupling two marginal P-IW and IB-IW cdfs using the Gumbel copula given in (4).

Parameter Estimation via the IFM Method
The objective of this section is to explain how to use the maximum likelihood (ML) method to estimate the parameters related to both the marginals and the copula.Ref. [15] discussed a method known as the IFM for parameter estimation of the copula density, which is dependent on the knowledge of the marginals.This method involves two steps; first, the parameters of the marginal cdfs are estimated, and then the copula parameters are obtained by maximizing the likelihood function of the copula, with the marginal parameters replaced by estimators obtained in the first step.The steps involved in the IFM method for the estimation of the marginal model parameters as well as the copula parameter are given below.

•
Step 1: Let Y n for j = 1, 2 be a random sample of two types of claims from the marginal composite H-Inverse Weibull distribution described in (5).Here, Θ is the parameter vector for the two marginal composite H-Inverse Weibull models.We utilize the ML estimation procedure to estimate the marginal distribution parameters.The goal of the ML estimation procedure is to find the parameter values that maximize: For j = 1, 2, the non-overlapping density parts between the body and tail of a composite distribution make parameter estimation typically straightforward, allowing one to factor out the likelihood function and independently estimate the three components of the distribution-mixing weight, body, and tail.All parameters of the two marginal composite H-Inverse Weibull distributions are estimated using the numerical optimization tool optim(), included in the stats package of the R programming language.

•
Step 2: The estimated parameters of the marginal df's are used to estimate the Gumbel copula parameter and to compute the value of the likelihood function associated with the dependence structure.To ease the process of finding the estimate of the copula parameter, we use the rvinecopula package of the R software 4.1.2by passing the argument gumbel as the family.

Simulation Study
In this section, we present a simulation study for the proposed bivariate composite W-IW distribution using the Gumbel copula.Initially, we detail the generation of random samples from the bivariate composite W-IW distribution.The process involves employing the conditional procedure for random sample generation, as outlined in [16].We consider Y 1 , Y 2 be the random sample from bivariate composite W-IW distribution derived using Gumbel cupula C(u 1 , u 2 ) (see ( 4)).The conditional distribution of the (U, V) is . Utilizing this conditional distribution method, we gener- ate the random numbers (y 1 , y 2 ) from the proposed bivariate composite W-IW distribution.The detailed algorithm to generate the random numbers from the bivariate composite W-IW distribution is as follows: Step Random Numbers from Bivariate Composite W-IW Distribution 1.
Generate two independent samples, say u 1 and s from U(0, 1).

Numerical Illustration 6.1. Dataset 1: Greece MTPL Dataset
In this section, we illustrate the proposed methodology using motor third-party liability (MTPL) insurance policies with non-zero property claims from the years 2012 to 2019.A major insurance firm in Greece generously provided the dataset for this study.The dataset contains 7263 motor vehicle insurance policies collected over the period from 2012 to 2019, all of which have complete records.The following section provides a detailed description of the MTPL dataset.The variables associated with the MTPL dataset are tcost bi and tcost pd, representing the cost of bodily injury claims and the cost of property damage claims, respectively.They are numeric vectors showing the total amount of bodily injury claims and property damage claims.Graphical representations of both types of claims, namely bodily injury claims and property damage claims, are presented in Figures 3-5.Both types of claims exhibit various peculiarities of insurance data, including positive skewness, unimodality, and tail heaviness.Figure 6, which displays the scatter plot of the MTPL dataset in natural logarithm scales, illustrates that the dependency is not linear.These data reveal extreme value dependence, i.e., the heavier the costs, the stronger the dependence.Additionally, it can be seen from Figure 6 that for small values of both claims, the dependence is almost zero, but the shape suggests positive dependence for larger costs of both claims.Clearly, this denotes a change in the joint distribution between the smaller and the larger claim costs.Here, both claims are known to be strongly correlated with high values but less correlated at low values; hence, the Gumbel copula will be an appropriate choice.

Dataset 2: Danish Fire Insurance Dataset
The provided dataset describes fire insurance claims in Denmark, gathered from the Copenhagen Reinsurance Company for the period between 1980 and 1990.The dataset is available on the following website: https://search.r-project.org/CRAN/refmans/fitdistrplus/html/danish.html (accessed on 18 January 2024).It consists of three main components: loss to buildings, loss to contents, and loss to profit.However, our specific focus in this case revolves around modeling the interdependence between the first two components.The dataset comprises a total of 1501 observations.Our analysis specifi-cally concentrates on cases where both losses have non-null values.In Figures 7-9, graphical representations of these claim types are provided.Notably, both loss to buildings and loss to contents exhibit several notable features commonly observed in insurance data, including a positively skewed distribution, a single mode (unimodality), and heavy tails indicating the presence of extreme values.The visual representation in Figure 10, which showcases a scatter plot of the Danish fire loss dataset using natural logarithmic scales, provides valuable insights into the relationship between these claims.It becomes apparent that the dependency between these claim types is not linear.Instead, the data suggest a non-linear relationship, and they reveal the existence of extreme value dependence.In other words, as claim costs increase, the strength of their interdependence becomes more pronounced.Moreover, a closer examination of Figure 10 reveals a fascinating pattern; for smaller values of both loss to buildings and loss to contents, the interdependence is nearly negligible, almost approaching zero.However, as losses grow larger, the figure clearly indicates a positive dependence emerging.This shift in the joint distribution, from near-zero dependence for smaller losses to a positive relationship for larger losses, represents a significant and noteworthy change in the nature of the joint distribution between these two categories of losses.

Model Fitting
The fitting of the bivariate composite distributions, namely bivariate composite Weibull-Inverse Weibull (W-IW), the bivariate composite Paralogistic-Inverse Weibull (P-IW) distribution, and the bivariate composite Inverse Burr-Inverse Weibull (IB-IW) distribution, as well as other existing bivariate distributions, such as bivariate Lomax (BL), bivariate Mardia's Pareto Type I (BMPI), and bivariate Burr (BB) model, are examined based on the MTPl dataset in Section 6.1 and Danish fire loss dataset in Section 6.2.In this section, we present the results based on two model selection criteria for the proposed distribution.Tables 2 and 3 provides the values of negative log likelihood (NLL), Akaike's information criterion (AIC), and Bayesian information criterion (BIC) for the MTPL and Danish fire loss datasets, respectively.The formulas involved in the computation of the above-mentioned model selection criteria, namely AIC and BIC, are: where L( Θ) is the maximum log-likelihood and Θ is the vector of the estimated model parameters.
where n is the sample size of the dataset and df is the number of fitted parameters of the distribution.Note that for NLL, AIC, and BIC, smaller values indicate a better fit of the distribution to the empirical data.Tables 2 and 3 show that the bivariate composite W-IW distribution performs better than the remaining bivariate composite distributions and other existing bivariate distributions for the MTPL dataset and bivariate composite IB-IW distribution performs better than the remaining bivariate composite distributions and other existing bivariate distributions for the Danish fire loss dataset.Tables 4 and 5 present the parameter estimates (marginal model parameters and Gumbel copula parameter) of the bivariate composite distributions as well as existing bivariate distributions for the MTPL dataset and Danish fire loss dataset, respectively.

Analysis of Dependence
To examine the goodness of fit of the fitted distribution in terms of dependence modelling, we provide the values of Kendall's tau (τ), the upper-tail dependence parameter λ U for the bivariate composite distributions generated using the Gumbel copula.In the empirical examination, the calculated τ stands at 0.099 for the MTPL dataset and 0.085 for the Danish fire loss dataset, indicating the observed ordinal correlation in each respective dataset.In case of the MTPL dataset, Table 6 indicates that the Kendall's tau for the W-IW bivariate composite distributions matches the empirical τ reasonably well, implying that the suggested Gumbel copula model can properly reflect the dependence structure of both types of claims in the MTPL dataset.In case of Danish fire loss dataset, the fitted values of the Kendall's tau for the proposed bivariate composite distributions are far away from the empirical value of the Kendall's tau.
Table 6 also presents the values of the upper-tail dependence parameter (λ U ) for both the datasets.The empirical values of the upper-tail dependence for the MTPL dataset and Danish fire loss dataset are 0.133 and 0.198, respectively.From Table 6, it is observed that for the proposed distributions, upper-tail dependence parameter λ U ∈ (0, 1], indicating that bivariate composite distributions exhibit upper-tail dependence for both the datasets.Table 6 shows the empirical upper-tail dependence for the MTPL dataset coincides with the upper-tail dependence of the proposed bivariate composite W-IW distribution, indicating the adequate fit of the bivariate composite W-IW distribution to the extremely large values of the MTPL dataset.In case of Danish fire loss dataset, the fitted value of Kendall's tau based on proposed bivariate composite distributions lies far away from its empirical counterpart, but the value of fitted upper-tail dependence matches the empirical value of upper-tail dependence.It suggests a situation where the proposed bivariate distributions captures certain aspects of dependence (specifically, tail dependence) but fails to accurately represent the overall Kendall's tau.

Conclusions
In this paper, we presented a novel family of bivariate composite distributions for simultaneously modelling small and/or moderate and large costs from different types of claims.The detailed methodology involved in the development of bivariate composite distribution is presented.A Gumbel copula is utilized to specify the correlation structure between two types of random variables.For expository purposes, the genesis of bivariate composite Weibull-Inverse Weibull (W-IW) distribution via Gumbel copula is exhibited.Application of the proposed new class of distribution is illustrated using the MTPL bodily injury and property damage dataset and Danish fire loss dataset.The numerical results obtained show reasonably good results for the bivariate composite distributions as compared to other existing bivariate distributions.This suggests that the proposed distributions offers a competitive and potentially more accurate representation of the joint distribution of insurance claim costs.A possible limitation of the proposed study is the choice of the copula.The choice of the Gumbel copula may impact the results, and its appropriateness depends on the underlying dependence structure.Sensitivity analyses or comparisons with alternative copulas could provide insights into the robustness of the proposed distributions.A potentially fruitful line of further research is to consider the influence of individual and coverage-specific covariates on the parameter(s) of the proposed class of distribution by introducing regression components into the composite marginal models.

Figure 1 .
Figure 1.Box plot of bias of the parameter estimates of bivariate composite W-IW distribution with ϕ = 1.5 obtained for different sample sizes n = 50, 100, 200, 500.

Figure 5 .
Figure 5. Histogram of bodily injury claims and property damage claims for the MTPL dataset.

Figure 6 .
Figure 6.Scatter plot of bodily injury claims vs. property damage claims for the MTPL dataset.

Figure 7 .Figure 8 .
Figure 7. Histogram of loss to buildings for the Danish dataset.

Figure 9 .
Figure 9. Histogram of loss to buildings and loss to contents for the Danish dataset.

Figure 10 .
Figure 10.Scatter plot of loss to buildings vs. loss to contents for the Danish dataset.

Table 1 .
Average Absolute Bias (AAB) and Root Mean Square Error (RMSE) of parameter estimates for bivariate composite W-IW distribution.

Table 2 .
Values of NLL, AIC, and BIC for the MTPL dataset for competing distributions.

Table 3 .
Values of NLL, AIC, and BIC for the Danish dataset for competing distributions.

Table 4 .
Parameter estimates of competing distributions for the MTPL dataset.

Table 5 .
Parameter estimates of competing distributions for the Danish fire loss dataset.

Table 6 .
Dependence measures of the bivariate composite distributions for the MTPL dataset and Danish fire loss data set dataset.