Concordance Probability for Insurance Pricing Models

The concordance probability, also called the C-index, is a popular measure to capture the discriminatory ability of a predictive model. In this article, the definition of this measure is adapted to the specific needs of the frequency and severity model, typically used during the technical pricing of a non-life insurance product. For the frequency model, the need of two different groups is tackled by defining three new types of the concordance probability. Secondly, these adapted definitions deal with the concept of exposure, which is the duration of a policy or insurance contract. Frequency data typically have a large sample size and therefore we present two fast and accurate estimation procedures for big data. Their good performance is illustrated on two real-life datasets. Upon these examples, we also estimate the concordance probability developed for severity models.


Introduction
One of the main tasks of an insurer is to determine the expected number of claims that will be received for a certain line of business and how much the average claim will cost. The former is typically predicted using a frequency model, whereas the latter is obtained by a severity model. The multiplication of these expected values then yields the technical premium (for more information, we refer to Frees (2009); Ohlsson and Johansson (2010)). Alternatively, one can also model the frequency and severity jointly (Shi et al. 2015). Predictive analytics are a key tool to develop both frequency and severity model in a data-driven way. Note that insurers also use a variety of predictive analytic tools in many other applications such as underwriting, marketing, fraud detection and claims reserving (Frees et al. 2014(Frees et al. , 2016Wuthrich and Buser 2020). The main goal of predictive analytics is typically to capture the predictive ability of the model of interest. Important aspects of the predictive ability of a model are the calibration and the discriminatory ability. Calibration expresses how close the predictions are to the actual outcome, while discrimination quantifies how well the predictions separate the higher risk observations from the lower risk observations (Steyerberg et al. 2010). Even though both calibration and discrimination are of utmost importance when constructing predictive models in general, the discrimination probably is considered to be slightly more important in the context of non-life insurance pricing. The technical premium should first and foremost capture the difference in risk that is present in the portfolio, which is exactly captured by discriminatory measures. The concordance probability typically is the most popular and widely used measure to gauge the discriminatory ability of a predictive model.
In case we have a discrete response variable Y, it equals the probability that a randomly selected subject with outcome Y = 0 has a lower predicted probability than a randomly selected subject with outcome Y = 1 (Pencina and D'Agostino 2004). Here, π(X) equals P(Y = 1|X), with X corresponding to the vector of predictors. In other words, the concordance probability C can be formulated as: (1) Furthermore, in a discrete setting and in the absence of ties in the predictions, this concordance probability equals the Area Under the ROC Curve (AUC) (Reddy and Aggarwal 2015). This ROC curve is the Receiver Operating Characteristic curve, suggested by Bamber (1975). It represents the true positive rate against the false positive rate at several threshold settings. The AUC is a popular performance measure to check the discriminatory ability of a binary classifier, as can be seen in the work of Liu et al. (2008) for example.
Even if definition (1) looks very promising to assess the discriminatory ability of frequency models, it assumes that the outcome variable is a binary rather than a count random variable. Moreover, since the policy runtime or exposure of an insurance contract typically is included as an offset variable in the frequency model, definition (1) needs to be extended to accommodate the presence of such an offset variable.
When dealing with a continuous outcome Y, this basic definition is typically adapted as: We say that the pairs (π(X i ), Y i ) and π(X j ), Y j are concordant when sgn(π(X i ) − π(X j )) = sgn(Y i − Y j ). Hence, the probability that a randomly selected comparable pair of observations with their predictions is a concordant pair, is another way of formulating the definition of the concordance probability. Note that definition (2) is a very popular measure in the field of survival analysis, where the continuous outcome corresponds to the time-to-event variable (Legrand 2021). For the severity model, it can be argued whether it is important to discriminate claims for which the observed cost hardly differs, hence an extension of definition (2) will be considered. Since the estimation of any definition of the concordance probability is time-consuming for larger datasets, we will also consider time-efficient and accurate estimation procedures.
In this paper, we will focus on the concordance probability applied to frequency and severity models used to construct a technical premium P for an insurance contract. This technical premium typically corresponds to the product of the expected probability of occurrence of the event (E(Y N )) times the expected cost of the event (E(Y S )). Note that these expectations are often conditional on some variables, such that the technical premium corresponds to with X N and X S the set of variables that are used to model each random variable. From here on, (Y N , X N ) will be referred to as the frequency data and (Y S , X S ) to as the severity data.
First, we introduce in Section 2 the real datasets that will be used throughout this article, together with the frequency and severity models based on them. Section 3 covers the required changes of the general concordance probabilities (1) and (2), such that they can be applied in an insurance context. Next, we develop several algorithms that calculate these new definitions in an accurate and time-efficient way. These algorithms will be introduced in Section 4, where they are immediately applied to the introduced models. Finally, the conclusion is given in Section 5.

Datasets and Models
In this section, we first introduce some real datasets. Next, we explain the frequency and severity models using these datasets.

Datasets
The datasets explained in this section, are all obtained from the pricing games of the French Institute of Actuaries, which is a game that can be played by both students and practitioners. First, we discuss the one of the 2015 pricing game and next we consider the ones of the 2016 pricing game. Both datasets are publicly available in the R-package CASdatasets 1 and contain data on which both a frequency and severity model can be applied.

2015 Pricing Game
The pg15training dataset was used for the 2015 pricing game of the French Institute of Actuaries organized on 5 November 2015 and contains 100,021 third-party liability (TPL) policies for private motor insurance. Each observation pertains to a different policy and a set of variables has been collected of the policyholder and the insured vehicle.
For reasons of confidentiality, most categorical levels have an unknown meaning. This dataset can be used for the frequency and severity model, and the selected and renamed variables are explained in Appendix A. The two most important ones are claimNumb and claimCharge, which will be the dependent variables of the frequency and severity analysis respectively. The variable claimNumb shows the number of third-party bodily injury claims. For policies for which more than two claims were filed during the considered exposure, the value was set to 2. This adaptation is needed for the measures that are presented in Section 3. The variable claimCharge represents the total cost of third-party bodily injury claims, in euro. Finally, exposure will be used as an offset variable during the analysis of the frequency data. It is the percentage of a full policy year, corresponding to the run time of the respective policy. Note that 72.58% of the observations have an exposure equal to one.
2.1.2. 2016 Pricing Game pg16trainpol and pg16trainclaim are two datasets that were used for the same pricing game of the French Institute of Actuaries one year later, in 2016. Both of them can be found in the R -package CASdatasets. The first dataset contains 87,226 policies for private motor insurance and can be used for the frequency model. The pg16trainclaim dataset contains 4568 claims of those 87,226 TPL policies and combined with the pg16trainpol dataset, the severity model can be constructed. Policies are guaranteed for all kinds of material damages, but not bodily injuries.
Once again, most categorical levels have an unknown meaning for reasons of confidentiality. The selected and renamed variables of the pg16trainpol and pg16trainclaim dataset are explained in Appendix A. The two most important ones are claimNumb and claimCharge, which will be the dependent variables of the frequency and severity analysis respectively. The variable claimNumb shows the number of claims. The policies for which more than two claims were filed during the considered exposure, the value was once again set to 2. This adaptation is needed for the measures that are presented in Section 3. The variable claimCharge represent the claim size. Moreover, exposure will be used as an offset variable during the analysis of the frequency data. It is the percentage of a full policy year, corresponding to the run time of the respective policy. In this dataset, 14.16% of the observations have an exposure equal to one.
Note that we only selected the 3969 observations that had a strictly positive claim, to construct the severity model. Finally, we could merge the pg16trainclaim and the pg16trainpol datasets based on the their policy number, begin date, end date and license number.

Models
In this subsection, we construct the frequency and severity models based on the aforementioned datasets. It is important to know that the interest of this paper is not really on the construction of the models, but on the calculation of the concordance probability of the models once the predictions are available. For both models, we first split the required dataset in a training and a test set. The training set is obtained by selecting 60% of the observations of the entire dataset. The remaining 40% of the observations represent the test set.

Frequency
In order to obtain predictions of the frequency model, we consider a basic Poisson model where the variable claimNumb is the response variable. The exposure is used as an offset variable, and all other variables of the training set, apart from claimCharge, are considered as predictor variables. Applying the frequency model on the test set of the 2015 (2016) pricing game, we obtain 40,008 (34,890) pairs of observations and their corresponding predictions. However, the goal of this paper is to calculate the concordance probability of these frequency models for big datasets. Therefore, we will also consider a bootstrap of these pairs of observations and predictions, resulting in 1,000,000 pairs for each dataset.

Severity
In order to obtain predictions for the severity, we consider a gamma model where the ratio of claimCharge over claimNumb is the response variable, and the weights are equal to the variable claimNumb. This is a popular approach for severity models, as explained in Appendix B, based on the book of Denuit et al. (2007). All other variables of the training set, apart from exposure and claimNumb, are considered as predictors. Applying the severity model on the test set of the 2015 (2016) pricing game, we obtain 1837 (1588) pairs of observations and their corresponding predictions. However, the goal of this paper is to calculate the concordance probability of these severity models for big datasets. Therefore, we will also consider a bootstrap of these pairs of observations and predictions, resulting in 1,000,000 pairs for each dataset.

Concordance Probability in an Insurance Setting
In this section, the general definitions (1) and (2) of the concordance probability will be modified to the use for frequency and severity models.

Frequency Models
The general definition of the concordance probability will in this section be modified to a concordance probability that can be used for frequency models. The basic definition (1) requires the definition of two groups, based on the number of events that occurred during the duration of the policy. However, non-life insurance contracts typically have an exposure of maximum one year. Hence, it is unlikely that more than two events will take place during this (short) period. Therefore, three groups will be defined: policies that experienced zero events, one event, and two events or more, respectively represented by the 0-, 1-and 2-group. These groups result in the following three definitions of the concordance probability for frequency models: where π N (·) refers to the predicted frequency of the frequency model and Y N to the observed claim number. The set of definitions (3) has several interesting interpretations. First of all, C 0,1+ (C 0,2+ ) evaluates the ability of the model to discriminate policies that did not encounter accidents from policies that encountered at least one (two) accident(s). Furthermore, C 1,2+ quantifies the ability of the model to discriminate policies that encountered one accident from policies that encountered multiple accidents. In other words, C 1,2+ quantifies the ability of the model to discriminate clients that could just have been unfortunate versus clients that are (probably) accident-prone. However, these concordance probabilities do not take the concept of exposure into account. This is the duration of a policy or insurance contract, and plays a pivotal role in frequency models. In order to make sure that the pair is comparable, the definition of the concordance probability needs to be extended to deal with the concept of exposure as well. As such, two main possibilities can be imagined which ensures comparability of the given pair. For the first possibility, the member of the pair that experienced the most accidents needs to have an exposure that is equal to or lower than the exposure of the other member of the pair. These pairs are sort of comparable since the member of the pair that experienced the most accidents did not have a longer policy duration than the member of the pair that experienced the fewest accidents. The set of definitions (3) can then be altered as: where λ i corresponds to the exposure of observation i. However, the above set of definitions (4) runs into trouble for pairs where there is a considerable difference in exposure. In order to understand why this is the case, we need to have a look at the structure of the predictions of a Poisson regression model, which corresponds for observation i to π N (X i ) = λ i exp(βX i ). This reveals that the prediction is mainly determined by the exposure λ i and the linear predictor βX i . Therefore, when the predictions of a Poisson regression model of a pair of observations are compared, two possibilities can occur when the pair is comparable according to the above set of definitions (4). One member of the pair can have a higher prediction than the other member due to a difference in risk, as expressed by the linear predictor and as is desirable, or due to a mere difference in exposure, which would obscure the analysis. A possible solution would be to set the exposure values of all observations equal to 1 when making predictions, such that one only focuses on the difference in risk between the different observations. However, this is undesirable as we would like to evaluate the predictions of the Poisson model that are used to compute the expected cost of the insurance policy, and for this the exposure is a key ingredient. In other words, the set of definitions (4) are of little practical use within the domain of insurance and will no longer be considered. For the second possibility, the exposure λ of both members of the pair need to be more or less the same, in order to ensure their comparability. Incorporated in the set of definitions (3), we get: Here, γ is a tuning parameter representing the maximal difference in exposure between both members of a pair that is considered to be negligible.
All former definitions are global measures, meaning that the concordance probability is computed over all observations of the dataset, where comparability is considered as the sole exclusion criterion for a given pair. The following definitions show a local concordance probability, by taking a subset of the complete dataset based on the exposure: In the above set of definitions, λ is the parameter corresponding to the exposure value for which the local concordance probability needs to be computed. In practice, C ≈ .,..+ (γ) ≈ C ≈ .,..+ (1, γ) because the main mass of the data is located at a full exposure. The appealing aspect of this set of definitions is that it allows the construction of a (λ, C(λ, γ)) table, i.e., an evolution of the local concordance probabilities in function of the exposure. However, the disadvantage of this plot is that one has to choose the values of λ and γ. Assume one takes γ equal to 0.05 and λ ∈ {0.05, 0.15, . . . , 0.95}. In this case, observations with for example exposure 0.49 and 0.51 will not be comparable, although their exposures are very close to each other. To eliminate this issue, we first define two groups: • O-group: group with the largest number of elements, hence the group with the smallest number of events, • 1-group: group with the smallest number of elements, hence the group containing the largest number of events. When we consider for example C ≈ 1,2+ (λ, γ), the O-group consists of the elements with Y N = 1 and the 1-group of the elements with Y N ≥ 2. Next, we apply following steps to construct a better (λ, C(λ, γ)) plot: 1.
Determine the pairs of observations and predictions belonging to the O-group and the ones to the 1-group.

2.
Define the number of unique exposures λ within 1 and apply a for-loop on them: • Select the elements in 1 with exposure λ i .
, the concordance probability on these two subsets. • Define m i , the number of comparable pairs used to calculate C(λ i , γ).

3.
The global concordance probability C(γ) can be rewritten as: where n equals the number of observations, n 0 (n 1 ) the number of observations in O (1), n 1 the number of unique exposures in 1 and Since the loop iterates over all unique exposures in the 1-group, which is the smallest one, the x-axis can have a rather rough grid. Therefore, one can also easily adapt the previous steps by looping over the unique exposures in the O-group, resulting in a plot with an x-axis that has possibly a finer grid. In Figures 1 and 2, both the rough and the fine version of the (λ, C ≈ 0,1+ (λ, γ)) plot are constructed for the test sets of the 2015 and 2016 pricing game respectively. We choose γ to be 0.05, which is approximately equal to the length of one month. For the test set of the 2015 (2016) pricing game, the maximal weight w i is 0.96 (0.32) for the observations with exposure 1.
However, the plots are hard to interpret, since there are large differences depending on which group is iterated. Especially in Figure 2, we see that for example C(0.08, 0.05) is much larger when iterating over the O-group (fine grid), than when iterating over the 1-group (rough grid). For the fine grid version, we use the elements of the O-group with exposure equal to 0.08, together with the elements of the 1-group with an exposure between 0.08 and 0.13. This subset leads to a high value for C(0.08, 0.05), meaning that the selected elements of the 1-group have in general a higher prediction than the ones of the O-group. However, for the rough grid version, we use the elements of the O-group with an exposure between 0.08 and 0.13, together with the elements of the 1-group with an exposure equal to 0.08. This is yet another subset, and this time we often see higher predictions for the elements in the O-group, leading to a small value for C(0.08, 0.05). Considering different subgroups, leads to a difficult interpretation of these plots. However, it is important to know that both versions of this local plot lead to the same global concordance probability, based on equality (7). A solution to the lack of interpretability of both local plots (fine and rough grid), is to consider a weighted mean of them, with the weights based on the number of comparable pairs. This weighted-mean-plot is constructed for both datasets and can be seen in Figure 3. For the interpretability, it is important to see that the weighted-mean-plot is equivalent with applying the following two steps:

1.
For every observation i, construct C(λ i , γ), with λ i the exposure of the considered element.

2.
For every considered exposure λ i , determine the weighted mean of C(λ i , γ), where the weights are based on the total number of comparable pairs.

Severity Models
The general definition (2) of the concordance probability will in this section be modified to a concordance probability that can be used for severity models. Since it might be of little practical importance to distinguish claims from one another that only slightly differ in claim cost, the basic definition can be extended to a version introduced by Van Oirbeek et al. (2021): where ν ≥ 0. Furthermore, π S (·) refers to the predicted claim size of the severity model and Y S to the observed claim size. In other words, the claims that are to be considered are those of which the claim size has a difference of at least a value ν. Hereby, pairs of claims that makes more sense from a business point of view are selected. Also, a (ν,C(ν)) plot can be constructed where different values for the threshold ν are chosen, as to investigate the influence of ν on (8). Interestingly, C(0) corresponds to a global version of the concordance probability (as expressed by definition (2)), while any value of ν > 0 results in a more local version of the concordance probability. Focusing on the datasets introduced in Section 2, we determine the value of ν such that x% of the pairwise absolute differences of the observed values is smaller than ν, with x ∈ {0, 20, 40}. Note that ν equal to zero is not a popular choice in business, since they are not interested in comparing claims that are nearly identical. The size of the considered test sets still allow to consider all possible pairs between the observations in order to determine the absolute differences between observations belonging to the same pair. However, this is no longer the case for the bootstrapped versions, since this would result in 499,999,500,000 pairs and corresponding differences. Since the observations are all sampled from the original test sets, we know that the number of unique values is much lower than 1,000,000. Hence, we can use the technique discussed in , resulting in a fast calculation of the values of ν represented in Table 1. As can be seen, the difference between the values for ν determined on the original test set or on the bootstrapped dataset is very small. Therefore, we will from here on only focus on the bootstrapped versions of the test sets.

Time-Efficient Computation
For a sample of size n, the general concordance probability is typically estimated as: corresponding to the ratio of the number of concordant pairs n c over the total number of comparable pairs n t . The value π c ( π d ) refers to the estimated probability that a comparable pair is concordant (discordant) respectively and I(·) to the indicator function. Note that the extra condition π(x i ) = π(x j ) is added to the denominator to ensure that no ties in the predictions are taken into account (Yan and Greene 2008).
Since this estimation method is not possible for large datasets, Van Oirbeek et al. (2021) introduced several algorithms to approximate the concordance probability in an accurate and time-efficient way. We also refer to that article for detailed information and an extensive simulation study. However, new algorithms need to be developed for the frequency setting to approximate the concordance probability dealing with the exposure, and this will be the subject of Section 4.1. For the completeness, we apply the original algorithms of Van Oirbeek et al. (2021) on the severity models in Section 4.2.
In this section, the approximations will be applied to the concordance probability for the models discussed in Section 2.2. More specifically, we will use the bootstrap version such that we have 1,000,000 pairs of observations and predictions to consider.

Frequency
The goal of this section is to approximate the concordance probability C ≈ 0,1+ (0.05), as defined in (5), in a fast and accurate way. This will be done for the frequency models of Section 2.2, using the 1,000,000 bootstrapped pairs of observations and predictions. Note that the same reasoning can be used for the other concordance probabilities defined in (5).
Before we can determine the bias of the concordance probability estimates, we need to know its exact value. This can be determined by first splitting the considered dataset in the O-group and the 1-group, as defined in Section 3.1. For the rough grid approach, we iterate over the elements of the 1-group. In each iteration, we count the number of predictions in the O-group that are smaller than the prediction of the considered element of the 1-group. Summing up all these counts, divided by the number of considered pairs, results in the exact concordance probability. Contrarily, we iterate over the elements of the O-group for the fine grid approach. In each iteration, we count the number of predictions in the 1-group that are larger than the prediction of the considered element of the 1-group.
Summing up all these counts, divided by the number of considered pairs, results in the exact concordance probability.
In Table 2, one can see the timings that were necessary to calculate the exact value of C ≈ 0,1+ (0.05), which is 0.6670 (0.5905) for the bootstrap version of the 2015 (2016) pricing game test set. The same was done for C ≈ 0,1+ (0.10), and hence, we can compare both to see the effect of the parameter γ on the run times. We cannot precisely draw a conclusion on the effect of γ on the exact value of the concordance probability, since the exact value of C ≈ 0,1+ (0.10) equals 0.6658 (0.5925) for the bootstrap version of the 2015 (2016) pricing game test set. However, for the run times we see clearly larger run times when γ is 0.10. This can be explained by the fact that a larger value for γ implies that we allow more pairs to be compared. Moreover, the run times for the dataset of the 2015 pricing game are clearly larger than the ones for the dataset of 2016. This can be explained by the fact that 73% of the 2015 dataset are observations with an exposure equal to 1. Hence, these observations belong to many comparable pairs. For comparison, only 14% of the observations of the 2016 pricing game dataset have an exposure equal to 1. This is confirmed by Table 3, which shows the number of comparable pairs. From this table, one can also see that the number of comparable pairs for the rough and fine grid approach are equal to each other. This was expected since both approaches result in the exact same global concordance probability. A final note on Table 2 is that it also contains the time to construct the weighted-mean-plot for C ≈ 0,1+ (γ). Since this plot is constructed as the weighted mean of the fine and the rough grid plot, the time to construct it equals the time to construct both the fine and rough grid plot.   (7). A similar reasoning can be used to obtain a marginal approximation for the rough grid approach. Hence, combining both as explained in Section 3.1 results in the weighted-mean-plot approach.
Such a marginal approximationĈ ≈ M,0,1+ (0.05) takes advantage of the fact that the bivariate distribution of the predictions for considered elements of the O-group and the 1-group, F π,π 1 (π O , π 1 ), is equal to the product of F π O (π O ) and F π 1 (π 1 ). Hence, when a grid with the same q boundary values τ = (τ 0 ≡ −∞, τ 1 , . . . , τ q , τ q+1 ≡ +∞) for the marginal distribution of both groups is placed on top of the latter bivariate distribution, the probability that a pair belongs to any of the delineated regions only depends on the marginal distributions F π O (π O ) and F π 1 (π 1 ). Important to note is that Van Oirbeek et al.
(2021) took the same q boundary values for each group. These boundary values were a set of evenly spaced quantiles of the empirical distribution of the predictions of both the Ogroup and the 1-group jointly. An extension on this idea is that we allow to have different boundary values for each group. Hence, the boundary values of the O-group (1-group) equal the quantiles of the empirical distribution of its predictions. This way of working allows to consider the distribution of each group separately, but the disadvantage is that it will increase the run time. The reason for this increment is that it will be more difficult to determine which region of the grid contains concordant pairs, as can be seen in Figure 4. Therefore, we will compare the original and the extended marginal approximation of the concordance probability C ≈ 0,1+ (0.05) for the frequency models of Section 2.2, using the 1,000,000 bootstrapped pairs of observations and predictions.   Table 4 shows the results of the original marginal approximation, hence using the same boundary values for the considered Oand 1-group when calculatingĈ ≈ M,0,1+ (0.05). The bias clearly decreases for a higher number of boundary values, but, of course, this coincides with a larger run time. Remarkably, the bias and run time for the marginal approximation of C ≈ 0,1+ (0.05) on the bootstrap of the predictions and observations of the 2016 pricing game dataset, are lower than the ones on the 2015 pricing game dataset. A final conclusion on the run times is that, compared to the results in Table 2, the original marginal approximation reduces the run time with at least 50%. Table 5 shows the results of the extended marginal approximation (weighted-meanplot approach), hence allowing to have different boundary values for each group. In Appendix C, we see similar results in Tables A1 and A2 for the fine and rough grid approach respectively. A first conclusion is that when each group has the same number of boundary values, the biases are higher than the ones of the original marginal method. Figure 4 reveals a possible cause, since we clearly see an increase of regions containing incomparable pairs for the extended approach. As a result, the concordance probability is based on fewer comparable pairs, which is confirmed in Table 6. In this situation, we also notice that the run times for the extended marginal approach are comparable with the ones for the original marginal approach, as long as the number of boundary values is smaller than 5000. For a larger number of boundary values, the extended marginal approximation has a higher run time than the original one. In general, we may conclude from Tables 5, A1 and A2 that the bias decreases for a higher number of boundaries, which coincides with a higher run time. Finally, we also construct an approximation of the weighted-mean-plot for C ≈ 0,1+ (λ, 0.05) based on the original and extended marginal approximation, respectively shown in  Figures A1 and A2 while using the number of boundary values that resulted in the lowest bias (in case of multiple scenarios, the one with the lowest run time). Comparing these plots with the original ones shown in Figure 3, we see that both the original and the extended marginal approximation give a weighted-mean-plot that is almost the same as the exact one. Based on these plots, the bias and the run time, we have a slight preference for the original marginal approximation where we use the same boundary values for the O-group and the 1-group. Table 5. Bias and run time (s), the latter between brackets, for the extended marginal approximation of C ≈ 0,1+ (0.05) on the 2015 and 2016 pricing game dataset. This is given for the weighted-mean-plot approach and for several different numbers of boundary values for the Oand 1-group.

k-Means Approximation
Another approximation for C ≈ 0,1+ (0.05) is based on the k-means approximation for discrete variables of . More specifically, when we focus for example on the fine grid version, we approximate each local concordance probability C ≈ 0,1+ (λ i , 0.05) by its k-means approximation, with λ i representing the unique exposures of the O-group. These local approximations are denoted byĈ ≈ p,kM,0,1+ (λ i , 0.05), such that the first approximation for the global concordance probability C ≈ 0,1+ (0.05) is obtained byĈ ≈ p,kM,0,1+ (0.05) = ∑ i w iĈ ≈ kM,0,1+ (λ i , 0.05), with w i representing the same weights as used in (7). A similar reasoning can be used to obtain a k-means approximation for the rough grid version. Hence, combining both as explained in Section 3.1 results in the weighted-mean-plot approach.
Such a k-means approximationĈ ≈ kM,0,1+ (0.05) applies within both groups a k-means clustering algorithm on the considered predictions. Once the clustering algorithms are applied, only the cluster centroids are used to determineĈ ≈ kM,0,1+ (0.05). Hence, a more precise estimate will be obtained as k increases. Important to note is that  took the same number of clusters for each group. An extension on this idea is that we allow to have a different number of clusters for each group. The results of this extended approximation can be found in Table 7 for the weighted-mean-plot approach. In Appendix D, Tables A3 and A4 show the results for the fine and rough grid approach respectively. A first conclusion regarding the bias is that it is very low for all considered number of clusters, since a maximum bias of 0.14% was observed over all considered scenarios. This is clearly lower than the comparable bias of the original marginal approximation. However, due to the randomness and the very small values, we do not always see a lower bias for a higher number of clusters. The run time, however, clearly increases for a higher number of clusters. Moreover, these run times are much higher than the ones of the original marginal approximation. Sometimes, they are even higher than the run times to exactly calculate the concordance probability. Despite the rather high run times, the weighted-mean-plots are very close to the exact ones as can be seen in Figures 7 and A3, the latter in Appendix D.  A final approximation for C ≈ 0,1+ (0.05) is denoted byĈ ≈ ep,kM,0,1+ (0.05) and is constructed to have an approximation based on the k-means approximation for discrete variables of , without having the high run times as forĈ ≈ kM,0,1+ (0.05). These high run times were the result of applying two k-means clustering algorithms for each considered exposure λ i . To determine this new approximationĈ ≈ ep,kM,0,1+ (0.05), a k-means clustering algorithm is only applied twice within both groups: first on the expo-sures and afterwards on the predictions. Hence, only four k-means clustering algorithms are applied. Finally,Ĉ ≈ ep,kM,0,1+ (0.05) is obtained by applying Equation (7) on the cluster centroids instead of on the exact exposures and predictions. The results of this third approximation can be found in Table 8 for the weighted-mean-plot approach. In Appendix D, Tables A5 and A6 show the results for the fine and rough grid approach respectively. Table 7. Bias and run time (s), the latter between brackets, for the approximationĈ ≈ kM,0,1+ (0.05) on the 2015 and 2016 pricing game dataset. This is given for the weighted-mean-plot approach and for several different numbers of clusters for the Oand 1-group.  A first important remark is that there are only 275 (93) unique exposures in the 2015 (2016) pricing game dataset. Hence, for a larger number of clusters on the exposures, we have no gain in the run time since we are looping again over all unique exposures. Due to the randomness of selecting the clusters, there is not always a lower bias for a larger number of clusters. Nevertheless, the bias for all considered approximations is very low. More specifically, it is slightly higher than the bias of the correspondingĈ ≈ kM,0,1+ (0.05) approximation, but still smaller than the one of the original marginal approximation. Finally, we do see an increase in the run time for a larger number of clusters. These run times are clearly smaller than the ones of the correspondingĈ ≈ kM,0,1+ (0.05) approximation, but still larger than the ones of the original marginal approximation. The weighted-meanplots are shown in Figures 8 and A4, the latter in Appendix D. Most of these approximations are very close to the exact weighted-mean-plot, apart from the one shown in Figure 8a. There we see that the values around an exposure equal to 0.8 are a bit higher estimated than they should be.
Since the bias of the original marginal approximation is already very low, we do not recommend the k-means algorithm resulting in a lower bias but coinciding with a larger run time. Another important reason for this recommendation, is the fact that more boundary values imply a lower bias for the original marginal approximation, which is not the case for the k-means approximation and its clusters.

Severity
The goal of this section is to approximate the concordance probability (8) in a fast and accurate way for the severity model of Section 2.2, using the 1,000,000 bootstrapped pairs of observations and predictions.
Before we can determine the bias of the concordance probability estimates, we need to know its exact value. This can be determined by looping over all observations and selecting each time the rows with an observation strictly larger than the considered observation added up with ν. In each iteration, we store the number of selected rows in u. Next, v represents the number of predictions in this selection that are larger than the prediction of the considered element. Finally, the exact concordance probability can be obtained by dividingv byū. Important note for this way of working is that we cannot take advantage anymore of the small number of unique values in the observations, since their predictions can differ.
For all considered values of ν, the exact concordance probability is calculated and represented in Table 9 together with its run time. As can be seen for larger values of ν, the concordance probability increases, but the run time decreases. The latter can be explained by the fact that a larger value for ν coincides with fewer comparable pairs. A general conclusion is that it takes a tremendous amount of time to precisely calculate the concordance probability, which is why we will try to approximate these values in a faster way.

Marginal Approximation
A first approximation is the marginal approximation, where a grid is placed on the (Y S , π(X)) space. The q boundary values τ = (τ 0 ≡ −∞, τ 1 , . . . , τ q , τ q+1 ≡ +∞) are evenly spaced percentiles from the empirical distribution of the observed values for Y S and the same set of boundary values is used for dimension π(X). As explained by , the marginal approximation of the concordance probability (8) can be computed as: ) equals the number of concordant (discordant) comparisons for region τ ij , and n τ ij ,τ kl is the product of the number of elements in regions τ ij and τ kl .

k-Means Approximation
Another approximation introduced by , is the k-means approximation. For this approximation, the dataset is reduced to a smaller set of clusters that are jointly constructed based on their observed outcomes and predictions. As a result, (8) can be approximated as: where y S,l and π l are the observed outcome and the prediction of the representation of the l-th cluster respectively; which is the centroid in case of k-means. w l is the weight of the l-th cluster that is determined by the percentage of observations that pertain to the lth cluster.
The results of the aforementioned approximations can be found in Table 10. There is clearly a smaller bias for a larger number of boundary values or clusters. The disadvantage is that this coincides with a larger run time. There is no considerable connection between the bias and the chosen value for ν. Nevertheless, we do see a shorter run time for higher values of ν, which was already noticed during the exact calculations of the concordance probability and can be explained by the smaller number of comparable pairs. For severity models, we prefer the k-means approximation due to a much smaller run time, combined with a very small bias.

Conclusions
Various discrepancy measures and extensions thereof have already been presented in the actuarial literature (Denuit et al. 2019). However, the concordance probability is seldom used in actuarial science, although it is very popular in the machine learning and statistical literature. In this article, we extend the concordance probability to the needs of the frequency and severity data in an insurance context. Both are typically used to calculate the technical premium of a non-life insurance product. For the frequency model, we adapt the concordance probability with respect to the exposure and the fact that the number of claims is not a binary variable. For the severity model, we made sure that claims that are nearly identical in claim cost are not taken into account. The concordance probability measures a model's discriminatory power and expresses its ability to distinguish risks from each other, a property that is particularly important in non-life insurance. Since it is very time consuming to estimate the above measures for the sizes of frequency and severity data that are typically encountered in practice, several approximations based on computationally efficient algorithms are applied. For the frequency models, we prefer the so-called original marginal approximation, since it has the smallest run time. For these frequency models, it is also possible to visualize the introduced concordance probability in function of the exposure in the so-called weighted-mean-plot. For the severity models, we prefer the k-means approximation due to a small run time combined with a very small bias.    Table A3. Bias and run time (s), the latter between brackets, for the approximationĈ ≈ kM,0,1+ (0.05) on the 2015 and 2016 pricing game dataset. This is given for the fine grid approach and for several different numbers of clusters for the Oand 1-group.     Table A6. Bias and run time (s), the latter between brackets, for the approximationĈ ≈ ep,kM,0,1+ (0.05) on the 2015 and 2016 pricing game dataset. This is given for the rough grid approach and for several different numbers of clusters for the Oand 1-group.