1. Introduction
The protection of policyholder privacy, covering a wide array of stakeholders from individuals to small businesses, privately owned companies, and local government funds, is a critical issue in the contemporary digital era. With the advent of digital data collection and storage, insurance companies, which traditionally rely on detailed individual claim information, are increasingly recognizing the importance of also understanding market-level trends and severities through grouped data analysis. Data vendors and public databases attempt to address this concern by providing data in a summarized or grouped format. Such data treatment necessitates viewing them as independent and identically distributed (i.i.d.) realizations of a random variable, subjected to interval censoring over multiple contiguous intervals. The analysis of grouped sample data has largely depended on
maximum likelihood estimation (MLE). However, MLE can result in models that are overly sensitive to anomalies in the data distribution, such as contamination,
Tukey (
1960), or the presence of disproportionately heavy point masses at specific values, a scenario frequently encountered in the actuarial field, particularly within
payment-per-payment and
payment-per-loss data contexts,
Poudyal et al. (
2023).
The drive for robustness in statistical estimation against the sensitivity/vulnerability of MLE has led to the exploration and establishment of various robust estimation methods across different data scenarios, with the notable exception of grouped data. This gap significantly underlines the fundamental motivation of this scholarly work: to address this gap by introducing an innovative robust estimation approach specifically designed to tackle the distinct challenges associated with the analysis of grouped data.
Within the domain of robust statistical estimation literature, the broad category of
L-statistics,
Chernoff et al. (
1967), stands out as a comprehensive toolkit, spanning a wide array of robust estimators along with their inferential justification, such as methods of
trimmed moments (MTM) and
winsorized moments (MWM). MTM and MWM approaches have been effectively applied in actuarial loss data scenarios, contingent on the existence of the quantile function for the assumed underlying distribution. Studies by
Brazauskas et al. (
2009);
Zhao et al. (
2018) have, respectively, applied MTM and MWM to datasets with completely observed ground-up actuarial loss severity. In contexts of incomplete actuarial loss data, particularly for
payment-per-payment and
payment-per-loss,
Poudyal (
2021a) and
Poudyal et al. (
2023) have, respectively, implemented MTM and MWM. These investigations have further established that trimming and winsorizing serve as effective strategies for enhancing the robustness of moment estimation in the presence of extreme claims,
Gatti and Wüthrich (
2023). However, the adaptability and applicability of MTM/MWM to situations involving grouped data, especially where the quantile function may be undefined within certain intervals of interest, remain open for investigation. This scholarly work aims to explore this potential, particularly in light of the
Method of Truncated Moments (MTuM), a novel approach introduced by
Poudyal (
2021b) for completely observed ground-up loss datasets, which implements predetermined lower and upper truncation points to effectively manage tail sample observations.
The importance of robust statistical methods is significant in the insurance sector, especially in its effects on pricing strategies and rate regulation. Traditional estimation methods, like MLE, that are sensitive to data anomalies can lead to inaccuracies in risk assessment, thus affecting the fairness and reliability of insurance premiums. This has direct implications for regulatory compliance and the development of insurance products. The need for methodologies that ensure accuracy and fairness in premium settings is crucial, as these influence both insurer profitability and policyholder satisfaction. Although MTM/MWM offer robust alternatives to MLE, they are not directly applicable in the context of grouped data. Therefore, the proposed MTuM approach aims to address these critical industry challenges by offering a more stable and fair estimation framework, which could significantly contribute to the improvement in insurance pricing models and regulatory practices.
Regarding the analysis of grouped data,
Aigner and Goldberger (
1970) explored the estimation of the scale parameter for the single-parameter Pareto distribution MLE and four variants of least squares. As a robust alternative to MLE for grouped data,
Lin and He (
2006) examined the approximate minimum Hellinger distance estimator (
Beran 1977a,
1977b), which can be asymptotically as efficient as the MLE. Additionally,
Victoria-Feser and Ronchetti (
1997) demonstrated that, in the presence of minor model contaminations, optimal bounded influence function estimators offer greater robustness than MLE for grouped data. The strategy of optimal grouping, in the sense of minimizing the loss of information, was introduced by
Schader and Schmid (
1986); however, this method remains within the likelihood estimation framework,
Kleiber and Kotz (
2003).
Therefore, the fundamental objective of this manuscript is to investigate the robustness of the MTuM estimator, specifically for the tail index of grouped single-parameter Pareto distributions, and to evaluate its performance against the corresponding MLE. Asymptotic distributional properties, such as normality, consistency, and the asymptotic relative efficiency in relation to the MLE, are established for the purpose of inferential justification. In addition, the paper strengthens its theoretical concepts with extensive simulation studies. It is noteworthy that the moments, when subject to threshold truncation and/or censorship, are consistently finite, irrespective of the underlying true distribution.
The structure of the remainder of this manuscript is outlined as follows.
Section 2 offers a succinct summary of the scenarios involving grouped data, encompassing a variety of probability functions.
Section 3 concentrates on the elaboration of the Method of Truncated Moments (MTuM) procedures specifically designed for grouped data, along with a discussion on the justification for their inferential application. An extensive simulation study is undertaken in
Section 4 to augment the theoretical results across diverse scenarios. The manuscript concludes in
Section 5, presenting our closing remarks and outlining possible paths for further research.
2. Pareto Grouped Data
Due to the complexity of the involved theory, we only investigate single parameter Pareto distribution in this scholarly work. As considered by
Poudyal (
2021b, sct. 3), let
with the distribution function
, and zero elsewhere. Here,
represents the shape parameter, often referred to as the tail index, and
is the known lower bound threshold. Consequently, if we define
, then
X follows an exponential distribution,
, with its distribution function given by
. Hence, estimating
is equivalent to estimating the exponential parameter
. Thus, for the purpose of analytic simplicity, we investigate
, rather than
. The development and asymptotic behavior of MTuM estimators will be explored for a grouped sample drawn from an exponential distribution.
Let
be the group boundaries for the grouped data, where we define
and
. Let
, where
X has pdf
and cdf
. The computation of the empirical distribution function at the group boundaries is clear, but inside the intervals, the linearly interpolated empirical cdf as defined in
Klugman et al. (
2019, sct. 14.2), is the most common one. The linearly interpolated empirical cdf, called “ogive” and denoted by
, is defined as
In the
complete data case, we observe the following
empirical frequencies of
X:
where
, giving
is the sample size.
Clearly, the empirical distribution is not defined in the interval , as it is impossible to draw a straight line joining two points and unless .
The corresponding linearized population cdf
is defined by
The corresponding density function
, called the histogram, is defined as
The empirical quantile function (the inverse of
) is then computed as
If individual claim losses
X, when grouped, are subjected to further changes, like truncation, interval censoring, or coverage adjustments, then the underlying distribution function requires suitable modifications. For example, if
m groups (
n observations in total) are provided and it is known that only data above deductible
d appeared, then the distributional assumption is that we observe
with the group boundaries satisfying
.
3. MTuM for Grouped Data
For both MTM and MWM, if the right trimming/winsorzing proportion
b is such that
, then we have
. That is,
does not exist as the linearized empirical distribution
is not defined in the interval
, see Equation (
1). As a consequence,
is not defined on the interval
. Thus, in order to apply the MTM/MWM approach for a grouped sample, we always need to make sure that
, that is,
, but this is problematic for different samples with the fixed right trimming/winsorzing
b. With this fact in consideration, the asymptotic distributional properties of MTM and MWM estimators and from grouped data are very complicated and not easy to analytically derive if not intractable. But with the MTuM, we can always choose the right truncated threshold
T, such that
. Therefore, we proceed with the MTuM approach for grouped data in the rest of this section. Let
and
, with
, be the left and right truncation points, respectively.
Let us introduce the following notations:
Proposition 1. Suppose . Then, .
Proof. Clearly,
and
Therefore,
□
The following corollary is an immediate consequence of Proposition 1.
Corollary 1. Let be a vector of empirical distribution function evaluated at the group boundaries vector . Then, is , where , , with for all .
Assume that
. Then,
where
also, consider
We now define the
Method of Truncated Moment (MTuM) estimator from grouped data. By using the empirical cdf, Equation (
1), and pdf, Equation (
3), the sample truncated moment for a grouped data as defined by
Poudyal (
2021b) is given by
By using Equation (
2), the corresponding linearized/grouped population mean is
The truncated estimator of
is determined by equating the sample truncated moment, as specified in Equation (
6), with the population truncated moment, as presented in Equation (
7). This equation is then solved for
. The solutions obtained, denoted by
or
, are defined as the MTuM estimator of
, provided that such a solution exists.
Assuming
and after some computation, we obtain
Note that
. Thus, by the delta method (see, e.g.,
Serfling 1980, Theorem A, p. 122), we have
where
and
. Consider
. Clearly, if
then
and if
,
Due to the intense nature of the function
, it is complicated to come up with an analytic justification establishing whether it is increasing or decreasing. But, at least for
,
appears to be an increasing function of
as shown in
Figure 1. Generally, we summarize the result in the following conjecture.
Conjecture 1. The function is strictly increasing.
Proposition 2. The function has the following limiting values Proof. These limits can be established by using L’Hôpital’s rule. □
Now, assuming the Conjecture 1 is true, then with Proposition 2, we have
Theorem 1. The equation has a unique solution provided that Solve the equation
for
, say
. Then, again by the delta method, we conclude that
. Note that if both the left- and right-truncation points lie on the same interval, then
. So, the parameter to be estimated disappears from the equation, and hence we do not consider this case for further investigation. Define
Then, we obtain a fixed point function as
, where
However, we need to consider the condition . Therefore, we need to be careful about the initialization of as the right truncation point T cannot be a boundary point because, if it was, then and we would not be able to divide by in the fixed point function .
Now, let us compute the derivative of with respect to using implicit differentiation.
Case 1: Assume that the two truncation points are in two consecutive intervals, i.e., assume that
. Then,
, where
Case 2: The other case is that the two truncation points are not in the two consecutive intervals, i.e., assume that . Then, , where and , and are defined above.
To obtain the exponential grouped MLE, consider
Then, following
Xue and Song (
2002), we have
, where
.
Note that after finding the derivative,
can be expressed as
The asymptotic performance of the MTuM estimator is measured through the asymptotic relative efficiency (
ARE) in comparison to the grouped MLE. The
ARE (see, e.g.,
Serfling 1980;
van der Vaart 1998) is defined as
The primary justification for employing the MLE as a standard/benchmark for comparison lies in its optimal asymptotic behavior in terms of variance, though this comes with the typical proviso of being subject to “under certain regularity conditions”. Therefore, the desired
ARE as given by Equation (
11) is computed as
The numerical values of
Equation (
12), from
with group boundaries vector
, is summarized in
Table 1. As shown in
Table 1, greater robustness is achieved with wider truncation thresholds, that is, as the distance between
t and
T increases.
4. Simulation Study
This section augments the theoretical findings established in
Section 3 with simulations. The primary objective is to determine the sample size required for the estimators to be unbiased (acknowledging that they are asymptotically unbiased), to validate the asymptotic normality, and to ensure that their finite sample relative efficiencies (REs) are converging towards the respective
AREs. For calculating the RE, the MLE is utilized as the reference point. Consequently, the concept of asymptotic relative efficiency outlined in Equation (
11) is adapted for finite sample analysis as follows:
The design of the simulation is detailed below, covering both the generation of data and the computation of various statistics as described:
- (i)
The underlying ground-up distribution is assumed to be exponential with a mean parameter .
- (ii)
Different sample sizes are explored: .
- (iii)
A total of 1000 samples are generated for each scenario.
- (iv)
The sample data are grouped according to the specified group boundaries: .
- (v)
We consider the following vectors of group boundaries:
- (vi)
For each grouping, 1000 estimates of are computed under the Method of Truncated Moments (MTuM) with varying truncation points for grouped data, denoted as .
- (vii)
The average estimated , denoted , is calculated as .
- (vii)
This process is repeated 10 times, yielding averages .
- (ix)
The overall mean,
, and the standard deviation,
, of these averages are computed as follows:
- (x)
- (xi)
Similarly, the finite-sample relative efficiency (RE) of the MTuM with respect to the grouped MLE is calculated as . The mean and standard deviations of these RE values are reported for different vectors of group boundaries.
The outcomes of the simulations are documented in
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6. The entries represent the ratios of the mean estimated values to the true parameter
based on 1000 samples and repeated 10 times. That is, the ratio of the estimated
and the true
, as described in item x above. The corresponding standard errors are presented in parentheses. In all tables, the last three columns (with ∞) represent analytic results, not results from simulations. The third last column is for the asymptotic relative efficiency of the MTuM with regards to grouped MLE, coming from Equation (
12) and
Table 1. For example, the group boundary vectors considered in
Table 1 and
Table 4 are exactly the same and given by
Then, for , the corresponding entries in those two tables should match, and those matching entries, i.e., , are boxed in both tables. Similarly, the second last column is for the asymptotic relative efficiency of the MTuM with regards to un-grouped MLE, and the very last column represents the asymptotic relative efficiency of grouped MLE with regards to un-grouped MLE.
If both the truncation points are in the same interval, say
, then we have
. Therefore, the parameter
to be estimated disappeared, and hence the four rows on
Table 6 are reported as
. As we move in sequence from
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6, it becomes noticeable that the convergence of the ratio of the estimated
with the true
, i.e.,
, approaches the true asymptotic value of 1 at a more gradual pace. This is because the length of intervals going from
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6 increases. More specifically, both our intuition and the data presented in the tables suggest that when there is a wider gap between the thresholds (namely,
t and
T), the estimators tend to approach the true values at a slower rate.
In
Table 2,
Table 3,
Table 4 and
Table 5, it is interesting to observe that, even for the sample size
, the estimator
successfully estimates their corresponding parameter
, with less than
of the relative bias, with one exception for
. As seen in
Table 6, the relative bias is within
only for the sample of size
and for
. Similarly, as observed in those tables, it is clear that all the REs are asymptotically unbiased.
5. Concluding Remarks
In this scholarly work, we have introduced a novel Method of Truncated Moments (MTuM) estimator designed to estimate the tail index from grouped Pareto loss severity data, offering a robust alternative to maximum likelihood estimation (MLE). We have established theoretical justifications for the existence and asymptotic normality of the designed estimators. Additionally, we have conducted a detailed investigation into the finite sample performance across various sample sizes and group boundary vectors through a comprehensive simulation study.
Looking ahead, this paper predominantly addressed the estimation of the mean parameter of an exponential distribution (or equivalently, the tail index of a single-parameter Pareto distribution) using grouped sample data. Therefore, an avenue for future research is the extension of the proposed methodology to more complex scenarios and models. However, particularly for distributions with multiple parameters, examining the nature of the function
, as presented in Equation (
7) and Conjecture 1, can be highly challenging, if not infeasible. The task of providing asymptotic inferential justification for the MTuM methodology when applied to multi-parameter distributions presents similar difficulties. In this context, a potential direction for future research involves adopting an algorithmic approach (i.e., designing simulation-based estimators for complex models) rather than focusing solely on inferential justification,
Efron and Hastie (
2016, p. xvi).
Moreover, evaluating the performance of this novel MTuM estimator in diverse practical risk analysis scenarios remains an important area for further assessment. Attempting to apply the designed MTuM methodology to real grouped insurance data revealed the challenge of finding publicly available data that fit the single-parameter Pareto (or equivalently, an exponential) model well. This issue, along with the often poor fit of the data to the Pareto model, highlighted the importance of adapting the MTuM to many other distributions as well. Such adaptation will provide the necessary flexibility to select the most suitable underlying models based on the initial diagnostic tests of the datasets. Therefore, there is a need to broaden the theoretical development of the MTuM approach to at least include the location-scale family of distributions.