1. Introduction
This paper presents a simple method for estimating
and
G using only the first quartile
, median
M, and third quartile
. Our approach uses the mean absolute deviation (MAD)
Bloomfield and Steiger (
1983) and truncation to estimate and interpret the parameters in terms of the corresponding sub-areas computed as integrals of the underlying quantile function. Our approximation formulas directly yield simple closed-form formulas relating the tail index and the Gini coefficient to quartiles. The proposed MAD-based method is robust (50% breakdown point), making it ideal for heavy-tailed data
Hampel (
1974) with outliers often present in income and wealth distributions data
Bennett et al. (
2019);
Brzezinski (
2014).
One of our main results is the approximation of the tail index
and the Gini coefficient
G using the median and weighted average of the other two quartiles. By considering different quadrature approximation methods, we obtain different weights and a number of approximations to the parameters. For example, in one such approximation, the tail index
is shown to be half the inverse Bowley–Galton skewness coefficient
Bowley (
1920), and the Gini index
G is one-half of the ratio of the quartile difference to the lower quartile difference. These approximations allow us to connect classical quantile statistics
Huber (
2009) and Pareto parameters. Our approach offers several advantages:
Computational simplicity—order statistics readily available from standard software;
Applicability to truncated distributions (e.g., income surveys with top-coding);
Elegant geometric and economic interpretation via areas under the quantile function and income groups.
This paper is organized as follows. In
Section 2, we introduce the Pareto distribution and discuss its quantile function. In
Section 3, we derive and interpret mean absolute deviation for a continuous distribution in terms of the subareas of the quantile functions. In
Section 4, we apply these ideas to the Pareto distribution. In
Section 5, we interpret the indices in terms of the ratio of the appropriate subarea integrals. In
Section 6, we provide an economic interpretation of these indices. In terms of a truncated Pareto distribution. In
Section 7, we consider the truncated Pareto distribution and derive its mean absolute deviation. This will allow us to express the tail index
and the Gini coefficient in terms of quartiles and, at the same time, address the practical issue of dealing with outliers in real data sets. In
Section 8, we derive an approximation to indices using the mean absolute deviation of a truncated distribution. By contrast, in
Section 9, we derive several approximations to these indices using the mean of the truncated distributions and several integral quadrature methods. The proposed approximations have simple geometric and economic interpretations. The obtained formulas are expressed in terms of very simple expressions involving the median and weighted average of the other two quartiles. In
Section 10, we provide a summary of the proposed approximation formulas. In
Section 11, we summarize our proposed estimation formula and compare it to the Maximum Likelihood Estimation. In
Section 12, we present simulation results to illustrate our approach. These simulations show that the proposed estimators achieve relative errors within a few percent of true values for even small sample sizes. In
Section 13, we illustrate our approximations on several real datasets. Finally, in
Section 14, we present some concluding remarks and discuss future directions.
In this paper, we focus on income inequality. Many economists focus on other components of income, such as wages and earnings, or on some utility functions like consumption and leisure
Attanasio and Pistaferri (
2016). We do not advocate a particular viewpoint on established theories in finance and economics (permanent income hypothesis, lifecycle models, wealth accumulation, and others). Our objective is to show that the proposed formulas for the Pareto distribution allow for the derivation of simple economic interpretations in terms of quantile-based metrics. We emphasize that these approximations do not require full knowledge of the underlying distributions. Since the obtained metrics are available in closed form in terms of simple quantile measures, one can study the effect of changes by examining the differences in quantiles over time. This allows for the application of the proposed methodology to time-varying or dynamic quantile models. Therefore, our methodology and formulas are general and can be used to interpret other models described by the Pareto distribution.
3. Mean Absolute Deviation
Let us start by outlining our methodology. It is based on using mean absolute deviations to derive simple expressions for the tail index and the Gini coefficient as an alternative to the traditional approach of deriving them, mainly the maximum likelihood estimation. Using mean absolute deviation is a standard statistical technique of estimating parameters that minimizes the calculated mean absolute deviation. Specifically, we derive mean absolute deviations in terms of the integrals of the quantile function. With simple quadrature approximations to these integrals, we express metrics of interest in terms of simple expressions involving the quartiles. This allows a simple explanation of the results in terms of the quantile metrics.
The proposed approximations do not require the full knowledge of the underlying distribution. The main attractiveness of the proposed method is its simplicity—we express our results in terms of the weighted averages of the quartiles.
Let us start with mean absolute deviations. Suppose X is a real-valued random variable with density and cumulative distribution function . Let be the quantile function defined by with . Let , , and denote the first quartile, median, and third quartile, respectively.
If
X has a finite first moment
, then for any value
a we can define the mean absolute deviation
H of
X around the
a by
Bloomfield and Steiger (
1983).
This is defined in Lebesque–Stieltjesintegration
Convertito and Cruz-Uribe (
2023);
Rudin (
1976) and applies to continuous and discrete distributions. If
, then
denotes the mean absolute deviation around the mean. If
, then
denotes the mean absolute deviation around the median
M.
It is well-known
Bloomfield and Steiger (
1983);
E. Elsayed (
2022);
Schwertman et al. (
1990);
Shad (
1969) that for any distribution with finite variance,
. On the other hand, mean absolute deviations require only the existence of the first-order moment. Throughout this paper, we focus on the mean absolute deviation around the median, and we will use the notation
.
Since the well-known work of
Fisher (
1920), standard deviation has been used as the primary metric for deviation. The convenience of differentiation and optimization, and the summation of variances for independent variables, has contributed to the widespread use of the standard deviation in estimation and hypothesis testing. On the other hand, using MAD offers a direct measure of deviation and is more resilient to outliers. In computing standard deviation, we square the differences between values and the central point, such as the mean or median. As a result, standard deviation emphasizes larger deviations. The computation of mean absolute deviation offers a direct and easily interpretable measure of deviation.
Usually, one computes mean absolute deviations from the density functions
Kenney and Keeping (
1962) as shown in Equation (
6). It is also possible to derive mean absolute deviations from cumulative distribution functions (CDF) instead of density functions (PDF). For many well-known distributions, the CDFs are often expressed in terms of special functions, and indefinite integrals and sums for these functions are well-known, allowing one to derive expressions for MAD in closed form
Pinsky (
2025).
Alternatively, it is possible to derive mean absolute deviations in terms of the quantile functions. For many distributions, such as Pareto, quantile functions are available in closed form. Computing mean absolute deviations involves computing some integrals of these quantile functions. Using truncation, we derive the parametric solutions through integration of the quantile function between the quartiles. This ensures that the integrals are finite even for distributions exhibiting heavy tails or infinite moments
Bock et al. (
2013);
Rousseeuw and Leroy (
2005).
Our approach eliminates the computational burden of iterative optimization procedures while maintaining accuracy comparable to maximum likelihood estimation. As a result, our approach gives closed-form solutions for parameters that are resistant to outliers and achieves substantial reductions in complexity without sacrificing precision. We have applied this approach to a number of distributions with heavy tails and obtained very good approximations.
Our main idea is the following: for the Pareto distribution considered in this paper, the quantile function has a simple form. We consider a truncation at the first and third quartiles. This ensures that all integrals are finite. The MAD of the truncated distribution can be computed in terms of the corresponding integrals of the quantile functions and can be expressed in terms of the tail index and quartiles. On the other hand, using simple quadrature methods for evaluating these integrals, we express MAD in terms of the corresponding quartiles only. From these two expressions for MAD, we derive simple approximations for the tail index and for the Gini index G. The accuracy of the proposed approach depends on the form of the quantile function and on the numeric quadrature approximation. We show by extensive simulations that for the Pareto distribution, the proposed approximations to and G give good accuracy. They are easy to compute and interpret.
The proposed approach of using the quantile functions and truncation offers many advantages. For many distributions, the quantile function is available in closed form even if the CDF or PDF is not (for example, the Levy or the Cauchy distribution). It is therefore possible to design parameter estimation procedures for these distributions (after truncation) and derive simple closed-form approximations. This allows us to connect the derived approximations for and G and interpret them in terms of widely used quantile metrics such as interquartile range and Galton skewness.
One of the interesting applications of our method is to study volatility in financial asset prices and investment returns. The standard approach is to use the standard deviation of returns as a measure of risk and to compare investments in terms of Sharpe ratios (returns per unit of risk). Mean absolute deviation is a robust measure of risk and provides an alternative to variance-based risk measures. It has been suggested as an alternative for risk management for non-normal return distributions
Lam et al. (
2021).
Let us elaborate on this connection with risk and volatility. First, consider the Pareto distributions with
. These distributions have finite mean
, and from Equation (
2), we have
. We can interpret
as a measure of deviation from the mean and interpret it as volatility. Then, if we interpret
as some measure of average returns, then the above expression is analogous to the Sharpe ratio widely used in finance to measure risk-adjusted returns. In the more general case for any
, we can consider truncation and show that the mean absolute deviations can be expressed via the integrals of the quantile functions, and in turn, these integrals can be approximately evaluated via quartiles using a number of quadrature approximation methods. For example, in one such method (“midpoint”) applied for the Pareto distribution in
Section 9.2, we obtain
and
. If we think of
X as measuring returns, then the term
for
H can be interpreted as risk, whereas the term
can be interpreted as some measure of average returns. Therefore, with this analogy, the tail index
can again be interpreted as the analogy of Sharpe’s ratio. Higher volatility would translate into higher values of the tail index.
Finally, for distributions with finite mean, we have the following result
Pham-Gia and Hung (
2001) for confidence intervals:
where
. With truncation at the quartiles considered in this paper, the median
M of the truncated distribution
is unchanged, the mean
of
is always finite, and the above equation can be rewritten as
where
and
denote the mean absolute deviations of
around the mean
and the median
M, respectively. This is somewhat analogous to the well-known Chebyshev’s inequality for distributions with finite variance
Feller (
1956):
For many cases of interest with Pareto distribution
, resulting in infinite variance. As a result, Chebyshev’s inequality in Equation (
9) cannot be used. However, the bounds with mean absolute deviations in Equation (
7) are applicable, giving us additional inequalities relating mean absolute deviations (“volatility”) and confidence intervals.
On the other hand, our method has limitations. Although the quantile function is available in closed form for many well-known distributions (including Pareto, Weibull, log-normal, and exponential), for many distributions, such as the beta distribution, quantile functions are not available in closed form. For such distributions, we can compute MAD from PDF or CDF (if they are available in closed form) or from some estimated quantile function. One approach that we think may be promising is to consider a convex sum of well-known quantile functions to model a distribution, as suggested in
Gilchrist (
2000). In this case, the resulting Mean Absolute Deviation can be computed in terms of Mean Absolute Deviations of these quantile functions, and the resulting parameters can be expressed in terms of the quantiles of these functions. We hope to address this in future work.
Another limitation of the proposed method is dealing with small sample sizes. With truncation, we eliminate 50% of the points and derive approximations to parameters from the “central” part of the distribution. Our preliminary results of applying this method for a number of distributions seem to be promising, even for relatively small sizes.
4. Mean Absolute Deviation for Pareto Distribution
For the Pareto distribution, the mean is finite for
. Most of the data sets on income inequality have the tail index
satisfying
(
Brzezinski (
2014);
Catalano et al. (
2009)). Throughout this paper, we assume
and use
H to denote the mean absolute deviation around the median
M. From the above Equation (
6) for
, we immediately obtain
We can express
H in terms of the integrals of the quantile function
as follows. For continuous distributions, we have
,
, and
. For any
, let
and
. Then, we have
We will use the following result for the Pareto distribution: for any
, we have
Olver (
1974)
Let us define two subarea integrals
and
as
Therefore, we can rewrite equation Equation (
10) for MAD as
If we re-write this as
, then
H has a simple geometrical interpretation as the shaded area shown in
Figure 3b.
Unlike the standard deviation, which is based on the
-norm, the mean absolute deviation is an
-metric and offers a direct and more explainable metric of deviations
Dodge (
1987);
K. Elsayed (
2015);
Pham-Gia and Hung (
2001);
Rousseeuw and Croux (
1993). When applied to the Pareto distribution, the mean absolute deviation
H also has a simple “economic” interpretation. We can interpret
as one-half of the average income of the (0–50%) lower-half income households, whereas
is one-half of the average income of the (50–100%) income households. The mean absolute deviation from the median
H in Equation (
14) can therefore be interpreted as one-half of the difference in mean incomes between the upper- and lower-half income groups.
Finally, let us now relate the mean absolute deviation to skewness. Consider the Groeneveld and Meeden’s skewness
Groeneveld and Meeden (
1984) coefficient
S is defined as
In economic terms, the numerator is the difference between the mean income and the median income M. The denominator is one-half of the difference between the average income of the upper half (50–100%) income bracket and the lower half (0–50%) income bracket.
5. Interpreting the Tail and the Gini Index
One of the metrics for this distribution is the Gini index
G that measures the deviation from equitable distribution. Formally, it is defined in terms of the so-called Lorenz curve
The Lorenz curve shows the percentage of income earned by people below the value
x. For the Pareto distribution, the Lorenz curve
and the Gini index
G are
This index can be computed as the ratio of area
under the Lorenz curve. This is illustrated in
Figure 4a.
The coefficient G tells us how far the Lorenz curve is from the line of equality . The Gini coefficient G is equal to the area below the line of perfect equality minus the area B below the Lorenz curve , divided by the area below the line of perfect equality . In other words, .
We can derive a similar interpretation in terms of the quantile functions. Consider two random variables, X and Y, both with the same but with tail indices and respectively. As before, we assume . Note that a larger represents a more equitable distribution.
Let
and
denote their corresponding quantile functions. The means
and
can be computed as areas under their quartile functions:
On the other hand, for Pareto distributions with
, these means are
and
respectively. Then
If the means
and
represent the areas
and
B under the corresponding quantile functions, we can interpret the Gini coefficient as follows: The Gini coefficient
G is equal to the area
below the quantile function of
X minus the area
B below the quantile function of
Y divided by the area
below the quantile function of
X. In other words,
. This is illustrated in
Figure 4b. This explanation is analogous to the explanation of
G in terms of the Lorenz curve shown in
Figure 4a.
Finally, throughout this paper, we will use the following: for any two constants
A and
B such that
we have
For example, if we apply this to the formula for the mean of the Pareto distribution with
and
, then we obtain
This has the following geometric explanation in terms of the quantile function shown in
Figure 5.
In the case of perfect inequality (one person has all the wealth), the tail index
and the Gini coefficient is 1. The term
in the expression for
and the term
in Equation (
22) measure deviation from such a perfect inequality. On the other extreme, in the case of perfect equality (everyone has the same income
), the tail index
is infinite and the Gini coefficient
. In such a case, all income would be equal, and the mean income
would be the same as the minimum income
. Therefore, the function
represents the quantile function of “total equality” with the area under
for
equal to
. The mean
for the Pareto distribution is the area under the quantile curve shown in
Figure 5a. The term
is the average minus the minimum income
and represents deviation from the equality case. Therefore, the term
in Equation (
22) represents deviation from perfect inequality for
, whereas the term
represents the deviation of the Gini coefficient from the perfect inequality. This is shown in
Figure 5b.
This interpretation of
and
G in terms of the ratio of areas under the quantile function in
Figure 5b in Equation (
22) is analogous to the interpretation of
G in terms of the ratios of areas under the Lorenz curve in
Figure 4b in Equation (
17).
Interpretation of the Tail Index and Gini Coefficient in Terms of Mean Absolute Deviation
Alternatively, we can interpret and G in terms of the mean absolute deviations. First, we provide a geometric interpretation for .
For the Pareto distribution, from Equation (
12) we have
Since for the Pareto distribution with
, the mean
is finite and
, we can easily evaluate
H from Equation (
14) as follows:
After some simple algebra, we obtain
Finally, let us relate the mean absolute deviation
H, the mean
, and the Lorenz curve. Since
, we have
6. “Economic” Interpretation of Tail Index, Gini Index, and MAD
Let us interpret our results in economic terms. If the distribution of income
X follows a Pareto distribution, then we can define any subgroup with income in the range
and compute the average income in this group by considering the corresponding truncated distribution of
X as follows. For any two values,
let
and
. Let
denote the corresponding average income for that group. Then, using the truncation, we have
For example, if we define a low-income group as households with income
, then we have
and
. From Equation (
28) we obtain
In the same manner, we can define other income groups and explicitly compute their average income in terms of
and quartiles. This is summarized in
Table 1.
From this table, we can interpret
as one-half of the average income of the (25–50%) lower-half income households, whereas
is one-half of the average income of the (50–75%) income households. The mean absolute deviation from the median
H in Equation (
14) can be interpreted as one-half of the difference in mean incomes between the upper- and lower-half income groups.
The tail index
in Equation (
25) can be written as
. This gives us the following simple interpretation: The tail index is the ratio of the average income
of the (50–75%) income group divided by the differences between this average and the median income, namely
.
The Gini index
in Equation (
25) can be written as
. This gives us the following simple interpretation: The Gini index is the ratio of the difference between the average income and the median
of the (50–75%) income divided by the sum of this average and the median income, namely
.
Finally, we note that for the Pareto distribution, for any average income above some
, we have
Therefore, the ratio of the average income above any threshold to the corresponding rank is independent of x and is just . The term b is referred to as the inverted Pareto coefficient. It represents a measure of income at the top of the distribution and describes the tail of the distribution. In this paper, we will continue to express our formulas in terms of the tail index .
7. Truncated Pareto Distribution
If
X does not have a finite first moment (case
), then at least one of the integrals on the right-hand side of Equation (
6) does not converge. Moreover, in practice, we often have extreme values (“outliers”) that could have a significant effect on the results. The mean absolute deviation is not as sensitive to outliers as the standard deviation, especially if one uses mean absolute deviation around the median (as we do) and not around the mean
Leys et al. (
2013).
Many models in statistical analysis assume finite mean or variance. Many models with a Pareto distribution assume a finite mean (). However, there are many models for datasets with heavy tails for which the finite moment does not exist. Such models are used to describe catastrophic losses in risk management.
To address this, we can consider a truncated random variable
restricted to
. When applied to real datasets, truncation allows the removal of extreme values in the distribution. For this truncated distribution, its density
and its cumulative distribution
are
Papoulis (
1984)
The median of the truncated random variable
is
and coincides with the median
M of the original distribution
X.
We now address the issue of computing the mean absolute deviation
of a truncated random variable
around its median
M using the original quantile function
for
X. To that end, we will find it convenient to introduce the following two “subarea” integrals:
Then, we can write the equation for the mean absolute deviation
of the truncated distribution
around its median
as follows:
The mean absolute deviation
of the truncated variable
around the median
M is twice the difference between the right and left sub-means. Note that the mean
of the truncated distribution
The subareas
and
correspond to left and right sub-means for
between
and
and between
and
, respectively. This is illustrated in
Figure 7.
The subarea
represents the average income in the
income group, whereas the subarea
represents the average income in the 50–75% income group. It is interesting to compare the above expression for the mean absolute deviation of
of the truncated random variable
in Equation (
34) with the equation for the mean absolute deviation
of
X in Equation (
24). As we discussed in
Section 6, the mean absolute deviation
H can be interpreted as a difference in average incomes of the top (50–100%) and bottom (50–100%) income groups. By contrast, the mean absolute deviation
can be interpreted as twice the difference in average incomes of the top (50–75%) and bottom (25–50%) income groups in the truncated distribution.
We now compute the subareas
and
explicitly. For the left sub-area
, from Equation (
12), we have
Similarly, for the right sub-area
from Equation (
12), we have
For the mean absolute deviation
, we obtain
whereas for mean
we obtain
We will use the above result in Equation (
38) and consider different approximations to subarea integrals. The problem of estimating such integrals is called numerical integration or numerical quadrature and is a classical and well-researched problem in numerical analysis
Conte and De Boor (
1972). This problem arises when the integration cannot be carried out or when the function is known only at a finite number of points.
When applied to estimating the subarea integrals with numerical quadrature, these integrals are approximated at points
,
, and
. This would allow us to derive explicit closed-form approximations for
and
G in terms of quartiles. This would be addressed in
Section 9.
We can rewrite this in terms of the Lorenz curve. From Equations (
3) and (
17) we have:
and therefore, we can rewrite Equations (
35) and (
36) for subarea integrals
and
as:
This immediately gives us
Note that the expression for the mean
in the above Equation (
41) can be written in terms of the mean
of the original distribution as
For income distributions that follow the Pareto distribution,
represents the average income of all individuals, whereas
represents the average income of individuals with income in the range
. The term in brackets represents the fraction of total wealth held by such individuals. For
(equal distribution), we have
as expected.
We can provide a quantile interpretation of income inequality. The average income in the (25–75%) income bracket is
, shown in
Figure 8a. If everyone had the same (minimum) income
, then we would have
. The difference
between the average (truncated) income
and the minimum income
represents the deviation from the equality. This is illustrated in
Figure 8b.
This is analogous to the interpretation of inequality in terms of the overall mean
and
shown in
Figure 6.
The approach outlined above of truncating the distribution at the quartiles and estimating the parameters by computing the subarea integrals of the quantile functions between
and
is quite general. It can be applied to distributions with or without finite moments or variances if the quantile function is available. Such distributions arise in many areas of risk management and finance
Chen and Wang (
2025).
We can consider such an approach and obtain closed-form approximations for parameters for a number of important distributions, including the Levy distribution, which does not have any moments, and the Gumbel extreme-value distribution. These distributions are widely used in risk management and finance
Rachev (
2003).
In particular, for the Levy distribution, the quantile function can be expressed in terms of the inverse of the Gaussian CDF
and is
and using the approach outlined above, we were able to derive simple approximations for the parameters by truncating the distribution at the quartiles and expressing the subarea integrals in terms of the corresponding quartiles and octiles of the normal distribution.
For the Gumbel distribution with location
and scale
, the quantile function is
Using our approach, we should be able to compute the parameters in terms of quartiles and logarithmic integrals (constants).
When applied to Pareto, one immediate extension is the Generalized Pareto IV distribution with the quantile function
This includes Pareto distribution ( and ), exponential (), Lomax distribution (), and many others. When applied to financial markets, this distribution can be used to model extreme events and estimate the Value at Risk (VaR) and the expected shortfall (ES). Our method would allow a simple estimation of these quantities via the quartiles. This would allow us to see the effect of the upper-tail behavior on volatility and returns.
Unlike the Pareto distribution considered in this paper, the generalized Pareto distribution has multiple parameters. To estimate these, we can consider an MAD-based approach: As before, we consider a truncated distribution, but we estimate three mean absolute deviations, , , , around the quartiles as well as the truncated mean . This gives us four equations in four unknowns that can be solved numerically. For many other distributions, such as log-normal, Cauchy, and Weibull, we can derive closed-form solutions for the parameters in terms of simple formulas involving the quartiles. We hope to address the analysis of the generalized Pareto distribution in our future work.
8. Approximating Tail and Gini Index Using the Mean Absolute Deviation of the Truncated Distribution
In this section, we derive approximations to the tail index and Gini index using the mean absolute deviation
of the truncated distribution. We will call this method
-based. We proceed as follows. From Equation (
37), we obtain
In terms of subarea integrals, we can rewrite these expressions as
To compute
and
G, we propose to approximate subarea integrals
and
by numeric quadrature. In quadrature approximation, we approximate the integral of a function with a weighted sum of functions, the so-called “abscissas” or “Quadrature points”
Rudin (
1976). When the function in question is a quantile function, its integral is approximated by a weighted sum of quantiles. To illustrate this with a very simple example, consider the integral of the quantile function between points
and
. This integral is approximated by the area of a rectangle with base length
and height
. In this paper, we will confine ourselves to very simple rules, such as the “midpoint”, trapezoid, and Simpson rules
Olver et al. (
2010). We will show that even these simple rules provide valuable insights into interpreting the shape metrics.
Let us consider the so-called trapezoid rule
Rudin (
1976):
and
. From
Table 1 this can be interpreted as setting the average income
of the (25–50%) group to
and setting the average income
of the (50–75%) group to
. The difference of subarea integrals
is then
. This is illustrated in
Figure 9.
With this approximation for
, we obtain
These equations have a simple interpretation in terms of well-known metrics used in quantile statistics
Gilchrist (
2000) summarized in
Table 2.
We can interpret the performance metrics in Equation (
48). In terms of these metrics, obtain the following simple interpretation:
The mean absolute deviation of the truncated distribution around its median is approximately a quarter of the interquartile range ;
The tail index is approximately one-half of the inverse of the Galton skewness;
The Gini index G is one-half of the ratio of the quartile difference to the lower quartile difference .
We will find it convenient to consider the weighted quartile difference
with
, defined as
This quantile metric
measures the spread of data with different weights
w and
assigned to lower and upper quartile differences
and
, respectively. With this metric, we can express our results in Equation (
48) as
10. Summary of Methods, Estimation Procedure, and Error Analysis
All of the proposed methods can be summarized by their approach to estimating the truncated mean
. We summarize these methods in
Table 3.
We summarize the obtained formulas in
Table 4:
We can compare the values for
by re-writing the results in terms of the median
M and the weighted quartile difference
from
Table 2. These are summarized in
Table 5.
We can compute the exact values of the relative errors for different
in the range
. These values are summarized in
Table 6.
As can be seen from
Table 6, using the
-based method gives us the worst results, similar to those using the “One-Trapezoid” method, with a typical relative error around 6–7%. The best results are obtained by the Simpson’s 1/8 method with relative errors of about 0.25%. We note that the error increases for larger values of
. This can be explained by noting from Equation (
5) that for larger
, the quantile function increases faster. Asymptotically, as
, the quantile function gets steeper (as in
Figure 4b), resulting in less accuracy in the
region.
Similarly, for the Gini index, we can compute the exact values of the relative errors for different
in the range
. These values are summarized in
Table 7.
We get similar results: the best approximation is to use the Simpson 1/8 rule. The average relative errors of this approximation are less than 0.5%. These results from
Table 6 and
Table 7 suggest that the Simpson 1/8 method gives the most accurate result. The one-trapezoid method (estimating the truncated mean
by the average of the first and last quartiles) gives the worst result but provides a simpler interpretation. The
-based method has similar accuracy to the One-Trapezoid and has a very simple and intuitive explanation.
The closed-form approximations for the tail index
and the Gini index
G make it easier to incorporate income mobility to analyze changes in these metrics. For example, consider the case where both
and
are increased by the same amount
. Let
denote the resulting tail index under this change. Then, with trapezoid approximation, we obtain
In particular, if for some C, then .
12. Numerical Results
We have conducted numerical evaluations to compare the performance of several of our proposed estimation methods against the traditional Maximum Likelihood Estimation (MLE), which utilizes the entire dataset, and to assess resilience to outliers. For this purpose, we generated 100 sample datasets with points. Each dataset was generated from a Pareto distribution, with the tail index parameter varying from to (the most interesting range for practical applications). The results of these simulations are summarized to compare the accuracy of each method.
In our analysis of the tail index,
, as detailed in
Table 8, a clear performance hierarchy emerges. The Maximum Likelihood Estimation (MLE) method has the highest accuracy. The Simpson 1/8 method is a close second, exhibiting remarkable precision. The Two Trapezoids method also performs reliably, showing consistent results across the range of
values. In contrast, the
-based method yields the highest error rate. Notably, the performance of the One-Trapezoid method shows considerable variability with changes in
, unlike the more stable Two-Trapezoids method.
A similar pattern is evident in the estimation of the Gini index
G, as shown in
Table 9. Once again, the Simpson 1/8 and Two-Trapezoids methods prove to be the most accurate, with their error rates being significantly lower than those of the other estimators.
Our simulation reveals that the Simpson 1/8 and MLE methods provide the most accurate estimations for both the tail index (
) and the Gini index (
G), consistently outperforming the
-based and One-Trapezoid approaches, especially at smaller sample sizes. However, as illustrated in
Figure 15 for the tail index and
Figure 16 for the Gini index, the performance of all methods improves as the sample size increases. Notably, the plots show that when the sample size is sufficiently large, such as
, the accuracy of the various estimators converges, with most methods yielding results that are within a narrow 1–2% error rate of each other. Note that we use 50% fewer points in computing
and
G compared to the Maximum Likelihood estimation.
To assess the robustness of each method, the dataset was “contaminated” with outliers by artificially inflating the largest 1% of data points by a factor of
. As illustrated in
Figure 17 and
Figure 18, MLE proved highly sensitive to outliers, exhibiting increased relative errors for both the
parameter and the Gini coefficient by as much as 15%. On the other hand, the proposed estimators grounded in robust statistics and truncation, such as Simpson’s 1/8 and the Two-Trapezoids methods, demonstrated remarkable resilience by maintaining low error rates across all sample sizes.
13. Case Study
To evaluate the practical performance of our proposed closed-form approximation methods, we conduct an analysis using five real-world wealth distribution datasets. Our study includes the following:
A synthetic Pareto dataset (n = 10,000, = 1.5) generated for validation purposes. For this dataset, the quartiles are , , and .
The Asia Fortune dataset (
n = 11,008), last accessed on 8 February 2025.
https://corgis-edu.github.io/corgis/csv/billionaires/. This dataset contains wealth information of affluent individuals across Asian markets. For this dataset, the quartiles are
,
, and
.
The Gender Money dataset (
n = 26,609), last accessed on 8 February 2025.
https://www.kaggle.com/datasets/fedesoriano/gender-pay-gap-dataset. This dataset captures wealth distributions segmented by gender across a 14-year period (2010–2023), enabling analysis of inequality patterns across demographic groups. For this dataset, the quartiles are
,
, and
.
These datasets collectively span different scales of wealth (from hundreds of thousands to hundreds of billions), geographical regions, and time periods, providing a robust testbed for evaluating how well our quartile-based approximation methods (, Midpoint, One Trapezoid, Two Trapezoids, Quartiles Average, and Simpson 1/8) estimate the tail index and Gini coefficient G compared with maximum likelihood estimation and exact calculations.
For example, if we wish to calculate the tail index
for the Global Billionaire dataset with the midpoint method, we can use the formula from
Table 4. Plugging in quartile values
,
and
, we get:
Table 10 presents the actual Gini coefficient values computed using the exact formula and six quartile-based approximation methods across all datasets, showing that true Gini values range from 0.46 to 0.56, while approximation methods exhibit varying degrees of accuracy.
Table 11 displays the estimated tail index
parameters using maximum likelihood estimation (MLE) as the benchmark and six approximation methods, revealing that MLE estimates range from 1.00 to 1.53 across datasets, and our proposed approximation methods produce reasonable estimates.
Table 12 demonstrates the relative percentage errors for Gini coefficient approximations, where the midpoint method achieves low error (0.09%) for the Synthetic Pareto dataset, while the H* method generally outperforms other methods with errors typically below 25%.
Finally,
Table 13 shows the relative percentage errors for tail index
estimation, where the Simpson and Two Trapezoids methods showed errors generally below 5% across most datasets, while the H* method exhibits the highest errors, ranging from 11% to 39%.
As can be seen from these results, the proposed methods give a simple and fairly accurate approximation to the tail index and the Gini index G.