The Kaniadakis Distribution for the Analysis of Income and Wealth Data

The paper reviews the “κ-generalized distribution”, a statistical model for the analysis of income data. Basic analytical properties, interrelationships with other distributions, and standard measures of inequality such as the Gini index and the Lorenz curve are covered. An extension of the basic model that best fits wealth data is also discussed. The new and old empirical evidence presented in the article shows that the κ-generalized model of income/wealth is often in very good agreement with the observed data.


Introduction
The past two decades have seen a resurgence of interest in the study of income and wealth distribution in both the physics [1][2][3][4] and economics [5][6][7][8][9] communities. Scholars have focused particularly on the empirical analysis of large data sets to infer the shape of income and wealth distributions and to develop theoretical models that can reproduce them.
Pareto's observation that the number of people in a population whose income exceeds x is often well approximated by Cx −α was a natural starting point for this field of analysis [10][11][12][13]. However, empirical research has shown that the Pareto distribution accurately models only high income levels, while it does a poor job of describing the lower end of distributions.
As research has continued, new models have been proposed to better describe the data, using either a combination of known statistical distributions [14][15][16][17][18][19][20][21][22] or parametric functional forms for the distribution as a whole. Among these, the two-parameter lognormal [23] and gamma [24] distributions were proposed as models for the size distributions of income and wealth, but later evidence showed that these models tend to exaggerate skewness and perform poorly at the upper end of the empirical distributions [25][26][27][28]. Three-parameter models such as the generalized gamma [29][30][31][32], Singh-Maddala [33], and Dagum Type I [34] provide better fits. These models converge to the Pareto model for large values of income/wealth and accurately describe lower and middle ranges.
Finally, models with more than three parameters have also been suggested to fit income and wealth data. For example, the generalized beta distribution of the second kind (GB2) is a four-parameter distribution that was first described by [35]. It fits the data very well and also includes some of the two-and three-parameter models mentioned above as special or limiting cases. (The generalized beta distribution of the first kind (GB1) [35] and the double Pareto-lognormal distribution [36] are other four-parameter models that fit the data well. Ref. [37] also developed the five-parameter generalized beta distribution family, which includes the GB1 and GB2 as special cases and all of the two-and three-parameter distributions nested inside them. In turn, the double Pareto-lognormal distribution has been generalized into a five-parameter family of distributions called the generalized double Pareto-lognormal distribution [38]. However, closed-form expressions for probability density and/or cumulative distribution functions do not always exist for

The κ-Generalized Model for Income Distribution
The κ-generalized statistical model, named after [47], is based on the use of κ-deformed exponential and logarithmic functions introduced by Kaniadakis [41][42][43] in the context of special relativity. Within this framework, the ordinary exponential function exp(x) deforms into the generalized exponential function exp κ (x) given by: The deformed logarithmic function ln κ (x), which is defined as the inverse of (1), can be written as: Kaniadakis' deformed functions have also been successfully used to analyze nonphysical systems. In economics, the κ-deformation has been used to study differentiated product markets [48,49], finance [50][51][52][53][54][55], and the distribution of income by size [47,[56][57][58][59][60][61][62]. In the latter case, it is interesting to use such deformed functions because they can be used to statistically describe the entire spectrum of incomes, from the low to the middle range and up to the Pareto tail.

Definitions and Basic Properties
A random variable X is said to have a κ-generalized distribution, and we write X ∼ κ-gen (α, β, κ), if it has a probability density function (PDF) given by: Its cumulative distribution function (CDF) can be expressed as: (For a complete description of the κ-generalized distributional properties, the reader is referred to [60] and the references cited therein. A heuristic derivation of the κ-generalized density, showing how this probability distribution emerges naturally within the field of κ-deformed analysis, is given in [61,63]). Figure 1 illustrates the behavior of the κ-generalized PDF and the complementary CDF, 1 − F(x; α, β, κ), for various parameter values.
Each of the three graph pairs holds two parameters constant and varies the remaining one.
The constant β is a characteristic scale that has the same dimension as income. For this reason, it takes into account the monetary unit and can be used to adjust for inflation and facilitate cross-country comparisons of income distributions expressed in different monetary units. Increases in the monetary unit result in a global increase in individual income and average income.    The α and κ parameters are scale-free parameters that affect the distribution's shape. The region around the origin of the κ-generalized distribution is dominated by α, while the upper tail is dominated by both α and κ. Increasing κ leads to a thicker upper tail, while increasing α tapers both tails and increases the concentration of probability mass around the peak of the distribution.
As κ approaches 0, the distribution converges to the Weibull distribution; it is easy to verify that: and: lim (The Weibull distribution is primarily studied in the engineering literature. In physics, it is known as the stretched exponential distribution when α < 1. In economics, it has potential for income data, although it has only been used sporadically-some applications can be found in Refs. [29,35,[64][65][66][67][68][69].) The distribution behaves similarly to the Weibull model for x → 0 + , while for large x it approaches a Pareto distribution of the first kind with scale k = β(2κ) − 1 α and shape a = α κ , i.e.: and: = − α κ = −a, the κ-generalized distribution also obeys these alternative versions of the weak Pareto law.) Equation (4) implies that the quantile function is available in closed form: an attractive feature for generating random numbers from a κ-generalized distribution. The median of the distribution is: and the mode occurs at: if α > 1; otherwise, the distribution is zero-modal with a pole at the origin. Finally, the rth raw moment of the κ-generalized distribution is equal to: where Γ(·) denotes the gamma function, and exists for −α < r < α κ . Specifically: is the mean of the distribution and: is the variance.

Measuring Income Inequality Using the κ-Generalized Distribution
The concept of inequality in economics dates back to Pareto's early work [10][11][12][13], which showed that the top 20% of population held about 80% of total income/wealth. Later, the American economist Lorenz [72] introduced the Lorenz curve, a widely used tool for measuring income/wealth inequality. This curve measures the difference in income or wealth distribution from an equal distribution. If there is perfect equality, the Lorenz curve coincides with the diagonal of a unit square, while worsening distribution (more inequality) moves the curve away from the diagonal.
The Lorenz curve for a random variable X with CDF F(x) and finite mean x = x d F(x) is defined as [73]: Using the closed form of the quantile function F −1 (u) of the κ-generalized distribution, the Lorenz curve can be reformulated as follows [74]: where I x (·, ·) is the regularized incomplete beta function defined in terms of the incomplete beta function and the complete beta function, that is, I x (·, ·) = B x (·,·) B(·,·) . The curve (16) exists if and only if α κ > 1. In particular, if X i ∼ κ-gen(α i , β i , κ i ), i = 1, 2, the necessary and sufficient conditions for the Lorenz curves of X 1 and X 2 not to intersect (otherwise, it would be impossible to determine which distribution has more inequality) are [58]: The Lorenz curves of two κ-generalized distributions X 1 and X 2 with parameters chosen according to (17) are illustrated in Figure 2. The depicted curves indicate that X 1 exhibits lower inequality compared to X 2 , as the Lorenz curve of X 1 does not intersect or fall below that of X 2 . Economists have employed statistical metrics to quantify income and wealth inequality. The Gini coefficient, developed in 1914 by the Italian statistician Gini [75], is one of the best known. From the general definition G = 1 − 1 [76], the Gini coefficient associated with the κ-generalized distribution is: Using the Stirling approximation for the gamma function, Γ(z) ≈ √ 2πz z− 1 2 exp(−z), and taking the limit as κ → 0 in Equation (18), after some simplification one arrives at G = 1 − 2 − 1 α , which is the explicit form of the Gini coefficient for the Weibull distribution (see e.g., [77], p. 177). Since the exponential distribution is a special case of the Weibull distribution with a shape parameter of 1, it follows directly that for κ → 0 and α = 1, the exponential law is also a special limiting case of the κ-generalized distribution with a true Gini coefficient of one half [16].
The Gini coefficient is a widely used measure of inequality, but it makes specific assumptions about income differences in different parts of the distribution. It is most sensitive to transfers around the middle of the income distribution and least sensitive to transfers among the very rich or very poor [78]. Differently, the generalized entropy class of inequality measures [79][80][81][82][83] provides a range of bottom-to-top sensitive indices used by analysts to assess inequality in different parts of the income distribution. The expression for this class of inequality indices in terms of the κ-generalized parameters is [57]: where m = x denotes the mean of the distribution given by Equation (13). Formula (19) defines a class because GE(θ) takes different forms depending on the value given to θ, the parameter that describes the sensitivity of the index to income differences in different parts of the income distribution-the more positive or negative θ is, the more sensitive GE(θ) is to income differences at the top or bottom of the distribution. Two limiting cases of (19), obtained when the parameter θ is set to 0 and 1, have gained attention in practical work for the purpose of measuring inequality; these are the mean logarithmic deviation index: where γ = −ψ(1) is the Euler-Mascheroni constant and ψ(z) = Γ (z)/Γ(z) is the digamma function, and the Theil index [84]: where the former is more sensitive to variations in the lower tail, while the latter is more sensitive to variations in the upper tail [85]. (Equation (19) is not defined for θ = 0 and θ = 1, as θ 2 − θ = 0 in both cases. Expressions for these values of θ are therefore derived using l'Hôpital rule, which allows evaluating limits of indeterminate forms using derivatives. Expressions for any GE(θ) index other than the cases θ = 0, 1 can be derived by simple substitution-see for example [60]). Finally, the class of inequality measures introduced by Atkinson [86] can be derived from (19) by exploiting the relationship [87,88]: where = 1 − θ is the inequality aversion parameter. As increases, A( ) becomes more sensitive to transfers among lower incomes and less sensitive to transfers among top incomes [78]. The limiting form of (22)  ∞ 0 x f (x; α, β, κ) d x converges, which is true if and only if α κ > 1. According to [89], parametric income distribution models share the existence problem of popular inequality measures).

Estimation
The κ-generalized distribution's parameters can be estimated using the maximum likelihood technique, which produces estimators with good statistical properties [90,91]. If sample observations x = {x 1 , . . . , x n , } are independent, the likelihood function is as follows: where f (x i ; θ) denotes the PDF, θ = {α, β, κ} the vector of unknown parameters, w i the weight of the ith observation, and n the sample size. This leads to the problem of solving the partial derivatives with respect to α, β and κ for the log-likelihood function: which is the same as finding the solution to the following nonlinear system of equations: However, the derivation of explicit expressions for maximum likelihood estimators of the three κ-generalized parameters poses a challenge due to the absence of feasible analytical solutions. The utilization of numerical optimization algorithms becomes therefore imperative in order to solve the maximum likelihood estimation problem.

Application to the Income Distribution in Greece
To celebrate 20 years of Kaniadakis' contribution, it seems appropriate to consider the income distribution in his native Greece to demonstrate the κ-generalized model's capacity to fit real-world data. First, income data for parameter estimation are briefly described. Next, the κ-generalized distribution is fitted to Greek household income data. Finally, using the same income microdata, different income size distribution models are compared.

Description of the Income Data
Income distribution data for Greece were obtained from the Luxembourg Income Study (LIS) database, which provides public access to household-level data files for various countries, including both developed and developing economies. The data are remoteaccessible, requiring program code to be sent to LIS rather than being run directly by the user. The definition of income is "household disposable income", which is the income available to households to support consumption expenditure and saving during the reference period. The measure includes income from work, wealth, and direct government benefits, but subtracts direct taxes paid. It does not include sales taxes or noncash benefits, such as healthcare provided by a government or employer. Additionally, the income definition excludes income from capital gains, a significant source of nonwage income for wealthy individuals. As a result, many top incomes are likely to be underestimated.
Household disposable income is expressed in euro and "equivalized", i.e., divided by the square root of household size to adjust for differences in household demographics. Prior to equivalization, top and bottom coding is applied to set limits for extreme values. We also exclude all households with missing disposable income and use person-adjusted weights (the product of the household weights and the number of household members) when generating income indicators for the total population and estimating model parameters. Figure 3 shows the results of fitting the κ-generalized distribution to empirical income data corresponding to the distribution of household income in Greece for the year 2016.  The best-fitting parameter values were determined using maximum likelihood estimation, resulting in estimates of α = 2.233 ± 0.017, β = 10, 667 ± 46, and κ = 0.630 ± 0.014. The small errors indicate accurate estimations, and the comparison between the observed and fitted probabilities in panels (a) and (b) of Figure 3 suggests that the κ-generalized distribution has great potential for describing data across the range of low-to-middle-income to high-income power-law regimes, including the intermediate region where Weibull and Pareto distributions show clear departures. (In Figure 3, the curves for the Pareto and Weibull distributions have been drawn by expressing their parameters in terms of the estimated κ-generalized parameters-see Section 2.1).

Results of Fitting
Panel (c) of the same figure displays data points for the empirical Lorenz curve superimposed on the theoretical curve given by Equation (16) with estimates replacing α and κ as necessary. This formula, represented by the red solid line in the plot, matches the data exceptionally well. In addition, the plot contrasts the empirical Lorenz curve with the theoretical curves associated with the Weibull and Pareto distributions, respectively, given by: where P(·, ·) is the lower regularized incomplete gamma function, and: As one can easily see, these curves tell only a small part of the story.
To provide an indirect check on the validity of the parameter estimation, we have also computed predicted values for median and mean household disposable income, as well as the Gini and Atkinson coefficients-the latter with the inequality aversion parameter equal to 1. The results, obtained by substituting the estimated parameters into relevant expressions, are presented in Table 1 The κ-generalized distribution predictions are fully covered by asymptotic normal 95% confidence intervals, confirming excellent agreement between the model and sample observations.
The linear behavior of the quantile-quantile (Q-Q) plot of sample percentiles against the fitted κ-generalized distribution and its limiting cases, shown in panel (d) of Figure 3, confirms the model's validity as well as the fact that the Weibull and Pareto distributions provide partial and incomplete data descriptions.

Comparisons of Alternative Distributions
This section compares the κ-generalized distribution's performance with other parametric models, including the three-parameter generalized gamma [93], Singh-Maddala [33], and Dagum type I [34] distributions, which have the following PDFs, respectively: Ref. [77] provides analytical expressions for distribution functions, moments, and tools for inequality measurement, including the Lorenz curve and Gini coefficient. Refs. [87,94] provide formulas for generalized entropy measures of the GB2 distribution, from which the Singh-Maddala and Dagum versions are easily obtained. For the generalized gamma distribution, closed expressions for the Theil entropy index and the mean logarithmic deviation are given in Refs. [85,95]. (Let X be a random variable following the generalized beta distribution of the second kind (GB2) with parameters a, b, p, and q, i.e., X ∼ GB2 (a, b, p, q). The Singh-Maddala distribution is the special case of the GB2 distribution when p = 1; the Dagum type I distribution is the special case when q = 1. For a discussion of other special cases, see [35,77]). Table 2 displays maximum likelihood estimates for the models under consideration.  The κ-generalized model offers the best results, with parameter standard errors derived from the inverse Hessian matrix being the lowest among competing income distribution models.
The root mean square error and mean absolute error between observed and predicted probabilities were used to determine which distribution best fits the data. These goodnessof-fit measures are, respectively, defined by: and: whereF w i , denotes the weighted empirical cumulative distribution function-equal to the sum of the income weights where x ≤ t divided by the total sum of weights-andθ is the vector of estimated parameters. (In the formulas above, 1{·} is an indicator function that takes the value 1 if the condition in {·} is true, 0 otherwise.) The RMSE and MAE between the observed and estimated Lorenz curves have also been used as goodness-of-fit criteria, as they are expected to better reflect the accuracy of the inequality estimates. These additional measures are given by: and: where λ i =F W (t) and L i denote the cumulative share of population and income, respectively, up to percentile i-i.e., (λ i , L i ) is a point on the empirical Lorenz curve. Based on the above goodness-of-fit criteria, the κ-generalized model is clearly the best fit. As shown in the last three columns of Table 2, the generalized gamma, Singh-Maddala, and Dagum type I have larger RMSE and MAE values for both probabilities and Lorenz curves, suggesting that these models perform worse than the κ-generalized distribution.
The performance of the four models is further evaluated by considering the accuracy of selected distributional statistics implied by parameter estimates. Table 3 presents the predicted values for the median, mean, and several inequality measures derived from estimates in Table 2. (The Gini coefficient of the generalized gamma distribution is available in [35] as a long expression involving the Gaussian hypergeometric function 2 F 1 , which is not currently available in the online statistical evaluator provided by the LIS web-based interface. An estimate of the Gini index for the generalized gamma distribution was therefore obtained by numerically integrating the area between the predicted Lorenz curve and the line of hypothetical equality. Ref. [96] reviews various methods for numerically estimating the Gini.)  For each of the models examined, the accuracy of the implied statistics is evaluated by calculating the absolute percentage error: between the predicted values (P) and the actual sample estimates (A) given in Table 3.
The results are summarized in Figure 4. Except for the median, the κ-generalized distribution has more accurate implied estimates of selected distributional statistics than the Singh-Maddala and Dagum type I models, with the Gini coefficient being significantly more accurate. This implies that the κ-generalized estimation procedure preserves the mean characteristic of the analyzed data and accurately models intra-and/or inter-group variation. Additionally, when considering income differences in different parts of the income distribution, the κ-generalized provides more accurate estimates than the two competitors of the MLD index, Theil index T and the Atkinson inequality measure A(1). The Gini is an inequality index sensitive to the middle, while the other indices are more sensitive to the top and bottom of the income distribution. These results support the closest approximation to the income distribution found for the κ-generalized model.

Median
The κ generalized distribution also outperforms the generalized gamma in predicting the Gini coefficient and Theil index, while the generalized gamma provides more accurate estimates for the MLD index, the A(1) measure, the median, and the mean. This agreement is due to better fit in the lower part of the observed distribution, while disagreements arise from poorer fit in the upper-middle range, especially at the top end. This is demonstrated by the double-logarithmic plot in Figure 5, known as the Zipf plot, which shows the relationship between income and the complementary CDF of income for the data under study.
The Zipf plot is natural to use when looking at the upper part of the distribution because it puts more emphasis on the upper tail and makes it easier to detect deviations in that part of the distribution from what a model would predict [97]. The lines show the Zipf plots that were predicted by fitting the generalized gamma and κ-generalized models. As the graph shows, both are pretty close to the actual data in the lower part of the income distribution. However, the empirical observations of the upper tail are very different from what the generalized gamma says they should be, while the theoretical Zipf plot for the κ-generalized distribution is much closer to the empirical one in the same part of the observed income distribution.

Applications of κ-Generalized Models to Income and Wealth Data
Apart from the one considered in this review, there have been numerous applications of the κ-generalized model to real-world income data over the past two decades.
The first study was conducted by [47], who analyzed 2001-2002 household incomes in Germany, Italy, and the United Kingdom. They found excellent agreement between the model and the empirical distributions across the full spectrum of incomes, including the intermediate income range where clear deviation was found when the Weibull model and pure Pareto law were used for interpolation.
The κ-generalized distribution was later applied to Australian household incomes in 2002-2003 [56] and US family incomes in 2003 [56,57]. The model again described the entire income range well and accurately estimated the inequality level in both countries using the Lorenz curve and Gini measure.
Comparative studies that fit multiple distributions to the same data are crucial for comparing performance. For example, Ref. [58], which examined the distribution of household income in Italy from 1989 to 2006, showed that the κ-generalized model outperforms three-parameter competitors such as the Singh-Maddala and Dagum type I distributions, except for the GB2, which has an extra parameter. The model has also also been used to analyze household income data for Germany between 1984 and 2007, the United Kingdom between 1991 and 2004, and the United States between 1980 and 2005. In many cases, the distribution of household income is observed to conform to the κ-generalized model, rather than the Singh-Maddala or Dagum type I distributions. In particular, the κgeneralized distribution is found to outperform competitors in the right tail of the data. The three-parameter κ-generalized model provides superior income inequality estimates even when the fit is worse than distributions belonging to the GB2 family, as obtained by [98] when comparing US and Italian income data for the 2000s. Finally, Ref. [60] finds that the κ-generalized distribution offers a superior fit to the data and, in many cases, estimates income inequality more accurately than alternatives using household income data for 45 countries from Wave IV to Wave IX of the LIS database. (Four-parameter extensions of the κ-generalized distribution, called extended κ-generalized distributions of the first and second kind-EκG1 and EκG2, respectively-were introduced by [74]. These two extensions are not discussed here, but Refs. [60,61,74] provide formulas for the moments, Lorenz curve, Gini index, coefficient of variation, mean logarithmic deviation, and Theil index for both the models. The new variants of the κ-generalized distribution outperform other four-parameter models in almost all cases, especially in estimating inequality indices with greater precision. In addition, a κ-deformation of the generalized gamma distribution with a power-law tail has recently been proposed by [99], to which the reader is referred for further details.) The κ-generalized distribution has also been used to analyze the singularities of survey data on net wealth, which is gross wealth minus total debt [60,61,100]. These data show highly significant frequencies of households or individuals with wealth that is either null or negative. The κ-generalized model of wealth distribution is a mixture of an atomic and two continuous distributions. The atomic distribution accounts for economic units with no net worth, while a Weibull function accounts for negative net worth data. Positive net worth values, on the other hand, are represented by the κ-generalized model (3). The κ-generalized mixture model for wealth distribution was used to model US net worth data from 1984 to 2011 [100]. The model was generally accurate and its performance was superior to that of finite mixture models based on the Singh-Maddala and Dagum type I distributions for positive net worth values. Similar results were later obtained by Ref. [60] when analyzing net wealth data for nine countries selected from the Luxembourg Wealth Study (LWS) database. (The Luxembourg Wealth Study database-see https://www.lisdatacenter.org/our-data/lws-database/, accessed on 29 June 2023-is a collaborative project to assemble existing microdata on household wealth into a coherent database, aiming to do for wealth what the LIS database has achieved for income. The LWS was officially launched in 2004 and currently provides wealth data sets for several countries and years.)

Concluding Remarks
The κ-generalized distribution, a statistical model developed over several years of collaborative, multidisciplinary research, is a valuable tool for studying income and wealth distributions. This article discussed its basic properties, relationships with other distributions, and important extensions. It also discussed common inequality measures such as the Lorenz curve and Gini index, and how they can be computed from κ-generalized parameter estimates. A review of empirical applications showed excellent agreement with observed data. It is hoped that the collection of all these results in a single source will facilitate and promote the use of the κ-generalized distribution.
Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The Luxembourg Income Study Database (https://www.lisdatacenter. org/, accessed on 26 June 2023) provides remote access to the microdata through a web-based Job Submission Interface (LISSY). Users have to register to the platform and submit through the LISSY interface their statistical programs written in R, SAS, SPSS or Stata. Data analysis was performed using Stata software version 17 [101] while graphs were generated using R software version 4.3.1 [102]. To allow reproduction of the analysis, software code used in this article is available from the author on request.

Acknowledgments:
The author would like to thank the anonymous referees whose comments and suggestions made it possible to improve this paper greatly.

Conflicts of Interest:
The author declares no conflict of interest.