1. Introduction
Among the various measures of economic inequality that have been previously proposed, one of the most popular is the Theil index, named after Theil [
1,
2], and generally expressed as
where
is the set of incomes of
n individuals or income intervals and
(e.g., [
3,
4,
5,
6,
7]). In terms of proportions (income shares)
, Theil’s index can be expressed as
where
is recognized as the entropy of Shannon [
8] for the distribution
with
for
and
The natural (base-
e) logarithm is used in (1) and (2).
While this index can certainly be criticized for its lack of intuitive sense [
9], its popularity is probably due to its useful decomposition property, as follows:
T can be decomposed additively into the inequality “between” and “within” different subgroups (e.g., [
1,
2,
10,
11]). This decomposability property is useful in empirical studies and can be used by policymakers when trying to identify sources of economic inequality (e.g., [
6,
10,
12,
13,
14]). The U.S. Census Bureau produces estimates of the Theil index.
A recognized practical disadvantage of Theil’s
T is that its values are not always comparable across different units (such as countries) since, although its lower bound is 0, the upper bound
of
T is not fixed, but depends on
n. Another limitation of
T, that has not so far been reported or discussed, relates specifically to the values taken on by
T. While various properties relevant to
T that are basically mathematical have been widely discussed (symmetry, scale invariance, population replication, Pigou–Dalton transfer principle; e.g., [
5]), any concern about
T lacking the
value-validity property has not so far been discussed, but will be in this paper. This property, first introduced by Kvålseth [
15], ensures that an inequality index takes on values throughout its range that provide representations of the inequality characteristic that are true, realistic, and valid with respect to a generally acceptable criterion.
It becomes immediately evident that
T does not meet the condition required by the value-validity property and can therefore lead to unreliable, inappropriate, and misleading results and conclusions. Consequently, the objective of this paper is to explore some alternative formulation as a correction of
T that satisfies the value-validity condition, at least as a good approximation. The exploratory analysis is based on randomly generated income-share distributions
as well as the so-called lambda distribution [
16].
2. Value-Validity
Since the value-validity property and its conditions have been discussed extensively by Kvålseth [
14,
16,
17], only a brief outline will be provided here. Thus, consider a generic economic inequality measure
EI whose value becomes
for the income-share distribution
and with the extreme values
and
for the two distributions
While the strictly correct notation would be for EI to denote a measure or function and to denote its value for some , EI may be used in this paper to denote both a measure (index) and its value to simplify the notation when there is no chance of ambiguity.
As a convenient starting point to introduce the value-validity concept, consider the following
delta distribution introduced by Kvålseth [
16]:
where
can be considered as an inequality parameter. The
and
in (3) are seen to be extreme members of (4). Thus, for any given
n,
in (4) represents the income-share distribution with perfect equality while
corresponds to the distribution with maximum income-share inequality. When considering some condition for the value-validity of an economic inequality index
EI, the special distribution in (4) can conveniently be used because of the following relationship:
for any given income-share distribution
and single-valued
EI with
. Of course, there can be any number of different
for which (5) would hold for the same
-value.
The distribution in (4) can be viewed as a so-called mixture distribution, since it is the following weighted mean of
and
in (3):
As a basis for the value-validity condition for an economic inequality index
EI, the following linearity (mean-value) requirement for (5) is proposed:
for all
n and
. This linear relationship can equivalently be expressed in terms of the normalized form
For any given , in (7) becomes a linear function of the two variables and and, for any given (fixed) and , is a linear function of .
Besides the linearity proposition in (7), this relationship can also be justified or explained in terms of metric distances between income-share distributions. Thus, by considering the distributions
as points (vectors) in
n-dimensional Euclidean space,
can be expressed in terms of Euclidean distances
d as
Then, from (8) and (9), the value-validity condition can be expressed in terms of the normalized index
EI* and distance
d* as
While the condition in (10) is based on the specific distribution in (4), there are more general implications from the equality in (5). Consequently, it can be expected that the first equality in (10) becomes an approximate equality for any income-share distribution
, i.e.,
3. Critical Assessment of T
It is readily seen from its definition in (2) that the Theil index T does not meet the value-validity condition in (10). It is clear from numerical examples that T substantially understates the true extent of the economic inequality. For a simple example , it follows from (2) that so that since and since , then the normalized index value becomes . This distribution is equivalent to in (4) so that, according to the requirement in (10), the normalized index value should be 0.50 rather than the -value of 0.19.
For an inequality index ranging in potential values between 0 and
, as is the case for Theil’s index with
and
, an equivalent index
complying with the value-validity condition in (10) can be expressed as
for the
in (4) and
. The extent to which
T lacks the value-validity property can conveniently be analyzed by comparing
with the
in (12).
Although the extent of the inequalities
and
become readily apparent from numerical data considered below, this can be performed analytically in terms of
and by defining the
value bias of
as
for the
in (12) and with
being the entropy in (2) for the distribution
in (4). In terms of partial derivatives,
and
. For any given
n, it is found from (14) that
VBT becomes maximum for
values ranging from
for
and
for
. Furthermore, by treating
n as a continuous variable for mathematical purposes, it is found that
. This analysis shows that the value bias
from (13) tends to become increasingly negative with increasing
n and as
approaches the mean
of the two extreme distributions in (3).
While the sensitivity of
in (12) to changes in the inequality or concentration parameter
in (4) remains constant for any given
n, that of
varies substantially with
. Specifically, it is found that
i.e., for any given
n, the sensitivity of
to small changes in
increases with
and at an increasing rate. For any income-share distribution
and from (5), the implication is clear: the sensitivity of
to small changes in the inequality (concentration, unevenness) of the components of
is not constant for any fixed
n, but increases with increasing inequality.
4. Correction of T
4.1. Specific Objective
In order to determine if
T can be corrected so as to comply with the value-validity condition in (8), an obvious approach would be to explore whether some systematic relationship exists between
T and
in (8). If the dimension
n of the income-share distribution
or the number
n of income earners is known, the results from Kvålseth [
15] could be used to explore such a potential relationship. However, when various studies, organizations, or agencies provide reports on economic inequality, the values of indices such as Theil’s
T are typically given without specifying values of
n (e.g., [
10,
11,
12,
13,
18,
19]).
Therefore, for practical purposes, it would be most useful if a value-validity correction
could be formulated at least approximately as a simple function of
T, i.e.,
Exploratory statistical analyses will be used to explore the function f in (15).
4.2. Data
To obtain the necessary data for analyzing the potential relationship in (15), two sources of data were used. First, randomly generated lambda distributions in (4) were obtained by generating n as a random integer between 2 and 100, inclusive, and as a random number (to 2 decimal places) such that .
Second, randomly generated distributions
were produced and based on the following computer algorithm. First,
n was generated as a random integer between 2 and 100, inclusive. Then, for each such generated
n, each
was generated in descending order
as random numbers within the following respective intervals:
Some distributions and were excluded when they produced near identical (repeat) results or when they resulted in values of T > 1, since such T-values would be unrealistic of real reported economic data. Thus, a total of 35 of each of the two types of distributions were used in the analysis.
4.3. Results
The results from using the randomly generated
in (4) and
are summarized in
Table 1 and
Table 2, respectively. An immediate observation from these results is how far
T deviates from the corresponding values of
in (12). The values of
and
differ greatly from the respective values of
in
Table 1 and
in
Table 2. These results support the above analysis that
T consistently and substantially understates the true inequality.
Perhaps the most interesting and promising result from
Table 1 and
Table 2 is the apparent indication that, although the values of
T and
can differ greatly, they appear to be systematically related. In fact, when the values
versus
and
versus
are represented by the scatter diagram in
Figure 1, it becomes clear that a functional relationship, as in (15), could be formulated.
It is evident from this scatter diagram that a simple power function may indeed be an appropriate correction for
T, i.e.,
where
and
are the parameters. The adequacy of this formulation would depend on how closely the values of
from (16) approximate those of
from (12).
From regression analysis of
on
T, the following parameter estimates from (16) are obtained:
and
for the data in
Table 1 and
and
for
Table 2. When combining the data from both tables into 70 data points,
and
, which turns out to be the means of the other two sets of parameter estimates. Consequently, the following value-validity correction of Theil’s
T is proposed:
which is the curve shown in
Figure 1.
When comparing the values of
and
in
Table 1 and
Table 2 and from the scatter diagram in
Figure 1, it becomes apparent that the
in (17) has the value-validity property since the values of
and
are approximately equal, to a reasonable degree. Specifically, if
is used to predict
, it is found that the coefficient of determination
, when properly computed [
20], becomes
for the fitted model
and the 70 data sets combined from
Table 1 and
Table 2. That is, 99% of the total variation of
(about its mean) is explained (accounted for) by the model
. Also,
=
. It is also clear, based on the residuals from
Figure 1, that the formulation in (17) is an appropriate one and that no alternative needs to be considered.
It is also of interest to note the close comparative results when based on the distribution
in (4) versus the general distribution
. From the data in
Table 1 and
Table 2 and from the scatter diagram in
Figure 1, it is evident that the results from the two different types of distribution are highly comparable. In fact, such correspondence is not surprising, in view of the relationship in (5) involving
T and the equivalent one in terms of the corrected
in (17).
4.4. Real Data Results
In addition to the results from randomly generated data, as discussed in
Section 4.3, it may also be of interest to perform the same analysis using some real income data. Also, while the focus of this paper is on the important Theil index, the results from the real data will also be used to make a comparison with another index, the most popular Gini’s index [
21], which does, in fact, have the value-validity property.
By definition, if the income shares are rank ordered such that
, Gini’s index
G can be expressed as
with tied (equal)
’s being placed in any order. For the lambda distribution in (4), it is determined from (18) that
and hence
G meets the value-validity condition in (10).
In order to compare values of the indices
G,
T,
in (18), (2), (12), and (17), respectively, for some real economic income data, U.S. Census Bureau data were used, as reported by Semega and Kollar [
22] (Table A2), for total household income and all ethnic groups for various years. Nine income intervals were reported, ranging from “under USD 15,000” to “USD 200,000 and over”. The results are summarized in
Table 3 (to 3 decimal places in order to discriminate between some of the small index values).
It is clear from the data in
Table 3 that
in (17) is closely related to both
in (12) and
G in (18). However, the values of
and
do not correspond as closely as they do for the random-based data in
Table 1 and
Table 2. Of course, the range of values of
and
is much greater in
Table 1 and
Table 2 than in
Table 3. Also, the results in
Table 3 are based on a fixed number of a few income categories,
, whereas those in
Table 1 and
Table 2 are based on
n ranging from 2 to 100.
There is, however, close linear relationships between the indices based on the data in
Table 3. In fact, the following fitted regression models are obtained from the data in
Table 3:
showing that the variation of one index (about its mean) is nearly perfectly explained (accounted for) by its linear relationship to another index. Consequently, when making difference (interval) comparisons, the indices
can generally be expected to provide similar results, since each complies with the value-validity condition in (10) and (11).
5. Concluding Comments
The single most significant result in this paper is the simple formulation in (17) that provides a correction of Theil’s economic inequality index T to incorporate the value-validity property as a good approximation. The corrected index is only a function of T and does not explicitly depend on the number of income units n. This is important when using to correct published data for T, since such data typically do not specify n. In fact, this was the motivation behind searching for a potential relationship, as in (17), rather than considering some as a function of both T and n.
While Theil’s T has a number of desirable properties, none of those relate specifically to the potential numerical values of T and whether those values can be justified as truly representing the economic inequality characteristic. This limitation of T is addressed by and its value-validity property: transforms understated T-values into realistic, reliable, and valid inequality representations.
Various economic inequality indices, such as Gini’s G in (18) and Theil’s T, are commonly used to make absolute and relative comparisons between individual values and differences (intervals). The , because of its additional value-validity property, has the advantage of providing more representative economic inequality comparisons.
An interesting inconsistency occurs between
and
T when making absolute and relative comparisons. That is, for any two values
and
of
T and the corresponding values
and
from (17), a general difference between the two indices becomes:
This inconsistency can be verified from the form of (17).
What sets Theil’s
T apart from other inequality indices is its desirable decomposition property. That is,
T can be decomposed into within
and between
inequalities, such that
when, for example, considering global economic inequality versus the inequalities within and between countries or regions (see, e.g., [
12,
13,
23]). While the additive decomposition does not hold for the correction in (17), i.e.,
, ratio comparisons could still be corrected, such as
,
, or
.
The does have a clear limitation, as does T. Neither index has any intuitively appealing or meaningful interpretation. Nevertheless, as a simple quantitative measure of economic inequality, has the important advantage over T of having the value-validity property. Consequently, when compared to T, the corrected form provides more realistic and true inequality representations and comparative results from real economic evaluations.