The Lorenz Curve: A Proper Framework to Define Satisfactory Measures of Symbol Dominance, Symbol Diversity, and Information Entropy

Novel measures of symbol dominance (dC1 and dC2), symbol diversity (DC1 = N (1 − dC1) and DC2 = N (1 − dC2)), and information entropy (HC1 = log2 DC1 and HC2 = log2 DC2) are derived from Lorenz-consistent statistics that I had previously proposed to quantify dominance and diversity in ecology. Here, dC1 refers to the average absolute difference between the relative abundances of dominant and subordinate symbols, with its value being equivalent to the maximum vertical distance from the Lorenz curve to the 45-degree line of equiprobability; dC2 refers to the average absolute difference between all pairs of relative symbol abundances, with its value being equivalent to twice the area between the Lorenz curve and the 45-degree line of equiprobability; N is the number of different symbols or maximum expected diversity. These Lorenz-consistent statistics are compared with statistics based on Shannon’s entropy and Rényi’s second-order entropy to show that the former have better mathematical behavior than the latter. The use of dC1, DC1, and HC1 is particularly recommended, as only changes in the allocation of relative abundance between dominant (pd > 1/N) and subordinate (ps < 1/N) symbols are of real relevance for probability distributions to achieve the reference distribution (pi = 1/N) or to deviate from it.


Introduction
Following the early use of Shannon's [1] entropy (H S ) by some theoretical ecologists during the 1950s [2][3][4], H S has been extensively used in community ecology to quantify species diversity. Ecologists have considered the relative abundance or probability of the ith symbol in a message or sequence of N different symbols whose meaning is irrelevant [1,5,6] as the relative abundance or probability of the ith species in a community or assemblage of S different species whose phylogeny is irrelevant (i.e., all species are considered taxonomically equally distinct) [4,7,8]. This use of H S implies that the concept of species diversity is directly related to the concept of information entropy, basically representing the amount of information or uncertainty in a probability distribution defined for a set of N possible symbols [1] or a set of S possible species [4]. H S takes values from 0 to log 2 N or log 2 S and is properly expressed in bits, but it can also be expressed in nats or dits (also called bans, decits, or Hartleys) if the natural logarithm or the decimal logarithm is calculated [1,[4][5][6][7][8].
In recent decades, several ecologists have, however, claimed that H S is a unsatisfactory diversity index because species diversity actually takes values from 1 to S and is ideally expressed in units of species (i.e., in the same units as S). Keeping this perspective in mind, and only considering the number of different symbols as the number of different species and the relative abundances of symbols as the relative abundances of species, Hill [9] proposed the exponential form of Shannon's [1] entropy (H S ) In econometrics, the Lorenz curve [12] is ideally depicted within a unit (1 × 1) square, in which the cumulative proportion of income (the vertical y-axis) is related to the cumulative proportion of individuals (the horizontal x-axis), ranked from the person with the lowest income to the person with the highest income. The 45-degree (diagonal) line represents equidistribution or perfect income equality. Income inequality may be quantified as the maximum vertical distance from the Lorenz curve to the 45-degree line of equidistribution if only differences in income between the rich and the poor are of interest (this measure being equivalent to the value of Schutz's inequality index), or as twice the area between the Lorenz curve and the 45-degree line of equidistribution if differences in income among all of the individuals are of interest (this measure being equivalent to the value of Gini's inequality index), with both measures exhibiting the same value whenever income inequality occurs only between the rich and the poor (see reviews in [22,23]; also see [18]). Therefore, in any given population with M individuals, income inequality takes a minimum possible value of 0 when every person has the same income (= total income/M, including M = 1) and a maximum possible value of 1 − 1/M when a single person has all the income and the remaining M − 1 people have none, as persons with no income can exist in a population.
If we assume that symbol dominance characterizes the extent of relative abundance inequality among different symbols, particularly between dominant and subordinate symbols, then the Lorenz-curve-based graphical representation of symbol dominance is given by the separation of the Lorenz curve from the 45-degree line of equiprobability, in which every symbol i has the same relative abundance (p i = 1/N, with N = the number of different symbols). This separation may be quantified as the maximum vertical distance from the Lorenz curve to the 45-degree line if only differences in relative abundance between dominant and subordinate symbols are of interest, or as twice the area between the Lorenz curve and the 45-degree line if differences in relative abundance among all symbols are of interest, with both measures giving the same value whenever relative abundance inequality occurs only between dominant and subordinate symbols.
In any given message with equiprobability the relative abundance of each different symbol equals 1/N, meaning a symbol may be objectively regarded as dominant if its probability (p d ) > 1/N and as subordinate if its probability (p s ) < 1/N. I had already used an equivalent method to discriminate between dominant and subordinate species [13][14][15][16][17] and between dominant and subordinate land cover types [18]. Thus, symbol dominance takes a minimum possible value of 0 when every different symbol has the same relative abundance (= 1/N, including N = 1), and approaches a maximum possible value of 1 -1/N when a single symbol has a relative abundance very close to 1 and the remaining N −1 symbols have minimum relative abundances (>0), as symbols with no abundance or zero probability do not exist in a message.
In addition, if we assume that symbol diversity equals the number of different symbols or maximum expected diversity (N) in any given message with equiprobability (symbol dominance = 0 because p i = 1/N), then symbol diversity in any given message with symbol dominance > 0 must equal the maximum expected diversity minus the impact of symbol dominance on it; that is, symbol diversity = N -(N × symbol dominance) = N (1 -symbol dominance). This Lorenz-consistent measure of symbol diversity is a function of both the number of different symbols and the equal distribution of their relative abundances (i.e., symbol diversity is a probabilistic concept free of semantic attributes), taking values from 1 to N (maximum diversity if p i = 1/N) and being properly expressed in units of symbols. Therefore, symbol diversity/N = 1 -symbol dominance (i.e., symbol dominance triggers the inequality between symbol diversity and its maximum expected value).
It should also be evident that the reciprocal of symbol diversity refers to the concentration of relative abundance in the same symbol, and consequently may be regarded as a Lorenz-consistent measure of symbol redundancy = 1/(N (1 − symbol dominance)). This redundancy measure is a function of both the fewness of different symbols and the unequal distribution of their relative abundances, taking values from 1/N to 1 (maximum redundancy if N = 1). Thus, symbol dominance (relative abundance inequality among different symbols) and symbol redundancy are distinct concepts, although the value of the former affects the value of the latter.
Lastly, if we assume that information entropy is mathematically equivalent to the binary logarithm of its related symbol diversity, then the resulting Lorenz-consistent measure of information entropy = log 2 (N (1 − symbol dominance)). This entropy measure takes values from 0 to log 2 N (maximum entropy if p i = 1/N) and is properly expressed in bits, quantifying the amount of information or uncertainty in a probability distribution defined for a set of N possible symbols. Obviously, the degree of uncertainty attains a minimum value of 0 as symbol redundancy reaches a maximum value of 1.

Deriving Measures of Symbol Dominance, Symbol Diversity, and Information Entropy from Lorenz-Consistent Statistics
Following the theoretical approach of assessing symbol dominance, symbol diversity, and information entropy within the framework of the Lorenz curve, novel measures of symbol dominance (d C1 and d C2 ), symbol diversity (D C1 and D C2 ), and information entropy (H C1 and H C2 ) are derived from Lorenz-consistent statistics, which I had previously proposed to quantify species dominance and diversity [13][14][15][16][17] and land cover dominance and diversity [18]. In this derivation the number of different species or land cover types is considered as the number of different symbols, and the probabilities of species or land cover types are considered as the probabilities of symbols: where N is the number of different symbols or maximum expected diversity, p d > 1/N is the relative abundance of each dominant symbol, p s < 1/N is the relative abundance of each subordinate symbol, p i and p j are the relative abundances of two different symbols in the same message, L is the number of dominant symbols, G is the number of subtractions between the relative abundances of dominant and subordinate symbols, and K = N (N − 1)/2 is the number of subtractions between all pairs of relative symbol abundances. The dominance statistic d C1 refers to the average absolute difference between the relative abundances of dominant and subordinate symbols (Equation (1)), with its value being equivalent to the maximum vertical distance from the Lorenz curve to the 45-degree line of equiprobability (see also [18]). Accordingly, the value of D C1 equals the number of different symbols minus the impact of symbol dominance (d C1 ) on the maximum expected diversity (Equation (2)). The binary logarithm of this subtraction is the associated measure of information entropy (Equation (3)).
Likewise, the dominance statistic d C2 refers to the average absolute difference between all pairs of relative symbol abundances (Equation (4)), with its value being equivalent to twice the area between the Lorenz curve and the 45-degree line of equiprobability (see also [18]). Accordingly, the value of D C2 equals the number of different symbols minus the impact of symbol dominance (d C2 ) on the maximum expected diversity (Equation (5)). The binary logarithm of this subtraction is the associated measure of information entropy (Equation (6)).
Despite the above dissimilarities between Lorenz-consistent statistics of symbol dominance, symbol diversity, and information entropy, and H C1 = H C2 < log 2 N whenever relative abundance inequality occurs only between dominant and subordinate symbols. In this regard, it is worth noting that d C1 is comparable to Schutz's [21] index of income inequality (also known as the Pietra ratio or Robin Hood index) and d C2 is comparable to Gini's [19,20] index of income inequality. In fact, Gini's index and Schutz's index take the same value whenever income inequality occurs only between the rich and the poor (see reviews in [22,23]; also see [18]). However, there is a particular difference between the measurement of symbol dominance (d C1 and d C2 ) and the measurement of income inequality (Schutz's index and Gini's index): income inequality can reach a maximum value of 1 − 1/M when a single person has all the income and the remaining M -1 people have none (as individuals with no income are considered to measure income inequality), but symbol dominance can only approach a maximum value of 1 -1/N when a single symbol has a relative abundance very close to 1 and the remaining N -1 symbols have minimum relative abundances (as symbols with no abundance or zero probability cannot be considered to measure symbol dominance).
Additionally, because the reciprocal of symbol diversity refers to the concentration of relative abundance in the same symbol (as already explained in Section 2.1), two Lorenz-consistent statistics of symbol redundancy are R C1 = 1/D C1 and R C2 = 1/D C2 . R C1 and R C2 take values from 1/N to 1 (maximum redundancy if N = 1), and therefore their mathematical behavior can considerably differ from the mathematical behavior of Gatlin's [30] classical redundancy index (R G = 1 -H S /log 2 N). Indeed, since R G takes a maximum value of 1 if N = 1 and a minimum value of 0 if p i = 1/N [30], R G should be regarded as a combination of redundancy and dominance (see also [15]).

Comparing Lorenz-Consistent Statistics with H S -Based and H R -Based Statistics
Lorenz-consistent statistics of symbol dominance (d C1 and d C2 ), symbol diversity (D C1 and D C2 ), and information entropy (H C1 and H C2 ) are compared with statistics based on Shannon's [1] entropy (H S ) and Rényi's [10] second-order entropy (H R ). More specifically, on the basis of Hill's [9] proposals for measuring diversity and Camargo's [17] proposals for measuring dominance, the H S -based and H R -based statistics are: where p i is the relative abundance or probability of the ith symbol in a message or sequence of N different symbols.
whenever there is equiprobability, differences in mathematical behavior between Lorenz-consistent statistics and H S -based and H R -based statistics were examined by computing all these statistics for the ten probability distributions (I-X) described as hypothetical messages in Table 1. As we can see, the hypothetical message V is the primary or starting distribution, having two different symbols with probabilities of 0.6 and 0.4. From distribution V to I, the probabilities of all different symbols are successively halved by doubling their number, with the whole relative abundance of dominant symbols that must be transferred to subordinate symbols to achieve equiprobability remaining steady (= 0.1). From distribution V to X, only the probabilities of subordinate symbols are successively halved by doubling their number, with the whole relative abundance of dominant symbols that must be transferred to subordinate symbols to achieve equiprobability approaching the probability of the single dominant symbol (= 0.6). Accordingly, the degree of dominance in each dominant symbol is given by the positive deviation of its probability (p d ) from the expected equiprobable value of 1/N, while the degree of subordination in each subordinate symbol is given by the positive deviation of its probability (p s ) from 1/N. Thus, in each probability distribution or hypothetical message, symbol dominance = symbol subordination = the average absolute difference between the relative abundances of dominant and subordinate symbols (Equation (1)) = the whole relative abundance of dominant symbols that must be transferred to subordinate symbols to achieve equiprobability (P transfer values in Table 1). Table 1. Ten probability distributions (I-X) are described as hypothetical messages: N = the number of different symbols; p 1 -p 33 = the relative abundances of symbols (symbol probabilities); P transfer = the whole relative abundance of dominant symbols (p d > 1/N) that must be transferred to subordinate symbols (p s < 1/N) to achieve equiprobability (p i = 1/N, including N = 1). In addition, disparities in mathematical behavior between Lorenz-consistent statistics and H S -based and H R -based statistics were examined by computing all these statistics for the ten probability distributions (XI-XX) described as hypothetical messages in Table 2, where differences in relative abundance or probability occur not only between dominant and subordinate symbols (as in Table 1), but also among dominant symbols and among subordinate symbols. However, because the P transfer value equals 0.25 in all hypothetical messages, only changes in the allocation of relative abundance between dominant and subordinate symbols (but not among dominant symbols or among subordinate symbols) seem to be of real significance for probability distributions to achieve the reference distribution (involving equiprobability) or to deviate from it. The reasons for this are evident: in the case of a dominant symbol increasing its relative abundance at the expense of other dominant symbols (Table 2, relative abundances p 1 -p 5 in probability distributions XVI-XIX), the resulting proportional abundance of all the dominant symbols is the same as before the transfer, since the increase in the probability of a dominant symbol (becoming more dominant) is compensated by an equivalent decrease in the probability of other dominant symbols (becoming less dominant); similarly, in the case of a subordinate symbol increasing its relative abundance at the expense of other subordinate symbols (Table 2, relative abundances p 6 -p 10 in probability distributions XII-XV), the resulting proportional abundance of all the subordinate symbols is the same as before the transfer, since the increase in the probability of a subordinate symbol (becoming less subordinate) is compensated by an equivalent decrease in the probability of other subordinate symbols (becoming more subordinate or rare). Table 2. Ten probability distributions (XI-XX) are described as hypothetical messages: N = the number of different symbols; p 1 -p 10 = the relative abundances of symbols (symbol probabilities); P transfer = the whole relative abundance of dominant symbols (p d > 1/N) that must be transferred to subordinate symbols (p s < 1/N) to achieve equiprobability (p i = 1/N, including N = 1). Probability distributions in Tables 1 and 2 were selected to better assess differences in mathematical behavior between Lorenz-consistent statistics (Camargo's indices) and H S -based and H R -based statistics. Otherwise, when using probability distributions that were chosen at random, we could obtain results that do not allow us to appreciate significant differences between the respective mathematical behaviors.

Results and Discussion
The Lorenz-curve-based graphical representation of symbol dominance (relative abundance inequality among different symbols) is shown in Figure 1.  Table 3. Because d C1 , d C2 , D C1 , D C2 , H C1 , and H C2 are Lorenz-consistent, their estimated values match estimated values of symbol dominance, symbol diversity, and information entropy concerning Figure 1. In fact, estimated values of d C1 and d C2 are equivalent to the respective P transfer values in Table 1. By contrast, estimated values of d S , d R , D S , D R , H S , and H R do not match estimated values of symbol dominance, symbol diversity, and information entropy concerning Figure 1, while d S and d R exhibit values even greater than the upper limit for symbol dominance (= 0.6). Consequently, D S and D R can underestimate symbol diversity when differences in relative abundance between dominant and subordinate symbols are large or can overestimate it when such differences are relatively small. Figure 1. The cumulative proportion of abundance is related to the cumulative proportion of symbols, ranked from the symbol with the lowest relative abundance to the symbol with the highest relative abundance, for the ten probability distributions (I-X) described as hypothetical messages in Table 1. The reference distribution is depicted by the 45-degree line of equiprobability, where every symbol has the same relative abundance = 1/N, symbol dominance = 0, and symbol diversity = the number of different symbols (N). Symbol dominance may be estimated as the maximum vertical distance from the Lorenz curve to the 45-degree line, or as twice the area between the Lorenz curve and the 45-degree line, with both measures giving the same value whenever relative abundance inequality occurs only between dominant and subordinate symbols (as shown in this figure). In addition, symbol diversity = N (1 -symbol dominance), symbol redundancy = 1/symbol diversity, and information entropy = log2 symbol diversity.
Differences in mathematical behavior between Lorenz-consistent statistics (dC1, dC2, DC1, DC2, HC1, and HC2) and HS-based and HR-based statistics (dS, dR, DS, DR, HS, and HR) are shown in Table 3. Because dC1, dC2, DC1, DC2, HC1, and HC2 are Lorenz-consistent, their estimated values match estimated values of symbol dominance, symbol diversity, and information entropy concerning Figure 1. In fact, estimated values of dC1 and dC2 are equivalent to the respective Ptransfer values in Table 1. By contrast, estimated values of dS, dR, DS, DR, HS, and HR do not match estimated values of symbol dominance, symbol diversity, and information entropy concerning Figure 1, while dS and dR exhibit values even greater than the upper limit for symbol dominance (= 0.6). Consequently, DS and DR can underestimate symbol diversity when differences in relative abundance between dominant and subordinate symbols are large or can overestimate it when such differences are relatively small. Figure 1. The cumulative proportion of abundance is related to the cumulative proportion of symbols, ranked from the symbol with the lowest relative abundance to the symbol with the highest relative abundance, for the ten probability distributions (I-X) described as hypothetical messages in Table 1. The reference distribution is depicted by the 45-degree line of equiprobability, where every symbol has the same relative abundance = 1/N, symbol dominance = 0, and symbol diversity = the number of different symbols (N). Symbol dominance may be estimated as the maximum vertical distance from the Lorenz curve to the 45-degree line, or as twice the area between the Lorenz curve and the 45-degree line, with both measures giving the same value whenever relative abundance inequality occurs only between dominant and subordinate symbols (as shown in this figure). In addition, symbol diversity = N (1 -symbol dominance), symbol redundancy = 1/symbol diversity, and information entropy = log 2 symbol diversity. The observed shortcomings in the measurement of symbol dominance (using d S and d R ) and symbol diversity (using D S and D R ) seem to be a consequence of the mathematical behavior of the associated entropy measures (H S and H R ). As we can see in Table 3, from distribution V to I, where the P transfer value remains relatively small = 0.1 (Table 1) This remarkable finding would indicate that H C1 and H C2 can quantify the amount of information or uncertainty in a probability distribution more efficiently than H S and H R , particularly when differences between higher and lower probabilities are maximized by increasing the number of small probabilities (as shown in Table 3 regarding data in Table 1). After all, within the context of classical information theory, the information content of a symbol is an increasing function of the reciprocal of its probability [1,5,6,10] (also see [31,32]).
Other relevant disparities in mathematical behavior regarding measures of symbol dominance, symbol diversity, and information entropy are shown in Table 4. The respective values of d C1 , D C1 , and H C1 remain identical from distribution XI to XX, since d C1 is sensitive only to differences in relative abundance between dominant and subordinate symbols. On the other hand, because d C2 is sensitive to differences in relative abundance among all different symbols, the respective values of d C2 , D C2 , and H C2 do not remain identical from distribution XI to XX, even though they are equal in XII and XVI, in XIII and XVII, in XIV and XVIII, and in XV and XIX, as in each of these distribution pairs changes in the allocation of relative abundance among dominant symbols and among subordinate symbols are equivalent. A similar pattern of values is observed concerning d R , D R , and H R , but not regarding d S , D S , and H S , whose respective values remain distinct from distribution XI to XX. Tables 3 and 4). The value of symbol dominance equals the maximum vertical distance from the Lorenz curve to the 45-degree line of equiprobability when only differences in relative abundance between dominant and subordinate symbols are quantified which is equivalent to the average absolute difference between the relative abundances of dominant and subordinate symbols = d C1 (Equation (1)) or equals twice the area between the Lorenz curve and the 45-degree line of equiprobability when differences in relative abundance among all symbols are quantified, which is equivalent to the average absolute difference between all pairs of relative symbol abundances = d C2 (Equation (4)). Symbol diversity = N (1 -symbol dominance) (i.e., D C1 = N (1 -d C1 ) and D C2 = N (1 -d C2 )) and information entropy = log 2 symbol diversity (i.e., H C1 = log 2 D C1 and H C2 = log 2 D C2 ). Additionally, the reciprocal of symbol diversity may be regarded as a satisfactory measure of symbol redundancy (i.e., R C1 = 1/D C1 and R C2 = 1/D C2 ).
This study has also shown that Lorenz-consistent statistics (d C1 , d C2 , D C1 , D C2 , H C1 , and H C2 ) have better mathematical behavior than H S -based and H R -based statistics (d S , d R , D S , D R , H S , and H R ), exhibiting greater coherence and objectivity when measuring symbol dominance, symbol diversity, and information entropy (Tables 3 and 4). However, considering that the 45-degree line of equiprobability ( Figure 1) represents the reference distribution (p i = 1/N), and that only changes in the allocation of relative abundance between dominant and subordinate symbols (but not among dominant symbols or among subordinate symbols) seem to have true relevance for probability distributions to achieve the reference distribution or to deviate from it (Table 2), the use of d C1 , D C1 , and H C1 may be more practical and preferable than the use of d C2 , D C2 , and H C2 in measuring symbol dominance, symbol diversity, and information entropy. In this regard, it should be evident that if the number of different symbols (N) is fixed in any given message, increasing differences in relative abundance between dominant and subordinate symbols necessarily imply decreases in symbol diversity and information entropy, whereas decreasing differences in relative abundance between dominant and subordinate symbols necessarily imply increases in symbol diversity and information entropy, with these two variables taking a maximum if p i = 1/N. By contrast, increasing or decreasing differences in relative abundance among dominant symbols or among subordinate symbols will not affect symbol diversity and information entropy, since the decrease or increase in the information content of a dominant or subordinate symbol is compensated by an equivalent increase or decrease in the information content of other dominant or subordinate symbols.
Funding: This research received no external funding.