Abstract
For income distributions divided into middle, lower, and higher regions based on scalar median cut-offs, this paper establishes the asymptotic distribution properties—including explicit empirically applicable variance formulas and hence standard errors—of sample estimates of the proportion of the population within the group, their share of total income, and the groups’ mean incomes. It then applies these results for relative mean income ratios, various polarization measures, and decile-mean income ratios. Since the derived formulas are not distribution-free, the study advises using a density estimation technique proposed by Comte and Genon-Catalot. A shrinking middle-income group with declining relative incomes and marked upper-tail polarization among men’s incomes are all found to be highly statistically significant.
JEL Classification:
C10; C42; D31
1. Introduction
Two major distributional changes have characterized many developed economies since around 1980: declining middle-class incomes and rising top incomes Hoffman et al. (2020); Blanchet et al. (2022); and Guvenen et al. (2022). For example, in the case of full-time full-year workers in Canada between 1970 and 2005, the proportion of workers who received middle-class earnings fell by 11.5 percentage points among men (from 74.3 to 62.8 percent) and by 13.4 percentage points among women (from 76.5 to 63.1 percent), while the proportion of higher earners rose by 3.4 percentage points for men and 4.9 percentage points for women, and the proportion of lower-earning workers went up by 5.1 and 5.7 points, respectively. Over the same period, the corresponding shares of total earnings received by middle-class earners fell by 16.9 points for men and 17.8 points for women, while the earnings shares of higher earners rose by over 13 percentage points for both men (18.5 to 32.0 percent) and women (11.4 to 25.0 percent) (Beach, 2016, Tables 1 and 6). It would clearly be useful to be able to capture both of these sets of changes efficiently in a simple empirical framework that allows for a conventional statistical inference methodology, so one can test for the statistical significance of such changes over time.
The distributional measures that are typically used to examine these patterns of distributional change are the income shares of middle- and upper-income groups, the relative sizes of these groups, and the relative incomes of these groups. In examining these changes, Beach (2016) demonstrated the usefulness of characterizing the income groups in terms of their relationship to the median income level. So, for example, the middle-income group (M) could be defined as including those with incomes between, say, fifty percent and two hundred percent of the median, the upper group (H) as those with incomes above twice the median, and the lower group (L) as those with incomes below half the median. This allows one to obtain separate estimates for group income shares (, ) and for the proportion of recipients within the group (or population share) , as well as for the group mean incomes (). This distributional framework allows a more insightful interpretation of distributional change, since one can then analyze both the size () and the relative prosperity () of the income group separately. (Percentile- or quantile-based measures, by construction, assign the size of the income groups as a prespecified percentage such as the top decile or 10% of all income recipients.) Characterizing group size and prosperity allows one to capture the quantity dimension of a change in the group’s total income separately from the income per recipient. This in turn can be used to help identify the relative strength of demand-side or supply-side driving factors behind observed distributional change Katz and Murphy (1992). Such insights, though, have heretofore been based on the relative magnitude of these effects, not on their statistical significance. This framework also allows for a richer and more extensive set of measures of income polarization, in terms of both quantity and relative income dimensions at the tails of the distribution.
Any summary or scalar inequality index (such as the Gini coefficient) does not capture the complex mix of distributional changes that have been occurring and does not allow one to identify where the major changes are occurring (and hence possible appropriate policy concerns). A three-way (or more detailed) distributional characterization of these income inequality changes is required.
Davidson (2018) provided an empirical approach to calculate asymptotic variances and covariances for sample estimates of and for middle-group income recipients within the median-based empirical framework, thus enabling formal statistical inference on these measures. The present paper extends Davidson’s statistical analysis to apply to lower- and upper-income groups as well (all defined in terms of the median), so that one can examine a full set of population subsets covering an income distribution (i.e., for L, M, and H subsets) jointly. The analysis shows how this approach leads to explicit formulas for asymptotic variances and standard errors, which can be easily programmed, for and , for all of income groups. The paper extends the set of distributional measures to a relative mean income statistic , where is the mean of group i incomes, and is the overall population mean, and also to itself, so that one can test for the statistical significance of growing income gaps between income groups.
The paper thus proposes a general framework for median-based income inequality analysis, based on asymptotic statistical inference. The derived formulas for variances and covariances of the various statistics are directly empirically applicable to available public microdata files such as those commonly used by research and public policy analysts. The present study serves as a complement to a separate piece by the authors (Beach & Davidson, 2025) that developed a comparable framework for inequality measures, based on quantile income shares as typically published by government statistical agencies. Together, the two papers provide the basis for a toolbox set of calculations that can be readily implemented to allow standard statistical inference for frequently used statistics of disaggregated income inequality change.
The paper first outlines the stochastic quantile function approach to statistical inference. It then extends Davidson’s (2018) middle-class group results for estimated income shares and population shares to corresponding lower- and upper-income groups as well and expresses the asymptotic variance results in terms of simple explicit formulas that can be estimated from available microdata. The extension of these results to group mean income measures is also presented. In Section 3, the results in Section 2 are used to obtain results for relative group mean incomes, measures of polarization, and mean–decile distribution functions. Section 4 provides an empirical application of the Section 2 theoretical results to Canadian Census earnings data. Section 5 summarizes the main results of the paper and notes some implications.
2. Basic Asymptotic Analysis
Let F be the population distribution of income recipients, and let Y denote a random variable of which the cumulative distribution function (CDF) is F. We make the following somewhat restrictive assumption:
Assumption 1.
The CDF F is differentiable and strictly increasing on its compact support.
The assumption is made for convenience and in order to simplify the asymptotic analysis. If it is not satisfied, various asymptotically negligible terms appear in the estimators of group population and income shares, which complicate the analysis.
2.1. Population Shares
Let m denote the median of the distribution F. Then, the population share of those recipients with income no greater than for some is . If we have a random sample from the population of size N, we can estimate the distribution by the empirical distribution function (EDF) , defined as follows:
where the , , are the observed incomes in the sample, and I is the indicator function, with value 1 if its argument is true and 0 if it is false. The sample median is defined as usual:
The natural estimate of the population share is . We have
Under our Assumption 1 and also under less restrictive but still conventional regularity conditions, the first two terms above are of order , while the last, being of order , can be ignored for asymptotic analysis. Then, to leading order, we see that
where is the population density function. According to the Bahadur (1966) representation of quantiles,
and so
Let B be equal to and consider the random variable defined as follows:
where Y is a variable that has the distribution F. Then, clearly
The terms in the sum in (3) can be seen to be IID realizations of the random variable , and so it follows that is asymptotically equal in distribution to . Asymptotic normality follows from the central-limit theorem. The variance of the limiting distribution, which, following standard terminology, we refer to as the asymptotic variance of , is then just . In order to estimate this variance, let
with , using appropriate estimates of the density. Then, to leading order,
with, from (5),
can be estimated by the sample variance of the .
A possibly better approach is simply to compute directly and then estimate the result. It is easy to see from (4) that
whence
Next, , and so from (5), for , we have
We see that can be estimated in a distribution-free manner by
Let , and make the definitions
and, for
Then, is asymptotically equal in distribution to .
Some comments are in order concerning the “appropriate” estimates amd . In Appendix B, we sketch an alternative to conventional kernel density estimation that works much better with distributions that have support only on the positive real line or a subset of it. Here, we follow the work of Comte and Genon-Catalot (2012).
The analysis so far developed is sufficient for estimating and providing standard errors for the population share with income less than or greater than . However, in order to estimate the population share of recipients with income in some interval , , , as in Davidson (2018), one needs not only the variances of and but also their covariance. The asymptotic covariance of and is the covariance of and .
Make the definitions and . Then,
whence
whereas
From this, we see immediately that
and this can be estimated in a distribution-free manner.
Although the results of this section so far are quite general, for most of the rest of the paper, interest will be focused on the case with . The share of the population with income not exceeding , that is, , will be denoted by , where ‘L’ stands for the group of lower-income recipients. The population share of the middle-income group is ; it is denoted by . The share of the higher-income group, , is denoted by .
It is clear from (8) that
and from (10) that
Note that the terms on the right-hand sides of these equations have simple intuitive interpretations. The first (product) term corresponds to the variance of random recipients lying within the respective population share, the second (squares) term corresponds to the variance of the estimated median-based cut-off points, and the last term corresponds to the covariance or interaction between the first two components.
The population share of recipients of incomes between and is , and the limiting variance of is equal to . The covariance (12) can now be rewritten as
and so the asymptotic variance of , after a little algebra based on (13), (14), and (15), can be seen to be
The same expression results from calculating directly. Let . Then, (16) can also be written as
2.2. Income Shares
We begin by considering the income share of recipients of incomes no greater than , with . The average income earned by these recipients is , defined as follows:
and the income share is , where is the mean income of the population, estimated by . Note that and . With , we denote and by and respectively, and we denote the income share of the lower-income group by . Clearly .
For incomes greater than , with , the average income is with defined just as in (18), replacing b by a. The income share is . For the middle-income group, the average income is , and the income share is .
By analogy with (1) for population shares, we have
where the third term can be ignored asymptotically. With a random sample of size N, as in the preceding subsection, the first term is exactly equal to
and the second term can be approximated to leading order by , and, by (2), that approximation is to leading order equal to
This leads to
Next, we define the random variable as
noting that . It follows now that is asymptotically equal in distribution to .
Similarly, for , we can define
where , and is asymptotically equal in distribution to .
For the variance of , we compute as follows:
so that
where we define . It follows that
In the same way, we find that
where and . Everything here can be straightforwardly estimated in a distribution-free manner.
Alternatively, by setting
for , the variance of and that of can be estimated by the sample variances of the and the , respectively.
The income share of the low-income group is , and this income share can be estimated by . We have
Now since , for the purposes of our asymptotic analysis, we can replace the denominator by . Given (19) and the definition (20) of the random variable , and the fact that , we are led to define the random variable and to conclude that (24) is asymptotically equal in distribution to . First, note that
For the variance, we have
Now,
Then, from (22) and (26), we see that the asymptotic variance of is
where . Note that the term involving corresponds to the variability of the estimated median-based cut-off point about its true population cut-off value. The terms without any B in them correspond to the variability of random recipients lying within the true median-based cut-off range. Terms involving only B then correspond to the covariance or interaction between the first two components.
Since the income share of the high-income group is , similarly to (24), we see that
Make the definition ; the asymptotic variance of is then , and after some algebra, we see that this is
For the middle-income group, we have , and so, again similarly to (24), we find that
Define the random variable
It is easy to check that the asymptotic variance of is . First,
Then,
Since
it follows that
and since
it follows that
From (29)–(31), we conclude after a bit of algebra that
Another way to estimate and is to define
for and use the sample variances of the and as and , respectively; recall the definitions (23). Further, if we define
the sample variance of the estimates .
2.3. Income Group Means
The mean income of recipients with income no greater than is denoted and is equal to
estimated by . From this, we have to leading order,
This suggests the definition of a new random variable , as follows:
recall the definitions (4) and (20). Then, is asymptotically equal in distribution to . Details of the calculation of the variance of are in Appendix A(a), although it is also possible to make the definition for
with and defined, respectively, by (6) and (23), and use the sample variance of the as an estimate of . The calculation in Appendix A(a) leads to a rather simple expression for , as follows:
Note that
and so, writing , we can reformulate (35) as
Note once more that the second term in this expression corresponds to the variance of based on the true cut-off value, while the first term corresponds to the variability associated with the randomness of the cut-off about its population value .
The mean of incomes greater than is
estimated by . Then, is asymptotically equal in distribution to the random variable
minus its expectation. Note that
so that
The variance of , derived in detail in Appendix A(b), is
Now, if we define the conditional variance
then (41) can also be expressed as
Alternatively, for , make the definition
The variance of the limiting distribution of can then be estimated by the sample variance of the . (Recall definitions (6) and (23) for and .)
The mean of the incomes between and is
estimated by . Thus, is asymptotically equal in distribution to the random variable
minus its expectation.
Note that is not a function of and alone. Estimating it poses no problem, but a new calculation is needed to find an expression for its asymptotic variance. The variance of is derived in Appendix A(c). It is
where we have made the definition
Another conditional variance:
so that (45) reformulated becomes
In order to estimate the variance of the limiting distribution of , another way to proceed is to make the definition, for ,
and use the sample variance of the to estimate the desired variance.
2.4. Summary of Main Results
2.4.1. Population Shares
2.4.2. Income Shares
2.4.3. Income Group Means
Note that the general framework of this paper allows for more and for more refined income groups than just the three employed here—so long as the cut-off points between income groups are expressed in terms of multiples of the median.
If, as for instance with stratified sampling, observations are not equally weighted, our analysis can still be applied if the number of actual observations N is replaced by the sum of the weights over the sample.
3. Inference on Related Distributional Statistics
This section considers three sets of distributional statistics that involve applications of the analytical results developed in the previous section. As there, we restrict attention to the case in which , thus defining three income groups: the lower group L, for incomes less than or equal to ; the middle group M, with incomes between and ; and the higher group H, with incomes greater than .
3.1. Relative Mean Income Ratios
The relative mean income for each income group is the ratio of the group’s mean income to the overall mean income of the distribution:
For example, in recent decades for many countries, the lower-income ratio has not changed much, while the upper-income ratio has risen substantially. It would be useful to know whether the changes in both ratios are statistically significant or only the latter.
The relative mean income ratio can be estimated directly as
However, from the definitions of , , and , we have , , and , and so for , . Thus, to leading order
In Appendix A(d), explicit expressions are derived for the asymptotic variances of , . The results are as follows:
The details of the calculation of the covariance needed in (52) are relegated to Appendix A(e). The result is
with .
3.2. Polarization Measures
The rise in upper incomes, resulting in a growing separation between high-income recipients and middle-class workers, has led to concern about the degree of polarization in income distributions. The concept of polarization can be viewed as having two quite distinct dimensions. One is the size dimension or relative mass at the two ends of the distribution (see for example Wolfson (1994)), which we label tail-frequency polarization and capture here as the proportion of recipients in the lower or higher income groups—what we are referring to here as and . Such measures then are , , and . Asymptotic variances for the first two have already been obtained in Section 2.4 above. For , note that the sum of the three population shares is one, and so the asymptotic variance of is simply that of the middle group, , which again we already have in (17).
The other aspect of polarization is the distance dimension or income-gap polarization, represented here by , , or . Both sets of measures provide useful insights, and both can be implemented in our analytical framework. In the case of the income-gap polarization measures, again, the asymptotic variances of , , and have been established in Section 2.4.2. For the differences in income group means, recall that
for . The three required covariances are provided in Appendix A(f). Thus, again, standard errors of the income-gap polarization measures can be computed in the usual fashion.
One could also posit a set of compound polarization measures, which capture both of these dimensions together: , , and also .
Analogously, one could further identify a compound measure to capture the evident decline in the economic situation of the middle dlass in many countries over recent decades as . This would allow one, for example, to use logarithmic derivatives to estimate the relative importance of changes in the relative size of the middle class () versus changes in their average real incomes () in this decline.
One can use the results of Section 2 to work out the asymptotic variances of these various estimated compound measures; see Appendix A(g) for details.
3.3. Mean–Decile Functions
In an environment where higher incomes have been rising dramatically relative to the rest of the distribution, one measure of interest could be an indication of skewness of the distribution, as measured by the difference between the overall mean and median of the income distribution, or . However, is simply the fifth decile of the distribution. One could, more generally, define a mean–decile function.
Choose some proportions , with for . For deciles, we would have , . Let be the -quantile of the distribution: the proportion of incomes less than is , and let be the corresponding sample quantile. Possible mean–decile functions could take on values , or alternatively , for the decile of the distribution as a further way of capturing growing income differences over various ranges of the distribution.
Here, we can make use of the work of Lin et al. (1980). These authors show that, under general regularity conditions, the and are asymptotically joint normally distributed. We denote the asymptotic variance–covariance matrix by : it is an matrix, where the index refers, not to a quantile, but to . Then, for , the elements of are
where is the density at , , and .
Thus, for the mean–decile distribution defined in levels as , we have
In relative or proportional terms,
Note that the density appears as such in the denominator of the above expressions rather than as a ratio or as elsewhere in this paper. However, can be estimated in the same way as the other densities used; see Appendix B. Standard errors can be calculated accordingly.
3.4. Relation with the Bootstrap
Given the fact that the bootstrap has become an almost universal tool for reliable statistical inference, it is incumbent on us to outline how the material in this paper can be used in connection with bootstrap methods. It has been suggested that the asymptotic variances and standard errors provided here are unnecessary, as they can be obtained in a finite-sample context by use of the bootstrap. However, Horowitz (2001) points out that naive bootstrap standard errors are unlikely to be any better than asymptotic ones and may well be worse. What he and numerous other authors recommend is using an asymptotic standard error in order to construct an asymptotically pivotal quantity by studentizing, that is, dividing the quantity of interest, supposed to have expectation zero, by its standard error. The studentized quantity can then be bootstrapped in order to obtain a bootstrap P value for some null hypothesis, or to construct a bootstrap confidence interval for a parameter of interest.
Our results can be applied readily to such a bootstrap exercise. For instance, a test of a hypothesis that is equal to some given value M can be based on bootstrapping , where is the square root of the asymptotic variance of given by (17). Similarly a bootstrap confidence for can be constructed by conventional means.
Another reason to exercise care in applying the bootstrap to the data used in this paper is set out in Davidson (2018). The incomes given for individuals in the census data are often, indeed usually, rounded to multiples of USD 500 or USD 1000. This means that the empirical distribution of the sample of incomes is not smooth, and this is known to cause problems for a conventional resampling bootstrap. We verified that this is the case with our samples. Asymptotic variances as given by the formulas of this paper, and variances derived from a conventional resampling bootstrap, were compared in the context of a simulation experiment that used samples of 200,000 observations realized from a lognormal distribution. The results were comparable, as might be expected with such large samples. When the same exercise was repeated with the sample of men’s incomes in 2000, the bootstrap variances were very different from the asymptotic ones.
Another point of interest for practitioners is that all the asymptotic standard errors reported in Table 1 Fortunately, no renumbering is needed. were computed in a quarter of a second, whereas the corresponding bootstrap standard errors, with 999 bootstrap repetitions, took 80 s.
4. Empirical Study
In this section, we present results obtained using data from the Canadian Census Public Use Microdata Files (PUMF) for Individuals for 2000 and 2005, as recorded in the 2001 and 2006 censuses. We preferred these datasets to more up-to-date ones since the 2015–2020 census interval has results that are massively affected by the Canadian federal government’s response to the COVID-19 pandemic in the form of major temporary income support programs. In addition, the 2011 Census used a changed methodology (to save money) that made the income data for 2010 non-comparable to the other censuses.
We treat men and women separately, as their wages and labor-market participation rates were quite different. Accordingly, for each census year, two samples, one for each sex, are extracted from the census data files and are treated separately. In both cases, individuals younger than 15 years of age are dropped from the sample, as well as individuals who did not work in that year or for whom the information on weeks worked is missing. Earnings here refers to annual wage and salary income and net self-employment income. Statistics Canada typically rounds incomes to integer multiples of CAN 1000. Earnings are stated in thousands of 2005 (Canadian) dollars.
Given Assumption 1, it is important to see to what extent the rounding of incomes, which inevitably creates an empirical distribution more discontinuous than one generated by sampling from a genuinely differentiable distribution, has an effect on our asymptotic standard errors. We took a subsample of just 1000 observations from the dataset for men from the 2000 census, and smoothed the data by adding noise generated by the Epanechnikov kernel. To each income y, measured in dollars rather than thousands of dollars, the added noise is given by
where the bandwidth . The asymptotic standard errors computed from the smoothed data differed by less than one percent from those computed from the census data.
Density estimates were given by the approach outlined in Appendix B. We experimented with different values of the parameter n using samples drawn from the lognormal distribution, for which the density is known analytically. It appeared that a larger value of n gave more accurate estimates, but that numerical overflow occurred in the computation of the gamma function for values of n greater than around 170. We found that setting gave satisfactory results, although other choices in the neighborhood of 100 gave results that were not markedly different.
In Table 1, results are shown for men in 2000. The entries for are the upper income cutoff for group L and the lower income cutoff for group H. For group M, the entry is the sample median. Asymptotic standard errors are in brackets.
Table 1.
Men in 2000.
Table 1.
Men in 2000.
| L | 17.7420 | 0.2702 | 0.0500 | 7.7588 | 0.1851 |
| (0.0007) | (0.0002) | (0.0271) | (0.0006) | ||
| M | 35.4840 | 0.5811 | 0.5745 | 41.4371 | 0.9886 |
| (0.0770) | (0.0012) | (0.0019) | (0.0937) | (0.0018) | |
| H | 70.9681 | 0.1487 | 0.3755 | 105.8242 | 2.5248 |
| (0.0009) | (0.0019) | (0.3020) | (0.0045) |
Sample size is 227,828, and the estimate = 0.8603, and = 0.4362.
Table 2 shows the corresponding results for women in 2000.
Table 2.
Women in 2000.
Table 3.
Men in 2005.
Table 4.
Women in 2005.
The sample sizes for these four tables of basic distributional results are quite large; so, it should perhaps not be surprising that the asymptotic standard errors are quite small, and all the reported statistics in these basic tables are highly statistically significant. They involve averages or proportions, which seem to be robustly estimated. The estimates of A and B are also all quite sensible in that they imply that the estimated density ratio is considerably larger than —which is what one would expect for a right-skewed distribution such as for an earnings distribution.
Table 5 and Table 6 show the differences in outcomes between men and women for the years 2000 and 2005, with asymptotic standard errors for these differences in parentheses. A positive difference means that the relevant outcome is greater for men than for women; a negative difference means the reverse. Again, all the differences are highly statistically significant. Two results are evident. In both years, men were relatively more concentrated in the middle-income group with women relatively more concentrated in the lower- and higher- income groups within each distribution. This is consistent with more part-time women workers as well as generally higher levels of education for women than for men in recent decades. Second, the earnings gap between men and women changed very little within the lower and middle income groups over 2000–2005. But in the higher income group, men’s earnings shot up quite dramatically compared to women’s over this period.
Table 5.
Differences men–women in 2000.
Table 6.
Differences men–women in 2005.
Table 7 and Table 8 present differences or changes over time in the distributional outcome measures between 2000 and 2005, separately for men and women. For outcomes that were greater in 2005 than in 2000, the differences are positive. Again, asymptotic standard errors are in parentheses, and again, all but one of the changes are highly statistically significant. Here, the changes are quite dramatic given that major distributional changes have typically been rather slow and gradual over time. For both men and women, the proportion of workers in the middle-income group fell substantially between 2000 and 2005, as did the relative-mean incomes of the middle group. On the other hand, mean earnings levels in the higher-income group went up dramatically. As a result, the earnings share of the middle group of so-called middle-class earners markedly declined and was made up by a corresponding dramatic rise in the earnings share of the higher-income group. This pattern occurred for both women and men in the Canadian labor market between 2000 and 2005, but the changes were two to three times stronger in the earnings distribution for men than for women.
Table 7.
Differences 2000–2005 for men.
Table 8.
Differences 2000–2005 for women.
Table 9 and Table 10 further pursue this significant pattern of change and show results for several measures of polarization within the earnings distributions (see Section 3.2 above). Table 9 focuses on population shares or the proportion of workers towards the two ends of the distributions, while Table 10 bases alternative polarization measures on mean earnings gaps over the ends of the distributions. Again, in both sets of polarization measures, one finds broadly similar patterns of change for both men and women (though with some differences). In the case of -based measures (Table 9), the general polarization of workers out of the middle-class region was driven by an increased proportion of workers in the H earnings group among men but by an increased proportion of workers in the L earnings group among women. In the case of the earnings-gap measures (Table 10), the greatly widening gaps in earnings between groups in the distributions is almost entirely driven by the widening gap between middle-class and higher earnings levels—for both men and women in the labor market. Again, the changes are about twice as strong among men than among women workers, and again, the results are highly statistically significant.
Table 9.
Measures of polarization I.
Table 10.
Measures of polarization II.
Finally, Table 11 and Table 12 display estimates of and changes in the compound polarization measures (in Section 3.2) that combine the population share and earnings gap dimensions. As can be seen, for both men and women, changes in the upper end of the earnings distributions over the 2000–2005 period were much greater than changes in the lower end of the distributions. For women, the changes were about twice as large, while for men it was about eight times. Clearly, the large changes have been occurring between the middle-class earnings group and the higher-earnings group. This recommends the use of separate polarization measures for the lower and upper ends of the distribution rather than one that blends or combines the two and thus potentially hides the basic structural changes that are going on over the different regions of the distribution and in the Canadian labor market. Note also that, for men, both components of contribute to the large increases in earnings polarization—both increases in , as well as the rising earnings gap ()—while for women, the increase in is driven completely by rapidly rising upper earnings levels. Again, these polarization changes are all highly statistically significant. Because our sample sizes are large, our asymptotic results seem to be reliable, as illustrated by the simulation evidence presented in Appendix D.
Table 11.
Compound polarization measures.
Table 12.
Changes in polarization measures 2000–2005.
As actual explanations for these major changes are fairly complex and overlapping (some examples: skill-biased automation, globalization and deindustrialization, sectoral and demographic shifts, increased industrial concentration, and weakened private-sector unionization rates); for more extensive discussion, we prefer to refer to (Beach, 2016, 2025), among others, where one can find more extensive discussion of the leading structural explanations of the observed distributional changes and possible policy implications of these changes.
One might want to follow up on the above results by investigating possible intra-group dynamics within any of the income groups.1 Since the choice of the a and b cut-off scalars is arbitrary, one could redo part or all of the above empirical analysis with different values of a and b, possibly highlighting specific narrower regions of the income distribution. Instead, the authors would recommend using a—possibly quite refined—quantile-based analysis as provided in Beach and Davidson (2025). The corresponding variance–covariance formulas in a quantile-based approach are simpler to use and are distribution-free, so that no density estimation steps need to be undertaken. Indeed, the authors view these two papers to be complementary, and between them they provide a quite extensive tool box of distributional statistics to look at possibly quite disaggregative patterns of distributional change.
5. Conclusions
This paper considers income distributions that are divided into lower, middle, and upper regions based on separating points that are scalar multiples of the median. For example, the lower region (L) could consist of recipients with incomes less than half the median, the middle group (M) includes those with incomes between 50 percent and 200 percent of the median, and those with incomes above twice the median lie in the higher income group (H). Such a characterization of an income distribution is very useful in evaluating changes over time in the economic experience of the middle-class income group and in the nature of polarization in the distribution. For each of these three income groups, separate estimates are obtained for their income shares (), group size or population shares () and their mean income levels (). The paper derives explicit formulas for the asymptotic variances (and, hence, standard errors) of sample estimates of the groups’ population shares, income shares, and mean incomes. It is shown that these formulas are not distribution-free, but that a density-estimation technique of Comte and Genon-Catalot (2012) is well-suited to provide needed data-based density estimates in empirical income distribution analyses. The results are then applied to derive asymptotic variances for relative-mean income ratios, for each income group, for various polarization measures, and for decile–mean income ratios. This statistical framework is implemented with Canadian Census public-use microdata files in order to investigate some of the key features of changes in the Canadian earnings distribution.
It is found that population and income shares and income-group means can indeed be estimated with a high degree of reliability. Major patterns of distributional change that have been previously highlighted in the literature have indeed been found to be highly statistically significant. The distributional framework and statistical approach used in this paper thus allow one to move beyond descriptive analysis of distributional change to a formal framework of statistical inference and hypothesis testing.
Further, since , changes in group income shares have been found to arise from changes in both population shares and relative mean incomes. Estimating these two dimensions separately allows for (i) a rich economic interpretation and testing of the driving factors behind distributional change and (ii) an extensive characterization (and hence better understanding) of polarization as a key aspect of on-going distributional change.
The results of this paper suggest that official government statistical agencies—such as Statistics Canada and the U.S. Bureau of the Census—may wish to consider providing median-based estimates of population shares, income shares and income-group means to complement their regularly published series on decile income shares and decile means. They could also provide user information on the general reliability of these estimates. Since the deciles and decile means, which official agencies already provide, and the median-based statistics provided in this paper are usefully complementary, they together would offer a much better source of distributional information on which to base possible policy initiatives to improve policy design and targetting. For example, one might ask what the appropriate income range is for so-called middle-class income tax cuts, COVID-19-response temporary income support programs, or possibly for wage or employment adjustment programs in face of major tariff impact adjustments.
Author Contributions
Conceptualization, C.M.B. and R.D.; methodology, C.M.B. and R.D.; software, R.D.; formal analysis, R.D.; writing—original draft, C.M.B.; writing—review and editing, R.D.; All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The data used to generate the results in this paper were all extracted from the Canadian Census Public Use Microdata Files (PUMF).
Acknowledgments
Davidson’s research was supported by a Distinguished James McGill Professorship at McGill University. We thank participants at the 2024 meeting of the Canadian Econometric Study Group, especially Pujee Tuvaandorj, for valuable comments on the paper. We are grateful to the late Aidan Worswick, research assistant to both authors, for providing us data in a manageable form.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| CDF | cumulative distribution function |
| EDF | empirical distribution function |
| RMI | relative mean income |
Appendix A. Detailed Calculations
- (a)
- Variance of
Recall from (33) that
Since , , and , it follows that
Next,
For the expectation of this, we use (7) for , (22) for , and (A8) for . Thus,
By collecting coefficients of powers of B, we see that
while from (A1), we have
and so
- (b)
- Variance of
From (38), we have
and so
From (39), it is easy to see that
It then follows that is
From (40), we have
and so we conclude that the asymptotic variance of is
- (c)
- Variance of
Recall the definition (44):
whence
Note that
so that and . Further,
so that
Next,
so that
Then,
so that
From all this, we see that
To ease notation in the above expression, write
as in (46). We get
Now,
and so
as stated in (45).
- (d)
- Variances of the ,
Consider first . With , (49) suggests the random variable
The asymptotic variance of is then . An easy calculation shows that
Then,
The expectation of follows from (27), and that of is given by (7). We have
It is easy to show that
so that
Thus, we have
Some algebra lets us calculate from (27), (7), (A7), and (A10). The result is
It is immediate from (9) that
Analogously to (A8) and (A9), we find that
Now, , and so, from (A15) and (A16), we see that
So, from (A12)–(A14), and (A17), we obtain the result
Although we can derive the asymptotic variance of along similar lines as above for and , this leads to expressions that are neither simple nor intuitive. A simpler procedure is to note from (49) that the asymptotic variance of is equal to
The asymptotic variances of and are given by (32) and (17), respectively; see also the summary of results Section 2.4.
The asymptotic covariance of and is the covariance of and . The details of the calculation of the covariance are in the next subsection (e) of this Appendix A.
- (e)
- Covariance of and
Recall that what we need is the covariance of and . With , we have
from which, we see that
Thus,
and
Therefore,
- (f)
- Covariances of estimates of income group means
For the purposes of evaluating the reliability of income polarization estimates, , , and , it is necessary to calculate the asymptotic covariances of the income group means. For the case of , we use the result that
By use of the same approach to the evaluation of asymptotic variances for income group means as set out in Section 2 one obtains
Since and , it follows that this is strictly positive.
For the case of , we have
For , we have
- (g)
- Compound measures
Throughout this section, the results collected in the Table A1 will be freely used in the calculations.
Each of the compound measures in Section 3.2 involves the product of two terms, for instance,
We see that
All of the asymptotic variances above are given in Section 2.4 and the covariance of and in Equation (A20). What remains is to compute the two asymptotic covariances with .
First we consider . It is equal to the covariance of in (4) and in (33):
In similar fashion, the asymptotic covariance of and is the covariance of and in (44):
Here,
while
recall the definition (46) of D. Thus,
Consider next the case of , for which we need the asymptotic covariances with of and . The first of these is
which, after some algebra, becomes
Similarly, the asymptotic covariance of and is , where
The last compound polarization measure defined in Section 3.2 is , which was defined as . For this, we need the covariances with of and . First,
In addition,
Appendix B. Density Estimation on the Positive Real Line
In most applications, the support of the distribution F is a subset of the positive real line. However, it is known that, in this case, ordinary kernel density estimates are biased downwards. A possible way around this difficulty is to transform the data, by taking logarithms for instance, and getting kernel density estimates of the transformed data, which can then be multiplied by the Jacobian of the transformation to obtain estimates of the density of the positive data.
A better approach is suggested by Comte and Genon-Catalot (2012), where it is unnecessary to transform the data. Here is a brief description of their approach, roughly quoted from their paper. Instead of a Gaussian or Epanechnikov kernel defined for both positive and negative arguments, consider a density function defined on the positive real line, with expectation equal to 1. Let be an IID set of random variables with distribution characterized by the density K. Then, the density of the mean is given by , where is the n-fold convolution of K with itself. As , the distribution with density converges to a point mass at 1. The proposal is to estimate the density for by
using the random sample , . The motivation they give is as follows:
In usual kernel methods, the intuition is that the estimation at x counts the number of observations such that is close to 0. In our strategy, the intuition is that the estimator at x counts the number of observations such that is close to 1.
They also point out that plays the same role here as does the bandwidth in conventional kernel methods.
The paper provides some examples of functions K for which the corresponding can be computed analytically. The easiest of these has K equal to the density of the exponential distribution, which is also the gamma distribution with parameter unity: , from which it can be shown that
With this choice, (A27) becomes
Asymptotic theory requires that as , but the guidelines as to how fast or how slowly in Comte and Genon-Catalot are very loose:
In Section 4, we discuss how we chose n for the datasets considered in the empirical work.
We conducted a small simulation experiment to see to what extent the approach of Comte and Genon-Catalot described here is reliable and to compare its performance with that of conventional kernel density estimates. We used the Epanechnikov kernel with bandwidth proportional to times the interquartile range of the sample of size n. For 10,000 replications with samples of size 10,001 (an odd number, so that the sample median is uniquely defined) drawn from the lognormal distribution, we computed realizations of , , and B for , with the densities estimated as kernel density estimates and as described here, and compared them with the true values for the lognormal distribution. The results, shown below, leave no doubt as to the reliability of the method described here and to the unreliability of the kernel density estimates.
| True value | 0.644203 | 0.398942 | 0.484433 |
| Kernel density estimate | 0.533028 | 0.403896 | 0.396132 |
| This method | 0.640649 | 0.401252 | 0.482013 |
Appendix C. Algorithm
Here is a detailed algorithm for the computation of estimates of the numerous measures presented in this paper and of their standard errors.
- Select the cut-off parameters a and b needed to define the three income groups. (We used , .)
- Choose a base unit of account. (Here, it has been thousands of 2005 constant Canadian dollars.) Convert raw income measures in the sample to the chosen unit of account, and sort the converted data.
- Compute the mean income , the mean squared income , and the variance of the sample.
- Compute the sample median and the two cut-off incomes, and .
- By use of the approach described in Appendix B, or otherwise, obtain the estimates and of the density at the cut-off incomes, and the estimates and .
- Count the number of data points with incomes in the three groups defined, respectively, by , , and , and divide these numbers by the sample size N in order to obtain , , and .
- Compute asymptotic standard errors for the estimated population shares using the formulas in Section 2.4.
- Obtain estimates , , and of the quantities , , and , respectively. This can be achieved by averaging the incomes in the low-income group, incomes less than the median, and those in the low- and middle-income groups combined, respectively. In addition, obtain estimates and by averaging squared incomes in the relevant groups.
- Compute the estimated income shares: ; ;.
- Compute the estimated income group means: , , , and . In addition, obtain ,, .
- Compute the estimated relative mean income ratios using (48).
- Obtain the estimated asymptotic variances for population shares, income shares, and group mean incomes by use of the formulas in Section 2.4. For the relative mean income ratios, estimated asymptotic covariances are given by (A11), (A18), and (53).
- Standard errors are found by dividing the asymptotic variances by the sample size N and taking square roots.
- The above computations provide all information necessary for the polarization measures introduced in Section 3.
Table A1.
Table of expectations.
Table A1.
Table of expectations.
| Reference | Random Variable | Expectation |
|---|---|---|
| Y | ||
| (5) | ||
| (5) | ||
| (A9) | ||
| (A16) | ||
| (26) | ||
| (26) | ||
| (11) | ||
| (7) | ||
| (7) | ||
| (22) | ||
| (A8) | ||
| (A15) | ||
| (25) | ||
| (A1) | ||
| (40) | ||
| (A6) |
Appendix D. Simulation Evidence
Simulations were run in order to see to what extent the numerous estimates produced by the algorithm do indeed approximate finite-sample properties. The simulated data were generated using a lognormal distribution The simulated samples contained IID drawings from this distribution. As in the empirical work Section 4, the parameters a and b are set to 2.0 and 0.5, respectively. The true values of all the estimated properties are readily computed for the lognormal distribution.
For each of 100,000 replications, realizations were obtained for , , and , for . The variances of these realizations were computed and then multiplied by the sample size n, since the theoretical work concerns asymptotic variances. The estimates of the theoretical asymptotic variances, as given in the summary of results, Section 2.4, were also computed for each replication and then averaged over all of them. In some cases, a second estimate of an asymptotic variance was obtained for each replication as the sample variance of quantities like the defined in (6). These too are averaged over the replications. In Table A2 below, the averages of the point estimates are given and in Table A3 the averages of the variance estimates.
Table A2.
Point estimates.
Table A2.
Point estimates.
| Value for low incomes | 0.2441 | 0.0452 | 0.3054 |
| Estimated value | 0.2439 | 0.0453 | 0.3052 |
| Value for middle incomes | 0.5118 | 0.3343 | 1.0768 |
| Estimated value | 0.5122 | 0.3352 | 1.0777 |
| Value for high incomes | 0.2441 | 0.6205 | 4.1910 |
| Estimated value | 0.2439 | 0.6195 | 4.1926 |
Table A3.
Estimates of asymptotic variances.
Table A3.
Estimates of asymptotic variances.
| 0.1472 | 0.0104 | 0.1551 | |
| 0.1467 | 0.0103 | 0.1548 | |
| 0.1480 | 0.0104 | 0.1546 | |
| 0.1483 | 0.0105 | 0.1547 | |
| 0.2499 | 0.4282 | 2.4579 | |
| 0.2512 | 0.4317 | 2.4434 | |
| 0.2519 | 0.4359 | 2.4882 | |
| 0.2522 | 0.4327 | 2.4956 | |
| 0.1472 | 0.4938 | 52.6442 | |
| 0.1475 | 0.4934 | 52.6775 | |
| 0.1504 | 0.4971 | 53.1611 | |
| 0.1504 | 0.4976 | 53.2143 |
The asymptotic variances denoted for are the theoretical variances as described in the summary of results Section 2.4 with the true values computed for the lognormal distribution; those denoted are the variances of the sets of point estimates from all the replications; those denoted are the estimates of the theoretical variances averaged over the replications; and those denoted are the sample variances of quantities like the in (6), again averaged over the replications.
Note
| 1 | The authors wish to thank an anonymous referee for raising this question. |
References
- Bahadur, R. R. (1966). A note on quantiles in large samples. Annals of Mathematical Statistics, 37, 577–580. [Google Scholar] [CrossRef]
- Beach, C. M. (2016). Changing income inequality: A distributional paradigm for Canada. Canadian Journal of Economics, 49(4), 1229–1292. [Google Scholar] [CrossRef]
- Beach, C. M. (2025). Testing for canadian distributional change: Declining middle class, rising top income shares and widening income gaps. Department of Economics (Working Paper No. 1531). Queen’s University. [Google Scholar]
- Beach, C. M., & Davidson, R. (2025). Quantile means and quantile share standard errors and a toolbox of distributional statistics. Econometric Reviews, 44, 1166–1185. [Google Scholar] [CrossRef]
- Blanchet, T., Saez, E., & Zucman, G. (2022). Real-time inequality. Working Paper 30229. NBER. [Google Scholar]
- Comte, F., & Genon-Catalot, V. (2012). Density estimation for non negative random variables. Journal of Statistical Planning and Inference, 142, 1698–1715. [Google Scholar] [CrossRef]
- Davidson, R. (2018). Statistical inference on the canadian middle class. Econometrics, 6(1), 14. [Google Scholar] [CrossRef]
- Guvenen, F., Pistaferri, L., & Violante, G. L. (2022). Global trends in income inequality and income dynamics: New insights from GRID. Quantitative Economics, 13, 1321–1360. [Google Scholar] [CrossRef]
- Hoffman, F., Lee, D. S., & Lemieux, T. (2020). Growing income inequality in the United States and other advanced economies. Journal of Economic Perspectives, 34, 52–78. [Google Scholar] [CrossRef]
- Horowitz, J. L. (2001). The bootstrap. In J. L. Heckman, & E. Leamer (Eds.), Handbook of econometrics (Vol. 5, pp. 3159–3228). Elsevier Science, B.V. [Google Scholar]
- Katz, L. F., & Murphy, K. M. (1992). Changes in relative wages, 1963–1987: Supply and demand factors. The Quarterly Journal of Economics, 107, 35–78. [Google Scholar] [CrossRef]
- Lin, P.-E., Wu, K.-T., & Ahmad, I. A. (1980). Asymptotic joint distributions of sample quantiles and sample mean with applications. Communications in Statistics-Theory and Methods, 9(1), 51–60. [Google Scholar] [CrossRef]
- Wolfson, M. C. (1994). When inequalities diverge. American Economic Review, 84, 353–358. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).