Top Incomes, Heavy Tails, and Rank-Size Regressions

: In economics, rank-size regressions provide popular estimators of tail exponents of heavy-tailed distributions. We discuss the properties of this approach when the tail of the distribution is regularly varying rather than strictly Pareto. The estimator then over-estimates the true value in the leading parametric income models (so the upper income tail is less heavy than estimated), which leads to test size distortions and undermines inference. For practical work, we propose a sensitivity analysis based on regression diagnostics in order to assess the likely impact of the distortion. The methods are illustrated using data on top incomes in the UK.


Introduction
Income distributions exhibit, like many other size distributions in economics and the natural science, upper tails that decay like power functions (see e.g., Schluter and Trede 2017). The recent and rapidly growing literature on top incomes focuses on this upper tail, and its presence has important consequences for the measurement of inequality. 1 However, estimating the heaviness of the upper tail is challenging, since real world size distributions usually are Pareto-like (i.e., tails are regularly varying) rather than strictly Pareto.
To be precise, let X 1 , . . . , X n be a sequence of positive independent and identically distributed random variables (e.g., incomes) with distribution function F that is regularly varying, so for large x where l is slowly varying at infinity, i.e., l(tx)/l(x) = 1 as x → ∞. The parameter γ, usually referred to as extreme value index (and 1/γ as the tail exponent), is unknown and needs to be estimated. Many estimators have been proposed in the statistical literature (see e.g., the textbook treatments in Embrechts et al. 1997 or Beirlant et al. 2004). An estimator popular among economists is based on a simple ordinary least squares (OLS) regression of log sizes on log ranks (e.g., Jenkins 2017 and Atkinson 2017, and references therein, in the income distribution and top incomes literature, this regression is ubiquitous in the city size literature). The enduring popularity of the OLS estimator is partly due to its simplicity, and partly due 1 See e.g., Schluter and Trede (2002) in the contexts of Lorenz curves, Davidson and Flachaire (2007) who propose a semi-parametric bootstrap, Cowell and Flachaire (2007) who advocate semi-parametric methods, or Burkhauser et al. (2012) who seek to reconcile survey and tax return data. Also observe that the p th moment of the income distribution is finite only if p < 1/γ, so very heavy tails can directly affect the validity of some standard inequality measurement tools. For instance, statistical inference for the Generalised Entropy index with parameter 2 requires the existence of the fourth moment (Cowell 1989). to a powerful intuition based on a Pareto quantile-quantile (QQ)-plot, the regression estimating its slope coefficient. However, if the tail of the distribution varies regularly, the Pareto QQ-plot will become linear only eventually. In particular, (1) can be expressed equivalently, using the tail quantile function U(x) = inf{t : Pr(X > t) = 1/x} where x > 1, as U(x) = x γl (x) wherel(x) is a slowly varying function. Hence, as x → ∞, log U(x) ∼ γ log(x) since then logl(x) → 0. Replacing these population quantities with their empirical counterparts gives the Pareto QQ-plot, and γ is its ultimate slope. This qualification (usually ignored by practitioners in economics) has important consequences for the behaviour of the estimator: Since the OLS estimator estimates the slope parameter of this QQ-plot, deviations from the strict Pareto model -captured by the nuisance function lwill induce distortions.
The empirical importance of this is illustrated in Figure 1, which depicts the Pareto QQ-plot for our administrative income data for the UK (the subject of our empirical application developed in Section 4 below), using the 1000 largest incomes. The plot exhibits a pronounced kink, and approximate linearity of the QQ-plot only holds for the very highest upper order statistics. Panel (b) shows the consequences for the OLS estimates: As we move in the QQ-plot from the right to the left, the departures from linearity become progressively more severe, and the OLS estimates progressively fall. Based on this first diagnostic QQ-plot, once the lower upper order statistics have been discarded as a source of downward bias, the subsequent analysis can then more clearly focus on the approximate linear part, the remaining distortions, and the choice of the number of order statistics. Figure 2 provides a further illustration for three Burr (Singh-Maddala) distributions (examined in detail in Section 3 below, being the leading parametric income distribution model) possessing the same γ. Here, the speed of decay of the nuisance function l is parametrised by the absolute value of the parameter ρ. The smaller the magnitude of ρ, the greater the initial curvature and steepness of the Pareto QQ-plot, and the larger the induced positive distortions of the OLS estimator of the slope coefficient. Estimates of γ for the k upper order statistics using the OLS regression (solid lines), and pointwise 95% symmetric confidence intervals (dashed lines). The distributional theory is stated in Equation (8).  In this paper, we examine the asymptotic distortions of the OLS estimator that arise in these circumstances, caused by the slow decay of the nuisance function l and modeled here as higher order regular variation. The theory is presented in Section 2 (proofs are collected in Appendix A), and numerical illustrations and quantifications of the distortions are provided in Section 3, as well as of the stark consequence for inference. More specifically, we show formally that the OLS estimator over-estimates the true value in the leading heavy-tailed model (i.e., the Hall class, which includes the Burr (Singh-Maddala) distribution, as well as the student, Fréchet, and Cauchy distributions). An empirical illustration in the context of top incomes in the UK using data on tax returns is the subject of Section 4.

The Log-Log Rank-Size Regression
We briefly review the rank size regression. Let X 1,n ≤ · · · ≤ X n,n denote the order statistics of X 1 , · · · , X n , and consider the k upper order statistics. Let ranks be shifted by a constant η < 1. The regression of sizes on ranks leads to the minimisation of the least squares criterion with respect to g, where η < 1 and 1 ≤ j ≤ k < n. The classic case is η = 0. However, since the OLS estimator of the slope coefficient is not invariant to shifts in the data, it is conceivable that a purposefully chosen shift could yield an asymptotic refinement (Gabaix and Ibragimov 2011 consider this in the strict Pareto model 1 − F(x) = cx −1/γ ). The analysis below allows for this possibility. The justification of considering regression (2) is based on a Pareto QQ-plot (Beirlant et al. 1996): For a sufficiently high threshold X n−k,n where k < n, the Pareto quantile plot in model (1) with coordinates (− log(j/(n + 1)), log X n−j+1,n ) j=1,··· ,k becomes ultimately linear. The line through point − log((k + 1)/(n + 1)), log X n−k,n ) with slope g is thus given by y = log X n−k,n + g[x + log((k + 1)/(n + 1))] and the data points are (x, y) = (− log(j/(n + 1)), log X n−j+1,n ) j=1,··· ,k . The regression estimator estimates this slope parameter. In particular, the OLS estimator of the slope coefficient g iŝ Note that the denominator D k is a Riemann approximation to 1 0 log 2 xdx = 2. An asymptotic expansion of the denominator reveals that From Kratz and Resnick (1996, proof of their Equation 2.4, p. 704) we know that the numerator N n,k converges in probability to 2γ, hence the estimator is weakly consistent:γ → P γ as k → ∞ and k/n → 0. We proceed in the next Section to refine this result by obtaining higher order expansions of the estimator in (3).
The literature contains several variants of regression (2). Rather regressing log sizes on log ranks, one could regress log ranks on log sizes, thus obtaining the 'dual' regression. In view of (3), our asymptotic analysis of the numerator carries immediately over to this dual regression. Another variant of (2) includes the additional estimation of a regression constant: log X n−j+1,n is regressed on a constant and log j. Kratz and Resnick (1996) obtain the distributional theory for this alternative estimator and show that its asymptotic variance is 2γ 2 /k, which exceeds, as will be shown below, the asymptotic variance ofγ given by (3). Hence this regression variant is less efficient. Schultze and Steinebach (1996) also prove weak consistency of the estimator in this setting.

Preliminaries: Higher Order Regular Variation
In order to obtain our asymptotic expansions, we use an equivalent representation of model (1) based on regular variation and extreme value theory. First we recall the definition of first-order regular variation, and then proceed to model the slowly varying nuisance function l in (1) by a refinement to second-order regular variation. We then show that most heavy-tailed distributions of interest (in the income, finance and urban literature) satisfy this condition.
It is well known that model (1) has the equivalent (first-order regular variation) representation (e.g., Dekkers et al. 1989) for all x > 0 where a is a positive norming function with the property a(t)/U(t) → γ. The problem for estimating the extreme value index γ is the behaviour of the slowly varying function l in (1). It is, therefore, common practice in the extreme value literature to model such second-order behaviour, thus strengthening model (1), by strengthening the first-order regular representation (5) to second-order regular variation. Following De Haan and Stadtmüller (1996), we assume that the following refinement of (5) holds for all This parameter ρ is the so-called second-order parameter of regular variation, and A(t) is a rate function that is regularly varying with index ρ, with A(t) → 0 as t → ∞. As ρ falls in magnitude, the nuisance part of l in (1) decays more slowly. Our numerical illustrations will thus consider small magnitudes for ρ.
Examples. Most heavy-tailed distributions of interest satisfy representation (6). Consider the Hall class of distributions (Hall 1982), given by, for large x, In this class, the nuisance function l in model (1) converges to a constant at a polynomial rate. The Hall class nests, for instance, the Burr (Singh-Maddala), Student, Fréchet, and Cauchy distributions. 2 The tail quantile function is This Hall class satisfies the second order representation (6) with ρ = γβ < 0, and rate function Figure 2 illustrates the role of ρ for the Burr distribution (examined in greater detail in Section 3) in terms of the Pareto QQ-plot, and the implications for the estimatorγ of its slope parameter. For ρ = −2 the plot is close to linear, and the estimates close to the population value. However, as ρ falls in magnitude, the initial curvature increases, and the slope estimates consequently becomes more positively distorted as the number of upper order statistics k entering the estimator increases.

The Main Results
We first state the higher order asymptotic expansion of the numerator N n,k . We then obtain the distributional theory for our estimatorγ, before returning to the distortions induced by deviations from the strict Pareto model (captured by second order regular variation).

Asymptotic expansion.
In theAppendix A we prove the following higher order expansion of the numerator N n,k under the assumption of second-order regular variation (6). Throughout, we will consider an intermediate sequence k = k n of positive integers such that k n → ∞ and k n /n → 0 as n → ∞. It is then true that, for γ > 0 and ρ < 0, A few comments are in order. The first two lines of this expression characterise the first-order behaviour of the numerator. It can be seen that setting the regression shift factor η to 1/2 eliminates the second and third term. However, the term O p log k/k 1/2 is still present. The asymptotic refinement due to second-order regular variation is given by the terms of line 3. Although A(t) → 0 as t → ∞, this decay might be slow: A(t) is regularly varying with index ρ, and as ρ falls in magnitude the nuisance part of l in (1) decays more slowly. A slow decay then introduces a noticeable distortion in finite samples. We examine these distortions after stating the distributional theory for the estimator. 2 The Burr distribution F (γ,ρ) (x) = 1 − (1 + x −ρ/γ ) 1/ρ is a member of the Hall class with parameters γ and ρ < 0, c = 1 and d = γ/ρ, as is the Student t δ distribution with δ degrees of freedom where γ = 1/δ, ρ = −2/δ, d = γBC −2γ , B = −0.5δ 2 (δ + 1)/(δ + 2), and C = Γ((δ + 1)/2)δ (δ−1)/2 /(δπ) 1/2 Γ(δ/2) (valid for δ > 2); so is the Fréchet distribution F γ (x) = exp(−x −1/γ ) with ρ = −1, c = 1, and d = −.5γ, and the Cauchy distribution with γ = 1, ρ = −2, c = 1/π, and d = −0.5π 2 . Distributional theory. Beirlant et al. (1996) observe that our slope estimatorγ, given by (3), is (to first order) a member of the class of kernel estimators discussed in Csorgo et al. (1985) with kernel K(t) = 1 − log t. Since 1 0 K(t)dt = 2 and not unity, a scale correction is required. Since 1 0 K 2 (t)dt = 5, the following result obtains as k → ∞ and k/n → 0, and if Higher order distortions. Asymptotically, the estimator is thus unbiased if √ kA(n/k) → 0. If this decay is slow, however, the estimator will suffer from a higher order distortion in finite samples. By (7), this distortion equals, for γ > 0 and ρ < 0, In particular, in the Hall model, A(t) = (ρ 2 /γ)dt ρ . The sign of the higher order distortion of N n,k and henceγ is, since ρ < 0, then given by -sgn(d). For the Burr (Singh-Maddala), student, Fréchet, and Cauchy distributions it can be shown that d < 0, leading to a positive higher order distortion. We conclude that the higher order distortion induced by higher order regular variation is positive for many popular distribution -i.e., for which the nuisance function l in model (1) converges to a constant at a polynomial rate-leading to an overestimation of γ. 3 Simulation evidence for these theoretical results is presented next. We also quantify the higher order distortions and the consequences for statistical inference about γ.

Numerical Illustrations
We illustrate numerically several of our results in a Monte Carlo study. First, we verify the distributional theory, then show that most of the empirical distortion is captured by the bias function b k,n . At the same time, we show that the distortions can be sizeable, leading to substantial test size distortions, while a bias correction using b k,n would reconcile nominal and actual test sizes.
Our Monte Carlo study is based on the Burr distribution, a member of the Hall class, parametrised here as F (γ,ρ) (x) = 1 − (1 + x −ρ/γ ) 1/ρ with parameters γ and ρ < 0. In the income distribution and inequality literature, this distribution is also know as the Singh-Maddala distribution, and used frequently in parametric income models. Specifically, we set γ = 2/3, and ρ = −1/2 to begin with. Qualitatively similar results are obtained for the student, Fréchet, and Cauchy distributions, all of which are members of the Hall class, and therefore not reported here. Since 1 < 1/γ < 2 we consider a situation of fairly heavy tails (as second moments of the distribution do not exist). However, the qualitative insights depend little on the actual choice of γ. We have chosen ρ = −1/2 as our leading example since we are interested in the consequences of deviating from a strict Pareto model. As ρ falls in magnitude the nuisance part of l in (1) decays more slowly. This is illustrated in Figure 2, where we depict three Pareto QQ-plots for different ρ. For ρ = −2, the plot is almost linear throughout. The deviations from the strict Pareto model become increasingly more pronounced in the left part of the plot as ρ falls in magnitude.
For the simulation study, we draw R = 1000 samples of size n = 10, 000 at first (then n = 1000), and consider the upper k order statistics. In order to choose a particular k, we follow standard practice and 3 De Haan and Ferreira (2006) consider the merit of shifting the tail for tail quantile functions U(t) = c 0 + c 1 t γ + c 2 t γ+τ + o(t γ+τ ) where c 0 and c 2 are not zero, c 1 > 0, and γ > 0 and τ < 0. It can then be shown that if τ < −γ, the second order parameter satisfies ρ = −γ. A data shift that eliminates c 0 then results in ρ = τ, so the post shift second order parameter has increased in magnitude, leading to a decrease in the induced distortion. However, the reverse reasoning also applies. In particular, the Hall model is , and increases the distortion if γ ≤ |ρ|. minimise the theoretical asymptotic Mean Squared Error (AMSE) (e.g., Hall 1982, or Beirlant et al. 1996, given by b 2 k,n + (1/k)(5/4)γ 2 , trading off distortion and dispersion. The theoretical higher order bias in γ induced by higher order regular variation in this Burr case is which is, of course, increasing in k. The theoretical AMSE is minimised around k * = 200, which also corresponds to the minimiser of the empirical AMSE based on the R samples. The mean ofγ at this k * is 0.739, and exceeds, as predicted by the theory, the population value γ = 2/3. Figure 3 depicts the results. In panel (a) we illustrate the distributional theory, given by (8), for k * , by plotting a kernel density estimate of √ k * γ (solid line), as well as a normal density with variance (5/4)γ 2 , centered on the empirical mean of the simulated data. The two are in close agreement. The figure also implies that any inferential problems are due to location shifts. In panel (b) we contrast the empirical distortions (solid line) with b k,n (dashed line).γ overestimates γ, and the distortion increases in k. It is evident that most of the distortion is captured by b k,n . In panel (c) we illustrate the consequences of the distortions for statistical inference, by plotting the empirical coverage error rates of the usual 95% symmetric confidence intervals. The higher order distortions lead to undermining inference because of the considerable size distortions. For instance, at k * , the empirical coverage error rate is 30% for a nominal 5% rate. Shifting the estimate by b k * ,n reduces the coverage error rate to 7%. Next, we consider the role of the sample size n. Reducing the sample sizes in the Monte Carlo to n = 1000 yields results that are in line with the above theory, and therefore not depicted. The bias ofγ increases by a factor predicted by the theory, namely b k,1000 /b k,10,000 = 10 1/2 = 3.16. The optimal k * shrinks by a factor of 4, as now k * = 50. The density of √ k * γ is in good agreement with the theory, and empirical coverage error rates at this k * are 32% for the uncorrected and 11% for the corrected estimator. The empirical coverage error rate for the uncorrected estimator rises steeply after k * , reaching 64% at k = 100. Reducing the sample sizes further to 100 results in k * = 20, and an empirical coverage error rate for the uncorrected estimator of 46% at this k * . Biases are increased by a factor b k,100 /b k,10,000 = 10.
Finally, we illustrate the importance of the speed of decay in the nuisance function l of model (1). As ρ falls in magnitude, the nuisance function l decays more slowly. For the Burr case with γ = 2/3, we depict in Figure 4 b k,n as ρ falls in magnitude for n = 1, 000 and selected k. While for ρ = −2 the distortions are negligible (in line with Figure 2, it is evident that for small magnitudes of ρ the higher order distortions cannot be ignored).  As the purpose of our simulation study is the provision of numerical evidence for our theory, we have used the theoretical bias function b k,n in the Burr case. When no such external knowledge is available, estimating the bias function requires non-parametric estimates of the second order parameter ρ and the function A(·). However, existing methods perform poorly, yielding excessively volatile estimates. The theory then informs a sensitivity analysis which is described in Section 4.1 in the context of our empirical application.

Empirical Illustration: Top incomes in the UK
Our empirical application uses administrative income tax return data are from the public-release files of the Survey of Personal Incomes (SPI) for the year 2009/10 (see e.g., Jenkins 2017 for a detailed description, and an analysis that includes rank size regressions). The SPI data underlie the UK top income share estimates in the World Top Incomes Database (WTID), and is a stratified sample of the universe of tax returns. The unit of taxation is the individual, and we use total taxable income as the income variable. The file contains 674,715 individuals, and we consider the n largest incomes.
In Figure 1 panel (a), we have depicted the Pareto QQ-plot for the 1000 largest incomes. It is evident that the data clearly reject a strict Pareto model: The plot exhibits a pronounced kink, and approximate linearity of the QQ plot only holds for the very highest upper order statistics. The function l in (1) captures this significant departure from the strict Pareto model. The Pareto QQ-plot thus conveys crucial information that is usually ignored by practitioners in economics, making it a key diagnostic device. For instance, a common mechanical approach is to set k by choosing 'blindly' (i.e., without reference to the Pareto QQ-plot) e.g., the top 1% or the top 1000 observations. Since the approximate linearity only obtains for about the 70 largest observations, the estimate of the slope parameter of the Pareto QQ-plot, i.e., the OLS estimator (3), will be severely biased if k is set to 1000 or higher. This is illustrated in panel (b) of the figure: The estimates fall for higher values of k, since the estimation procedure then attributes increasing weights to the left of the kink in the Pareto QQ-plot.
In the light of these observations, we restrict our subsequent analysis to the range of k in which the Pareto QQ-plot is approximately linear. We confirm this in Figure 5 panel (a), having restricted the plot to the n = 70 highest incomes. The plot now appears fairly linear. In panel (b), we depict the regression estimatesγ and the 95% symmetric pointwise confidence intervals. One first visual way of choosing an estimate is to consider an area of the plot where the estimate is fairly stable (as is done by inspecting Hill or so-called alternative Hill plots) and picking the largest such k since the variance of the estimate falls in k. Such subjective choice would be around k = 60 with an estimate ofγ = 1.070 (indicated by the horizontal faint line in the figure). 4 Overall, the visual method would suggest an estimate of γ between 0.9 and 1, implying very heavy tails. Taking into consideration the variability of the estimate, one cannot reject the hypothesis that the tail index be unity, i.e., Zipf's law. Returning to panel (a) we have also plotted the line with slope 1. This line does well in describing the data. We turn to a method that permits an objective choice of a particular k, and examine the remaining distortions in the estimate of γ.

Sensitivity Analysis, and the Choice of k
The preceding analysis has shown thatγ is likely to suffer from positive higher order distortions, captured by b k,n . Estimating this bias function requires non-parametric estimates of the second order parameter ρ and the function A(·), but existing methods perform poorly, yielding excessively volatile estimates. Hence we limit ourselves to a sensitivity analysis, taking ρ as a sensitivity parameter, whose objective is to gauge plausible values of the potential distortions based on diagnostics of the rank size regression. This approach is sketched next.
Following Beirlant et al. (1996), we observe that the mean weighted theoretical squared deviation for some coefficients c k depending only on k, and d k (ρ) depending on k and ρ (these are stated explicitly in the Appendix A). Set w j,k ≡ 1. An estimate of the mean theoretical deviation is the mean of the squared residuals k −1 SSR k of the rank size regression. In view of the usual bias-variance trade-off for our estimatorγ for fixed n, we ascribe all the measured deviation k −1 SSR k to the bias, thereby defining a very conservative bound, and let This conservative sensitivity analysis then consists of examiningγ −b k,n (ρ) for a range of values of ρ. Figure 5 panel (c) reports the results of such a sensitivity analysis for k being restricted to the n = 70 highest incomes. Since under this restriction the Pareto QQ-plot is approximately linear, we expect that the remaining distortions are fairly modest. This is borne out in the sensitivity plot, as the precise value of ρ now plays only a minor role.
Should a researcher wish to choose a particular k by minimising an approximation to the AMSE, Equation (10) is the basis of the procedure proposed in Beirlant et al. (1996): Apply two weighting schemes w (i) j,k (i = 1, 2), estimate the corresponding two mean weighted theoretical deviations using the residuals, and compute a linear combination thereof such that Var(γ) + b 2 k,n obtains. We have carried out this programme (see Appendix A for further details) for weights w (1) j,k ≡ 1 and w (2) j,k = j/(k + 1) for given ρ, and Figure 5 panel (d) depicts the results. Minimising this approximation to the AMSE yields k * (ρ), which, for ρ ∈ {−2, −1, −0.5}, resulted in k * = 58 across the selected ρ, for whichγ k * = 1.089 obtains. In view of the results depicted in panel (c) it is not surprising that changing ρ has only a small 4 Alternative estimators lead to similar conclusions. For instance, using the classic Hill estimator, at k = 60 an estimate of γ of 1.017 is obtained. The plot (not displayed here) is fairly stable around this value for k ∈ [20, 60].
effect. This estimate of γ is very close to the subjective visual choice ofγ of 1.075, reported above, based on Figure 5b.

Conclusions
The OLS estimator of the slope coefficient in the rank size regression (shifted or unshifted) can suffer significant higher order distortions that arise from the slow decay of the nuisance function l in Modeling the tail as second order regular variation, we have shown that the estimator over-estimates the true value in models in which l converges to a constant at a polynomial rate (i.e., in the leading heavy-tailed distributions). Our numerical illustrations have shown that these distortions can be dramatic, leading to test size distortions in which actual error rates are multiples of nominal error rates. The empirical illustration based on the Pareto QQ-plot has revealed a further distortion, namely the presence of a pronounced kink. Figure 1 has revealed that using the common rule to choose 1% of the observation for tail estimation would lead to a severe under-estimation of how heavy the tail is.
The higher order distortions are functions of A(·) and the second order regular variation parameter ρ. Since existing methods usually result in poor estimates of these, reliable bias corrections are not feasible. In view of this we have proposed a sensitivity analysis based on diagnostics from the rank size regression. When applied to our data on top incomes, we still cannot reject the hypothesis γ be unity, a situation often described in several fields as Zipf's law (e.g., Schluter and Trede 2017).
The simplicity of the regression estimator is undoubtedly the principal reason for its popularity among practitioners in economics. This paper has shown that in many situations the naive (i.e., 'blind') use of this estimator should be considered with care: Pareto QQ-plot, the sensitivity plot and the AMSE plot convey jointly important information about the behaviour of the estimator.
Proof of (A1). We adapt the proofs of Kratz and Resnick (1996) (KR henceforth) of their Equations 2.4 and 2.8. The key is the use of Renyi's representation of exponential order statistics, which implies (e.g., KR, p. 705) where E j (j = 1, · · · , n) denote iid unit exponential random variables, and E n−k,n denotes the (n − k)-th order statistic. We obtain an asymptotic refinement by using, instead of KR's Lemmas 2.2 and 2.3, the above Euler Maclaurin formulae, and Lyapunov's central limit theorem (CLT). Our numerator is denoted there by A n , and the indices are mapped by setting i = k + 1 − j. From KR (pp. 704-707), we have whereĒ k = (1/k) ∑ k j E j . We first show that KR's result (A4) can also be derived from the first order regular variation condition under the stated assumptions. Let Y denote a standard Pareto random variable, and denote the (n − k)-th order statistic by Y n−k,n . Consider the scaled log excesses log X n−i+1,n − log X n−k,n a(Y n−k,n )/U(Y n−k,n ) where a(.) and U(.) are defined in representation (5). Then, noting that X i,n = d U(Y i,n ), and using (5) with t = Y n−k,n and x = Y n−i+1,n /Y n−k,n , the scaled log excesses satisfy as n → ∞ and n/k → ∞ By Renyi's representation of exponential order statistics, we have Y n−i+1,n /Y n−k,n = d Y k−i+1,k , so From Wellner (1978), we know that k n Y n−k,n → p 1, so a(Y n−k,n )/U(Y n−k,n ) → γ. Using the definition of N n,k , on combining the results we thus obtain as claimed. We proceed to examine (A4). Using (A3) yields . Using again (A3) and substituting the result, we obtain which is Equation (A1), as claimed. Before refining the asymptotic expansion, we briefly consider: Proof of (4). D k = 2 + O log 2 k k . Expanding the quadratic in the definition of D k and using the Euler Maclaurin formulae yields the stated result.
We are now in a position to examine the behaviour of the numerator N n,k under second order regular variation.
Proof of the higher order expansion (7). Consider the scaled log excesses again, this time using representation (6) instead of (5). Set again t = Y n−k,n and x = Y n−i+1,n /Y n−k,n , and recall Y n−i+1,n /Y n−k,n = d Y k−i+1,k . Hence we obtain the higher order expression log X n−i+1,n − log X n−k,n a(Y n−k,n )/U(Y n−k,n ) The role of the first term on the right for N n,k has already been described above. In what follows, we consider the higher order term. Since H γ>0,ρ<0 (x) = 1 ρ ( x ρ −1 ρ − log x), the higher order expansion of N n,k requires the analysis of where V denotes a standard uniform random variable, and we replace the order statistic V j,k by its expectation, V j,k = j/(k + 1) + O p (k −1/2 ). A Taylor series expansion then gives For the third term on the rhs, we use the Euler Maclaurin (A3), for the second term on the rhs we have the following Euler Maclaurin Combing these two Euler Maclaurin formulae, we can simplify to get We are now in a position to combine the results. In order to simplify notation, denote the first order expansion of the numerator N n,k /γ by N 1,n,k /γ, given by the rhs of (A1). Then substituting the higher order expression for the scaled excesses (A5) into the formula for N n,k , recalling that k n Y n−k,n → p 1 (Wellner 1978), and using (A6) yields N n,k /γ = N 1,n,k /γ + A n k 1 ρ 2 − ρ (1 − ρ) 2 + O p log k k + o p (A(n/k)).
Proof of (8). The class of kernel estimator considered in Csorgo et al. (1985) is of the form γ kernel = ∑ k j=1 (j/k)K(j/k)[log X n−j+1 − log X n−j ] Their Theorem 2 (or Theorem 1.1 in Beirlant et al. 1996) states the following. Under general conditions on kernel K and distribution function F so that there exists a nonrandom sequence C n such that C n (â − γ) converges weakly to a limiting N(0, 1) distribution for some sequence k = k n → ∞ with k n /n → 0, it is necessary and sufficient that lim n→∞ √ k 1 0 b(kw/n)K(w)dw = 0 where b is a function such that b(1/x) → 0 as x → ∞, and that for the tail quantile function where c(x) → c as x → ∞. If this condition is satisfied, then as k = k n → ∞ and k n /n → 0 √ k 1 0 K 2 (v)dv −1/2 (γ kernel − γ) → d N(0, γ 2 ). Beirlant et al. (1996) observe that our slope estimatorγ, given by (3), is (to first order) a member of the class of kernel estimatorsγ kernel with kernel K(t) = 1 − log t, and that the above condition holds under the regular variation hypothesis. Turning to the specific kernel K(t) = 1 − log t, since 1 0 K(t)dt = 2 and not unity, a scale correction is required. As 1 0 K 2 (t)dt = 5, the stated result (8) follows.
Proof of (10). Consider the mean weighted theoretical squared deviation 1 k k ∑ j=1 w j,k E log X n−j+1,n X n−k,n − γ log k + 1 j 2 for some weights w j,n . Using (A5) this equals, to first order, Then, recalling that Y k−j+1,k = d (V j,k ) −1 and proceeding as in Beirlant et al. (1996, Section 4), which involves approximating expectations E( f (V j,k )) by the leading term f (j/(k + 1)) when applying the delta method yields, to first order, Finally, we set c k = (4/5)c k and w j,k ≡ 1 to arrive at (10).

Remark:
In order to obtain an estimate of the AMSE, Beirlant et al. (1996) use two weighting schemes, namely w j,k ≡ 1 leading to coefficients, say, c 1 k and d 1 k and mean weighted squared residuals k −1 SSR 1 k , and w j,k = j/(k + 1) leading to c 2 k , d 2 k , and k −1 SSR 2 k . Then a linear combination of two approximate MSE expressions (with coefficients, say, x and y) is sought that yields Var(γ) + b 2 k,n , which is achieved by solving simultaneously the equations xc 1 k + yc 2 k = 1 xd 1 k + yd 2 k = 1.