Previous Article in Journal
VAR Models with an Index Structure: A Survey with New Results
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Consistency of the OLS Bootstrap for Independently but Not-Identically Distributed Data: A Permutation Perspective

Department of Economics, London School of Economics, Houghton St., London WC2A 2AE, UK
Econometrics 2025, 13(4), 41; https://doi.org/10.3390/econometrics13040041
Submission received: 30 June 2025 / Revised: 26 September 2025 / Accepted: 29 September 2025 / Published: 23 October 2025

Abstract

This paper introduces a new approach to proving bootstrap consistency based upon the distribution of permutation statistics, using it to derive results covering fundamentally not-identically distributed groups of data, in which average moments do not converge to anything, with moment conditions that are less demanding than earlier results for either identically distributed or not-identically distributed data.

1. Introduction

Data are often drawn from dissimilar environments which render the independent and identically distributed (iid) assumption that underlies many results on the bootstrap suspect.1 This paper extends results concerning the consistency of the pairs and wild OLS bootstraps, which have mostly been derived for iid data, to general regression frameworks with independently but not-necessarily identically distributed (inid) data. Instead of considering the sampling distribution of the bootstraps, the usual approach, it notes that any permutation of the pairs bootstrap vector of sampling frequencies or the realization of the external variable used by the wild bootstrap to transform residuals is equally likely. Using results on the asymptotic distribution of permutation statistics by Wald and Wolfowitz (1944), Noether (1949), and Hoeffding (1951), these equally likely permutations can be used to characterize the bootstrap distributions conditional on the data as normal given restrictions on sample moments of the data. White’s (1980a) conditions for the asymptotic normality of OLS coefficients with clustered/heteroskedastic residuals and inid data guarantee these restrictions almost surely, ensuring that the asymptotic distribution of pairs and wild bootstrapped coefficients and Wald statistics conditional on the data matches the unconditional distribution of the original OLS estimates.
While proofs of bootstrap consistency typically require the existence of at least fourth moments of the regressors with iid data, the permutation distribution allows this paper to prove consistency with no more than second regressor moments and inid data. For iid data, Mammen (1993) proved consistency of the wild OLS bootstrap coefficient and homoskedasticity-based Wald test distributions with bounded expectations of the product of the fourth power of the regressors with the squared errors and an additional Lindeberg condition. Similarly, for the pairs OLS bootstrap with iid data Freedman (1981) showed that bounded fourth moments of both regressors and errors are sufficient for consistency of the pairs bootstrap coefficient distribution2 and that of the Wald statistic based upon the (potentially incorrect) assumption of homoskedastic errors. Stute (1990) tightened the result for the coefficient distribution alone, showing it is sufficient for the squared regressors and the product of the squared regressors with the squared errors to have finite expectation. This paper proves consistency of both the coefficient and clustered heteroskedasticity robust Wald statistic distribution in a broader inid environment for both the pairs and wild bootstraps with finite expectations of only slightly more than second powers of the regressors and of the product of the second powers of the regressors with the second power of the errors. These are much less demanding assumptions than those used by Freedman and Mammen, requiring only slightly higher moments than used by Stute for the proof of only the pairs bootstrap coefficient distribution in a narrower iid environment. Moreover, when residuals are heteroskedastic or interrelated within clusters, the homoskedasticity-based Wald test is not guaranteed to be asymptotically accurate, as recognized by Freedman (1981) and Mammen (1993). In such cases, practitioners are likely to prefer clustered/heteroskedasticity robust covariance estimates and Wald statistics as these are asymptotically accurate and pivotal, respectively, ensuring the asymptotic accuracy of the conventional test and higher order accuracy and faster convergence of rejection probabilities to the nominal value in the bootstrap (Singh, 1981; Hall, 1992).
For OLS models with inid data, the salient contribution is Liu (1988), who showed that the wild bootstrap provides consistent estimates of the second central moment of a linear combination of coefficients in an OLS regression model with bounded regressors, provided the first and second moments of the wild bootstrap external variable are 0 and 1, respectively. Liu’s result regarding the second central moment is easily extended to the case of the multivariate second central moments of coefficients for unbounded inid regressors without any additional restrictions on the moments of the external variable, as shown below. Our interest here, however, is in the full distribution of wild bootstrap coefficient and Wald statistic estimates, where our proof requires the existence of higher moments of the wild bootstrap external variable to ensure the convergence of higher moments to the normal. As the external variable is selected by the practitioner, and not an exogenous characteristic of the data, these additional moment conditions pose no obstacle. The two-point distribution proposed by Mammen (1993) and the Rademacher distribution, both often used in practical applications (e.g., Davidson & Flachaire, 2008), have moments of all orders.
Liu’s consideration of inid data has largely not been extended, as the OLS bootstrap literature has since focused on time series dependent data, where the absence of random sampling of independent observations raises different statistical issues and the use of different bootstrap methods (see the review in Hardle et al., 2003). Djogbenou et al. (2019), who prove consistency of the wild bootstrap t-statistic distribution for independently distributed cluster groupings of data, are a notable exception. With the moment assumptions used here, plus the additional requirement of bounded slightly higher than fourth moments of the regressors, their proof allows for heterogeneity in the distribution of data across clusters. However, they limit that heterogeneity in requiring that the cross product of the regressors and the covariance matrix of coefficient estimates converge to matrices of constants, a condition that in other papers is typically motivated by an iid assumption.3 The data-generating process examined in this paper is more fully inid in that there is no restriction that such matrices converge to anything, and the proof requires only slightly higher than second moments of the regressors. In sum, by emphasizing the permutation distribution, this paper lowers typical fourth moment restrictions on regressors to second moments, allows a fully inid data process in which average moments do not converge, and highlights the conceptual similarity between the wild and pairs bootstraps, proving results for both in a unified framework.
The paper proceeds as follows: Section 2 reviews the OLS model, White’s assumptions and results regarding OLS with inid data, and pairs and wild bootstrap methods for clustered/heteroskedastic data. Section 3 presents the foundational theorems regarding the asymptotic normality of permutation distributions that motivate the results. Section 4 then combines these with White’s (1980a) result to derive sufficient conditions for pairs and wild OLS bootstrap consistency with inid data and potentially cluster interdependent heteroskedastic residuals. Section 5 more fully contrasts the assumptions and results herein with those found in the papers cited above. Section 6 provides Monte Carlo evidence of the consistency of the bootstrap in a challenging environment with an inid data process where average moments do not converge, regressors have barely second moments, and residuals are bounded, varyingly skewed, sometimes bi-modal, and otherwise generally highly non-normal. The Appendix provides proofs of the main theorems, while the on-line Appendix extends the pair results to sub-sampling and provides lengthy technical proofs of otherwise minor lemmas and extensions of the theorems.

2. Framework and Notation

Our interest is in inference for the linear model where, with i = 1 … N observations,
y = X β + ε
where y represents the N × 1 matrix of observations on the dependent (outcome) variable, X the N × K matrix of observations of independent variables, β the K × 1 vector of unobserved parameters of interest, and ε the N × 1 matrix of unobserved disturbances. The ordinary least squares (OLS) estimates β ^ of β minimize the sum of squared estimated residuals ε ^ ε ^ , where ε ^ = y X β ^ , producing the estimates
β ^ = ( X X ) 1 X y .
If the disturbances εi are homoskedastic with common variance σ i 2 = σ 2 , one can use the homoskedastic variance estimate of β ^ , ( X X ) 1 ε ^ ε ^ / ( N K ) , but we focus on more general inference where the εi are heteroskedastic and possibly interdependent within C ≤ N “cluster” groupings of observations, using the clustered/heteroskedasticity robust covariance estimate
V ^ ( β ^ ) = ( X X ) 1 g = 1 C X g ε ^ g ε ^ g X g ( X X ) 1 ,
where we use the subscript g to denote the rows of matrices and vectors associated with the observations in cluster grouping g . As will be seen later, we assume that the regressors and disturbances ( X g , ε g ) are independent across cluster groupings. When observations themselves are independent, each grouping g equals an individual observation i, C = N, and (3) is White’s (1980a) heteroskedasticity robust covariance estimate. The clustered extension, however, is often used to allow for unspecified grouped dependence, and so we present the results within a more general framework. In describing limits, we use the subscript C, as in β ^ C and V ^ ( β ^ C ) , to emphasize that the estimated coefficients and covariance estimates are functions of C realized observation groupings.
White (1980a) provided conditions for valid OLS inference when the row vector of random variables associated with each observation is independently but not necessarily identically distributed (inid). With x g j denoting the jth column of X g , we extend these to allow for grouped dependence:
Theorem I (extending White, 1980a). 
If there exist strictly positive finite constants γ, Δ, and η, such that
  • (Ia) ( X g , ε g )  is a sequence of independent but not necessarily identically distributed random matrices, such that  E ( X g ε g ) = 0 K ;
  • (Ib) For all  g   E ( | x g j x g k | 1 + γ ) < Δ  for all j, k = 1 … K, and for all C sufficiently large  M C = C 1 g = 1 C E ( X g X g )  is non-singular with determinant (MC) > η;
  • (Ic) For all i,  E ( | x g j ε g ε g x g k | 1 + γ ) < Δ  for all j, k = 1 … K, and for all C sufficiently large  V C = C 1 g = 1 C E ( X g ε g ε g X g )  is non-singular with determinant (VC) > η;
then
(i)
β ^ C a s X , ε β ;
(ii)
V C ½ M C C β ^ C β d X , ε n K ;
(iii)
M C , V c ,   and   their   inverses   are   uniformly   bounded   for   all   C   sufficiently   large
(iv)
C V ^ β ^ C M C 1 V C M c 1 a s X , ε 0 K x K ;
(v)
( β ^ C β ) V ^ ( β ^ C ) 1 ( β ^ C β ) d ( X , ε ) χ K 2 ;
where  a s ( X , ε )  and  d ( X , ε )  denote convergence almost surely and in distribution across (X,ε), respectively, A½ the “square root” of symmetric positive definite matrix A,4  nK the K dimensional standard normal,  χ K 2  the central chi-squared with K degrees of freedom, and  0K and  0KxK vectors and matrices of zeros of the indicated dimensions.
Remark 1. 
White’s covariance estimate often motivates inference with heteroskedasticity or clustering in an otherwise iid setting where each observation or cluster grouping is a draw from a fixed distribution. However,  V ^ ( β ^ C )  allows for asymptotically accurate inference in the much more general inid setting given the above, where MC, VC, and  C V ^ ( β ^ C )  do not necessarily converge to matrices of constants, as illustrated in Monte Carlos further below.
Remark 2. 
White (1980a) used (Ia)–(Ic) to prove (i), (ii), and parts of (iii) and added the assumption  E ( | x g j x g j x g k x g l | 1 + γ )  to prove (iv), (v), and other results. As reviewed below, a similar fourth moment condition on the regressors is also used in prior proofs of bootstrap consistency. However, (Ia)–(Ic), with only slightly higher than second regressor moments, suffice to prove (i)–(v) and ensure bootstrap consistency, as shown in proofs and Monte Carlos below.
Remark 3. 
In practical application, moment restrictions on the data-generating process can be tested using techniques suggested by Meerschaert and Scheffler (1998), Fedotenkov (2013), and Trapani (2016), among others.
In this paper, we examine two bootstrap techniques commonly used for OLS inference with heteroskedastic and clustered disturbances and prove the asymptotic consistency of their distributions for general inid data. Wu’s (1986) external bootstrap, now commonly known as the wild bootstrap, holds the design matrix  X  constant and generates new realizations of the outcome vector y  by multiplying the estimated residuals of each cluster grouping by an independently and identically distributed external random variable  δ g w , so that the dependent variable for grouping g is now given by y g w = X g β ^ C + ε ^ g δ g w . Selecting β ^ C w so as to minimize the sum of squared residuals for this new data yields coefficient and covariance estimates expressed in terms of the original data and its estimates as
β ^ C w = β ^ C + ( X X ) 1 g = 1 C X g ε ^ g δ g w   and   V ^ ( β ^ C w ) = ( X X ) 1 g = 1 C X g ε ^ g w ε ^ g w X g ( X X ) 1 ,   w h e r e   ε ^ g w = X g β ^ C β ^ C w + ε ^ g δ g w .
Repeated draws of the Cx1 vector δw of iid variables are made and the resulting distribution of coefficients β ^ C w β ^ C and Wald statistics ( β ^ C w β ^ C ) V ^ ( β ^ C w ) 1 ( β ^ C w β ^ C ) used to evaluate the statistical significance of corresponding measures for tests of the null hypothesis β = β 0 in the original sample, i.e., β ^ C β 0 and ( β ^ C β 0 ) V ^ ( β ^ C ) 1 ( β ^ C β 0 ) . All permutations of any given realization of δ w are equally likely, a fact that plays a prominent role in the results of this paper.
The pairs bootstrap samples with replacement C cluster groupings of “pairs” of dependent and independent variables ( y g , X g ) from the rows of the original data ( y , X ) , producing a new data set composed of h = 1 … C cluster groups of observations ( y h , X h ), with each h corresponding to one of the original g groupings.5 Selecting β ^ C p so as to minimize the sum of squared residuals for this new data, the resulting coefficient and covariance estimates can be expressed in terms of the original data, its estimates, and its indices g = 1 … C as
β ^ C p = β ^ C + g = 1 C X g X g δ g p 1 g = 1 C X g ε ^ g δ g p   a n d   V ^ β ^ C p = g = 1 C X g X g δ g p 1 g = 1 C X g ε ^ g p ε ^ g p X g δ g p g = 1 C X g X g δ g p 1 ,   w h e r e   ε ^ g p = X g β ^ C β ^ C p + ε ^ g ,
where δ g p denotes the number of times (possibly 0) cluster grouping g was drawn. Repeated bootstrap samples are made and the resulting distribution of coefficients β ^ C p β ^ C and Wald statistics ( β ^ C p β ^ C ) V ^ ( β ^ C p ) 1 ( β ^ C p β ^ C ) once again used to evaluate the statistical significance of corresponding measures for tests of the null hypothesis β = β 0 in the original sample. As in the case of the wild bootstrap, all permutations of any given realization of the Cx1 sampling frequency vector δ p are equally likely. Consequently, we use the common notation δ, distinguished by superscripted p or w, for seemingly dissimilar objects because these operate identically in the theorems and proofs below.
Our interest is in deriving sufficient conditions for the conditional consistency of the bootstrap distributions in an inid framework. Specifically, we show that White’s (1980a) assumptions are sufficient to ensure that for the bootstrapped coefficient and clustered/heteroskedasticity robust covariance estimates, with b (both) denoting p (pairs) or w (wild),
g = 1 C X g ε ^ g ε ^ g X g C ½ X X C C ( β ^ C b β ^ C ) d ( δ b ) | a s ( X , ε ) n K     and   C ( β ^ C b β ^ C ) [ C V ^ ( β ^ C b ) ] - 1 ( β ^ C b β ^ C ) C d ( δ b ) | a s ( X , ε ) χ K 2 ,
where d ( δ ) | a s ( X , ε ) denotes convergence in distribution across δ almost surely across realizations of (X,ε). These results show that the asymptotic conditional distribution given the data (X,ε) of the bootstrap equals the asymptotic distribution of the OLS estimates across (X,ε), allowing for valid inference using the percentiles of bootstrapped coefficient estimates or Wald statistics.6
The key characteristic exploited in the proofs below is that any of the row permutations of the vectors δ are equally likely. Consequently, the distribution of the bootstraps can be thought of as the distribution across permutations of δ integrated across the ordered realizations of δ. Permutation theorems characterize this permutation distribution as asymptotically normal with covariance matrix C V ^ ( β ^ C ) provided (X,ε) and δ have certain moment properties. White’s (1980a) assumptions ensure that these properties hold almost surely for (X,ε), while the properties of the multinomial sampling frequencies δp and moment assumptions on the iid elements of δw ensure the requisite conditions on δ also hold almost surely. Consequently, almost surely conditional on the data (X,ε), the distributions of the bootstraps across the draws δ that determine their coefficient estimates and Wald statistics converge to the distribution of their OLS counterparts for the original sample (X,ε) across its data-generating process.

3. Foundational Permutation Theorems

The proofs in this paper rely on a theorem first proven by Wald and Wolfowitz (1944) and later refined by Noether (1949) and Hoeffding (1951) concerning the asymptotic distribution of root-C times the correlation of a permuted sequence with another sequence:
Theorem II: 
Let z’ = (z1, …, zC) and δ’ = (δ1, …, δC) denote sequences of real numbers, not all equal, and d’ = (d1, …, dC) denote any of the C! equally likely permutations of δ. Then, as C → ∞ the distribution across the realizations of d of the random variable
v C = g = 1 C z g m z g d g m d g s z g s d g C ½ ,   w h e r e   f o r   h = z   o r   d ,   m ( h g ) = g = 1 C h g C   &   s ( h g ) 2 = g = 1 C [ h g m ( h g ) ] 2 C
converges to that of the standard normal if for all integer τ > 2
l i m C C τ 2 1 g = 1 C [ z g m ( z g ) ] τ g = 1 C [ δ g m ( δ g ) ] τ g = 1 C [ z g m ( z g ) ] 2 τ / 2 g = 1 C [ δ g m ( δ g ) ] 2 τ / 2 = 0 .
The proof is based on showing that the moments of vC converge to those of the standard normal. A simple multivariate extension, proven in the on-line Appendix, is
Theorem IIm:
Let  O = I C x C 1 C 1 C / C  denote the centering matrix,7  Z = ( z 1 , , z C )  a sequence of K x 1 vectors such that  Z O Z  is positive definite, δ’ = (δ1, …, δC) a sequence of real numbers not all equal, and d’ = (d1, …, dC) any of the C! equally likely permutations of δ. Then, as C → ∞ the distribution across the realizations of d of the random variable
v C = Z O Z C d O d C ½ ( Z O d ) C
converges to that of the multivariate iid standard normal if (IIb) holds for each element in the vector sequence  z g .
Theorem II is easily extended to a probabilistic environment by noting the following result due to Ghosh (1950) that translates the almost-sure or in probability characteristics of an infinite number of moment conditions into similar statements regarding a distribution:
Theorem III
If all the moments of the cumulative distribution function FC(x) converge almost surely (in probability) to those of F(x), which possesses a density function, and for which, with  ν k + 1  denoting the absolute moment of order k + 1,
l i m k α k + 2 ν k + 1 k + 2 ! = 0   for   any   given   value   of   α ,
then FC(x) converges almost surely (in probability) to F(x).
Condition (IIIa) is of course true for the normal distribution. Hoeffding (1952) generalized the result by showing that condition (IIIa) is not even needed for convergence in probability at all points of continuity of any F(x) that is uniquely determined by its moments. By virtue of the Cramér–Wold device, Theorem III covers the multivariate case given in (IIc) above, as for all λ , such that λ λ = 1 , all moments of λ v C converge to those of the standard normal. In light of Theorem III, in applying Theorem II below, we use the notation d ( d ) | a s ( δ , X , ε ) , i.e., almost surely across the realizations of (δ,X,ε), the distribution of v C across permutations d of δ converges to the multivariate standard normal. Theorems II and III are used to characterize the asymptotic distribution of g = 1 C X g ε ^ g d g b / C , which appears in the expressions for the bootstrapped coefficient estimates in (4) and (5) above.
A less demanding form of Theorem II, proven in Appendix B below, provides a weaker condition under which the mean of products converges in probability across permutations to the product of means:
Theorem IV:
Let z’ = (z1, …, zC) and δ’ = (δ1, …, δC) denote sequences of real numbers, possibly all equal, and d’ = (d1, …, dC) any of the C! equally likely permutations of δ. Then, as C → ∞, across permutations d of δ,
m z g d g m z g m δ g = g = 1 C z g d g C g = 1 C z g C g = 1 C δ g C     p 0 ,
if
l i m C g = 1 C [ z g m ( z g ) ] 2 C g = 1 C [ δ g m ( δ g ) ] 2 C C = 0 .
Theorem IV is used in proofs to make statements regarding the convergence in probability of terms such as g = 1 C X g ε ^ g ε ^ g X g ( d g w ) 2 / C , g = 1 C X g ε ^ g p ε ^ g p X g d g p / C , and g = 1 C X g X g d g p / C ,   which appear in (4) and (5) above. As the satisfaction of (IVb) depends on the realized sample moments of (X,ε) and δ, we use the notation p ( d ) | a s ( δ , X , ε ) , i.e., almost surely across the realizations of (δ,X,ε) m ( z g d g ) converges in probability across the permutations d of δ to m ( z g ) m ( δ g ) .

4. Results: Bootstrap Consistency with INID Data

The following result is proven in Appendix C, further below:
Theorem V: 
Assume that for the wild bootstrap  E [ δ g w ] = 0 ,  E [ ( δ g w ) 2 ] = 1  and  E [ ( δ g w ) 2 ( 1 + θ 1 ) ] < Δ  for some finite  Δ  and θ1 > 1/γ, with γ as given in Theorem I earlier. Assumptions (Ia)–(Ic) given in Theorem I in combination with the properties of δ are sufficient to ensure that across the permutations d of δb, for b = p (pairs) or w (wild),
g = 1 C X g ε ^ g ε ^ g X g C - ½ X X C δ b O δ b C - ½ C β ^ C b β ^ C d ( d ) | a s ( δ b , X , ε ) n K ,
C V ^ ( β ^ C b ) C V ^ ( β ^ C ) p ( d ) | a s ( δ b , X , ε ) 0 K x K .
Bounded higher moments of δ g w are needed to ensure that conditions (IIb) and (IVb) in Theorems II and IV are satisfied.
Let δ* denote the ordered values of δ. Across permutations d of δ* (Va) and (Vb) hold. These permutations, integrated across the distribution of δ*, characterize the entire distribution of δ. Adding the result8
δ p O δ p C p ( δ p ) 1   and   δ w O δ w C a s ( δ w ) 1 ,
implies that
g = 1 C X g ε ^ g ε ^ g X g C - ½ X X C C ( β ^ C b β ^ C ) d ( δ b ) | a s ( X , ε ) n K .
C V ^ ( β ^ C b ) C V ^ ( β ^ C ) p ( δ b ) | a s ( X , ε ) 0 K x K ,
where the convergence in distribution in this case is across the bootstrap realizations of δb that determine the bootstrap coefficient and covariance estimates, as in (4) and (5) above. When combined with White’s (1980a) result in Theorem I regarding the asymptotic distribution of OLS coefficient and covariance estimates, this establishes that almost surely the conditional (on the data) distributions of the bootstrapped coefficients and Wald statistics converge to the unconditional distributions of their OLS regression counterparts.

5. Comparison of Bootstrap Consistency Results

This section contrasts the assumptions and results of this paper with other papers on bootstrap consistency. These usually assume independent observations, with moment conditions given at that level. To simplify the comparison of moment conditions, where possible I use the i = 1 … N notation, taking each cluster g as composed of one observation and using the implied observational level assumptions in the theorems given above. Table 1 below summarizes key elements of the discussion that follows.
  • Assumptions on regressors and errors
For an OLS model with iid data and potentially heteroskedastic residuals, Mammen (1993) showed that for a fixed number of regressors the wild bootstrap distributions of linear combinations of the coefficients and Wald statistics based upon the homoskedastic covariance estimate are in probability consistent, given sup c = 1 E [ ( c x i ) 4 ( 1 + ε i 2 ) ] < ∞ and the Lindeberg type condition E [ ( c x i ) 2 ε i 2 I [ ( c x i ) 2 ε i 2 γ N ] 0 for every fixed γ > 0. For the same model, Freedman (1981) proved almost-sure consistency of pairs bootstrap coefficients and homoskedastic-based Wald tests if the row vectors ( x i , y i ) are independently and identically distributed and E [ ( ( x i , y i ) ( x i , y i ) ) 2 ] < ∞. Stute (1990) tightened part of the result, showing that almost-sure convergence of the pairs bootstrap coefficients alone for iid data only requires E ( x i j x i k ) and E ( x i j x i k ε i 2 ) to be finite. By adopting a permutation approach, this paper proves almost-sure consistency of both coefficients and Wald statistics based upon the clustered/heteroskedasticity robust covariance estimate with inid observations for both the pairs and wild bootstrap with the existence of only slightly higher moments than required by Stute (1990), i.e., E | x i j x i k | 1 + γ < and E | x i j x i k ε i 2 | 1 + γ < for some γ > 0. It should be noted, however, that Mammen’s result was part of a broader framework that allowed for a growing number of regressors.
For inid data, Liu (1988) proved consistency in probability of the second central moment of the wild OLS bootstrap coefficient distribution with bounded regressors (with all moments) and finite second moments of εi. This paper proves almost-sure consistency of the wild bootstrap distribution for inid data with unbounded regressors using the additional moment conditions described above.
Djogbenou et al. (2019) prove consistency in probability of the distribution of the wild bootstrap t-statistic for within-cluster correlated but cross-cluster independent but not identically distributed data. Their assumptions on the existence of moments are those used in this paper, plus the addition of the fourth moment restriction E | x i j 4 | 1 + γ < for some γ > 0. They also impose asymptotic homogeneity of the data-generating process in the form of assuming that X X / N converges to a matrix of constants, while for any vector α, such that α α = 1 , there exists a finite scalar vα > 0 and non-random sequence μN → ∞, such that μ N α ( X X ) 1 g = 1 C E ( X g ε g ε g X g ) ( X X ) 1 α v α . Thus, while papers usually use the iid assumption to motivate the convergence of key matrices to matrices of constants, Djogbenou et al. (2019) avoid the iid assumption but assume that the data nevertheless converge to such matrices. This paper, using clustered versions of White’s (1980a) assumptions, requires no such convergence of the asymptotic regressor cross product and covariance matrix of coefficient estimates and as such covers more fundamentally inid data without the addition of the fourth moment condition E | x i j 4 | 1 + γ < .
This paper makes no explicit assumptions regarding maximum cluster size, but in practice the assumption that the expectation of vector products of the regressors are uniformly bounded for all g , i.e., E ( | x g j x g k | 1 + γ ) < Δ , implies that either the maximum cluster size is bounded or, as seems less likely, the expectation of individual observations shrinks with cluster size. In contrast, Djogbenou et al.’s (2019) proof of consistency allows the maximum cluster size to increase with the sample size in an unbounded fashion at a rate determined by the form of dependency (albeit unknown) within clusters. All proofs of consistency necessarily require that asymptotically individual observations or clusters exert a negligible influence on coefficient and variance estimates, although ironically it is often the strong influence of outlier observations or groupings in finite samples that makes conventional tests less accurate relative to the bootstrap (c.f. Davidson and Flachaire, 2008; Young, 2019, and the simulations below).
2.
Type of consistency proven
Aside from consistency of the coefficient distribution, Freedman (1981) and Mammen (1993) prove consistency of the Wald statistic for the pairs and wild bootstrap, respectively, based upon the covariance estimate with homoskedastic errors. Djogbenou et al. (2019) prove consistency of the Wald statistic using the cluster/heteroskedasticity robust covariance estimate, which is also asymptotically accurate with homoskedastic errors. This test statistic is asymptotically pivotal and hence provides higher-order asymptotic bootstrap accuracy (Singh, 1981; Hall, 1992). This paper does the same for both the pairs and wild bootstrap using weaker moment conditions and a unified permutation framework that highlights a similarity between the two methods.
Freedman and Stute allowed for sub-sampling M < N observations in the pairs bootstrap and proved convergence in distribution if M and N both go to infinity. As shown in the on-line Appendix, at the expense of complicating the proofs, the permutation-based pairs bootstrap consistency results can be extended to sub-sampling, with and without replacement, if M/N → 0 and for some γ* > (1 + γ)−1, M is such that liminf  M / N γ * > 0. The requirement that M not fall too rapidly relative to N is needed to ensure the existence and convergence of higher moments to the normal, as the proof of Theorem II is based upon the method of moments.
Liu (1988) proves consistency of the wild bootstrap second central moment with bounded regressors. Proving such consistency with the unbounded regressors of this paper is trivial. If we assume, as did Liu (1988), that E [ δ w ] = 0 C and E [ δ w δ w ] = I C x C (the identity matrix), then taking the expectation with respect to this variable for a given realization of X and ε, we have
E [ β ^ C w | X , ε ] = β ^ C + ( X X ) 1 g = 1 C X g ε ^ g E [ δ g w ] = β ^ C     a s   g = 1 C X g ε ^ g = 0 K   E [ ( β ^ C w E [ β ^ C w ] ) ( β ^ C w E [ β ^ C w ] ) | X , ε ] =   ( X X ) 1 g = 1 C h = 1 C X g ε ^ g ε ^ h X h E [ δ g w δ h w ] ( X X ) 1 = ( X X ) 1 g = 1 C X g ε ^ g ε ^ g X g ( X X ) 1 = V β ^ C ,
where we make use of the fact that g = 1 C X g ε ^ g = X ε ^ = 0 K as the OLS estimates β ^ C in (2) above minimize ε ^ C ε ^ C . Thus, for any sample size the variance of wild bootstrap coefficient estimates equals White’s clustered/heteroskedasticity robust covariance estimate for the sample. Since under White’s conditions given in Theorem I, C V ^ ( β ^ C ) is a consistent estimator of the asymptotic variance of C ( β ^ C β ) , it follows that for such general inid data the wild bootstrap coefficient variance is a consistent estimator as well, reproducing Liu’s result in a more general framework.
A similar result for the pairs bootstrap is more problematic. The first two moments of the multinomial sampling frequencies ( δ p ) for C draws with replacement from C cluster groups are E [ δ p ] = 1 C (a vector of ones) and E [ δ p δ p ] = I C x C C 1 1 C 1 C . Examining the moments of the latter half of β ^ C p β ^ C = g = 1 C X g X g δ g p 1 g = 1 C X g ε ^ g δ g p , we see:
E g = 1 C X g ε ^ g δ g p | X , ε = g = 1 C X g ε ^ g E δ g p = g = 1 C X g ε ^ g = 0 K ,   & E g = 1 C X g ε ^ g δ g p g = 1 C X g ε ^ g δ g p | X , ε = g = 1 C h = 1 C X g ε ^ g ε ^ h X h E [ δ g p δ h p ]   = g = 1 C X g ε ^ g ε ^ g X g C 1 g = 1 C h = 1 C X g ε ^ g ε ^ h X h = g = 1 C X g ε ^ g ε ^ g X g = X X V β ^ C X X .
Were g = 1 C X g ε ^ g δ g p   multiplied by ( X X ) 1 , this would prove consistency of the second central moment of pairs bootstrap coefficients, but unfortunately it is multiplied by ( g = 1 C X g X g δ g p ) 1 . However, it is easy to show that ( g = 1 C X g X g δ g p ) 1 converges in probability to ( X X ) 1 (see Appendix C below). Using this fact, Shao and Tu (1995) prove consistency of the second central moment using the artifice of assuming that when the minimum eigenvalue of ( g = 1 C X g X g δ g p ) 1 is less than ½ of the minimum eigenvalue of ( X X ) 1 , an event whose probability converges to zero, β ^ C p is set equal to β ^ C .
It is well known that convergence in distribution does not imply convergence of moments, but the fact that the proof of Theorem II regarding the asymptotic permutation distribution of root-C correlation coefficients is based upon the method of moments (see Hoeffding, 1951 and the on-line Appendix of this paper) might lead to the erroneous conclusion that the results here imply consistency of all moments. They do not, as already implied by the discussion of the second moment of the pairs bootstrap. In Appendix C below, Theorem II is used to prove that across the equally likely permutations d of a given δb, for b (both) = p (pairs) or w (wild)
g = 1 C X g ε ^ g ε ^ g X g C ½ δ b O δ b C - ½ g = 1 C X g ε ^ g δ g b C d ( d ) | a s ( δ b , X , ε ) n K ,
signifying, by the method of proof, that the moments across permutations d of δ of the left-hand side converge to those of the multivariate standard normal. Since this is true for all δ, such that δ b O δ b > 0 , which almost surely holds (see (L2) in Appendix C), we can equally say that across the distribution of δ, the moments of (11) converge to those of the multivariate standard normal. For the wild bootstrap C ( β ^ C w β ^ C ) consists of (11) multiplied by ( X X / C ) 1 ( δ w O δ w / C ) ½ ( g = 1 C X g ε ^ g ε ^ g X g / C ) ½ , and as δ w O δ w / C   a s ( δ w ) 1, we can say that all the moments of C ( β ^ C w β ^ C ) converge to those of the multivariate normal with covariance matrix C V ^ ( β ^ C ) , although these need not be the asymptotic moments of the sample coefficients C ( β ^ C β ) . In the case of the pairs bootstrap, C ( β ^ C p β ^ C ) equals (11) multiplied by ( g = 1 C X g X g δ g p / C ) 1 ( δ b O δ b / C ) ½ ( g = 1 C X g ε ^ g ε ^ g X g / C ) ½ , and as both g = 1 C X g X g δ g p / C and δ b O δ b / C are only shown to converge in probability, nothing can be said about the asymptotic moments of C ( β ^ C p β ^ C ) without the use of an artifice such as that of Shao and Tu (1995) mentioned above.
3.
Assumptions on the wild bootstrap external variable
Liu (1988) proves the consistency of the second central moment of the wild bootstrap coefficients, assuming that the first and second moments of the wild bootstrap external variable   δ i w are 0 and 1, respectively.9 This paper extends the proof to consistency in distribution by additionally requiring that E [ ( δ i w ) 2 ( 1 + θ 1 ) ] < for θ1 > 1/γ, where γ > 0 is such that E | x i j x i k | 1 + γ < and E | x i j x i k ε i 2 | 1 + γ < . As the proof of Theorem II is based on the method of moments, depending upon the existence of higher moments for the regressors higher moments on δ i w are needed to ensure that all moments of (11) above exist and converge to the normal. Proofs of the consistency of wild bootstrap distributions typically assume that the external variable δ i w comes from a particular distribution, such as the Rademacher, with moments of all orders (e.g., Mammen, 1993; Canay et al., 2021). A notable exception is Djogbenou et al. (2019), where the proof of convergence in distribution merely requires that | δ i w | 2 + λ < for some λ > 0. The wild bootstrap external variable, however, is under the control of the practitioner (i.e., not a characteristic of the given data) and, at this time, there appear to be no known advantages to using an external variable without higher moments.

6. Monte Carlo Illustration with INID Data

To illustrate the properties and consistency of the bootstraps with fully inid data, I use a data-generating process that departs strongly from the independently and identically distributed ideal. To ensure average moments do not even begin to converge in finite samples, I model underlying distributional parameters as following a random walk across the data. To stress test the theorems above, I use regressors with heavy-tailed distributions that barely satisfy the specified moment conditions. Finally, to further hinder convergence to the normal, I choose an error distribution that departs strongly from the shape of that ideal.
To begin with unclustered data, for i = 1 … N independent observations, I assume that:
y i = ε i ,       ε i = B a ε i , b ε i a ε i a ε i + b ε i ,       x i = t 2.01 + B a x i , b x i   a ε i = a ε i 1 + U . 5 , . 5 ,     b ε i = b ε i 1 + U . 5 , . 5 ,     a ε 0 = b ε 0 = U . 5 , . 5 ,   a x i = a x i 1 + U . 5 , . 5 ,     b x i = b x i 1 + U . 5 , . 5 , &   a x 0 = b x 0 = U . 5 , . 5 ,
where B a , b denotes an independent draw from the Beta distribution with parameters a and b (and expectation a/(a + b)), t(v) an independent draw from the t-distribution with v degrees of freedom, and U . 5 , . 5 an independent draw from the uniform distribution on [−0.5, 0.5]. The random walks a and b (separate for ε and x) with their expanding variances ensure that the moments of the data do not meaningfully converge in simulation.10 These random walks can be thought of as an underlying population characteristic that develops, say, geographically or intertemporally. Observations, however, are drawn independently from these otherwise related distributions. Beta random variables are bounded on [0, 1] and, depending upon a and b, can be heavily skewed toward 0 or 1, symmetric unimodally around 0.5, or bimodally concentrated at both 0 and 1, to name just a few possibilities. This departs strongly from the symmetric unimodal normal distribution on the real line. Random variables drawn from a t-distribution only have finite moments up to their degrees of freedom. Thus, the regressors xi only have moments between 2.01 and 3.01, approaching the limits of the assumptions in Theorem I.
For a clustered data-generating process, with dependence within clusters, I generate g = 1   C clusters with cluster effects that follow (12) above (substituting   g   for i everywhere in those equations), and observation level data:
y i = ε i ,       ε i = ε g i + B a ε g ( i ) , b ε g ( i ) a ε g ( i ) a ε g ( i ) + b ε g ( i ) ,       and   x i = x g i + t ( 2.01 + B a x g ( i ) , b x g ( i ) ) ,
where   g ( i ) denotes the cluster to which observation i belongs. Thus, each regressor and disturbance observation within a cluster is composed of a common cluster effect plus a similarly, but independently, distributed observation effect. The estimated regression model is:
y i = α + β x i + ε i ,     w h e r e   α = β = 0 .
Table 2 reports Monte Carlo results for tests of the true null of β = 0 in the OLS regression (14) for the data-generating processes described in (12) and (13) with 10, 100, 1000, 10,000, 100,000, and 1,000,000 observations or clusters (and five observations per cluster). For the conventional test, I report the p-value of the two-sided test using the squared sample t-statistic based upon the clustered/heteroskedasticity robust covariance estimate evaluated using its asymptotic chi-squared distribution. For the bootstraps, I report p-values based upon the bootstrap-c, evaluating the squared coefficient deviation from the null using the percentiles of the squared coefficient deviations of the bootstraps from the mean of their data-generating processes, and the bootstrap-t, evaluating the sample squared t-statistic using the corresponding squared bootstrap test statistics, i.e.,
B o o t s t r a p c : β ^ β 2   e v a l u a t e d   u s i n g   β ^ b β ^ 2 ,   B o o t s t r a p t : β ^ β 2 V ^ β ^   e v a l u a t e d   u s i n g   β ^ b β ^ 2 V ^ β ^ b .
99 draws are used for each bootstrap, and an exact test relative to the bootstrap distribution is achieved using a p-value given by (G + (T + 1) U[0,1])/100, where G and T are the number of bootstrap test statistics greater than and equal to, respectively, that of the sample and U[0,1] is a draw from the uniform distribution on [0.1].11 For the wild bootstrap, δ c w is drawn from the Rademacher distribution, which equals ± 1 with equal probability and appears to perform better than alternatives (Davidson & Flachaire, 2008). 1000 realizations of the data-generating process are used for each specification.
As can be seen in panel (A) of the table, rejection rates using both the conventional chi-squared distribution and those of the bootstraps differ substantially from nominal value in small samples, but converge to the 0.01, 0.05, and 0.10 levels as the number of observations or clusters increases. The central 95 percentiles of the binomially distributed Monte Carlo rejection probability in 1000 independent draws are 0.004 to 0.016 at the 0.01 level, 0.037 to 0.063 at the 0.05 level, and 0.082 to 0.118 at the 0.1 level. With 1,000,000 independent observations or clusters, most rejection rates lie within those bounds. As shown in panel (B) of the table, the Kolmogorov–Smirnov test statistic for the null that the p-value distributions are uniform, i.e., the maximum absolute difference between the cumulative distribution function of each set of 1000 p-values from that of the uniform distribution, is less than or equal to 0.028 in all cases, with the p-value of the null that the distributions are uniform on [0,1] exceeding 0.41 in each instance.
Theorem V asserts conditional consistency, i.e., the asymptotic distribution of the bootstraps is normal with a covariance matrix equal to that of the conventional estimate. If so, evaluating the conventional test statistic with the full distribution of each bootstrap should asymptotically yield a p-value identical to that found by evaluating the same using the chi-squared distribution.12 Panel (C) of Table 2 reports the correlation between the bootstrap and conventional p-values with 1,000,000 observations or clusters. As can be seen, this is at least 0.9889 in all cases. As the bootstrap p-values are based upon only a sample from their distribution, while exact relative to that distribution, they cannot be expected to equal the conventional chi-squared p-value. Their correlation with the conventional p-value, however, should be the same as that found when evaluating the conventional test statistic using 99 independent draws from the chi-squared distribution and the same formula p = (G + (T + 1) U[0,1])/100.
Panel (C) of Table 2 reports that the probability that a correlation less than or equal to that found between the bootstraps and the conventional p-value would be found when evaluating a squared t-statistic using 99 draws from the chi-squared vs. the full chi-squared distribution itself. “p-value1” evaluates the correlation using the distribution conditional on the realized conventional squared t-statistics, i.e., is a test of conditional consistency alone and does not assume consistency of the conventional test. While most p-values are very large, two of the eight are near zero, indicating that the correlation is not yet quite up to the level that would be expected from completed conditional convergence.13p-value2” evaluates the correlations using the distribution across random draws of the initial conventional test statistics from the chi-squared, i.e., a joint test of convergence of the conventional test statistic and conditional consistency of the bootstraps. Here the smallest p-value is 0.045, which, given any adjustment for multiple testing, can be taken as indicating that the tests do not reject the joint null implied by the theorems above at the 0.05 level.
Table 2 illustrates the consistency of bootstrap procedures with a highly challenging data-generating process. While previous results cited above require the existence of at least fourth moments of the regressors for convergence of both coefficients and Wald statistics in environments with iid data or inid data with convergent average moments, no more than slightly higher than second regressor moments are actually sufficient for fully inid data whose moments need not converge to anything, as specified in the theorems above and illustrated in these Monte Carlos.
As can also be seen in Table 2, in small samples with heavy-tailed regressors the conventional clustered/heteroskedasticity robust test statistic performs poorly, with rejection probabilities far above the nominal value. In such environments, the bootstraps often perform better (Davidson & Flachaire, 2008; Young, 2019). In the simulations of Table 2, this is clearly the case for the pairs bootstrap and, to a lesser degree, with the wild bootstrap using the asymptotically pivotal t-statistic.14 Thus, while providing the same asymptotic assurances as conventional inference methods, the bootstraps often provide a better approximation to the distribution of test statistics in small finite sample environments. It should be noted, however, that other methods also exist for improving the finite sample performance of the conventional test, such as the HC bias corrections of MacKinnon and White (1985) and the effective degrees of freedom corrections of Bell and McCaffrey (2002), Pustejovsky and Tipton (2018), and Young (2016).15

7. Conclusions

This paper characterizes the pairs and wild bootstraps as realizations of a permutation distribution and uses previously unexploited permutation theorems to derive less restrictive moment conditions for their conditional consistency. While prior work requires at least fourth moments of the regressors for consistency of distributions in an iid framework or inid framework where average moments converge to constants, only slightly more than second moments on the regressors are actually needed for consistency in a fully inid environment where average moments need never converge. The use of the same permutation theorems to characterize and derive new results for the asymptotic distributions of other techniques, such as bootstraps for time series, the jackknife, randomization inference, and conventional estimates on exchangeable data, is the subject of ongoing research.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/econometrics13040041/s1.

Funding

This research received no external funding.

Data Availability Statement

The Stata code used in the simulations is available as Supplementary Material, as well as on the author’s website at https://personal.lse.ac.uk/YoungA/ (accessed on 26 September 2025).

Acknowledgments

I am grateful to Taisuke Otsu and anonymous referees for helpful comments.

Conflicts of Interest

The author declares no conflict of interest.

Appendices

In the proofs below, corollaries to Markov’s Law of Large Numbers and the Continuous Mapping Theorem given in White (1984), as well as a Borel–Cantelli type corollary by Galambos (1987), will be useful:
Markov’s Law Corollary. 
Let  z g  be a sequence of independent random variables such that  E ( | z g | 1 + γ ) < Δ <  for some γ > 0 and all  g .  Then  m ( z g ) m ( E ( z g ) )   a s  0.
Continuous Mapping Theorem Corollary. 
Let  g : kl be continuous on a compact set D   k. Suppose that bC(ω) and dC are kx1 vectors such that bC(ω)–dC  a s  0, and for all C sufficiently large, dC is interior to D, uniformly in C. Then  g  (bC(ω))– g  (dC)  a s  0.
Borel–Cantelli Corollary. 
Let x1, x2, … be an infinite sequence of random variables,  F g x    the cumulative distribution function of  x g  (i.e., Prob( x g < x  ), and uC a nondecreasing sequence of real numbers such that for all  g    Prob( x g  < supC uC) = 1. Then,
C = 1 [ 1 F C ( u C ) ] < Prob ( max g C   x g u C   infinitely   often ) = 0 .

Appendix A. Proof of Theorem I

This appendix references assumptions (Ia)–(Ic) and results (i)–(v) in Theorem I and uses the notation therein. By (Ib) and (Ic) MC and VC are nonsingular with determinant > η for all C sufficiently large and their elements uniformly bounded as from Jensen’s Inequality:
E ( | x g j x g k | ) E ( | x g j x g k | 1 + γ ) 1 / ( 1 + γ )   Δ 1 / ( 1 + γ )   E x g j ε g ε g x g k E ( | x g j ε g ε g x g k | 1 + γ ) 1 1 + γ   Δ 1 1 + γ .
As the sum of the eigenvalues of a matrix equals its trace and the product its determinant, their maximum eigenvalues are less than K Δ 1 / ( 1 + γ ) and their minimum eigenvalues are greater than η / ( K Δ 1 / ( 1 + γ ) ) K 1 for all C that are sufficiently large. The minimum and maximum eigenvalues of their inverses are the inverses of these. Consequently, for all sufficiently large C, the determinants of their inverses are greater than ( K Δ 1 / ( 1 + γ ) ) K > 0 and, by the spectral decomposition of a real symmetric matrix, the absolute value of their elements is bounded by ( K Δ 1 / ( 1 + γ ) ) K 1 / η .16 This establishes result (iii) in Theorem I.
Using Jensen’s Inequality and (Ic),
E ( | x g j ε g | 1 + γ ) E ( | x g j ε g ε g x g j | 1 + γ ) < Δ ,
so, by the Markov Corollary, (Ib), (Ic) and the independence of ( X g , ε g ) across cluster groups (Ia):
X ε C 0 = g = 1 C X g ε g C g = 1 C E X g ε g C = a s ( X , ε ) 0 K x 1   X X C M C = g = 1 C X g X g C g = 1 C E X g X g C a s X 0 K x K   g = 1 C X g ε g ε g X g C V C = g = 1 C X g ε g ε g X g C g = 1 C E ( X g ε g ε g X g ) C a s X , ε 0 K x K   X X C 1 M C 1 a s X 0 K x K ,
where the last follows from the Continuous Mapping Theorem Corollary. These results, combined with the boundedness of M C 1 , establish result (i) in Theorem I:
β ^ C = X X C 1 X y C = β + X X C 1 X ε C   a s ( X , ε ) β .
X ε / C is a vector with expectation and variance:
E X ε C = g = 1 C E ( X g ε g ) C = 0 K x 1 ,       E X ε ε X C = g = 1 C E ( X g ε g ε g X g ) C = V C .
As noted in White (1980a, p. 829—see also White, 1980b; Hoadley, 1971), given (A5), a multivariate Liapounov Central Limit theorem implies that V C ½ X ε / C is asymptotically distributed multivariate standard normal, n K , provided that for all κ in K and some δ > 0:
g = 1 C E ( | κ V C ½ X g ε g | 2 + δ ) C ( 2 + δ ) / 2 0 .
Define φ = κ V C ½ , and note that by the properties of the Rayleigh quotient φ φ κ κ / λ m i n , where λ m i n = η / ( K Δ 1 / ( 1 + γ ) ) K 1 is the lower bound on the minimum eigenvalue of V C ( 1 / λ m i n the upper bound on the maximum eigenvalue of V C 1 ) given earlier above. Keeping in mind then that the kth element of φ , φ k , is bounded, and noting that from (Ic) E ( | x g j ε g ε g x g j | 1 + γ ) = E ( | x g j ε g | 2 + 2 γ ) < Δ , we apply Minkowski’s Inequality:
E | κ V C ½ X g ε g | 2 + 2 γ = E k = 1 K φ k x g k ε g 2 + 2 γ   k = 1 K [ E ( | φ k x g k ε g | 2 + 2 γ ) ] 1 2 + 2 γ 2 + 2 γ <   k = 1 K | φ k | 2 + 2 γ Δ 1 2 + 2 γ 2 + 2 γ <   ,
so (A6) holds with δ = 2γ. Consequently, we can say that
V C ½ M C C β ^ C β = V C ½ M C X X C 1 X ε C d ( X , ε ) n K ,         a s   M C X X C 1 a s ( X , ε ) I K .
This establishes result (ii) in Theorem I.
As ε ^ g = ε g + X g ( β β ^ C ) , the jkth element of g = 1 C X g ε g ε g X g / C equals:
g = 1 C x g j ε ^ g x g k ε ^ g C = r = 1 K s = 1 K ( β r β ^ r ) C ( θ 1 ) / 2 ( β s β ^ s ) C ( θ 1 ) / 2 g = 1 C x g j x g r x g k x g s C 2 θ   + r = 1 K ( β r β ^ r ) C ( θ 1 ) / 2 g = 1 C x g j x g r x g k ε g C 1 + ( 1 θ ) / 2 + g = 1 C x g k x g r x g j ε g C 1 + ( 1 θ ) / 2 + g = 1 C x g j ε g x g k ε g C ,
where we select θ such that γ/(1 + γ) > θ > 0, with γ as in (Ib) and (Ic). Repeatedly applying the Cauchy–Schwarz Inequality, we have
g = 1 C x g j x g r x g k x g s C 2 θ g = 1 C x g j x g r 2 C 2 θ g = 1 C x g k x g s 2 C 2 θ   g = 1 C ( x g j x g j ) ( x g r x g r ) C 2 θ g = 1 C ( x g k x g k ) ( x g s x g s ) C 2 θ i = j , k m a x g C x g i x g i C 1 θ i = r , s g = 1 C x g i x g i C   g = 1 C x g j x g r x g k ε g C 1 + ( 1 θ ) / 2 m a x g C x g j x g j C 1 θ g = 1 C x g r x g r C g = 1 C ( x g k ε g ) 2 C
Using Markov’s Inequality and E ( | x g j x g k | 1 + γ ) < Δ in (Ib), we can state that for any δ > 1/(1 + γ) but < 1 − θ
C = 1 Prob ( x C j x C j ) C δ ) C = 1 E ( | x C j x C j | 1 + γ ) C δ 1 + γ < C = 1 Δ C δ 1 + γ < .
So, by the Borel–Cantelli Corollary, max g C x g j x g j is asymptotically almost surely less than C δ and hence max g C x g j x g j / C 1 θ almost surely converges to zero for 1-θ > 1/(1+γ), i.e., 0 < θ < γ/(1+γ). Together with (A3)’s results, that g = 1 C x g p x g r / C and g = 1 C ( x g k ε g ) 2 / C   almost surely converge to bounded elements of MC and VC, this establishes that both left-hand side terms in (A10) almost surely converge to 0. Results (i)–(iii) show that C ( β r β ^ r ) is asymptotically normally distributed with mean zero and bounded variance less than some σ2 > 0. Hence, asymptotically, the probability | C ( β r β ^ r ) | > Cδ for any δ > 0 and < θ can be bounded by
2 2 π σ 2 C δ exp x 2 2 σ 2 d x < 2 2 π σ 2 C δ x C δ exp x 2 2 σ 2 d x = 2 σ 2 2 π σ 2 1 C δ exp C 2 δ 2 σ 2 ,
which is less than 1/C1 for all C that are sufficiently large. So,
C = 1 Prob ( | C ( β r β ^ r ) | C δ ) <
and by the Borel–Cantelli Lemma, C ( β r β ^ r ) / C θ a s ( X , ε ) 0 . From (A3), the last g = 1 C x g j ε g x g k ε g / C term in (A9) almost surely converges to the jkth term of VC. Putting all of the above together, we see that
g = 1 C X g ε ^ g ε ^ g X g C V C a s X , ε 0 K x K   a n d   X X C 1 M C 1 a s X 0 K x K A 3   a b o v e ,   s o C V ^ β ^ C M N 1 V N M N 1 a s X , ε 0 K x K ,
establishing (iv) in Theorem I. Result (v) follows from (i)–(iv).

Appendix B. Proof of Theorem IV

If either the z g or δ g are all identical ( z g = z or δ g = δ ), Theorem IV follows immediately. Assuming this is not the case, we use the symmetry and equal likelihood of permutations to calculate the expectation of d g and products of d g across the row permutations d of δ:
E d ( d g ) = g = 1 C δ g C = m ( δ g ) , E d ( d g 2 ) = g = 1 C δ g 2 C = m ( δ g 2 ) &   E d ( d g d h g ) = g = 1 C h = 1 C δ g δ h C ( C 1 ) g = 1 C δ g 2 C ( C 1 ) = m ( δ g ) 2 C C 1 m ( δ g 2 ) C 1 .
The mean and variance of m ( z g d g ) m ( z g ) m ( d g )   across realizations of d are given by:
E d ( m ( z g d g ) m ( z g ) m ( d g ) ) = g = 1 C z g E d ( d g ) C m ( z g ) m ( δ g ) = 0 , E d ( m ( z g d g ) m ( z g ) m ( d g ) ) 2 = g h = 1 C z g z h E d ( d g d h ) C 2 + g = 1 C z g 2 E d ( d g 2 ) C 2 m ( z g ) 2 m ( δ g ) 2 = m ( δ g ) 2 C C 1 m ( δ g 2 ) C 1 g = 1 C h = 1 C z g z h C 2 g = 1 C z g 2 C 2 + m ( δ g 2 ) g = 1 C z g 2 C 2 m ( z g ) 2 m ( δ g ) 2 = m ( δ g ) 2 C C 1 m ( δ g 2 ) C 1 m ( z g ) 2 m ( z g 2 ) C + m ( δ g 2 ) m ( z g 2 ) C m ( z g ) 2 m ( δ g ) 2 = [ m ( z g 2 ) m ( z g ) 2 ] [ m ( δ g 2 ) m ( δ g ) 2 ] C 1 ,
where Σ g h   denotes the summation across the two indices, excluding ties between them. The last line shows that if (IVb) holds, then across the permutations d of δ  m z g d g m z g m d g converges in mean square and hence in probability to 0, as stated in Theorem IV.

Appendix C. Proof of Theorem V

We begin by noting the following Lemma, proven in Appendix D further below.
Lemma 1.
Let  a s ( δ )  or  p ( δ )   denote convergence almost surely or in probability across the distribution of δ, τ any integer greater than 2, b (both) = p (pairs) or w (wild), γ > 0 be as given in Theorem I, θ1 > 0 as in Theorem V, and η1 and   κ     constants > 0. For all θ such that γ/(1 + γ) > θ > 0 (pairs) or γ/(1 + γ) > θ > 1/(1 + θ1) (wild):
m ( δ g w ) a s δ w 0 , m ( ( δ g w ) 2 ) a s δ w 1 & C θ m ( ( δ g w ) 4 ) a s δ w 0
m ( δ g p ) = 1 , m ( ( δ g p ) 2 ) p δ p 2 , & C θ m ( ( δ g p ) 2 ) a s δ p 0
almost   surely   for   all   C   sufficiently   large g = 1 C [ δ g b m ( δ g b ) ] 2 C > κ > 0
C ( 1 θ ) τ 2 1 g = 1 C [ δ g b m ( δ g b ) ] τ g = 1 C [ δ g b m ( δ g b ) ] 2 τ / 2 a s ( δ b ) 0
almost   surely   for   all   C   sufficiently   large   X X / C ,   g = 1 C X g ε ^ g ε ^ g X g / C ,   and   their   inverses   are   bounded   and   positive   definite   with   determinant   > η 1 > 0
k &   τ : C θ τ 2 1 g = 1 C ( x g k ε ^ g ) τ g = 1 C ( x g k ε ^ g ) 2 τ / 2 a s ( X , ε ) 0
j , k : m ( ( x g j x g k ) 2 ) C 1 θ a s ( X ) 0
j , k : m ( ( x g j x g k ) 4 ) C 3 3 θ a s ( X ) 0
Use of θ, θ1, and γ below follows the bounds and definitions in Lemma 1 and Theorems I and V earlier.
For a permutation d of δw or δp, the coefficient estimates of the pairs and wild bootstrap are, following (4) and (5) in the text, given by
C β ^ C p β ^ C = A 1 a   a n d   C ( β ^ C w β ^ C ) = ( X X / C ) 1 a ,   w h e r e   A = g = 1 C X g X g C d g   a n d   a = g = 1 C X g ε ^ g C d g .
Our objective is to describe the distribution of these objects across permutations d given the realization of conditions on δ, X, and ε. When results apply to both bootstraps, we use the notation d; when they apply to only one bootstrap, we use the notation dw or dp.
Regarding the jkth element of A, given by g = 1 C x g j x g k d g p / C , we can apply Theorem IV with z g = x g j x g k . Condition IVb in this case requires that:
m ( ( x g j x g k ) 2 ) m ( x g j x g k ) 2 C 1 θ m ( ( δ g p ) 2 ) m ( δ g p ) 2 C θ a s ( δ p , X , ε ) 0 ,
which is guaranteed by (L1p), (L4), and (L6) above. So,
A X X C m ( δ g p ) = 1 p ( d ) | a s ( δ p , X , ε ) 0 K x K ,
and by the corollary to the Continuous Mapping Theorem given above, A 1 converges in probability to bounded positive definite ( X X / C ) 1 (as in L4).
Regarding a, we apply Theorem IIm in the text with Z = { z 1 , , z C } and z g = X g ε ^ g . Since Z 1 C = X ε ^ =   X y X β ^ = X y X y = 0 K , the mean of z g is zero, and so we have Z O Z = g = 1 C X g ε ^ g ε ^ g X g and Z O d = g = 1 C X g ε ^ g d g . From (L2) we know that almost surely d O d / C = δ b O δ b / C > κ > 0, while (L4) ensures that Z O Z / C is positive definite with determinant > η1 > 0. Hence, following Theorems II and III, the distribution across d of
g = 1 C X g ε ^ g ε ^ g X g C ½ d O d C ½ g = 1 C X g ε ^ g d g C
converges almost surely (across δb,X,ε) to that of the iid multivariate standard normal, as by (L3) and (L5) condition (IIb) in Theorem II, holds for all elements z g k in z g .
Using (L4) and the fact that δ b O δ b / C = d O d / C is a scalar, we then have:
g = 1 C X g ε ^ g ε ^ g X g C ½ X X C δ p O δ p C ½ C β ^ C p β ^ C = g = 1 C X g ε ^ g ε ^ g X g C ½ X X C A 1 g = 1 C X g ε ^ g ε ^ g X g C ½ p ( d ) | a s ( δ p , X , ε ) I K x K   g = 1 C X g ε ^ g ε ^ g X g C ½ d p O d p C ½ a d ( d ) | a s ( δ p , X , ε ) n K d ( d ) | a s ( δ p , X , ε ) n K ,   g = 1 C X g ε ^ g ε ^ g X g C ½ X X C δ w O δ w C ½ C β ^ C w β ^ C   = g = 1 C X g ε ^ g ε ^ g X g C ½ d w O d w C ½ a d ( d ) | a s ( δ w , X , ε ) n K ,
thereby establishing the claim in (Va).
Regarding the wild bootstrap clustered/heteroskedasticity robust covariance estimates, for a permutation dw of δw, we have
C V ^ β ^ C w = X X C 1 W X X C 1 , where   W = g = 1 C X g ε ^ g w ε ^ g w X g C .
Using ε ^ g w = X g β ^ C β ^ C w + ε ^ g d g w , from (4) in the text, the jkth element of W is given by:
g = 1 C x g j ε ^ g w x g k ε ^ g w C = m x g j ε ^ g x g k ε ^ g d g w 2 a + r = 1 K s = 1 K δ w O δ w C 1 + θ η ^ r η ^ s m x g j x g r x g k x g s C 1 θ c   r = 1 K δ w O δ w C 1 + θ η ^ p m x g j x g r x g k ε ^ g d g w C ( ½ ) 1 θ + m x g k x g r x g j ε ^ g d g w C ( ½ ) 1 θ b   w h e r e   η ^ = d O d C ½ C ( β ^ C w β ^ C ) .
For “a”, we note that d g w 2 is the permutation of δ g w 2 and apply Theorem IV with     z g = x g j ε ^ g x g k ε ^ g . Condition (IVb) requires that:
m ( ( x g j ε ^ g x g k ε ^ g ) 2 ) m ( x g j ε ^ g x g k ε ^ g ) 2 C 1 θ m ( ( δ g w ) 4 ) m ( ( δ g w ) 2 ) 2 C θ a s ( δ w , X , ε ) 0 .
From (L1w) and (L4), we know that [ m ( ( δ g w ) 4 ) m ( ( δ g w ) 2 ) 2 ]/Cθ and m ( x g j ε ^ g x g k ε ^ g ) 2 / C 1 θ   a s ( δ w , X , ε ) 0. Applying the Cauchy–Schwarz Inequality (here, and frequently below),
m ( ( x g j ε ^ g x g k ε ^ g ) 2 ) C 1 θ = g = 1 C ( x g j ε ^ g x g k ε ^ g ) 2 C 2 θ i = j , k g = 1 C x g i ε ^ g 4 C 2 θ a s X , ε 0 ,
where the last is guaranteed by (L4) and (L5) as
g = 1 C ( x g i ε ^ g ) 4 C 2 θ = C θ 4 2 1 g = 1 C ( x g i ε ^ g ) 4 g = 1 C ( x g i ε ^ g ) 2 4 / 2 a s ( X , ε ) 0 ( L 5   with   τ = 4 ) g = 1 C ( x g i ε ^ g ) 2 C 2 a s   bounded   ( L 4 ) a s ( X , ε ) 0 .
So, by Theorem IV,
a : m ( x g j ε ^ g x g k ε ^ g d g w 2 ) m ( x g j ε ^ g x g k ε ^ g ) m ( ( δ g w ) 2 ) a s ( δ w ) 1 ( L 1 w )   p ( d ) | a s ( δ w , X , ε ) 0 .
For “b”, we apply Theorem IV with   z g = x g j x ^ g r x g k ε ^ g / C ( ½ ) ( 1 θ ) , so condition (IVb) requires that
m ( ( x g j x g r x g k ε ^ g ) 2 / C 1 θ ) m ( x g j x g r x g k ε ^ g / C ( ½ ) ( 1 θ ) ) 2 C 1 θ m ( δ i w 2 ) m ( δ i w ) 2 C θ a s ( δ w , X , ε ) 0 .
Using (L1w) and
m x g j x g r x g k ε ^ g C ( ½ ) ( 1 θ ) = g = 1 C x g j x g r x g k ε ^ g C 1 + ( ½ ) ( 1 θ ) g = 1 C ( x g j x g r ) 2 C 2 θ a s X , ε 0   L 6 g = 1 C ( x g k ε ^ g ) 2 C a s   bounded   L 4 a s X , ε 0 ,
m ( ( x g j x g r x g k ε ^ g ) 2 / C 1 θ ) C 1 θ = g = 1 C ( x g j x g r x g k ε ^ g ) 2 C 3 2 θ g = 1 C ( x g j x g r ) 4 C 4 3 θ a s X , ε 0   L 7 g = 1 C x g k ε ^ g 4 C 2 θ a s X , ε 0   C 10 a s X , ε 0 ,
we see that condition (IVb) is met and by Theorem IV we then have
b : m x g j x g r x g k ε ^ g d g w C ( ½ ) ( 1 θ ) m x g j x g r x g k ε ^ g C ( ½ ) ( 1 θ ) a s X , ε 0 C 13   m ( δ i w ) a s ( δ w ) 0   ( L 1 w ) p ( d ) | a s ( δ w , X , ε ) 0 .
Finally, for “c”, we note that
m x g j x g r x g k x g s C 1 θ m ( x g j x g r ) 2 C 1 θ m ( x g k x g s ) 2 C 1 θ a s ( X , ε ) 0   by   ( L 6 ) .
From the above, we see that the η ^ r in (C7) are multiplied by δ w O δ w / C 1 + θ which from (L1w) converges almost surely (across δw) to 0, “c” terms which almost surely (across X, ε) converge to 0, and “b” terms which also almost surely (across δw, X, ε) converge in probability across permutations dw to zero. As the η ^ r , from (L4) and (C5) almost surely (across δw, X, ε) converge in distribution across permutations dw of δw to normal variables with bounded variance, it follows that when so multiplied they converge in probability across permutations dw to zero. This leaves only the “a” term, and consequently, using (C10), we see that
W g = 1 C X g ε ^ g ε ^ g X g C p ( d ) | a s ( δ w , X , ε ) 0 K x K   and   so   C V ^ β ^ C w C V ^ β ^ C p ( d ) | a s ( δ w , X , ε ) 0 K x K ,
which establishes (Vb) for the wild bootstrap.
For the pairs bootstrap clustered/heteroskedasticity robust covariance estimates, for a permutation dp of δp, we have from (5),
V ^ β ^ C p = A 1 P A 1 , A = g = 1 C X g X g C d g p and   P = g = 1 C X g ε ^ g p ε ^ g p X g C d g p .
Using ε ^ g p = X g β ^ C β ^ C p + ε ^ g given in (5) earlier, the jkth element of P is given by
g = 1 C x g j ε ^ g p x g k ε ^ g p d g p C = m x g j ε ^ g x g k ε ^ g d g p d + r = 1 K s = 1 K δ p O δ p C 1 + θ η ^ r η ^ s m x g j x g r x g k x g s d g p C 1 θ f   r = 1 K δ p O δ p C 1 + θ η ^ r m x g j x g r x g k ε ^ g d g p C ( ½ ) 1 θ + m x g k x g r x g j ε ^ g d g p C ( ½ ) 1 θ e   w h e r e   η ^ = d O d C ½ C ( β ^ C p β ^ C ) .
For “d”, we apply Theorem IV with z g = x g j ε ^ g x g k ε ^ g and, as by (L1p), (L4), and (C10) [ m ( ( δ g p ) 2 ) m ( δ g p ) 2 ]/Cθ, m ( x g j ε ^ g x g k ε ^ g ) 2 / C 1 θ and m ( ( x g j ε ^ g x g k ε ^ g ) 2 ) / C 1 θ all a s ( δ p , X , ε ) 0, condition (IVb) is met, so
d : m ( x g j ε ^ g x g k ε ^ g d g p ) m ( x g j ε ^ g x g k ε ^ g ) m ( δ g p ) = 1   ( L 1 p ) p ( d ) | a s ( δ p , X , ε ) 0 .
For “e”, we apply Theorem IV with z g = x g j x g r x g k ε ^ g / C ( ½ ) ( 1 θ )   and, as by (L1p), (C13), and (C14) [ m ( ( δ g p ) 2 ) m ( δ g p ) 2 ]/Cθ, m ( x g j x g r x g k ε ^ g / C ( ½ ) ( 1 θ ) ) 2 / C 1 θ and m ( ( x g j x g r x g k ε ^ g ) 2 / C 1 θ ) / C 1 θ all a s ( δ p , X , ε ) 0, condition (IVb) is met, so
e : m ( x g j x g r x g k ε ^ g d g p ) m ( x g j x g r x g k ε ^ g ) m ( δ g p ) = 1   ( L 1 p ) p ( d ) | a s ( δ p , X , ε ) 0 .
For “f “, we apply Theorem IV with z g = x g j x g r x g k x g s / C 1 θ and see that condition (IVb) holds as by (L1p) and (C16) [ m ( ( δ g p ) 2 ) m ( δ g p ) 2 ]/Cθ and m ( x g j x g r x g k x g s / C 1 θ ) 2 / C 1 θ   a s ( δ p , X , ε ) 0, while by the Cauchy–Schwarz Inequality and (L7),
m ( ( x g j x g r x g k x g s ) 2 / C 2 ( 1 θ ) ) C 1 θ g = 1 C ( x g j x g r ) 4 N 4 3 θ g = 1 C ( x g k x g s ) 4 N 4 3 θ a s ( X , ε ) 0 ,
so
f : m ( x g j x g r x g k x g s d g p ) m ( x g j x g r x g k x g s ) m ( δ g p ) = 1   ( L 1 p ) p ( d ) | a s ( δ p , X , ε ) 0 .
Similar to the case of the wild bootstrap, the η ^ r in (C19), which from (L4) and (C5) almost surely (across δp,X,ε) converge in distribution across permutations dp of δp to normal variables with bounded variance, are multiplied by δ p O δ p / C 1 + θ , which from (L1p) converges almost surely (across δp) to 0 and “e” and “f” terms, which almost surely (across δp,X,ε) converge in probability across permutations dp to zero, and hence, when so multiplied, converge in probability across permutations dp to zero. This leaves only the “d” term, and so, using (C3) earlier,
P g = 1 C X g ε ^ g ε ^ g X g C p ( d ) | a s ( δ p , X , ε ) 0 K x K   &   A 1 X X C 1 p ( d ) | a s ( δ p , X , ε ) 0 K x K ,   and   hence   C V ^ β ^ C p C V ^ β ^ C p ( d ) | a s ( δ b , X , ε ) 0 K x K ,
which establishes (Vb) for the pairs bootstrap.

Appendix D. Proof of Lemma 1 in Appendix B

(L1), (L2), (L3): We prove these for the wild bootstrap, placing the more involved proofs for the pairs in the on-line appendix. From the assumptions E [ δ g w ] = 0 & E [ ( δ g w ) 2 ] = 1 (Theorem V) and the Strong Law of Large Numbers, we know that m ( δ g w )   a s δ w 0 and m ( ( δ g w ) 2 )   a s δ w 0. Markov’s Inequality, E [ ( δ g w ) 2 ( 1 + θ 1 ) ] < Δ (Theorem V), and θ > 1 / ( 1 + θ 1 ) (Lemma 1) imply that there exists a v in (1/(1 + θ1),θ) such that
C = 1 Prob ( ( δ C w ) 2 C ν ) C = 1 E ( ( δ C w ) 2 ( 1 + θ 1 ) ) C v ( 1 + θ 1 ) < C = 1 Δ C v ( 1 + θ 1 ) < ,
and thus, by the Borel–Cantelli Corollary given above, max g C ( δ g w ) 2 / C θ   a s δ w 0, and so,
m ( ( δ g w ) 4 ) C θ = g = 1 C ( δ g w ) 4 C 1 + θ m a x g C ( δ g w ) 2 C θ m ( ( δ g w ) 2 ) a s ( δ w )   0 .
This establishes (L1w). As δ w O δ w / C   a s δ w   1, for all C that are sufficiently large, δ w O δ w / C is almost surely greater than some κ > 0, as stated in (L2). Regarding (L3), as for τ > 2,
g = 1 C [ δ g w m ( δ g w ) ] τ g = 1 C δ g w m δ g w τ   m a x g C [ δ g w m ( δ g w ) ] 2 τ 2 1 g = 1 C [ δ g w m ( δ g w ) ] 2 ,
we have
0 < C 1 θ τ 2 1 g = 1 C [ δ g w m ( δ g w ) ] τ g = 1 C [ δ g w m ( δ g w ) ] 2 τ / 2 max g C [ δ g w m ( δ g w ) ] 2 C θ g = 1 C [ δ g w m ( δ g w ) ] 2 C τ 2 1 .
From the above, we know the denominator of the last almost surely converges to 1, while as for the numerator, using (L1w) and the result from (D1) max g C ( δ g w ) 2 / C θ   a s δ w 0:
max g C [ δ g w m ( δ g w ) ] 2 C θ max g C ( δ g w ) 2 C θ + 2 m ( δ g w ) C ( ½ ) θ max g C ( δ g w ) 2 C θ ½ + m ( δ g w ) 2 C θ a s ( δ w ) 0 .
Consequently, (D4) almost surely converges to 0 for θ > 1 / ( 1 + θ 1 ) , proving (L3).
(L4): In the proof of Theorem I in Appendix A, we saw that X X / C M C a s X , ε 0 K x K and g = 1 C X g ε ^ g ε ^ g X g / C V C   a s X , ε 0 K x K , where the determinants of MC and VC are > η > 0 for all sufficiently large C and the absolute values of their elements are uniformly bounded by Δ 1 / ( 1 + γ ) . By the Continuous Mapping Theorem Corollary given above, ( X X / C ) 1 M C 1   a s X , ε 0 K x K and ( g = 1 C X g ε ^ g ε ^ g X g / C ) 1 V C 1 a s X , ε 0 K x K , where for all C sufficiently large the determinants of M C 1 and V C 1 are greater than ( K Δ 1 / ( 1 + γ ) ) K > 0 and the absolute values of their elements bounded by ( K Δ 1 / ( 1 + γ ) ) K 1 / η , as proven earlier. It follows that almost surely for all C sufficiently large, X X / C , g = 1 C X g ε ^ g ε ^ g X g / C and their inverses have the same properties.
(L5), (L6), and (L7): Following the same logic used in (D3) and (D4) and using the Cauchy–Schwarz Inequality, we note that:
C θ τ 2 1 g = 1 C ( x g k ε ^ g ) τ g = 1 C ( x g k ε ^ g ) 2 τ / 2 C θ τ 2 1 g = 1 C ( x g k ε ^ g ) τ g = 1 C ( x g k ε ^ g ) 2 τ / 2 max g C   ( x g k ε ^ g ) 2 / C 1 θ g = 1 C ( x g k ε ^ g ) 2 / C τ 2 1
g = 1 C ( x g j x g k ) 2 C 2 θ g = 1 C x g j x g j x g k x g k C 2 θ max g C   x g j x g j C 1 θ m ( x g k x g k )
g = 1 C ( x g j x g k ) 4 C 4 3 θ max g C   x g j x g j C 1 θ 2 max g C   x g k x g k C 1 θ m x g k x g k .
So, to prove (L5)–(L7) it is sufficient to show that the right-hand sides of these inequalities converge to zero. In Appendix A, we already showed that almost surely   m ( x g k x g k ) is bounded and max g C x g j x g j / C 1 θ converges to 0, which establishes this for (D6b) and (D6c).
Turning to the right-hand side of (D6a), as shown in Appendix A  m ( ( x g k ε ^ g ) 2 ) almost surely converges to the diagonal element of VC in Theorem (Ic), whose smallest eigenvalue is greater than η / ( K Δ 1 / ( 1 + γ ) ) K 1 for all sufficiently large C. From the Schur–Horn Theorem, we know that the smallest diagonal element of VC is greater than or equal to its smallest eigenvalue, and hence the term m ( ( x g k ε ^ g ) 2 ) in the denominator of (D6a) is almost surely greater than η / ( K Δ 1 / ( 1 + γ ) ) K 1 > 0 for all C sufficiently large. Regarding the max term in the numerator, using ε ^ g = ε g + X g ( β - β ^ C ) and the Cauchy–Schwarz Inequality we have
( x g k ε ^ g ) 2 C 1 θ = ( x g k ε g ) 2 C 1 θ + r = 1 K s = 1 K ( β r β ^ r ) C ( ½ ) ( θ 1 ) ( β s β ^ s ) C ( ½ ) ( θ 1 ) x g k x g r x g k x g s C 2 2 θ   + 2 r = 1 K ( β r β ^ r ) C ( ½ ) ( θ 1 ) x g k x g r x g k ε g C ( 3 / 2 ) ( 1 θ ) x g k ε g 2 C 1 θ + r = 1 K s = 1 K β r β ^ r C ( ½ ) ( θ 1 ) β s β ^ s C ( ½ ) ( θ 1 ) i = j , k , p , q x g i x g i C 1 θ + 2 r = 1 K β r β ^ r C ( ½ ) ( θ 1 ) ( x g k ε g ) 2 C 1 θ i = k , r x g i x g i C 1 θ .
It was shown in Appendix A that max g C x g j x g j / C 1 θ   a s ( X , ε )   0 and C ( β r β ^ r ) / C δ is asymptotically almost surely less than 1 for all δ > 0, so that ( β r β ^ r ) / C ( ½ ) ( θ 1 )   a s ( X , ε )   0. Consequently, to prove that max g C ( x g k ε ^ g ) 2 / C 1 θ converges almost surely to zero, it is sufficient to show that max g C ( x g k ε g ) 2 / C 1 θ converges almost surely to zero. However, E ( | x g j ε g ε g x g k | 1 + γ ) < Δ in Theorem I (Ic), by the same argument used in (A11) above, ensures that this is the case for 0 < θ < γ/(1+γ). In sum, (D6a)–(D6c) converge to 0 for all θ in (0,γ/(1+ γ)), proving (L5)–(L7). As θ1 > 1/γ in Theorem V, the condition θ > 1 / ( 1 + θ 1 ) for the wild bootstrap in Lemma 1 and the proof of (L3) above can also be met without contradiction.

Notes

1
As examples: (i) Thornton (2008) used a randomized experiment to investigate the demand for and effects of learning HIV status across north, central, and south Malawi, which differ systematically in their ethnicity and religion. (ii) Cai et al. (2009) investigated saliency by randomly assigning restaurant arrivals in China to tables with different menu setups; not surprisingly, the total bill paid varies systematically with the time of day.
2
With independent homoskedastic errors, the bootstrap resampling of estimated residuals (rather than the data itself) always yields consistent estimates of the coefficient distribution for a fixed number of OLS regressors (Bickel & Freedman, 1983).
3
Canay et al. (2021), who examine wild bootstrap consistency when the number of independent cluster groupings is fixed, similarly allow for heterogeneity across clusters while assuming convergence of the full sample cross-product and covariance matrices to matrices of constants and, additionally, convergence of the projection of regressors on each other within each cluster to a common matrix.
4
With E equal to the matrix of eigenvectors and Λ the diagonal matrix of eigenvalues of A, A½ = ½E’, where Λ½ is the diagonal matrix with entries equal to the square root of those of Λ.
5
The on-line Appendix proves consistency for sub-sampling, with and without replacement, M < C groupings.
6
Although, as noted by Cavaliere and Georgiev (2020), even when conditional consistency does not hold, valid inference using the bootstrap is still possible if the unconditional limit distribution of the sample test statistic equals the average of the random limit distribution of the bootstrap given the data.
7
Where 1c denotes a Cx1 vector of ones and ICxC the CxC identity matrix.
8
For the wild bootstrap, (7) follows immediately from the assumptions on moments. The proof for the pairs bootstrap is lengthy and is given in the on-line Appendix.
9
Liu (1988) also advocated selecting E((δgw)3) = 1 to correct for skewness in the Edgeworth expansion. However, Monte Carlos find that forms of δgw that make this assumption perform less well than those that do not (Davidson & Flachaire, 2008; MacKinnon, 2015)
10
To illustrate, the average skewness (i.e., standardized third central moment) of the first 1,000,000 Beta data-generating processes in 1000 runs of (12) ranges from −0.41 to 0.43, with a standard deviation of 0.14. A single run of 2 billion observations also shows no sign of converging, as the average skewness of the first 200, 400, 600, …, 2000 million Beta data-generating processes is .018, .021, .018, .014, .013, .024, .028, .031, .029, and .026, respectively.
11
Hope (1968) noted that with k an integer and M draws from a continuous bootstrap distribution, an exact test relative to that distribution at level α = k/(M + 1) is achieved when the null is rejected if k−1 or fewer draws are greater than the sample test statistic. Jockel (1986) showed the same is true for draws from an arbitrary distribution, if (G + (T + 1) U) is less than α(M + 1). As ties in these samples are exceedingly rare and αx100 is an integer, in the Table U plays no role and the test rejects when G = 0, ≤ 4 or ≤ 9 at the 0.01, 0.05 and 0.1 levels, respectively.
12
This should be apparent for the bootstrap-t in (13), while in the case of the bootstrap-c, evaluating β ^ β 2 using β ^ b β ^ 2 is identical to evaluating β ^ β 2 / V ( β ^ ) using β ^ b β ^ 2 / V ( β ^ ) .
13
These instances are for the bootstrap-c, which is not based upon an asymptotically pivotal test statistic and hence does not provide the higher-order asymptotic accuracy of the bootstrap-t (Singh, 1981; Hall, 1992).
14
The performance of the wild bootstrap is considerably improved if one imposes the null in the estimation of the residuals in the initial OLS regression (Davidson & Flachaire, 2008; Djogbenou et al., 2019). However, the same can be said for the conventional test, which raises the question of the correct benchmark comparison.
15
The estimates in Table 1 incorporate Stata’s default HC1 correction of the conventional covariance estimate, which reduces the rejection rate in the smallest samples. Without this correction, rejection rates at the 0.01 level are 0.134 and 0.107 with 10 observations or clusters, respectively.
16
Let E denote the eigenvectors, Λ the diagonal matrix of eigenvalues, λmax the maximum eigenvalue of the symmetric positive definite matrix A, aij the ijth element, and αi a vector of 0s with a 1 in the ith position. By the Cauchy–Schwarz Inequality and properties of the Rayleigh quotient, a i j 2 = ( α i E Λ E α j ) 2 ( α i E Λ E α i ) ( α j E Λ E α j ) λ m a x 2 .

References

  1. Bell, R. M., & McCaffrey, D. F. (2002). Bias reduction in standard errors for linear regression with multi-stage samples. Survey Methodology, 28, 169–181. [Google Scholar]
  2. Bickel, P. J., & Freedman, D. A. (1983). Bootstrapping regression models with many parameters. In P. J. Bickel, K. A. Doksum, & J. L. Hodges (Eds.), A festschrift for Erich L. Lehmann in honor of his sixty-fifth birthday. Wadsworth. [Google Scholar]
  3. Cai, H., Chen, Y., & Fang, H. (2009). Observational learning: Evidence from a randomized natural field experiment. American Economic Review, 99, 864–882. [Google Scholar] [CrossRef]
  4. Canay, I. A., Santos, A., & Shaikh, A. M. (2021). The wild bootstrap with a “small” number of “large” clusters. Review of Economics and Statistics, 103, 346–363. [Google Scholar] [CrossRef]
  5. Cavaliere, G., & Georgiev, I. (2020). Inference under random limit bootstrap measures. Econometrica, 88, 2547. [Google Scholar] [CrossRef]
  6. Davidson, R., & Flachaire, E. (2008). The wild bootstrap, tamed at last. Journal of Econometrics, 146, 162–169. [Google Scholar] [CrossRef]
  7. Djogbenou, A. A., MacKinnon, J. G., & Nielsen, M. O. (2019). Asymptotic theory and wild bootstrap inference with clustered errors. Journal of Econometrics, 212, 393–412. [Google Scholar] [CrossRef]
  8. Fedotenkov, I. (2013). A bootstrap method to test for the existence of finite moments. Journal of Nonparametric Statistics, 25, 315–322. [Google Scholar] [CrossRef]
  9. Freedman, D. A. (1981). Bootstrapping regression models. The Annals of Statistics, 9, 1218–1228. [Google Scholar] [CrossRef]
  10. Galambos, J. (1987). The asymptotic theory of extreme order statistics (2nd ed.). Robert E. Krieger Publishing Co. [Google Scholar]
  11. Ghosh, M. N. (1950). Convergence of random distribution functions. Bulletin of the Calcutta Mathematical Society, 42, 217–226. [Google Scholar]
  12. Hall, P. (1992). The bootstrap and edgeworth expansion. Springer. [Google Scholar]
  13. Hardle, W., Horowitz, J., & Kreiss, J.-P. (2003). Bootstrap methods for time series. International Statistical Review, 71, 435–459. [Google Scholar] [CrossRef]
  14. Hoadley, B. (1971). Asymptotic properties of maximum likelihood estimators for the independent not identically distributed case. The Annals of Mathematical Statistics, 42, 1977–1991. [Google Scholar] [CrossRef]
  15. Hoeffding, W. (1951). A combinatorial central limit theorem. The Annals of Mathematical Statistics, 22, 558–566. [Google Scholar] [CrossRef]
  16. Hoeffding, W. (1952). The large sample power of tests based upon permutations of observations. The Annals of Mathematical Statistics, 23, 169–192. [Google Scholar] [CrossRef]
  17. Hope, A. C. A. (1968). A simplified Monte Carlo significance test procedure. Journal of the Royal Statistical Society, Series B (Methodological), 30, 582–598. [Google Scholar] [CrossRef]
  18. Jockel, K.-H. (1986). Finite sample properties and asymptotic efficiency of Monte Carlo tests. The Annals of Statistics, 14, 336–347. [Google Scholar] [CrossRef]
  19. Liu, R. Y. (1988). Bootstrap procedures under some non-i.i.d. models. The Annals of Statistics, 16, 1696–1708. [Google Scholar] [CrossRef]
  20. MacKinnon, J. G. (2015). Wild bootstrap cluster confidence intervals. L’Actualité Économique, 9, 11–33. [Google Scholar] [CrossRef]
  21. MacKinnon, J. G., & White, H. (1985). Some heteroskedasticity-consistent covariance estimators with improved finite sample properties. Journal of Econometrics, 29, 305–325. [Google Scholar] [CrossRef]
  22. Mammen, E. (1993). Bootstrap and wild bootstrap for high dimensional linear models. The Annals of Statistics, 21, 255–285. [Google Scholar] [CrossRef]
  23. Meerschaert, M. M., & Scheffler, H.-P. (1998). A simple robust estimation method for the thickness of heavy tails. Journal of Statistical Planning and Inference, 71, 19–34. [Google Scholar] [CrossRef]
  24. Noether, G. E. (1949). On a theorem by Wald and Wolfowitz. The Annals of Mathematical Statistics, 20, 455–458. [Google Scholar] [CrossRef]
  25. Pustejovsky, J. E., & Tipton, E. (2018). Small-sample methods for cluster-robust variance estimation and hypothesis testing in fixed effects models. Business and Economic Statistics, 36, 672–683. [Google Scholar] [CrossRef]
  26. Shao, J., & Tu, D. (1995). The jackknife and bootstrap. Springer. [Google Scholar]
  27. Singh, K. (1981). On the asymptotic accuracy of Efron’s bootstrap. The Annals of Statistics, 9, 1187–1195. [Google Scholar] [CrossRef]
  28. Stute, W. (1990). Bootstrap of the linear correlation model. Statistics: A Journal of Theoretical and Applied Statistics, 21, 433–436. [Google Scholar]
  29. Thornton, R. L. (2008). The demand for, and impact of, learning HIV status. American Economic Review, 98, 1829–1863. [Google Scholar] [CrossRef]
  30. Trapani, L. (2016). Testing for (in)finite moments. Journal of Econometrics, 191, 57–68. [Google Scholar] [CrossRef]
  31. Wald, A., & Wolfowitz, J. (1944). Statistical tests based on permutations of the observations. The Annals of Mathematical Statistics, 15, 358–372. [Google Scholar] [CrossRef]
  32. White, H. (1980a). A heteroskedasticity consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48, 817–838. [Google Scholar] [CrossRef]
  33. White, H. (1980b). Nonlinear regression on cross-section data. Econometrica, 48, 721–746. [Google Scholar] [CrossRef]
  34. White, H. (1984). Asymptotic theory for econometricians. Academic Press. [Google Scholar]
  35. Wu, C. F. J. (1986). Jackknife, bootstrap and other resampling methods in regression analysis. The Annals of Statistics, 14, 1261–1295. [Google Scholar] [CrossRef]
  36. Young, A. (2016). Improved, nearly exact, statistical inference with robust and clustered covariance matrices using effective degrees of freedom corrections. Working paper, January 2016. Available online: https://personal.lse.ac.uk/YoungA/ (accessed on 26 September 2025).
  37. Young, A. (2019). Channelling Fisher: Randomization tests and the statistical insignificance of seemingly significant experimental results. Quarterly Journal of Economics, 134, 557–598. [Google Scholar] [CrossRef]
Table 1. Comparison of assumptions and results.
Table 1. Comparison of assumptions and results.
Mammen (1993)Freedman (1981)Stute (1990)Liu (1988)Djogbenou et al. (2019)This
Paper
type of bootstrapwildpairspairswildwildboth
type of dataiidiidiidinidinidinid
bounded moments x i j 4 ε i 2
&   x i j 4
x i j a ε i b
a + b = 4
x i j 2 ε i 2
&   x i j 2
ε i 2 &
a l l   x i j n
( x i j 2 ε i 2 ) 1 + γ
&   ( x i j 4 ) 1 + γ
( x i j 2 ε i 2 ) 1 + γ
&   ( x i j 2 ) 1 + γ
avg. moments convergeyesyesyesnoyesno
maximum cluster size unboundedbounded
distribution of coefficients
… and Wald statistics
yes
homo.
yes
homo.
yes
no
no
no
yes
cl/hetero.
yes
cl/hetero.
sub-sampling M of N M M M / N 0
lim inf M N γ * > 0
moments of coefficients yes yes (wild)
Notes: Wald statistics based upon homoskedastic (homo.) or clustered/heteroskedasticity robust (cl/hetero.) covariance estimates.
Table 2. Monte Carlo’s illustrating consistency (1000 data sets per data-generating process, 99 bootstrap draws per data set).
Table 2. Monte Carlo’s illustrating consistency (1000 data sets per data-generating process, 99 bootstrap draws per data set).
ConventionalPairsPairsWildWild
Chi-SquaredBootstrap-cBootstrap-tBootstrap-cBootstrap-t
(A) empirical rejection rates at .01, .05 and .10 levels
.01.05.10.01.05.10.01.05.10.01.05.10.01.05.10
observationsindependent observations
10
100
1000
10,000
100,000
1,000,000
.108
.043
.022
.015
.016
.012
.200
.100
.072
.067
.063
.065
.272
.173
.137
.124
.127
.113
.003
.012
.008
.006
.017
.009
.038
.047
.051
.050
.059
.050
.098
.105
.108
.103
.123
.104
.020
.033
.018
.020
.017
.012
.069
.082
.067
.062
.060
.056
.126
.142
.125
.116
.126
.111
.203
.053
.021
.020
.011
.013
.268
.110
.075
.073
.070
.068
.308
.178
.141
.122
.124
.113
.084
.062
.030
.024
.017
.014
.146
.108
.076
.067
.068
.069
.205
.159
.135
.115
.121
.110
clustersindependent clusters of observations
10
100
1000
10,000
100,000
1,000,000
.096
.030
.015
.023
.015
.018
.169
.076
.062
.070
.058
.053
.227
.136
.116
.125
.110
.101
.022
.007
.005
.005
.012
.016
.073
.045
.045
.050
.048
.048
.126
.095
.094
.103
.104
.093
.023
.018
.012
.016
.014
.018
.081
.059
.061
.062
.060
.053
.139
.104
.109
.119
.109
.094
.149
.037
.018
.023
.014
.015
.208
.088
.060
.076
.055
.059
.245
.131
.118
.131
.116
.098
.083
.037
.020
.026
.019
.014
.127
.084
.053
.069
.057
.055
.171
.118
.108
.124
.109
.095
(B) Kolmogorov–Smirnov test statistics and p-values (1,000,000 observations or clusters)
obs.clustersobs.clustersobs.clustersobs.clustersobs.clusters
statistic.021.024.023.028.020.027.025.021.022.022
p-value.744.616.678.413.815.437.573.747.697.720
(C) Correlation of p-values with conventional and p-values of said correlations (1,000,000 obs. or cl.)
obs.clustersobs.clustersobs.clustersobs.clusters
correlation .9907.9889.9905.9895.9901.9910.9891.9900
p-value1
p-value2
.466
.870
.001
.045
.287
.778
.027
.243
.092
.573
.849
.940
.000
.087
.162
.502
Notes: Author’s calculations using Stata version 18.0 code provided in on-line materials. Conventional test incorporates Stata’s HC1 correction of the covariance estimate, which substantially reduces the rejection rate in samples with 10 observations. Kolmogorov–Smirnov tests are of the null that each set of 1000 p-values is drawn from the uniform distribution, with the distribution under the null calculated using 100,000 draws of 1000 uniform random variables. p-values of correlations in panel (C) are the likelihood of a smaller correlation in 100,000 instances of using 99 independent draws from the chi-squared to evaluate each of the 1000 chi-squared statistics vs. using the chi-squared distribution itself. As explained in the text, p-value1 calculates the correlation distribution conditional on the realized conventional squared t-statistics, while p-value2 calculates the correlation distribution based on conventional test statistics which are random draws from the chi-squared.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Young, A. Consistency of the OLS Bootstrap for Independently but Not-Identically Distributed Data: A Permutation Perspective. Econometrics 2025, 13, 41. https://doi.org/10.3390/econometrics13040041

AMA Style

Young A. Consistency of the OLS Bootstrap for Independently but Not-Identically Distributed Data: A Permutation Perspective. Econometrics. 2025; 13(4):41. https://doi.org/10.3390/econometrics13040041

Chicago/Turabian Style

Young, Alwyn. 2025. "Consistency of the OLS Bootstrap for Independently but Not-Identically Distributed Data: A Permutation Perspective" Econometrics 13, no. 4: 41. https://doi.org/10.3390/econometrics13040041

APA Style

Young, A. (2025). Consistency of the OLS Bootstrap for Independently but Not-Identically Distributed Data: A Permutation Perspective. Econometrics, 13(4), 41. https://doi.org/10.3390/econometrics13040041

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop