3.1. Model and Estimation
Given an independent sample of
n observations, consider the linear regression model
for
, where
is the
i-th realization of the response variable
,
and
are the
i-th realizations of the predictor vectors
and
, and
is the
i-th realization of the error term
. The predictor variables are assumed to be partitioned into two sets, such that
contains the variables of interest for inference purposes, and
contains the covariates that will be conditioned on. Let
denote the combined predictor vector, and let
denote the combined coefficient vector. Furthermore, let
denote the mean centered (combined) predictor vector, where
is the sample average of the combined predictor vector.
To estimate the coefficients in
, consider minimizing the GRR loss function
where
is the centered design matrix, and
is an
symmetric and positive semi-definite penalty matrix (where
is the total number of slope coefficients). The coefficients that minimize Equation (
8) can be written as
which is subscripted to emphasize that the estimated coefficients depend on the penalty matrix
. Given the estimated slope vector
, the least squares estimate of the intercept has the form
.
3.2. Asymptotic Distributions
Consider the linear regression model from Equation (
7) with the assumptions:
- B1.
with and for
- B2.
are iid from a distribution satisfying
- B3.
and are nonsingular, where , and is almost surely invertible. In the fixed predictors case, the mean vector is and the covariance matrix terms are defined as and , where .
Given assumptions B1–B3, the GRR estimator provides an estimate of , where and is the covariance between and . Note that , where is estimated by the OLS estimator with and . This implies that the GRR estimator can be written as where .
Lemma 3. Given assumptions B1–B3, the GRR estimator from Equation (9) is asymptotically normal with mean vector and covariance matrix , i.e.,as , where the notation denotes convergence in distribution. Lemma 4. Consider the linear model in Equation (7) with the assumptions and , and suppose that θ and are independent of one another. Under these assumptions, the posterior distribution of θ given is multivariate normal with mean vector and covariance matrix . The asymptotic mean vector is and the asymptotic covariance matrix is . Note that Lemmas 3 and 4 can be proved using analogues of the results that were used to prove Lemmas 1 and 2. Specifically, for Lemma 3, we can use a direct analogue of the proof for Lemma 1 with replacing and the matrices and replacing the matrices and . For Lemma 4, we can use a direct analogue of the proof for Lemma 2 with replacing and replacing . For both cases, we need to replace the unpenalized coefficient vector with the unpenalized coefficient vector .
3.3. Test Statistics
Consider the linear model in Equation (
7), and suppose that we want to test the null hypothesis
versus the alternative hypothesis
. This is the same null hypothesis that was considered in the previous section, but now the nuisance effects
are included in the model (i.e., conditioned on) while testing the significance of the
vector. Assuming that
and
are independent of one another, we could use the
F test statistic
where
is a
selection matrix such that
(note that
is the
identity matrix and
is a
matrix of zeros),
is the estimated error variance, and
is the residual with
denoting the fitted value for the
i-th observation.
When
is true and the assumptions in Lemma 4 are met, the
F statistic approaches an
F distribution with degrees of freedom parameters
p and
as
. For non-zero penalties, the
F statistic in Equation (
10) will not follow an
distribution, and may produce asymptotically invalid results when used in a permutation test, especially when the error terms are heteroscedastic (see [
44,
50,
51]). In such cases, the Wald test statistic should be preferred
where
with
. Under assumptions B1–B3, the
W statistic asymptotically follows a
distribution when
is true, which is a result of Lemma 3 (and the consistency of the estimators
and
).
3.4. Permutation Inference
Table 1 depicts eight different permutation methods that have been proposed for testing the significance of regression coefficients in the presence of nuisance parameters. The eight methods can be split into three different groups: (i) methods that permute the rows of
[
52,
53,
54], (ii) methods that permute
with
included in the model [
55,
56,
57], and (iii) methods that permute
after partialling out
[
58,
59,
60]. All of these methods were originally proposed for use with the OLS estimator
and the
F test statistic. Recent works have incorporated the use of the robust
W test statistic with these permutation methods [
44,
50,
51]. However, these authors only considered a theoretical analysis of the DS and FL permutation methods, and no previous works seem to have studied these methods using the GRR estimators from Equations (
4) and (
9).
To understand the motivation of the various permutation methods in
Table 1, assume that the penalty matrix is a block diagonal such as
, where
and
denote the penalty matrices for
and
, respectively. Using the well-known form for the inverse of a block matrix [
61,
62,
63,
64], the coefficient estimates from Equation (
9) have the form
where
is the residual forming matrix for the model that only includes the nuisance effects in the model, i.e.,
. This implies that the (centered) fitted values can be written as
, where
is the hat matrix for the linear model that only includes the nuisance effects. Thus, when
, all of the permutation methods except SW will produce the same observed
F statistic (when the permutation matrix is
).
Consider the additional assumption that the response and predictor variables have finite fourth moments, i.e., B4:
,
, and
. Assume B1–B4 and that the null hypothesis
is true. Using the
W test statistic from Equation (
11), the following can be said about the finite sample and asymptotic properties of the various permutation methods in
Table 1: the DS method is exact when
and asymptotically valid otherwise; the MA method is exact when
and asymptotically valid otherwise; the SW method is inexact and asymptotically valid only when
; the other five methods (OS, FL, TB, KC, HJ) are inexact and asymptotically valid.
The asymptotic behaviors of the DS and FL methods were proved by DiCiccio and Romano [
44]. The asymptotic validity of the OS method can be proved using a similar result as used for the DS method, given that
is a consistent estimator of
and
is a consistent estimator of
. The asymptotic validity of the MA method can also be proved using a similar result as used for the DS method. The asymptotic validity of the TB method can be proved using a similar result as used for the FL method, given that
is a consistent estimator of
and
is a consistent estimator of
. Finally, note that the KC and HJ methods are asymptotically equivalent, and the SW method is asymptotically equivalent to the KC and HJ methods when
and
are uncorrelated. The asymptotic validity of the KC and HJ methods follow from the results in Theorem 1, given that these methods permute the response after partialling out the nuisance effects. It is important to note that if
and
are correlated, the SW method will produce asymptotically invalid results because
is partialled out of
but not
.