1. Introduction
This paper contributes to the asymptotic distribution theory for ridge parameter estimates for the linear instrumental variables model. The tuning parameter for the ridge penalty, denoted
, is selected by splitting the data into a training sample and a test (or holdout) sample. In [
1], the ridge penalty parameter is estimated jointly with the structural parameters and the asymptotic distribution is characterized as the projection of a stochastic process onto a cone. This gives the rate of convergence to the asymptotic distribution but does not provide guidance for inference. To allow inference, the closed form for the asymptotic distribution is presented. These new results allow for the calculation of confidence regions for the structural parameters. When the prior is equal to the population parameter value of the structural parameters, the tuning parameter is not identified. However, the structural parameter estimates are consistent and the asymptotic covariance is smaller than the asymptotic covariance of the two-stage least squares estimator.
A fundamental issue with ridge regression is the selection of the tuning parameter and its resulting impact on inference for the parameters of interest. One approach is to select a deterministic function of the sample size that shrinks to zero fast enough to not impact the asymptotic distribution. The resulting asymptotic distribution is then equivalent to the OLS asymptotic distribution [
2]. An alternative approach is to select the tuning parameter with the observed data. The literature contains multiple ways to estimate the tuning parameter [
3,
4]. For these estimates, inference typically follows the approach stated in [
5]. Conditional on a fixed
, the ridge estimator’s covariance is a function of
. The
is selected using the observed data and substituted into the covariance that was calculated assuming that
is fixed. This covariance is used to create a test statistic. In [
5], the resulting tests are correctly referred to as approximate
t-tests because the authors appreciate the internal inconsistency of using a covariance obtained assuming a fixed
with an
value estimated with the observed data. (This problem is well known in the literature, see [
6,
7,
8,
9,
10].) Each estimate for the tuning parameter leads to a different approximate
t-test which are typically compared using simulations. This has been the approach for the past 20 years [
5,
11]. (“When the ridge parameter
k is determined from the data, the above arguments are no longer valid. Hence, to investigate the size and power of the approximate ridge-based
t-type tests in such cases, a Monte Carlo simulation study is conducted” [
5]. “Since a theoretical assessment among the test statistics is not possible, a simulation study has been conducted to evaluate the performance of the suggested test statistics” [
11].) For other models, researchers have proposed alternative procedures to obtain hyperparameter (tuning parameter) estimates [
12,
13]. Like the previous ridge regression literature, these approaches have relied on simulations to demonstrate their behavior. For inference, these procedures would need to be extended to establish their asymptotic distributions. A third approach is followed in this paper. The tuning parameter is selected by splitting the sample into training and test samples. The tuning parameter defines a path from the prior to the IV estimator on the training sample. On this path, the tuning parameter is selected to minimize the prediction error on the test sample. This procedure is written as a method of moments estimation problem where the tuning parameter and the parameters of interest are simultaneously estimated. Inference is then performed using the joint asymptotic distribution.
A related literature concerns the distribution of some empirically selected ridge tuning parameters [
14,
15,
16]. These approaches have relied on strong distribution assumptions (e.g., normal error). In addition, they are built on tuning parameters as functions of the data, where the functions are determined by assuming that the tuning parameter is fixed. This leads to an inconsistency because using the data to select the tuning parameter means that the tuning parameter is no longer fixed. In this paper, the inconsistency is avoided by estimating the structural parameters and the ridge tuning parameter simultaneously. Additionally, the method of moments framework permits weaker assumptions.
In [
1], the asymptotic joint distribution for the parameters in the linear model and the ridge tuning parameter is characterized as the projection of a stochastic process onto a cone. This structure occurs because the probability limit of the ridge tuning parameter is on the boundary of the parameters space. (This leads to the same problem of consistently estimating a population parameter that is on the boundary of the parameter space [
17,
18,
19,
20].) This leads to a nonstandard asymptotic distribution that depends on the prior and the population parameter value of the structural parameters. When the prior is different from the population parameter value, the asymptotic distribution for the ridge tuning parameter is a mixture with a discrete mass of 1/2 at zero and a truncated normal over the positive reals. In addition, the asymptotic distribution for the structural parameters is normal with a nonzero mean. This mean and variance both contain the population parameter value. This prevents the calculation of
t-statistics for individual parameter estimates. However, a hypothesis for the entire set of structural parameters can be tested using a chi-square test and this statistic can be inverted to give accurate confidence regions.
2. Ridge Estimator for Linear IV Model Using a Holdout Sample
Consider the linear instrumental variables model where
Y is
,
X is
, and
Z is
with
where the
instruments are
, with full rank
second moments
, and conditional on
Z,
The IV, or 2SLS, estimator
where
is the projection matrix for
Z and has the asymptotic distribution
Let
have the spectral decomposition
, where
C is orthonormal, i.e.,
and
D is a positive semidefinite diagonal
matrix of eigenvalues,
. When some of the eigenvectors explain very little variation, i.e., small magnitudes for the corresponding eigenvalues, the objective function is flatter along these dimensions and the resulting covariance estimates are larger because the variance of
is proportional to
This leads to a relatively large MSE. The ridge estimator addresses this by shrinking the estimated parameter towards a prior. The ridge objective function augments the usual IV objective function (
4) with a quadratic penalty centered at a prior,
, weighted by a regularization tuning parameter
Conditional on
, the ridge solution is
Different values of
result in different estimated values for
. An optimal value for
can be determined empirically by splitting the data into training and test samples. The training sample is a randomly drawn sample of
observations, denoted,
,
, and
. The estimate using the training sample, conditional on
is
where
is the projection matrix onto
and
is the greatest integer function. The optimal
is selected to minimize the IV least squares objective function over the remaining
observations, i.e., the test or holdout sample, denoted
,
and
. The estimated tuning parameter is defined by
where
and
is the projection matrix onto
The ridge regression estimate
is then characterized by
Ref. [
1] showed how the asymptotic distribution of the ridge estimator can be determined with the method of moments framework using the parameterization
where
stacks the elements from a matrix into a column vector and
stacks the unique elements from a symmetric matrix into a column vector. The population parameter values are
The ridge estimator is part of the parameter estimates defined by the just identified system of equations
where
and the training and test samples are determined with the indicator function
Using the structure of Equation (
9), the system
can be seen as seven sets of equations. The first four sets are each self-contained systems of equal numbers of equations and parameters. The fifth set has
k equations and introduces the
k parameters,
. The sixth is a single equation with parameter
. The seventh introduces the final
k parameters,
. Identification occurs because the expectation of the gradient is invertible. This is presented in the
Appendix A.
3. Asymptotic Behavior
The asymptotic distribution is derived with four high level assumptions.
Assumption 1. is iid with finite fourth moments and has full rank.
Assumption 2. Conditional on Z, are iid vectors with zero mean, full rank covariance matrix with possibly nonzero off-diagonal elements.
Assumptions 1 and 2 imply and satisfies the CLT.
Assumption 3. The parameter space Θ is defined as follows: is restricted to a symmetric positive definite matrix with eigenvalues for , is of full rank with for , and where , , , and are positive and finite.
Assumption 4. The fraction of the sample used for training satisfies .
The tuning parameter selected using a holdout sample is root-n when the prior is different from the population parameter value. When the prior is equal to the population parameter value, the tuning parameter is not identified.
Lemma 1. Assumptions 1–4imply, when , (1) and (2) , and when , converges in distribution to a draw from the distribution for ,where and are and is the symmetric matrix square root of . Proofs are given in the
Appendix A. The a-min distribution with,
matrix parameter
S, is characterized by
where
and
are
. When
,
converges in distribution to a draw from the a-min distribution with parameter
.
When , is no longer identified. Recall that parameterizes a path from the prior to the IV estimator on the training sample. However, when the prior equals and the IV estimator is consistent for , every value of will be associated with a consistent estimator for .
Lemma 1 implies that the probability limit for the tuning parameter is zero,
, which is on the boundary of the parameter space. This results in a nonstandard asymptotic distribution, which is characterized by the projection of a random vector on a cone (denoted
) that allows for the sample estimate to be on the boundary of the parameter space. The estimation objective function can be expanded into a quadratic approximation about the centered and scaled population parameter values
This suggests selecting
to minimize
results in the asymptotic distribution of
being equivalent to the distribution of
, where the random variable is defined as
and the cone is defined by
. The estimator is defined as
and its asymptotic distribution is characterized in Theorem 1 of [
1]. For continuity of presentation, the theorem is repeated here.
Theorem 1. Assumptions 1–4imply that, when , the asymptotic distribution of is equivalent to the distribution of The objective function can be minimized at a value of the tuning parameter in
or possibly at
The asymptotic distribution of the tuning parameter will be composed of two parts, a discrete mass at
and a continuous function over
. The asymptotic distribution is characterized as the projection of a stochastic process onto a cone. The special structure of the ridge estimator using a holdout sample permits the calculation of the closed form for the asymptotic distribution for the parameter of interest, see Theorem 1 in [
17], case 2 after Theorem 2 in [
18], Section 3.8 in [
19], and Theorem 5 in [
20].
Theorem 2. Assumptions 1–4imply, when ,
(i) converges in distribution to a draw from a normal distribution with meanand covarianceand (ii) will converge in distribution to a mixture distribution with discrete mass of 1/2 at zero and over , a truncated normal distribution with zero mean and covariance When the prior is different from the population parameter value, the ridge estimator has an asymptotic bias and its asymptotic variance is larger than for the 2SLS. However, this is restricted to only one dimension. The asymptotic MSE is
This is the MSE for the 2SLS estimator plus a term built on , the projection matrix for The ridge estimator using the holdout sample has the same bias, variance, and MSE as the 2SLS estimator, except in the dimension of . Because takes its minimum at , the optimal sample split to minimize the asymptotic bias, variance, and MSE is when the sample is equally split between the training and the testing (or holdout) sample.
Because the population parameter value enters into both the asymptotic bias and the asymptotic variance, it is not possible to determine individual
t-statistics for the parameters. However, under the null hypothesis that
, the statistic
will converge in distribution to a chi-square with
k degrees of freedom. This statistic can be inverted to create accurate confidence regions.
The asymptotic behavior is different when .
Theorem 3. Assumptions 1–4imply, when , conditional on , converges in distribution towhere converges in distribution to a draw from an a-min distribution with parameter . In the unlikely event that the prior is selected equal to the population parameter value, the asymptotic covariance is smaller than or equal to the 2SLS asymptotic covariance. In terms of implementation, the covariance and bias associated with should be used because it is asymptotically correct for all priors except for , where it leads to conservative confidence regions.
Linear Regression
A special case is the linear regression model where
Y is
and
X is
with full rank
second moments
, and conditional on
X,
. The estimation equations for the ridge regression estimate where the tuning parameter,
, is selected with a holdout sample can be written in the method of moments framework using the parameterization
The ridge estimator is part of the parameter estimates defined by the just identified system of equations
, where
Along with Assumption 4, the following three assumptions are sufficient to obtain the asymptotic results.
Assumption 5. is iid with finite fourth moments and has full rank.
Assumption 6. Conditional on X, .
Assumption 7. The parameter space Θ is defined as follows: is restricted to a symmetric positive definite matrix with eigenvalues for , and where , , and are positive and finite.
Lemma 2 gives the rate of convergence for the tuning parameter when the prior is different from the population parameter value and characterizes its asymptotic distribution when the prior is equal to the population parameter value.
Lemma 2. Assumptions 4–7imply
(i) if , (1) and (2) ; (ii) if , converges in distribution to a draw from the a-min distribution with parameter , the symmetric matrix square root of .
The asymptotic distribution of
is equivalent to the distribution of
where the random variable is defined as
and the cone is defined by
. The estimator is defined as
Theorem 4. Assumptions 4–7imply, when , the asymptotic distribution of is equivalent to the distribution of This Theorem characterizes the asymptotic distribution of the estimator as the projection of a stochastic process onto a cone. The special structure of this problems allows for the analytic derivation of the asymptotic distribution.
Theorem 5. Assumptions 4–7imply, when ,
(i) is asymptotically normally distributed with meanand covarianceand (ii) asymptotically has a mixture distribution with a discrete mass of 1/2 at zero and over , a truncated normal distribution with zero mean and covariance When
, the ridge estimator has an asymptotic bias and its asymptotic variance is larger than that of the OLS estimator. However, this is only in one dimension. The asymptotic MSE is
This is the MSE for the OLS estimator plus a constant times the projection matrix for
The ridge estimator using the holdout sample has the same bias, variance, and MSE as the OLS estimator except in the dimension of the
. In order to minimize bias, variance, and MSE of the estimator,
should be selected. The population parameter value in both the asymptotic bias and the asymptotic variance does not allow for individual
t-statistics. However, under the null hypothesis that
, the statistic
will converge in distribution to a draw from a chi-square with
k degrees of freedom and can be used to create confidence regions.
When , a different asymptotic distribution occurs.
Theorem 6. Assumptions 4–7imply, when , , conditional on , converges in distribution to a draw fromwhere will converge in distribution to a draw from an a-min distribution with parameter . If , the asymptotic covariance is smaller than or equal to the OLS estimator’s asymptotic covariance. Again, the covariance and bias associated with should be used for inference.
4. Small Sample Properties
The behavior in finite samples is investigated next by simulating the model in Equations (
1) and (2) with
and
(Because the ridge estimator and the 2SLS estimator will be compared using MSE, two moments need to exist. This is ensured by having four instruments to estimate the two parameters. See [
21].). To standardize the model, set
and
. Create endogeneity with
The strength of the instrument is controlled by the parameter with If , the second element of is not identified.
The ridge parameter estimate is determined in two steps. In the first step, the objective function is evaluated on a grid of values to determine a starting value. In the second step, the objective function is evaluated over a finer grid (10,000 points) centered at the best value obtained from the first step. A value of in the second step corresponds to the ridge estimator ignoring the prior in favor of the data, whereas a value corresponds to “infinite regularization” implying that the ridge estimator ignores the data in favor of the prior.
4.1. Coverage Probabilities
The chi-square test is performed for a range of parameterizations including different priors, different strengths of the instrument, and different sample sizes. Each model is simulated 10,000 times and a size of ten percent is used for each test. The results are presented in
Table A1. The observed coverage probabilities agree with the theoretical values. As expected, the approximations are best in cases where the sample size is large, the correlation between instruments and covariates is higher, and the prior is closer to the population parameter value.
4.2. MSE for the Ridge Estimator
The ridge estimator is compared with the 2SLS estimator to demonstrate settings where the ridge estimator can be expected to give more accurate results in small samples. The simulated models differ in three dimensions: sample size, strength of the instruments, and the prior on the structural parameters. For smaller sample sizes, the ridge estimator should have better properties, whereas, for larger sample sizes, 2SLS should perform better. Sample sizes of , 50, 250 and 500 are considered. As noted above, the instrument signal strength decreases with . For smaller signal strengths, the ridge estimator should perform better. Values of , 0.05, 0.1, and 0.5 are considered. For prior values further from to the population parameter values, the ridge estimator should perform worse. Three different values of are considered , and . A total of 48 model specifications are simulated: four sample sizes n, four values of the precision parameter , and three values of the prior . Each specification is simulated times and estimated with 2SLS and the ridge estimator. is used to split the sample between training and test samples for the ridge estimator.
Table A2,
Table A3 and
Table A4 compare the performance of the 2SLS estimator with the ridge estimator for different precision levels and sample sizes when the prior is fixed at
,
, and
, respectively. The estimators are compared on the basis of bias, standard deviation of the estimates, MSE values of the estimates, and sum of MSE values of
and
. All three tables demonstrate expected results. The ridge estimator dominates in models with smaller sample sizes, weaker instrument strength, and when the prior is closer to the population parameter values.
Overall, the simulations demonstrate that, for some models’ specifications, the ridge estimator using a holdout sample has better small sample performance relative to the 2SLS estimator. The simulations agree with the asymptotic distributions. As the sample size increases, the 2SLS estimator performs better than the ridge estimator.
5. Conclusions
Inference has always been a weakness of ridge regression. This paper presents a methodology and results to help address some of its weaknesses. Theoretically accurate inferences can be performed with the asymptotic distribution of the ridge estimates of the linear IV model when the tuning parameter is empirically selected by a holdout sample. It is well known that the distribution of the estimates of the structural parameters is affected by empirically selected tuning parameters. This is addressed by simultaneously estimating both the parameters of interest and the ridge regression tuning parameter in the method of moments framework. When the prior is different from the population parameter value, the estimator accounts for the probability limit of the tuning parameter being on the boundary of the parameter space. The asymptotic distribution for the tuning parameter is a nonstandard mixed distribution. The asymptotic distribution for the estimates of the structural parameters is normal but with a nonzero mean. The ridge estimator of the structural parameters has asymptotic bias and the asymptotic covariance is larger than the asymptotic covariance for the 2SLS estimator; however, the bias and larger covariance only apply to one dimension of the parameter space. The dependence of the asymptotic mean and variance on the population parameter values, prevents the calculation of t-statistics for individual parameters. Fortunately, a chi-square statistic provides accurate confidence regions for the structural parameters.
If the prior is equal to the population parameter value, the ridge estimator is consistent and the asymptotic covariance is smaller than the 2SLS asymptotic covariance. The asymptotic distribution provides insights on how to perform estimation with a holdout sample. The minimum bias, variance, and MSE for the structural parameters occur when the sample is equally split into a training sample and a test (or holdout) sample.
This paper’s approach can be useful in determining the asymptotic behavior for other empirical procedures that select tuning parameters. Two natural extensions would be to generalize cross-validation (see [
22]) and K-fold cross-validation, where the entire dataset would be used to select the tuning parameter. This paper has focused on strong correlations between the instruments and the regressors. Another important extension would be the asymptotic behavior in models with weaker correlation between the instruments and the regressors, see [
23].