Abstract
The asymptotic distribution is presented for the linear instrumental variables model estimated with a ridge penalty and a prior where the tuning parameter is selected with a holdout sample. The structural parameters and the tuning parameter are estimated jointly by method of moments. A chi-squared statistic permits confidence regions for the structural parameters. The form of the asymptotic distribution provides insights on the optimal way to perform the split between the training and test sample. Results for the linear regression estimated by ridge regression are presented as a special case.
Keywords:
ridge regression; holdout sample; method of moments; asymptotic distribution; confidence region JEL Classification:
C13; C18
1. Introduction
This paper contributes to the asymptotic distribution theory for ridge parameter estimates for the linear instrumental variables model. The tuning parameter for the ridge penalty, denoted , is selected by splitting the data into a training sample and a test (or holdout) sample. In [1], the ridge penalty parameter is estimated jointly with the structural parameters and the asymptotic distribution is characterized as the projection of a stochastic process onto a cone. This gives the rate of convergence to the asymptotic distribution but does not provide guidance for inference. To allow inference, the closed form for the asymptotic distribution is presented. These new results allow for the calculation of confidence regions for the structural parameters. When the prior is equal to the population parameter value of the structural parameters, the tuning parameter is not identified. However, the structural parameter estimates are consistent and the asymptotic covariance is smaller than the asymptotic covariance of the two-stage least squares estimator.
A fundamental issue with ridge regression is the selection of the tuning parameter and its resulting impact on inference for the parameters of interest. One approach is to select a deterministic function of the sample size that shrinks to zero fast enough to not impact the asymptotic distribution. The resulting asymptotic distribution is then equivalent to the OLS asymptotic distribution [2]. An alternative approach is to select the tuning parameter with the observed data. The literature contains multiple ways to estimate the tuning parameter [3,4]. For these estimates, inference typically follows the approach stated in [5]. Conditional on a fixed , the ridge estimator’s covariance is a function of . The is selected using the observed data and substituted into the covariance that was calculated assuming that is fixed. This covariance is used to create a test statistic. In [5], the resulting tests are correctly referred to as approximate t-tests because the authors appreciate the internal inconsistency of using a covariance obtained assuming a fixed with an value estimated with the observed data. (This problem is well known in the literature, see [6,7,8,9,10].) Each estimate for the tuning parameter leads to a different approximate t-test which are typically compared using simulations. This has been the approach for the past 20 years [5,11]. (“When the ridge parameter k is determined from the data, the above arguments are no longer valid. Hence, to investigate the size and power of the approximate ridge-based t-type tests in such cases, a Monte Carlo simulation study is conducted” [5]. “Since a theoretical assessment among the test statistics is not possible, a simulation study has been conducted to evaluate the performance of the suggested test statistics” [11].) For other models, researchers have proposed alternative procedures to obtain hyperparameter (tuning parameter) estimates [12,13]. Like the previous ridge regression literature, these approaches have relied on simulations to demonstrate their behavior. For inference, these procedures would need to be extended to establish their asymptotic distributions. A third approach is followed in this paper. The tuning parameter is selected by splitting the sample into training and test samples. The tuning parameter defines a path from the prior to the IV estimator on the training sample. On this path, the tuning parameter is selected to minimize the prediction error on the test sample. This procedure is written as a method of moments estimation problem where the tuning parameter and the parameters of interest are simultaneously estimated. Inference is then performed using the joint asymptotic distribution.
A related literature concerns the distribution of some empirically selected ridge tuning parameters [14,15,16]. These approaches have relied on strong distribution assumptions (e.g., normal error). In addition, they are built on tuning parameters as functions of the data, where the functions are determined by assuming that the tuning parameter is fixed. This leads to an inconsistency because using the data to select the tuning parameter means that the tuning parameter is no longer fixed. In this paper, the inconsistency is avoided by estimating the structural parameters and the ridge tuning parameter simultaneously. Additionally, the method of moments framework permits weaker assumptions.
In [1], the asymptotic joint distribution for the parameters in the linear model and the ridge tuning parameter is characterized as the projection of a stochastic process onto a cone. This structure occurs because the probability limit of the ridge tuning parameter is on the boundary of the parameters space. (This leads to the same problem of consistently estimating a population parameter that is on the boundary of the parameter space [17,18,19,20].) This leads to a nonstandard asymptotic distribution that depends on the prior and the population parameter value of the structural parameters. When the prior is different from the population parameter value, the asymptotic distribution for the ridge tuning parameter is a mixture with a discrete mass of 1/2 at zero and a truncated normal over the positive reals. In addition, the asymptotic distribution for the structural parameters is normal with a nonzero mean. This mean and variance both contain the population parameter value. This prevents the calculation of t-statistics for individual parameter estimates. However, a hypothesis for the entire set of structural parameters can be tested using a chi-square test and this statistic can be inverted to give accurate confidence regions.
2. Ridge Estimator for Linear IV Model Using a Holdout Sample
Consider the linear instrumental variables model where Y is , X is , and Z is with
where the instruments are , with full rank second moments , and conditional on Z,
The IV, or 2SLS, estimator
where is the projection matrix for Z and has the asymptotic distribution
Let have the spectral decomposition , where C is orthonormal, i.e., and D is a positive semidefinite diagonal matrix of eigenvalues, . When some of the eigenvectors explain very little variation, i.e., small magnitudes for the corresponding eigenvalues, the objective function is flatter along these dimensions and the resulting covariance estimates are larger because the variance of is proportional to This leads to a relatively large MSE. The ridge estimator addresses this by shrinking the estimated parameter towards a prior. The ridge objective function augments the usual IV objective function (4) with a quadratic penalty centered at a prior, , weighted by a regularization tuning parameter
Conditional on , the ridge solution is
Different values of result in different estimated values for . An optimal value for can be determined empirically by splitting the data into training and test samples. The training sample is a randomly drawn sample of observations, denoted, , , and . The estimate using the training sample, conditional on is
where is the projection matrix onto and is the greatest integer function. The optimal is selected to minimize the IV least squares objective function over the remaining observations, i.e., the test or holdout sample, denoted , and . The estimated tuning parameter is defined by where
and is the projection matrix onto The ridge regression estimate is then characterized by
Ref. [1] showed how the asymptotic distribution of the ridge estimator can be determined with the method of moments framework using the parameterization
where stacks the elements from a matrix into a column vector and stacks the unique elements from a symmetric matrix into a column vector. The population parameter values are
The ridge estimator is part of the parameter estimates defined by the just identified system of equations where
and the training and test samples are determined with the indicator function
Using the structure of Equation (9), the system can be seen as seven sets of equations. The first four sets are each self-contained systems of equal numbers of equations and parameters. The fifth set has k equations and introduces the k parameters, . The sixth is a single equation with parameter . The seventh introduces the final k parameters, . Identification occurs because the expectation of the gradient is invertible. This is presented in the Appendix A.
3. Asymptotic Behavior
The asymptotic distribution is derived with four high level assumptions.
Assumption 1.
is iid with finite fourth moments and has full rank.
Assumption 2.
Conditional on Z, are iid vectors with zero mean, full rank covariance matrix with possibly nonzero off-diagonal elements.
Assumptions 1 and 2 imply and satisfies the CLT.
Assumption 3.
The parameter space Θ is defined as follows: is restricted to a symmetric positive definite matrix with eigenvalues for , is of full rank with for , and where , , , and are positive and finite.
Assumption 4.
The fraction of the sample used for training satisfies .
The tuning parameter selected using a holdout sample is root-n when the prior is different from the population parameter value. When the prior is equal to the population parameter value, the tuning parameter is not identified.
Lemma 1.
Assumptions 1–4imply, when , (1) and (2) , and when , converges in distribution to a draw from the distribution for ,
where and are and is the symmetric matrix square root of .
Proofs are given in the Appendix A. The a-min distribution with, matrix parameter S, is characterized by
where and are . When , converges in distribution to a draw from the a-min distribution with parameter .
When , is no longer identified. Recall that parameterizes a path from the prior to the IV estimator on the training sample. However, when the prior equals and the IV estimator is consistent for , every value of will be associated with a consistent estimator for .
Lemma 1 implies that the probability limit for the tuning parameter is zero, , which is on the boundary of the parameter space. This results in a nonstandard asymptotic distribution, which is characterized by the projection of a random vector on a cone (denoted ) that allows for the sample estimate to be on the boundary of the parameter space. The estimation objective function can be expanded into a quadratic approximation about the centered and scaled population parameter values
This suggests selecting to minimize results in the asymptotic distribution of being equivalent to the distribution of , where the random variable is defined as
and the cone is defined by . The estimator is defined as and its asymptotic distribution is characterized in Theorem 1 of [1]. For continuity of presentation, the theorem is repeated here.
Theorem 1.
Assumptions 1–4imply that, when , the asymptotic distribution of is equivalent to the distribution of
The objective function can be minimized at a value of the tuning parameter in or possibly at The asymptotic distribution of the tuning parameter will be composed of two parts, a discrete mass at and a continuous function over . The asymptotic distribution is characterized as the projection of a stochastic process onto a cone. The special structure of the ridge estimator using a holdout sample permits the calculation of the closed form for the asymptotic distribution for the parameter of interest, see Theorem 1 in [17], case 2 after Theorem 2 in [18], Section 3.8 in [19], and Theorem 5 in [20].
Theorem 2.
Assumptions 1–4imply, when ,
(i) converges in distribution to a draw from a normal distribution with mean
and covariance
and
(ii) will converge in distribution to a mixture distribution with discrete mass of 1/2 at zero and over , a truncated normal distribution with zero mean and covariance
When the prior is different from the population parameter value, the ridge estimator has an asymptotic bias and its asymptotic variance is larger than for the 2SLS. However, this is restricted to only one dimension. The asymptotic MSE is
This is the MSE for the 2SLS estimator plus a term built on , the projection matrix for The ridge estimator using the holdout sample has the same bias, variance, and MSE as the 2SLS estimator, except in the dimension of . Because takes its minimum at , the optimal sample split to minimize the asymptotic bias, variance, and MSE is when the sample is equally split between the training and the testing (or holdout) sample.
Because the population parameter value enters into both the asymptotic bias and the asymptotic variance, it is not possible to determine individual t-statistics for the parameters. However, under the null hypothesis that , the statistic
will converge in distribution to a chi-square with k degrees of freedom. This statistic can be inverted to create accurate confidence regions.
The asymptotic behavior is different when .
Theorem 3.
Assumptions 1–4imply, when , conditional on , converges in distribution to
where converges in distribution to a draw from an a-min distribution with parameter .
In the unlikely event that the prior is selected equal to the population parameter value, the asymptotic covariance is smaller than or equal to the 2SLS asymptotic covariance. In terms of implementation, the covariance and bias associated with should be used because it is asymptotically correct for all priors except for , where it leads to conservative confidence regions.
Linear Regression
A special case is the linear regression model where Y is and X is
with full rank second moments , and conditional on X, . The estimation equations for the ridge regression estimate where the tuning parameter, , is selected with a holdout sample can be written in the method of moments framework using the parameterization
The ridge estimator is part of the parameter estimates defined by the just identified system of equations , where
Along with Assumption 4, the following three assumptions are sufficient to obtain the asymptotic results.
Assumption 5.
is iid with finite fourth moments and has full rank.
Assumption 6.
Conditional on X, .
Assumption 7.
The parameter space Θ is defined as follows: is restricted to a symmetric positive definite matrix with eigenvalues for , and where , , and are positive and finite.
Lemma 2 gives the rate of convergence for the tuning parameter when the prior is different from the population parameter value and characterizes its asymptotic distribution when the prior is equal to the population parameter value.
Lemma 2.
Assumptions 4–7imply
(i) if , (1) and (2) ; (ii) if , converges in distribution to a draw from the a-min distribution with parameter , the symmetric matrix square root of .
The asymptotic distribution of is equivalent to the distribution of
where the random variable is defined as
and the cone is defined by . The estimator is defined as
Theorem 4.
Assumptions 4–7imply, when , the asymptotic distribution of is equivalent to the distribution of
This Theorem characterizes the asymptotic distribution of the estimator as the projection of a stochastic process onto a cone. The special structure of this problems allows for the analytic derivation of the asymptotic distribution.
Theorem 5.
Assumptions 4–7imply, when ,
(i) is asymptotically normally distributed with mean
and covariance
and
(ii) asymptotically has a mixture distribution with a discrete mass of 1/2 at zero and over , a truncated normal distribution with zero mean and covariance
When , the ridge estimator has an asymptotic bias and its asymptotic variance is larger than that of the OLS estimator. However, this is only in one dimension. The asymptotic MSE is
This is the MSE for the OLS estimator plus a constant times the projection matrix for The ridge estimator using the holdout sample has the same bias, variance, and MSE as the OLS estimator except in the dimension of the . In order to minimize bias, variance, and MSE of the estimator, should be selected. The population parameter value in both the asymptotic bias and the asymptotic variance does not allow for individual t-statistics. However, under the null hypothesis that , the statistic
will converge in distribution to a draw from a chi-square with k degrees of freedom and can be used to create confidence regions.
When , a different asymptotic distribution occurs.
Theorem 6.
Assumptions 4–7imply, when , , conditional on , converges in distribution to a draw from
where will converge in distribution to a draw from an a-min distribution with parameter .
If , the asymptotic covariance is smaller than or equal to the OLS estimator’s asymptotic covariance. Again, the covariance and bias associated with should be used for inference.
4. Small Sample Properties
The behavior in finite samples is investigated next by simulating the model in Equations (1) and (2) with and (Because the ridge estimator and the 2SLS estimator will be compared using MSE, two moments need to exist. This is ensured by having four instruments to estimate the two parameters. See [21].). To standardize the model, set and . Create endogeneity with
The strength of the instrument is controlled by the parameter with If , the second element of is not identified.
The ridge parameter estimate is determined in two steps. In the first step, the objective function is evaluated on a grid of values to determine a starting value. In the second step, the objective function is evaluated over a finer grid (10,000 points) centered at the best value obtained from the first step. A value of in the second step corresponds to the ridge estimator ignoring the prior in favor of the data, whereas a value corresponds to “infinite regularization” implying that the ridge estimator ignores the data in favor of the prior.
4.1. Coverage Probabilities
The chi-square test is performed for a range of parameterizations including different priors, different strengths of the instrument, and different sample sizes. Each model is simulated 10,000 times and a size of ten percent is used for each test. The results are presented in Table A1. The observed coverage probabilities agree with the theoretical values. As expected, the approximations are best in cases where the sample size is large, the correlation between instruments and covariates is higher, and the prior is closer to the population parameter value.
4.2. MSE for the Ridge Estimator
The ridge estimator is compared with the 2SLS estimator to demonstrate settings where the ridge estimator can be expected to give more accurate results in small samples. The simulated models differ in three dimensions: sample size, strength of the instruments, and the prior on the structural parameters. For smaller sample sizes, the ridge estimator should have better properties, whereas, for larger sample sizes, 2SLS should perform better. Sample sizes of , 50, 250 and 500 are considered. As noted above, the instrument signal strength decreases with . For smaller signal strengths, the ridge estimator should perform better. Values of , 0.05, 0.1, and 0.5 are considered. For prior values further from to the population parameter values, the ridge estimator should perform worse. Three different values of are considered , and . A total of 48 model specifications are simulated: four sample sizes n, four values of the precision parameter , and three values of the prior . Each specification is simulated times and estimated with 2SLS and the ridge estimator. is used to split the sample between training and test samples for the ridge estimator.
Table A2, Table A3 and Table A4 compare the performance of the 2SLS estimator with the ridge estimator for different precision levels and sample sizes when the prior is fixed at , , and , respectively. The estimators are compared on the basis of bias, standard deviation of the estimates, MSE values of the estimates, and sum of MSE values of and . All three tables demonstrate expected results. The ridge estimator dominates in models with smaller sample sizes, weaker instrument strength, and when the prior is closer to the population parameter values.
Overall, the simulations demonstrate that, for some models’ specifications, the ridge estimator using a holdout sample has better small sample performance relative to the 2SLS estimator. The simulations agree with the asymptotic distributions. As the sample size increases, the 2SLS estimator performs better than the ridge estimator.
5. Conclusions
Inference has always been a weakness of ridge regression. This paper presents a methodology and results to help address some of its weaknesses. Theoretically accurate inferences can be performed with the asymptotic distribution of the ridge estimates of the linear IV model when the tuning parameter is empirically selected by a holdout sample. It is well known that the distribution of the estimates of the structural parameters is affected by empirically selected tuning parameters. This is addressed by simultaneously estimating both the parameters of interest and the ridge regression tuning parameter in the method of moments framework. When the prior is different from the population parameter value, the estimator accounts for the probability limit of the tuning parameter being on the boundary of the parameter space. The asymptotic distribution for the tuning parameter is a nonstandard mixed distribution. The asymptotic distribution for the estimates of the structural parameters is normal but with a nonzero mean. The ridge estimator of the structural parameters has asymptotic bias and the asymptotic covariance is larger than the asymptotic covariance for the 2SLS estimator; however, the bias and larger covariance only apply to one dimension of the parameter space. The dependence of the asymptotic mean and variance on the population parameter values, prevents the calculation of t-statistics for individual parameters. Fortunately, a chi-square statistic provides accurate confidence regions for the structural parameters.
If the prior is equal to the population parameter value, the ridge estimator is consistent and the asymptotic covariance is smaller than the 2SLS asymptotic covariance. The asymptotic distribution provides insights on how to perform estimation with a holdout sample. The minimum bias, variance, and MSE for the structural parameters occur when the sample is equally split into a training sample and a test (or holdout) sample.
This paper’s approach can be useful in determining the asymptotic behavior for other empirical procedures that select tuning parameters. Two natural extensions would be to generalize cross-validation (see [22]) and K-fold cross-validation, where the entire dataset would be used to select the tuning parameter. This paper has focused on strong correlations between the instruments and the regressors. Another important extension would be the asymptotic behavior in models with weaker correlation between the instruments and the regressors, see [23].
Supplementary Materials
The following are available online at https://www.mdpi.com/article/10.3390/stats4030043/s1.
Author Contributions
Conceptualization, F.S. and N.S.; simulations, F.S. and N.S.; writing—original draft preparation, F.S. and N.S.; writing—review and editing, F.S. and N.S. Both authors contributed equally to this project. Both authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Acknowledgments
The authors thank three anonymous referees for their helpful comments. The authors benefited from discussions during the presentation at the 2021 North American Summer Meeting of the Econometrics Society.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A
Table A1.
Coverage probabilities for confidence regions created from the statistic with and the corresponding values from 2SLS estimation. The simulated model is given by Equations (1), (2), (4) and (12) with , , , and . The models differ with respect to priors (), strength of the instrument (), and sample size (n). Each model is simulated 10,000 times and the test performed with a size of 10%.
Table A1.
Coverage probabilities for confidence regions created from the statistic with and the corresponding values from 2SLS estimation. The simulated model is given by Equations (1), (2), (4) and (12) with , , , and . The models differ with respect to priors (), strength of the instrument (), and sample size (n). Each model is simulated 10,000 times and the test performed with a size of 10%.
| n | ||||
|---|---|---|---|---|
| 25 | 0.651 | 0.547 | ||
| 500 | 0.717 | 0.629 | ||
| 10,000 | 0.708 | 0.661 | ||
| 25 | 0.655 | 0.561 | ||
| 500 | 0.710 | 0.663 | ||
| 10,000 | 0.721 | 0.863 | ||
| 25 | 0.656 | 0.571 | ||
| 500 | 0.709 | 0.759 | ||
| 10,000 | 0.855 | 0.890 | ||
| 25 | 0.697 | 0.716 | ||
| 500 | 0.870 | 0.896 | ||
| 10,000 | 0.900 | 0.897 | ||
| 25 | 0.623 | 0.556 | ||
| 500 | 0.686 | 0.636 | ||
| 10,000 | 0.675 | 0.661 | ||
| 25 | 0.625 | 0.563 | ||
| 500 | 0.671 | 0.673 | ||
| 10,000 | 0.736 | 0.859 | ||
| 25 | 0.624 | 0.576 | ||
| 500 | 0.682 | 0.760 | ||
| 10,000 | 0.863 | 0.892 | ||
| 25 | 0.688 | 0.723 | ||
| 500 | 0.870 | 0.888 | ||
| 10,000 | 0.891 | 0.900 | ||
| 25 | 0.621 | 0.565 | ||
| 500 | 0.670 | 0.632 | ||
| 10,000 | 0.650 | 0.669 | ||
| 25 | 0.622 | 0.565 | ||
| 500 | 0.664 | 0.664 | ||
| 10,000 | 0.784 | 0.856 | ||
| 25 | 0.616 | 0.564 | ||
| 500 | 0.673 | 0.750 | ||
| 10,000 | 0.867 | 0.892 | ||
| 25 | 0.689 | 0.726 | ||
| 500 | 0.865 | 0.895 | ||
| 10,000 | 0.886 | 0.896 |
Table A2.
The simulated model is given by Equations Equations (1), (2), (4) and (12) with , , , and . Summary statistics are reported for estimates of and using 2SLS and ridge estimator for , where . The models differ with respect to the strength of the instrument () and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.
Table A2.
The simulated model is given by Equations Equations (1), (2), (4) and (12) with , , , and . Summary statistics are reported for estimates of and using 2SLS and ridge estimator for , where . The models differ with respect to the strength of the instrument () and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.
| n | Estimator | Bias | SD | MSE | Bias | SD | MSE | MSE | |
|---|---|---|---|---|---|---|---|---|---|
| 0.01 | 25 | 2SLS | 0.020 | 0.117 | 0.014 | 0.715 | 0.661 | 0.948 | 0.962 |
| Ridge | 0.080 | 0.108 | 0.018 | 0.604 | 0.343 | 0.483 | 0.501 | ||
| 50 | 2SLS | 0.009 | 0.082 | 0.007 | 0.696 | 0.686 | 0.956 | 0.963 | |
| Ridge | 0.055 | 0.076 | 0.009 | 0.589 | 0.333 | 0.458 | 0.467 | ||
| 250 | 2SLS | 0.001 | 0.037 | 0.001 | 0.694 | 0.740 | 1.029 | 1.031 | |
| Ridge | 0.022 | 0.037 | 0.002 | 0.577 | 0.349 | 0.455 | 0.457 | ||
| 500 | 2SLS | 0.001 | 0.028 | 0.001 | 0.697 | 0.710 | 0.989 | 0.990 | |
| Ridge | 0.016 | 0.027 | 0.001 | 0.577 | 0.317 | 0.433 | 0.434 | ||
| 0.05 | 25 | 2SLS | 0.019 | 0.122 | 0.015 | 0.688 | 0.694 | 0.954 | 0.970 |
| Ridge | 0.083 | 0.110 | 0.019 | 0.595 | 0.352 | 0.477 | 0.496 | ||
| 50 | 2SLS | 0.011 | 0.084 | 0.007 | 0.683 | 0.697 | 0.952 | 0.959 | |
| Ridge | 0.056 | 0.078 | 0.009 | 0.583 | 0.335 | 0.452 | 0.462 | ||
| 250 | 2SLS | 0.001 | 0.037 | 0.001 | 0.572 | 0.712 | 0.834 | 0.835 | |
| Ridge | 0.022 | 0.038 | 0.002 | 0.531 | 0.332 | 0.392 | 0.394 | ||
| 500 | 2SLS | 0.001 | 0.025 | 0.001 | 0.485 | 0.687 | 0.707 | 0.707 | |
| Ridge | 0.016 | 0.027 | 0.001 | 0.504 | 0.292 | 0.339 | 0.340 | ||
| 0.10 | 25 | 2SLS | 0.021 | 0.121 | 0.015 | 0.636 | 0.727 | 0.933 | 0.948 |
| Ridge | 0.082 | 0.107 | 0.018 | 0.572 | 0.356 | 0.454 | 0.472 | ||
| 50 | 2SLS | 0.008 | 0.086 | 0.007 | 0.590 | 0.719 | 0.865 | 0.872 | |
| Ridge | 0.054 | 0.079 | 0.009 | 0.552 | 0.334 | 0.416 | 0.425 | ||
| 250 | 2SLS | 0.002 | 0.035 | 0.001 | 0.335 | 0.550 | 0.415 | 0.416 | |
| Ridge | 0.024 | 0.038 | 0.002 | 0.444 | 0.265 | 0.268 | 0.270 | ||
| 500 | 2SLS | 0.000 | 0.026 | 0.001 | 0.181 | 0.463 | 0.247 | 0.248 | |
| Ridge | 0.016 | 0.028 | 0.001 | 0.378 | 0.245 | 0.203 | 0.204 | ||
| 0.50 | 25 | 2SLS | 0.012 | 0.127 | 0.016 | 0.158 | 0.462 | 0.238 | 0.255 |
| Ridge | 0.088 | 0.123 | 0.023 | 0.325 | 0.258 | 0.173 | 0.195 | ||
| 50 | 2SLS | 0.006 | 0.085 | 0.007 | 0.066 | 0.312 | 0.101 | 0.109 | |
| Ridge | 0.055 | 0.089 | 0.011 | 0.243 | 0.235 | 0.114 | 0.125 | ||
| 250 | 2SLS | 0.002 | 0.036 | 0.001 | 0.014 | 0.126 | 0.016 | 0.017 | |
| Ridge | 0.016 | 0.039 | 0.002 | 0.104 | 0.148 | 0.033 | 0.035 | ||
| 500 | 2SLS | 0.000 | 0.026 | 0.001 | 0.007 | 0.089 | 0.008 | 0.009 | |
| Ridge | 0.009 | 0.027 | 0.001 | 0.070 | 0.111 | 0.017 | 0.018 | ||
Table A3.
The simulated model is given by Equations (1), (2), (4) and (12) with , , , and . Summary statistics are reported for estimates of and using 2SLS and ridge estimator for , where . The models differ with respect to the strength of the instrument () and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.
Table A3.
The simulated model is given by Equations (1), (2), (4) and (12) with , , , and . Summary statistics are reported for estimates of and using 2SLS and ridge estimator for , where . The models differ with respect to the strength of the instrument () and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.
| n | Estimator | Bias | SD | MSE | Bias | SD | MSE | MSE | |
|---|---|---|---|---|---|---|---|---|---|
| 0.01 | 25 | 2SLS | 0.021 | 0.115 | 0.014 | 0.688 | 0.659 | 0.908 | 0.921 |
| Ridge | 0.084 | 0.108 | 0.019 | 0.700 | 0.334 | 0.601 | 0.620 | ||
| 50 | 2SLS | 0.009 | 0.081 | 0.007 | 0.702 | 0.698 | 0.980 | 0.987 | |
| Ridge | 0.055 | 0.077 | 0.009 | 0.708 | 0.331 | 0.611 | 0.620 | ||
| 250 | 2SLS | 0.002 | 0.037 | 0.001 | 0.693 | 0.730 | 1.014 | 1.016 | |
| Ridge | 0.024 | 0.037 | 0.002 | 0.700 | 0.316 | 0.589 | 0.591 | ||
| 500 | 2SLS | 0.001 | 0.026 | 0.001 | 0.675 | 0.702 | 0.949 | 0.949 | |
| Ridge | 0.016 | 0.027 | 0.001 | 0.700 | 0.304 | 0.582 | 0.583 | ||
| 0.05 | 25 | 2SLS | 0.021 | 0.117 | 0.014 | 0.683 | 0.654 | 0.894 | 0.908 |
| Ridge | 0.083 | 0.107 | 0.018 | 0.694 | 0.334 | 0.593 | 0.612 | ||
| 50 | 2SLS | 0.010 | 0.081 | 0.007 | 0.666 | 0.663 | 0.884 | 0.891 | |
| Ridge | 0.056 | 0.077 | 0.009 | 0.690 | 0.339 | 0.591 | 0.601 | ||
| 250 | 2SLS | 0.002 | 0.037 | 0.001 | 0.576 | 0.683 | 0.798 | 0.800 | |
| Ridge | 0.023 | 0.037 | 0.002 | 0.651 | 0.306 | 0.518 | 0.520 | ||
| 500 | 2SLS | 0.001 | 0.026 | 0.001 | 0.472 | 0.662 | 0.661 | 0.662 | |
| Ridge | 0.016 | 0.027 | 0.001 | 0.618 | 0.324 | 0.487 | 0.488 | ||
| 0.10 | 25 | 2SLS | 0.022 | 0.120 | 0.015 | 0.657 | 0.695 | 0.914 | 0.929 |
| Ridge | 0.085 | 0.111 | 0.019 | 0.679 | 0.361 | 0.591 | 0.611 | ||
| 50 | 2SLS | 0.009 | 0.083 | 0.007 | 0.595 | 0.690 | 0.830 | 0.837 | |
| Ridge | 0.056 | 0.078 | 0.009 | 0.657 | 0.316 | 0.532 | 0.541 | ||
| 250 | 2SLS | 0.001 | 0.037 | 0.001 | 0.334 | 0.572 | 0.439 | 0.441 | |
| Ridge | 0.023 | 0.038 | 0.002 | 0.548 | 0.301 | 0.391 | 0.393 | ||
| 500 | 2SLS | 0.000 | 0.026 | 0.001 | 0.187 | 0.457 | 0.244 | 0.245 | |
| Ridge | 0.015 | 0.027 | 0.001 | 0.463 | 0.295 | 0.301 | 0.302 | ||
| 0.50 | 25 | 2SLS | 0.014 | 0.124 | 0.016 | 0.169 | 0.455 | 0.235 | 0.251 |
| Ridge | 0.084 | 0.126 | 0.023 | 0.369 | 0.328 | 0.244 | 0.267 | ||
| 50 | 2SLS | 0.006 | 0.083 | 0.007 | 0.066 | 0.293 | 0.090 | 0.097 | |
| Ridge | 0.050 | 0.085 | 0.010 | 0.266 | 0.268 | 0.142 | 0.152 | ||
| 250 | 2SLS | 0.001 | 0.037 | 0.001 | 0.011 | 0.128 | 0.017 | 0.018 | |
| Ridge | 0.013 | 0.038 | 0.002 | 0.103 | 0.155 | 0.035 | 0.036 | ||
| 500 | 2SLS | 0.001 | 0.026 | 0.001 | 0.005 | 0.090 | 0.008 | 0.009 | |
| Ridge | 0.008 | 0.026 | 0.001 | 0.071 | 0.114 | 0.018 | 0.019 | ||
Table A4.
The simulated model is given by Equations (1), (2), (4) and (12) with , , , and . Summary statistics are reported for estimates of and using the 2SLS and ridge estimator for , where . The models differ with respect to the strength of the instrument () and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.
Table A4.
The simulated model is given by Equations (1), (2), (4) and (12) with , , , and . Summary statistics are reported for estimates of and using the 2SLS and ridge estimator for , where . The models differ with respect to the strength of the instrument () and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.
| n | Estimator | Bias | SD | MSE | Bias | SD | MSE | MSE | |
|---|---|---|---|---|---|---|---|---|---|
| 0.01 | 25 | 2SLS | 0.022 | 0.124 | 0.016 | 0.708 | 0.724 | 1.025 | 1.041 |
| Ridge | 0.086 | 0.124 | 0.023 | 0.831 | 0.435 | 0.880 | 0.903 | ||
| 50 | 2SLS | 0.009 | 0.081 | 0.007 | 0.695 | 0.678 | 0.943 | 0.950 | |
| Ridge | 0.057 | 0.083 | 0.010 | 0.836 | 0.380 | 0.842 | 0.853 | ||
| 250 | 2SLS | 0.003 | 0.036 | 0.001 | 0.691 | 0.693 | 0.958 | 0.960 | |
| Ridge | 0.024 | 0.039 | 0.002 | 0.859 | 0.385 | 0.887 | 0.889 | ||
| 500 | 2SLS | 0.001 | 0.027 | 0.001 | 0.698 | 0.725 | 1.013 | 1.014 | |
| Ridge | 0.016 | 0.028 | 0.001 | 0.868 | 0.351 | 0.877 | 0.878 | ||
| 0.05 | 25 | 2SLS | 0.018 | 0.125 | 0.016 | 0.688 | 0.711 | 0.979 | 0.995 |
| Ridge | 0.083 | 0.117 | 0.021 | 0.820 | 0.399 | 0.832 | 0.853 | ||
| 50 | 2SLS | 0.010 | 0.082 | 0.007 | 0.669 | 0.681 | 0.911 | 0.918 | |
| Ridge | 0.057 | 0.082 | 0.010 | 0.823 | 0.383 | 0.824 | 0.834 | ||
| 250 | 2SLS | 0.002 | 0.035 | 0.001 | 0.568 | 0.658 | 0.755 | 0.756 | |
| Ridge | 0.023 | 0.039 | 0.002 | 0.793 | 0.365 | 0.761 | 0.763 | ||
| 500 | 2SLS | 0.001 | 0.031 | 0.001 | 0.466 | 0.725 | 0.743 | 0.744 | |
| Ridge | 0.016 | 0.031 | 0.001 | 0.753 | 0.437 | 0.757 | 0.759 | ||
| 0.10 | 25 | 2SLS | 0.018 | 0.121 | 0.015 | 0.654 | 0.682 | 0.893 | 0.908 |
| Ridge | 0.085 | 0.119 | 0.021 | 0.799 | 0.404 | 0.802 | 0.823 | ||
| 50 | 2SLS | 0.010 | 0.099 | 0.010 | 0.593 | 1.026 | 1.405 | 1.415 | |
| Ridge | 0.057 | 0.083 | 0.010 | 0.780 | 0.410 | 0.777 | 0.787 | ||
| 250 | 2SLS | 0.001 | 0.036 | 0.001 | 0.318 | 0.605 | 0.468 | 0.469 | |
| Ridge | 0.021 | 0.040 | 0.002 | 0.643 | 0.393 | 0.567 | 0.569 | ||
| 500 | 2SLS | 0.001 | 0.026 | 0.001 | 0.178 | 0.464 | 0.247 | 0.247 | |
| Ridge | 0.013 | 0.028 | 0.001 | 0.524 | 0.395 | 0.431 | 0.432 | ||
| 0.50 | 25 | 2SLS | 0.014 | 0.123 | 0.015 | 0.162 | 0.434 | 0.215 | 0.230 |
| Ridge | 0.079 | 0.129 | 0.023 | 0.405 | 0.376 | 0.306 | 0.329 | ||
| 50 | 2SLS | 0.006 | 0.087 | 0.008 | 0.071 | 0.317 | 0.105 | 0.113 | |
| Ridge | 0.043 | 0.088 | 0.010 | 0.278 | 0.309 | 0.173 | 0.182 | ||
| 250 | 2SLS | 0.001 | 0.037 | 0.001 | 0.012 | 0.127 | 0.016 | 0.018 | |
| Ridge | 0.011 | 0.037 | 0.001 | 0.103 | 0.158 | 0.035 | 0.037 | ||
| 500 | 2SLS | 0.000 | 0.026 | 0.001 | 0.007 | 0.089 | 0.008 | 0.009 | |
| Ridge | 0.007 | 0.026 | 0.001 | 0.071 | 0.114 | 0.018 | 0.019 | ||
Proofs
Lemma 1
Proof of Lemma 1.
The objective function that determines the tuning parameter is
Substitute
and write the objective function
The CLT and LLN imply and are . The LLN implies is when . However, this term will be zero if . Hence, the limiting behavior of the objective function will be determined by the term when and by the terms when .
For , the consistency of is presented in Lemma 1 of [1]. For ,
where and are . Hence, converges in distribution to a draw from the a-min distribution with parameter where is the symmetric matrix square root of . □
Theorem 1
This theorem and its proof are presented in [1].
Theorem 2
Proof of Theorem 2.
Let and . The asymptotic distribution of is
The parameters and the moment conditions are written in sets. To keep track of the needed calculations, write , , , , , and C into terms associated with the sets. Use bold subscripts and superscripts to denote the different sets
and for . Denote the partitioned terms of , for . The limiting random variables will be partitioned
The partitioned elements of C will be denoted for and can be written
(Note that the transpose on the second term is achieved with the index being flipped.) The term is the covariance matrix for the estimate of , i.e., the ridge estimate of the structural parameters, . The detailed calculation of the terms are presented in the Supplemental Material for this paper.
The term is that is restricted to being non-negative and the probability limit is on the boundary of the parameter space, i.e., Following Self and Liang (1987), the probability limit being on the boundary of the parameter space results in an asymptotic distribution that is characterized by a projection onto a cone. Because the probability limit is zero, the asymptotic distribution is obtained by projecting the limiting stochastic process onto the non-negative values of the . This projection is defined using the limiting covariance matrix to define the inner product. When a draw from the limiting distribution has a non-negative term, it directly contributed to the asymptotic distribution. When a draw has a negative term, the random vector is projected on the cone with . This means will be mapped to zero. The other parameters will also be adjusted depending on their covariance and correlation with . This adjustment can contribute an asymptotic bias term.
The asymptotic distribution of the estimates can be characterized
The asymptotic distribution for is a mixture with 1/2 probability mass on zero and truncated normal distribution over non-negative values.
The asymptotic distribution for the ridge estimator of the structural parameters , conditional on , is the asymptotic distribution for
where is a draw from the truncated normal distribution over the negative values. The asymptotic bias can be evaluated in closed form. The expectation of for the truncated normal distribution is
Under the null hypothesis that , the statistic
will converge in distribution to chi-square with k degrees of freedom.
The asymptotic distribution for , , and the test statistic require the terms
The details of the matrix multiplication are presented in the Supplemental Material for this paper. The terms are
and
The asymptotic distribution for the optimally selected tuning parameter when is a mixture with a discrete mass of 1/2 at zero and, over the positive values, a truncated normal with zero mean and covariance The asymptotic distribution for the structural parameters is normally distributed with mean and covariance □
Theorem 3
Proof of Theorem 3.
The tuning parameter is estimated, but it is not identified. Consider only the final system of k equations used to estimate the parameters of interest, conditional on the estimated tuning parameter
This implies
Substitute in ,
Substitute in ,
Simplify to
This implies
which gives the root-n consistency of the estimate and the asymptotic distribution becomes
where is a draw from the a-min distribution with parameter .
The inverse of the ridge estimator’s variance is
For , this is larger than the inverse of the 2SLS estimator’s variance, . Hence, the variance of the 2SLS can never be smaller than the variance of the ridge estimator, when □
Theorem 4
This is a special case of Theorem 1. The proof of Theorem 1 as presented in [1] applies to this set of parameters and moment conditions.
Theorem 5
This is a special case of Theorem 2. Assumption 6 implies that there is no need for instrumental variables. The basic simplification is that in the IV model , while, for the linear regression model, this reduces to .
Theorem 6
This is a special case of the results in Theorem 3. The explanation for Theorem 5 also applies for this theorem.
References
- Sengupta, N.; Sowell, F. On the Asymptotic Distribution of Ridge Regression Estimators Using Training and Test Samples. Econometrics 2020, 8, 39. [Google Scholar] [CrossRef]
- Obenchain, R. Classical F-Tests and Confidence Regions for Ridge Regression. Technometrics 1977, 19, 429. [Google Scholar] [CrossRef]
- Van Wieringen, W.N. Lecture notes on ridge regression. arXiv 2021, arXiv:1509.09169. [Google Scholar]
- Melo, S.; Kibria, B.M.G. On Some Test Statistics for Testing the Regression Coefficients in Presence of Multicollinearity: A Simulation Study. Stats 2020, 3, 40–55. [Google Scholar] [CrossRef]
- Halawa, A.; Bassiouni, M.E. Tests of regression coefficients under ridge regression models. J. Stat. Comput. Simul. 2000, 65, 341–356. [Google Scholar] [CrossRef]
- Theobald, C.M. Generalizations of Mean Square Error Applied to Ridge Regression. J. R. Stat. Soc. Ser. B 1974, 36, 103–106. [Google Scholar] [CrossRef]
- Schmidt, P. Econometrics; Statistics, Textbooks and Monographs; Dekker: New York, NY, USA, 1976. [Google Scholar]
- Smith, G.; Campbell, F. A Critique of Some Ridge Regression Methods. J. Am. Stat. Assoc. 1980, 75, 74–81. [Google Scholar] [CrossRef]
- Montgomery, D.; Peck, E.; Vining, G. Introduction to Linear Regression Analysis; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
- Gómez, R.S.; García, C.G.; Pérez, J.G. The Raise Regression: Justification, properties and application. arXiv 2021, arXiv:2104.14423. [Google Scholar]
- Kibria, B.M.G.; Banik, S. A Simulation Study on the Size and Power Properties of Some Ridge Regression Tests. Appl. Appl. Math. Int. J. (AAM) 2019, 14, 741–761. [Google Scholar]
- Zorzi, M. Empirical Bayesian learning in AR graphical models. Automatica 2019, 109, 108516. [Google Scholar] [CrossRef]
- Zorzi, M. Autoregressive identification of Kronecker graphical models. Automatica 2020, 119, 109053. [Google Scholar] [CrossRef]
- Alheety, M.I.; Ramanathan, T.V. Confidence Interval for Shrinkage Parameters in Ridge Regression. Commun. Stat.-Theory Methods 2009, 38, 3489–3497. [Google Scholar] [CrossRef]
- Rubio, H.; Firinguetti, L. The Distribution of Stochastic Shrinkage Parameters in Ridge Regression. Commun. Stat.-Theory Methods 2002, 31, 1531–1547. [Google Scholar] [CrossRef][Green Version]
- Akdeniz, F.; Öztürk, F. The distribution of stochastic shrinkage biasing parameters of the Liu type estimator. Appl. Math. Comput. 2005, 163, 29–38. [Google Scholar] [CrossRef]
- Moran, P.A.P. Maximum-likelihood estimation in non-standard conditions. Math. Proc. Camb. Philos. Soc. 1971, 70, 441–450. [Google Scholar] [CrossRef]
- Self, S.; Liang, K. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Am. Stat. Assoc. 1987, 82, 605–610. [Google Scholar] [CrossRef]
- Andrews, D.W.K. Generalized method of moments estimation when a parameter is on a boundary. J. Bus. Econ. Stat. 2002, 20, 530–544. [Google Scholar] [CrossRef]
- Andrews, D.W.K. Estimation When a Parameter is on a Boundary. Econometrica 1999, 67, 1341–1383. [Google Scholar] [CrossRef]
- Kinal, T.W. The Existence of Moments of k-Class Estimators. Econometrica 1980, 48, 241–249. [Google Scholar] [CrossRef]
- Golub, G.H.; Heath, M.; Wahba, G. Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter. Technometrics 1979, 21, 215–223. [Google Scholar] [CrossRef]
- Antoine, B.; Renault, E. Efficient GMM with nearly-weak instruments. Econom. J. 2009, 12, S135–S171. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).