 Previous Article in Journal
Modeling I(2) Processes Using Vector Autoregressions Where the Lag Length Increases with the Sample Size
Article

# On the Asymptotic Distribution of Ridge Regression Estimators Using Training and Test Samples

1
School of Public Policy, Indian Institute of Technology Delhi, Delhi 110016, India
2
Tepper School of Business, Carnegie Mellon University, Pittsburgh, PA 15213, USA
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Econometrics 2020, 8(4), 39; https://doi.org/10.3390/econometrics8040039
Received: 25 November 2019 / Revised: 10 September 2020 / Accepted: 11 September 2020 / Published: 1 October 2020

## Abstract

The asymptotic distribution of the linear instrumental variables (IV) estimator with empirically selected ridge regression penalty is characterized. The regularization tuning parameter is selected by splitting the observed data into training and test samples and becomes an estimated parameter that jointly converges with the parameters of interest. The asymptotic distribution is a nonstandard mixture distribution. Monte Carlo simulations show the asymptotic distribution captures the characteristics of the sampling distributions and when this ridge estimator performs better than two-stage least squares. An empirical application on returns to education data is presented.

## 1. Introduction

This paper concerns the estimation and inference on the structural parameters in the linear instrumental variables (IV) model estimated with ridge regression. This estimator differs from previous ridge regression estimators in three important areas. First, the regularization tuning parameter is selected using a randomly selected test sample from the observed data. Second, the empirically selected tuning parameter’s impact on the estimates of the parameters of interest is accounted for by deriving their asymptotic joint distribution, which is a mixture. Third, the traditional Generalized Method of Moments (GMM) framework is used to characterize the asymptotic distribution.
The ridge estimator belongs to a family of estimators which utilize regularization, see (Bickel et al. 2006) and (Hastie et al. 2009) for overviews. Regularization requires tuning parameters and procedures to select them can be split into three broad areas. (1) Plugin-type. For a given criteria function, an optimal value is determined in terms of the model’s parameters. These model’s parameters are then estimated with the data set and plugged into the formula. Generalization include adjustments to reduce bias and iterative procedures until a fixed point is achieved. (2) Test sample. The data set is randomly split into a training and a test sample. The tuning parameter and estimates from the training sample are used to evaluate some criteria function on the test sample to determine the optimal tuning parameter and model. Generalizations include k-fold cross-validation and generalized cross-validation. (3) Rate of convergence restriction. The tuning parameter must converge at an appropriate rate to guarantee consistency and a known asymptotic distribution for the estimates of the parameters of interest. A key feature is that the tuning parameter only converges to zero asymptotically and is restricted from zero in finite samples. Previous ridge estimators have relied on plugin-type and rate of convergence restrictions. We study the test sample approach.
This estimator builds on a large literature. The ridge regression estimator was proposed in (Hoerl and Kennard 1970) to obtain smaller MSE relative to the OLS estimates in the linear model when the covariates are all exogenous but have multicollinearity. It was shown that for fixed tuning parameter, $α$, the bias and variance of the ridge estimator implied that there exists an $α$ value with lower MSE than for the OLS estimator. Assuming that $α$ is fixed, Hoerl et al. (1975) proposed selecting $α$ to minimizing the MSE of the ridge estimator. This resulting formula became a plugin selection for $α$. Subsequent research has followed the same approach of selecting the tuning parameter to minimize MSE where $α$ is assumed fixed, see (Dorugade 2014) and the papers cited there. A shortcoming of this plugin-type approach is that the form of the MSE assumes the tuning parameter is fixed, however it is then selected on the observed sample and hence is stochastic, see (Theobald 1974) and (Montgomery et al. 2012). Instead of focusing on reducing the MSE, we focus on the large sample properties of the estimates of the parameters of interest. In this literature the work closest to ours is (Firinguetti and Bobadilla 2011) where the sampling distribution is considered for a ridge estimator. However, this estimator is built on minimizing the MSE where the tuning parameter is assumed fixed, see (Lawless and Wang 1976). Our ridge estimator is derived knowing that the tuning parameter is stochastic. This leads to the joint asymptotic distribution for the estimates of the parameters of interest and the tuning parameter.
The supervised learning (machine learning) literature focuses on the ability to generalize to new data sets by selecting the tuning parameters to minimize the prediction error for a test (or holdout or validation) sample. Starting with (Larsen et al. 1996), a test sample is used to select optimal tuning parameters for a neural network model. The problem reduces to finding a local minimum of the criteria function evaluated on the test sample. Extensions of the test sample approach include backward propagation, see (Bengio 2000; Habibnia and Maasoumi 2019), but as the number of parameters increases the memory requirements become too large. This has led to the use of stochastic gradient decent, see (Maclaurin et al. 2015). Much research in this area has focused on effecient ways to optimally select the hyperparameters to minimize prediction errors. We select the tuning parameter to address its impact on the estimates of the model’s coefficients and do not focus on the model’s predictive power.
A number of papers extend the linear IV model with tuning parameters. Structural econometrics (Carrasco 2012) allows the number of instruments to grow with the sample size and (Zhu 2018) considers models where the number of covariates and instruments is larger than the sample size. In genetics the linear IV model is widely used to model gene regulatory networks, see (Chen et al. 2018; Lin et al. 2015). In this setting the number of covariates and instruments can be larger than the number of observations and the tuning parameter is restricted from being zero in finite samples. In contrast to these models, we fix the number of covariates and instrument to determine the asymptotic distribution and permit the tuning parameter to take the value zero.
Within the structural econometrics literature, ridge type regularization concepts are not new. Notable contributions are (Carrasco and Tchuente 2016; Carrasco and Florens 2000; Carrasco et al. 2007) which allow for a continuum of moment conditions. The authors use ridge regularization to find the inverse of the optimal weighting operator (instead of optimal weighting matrix in traditional GMM). In these papers and in (Carrasco 2012) the rate of convergence restriction is used to select the tuning parameter.
Several types of identification and asymptotic distributions can occur with linear IV models e.g., strong instruments, nearly-strong instruments, nearly-weak instrument and weak instruments, see (Antoine and Renault 2009) for a summary. For this taxonomy, this paper and estimator is in the strong instruments setting. The models considered in this paper are closest to the situation considered in (Sanderson and Windmeijer 2016). However, unlike (Sanderson and Windmeijer 2016) we provide point estimates instead of testing for weak instruments and restrict attention to fixed parameters that do not drift to zero. The models we study are explicitly strongly identified, however in a finite sample the precision can be low.
This ridge estimator extends the literature in five important dimensions. First, this estimator allows a meaningful prior. When the prior is ignored, or equivalently set to zero, the model penalizes variability about the origin. However, in structural economic models a more appropriate penalty will be variability about some economically meaningful prior. Second, the regularization tuning parameter is selected empirically using the observed data. This removes the internally inconsistent argument about the minimum MSE when the tuning parameters is assumed fixed. Third, the tuning parameter is allowed to take the value zero in finite samples. Fourth, empirically selecting the tuning parameter impacts the asymptotic distribution of the parameter estimates. As stressed in (Leeb and Pötscher 2005), the final asymptotic distribution will depend on empirically selected tuning parameters. We address this directly by characterizing the joint asymptotic distribution that includes both the parameters of interest and the tuning parameter. Fifth, the GMM framework is used to characterize the asymptotic distribution.1 The GMM framework is used because it is better suited to the social science setting where this estimator will be most useful. Rarely does a social science model imply the actual distribution for an error. Unconditional expectations of zero are more typical in social science theories and are the foundation for the GMM estimator. Adding a regularization penalty term and splitting the observed data into a training and test samples, takes the estimator out of the traditional GMM framework. We present new moment conditions in the traditional GMM framework which include the first order conditions for the ridge estimator.
Section 2 presents the linear IV framework, describes the precision problem and the ridge estimator. Section 3 characterizes the asymptotic distribution of the ridge estimator in the traditional GMM framework. Small sample properties are analyzed via simulations in Section 4. The procedure is applied to the returns to education data set from Angrist and Krueger (1991) in Section 5. Section 6 summarizes the results and presents directions for future research.

## 2. Ridge Estimator for Linear Instrumental Variables Model

This section presents the ridge regression estimator where regularization tuning parameter is empirically determined by splitting the data into training and test samples. This estimator is then fit into the traditional GMM framework to characterize its asymptotic distribution. Consider the model
$Y = X β 0 + ε$
$X = Z Γ 0 + u$
where Y is $n × 1$, X is $n × k$, $Z = z 1 z 2 ⋯ z n ′$ is $n × m$, $m ≥ k$, $z i ∼ i i d$, $R z = E [ z i z i ′ ]$ full rank, and conditional on Z, $ε i u i ∼ i i d 0 , σ ε 2 Σ ε u Σ u ε Σ U .$ This model allows for both endogenous X’s that are correlated with $ε$ and exogenous X’s that are uncorrelated with $ε$. Endogenous regressors imply OLS will be inconsistent. The Z instruments allow consistent estimates with the IV estimator that minimizes the residual sum of squares projected onto the instruments and has the closed form
$β ^ I V = arg min β 1 2 n ( Y − X β ) ′ Z ( Z ′ Z ) − 1 Z ′ ( Y − X β ) = ( X ′ P Z X ) − 1 X ′ P Z Y$
where $P Z$ is the projection matrix for Z. The well known asymptotic distribution is
$n β ^ I V − β 0 ∼ a N 0 , σ ε 2 Γ 0 ′ R z Γ 0 − 1$
and the covariance can be consistently estimated with
$ε ^ ′ ε ^ n X ′ Z n Z ′ Z n − 1 Z ′ X n − 1 = ε ^ ′ ε ^ n X ′ P Z X n − 1$
where $ε ^ = Y − X β ^ I V$. Let $S 0 = E [ z i x i ′ ] = R z Γ 0$.
For a finite sample let2 $X ′ P Z X n$ have the spectral decomposition $C Λ C ′$, where $Λ$ is a positive definite diagonal $k × k$ matrix, and C is orthonormal, $C ′ C = I k$. A precision problem occurs when some of the eigenvectors explain very little variation, as represented by the magnitude of the corresponding eigenvalues. This occurs when the objective function is relatively flat along these dimensions and the resulting covariance estimates are large because as Equation (4) shows, the variance of $β ^ I V$ is proportional to $X ′ P Z X n − 1 = C Λ C ′ − 1 = C Λ − 1 C ′ .$ The flat objective function, or equivalently large estimated variances, leads to a relatively large MSE. The ridge estimator addresses this problem by shrinking the estimated parameter toward a prior. The IV estimate still has low bias (it is consistent) and has the asymptotically minimum variance. However, accepting a little higher bias can have a dramatic reduction in the variance and thus provide a point estimate with lower MSE.
The ridge objective function augments the usual IV objective function (3) with a quadratic penalty centered at a prior value, $β p$, weighted by a regularization tuning parameter $α$
$Q n ( β ) = 1 2 n ( Y − X β ) ′ P Z ( Y − X β ) + 1 2 α ( β − β p ) ′ ( β − β p ) .$
The objective function’s second derivative is $X ′ P Z X n + α I k = C ( Λ + α I k ) C ′ .$ The regularization parameter injects stability since $X ′ P Z X n + α I k − 1 = C Λ + α I k − 1 C ′$ has eigenvalues $1 / ( λ i + α )$ for $i = 1 , … , k$ which are decreasing in $α$. This results in smaller variance but higher bias.
Denote the ridge solution given $α$ as
$β ^ I V ( α ) = X ′ P Z X n + α I k − 1 X ′ P Z Y n + α β p = C Λ + α I k − 1 C ′ X ′ P Z Y n + C Λ + α I k − 1 C ′ α β p = C Λ + α I k − 1 C ′ · C Λ C ′ · C Λ − 1 C ′ X ′ P Z Y n + C Λ α + I k − 1 C ′ β p = C I k + α Λ − 1 − 1 C ′ β ^ I V + C Λ α + I k − 1 C ′ β p .$
Equation (6) shows how the tuning parameter, $α$ creates a smooth curve in the parameter space between the low bias-high variance IV estimate, $β ^ I V$, (when $α = 0$) to the high bias-no variance prior, $β p$, (when $α → ∞$).
Different values of $α$ result in different estimated values for $β 0$. The optimal value of $α$ is determined empirically by splitting the data into training and test samples. The training sample is a randomly drawn sample of $[ τ n ]$ observations, denoted, $Y τ n$, $X τ n$, and $Z τ n$, and are used to calculate a path between the IV estimate and the prior as in Equation (6). The estimate using the training sample, conditional on $α ,$ is
$β ^ t r ( α ) ≡ arg min β 1 2 [ τ n ] Y τ n − X τ n β ′ P Z τ n Y τ n − X τ n β + α 2 ( β − β p ) ′ ( β − β p )$
where $P Z τ n$ is the projection matrix onto $Z τ n$ and $[ · ]$ is the greatest integer function. The first order conditions for an internal solution are
$− 1 τ n X τ n ′ P Z τ n Y τ n − X τ n β ^ t r + α ( β ^ t r − β p ) = 0$
or alternatively
$− 1 [ τ n ] ∑ i = 1 [ τ n ] X τ n ′ Z τ n [ τ n ] Z τ n ′ Z τ n [ τ n ] − 1 z i y i − x i ′ β ^ t r + α ( β ^ t r − β p ) = 0 .$
The closed form solution is
$β ^ t r ( α ) = X τ n ′ P Z τ n X τ n [ τ n ] + α I − 1 X τ n ′ P Z τ n Y τ n [ τ n ] + α β p .$
As $α$ goes from 0 towards infinity, this gives a path from the IV estimator, $β ^ t r$ (at $α = 0$), to the prior, $β p$ (the limit as $α → ∞$). Following this path, the optimal $α$ is selected to minimize the IV least squares objective function (3) over the remaining $( n − [ τ n ] )$ observations, the test sample, denoted $Y n ( 1 − τ )$, $X n ( 1 − τ )$ and $Z n ( 1 − τ )$. The optimal value for the tuning parameter is defined by $α ^ = arg min α ∈ [ 0 , ∞ ) Q n ( 1 − τ ) ( α )$ where
$Q n ( 1 − τ ) ( α ) = 1 2 ( n − [ n τ ] ) Y n ( 1 − τ ) − X n ( 1 − τ ) β ^ t r ( α ) ′ P Z n ( 1 − τ ) Y n ( 1 − τ ) − X n ( 1 − τ ) β ^ t r ( α )$
where $P Z n ( 1 − τ )$ is the projection matrix onto $Z n ( 1 − τ ) .$
$1 ( n − [ τ n ] ) ( β p − β ^ t r ( α ^ ) ) ′ X τ n ′ P Z τ n X τ n [ τ n ] + α ^ I k − 1 X n ( 1 − τ ) ′ P Z n ( 1 − τ ) Y n ( 1 − τ ) − X n ( 1 − τ ) β ^ t r ( α ^ ) = 0$
or alternatively
$1 n − [ τ n ] ∑ i = [ τ n ] + 1 n { ( β p − β ^ t r ( α ^ ) ) ′ X τ n ′ Z τ n [ τ n ] Z τ n ′ Z τ n [ τ n ] − 1 X τ n ′ Z τ n [ τ n ] + α ^ I k − 1 X τ n ′ Z τ n n − [ τ n ] Z τ n ′ Z τ n n − [ τ n ] − 1 } z i y i − x i ′ β ^ t r ( α ^ ) = 0 .$
The ridge regression estimate $β ^ α ^ ≡ β ^ I V ( α ^ )$ is then characterized by
$− 1 n X ′ P Z Y − X β ^ α ^ + α ^ ( β ^ α ^ − β p ) = 0$
or alternatively
$− 1 n ∑ i = 1 n X ′ Z n Z ′ Z n − 1 z i ′ ( y i − x i β ^ α ^ ) + α ^ ( β ^ α ^ − β p ) = 0 .$
The first order conditions that characterize the ridge estimator, Equations (8), (11), and (12), are $2 k + 1$ equations in the $2 k + 1$ parameters and have the structure of sample averages being set to zero. However, the functions being averaged do not fit into the traditional GMM framework. In Equations (8), (11), and (12) the terms in the curly brackets depend on the entire sample and not just the data for index i and the parameters. The terms in the curly brackets will converge at $O p n − 1 / 2$ and must be considered jointly with the asymptotic distributions of $( β ^ α ^ ′ , α ^ ) ′$.
The asymptotic distribution of the ridge estimator can be determined with the GMM framework using the parameterization
$θ = vech ( R τ ) ′ vech R ( 1 − τ ) ′ vec ( S τ ) ′ vec S ( 1 − τ ) ′ β t r ′ α β ′ ′$
where $vec ( · )$ stacks the elements from a matrix into a column vector and $vech ( · )$ stacks the unique elements from a symmetric matrix into a column vector. The population parameter values are
$θ 0 = vech ( R z ) ′ vech ( R z ) ′ vec ( R z Γ 0 ) ′ vec ( R z Γ 0 ) ′ β 0 ′ 0 β 0 ′ ′ .$
The ridge estimator is part of the parameter estimates defined by the just identified system of equations $H n ( θ ) = 1 n ∑ i = 1 n h i ( θ ) = 0$ where
$h i ( θ ) = 1 τ ( i ) vech ( R τ − z i z i ′ ) ( 1 − 1 τ ( i ) ) vech ( R ( 1 − τ ) − z i z i ′ ) 1 τ ( i ) vec ( S τ − z i x i ′ ) ( 1 − 1 τ ( i ) ) vec ( S ( 1 − τ ) − z i x i ′ ) 1 τ ( i ) − S τ ′ R τ − 1 z i ( y i − x i ′ β t r ) + α ( β t r − β p ) ( 1 − 1 τ ( i ) ) ( y i − x i ′ β t r ) z i ′ R ( 1 − τ ) − 1 S ( 1 − τ ) S τ ′ R τ − 1 S τ + α I k − 1 ( β p − β t r ) − ( τ S τ + ( 1 − τ ) S 1 − τ ) ′ ( τ R τ + ( 1 − τ ) R 1 − τ ) − 1 ) z i ( y i − x i ′ β ) + α ( β − β p )$
and the training and test samples are determined with the indicator function
$1 τ ( i ) = 1 , i ≤ [ τ n ] 0 , [ τ n ] < i .$
Using the structure of Equation (13), the system $H n ( θ ) = 1 n ∑ i = 1 n h i ( θ ) = 0$ can be seen as seven sets of equations. The first four sets are each self-contained systems of equal numbers of equations and parameters. The fifth set has k equations and introduces k new parameters, $β t r$. The six is a single equation with the new parameter $α$. The seventh set has k equations and introduces the final k parameters, $β$. Identification occurs because the expectation of the gradient is invertible. This is presented in the Appendix A and Appendix B.

## 3. Asymptotic Behavior

Three assumptions are sufficient to obtain asymptotic distribution for the ridge estimator.
Assumption 1.
$z i$ is iid with finite fourth moments and $E [ z i z i ′ ] = R z$ has full rank.
Assumption 2.
Conditional on Z, $ε i u i ′ ′$ are iid vectors with zero mean, full rank covariance matrix with possibly nonzero off-diagonal elements.
Assumptions 1 and 2 imply $E [ h i ( θ 0 ) ] = 0$ and $n H n ( θ 0 )$ satisfies the CLT.
Assumption 3.
The parameter space Θ is defined by: $R z$ is restricted to a symmetric positive definite matrix with eigenvalues $1 / B 1 ≤ λ ˜ 1 ≤ λ ˜ 2 ≤ … ≤ λ ˜ m ≤ B 1 ,$ $β j ≤ B 2$ for $j = 1 , 2 , … , k$, $Γ 0 = [ γ ℓ , j ]$ is of full rank with $γ ℓ , j ≤ B 3$ for $ℓ = 1 , … , m$, $j = 1 , 2 , … , k$ and $α ∈ [ 0 , B 4 ]$ where $B 1$, $B 2$, $B 3$ and $B 4$ are positive and finite.
First consider the tuning parameter. Even though it is empirically selected using the training and testing samples, its limiting value and rate of convergence are familiar.
Lemma 1.
Assumptions 1, 2 and 3 imply
(1)
$α ^ → 0$ and
(2)
$n α ^ = O p ( 1 ) .$
Proofs are given in the Appendix A.
Lemma 1 implies that the population parameter value for the tuning parameter is zero, $α 0 = 0$, which is on the boundary of the parameter space. This results in a nonstandard asymptotic distribution which can be characterized by appealing to Theorem 1 in (Andrews 2002). The approach in (Andrews 2002) requires the root-n convergence of the parameters. Lemma 1, traditional TSLS and method of moments establishes this for all the parameters in $θ$. Equation Equation (13) puts the ridge estimator in the form of the first part of (14) from (Andrews 2002). Because the system is just identified, the weighting matrix does not affect the estimator and is set to the identity matrix. The scaled GMM objective function can be expanded into a quadratic approximation about the centered and scaled population parameter values
$n H n ( θ ) ′ H n ( θ ) = n H n ( θ 0 ) ′ H n ( θ 0 ) + n H n ( θ 0 ) ∂ H n ( θ 0 ) ∂ θ ′ ( θ − θ 0 ) + n 2 ( θ − θ 0 ) ′ ∂ H n ( θ 0 ) ′ ∂ θ ∂ H n ( θ 0 ) ∂ θ ′ ( θ − θ 0 ) + o p ( 1 ) = n 2 H n ( θ 0 ) ′ H n ( θ 0 ) + n 2 H n ( θ 0 ) + ∂ H n ( θ 0 ) ∂ θ ′ ( θ − θ 0 ) ′ H n ( θ 0 ) + ∂ H n ( θ 0 ) ∂ θ ′ ( θ − θ 0 ) + o p ( 1 )$
$= n 2 H n ( θ 0 ) ′ H n ( θ 0 ) + 1 2 − ∂ H n ( θ 0 ) ∂ θ ′ − 1 n H n ( θ 0 ) − n ( θ − θ 0 ) ′ ∂ H n ( θ 0 ) ′ ∂ θ ∂ H n ( θ 0 ) ∂ θ ′ × − ∂ H n ( θ 0 ) ∂ θ ′ − 1 n H n ( θ 0 ) − n ( θ − θ 0 ) + o p ( 1 ) .$
The first term does not depend on $θ$ and the last term converges to zero in probability. This suggests that selecting $θ ^$ to minimize $H n ( θ ) ′ H n ( θ )$ will result in the asymptotic distribution of $n ( θ ^ − θ 0 )$ being the same as the distribution of $λ ∈ Λ ≡ λ ∈ R m ( m + 1 ) + 2 k m + 2 k + 1 : λ m ( m + 1 ) + 2 k m + k + 1 ≥ 0$ where $( Z − λ ) ′ M 0 ′ M 0 ( Z − λ )$ takes its minimum, where the random variable is defined as
$Z = lim n → ∞ E − ∂ H n ( θ 0 ) ∂ θ ′ − 1 n H n ( θ 0 )$
and
$M 0 = E ∂ H n ( θ 0 ) ∂ θ ′ .$
This indeed is the result by Theorem 1 of (Andrews 2002). The needed assumptions are given in (Andrews 2002). The estimator is defined as
$θ ^ = arg min θ ∈ Θ H n ( θ ) ′ H n ( θ ) .$
Theorem 1.
Assumptions 1–3 imply the asymptotic distribution of $n ( θ ^ − θ 0 )$ is equivalent to the distribution of
$λ ^ = arg min λ ∈ Λ ( Z − λ ) ′ M 0 ′ M 0 ( Z − λ ) .$
The objective function can be minimized at a value of the tuning parameter in $( 0 , ∞ )$ or possibly at $α = 0 .$ The asymptotic distribution of the tuning parameter will be composed of two parts, a discrete mass at $α = 0$ and a continuous function over $( 0 , ∞ )$. The asymptotic distribution over the other parameters can be thought of as being composed of two parts, the distribution conditional on $α = 0$ and the distribution over $α > 0 .$
In terms of the framework presented in (Andrews 2002), the random sample is used to create a random variable. This is then projected onto the parameter space, which is a cone. The projection onto the cone results in the discrete mass at $α = 0$ and the continuous mass over $( 0 , ∞ )$. As noted in (Andrews 2002), this type of a characterization of the asymptotic distribution can be easily programmed and simulated.

## 4. Small Sample Properties

To investigate the small sample performance, linear IV models are simulated and estimated using TSLS and the ridge estimator. The model is given in Equations (1) to (4) with $k = 2$ and $m = 3$. To standardize the model, set $z i ∼ i i d N ( 0 , I 3 )$ and $β 0$ = (0, 0)’. Endogeneity is created with
$ε i u i ∼ i i d N 0 , 1 0.7 0.7 0.7 1 0 0.7 0 1 .$
The strength of the instrument signal is controlled by the parameter3$δ$ in
$Γ 0 = 1 0 0 δ 1 0 .$
To judge the behavior of the estimator, three different dimensions of the model are adjusted.
1.
Sample size. For smaller sample sizes, the ridge estimator should have better properties whereas for larger sample sizes, TSLS should perform better. We consider sample sizes of $n = 25$, 50, 250 and 500.
2.
Precision. Signal strength of the instruments is one way to vary precision. The instrument signal strength decreases with the value of $δ$ above, conditional on holding the other model parameters fixed. For lower precision settings or smaller signal strengths the ridge estimator should perform better. We consider values of $δ = 0.1$, 0.25, 0.5 and 1. Note that while $δ = 1$ leads to a high precision setting for all sample sizes considered, $δ = 0.1$ leads to a low precision setting in smaller samples and a high precision setting in larger samples.
3.
Prior value relative to $β 0$. For the prior closer to the population parameter values the ridge estimator should perform relatively better. We consider values of $β p$ which were (a) one standard deviation4 from the true value $β p = ( 1 / 2 , 1 / 2 ) ′$, (b) two standard deviations from the true value $β p = ( 2 , 2 ) ′ ,$ and (c) three standard deviations from the true value5$β p = ( 3 / 2 , 3 / 2 ) ′$.
We simulate a total of 48 model specifications corresponding to 4 sample sizes n, 4 values of the precision parameter $δ$ and 3 values of the prior $β p$. Each specification is simulated 10,000 times and both TSLS and ridge estimator are estimated. We compare estimated $β 0$ values on bias, variance and MSE. For the ridge estimator we use $τ = 0.7$ to split the sample between training and test samples.6
The regularization parameter $α$ is selected in two steps—first, we search in the log-space going from $10 − 5$ to $10 6$; second, we perform a grid search7 in a linear space around the value selected in the first step. A final selected value of $α ^ = 0$ in the second step corresponds to a “no regularization” scenario which implies the ridge estimator ignores the prior in favor of the data and the value $α ^ = 10 7$ corresponds to an “infinite regularization” scenario which implies the ridge estimator ignores the data in favor of the prior.
Table 1 and Table 2 compare the performance of the TSLS estimator with the ridge estimator for different precision levels and sample sizes when the prior is fixed at $β p = ( 1 2 , 1 2 ) ′$ and $β p = ( 3 2 , 3 2 ) ′$ respectively. Recall, our parameter of interest is $β 0 = ( β 1 , β 2 ) ′ = ( 0 , 0 ) ′$. We compare the estimators based on (a) bias, (b) standard deviation of the estimates, (c) MSE values of the estimates and (d) sum of MSE values of $β 1 ^$ and $β 2 ^$. In both tables, the TSLS estimator performs as expected—both bias and standard deviation of estimates fall as sample size increases and as instrument signal strength increases. In smaller samples, the TSLS estimators exhibit some bias, which confirms that TSLSL estimators are consistent but not unbiased. Table 1 presents a scenario where the prior for the ridge estimator is one standard deviation away from the true parameter estimate. We note that in the low precision setting of $δ = 0.1$ the ridge estimator has lower MSE for all sample sizes considered in the simulations. However as precision improves, we note that for larger sample sizes the TSLS estimator has lower MSE. Table 2 describes a scenario where the ridge estimator does not have any particular advantage since it is biased to a prior which is 3 standard deviations away from the true parameter value. However, even when prior values are far from true parameter values, there are a number of scenarios where the ridge estimator outperforms the TSLS estimator in terms of MSE. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE. When $δ = 0.1$, the ridge estimator leads to lower MSE values for all sample sizes except $n = 500$. When $δ = 1$ and the model has high precision, the ridge estimator has higher MSE than TSLS. Thus as the signal strength improves and low precision issues subside, TSLS dominates. The bias-variance trade-off is at work here. Consider the results corresponding to $n = 25$ and $δ = 0.25$. The ridge estimator has higher bias compared to the TSLS estimator for both parameters, however this is compensated by considerably smaller standard deviation values leading to smaller MSE. This table also demonstrates scenarios where for a given $δ$ value, as the sample size increases the estimator with lower MSE changes from ridge to TSLS. For $δ = 0.25$, the ridge estimator performs better for sample sizes $n ≤ 50$ whereas TSLS performs better for $n ≥ 250$. Similarly, for $δ = 0.50$, the ridge estimator outperforms TSLS only for the smallest sample size of $n = 25$.
Figure 1, Figure 2, Figure 3 and Figure 4 present scatter plots of the estimates from TSLS and ridge estimator with different priors for the following cases: (a) low precision, small sample size; (b) low precision, large sample size; (c) high precision, small sample size; (d) high precision, large sample size. These figures demonstrate the influence of the priors. The prior pulls the ridge estimates away from the population parameter values. For low precision models ($δ = 0.1$), the variance associated with TSLS estimates is larger than the ridge estimates, even in larger sample sizes. The ridge estimator is biased towards the prior which is demonstrated by the estimates not being distributed symmetrically around the true value. On the other hand, for high precision models ($δ = 1$) the variance reduction from TSLS for the ridge estimator is not as dramatic. In fact, while the variance reduction appears substantial for the prior value of $β p = ( 1 2 , 1 2 ) ′$, it is unclear at least visually if there is a reduction in variance for a poorly specified prior at $β p = ( 3 2 , 3 2 ) ′$. In larger samples with high precision (Figure 4) the TSLS estimates outperform the ridge estimators which is demonstrated by larger clouds which are slightly off-center from the true parameter values. However, ridge estimators using different priors are still competitive and don’t lead to a drastically worse performance (as a reference compare the performance of the TSLS estimates to the ridge estimates in Figure 1).
Table 3, summarizes the distribution of the estimated regularization parameter $α ^$ for different precision levels, sample sizes and prior values. Recall Theorem 1 implies the asymptotic distribution will be a mixed distribution with some discrete mass at $α = 0 .$ Table 3 reports the proportion of cases which correspond to “no regularization” ($α ^ = 0$), “infinite regularization” ($α ^ = 10 7 ≈ ∞$) and “some regularization” ($α ^ ∈ ( 0 , 10 7 )$). In all cases, there is a substantial mass of the distribution concentrated at $α ^ = 0$. On the other hand we note that except in the cases where the prior is located at the true parameter value, there is no mass concentrated at $α ^ ≈ ∞$. We see some interesting variations corresponding to different prior values. In low precision settings (particularly $δ = 0.1$), keeping sample size fixed, as the prior moves away from the true value, the proportion of cases with “no regularization” increases whereas the proportion of cases with “some regularization” falls. Similarly for high precision settings (particularly $δ = 1$), as the sample size increases, the proportion of cases with “no regularization” increases whereas the proportion of cases with “some regularization” falls. In this table we also present results for large sample sizes of $n =$ 10,000, which demonstrate that the mass at $α ^ = 0$ approaches $50 %$ asymptotically, as predicted by Theorem 1. Distributions of $α ^$ for large sample sizes of $n =$ 10,000 via histograms are presented in Figure 5.
Table 4 presents summaries of the smallest singular value of the matrix8$− X ′ Z n$ for different values of $δ$ and n. The estimated asymptotic standard deviation is inversely related to the smallest singular value, or equivalently smaller singular values are associated with flatter objective functions at their minimum values. As the precision parameter increases from $δ = 0.1$ to $δ = 1$, the mean of the smallest singular value increases. As the sample size increases, the variance of the smallest singular values decreases.

## 5. Returns to Education

This section revisits the question of returns to schooling in Angrist and Krueger (1991). The key insight of that paper was the use of quarter of birth indicator variables as instruments to uniquely identify the impact of years of education on wages. The data are Public use Micro Sample of the 1980 U.S. Census and includes men born between 1920 and 1949 with positive earnings in 1979 and no missing observations. The sample is divided into three data sets, one for each decade. The empirical results are summarized in Table 5.
The specification explains the log of weekly wages with the years of education; with race, standard metropolitan statistical area, marital status, region dummies, and year of birth dummies as controls9. The same data set was used in (Staiger and Stock 1997) which focused on weak instruments using the quarter of birth interacted with year of birth to create 30 instruments. To avoid weak instruments, we restrict attention to the quarter of birth dummy variables as instruments. This gives three instruments for years of education. The resulting first stage F-tests give no indication of weak instruments. The appropriateness of the specification is tested with the Basmann test for overidentify restrictions. The specification is not rejected for the 1920s and 1930s. However, the specification is rejected for the 1940s.
Each sample is split into a training sample with $τ = 0.7$ and a testing sample. The empirical results will be sensitive to this randomization. For each decade, the data was read into R from the Stata data set from the Angrist Data Archive at MIT, dplyr was used to filter the data for each decade and the random seed was set in R with “set.seed(12345)”. The samples have hundreds of thousands of observations, n, with 22 parameters estimated. However, there are precision problems with these estimates. As noted above in Section 2, the precision can be judged by the magnitude of the smallest eigenvalue of the second derivative of the objective function at the TSLS estimate for the entire sample, these are denoted $λ min$. The precision can also be judged with the condition number for the second derivatives which varies between 12 million and 282 million.
The prior for the ridge estimates was set empirically to judge the variability in the parameter estimates in the least informative dimension of the parameter space. The prior was set at four times the eigenvector associated with the smallest eigenvalue of the second derivative of the TSLS objective function for the training sample. This is moving four unit lengths away from the TSLS estimate in the flattest (least informative) direction. The ridge estimator is then defined by the value of $α$ that minimizes the TSLS objective function using the test sample.
Even with the large sample sizes, the prior impacts the estimates. For the 1930s and 1940s the ridge estimate is between the OLS and the TSLS estimates. For the 1920s the ridge estimate is below both the TSLS estimate and the OLS estimate.
A final simulation exercise compares TSLS estimates with ridge estimates using the Angrist and Krueger (1991) data. As above the specification explains the log of weekly wages with the years of education; with race, standard metropolitan statistical area, marital status, region dummies, and year of birth dummies as controls. The quarter of birth dummies provide three instruments for years of education. Using the parameter estimates obtained by running the TSLS estimation on the entire sample from 1930s we obtain residuals $u ^$ and $ε ^$ and their estimated covariance matrix. We then draw $N =$ 10,000 random samples of size n with replacement for the instruments and control variables and use these along with the full sample parameter estimates and error covariance to obtain simulated values of years of education and log of weekly wages.
TSLS and ridge estimates are obtained for each random sample and compared on the basis of bias, standard deviation and root mean square error (RMSE). For the ridge estimates each sample is split into a training sample with $τ = 0.7$ and a testing sample. Priors for all parameters are set to $β ^ I V , t r a i n$ i.e., the TSLS parameter estimate using only the training sample, except the parameter corresponding to returns to education. Three priors are considered for the returns to education parameter: $β e d u p = 1 2 β ^ e d u , t r a i n I V$, $β e d u p = β ^ e d u , t r a i n I V$ and $β e d u p = 2 β ^ e d u , t r a i n I V$. Results presented in Table 6 demonstrate that for the smallest sample size of n = 1000 ridge estimates corresponding to all three priors produce lower RMSE values for the parameter of interest compared to TSLS. On the other hand for the larger sample size of $n =$ 100,000 TSLS estimates produce lower RMSE values compared to ridge estimates for all three priors.

## 6. Conclusions

The asymptotic distribution of the ridge estimator when the tuning parameter is selected with a test sample has been characterized. This estimator incorporates a non-zero prior and allows the non-negative tuning parameter to be zero. The resulting asymptotic distribution is a mixture with discrete mass on zero for the tuning parameter, a novel result, which follows from the true value of the tuning parameter lying on the boundary of the parameter space.
Simulations demonstrate where the ridge estimator produced lower MSE than the TSLS estimator, specifically when model precision is low, particularly in smaller samples and often even in larger samples. Where the TSLS estimator has lower MSE, particularly when precision is high, the ridge estimator remains competitive. As an empirical application, we have applied this procedure to the returns from education dataset used in (Angrist and Krueger 1991). Importantly, even with over 200,000 observations, the prior still influences the point estimates.
The ridge estimator will be particularly useful in addressing applied empirical questions where TSLS is appropriate but the available data suffers from low precision or where the sample size available is small. The characterization the asymptotic distribution of the ridge estimator also provides a useful framework for other estimators that involve tuning parameters.
Extensions and improvements of the approach are worthwhile to pursue. Different loss functions can be applied to the test sample (e.g., k-fold cross validation), we can allow for multiple tuning parameters, consider models without a closed form solution, allow the number of covariates to grow with the sample size, allow the number of tuning parameters to grow with the sample size, consider situation with weak instruments or nearly weak instruments, allow other penalty terms such as the LASSO or elastic-net.

## Supplementary Materials

The following are available online at https://www.mdpi.com/2225-1146/8/4/39/s1, File S1: Supplementary Material: On the Asymptotic Distribution of Ridge Regression Estimators using Training and Test Samples.

## Author Contributions

The authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

## Funding

This research received no external funding.

## Conflicts of Interest

The authors declare no conflict of interest.

## Appendix A. Proof of Lemma 1

The objective function that determines the optimal tuning parameter is given in Equation (10). As the sample size grows the objective function uniformly converges to a deterministic function that takes a unique local minimum at $α = 0 .$ The parameter space is bounded and the law of large numbers implies
$lim n → ∞ Q n ( 1 − τ ) ( α ) = 1 2 β 0 − β p ′ Γ 0 ′ R z Γ 0 α + I k − 1 Γ 0 ′ R z Γ 0 Γ 0 ′ R z Γ 0 α + I k − 1 β 0 − β p$
which is minimized at $α = 0$. Hence $α 0 = 0 .$ When $α = 0$ then $β ^ t r ( 0 ) → β 0$.
The root-n consistency of $α ^$ follows from the standard approach of Lemma 5.4 in Ichimura (1993). The needed results are that $d Q n ( 1 − τ ) ( α 0 ) d α$ satisfies a CLT and $d 2 Q n ( 1 − τ ) ( α ) d α 2$ is continuous (from the right hand side) at $α 0$ and $d 2 Q n ( 1 − τ ) ( α 0 ) d α 2$ limits to a positive value. These derivatives reduce to the derivatives of $β ^ t r ( α ) = X τ n ′ P Z τ n X τ n [ τ n ] + α I − 1 X τ n ′ P Z τ n Y τ n [ τ n ] + α β p$ wrt $α$. The first derivative is
$d β ^ t r ( α ) d α = X τ n ′ P Z τ n X τ n [ τ n ] + α I k − 1 β p − X τ n ′ P Z τ n X τ n [ τ n ] + α I k − 2 X τ n ′ P Z τ n Y τ n [ τ n ] + α β p = X τ n ′ P Z τ n X τ n [ τ n ] + α I k − 1 β p − β ^ t r ( α ) .$
The second derivative is
$d 2 β ^ t r ( α ) d α 2 = − X τ n ′ P Z τ n X τ n [ τ n ] + α I k − 1 d β ^ t r ( α ) d α − X τ n ′ P Z τ n X τ n [ τ n ] + α I k − 2 β p − β ^ t r ( α ) = − X τ n ′ P Z τ n X τ n [ τ n ] + α I k − 2 β p − β ^ t r ( α ) − X τ n ′ P Z τ n X τ n [ τ n ] + α I k − 2 β p − β ^ t r ( α ) = − 2 X τ n ′ P Z τ n X τ n [ τ n ] + α I k − 2 β p − β ^ t r ( α ) .$
Now determine the derivatives of $Q n ( 1 − τ ) ( α )$. The first derivative is
$d Q n ( 1 − τ ) ( α ) d α = − 1 ( n − [ τ n ] ) Y n ( 1 − τ ) − X n ( 1 − τ ) β ^ t r ( α ) ′ P Z n ( 1 − τ ) X n ( 1 − τ ) d β ^ t r ( α ) d α = − 1 ( n − [ τ n ] ) Y n ( 1 − τ ) − X n ( 1 − τ ) β ^ t r ( α ) ′ P Z n ( 1 − τ ) X n ( 1 − τ ) X τ n ′ P Z τ n X τ n [ τ n ] + α I k − 1 β p − β ^ t r ( α ) .$
Evaluate at $α 0 = 0$
$d Q n ( 1 − τ ) ( 0 ) d α = − 1 ( n − [ τ n ] ) Y n ( 1 − τ ) − X n ( 1 − τ ) β ^ t r ( 0 ) ′ P Z n ( 1 − τ ) X n ( 1 − τ ) X τ n ′ P Z τ n X τ n [ τ n ] − 1 β p − β ^ t r ( 0 ) = − 1 ( n − [ τ n ] ) ( Y n ( 1 − τ ) − X n ( 1 − τ ) β 0 ) − X n ( 1 − τ ) ( β ^ t r ( 0 ) − β 0 ) ′ × P Z n ( 1 − τ ) X n ( 1 − τ ) X τ n ′ P Z τ n X τ n [ τ n ] − 1 β p − β 0 − ( β ^ t r ( 0 ) − β 0 ) = − 1 ( n − [ τ n ] ) ε n ( 1 − τ ) ′ − ε τ n ′ P Z τ n X τ n [ τ n ] X τ n ′ P Z τ n X τ n [ τ n ] − 1 X n ( 1 − τ ) ′ × P Z n ( 1 − τ ) X n ( 1 − τ ) X τ n ′ P Z τ n X τ n [ τ n ] − 1 β p − β 0 − X τ n ′ P Z τ n X τ n [ τ n ] − 1 X τ n ′ P Z τ n ε τ n [ τ n ] .$
The CLT applies to the $ε n ( 1 − τ ) ′ Z n ( 1 − τ )$ and $ε τ n ′ Z τ n$ terms. The others converge by LLN. Hence
$( n − [ τ n ] ) d Q n ( 1 − τ ) ( 0 ) d α = − 1 ( n − [ τ n ] ) ε n ( 1 − τ ) ′ − ε τ n ′ P Z τ n X τ n [ τ n ] X τ n ′ P Z τ n X τ n [ τ n ] − 1 X n ( 1 − τ ) ′ × P Z n ( 1 − τ ) X n ( 1 − τ ) X τ n ′ P Z τ n X τ n [ τ n ] − 1 β p − β 0 + o p ( 1 ) .$
The second derivative is
$d 2 Q n ( 1 − τ ) ( α ) d α 2 = − 1 ( n − [ τ n ] ) Y n ( 1 − τ ) − X n ( 1 − τ ) β ^ t r ( α ) ′ P Z n ( 1 − τ ) X n ( 1 − τ ) d 2 β ^ t r ( α ) d α 2 + 1 ( n − [ τ n ] ) X n ( 1 − τ ) d β ^ t r ( α ) d α ′ P Z n ( 1 − τ ) X n ( 1 − τ ) d β ^ t r ( α ) d α = 2 ( n − [ τ n ] ) Y n ( 1 − τ ) − X n ( 1 − τ ) β ^ t r ( α ) ′ P Z n ( 1 − τ ) X n ( 1 − τ ) X τ n ′ P Z τ n X τ n [ τ n ] + α I k − 2 β p − β ^ t r ( α ) + 1 ( n − [ τ n ] ) β p − β ^ t r ( α ) ′ X τ n ′ P Z τ n X τ n [ τ n ] + α I k − 1 X n ( 1 − τ ) ′ × P Z n ( 1 − τ ) X n ( 1 − τ ) X τ n ′ P Z τ n X τ n [ τ n ] + α I k − 1 β p − β ^ t r ( α ) .$
This is a bounded continuous function. Now evaluate at $α 0 = 0$
$d 2 Q n ( 1 − τ ) ( 0 ) d α 2 = 2 ( n − [ τ n ] ) ( Y n ( 1 − τ ) − X n ( 1 − τ ) β 0 ) − X n ( 1 − τ ) ( β ^ t r ( 0 ) − β 0 ) ′ × P Z n ( 1 − τ ) X n ( 1 − τ ) X τ n ′ P Z τ n X τ n [ τ n ] − 2 β p − β 0 − ( β ^ t r ( 0 ) − β 0 ) + 1 ( n − [ τ n ] ) β p − β 0 − ( β ^ t r ( 0 ) − β 0 ) ′ X τ n ′ P Z τ n X τ n [ τ n ] − 1 × X n ( 1 − τ ) ′ P Z n ( 1 − τ ) X n ( 1 − τ ) X τ n ′ P Z τ n X τ n [ τ n ] − 1 β p − β 0 − ( β ^ t r ( 0 ) − β 0 ) = 2 ( n − [ τ n ] ) ε n ( 1 − τ ) − X n ( 1 − τ ) ( β ^ t r ( 0 ) − β 0 ) ′ × P Z n ( 1 − τ ) X n ( 1 − τ ) X τ n ′ P Z τ n X τ n [ τ n ] − 2 β p − β 0 − ( β ^ t r ( 0 ) − β 0 ) + 1 ( n − [ τ n ] ) β p − β 0 − ( β ^ t r ( 0 ) − β 0 ) ′ X τ n ′ P Z τ n X τ n [ τ n ] − 1 × X n ( 1 − τ ) ′ P Z n ( 1 − τ ) X n ( 1 − τ ) X τ n ′ P Z τ n X τ n [ τ n ] − 1 β p − β 0 − ( β ^ t r ( 0 ) − β 0 ) .$
The first term will converge to zero and the second term converges to the positive value
$β p − β 0 ′ Γ 0 ′ R z Γ 0 β p − β 0 .$
Now follow the standard approach (Lemma 5.4, Ichimura 1993) to show that $n ( α ^ − α 0 ) = O p ( 1 ) .$ Expand $Q n ( 1 − τ ) ( α )$ about $α 0$ and evaluate at $α ^$.
$Q n ( 1 − τ ) ( α ^ ) = Q n ( 1 − τ ) ( α 0 ) + d Q n ( 1 − τ ) ( α 0 ) d α ( α ^ − α 0 ) + 1 2 d 2 Q n ( 1 − τ ) ( α ¯ ) d α 2 ( α ^ − α 0 ) 2$
where $0 ≤ α ¯ ≤ α ^$. Because $α ^ = arg min [ 0 , ∞ ) Q n ( 1 − τ ) ( α )$, $0 ≥ Q n ( 1 − τ ) ( α ^ ) − Q n ( 1 − τ ) ( α 0 ) ,$ hence
$0 ≥ d Q n ( 1 − τ ) ( α 0 ) d α ( α ^ − α 0 ) + 1 2 d 2 Q n ( 1 − τ ) ( α ¯ ) d α 2 ( α ^ − α 0 ) 2 .$
Multiply both sides by $n ( 1 + n | α ^ − α 0 | ) 2 .$
$0 ≥ d Q n ( 1 − τ ) ( α 0 ) d α ( α ^ − α 0 ) n ( 1 + n | α ^ − α 0 | ) 2 + 1 2 d 2 Q n ( 1 − τ ) ( α ¯ ) d α 2 ( α ^ − α 0 ) 2 n ( 1 + n | α ^ − α 0 | ) 2 = n d Q n ( 1 − τ ) ( α 0 ) d α n ( α ^ − α 0 ) ( 1 + n | α ^ − α 0 | ) 1 ( 1 + n | α ^ − α 0 | ) + 1 2 d 2 Q n ( 1 − τ ) ( α ¯ ) d α 2 n ( α ^ − α 0 ) ( 1 + n | α ^ − α 0 | ) 2$
Suppose $n | α ^ − α 0 |$ diverged to infinity. As noted above $n d Q n ( 1 − τ ) ( α 0 ) d α = O p ( 1 ) .$ In addition, $n ( α ^ − α 0 ) ( 1 + n | α ^ − α 0 | ) = O p ( 1 )$. However, $1 ( 1 + n | α ^ − α 0 | ) = o p ( 1 )$ and hence the first term on the LHS of Equation (A1) goes to zero. This means
$o p ( 1 ) ≥ 1 2 d 2 Q n ( 1 − τ ) ( α ¯ ) d α 2 n ( α ^ − α 0 ) ( 1 + n | α ^ − α 0 | ) 2 .$
However, $d 2 Q n ( 1 − τ ) ( α ¯ ) d α 2$ limits to $d 2 Q n ( 1 − τ ) ( α 0 ) d α 2$, a positive value, and the RHS can satisfy this only if
$n ( α ^ − α 0 ) ( 1 + n | α ^ − α 0 | ) = o p ( 1 ) .$
This occurs only if $n | α ^ − α 0 | = o p ( 1 )$ which is a contradiction of the assumption that $n | α ^ − α 0 |$ diverges. Hence $n ( α ^ − α 0 ) = O p ( 1 )$. □

## Appendix B. Proof of Theorem 1

This is a direct application of Theorem 1 from (Andrews 2002). Assumptions A1–A5 (GMM1$*$–GMM5$*$) in (Andrews 2002) are satisfied for the linear model by Assumptions 1–3. To show how the assumptions in (Andrews 2002) are satisfies, we first use Assumtions 1–3 to demonstrate three useful results for the system of Equation (13). The useful results are: $E [ h i ( θ 0 ) ] = 0$, $n H n ( θ 0 )$ satisfies a central limit theorem and $lim n → ∞ ∂ H n ( θ 0 ) ∂ θ ′ − 1$ exists, which requires showing that LLN leads to a matrix which is invertible. In the statement of the Theorem, the limiting random variable, Z, is composed of two terms: $n H n ( θ 0 )$ and $− E ∂ h i ( θ 0 ) ∂ θ ′ − 1$.
Evaluate the moment condition, Equation (13), at $θ 0$, to show that $E [ h i ( θ 0 ) ] = 0$ and that $n H n ( θ 0 )$ satisfies a central limit theorem.
$H n ( θ 0 ) = 1 n ∑ i = 1 n 1 τ ( i ) vech ( R z − z i z i ′ ) ( 1 − 1 τ ( i ) ) vech ( R z − z i z i ′ ) 1 τ ( i ) vec ( R z Γ 0 − z i x i ′ ) ( 1 − 1 τ ( i ) ) vec ( R z Γ 0 − z i x i ′ ) − 1 τ ( i ) Γ 0 ′ R z R z − 1 z i ( y i − x i ′ β 0 ) ( 1 − 1 τ ( i ) ) ( y i − x i ′ β 0 ) z i ′ R z − 1 R z Γ 0 Γ 0 ′ R z R z − 1 R z Γ 0 − 1 ( β p − β 0 ) − Γ 0 ′ R z R z − 1 z i ( y i − x i ′ β 0 )$
$= 1 n ∑ i = 1 n 1 τ ( i ) vech ( R z − z i z i ′ ) ( 1 − 1 τ ( i ) ) vech ( R z − z i z i ′ ) 1 τ ( i ) vec R z Γ 0 − z i u i ′ − z i z i ′ Γ 0 − ( 1 − 1 τ ( i ) ) vec ( R z Γ 0 − u i z i ′ − z i z i ′ Γ 0 ) 1 τ ( i ) Γ 0 ′ z i ε i ( 1 − 1 τ ( i ) ) ε i z i ′ Γ 0 Γ 0 ′ R z Γ 0 − 1 ( β p − β 0 ) − Γ 0 ′ z i ε i$
Each element of $h i ( θ 0 )$ has expectation zero and bounded covariance, hence the iid assumption implies the central limit theorem
$n H n ( θ 0 ) ∼ A N 0 , τ χ 0 τ ξ 0 0 0 0 0 ( 1 − τ ) χ 0 ( 1 − τ ) ξ 0 0 0 τ ξ ′ 0 τ ζ 0 τ Ψ 0 τ Ψ 0 ( 1 − τ ) ξ ′ 0 ( 1 − τ ) ζ 0 ( 1 − τ ) Π ( 1 − τ ) Ψ 0 0 τ Ψ ′ 0 τ Ξ 0 τ Ξ 0 0 0 ( 1 − τ ) Π ′ 0 ( 1 − τ ) Υ ( 1 − τ ) Φ ′ 0 0 τ Ψ ′ ( 1 − τ ) Ψ ′ τ Ξ ( 1 − τ ) Φ Ξ$
where
$χ = E vech ( R z − z i z i ′ ) vech ( R z − z i z i ′ ) ′ ,$
$ξ = E vech ( R z − z i z i ′ ) vec ( R z Γ 0 − z i z i ′ Γ 0 ) ′ ,$
$ζ = E vec ( R z Γ 0 − z i x i ′ ) vec ( R z Γ 0 − z i x i ′ ) ′ ,$
$Ψ = E vec ( z i u i ′ ) ε i z i ′ Γ 0 ,$
$Π = E vec ( − u i z i ′ ) ε i z i ′ Γ 0 ( Γ 0 ′ R z Γ 0 ) − 1 ( β 0 − β p ) ,$
$Ξ = ( Γ 0 ′ R z Γ 0 ) σ ε 2 ,$
$Υ = σ ε 2 ( β 0 − β p ) ′ ( Γ 0 ′ R z Γ 0 ) − 1 ( β 0 − β p ) , a n d$
$Φ = − σ ε 2 ( β 0 − β p ) ′ .$
The expectation of the first derivative of the moment conditions evaluated at $θ 0$ is
$E ∂ h i ( θ 0 ) ∂ θ ′ = τ I m ( m + 1 ) 2 0 0 0 0 0 0 0 ( 1 − τ ) I m ( m + 1 ) 2 0 0 0 0 0 0 0 τ I m p 0 0 0 0 0 0 0 ( 1 − τ ) I m p 0 0 0 0 0 0 0 τ ( Γ 0 ′ R z Γ 0 ) τ ( β 0 − β p ) 0 0 0 0 0 − ( 1 − τ ) ( β 0 − β p ) ′ 0 0 0 0 0 0 0 ( β 0 − β p ) ( Γ 0 ′ R z Γ 0 )$
The general structure of the last matrix is
$A B κ B ′ C$
where
$A = τ I m ( m + 1 ) 2 0 0 0 0 0 ( 1 − τ ) I m ( m + 1 ) 2 0 0 0 0 0 τ I m p 0 0 0 0 0 ( 1 − τ ) I m p 0 0 0 0 0 τ ( Γ 0 ′ R z Γ 0 ) ,$
$B = 0 0 τ ( β 0 − β p ) 0 ,$$C = 0 0 ( β 0 − β p ) ( Γ 0 ′ R z Γ 0 ) ,$ and $κ = − ( 1 − τ ) τ$. The inverse of the general structure is
$A − 1 + A − 1 B ( C − κ B ′ A − 1 B ) − 1 B A − 1 − A − 1 B ( C − κ B ′ A − 1 B ) − 1 − ( C − κ B ′ A − 1 B ) − 1 B A − 1 ( C − κ B ′ A − 1 B ) − 1$
which is well defined if the inverses of A and $( C − κ B ′ A − 1 B )$ exist. The matrix $A − 1$ is well defined because Assumption 3 implies $( Γ 0 ′ R z Γ 0 )$ is full rank. The matrix
$C − κ B ′ A − 1 B = ( 1 − τ ) ( β 0 − β p ) ′ Γ 0 ′ R z Γ 0 − 1 ( β 0 − β p ) 0 1 × p ( β 0 − β p ) ( Γ 0 ′ R z Γ 0 )$
with inverse
$C − κ B ′ A − 1 B − 1 = 1 ( 1 − τ ) ( β 0 − β p ) ′ ( Γ 0 ′ R z Γ 0 ) − 1 ( β 0 − β p ) 0 1 × p − ( Γ 0 ′ R z Γ 0 ) − 1 ( β 0 − β p ) ( 1 − τ ) ( β 0 − β p ) ′ ( Γ 0 ′ R z Γ 0 ) − 1 ( β 0 − β p ) ( Γ 0 ′ R z Γ 0 ) − 1 .$
Hence $− E ∂ h i ( θ 0 ) ∂ θ ′ − 1$ is well defined. Now verify Assumptions A1–A5 (GMM1$*$–GMM5$*$) in Andrews (2002).
Assumption A1 (GMM1*).
This parameter space is bounded. Because $z i$ has finite fourth moments and $ε i u i ′ ′$ has a finite second moment there exists a dominating function with a finite expectation. This implies that $H n ( θ ) ′ H n ( θ )$ will uniformly converge to its limiting function, $E [ H n ( θ ) ′ ] E [ H n ( θ ) ]$. Identification follows from $E [ H n ( θ 0 ) ] = 0$ and the invertibility of $E ∂ h i ( θ 0 ) ∂ θ ′ .$
Assumption A2 (GMM2*).
The data are iid. The GMM structure is presented above. The expectation of the first derivative of the moment conditions is evaluated at $θ 0$ and inverted, hence demonstrating it is full rank. $E [ H n ( θ 0 ) ] = 0$ is demonstrated above. The system is just identified, so an identity weighting matrix is used.
Assumption A3 (GMM3*).
The CLT applies because the data are iid and $z i$ has finite fourth moments, $ε i u i ′ ′$ has a finite second moment and the $z i$ and $ε j u j ′ ′$ are independent for all i and j.
Assumption A4 (GMM4*).
Because the eigenvalues of $R z$ are bounded above zero and below infinity each element of $R z$ and $R z − 1$ is bounded above. Hence all the parameters in Θ are bounded and Equation (27) of (Andrews 2002) is satisfied with $c = max ( B 1 , B 2 , B 3 , B 4 )$.
Assumption A5 (GMM5*).
The cone for this problem is $Λ = λ ∈ R m ( m + 1 ) + 2 m k + 2 k + 1 : λ m ( m + 1 ) + 2 m k + k + 1 ≥ 0$ which is convex.

## References

1. Andrews, Donald W. K. 2002. Generalized method of moments estimation when a parameter is on a boundary. Journal of Business & Economic Statistics 20: 530–44. [Google Scholar]
2. Angrist, Joshua D., and Alan B. Krueger. 1991. Does compulsory school attendance affect schooling and earnings? The Quarterly Journal of Economics 106: 979–1014. [Google Scholar] [CrossRef]
3. Antoine, Bertille, and Eric Renault. 2009. Efficient gmm with nearly-weak instruments. The Econometrics Journal 12: S135–S171. [Google Scholar] [CrossRef]
4. Bengio, Yoshua. 2000. Gradient-based optimization of hyperparameters. Neural Computation 12: 1889–900. [Google Scholar] [CrossRef] [PubMed]
5. Bickel, Peter J., Bo Li, Alexandre B. Tsybakov, Sara A. van de Geer, Bin Yu, Teófilo Valdés, Carlos Rivero, Jianqing Fan, and Aad van der Vaart. 2006. Regularization in statistics. Test 15: 271–344. [Google Scholar] [CrossRef]
6. Carrasco, Marine, and Guy Tchuente. 2016. Efficient estimation with many weak instruments using regularization techniques. Econometric Reviews 35: 1609–37. [Google Scholar] [CrossRef]
7. Carrasco, Marine, and Jean-Pierre Florens. 2000. Generalization of gmm to a continuum of moment conditions. Econometric Theory 16: 797–834. [Google Scholar] [CrossRef]
8. Carrasco, Marine, Jean-Pierre Florens, and Eric Renault. 2007. Linear inverse problems in structural econometrics estimation based on spectral decomposition and regularization. Handbook of Econometrics 6: 5633–751. [Google Scholar]
9. Carrasco, Marine. 2012. A regularization approach to the many instruments problem. Journal of Econometrics 170: 383–98. [Google Scholar] [CrossRef]
10. Chen, Chen, Min Ren, Min Zhang, and Dabao Zhang. 2018. A two-stage penalized least squares method for constructing large systems of structural equations. Journal of Machine Learning Research 19: 40–73. [Google Scholar]
11. Dorugade, Ashok Vithoba. 2014. New ridge parameters for ridge regression. Journal of the Association of Arab Universities for Basic and Applied Sciences 15: 94–99. [Google Scholar] [CrossRef]
12. Firinguetti, Luis, and Gladys Bobadilla. 2011. Asymptotic confidence intervals in ridge regression based on the edgeworth expansion. Statistical Papers 52: 287–307. [Google Scholar] [CrossRef]
13. Habibnia, Ali, and Esfandiar Maasoumi. 2019. Forecasting in big data environments: An adaptable and automated shrinkage estimation of neural networks (aashnet). arXiv arXiv:1904.11145. [Google Scholar]
14. Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. Unsupervised learning. In The Elements of Statistical Learning. New York: Springer, pp. 485–585. [Google Scholar]
15. Hoerl, Arthur E., and Robert W. Kennard. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12: 55–67. [Google Scholar] [CrossRef]
16. Hoerl, Arthur E., Robert W. Kannard, and Kent F. Baldwin. 1975. Ridge regression: Some simulations. Communications in Statistics 4: 105–23. [Google Scholar] [CrossRef]
17. Ichimura, Hidehiko. 1993. Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. Journal of Econometrics 58: 71–120. [Google Scholar] [CrossRef]
18. Larsen, Jan, Lars Kai Hansen, Claus Svarer, and Børje Ola Mattias Ohlsson. 1996. Design and regularization of neural networks: The optimal use of a validation set. Papaer presented at the 1996 IEEE Signal Processing Society Workshop, Kyoto, Japan, September 4–6; pp. 62–71. [Google Scholar] [CrossRef]
19. Lawless, J. F., and P. Wang. 1976. A simulation study of ridge and other regression estimators. Communications in Statistics - Theory and Methods 5: 307–23. [Google Scholar] [CrossRef]
20. Leeb, Hannes, and Benedikt M. Pötscher. 2005. Model selection and inference: Facts and fiction. Econometric Theory 21: 21–59. [Google Scholar] [CrossRef]
21. Lin, Wei, Rui Feng, and Hongzhe Li. 2015. Regularization methods for high-dimensional instrumental variables regression with an application to genetical genomics. Journal of the American Statistical Association 110: 270–88. [Google Scholar] [CrossRef]
22. Maclaurin, Dougal, David Duvenaud, and Ryan P. Adams. 2015. Gradient-based hyperparameter optimization through reversible learning. Paper presented at the 32nd International Conference on International Conference on Machine Learning—Volume 37, ICML’15, Lille, France, July 7–9; pp. 2113–22. [Google Scholar]
23. Montgomery, Douglas C., Elizabeth A. Peck, and G. Geoffrey Vining. 2012. Introduction to Linear Regression Analysis. Wiley Series in Probability and Statistics; New York: Wiley. [Google Scholar]
24. Sanderson, Eleanor, and Frank Windmeijer. 2016. A weak instrument f-test in linear iv models with multiple endogenous variables. Journal of Econometrics 190: 212–21. [Google Scholar] [CrossRef]
25. Staiger, Douglas, and James H. Stock. 1997. Instrumental Variables Regression with Weak Instruments. Econometrica 65: 557–86. [Google Scholar] [CrossRef]
26. Theobald, Chris M. 1974. Generalizations of mean square error applied to ridge regression. Journal of the Royal Statistical Society. Series B (Methodological) 36: 103–6. [Google Scholar] [CrossRef]
27. Zhu, Ying. 2018. Sparse linear models and l1-regularized 2SLS with high-dimensional endogenous regressors and instruments. Journal of Econometrics 202: 196–213. [Google Scholar] [CrossRef]
 1 Relative the likelihood based approaches, the GMM framework is better suited to the social science setting where this estimator will be most useful. Rarely does a social science model imply the actual distribution for an error. Unconditional expectations of zero are more typical in social science theories and are the foundation for the GMM estimator. 2 This term is both the second derivative of the objective function (3) and the matrix being inverted in the last term of the covariance (4). 3 Similar results are obtained via other specifications of $Γ 0$. These are included as part of Supplementary Materials for the paper, available from the authors on request. 4 Each individual error term is standard normal. 5 Other specifications of prior values also led to similar results. These are included as part of Supplementary Materials for the paper, available from the authors on request. 6 The best value of $τ$ is unclear. All the simulations reported in this section were also performed with $τ = 0.5$ and $τ = 0.9$. The results were similar to $τ = 0.7$. The full set of simulations is available in the Supplementary Materials. 7 We consider a linear grid of 10,000 points in the the second step. 8 This corresponds to the estimate of $E ∂ g i ( β ) ∂ β ′$ where $g i ( β ) = ( y i − x i β ) z i$. 9 The specification with exogenous variables W, (Staiger and Stock 1997) $Y = X ˜ a + W b + u X ˜ = Z ˜ c + W d + ε ˜$ fits into the specification of Equations (1) and (2) with $X = X ˜ W ,$$Z = Z ˜ W ,$$ε = ε ˜ 0 ,$$β = a b ,$ and $Γ = c 0 d I .$
Figure 1. Scatter plots of the estimates from TSLS and ridge estimator with different priors when precision is low ($δ = 0.1$) and sample size is small ($n = 25$). Estimates, the true parameter value and prior values are represented by blue, yellow and red points respectively. The variance associated with TSLS estimates is much larger than the ridge estimates. The ridge estimator is biased toward the prior.
Figure 1. Scatter plots of the estimates from TSLS and ridge estimator with different priors when precision is low ($δ = 0.1$) and sample size is small ($n = 25$). Estimates, the true parameter value and prior values are represented by blue, yellow and red points respectively. The variance associated with TSLS estimates is much larger than the ridge estimates. The ridge estimator is biased toward the prior. Figure 2. Scatter plots of the estimates from TSLS and ridge estimator with different priors when precision is low ($δ = 0.1$) and sample size is large ($n = 500$). Estimates, the true parameter value and prior values are represented by blue, yellow and red points respectively. The variance associated with TSLS estimates is much larger than the ridge estimates. The ridge estimator is less biased towards the prior in the larger samples.
Figure 2. Scatter plots of the estimates from TSLS and ridge estimator with different priors when precision is low ($δ = 0.1$) and sample size is large ($n = 500$). Estimates, the true parameter value and prior values are represented by blue, yellow and red points respectively. The variance associated with TSLS estimates is much larger than the ridge estimates. The ridge estimator is less biased towards the prior in the larger samples. Figure 3. Scatter plots of the estimates from TSLS and ridge estimator with different priors when precision is high ($δ = 1$) and sample size is small ($n = 25$). Estimates, the true parameter value and prior values are represented by blue, yellow and red points respectively. TSLS performance is much better in this setting. The variance reduction for the ridge estimator is not as dramatic.
Figure 3. Scatter plots of the estimates from TSLS and ridge estimator with different priors when precision is high ($δ = 1$) and sample size is small ($n = 25$). Estimates, the true parameter value and prior values are represented by blue, yellow and red points respectively. TSLS performance is much better in this setting. The variance reduction for the ridge estimator is not as dramatic. Figure 4. Scatter plots of the estimates from TSLS and ridge estimator with different priors when precision is high ($δ = 1$) and sample size is large ($n = 500$). Estimates, the true parameter value and prior values are represented by blue, yellow and red points respectively. The TSLS estimates outperform the ridge estimators which is demonstrated by marginally larger clouds which are slightly off-center from the true parameter values for the ridge estimators. However, the ridge estimator using different priors is still competitive.
Figure 4. Scatter plots of the estimates from TSLS and ridge estimator with different priors when precision is high ($δ = 1$) and sample size is large ($n = 500$). Estimates, the true parameter value and prior values are represented by blue, yellow and red points respectively. The TSLS estimates outperform the ridge estimators which is demonstrated by marginally larger clouds which are slightly off-center from the true parameter values for the ridge estimators. However, the ridge estimator using different priors is still competitive. Figure 5. This figure plots the histogram of estimated regularization parameter $α ^$ when $n =$ 10,000 for all precision parameters and all priors considered in the simulations. The total number of simulations to generate each of these plots is $N = 1000$. As predicted by Theorem 1, the mass at $α ^ = 0$ is approaching $50 %$ asymptotically. Distributions of $α ^$ values for all cases considered are presented in Table 3.
Figure 5. This figure plots the histogram of estimated regularization parameter $α ^$ when $n =$ 10,000 for all precision parameters and all priors considered in the simulations. The total number of simulations to generate each of these plots is $N = 1000$. As predicted by Theorem 1, the mass at $α ^ = 0$ is approaching $50 %$ asymptotically. Distributions of $α ^$ values for all cases considered are presented in Table 3. Table 1. Estimates of $β ^ 1$ and $β ^ 2$ using TSLS and ridge estimator for $β p = ( 1 2 , 1 2 ) ′$. The ridge estimator leads to smaller combined MSE (highlighted in bold) when precision is low ($δ = 0.10$). This drop in MSE is driven primarily by large reductions in standard deviations of the estimates. The TSLS estimator leads to smaller combined MSE when precision is high ($δ = 1.00$). For intermediate precision models the ridge estimator leads to smaller combined MSE in small samples.
Table 1. Estimates of $β ^ 1$ and $β ^ 2$ using TSLS and ridge estimator for $β p = ( 1 2 , 1 2 ) ′$. The ridge estimator leads to smaller combined MSE (highlighted in bold) when precision is low ($δ = 0.10$). This drop in MSE is driven primarily by large reductions in standard deviations of the estimates. The TSLS estimator leads to smaller combined MSE when precision is high ($δ = 1.00$). For intermediate precision models the ridge estimator leads to smaller combined MSE in small samples.
$δ$nEstimator$β ^ 1$$β ^ 2$$( β ^ 1 , β ^ 2 )$
BiasSDMSEBiasSDMSEMSE
0.1025TSLS0.0130.2320.0540.6301.5142.6902.744
Ridge0.1000.1380.0290.6770.4030.6210.650
50TSLS0.0060.1890.0360.5421.4282.3332.368
Ridge0.0620.1000.0140.6460.5480.7170.731
250TSLS−0.0000.0810.0070.2041.5112.3232.330
Ridge0.0210.0440.0020.4980.3850.3960.398
500TSLS−0.0000.0410.0020.0600.7630.5850.587
Ridge0.0140.0320.0010.4010.3490.2820.283
0.2525TSLS0.0080.2160.0470.3221.1561.4401.486
Ridge0.1010.1450.0310.5490.4670.5190.550
50TSLS0.0020.1470.0220.1791.0851.2091.231
Ridge0.0620.0980.0130.4610.3420.3290.343
250TSLS−0.0010.0470.002−0.0020.2970.0880.091
Ridge0.0160.0450.0020.2150.2630.1160.118
500TSLS0.0000.0320.0010.0000.1880.0350.036
Ridge0.0090.0320.0010.1430.2110.0650.066
0.5025TSLS0.0020.1980.0390.0530.7510.5660.605
Ridge0.0990.1580.0350.3380.3620.2450.280
50TSLS−0.0000.1130.0130.0050.4020.1620.175
Ridge0.0570.1040.0140.2390.2640.1270.141
250TSLS−0.0010.0450.002−0.0000.1300.0170.019
Ridge0.0140.0450.0020.0880.1470.0290.032
500TSLS0.0000.0320.001−0.0000.0910.0080.009
Ridge0.0090.0320.0010.0570.1080.0150.016
1.025TSLS−0.0020.1630.026−0.0040.2440.0600.086
Ridge0.0900.1640.0350.1440.2210.0700.105
50TSLS0.0000.1060.0110.0010.1530.0230.035
Ridge0.0540.1070.0140.0950.1550.0330.047
250TSLS−0.0010.0450.002−0.0000.0640.0040.006
Ridge0.0180.0480.0030.0350.0730.0070.009
500TSLS0.0000.0320.001−0.0000.0450.0020.003
Ridge0.0130.0340.0010.0240.0530.0030.005
Table 2. Estimates of $β ^ 1$ and $β ^ 2$ using TSLS and ridge estimator for $β p = ( 3 2 , 3 2 ) ′$. The prior is 3 standard deviations away from the true parameter value. The ridge estimator outperforms the TSLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.
Table 2. Estimates of $β ^ 1$ and $β ^ 2$ using TSLS and ridge estimator for $β p = ( 3 2 , 3 2 ) ′$. The prior is 3 standard deviations away from the true parameter value. The ridge estimator outperforms the TSLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.
$δ$nEstimator$β ^ 1$$β ^ 2$$( β ^ 1 , β ^ 2 )$
BiasSDMSEBiasSDMSEMSE
0.1025TSLS0.0120.2320.0540.6291.5162.6932.747
Ridge0.0900.2220.0581.0500.8951.9031.961
50TSLS0.0060.1890.0360.5461.4252.3282.363
Ridge0.0510.1520.0261.0000.9721.9441.970
250TSLS−0.0000.0810.0070.2051.5112.3252.332
Ridge0.0120.0740.0060.6861.1961.9021.908
500TSLS−0.0000.0410.0020.0580.7640.5880.589
Ridge0.0060.0350.0010.4890.6140.6150.617
0.2525TSLS0.0080.2160.0470.3241.1601.4511.498
Ridge0.0850.2170.0540.7860.8501.3401.394
50TSLS0.0020.1470.0220.1781.0821.2021.223
Ridge0.0460.1270.0180.6290.7180.9100.928
250TSLS−0.0010.0470.002−0.0030.2970.0880.091
Ridge0.0080.0420.0020.2250.3180.1510.153
500TSLS0.0000.0320.001−0.0000.1880.0350.036
Ridge0.0050.0300.0010.1360.2150.0650.066
0.5025TSLS0.0020.1990.0400.0530.7480.5620.602
Ridge0.0770.1900.0420.4120.5610.4850.527
50TSLS−0.0000.1130.0130.0050.4020.1610.174
Ridge0.0410.1080.0130.2630.3640.2010.215
250TSLS−0.0010.0450.002−0.0010.1300.0170.019
Ridge0.0110.0440.0020.0860.1490.0300.032
500TSLS0.0000.0320.001−0.0000.0910.0080.009
Ridge0.0080.0310.0010.0560.1080.0150.016
1.025TSLS−0.0020.1620.026−0.0030.2440.0600.086
Ridge0.0760.1710.0350.1420.2560.0860.121
50TSLS0.0000.1060.0110.0010.1530.0230.035
Ridge0.0480.1080.0140.0930.1620.0350.049
250TSLS−0.0010.0450.002−0.0000.0640.0040.006
Ridge0.0170.0480.0030.0340.0740.0070.009
500TSLS0.0000.0320.001−0.0000.0450.0020.003
Ridge0.0120.0340.0010.0240.0530.0030.005
Table 3. Distribution of regularization parameter $α ^$. The mixed distribution associated with the finite samples is in agreement with the nonstandard asymptotic distribution given in Theorem 1. The proportion of cases with “no regularization” ($α ^ = 0$), “some regularization” ($α ^ ∈ ( 0 , 10 7 )$) and “infinite regularization” ($α ^ = 10 7 ≈ ∞$) are presented. For all cases, there is a substantial mass of the distribution concentrated at $α ^ = 0$. On the other hand, there is no mass concentrated at $α ^ ≈ ∞$except in very small samples of $n = 25$. As predicted by Theorem 1, the mass at $α ^ = 0$ is approaching $50 %$ asymptotically. Histograms for the large sample cases of $n =$ 10,000 are presented in Figure 5.
Table 3. Distribution of regularization parameter $α ^$. The mixed distribution associated with the finite samples is in agreement with the nonstandard asymptotic distribution given in Theorem 1. The proportion of cases with “no regularization” ($α ^ = 0$), “some regularization” ($α ^ ∈ ( 0 , 10 7 )$) and “infinite regularization” ($α ^ = 10 7 ≈ ∞$) are presented. For all cases, there is a substantial mass of the distribution concentrated at $α ^ = 0$. On the other hand, there is no mass concentrated at $α ^ ≈ ∞$except in very small samples of $n = 25$. As predicted by Theorem 1, the mass at $α ^ = 0$ is approaching $50 %$ asymptotically. Histograms for the large sample cases of $n =$ 10,000 are presented in Figure 5.
$δ$n$β p = ( 1 / 2 , 1 / 2 ) ′$$β p = ( 2 , 2 ) ′$$β p = ( 3 / 2 , 3 / 2 ) ′$
$α ^ = 0$$α ^ ∈ ( 0 , 10 7 )$$α ^ = 10 7 ≈ ∞$$α ^ = 0$$α ^ ∈ ( 0 , 10 7 )$$α ^ = 10 7$$α ^ = 0$$α ^ ∈ ( 0 , 10 7 )$$α ^ = 10 7$
0.01250.1640.8340.0030.2620.7380.0010.3390.6610.000
500.1660.8340.0000.2750.7250.0000.3540.6460.000
2500.1900.8100.0000.2810.7190.0000.3190.6810.000
5000.2200.7800.0000.2850.7150.0000.3020.6980.000
10,0000.4130.5870.0000.4110.5890.0000.4110.5890.000
0.25250.1760.8220.0020.2620.7370.0010.3150.6840.001
500.1840.8160.0000.2630.7370.0000.2990.7010.000
2500.2930.7070.0000.3090.6910.0000.3140.6860.000
5000.3460.6540.0000.3540.6460.0000.3540.6460.000
10,0000.4610.5390.0000.4650.5350.0000.4650.5350.000
0.50250.2160.7800.0040.2620.7370.0010.2840.7160.000
500.2550.7450.0000.2840.7160.0000.2940.7060.000
2500.3690.6310.0000.3740.6260.0000.3760.6240.000
5000.4120.5880.0000.4150.5850.0000.4170.5830.000
10,0000.4630.5370.0000.4670.5330.0000.4660.5340.000
1.00250.2870.7080.0050.3100.6890.0010.3180.6810.000
500.3330.6670.0000.3460.6540.0000.3510.6490.000
2500.4130.5870.0000.4180.5820.0000.4190.5810.000
5000.4390.5610.0000.4430.5570.0000.4420.5580.000
10,0000.4740.5260.0000.4780.5220.0000.4770.5230.000
Table 4. Summary statistics of the smallest singular value for the matrix $− X ′ Z n$ corresponding to different precision parameter values $δ$ and sample sizes n, using 10,000 samples each. As the precision parameters increase from $δ = 0.1$ to $δ = 1$, the mean of the smallest singular value increases. As sample sizes increase from $n = 25$ to $n =$ 10,000, the spread in the smallest singular value decreases.
Table 4. Summary statistics of the smallest singular value for the matrix $− X ′ Z n$ corresponding to different precision parameter values $δ$ and sample sizes n, using 10,000 samples each. As the precision parameters increase from $δ = 0.1$ to $δ = 1$, the mean of the smallest singular value increases. As sample sizes increase from $n = 25$ to $n =$ 10,000, the spread in the smallest singular value decreases.
$δ$nMeanStd Dev1st QuartileMedian3rd Quartile
0.10250.250.140.140.230.33
500.190.100.120.180.26
2500.120.050.080.120.16
5000.110.040.080.110.14
25000.100.020.090.100.12
50000.100.010.090.100.11
10,0000.100.010.090.100.11
0.25250.320.170.190.300.42
500.280.130.190.270.37
2500.260.070.210.260.30
5000.250.050.220.250.29
25000.250.020.240.250.26
50000.250.010.240.250.26
10,0000.250.010.240.250.26
0.50250.500.210.350.480.63
500.500.170.390.490.61
2500.500.080.450.500.55
5000.500.050.460.500.54
25000.500.020.480.500.52
50000.500.020.490.500.51
10,0000.500.010.490.500.51
1.00250.860.270.670.851.03
500.920.210.770.911.05
2500.980.100.910.981.05
5000.990.080.940.991.04
25001.000.030.981.001.02
50001.000.020.981.001.02
10,0001.000.020.991.001.01
Table 5. Effect of years of education on the log of weekly earnings.
Table 5. Effect of years of education on the log of weekly earnings.
1920–19291930–19391940–1949
OLS0.0700.0630.052
TSLS0.0580.099−0.073
Ridge0.0270.0660.001
First stage F-test38.330.526.3
Overidentification, Basmann0.7762.3219.693
{p-value}{0.679}{0.313}{0.008}
$λ min$3.3 × 10$− 6$1.3 × 10$− 5$6.6 × 10$− 7$
Condition Number4.1 × 10$7$1.2 × 10$7$2.8 × 10$8$
Sample size, n247,199329,509486,926
Table 6. Simulation results using returns to education data.
Table 6. Simulation results using returns to education data.
PriorSample SizeEstimatorBiasSDRMSE
$β e d u p = 1 2 β ^ e d u , t r a i n I V$1000TSLS$− 0.032$$0.185$$0.188$
Ridge$− 0.027$$0.164$$0 . 166$
100,000TSLS$− 0.001$$0.039$$0 . 039$
Ridge$− 0.0003$$0.058$$0.058$
$β e d u p = β ^ e d u , t r a i n I V$1000TSLS$− 0.032$$0.192$$0.194$
Ridge$− 0.026$$0.166$$0 . 168$
100,000TSLS$− 0.002$$0.039$$0 . 039$
Ridge$− 0.004$$0.062$$0.062$
$β e d u p = 2 β ^ e d u , t r a i n I V$1000TSLS$− 0.036$$0.181$$0.184$
Ridge$− 0.023$$0.134$$0 . 136$
100,000TSLS$− 0.001$$0.038$$0 . 038$
Ridge$− 0.007$$0.050$$0.051$