Next Article in Journal
Some New Tests of Conformity with Benford’s Law
Previous Article in Journal
Cross-Validation, Information Theory, or Maximum Likelihood? A Comparison of Tuning Methods for Penalized Splines
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Inference for the Linear IV Model Ridge Estimator Using Training and Test Samples

1
Tepper School of Business, Carnegie Mellon University, Pittsburgh, PA 15213, USA
2
School of Public Policy, Indian Institute of Technology Delhi, New Delhi 110016, India
*
Author to whom correspondence should be addressed.
Stats 2021, 4(3), 725-744; https://doi.org/10.3390/stats4030043
Submission received: 22 July 2021 / Revised: 28 August 2021 / Accepted: 30 August 2021 / Published: 3 September 2021
(This article belongs to the Special Issue Ridge Regression, Liu and Related Estimators)

Abstract

:
The asymptotic distribution is presented for the linear instrumental variables model estimated with a ridge penalty and a prior where the tuning parameter is selected with a holdout sample. The structural parameters and the tuning parameter are estimated jointly by method of moments. A chi-squared statistic permits confidence regions for the structural parameters. The form of the asymptotic distribution provides insights on the optimal way to perform the split between the training and test sample. Results for the linear regression estimated by ridge regression are presented as a special case.

1. Introduction

This paper contributes to the asymptotic distribution theory for ridge parameter estimates for the linear instrumental variables model. The tuning parameter for the ridge penalty, denoted α , is selected by splitting the data into a training sample and a test (or holdout) sample. In [1], the ridge penalty parameter is estimated jointly with the structural parameters and the asymptotic distribution is characterized as the projection of a stochastic process onto a cone. This gives the rate of convergence to the asymptotic distribution but does not provide guidance for inference. To allow inference, the closed form for the asymptotic distribution is presented. These new results allow for the calculation of confidence regions for the structural parameters. When the prior is equal to the population parameter value of the structural parameters, the tuning parameter is not identified. However, the structural parameter estimates are consistent and the asymptotic covariance is smaller than the asymptotic covariance of the two-stage least squares estimator.
A fundamental issue with ridge regression is the selection of the tuning parameter and its resulting impact on inference for the parameters of interest. One approach is to select a deterministic function of the sample size that shrinks to zero fast enough to not impact the asymptotic distribution. The resulting asymptotic distribution is then equivalent to the OLS asymptotic distribution [2]. An alternative approach is to select the tuning parameter with the observed data. The literature contains multiple ways to estimate the tuning parameter [3,4]. For these estimates, inference typically follows the approach stated in [5]. Conditional on a fixed α , the ridge estimator’s covariance is a function of α . The α is selected using the observed data and substituted into the covariance that was calculated assuming that α is fixed. This covariance is used to create a test statistic. In [5], the resulting tests are correctly referred to as approximate t-tests because the authors appreciate the internal inconsistency of using a covariance obtained assuming a fixed α with an α value estimated with the observed data. (This problem is well known in the literature, see [6,7,8,9,10].) Each estimate for the tuning parameter leads to a different approximate t-test which are typically compared using simulations. This has been the approach for the past 20 years [5,11]. (“When the ridge parameter k is determined from the data, the above arguments are no longer valid. Hence, to investigate the size and power of the approximate ridge-based t-type tests in such cases, a Monte Carlo simulation study is conducted” [5]. “Since a theoretical assessment among the test statistics is not possible, a simulation study has been conducted to evaluate the performance of the suggested test statistics” [11].) For other models, researchers have proposed alternative procedures to obtain hyperparameter (tuning parameter) estimates [12,13]. Like the previous ridge regression literature, these approaches have relied on simulations to demonstrate their behavior. For inference, these procedures would need to be extended to establish their asymptotic distributions. A third approach is followed in this paper. The tuning parameter is selected by splitting the sample into training and test samples. The tuning parameter defines a path from the prior to the IV estimator on the training sample. On this path, the tuning parameter is selected to minimize the prediction error on the test sample. This procedure is written as a method of moments estimation problem where the tuning parameter and the parameters of interest are simultaneously estimated. Inference is then performed using the joint asymptotic distribution.
A related literature concerns the distribution of some empirically selected ridge tuning parameters [14,15,16]. These approaches have relied on strong distribution assumptions (e.g., normal error). In addition, they are built on tuning parameters as functions of the data, where the functions are determined by assuming that the tuning parameter is fixed. This leads to an inconsistency because using the data to select the tuning parameter means that the tuning parameter is no longer fixed. In this paper, the inconsistency is avoided by estimating the structural parameters and the ridge tuning parameter simultaneously. Additionally, the method of moments framework permits weaker assumptions.
In [1], the asymptotic joint distribution for the parameters in the linear model and the ridge tuning parameter is characterized as the projection of a stochastic process onto a cone. This structure occurs because the probability limit of the ridge tuning parameter is on the boundary of the parameters space. (This leads to the same problem of consistently estimating a population parameter that is on the boundary of the parameter space [17,18,19,20].) This leads to a nonstandard asymptotic distribution that depends on the prior and the population parameter value of the structural parameters. When the prior is different from the population parameter value, the asymptotic distribution for the ridge tuning parameter is a mixture with a discrete mass of 1/2 at zero and a truncated normal over the positive reals. In addition, the asymptotic distribution for the structural parameters is normal with a nonzero mean. This mean and variance both contain the population parameter value. This prevents the calculation of t-statistics for individual parameter estimates. However, a hypothesis for the entire set of structural parameters can be tested using a chi-square test and this statistic can be inverted to give accurate confidence regions.

2. Ridge Estimator for Linear IV Model Using a Holdout Sample

Consider the linear instrumental variables model where Y is n × 1 , X is n × k , and Z is n × m with m k
Y = X β 0 + ε
X = Z Γ 0 + u
where the m × 1 instruments are z i i i d , with full rank m × m second moments R z = E [ z i z i ] , and conditional on Z,
ε i u i i i d 0 , σ ε 2 Σ ε u Σ u ε Σ U .
The IV, or 2SLS, estimator
β ^ I V = arg min β 1 2 n ( Y X β ) Z ( Z Z ) 1 Z ( Y X β ) = ( X P Z X ) 1 X P Z Y
where P Z is the projection matrix for Z and has the asymptotic distribution n β ^ I V β 0 a N 0 , σ ε 2 Γ 0 R z Γ 0 1 .
Let X P Z X n have the spectral decomposition C D C , where C is orthonormal, i.e., C C = I k and D is a positive semidefinite diagonal k × k matrix of eigenvalues, λ 1 , λ 2 , , λ k . When some of the eigenvectors explain very little variation, i.e., small magnitudes for the corresponding eigenvalues, the objective function is flatter along these dimensions and the resulting covariance estimates are larger because the variance of β ^ I V is proportional to X P Z X n 1 = C D C 1 = C D 1 C . This leads to a relatively large MSE. The ridge estimator addresses this by shrinking the estimated parameter towards a prior. The ridge objective function augments the usual IV objective function (4) with a quadratic penalty centered at a prior, β p , weighted by a regularization tuning parameter α
1 2 n ( Y X β ) P Z ( Y X β ) + 1 2 α ( β β p ) ( β β p ) .
Conditional on α , the ridge solution is
β ^ I V ( α ) = X P Z X n + α I k 1 X P Z Y n + α β p .
Different values of α result in different estimated values for β 0 . An optimal value for α can be determined empirically by splitting the data into training and test samples. The training sample is a randomly drawn sample of [ τ n ] observations, denoted, Y τ n , X τ n , and Z τ n . The estimate using the training sample, conditional on α , is
β ^ t r ( α ) arg min β 1 2 [ τ n ] Y τ n X τ n β P Z τ n Y τ n X τ n β + α 2 ( β β p ) ( β β p )
= X τ n P Z τ n X τ n [ τ n ] + α I k 1 X τ n P Z τ n Y τ n [ τ n ] + α β p
where P Z τ n is the projection matrix onto Z τ n and [ · ] is the greatest integer function. The optimal α is selected to minimize the IV least squares objective function over the remaining ( n [ τ n ] ) observations, i.e., the test or holdout sample, denoted Y n ( 1 τ ) , X n ( 1 τ ) and Z n ( 1 τ ) . The estimated tuning parameter is defined by α ^ = arg min α [ 0 , ) Q n ( 1 τ ) ( α ) where
Q n ( 1 τ ) ( α ) = 1 2 ( n [ n τ ] ) Y n ( 1 τ ) X n ( 1 τ ) β ^ t r ( α ) P Z n ( 1 τ ) Y n ( 1 τ ) X n ( 1 τ ) β ^ t r ( α )
and P Z n ( 1 τ ) is the projection matrix onto Z n ( 1 τ ) . The ridge regression estimate β ^ α ^ β ^ I V ( α ^ ) is then characterized by
1 n X P Z Y X β ^ α ^ + α ^ ( β ^ α ^ β p ) = 0 .
Ref. [1] showed how the asymptotic distribution of the ridge estimator can be determined with the method of moments framework using the parameterization
θ = vech ( R τ ) vech R ( 1 τ ) vec ( S τ ) vec S ( 1 τ ) β t r α β
where vec ( · ) stacks the elements from a matrix into a column vector and vech ( · ) stacks the unique elements from a symmetric matrix into a column vector. The population parameter values are
θ 0 = vech ( R z ) vech ( R z ) vec ( R z Γ 0 ) vec ( R z Γ 0 ) β 0 0 β 0 .
The ridge estimator is part of the parameter estimates defined by the just identified system of equations H n ( θ ) = 1 n i = 1 n h i ( θ ) = 0 where
h i ( θ ) = 1 τ ( i ) vech ( R τ z i z i ) ( 1 1 τ ( i ) ) vech ( R ( 1 τ ) z i z i ) 1 τ ( i ) vec ( S τ z i x i ) ( 1 1 τ ( i ) ) vec ( S ( 1 τ ) z i x i ) 1 τ ( i ) S τ R τ 1 z i ( y i x i β t r ) + α ( β t r β p ) ( 1 1 τ ( i ) ) ( y i x i β t r ) z i R ( 1 τ ) 1 S ( 1 τ ) S τ R τ 1 S τ + α I k 1 ( β p β t r ) ( τ S τ + ( 1 τ ) S 1 τ ) ( τ R τ + ( 1 τ ) R 1 τ ) 1 ) z i ( y i x i β ) + α ( β β p )
and the training and test samples are determined with the indicator function
1 τ ( i ) = 1 , i [ τ n ] 0 , [ τ n ] < i .
Using the structure of Equation (9), the system H n ( θ ) = 1 n i = 1 n h i ( θ ) = 0 can be seen as seven sets of equations. The first four sets are each self-contained systems of equal numbers of equations and parameters. The fifth set has k equations and introduces the k parameters, β t r . The sixth is a single equation with parameter α . The seventh introduces the final k parameters, β . Identification occurs because the expectation of the gradient is invertible. This is presented in the Appendix A.

3. Asymptotic Behavior

The asymptotic distribution is derived with four high level assumptions.
Assumption 1.
z i is iid with finite fourth moments and E [ z i z i ] = R z has full rank.
Assumption 2.
Conditional on Z, ε i u i are iid vectors with zero mean, full rank covariance matrix with possibly nonzero off-diagonal elements.
Assumptions 1 and 2 imply E [ h i ( θ 0 ) ] = 0 and n H n ( θ 0 ) satisfies the CLT.
Assumption 3.
The parameter space Θ is defined as follows: R z is restricted to a symmetric positive definite matrix with eigenvalues 1 / B 1 e 1 e 2 e m B 1 ,   β j B 2 for j = 1 , 2 , , k , Γ 0 = [ γ , j ] is of full rank with γ , j B 3 for = 1 , , m , j = 1 , 2 , , k and α [ 0 , B 4 ] where B 1 , B 2 , B 3 , and B 4 are positive and finite.
Assumption 4.
The fraction of the sample used for training satisfies 0 < τ < 1 .
The tuning parameter selected using a holdout sample is root-n when the prior is different from the population parameter value. When the prior is equal to the population parameter value, the tuning parameter is not identified.
Lemma 1.
Assumptions 1–4imply, when β p β 0 , (1) α ^ p 0 and (2) n α ^ = O p ( 1 ) , and when β p = β 0 , α ^ converges in distribution to a draw from the distribution for α min ,
α min arg min a [ 0 , ) Z ( 1 τ ) R z 1 / 2 Γ 0 Γ 0 R z Γ 0 + a I k 1 Γ 0 R z 1 / 2 Z τ × Z ( 1 τ ) R z 1 / 2 Γ 0 Γ 0 R z Γ 0 + a I k 1 Γ 0 R z 1 / 2 Z τ
where Z ( 1 τ ) and Z τ are i i d N ( 0 , I m ) and R z 1 / 2 is the symmetric matrix square root of R z .
Proofs are given in the Appendix A. The a-min distribution with, m × k matrix parameter S, is characterized by
arg min a [ 0 , ) Z 1 S S S + a I k 1 S Z 2 Z 1 S S S + a I k 1 S Z 2
where Z 1 and Z 2 are i i d N ( 0 , I m ) . When β p = β 0 , α ^ converges in distribution to a draw from the a-min distribution with parameter R z 1 / 2 Γ 0 .
When β p = β 0 , α is no longer identified. Recall that α parameterizes a path from the prior to the IV estimator on the training sample. However, when the prior equals β 0 and the IV estimator is consistent for β 0 , every value of α will be associated with a consistent estimator for β 0 .
Lemma 1 implies that the probability limit for the tuning parameter is zero, α 0 = 0 , which is on the boundary of the parameter space. This results in a nonstandard asymptotic distribution, which is characterized by the projection of a random vector on a cone (denoted Λ ) that allows for the sample estimate to be on the boundary of the parameter space. The estimation objective function can be expanded into a quadratic approximation about the centered and scaled population parameter values
n H n ( θ ) H n ( θ ) = n H n ( θ 0 ) H n ( θ 0 ) + 2 n H n ( θ 0 ) H n ( θ 0 ) θ ( θ θ 0 ) + n ( θ θ 0 ) H n ( θ 0 ) θ H n ( θ 0 ) θ ( θ θ 0 ) + o p ( 1 ) = n H n ( θ 0 ) + H n ( θ 0 ) θ ( θ θ 0 ) H n ( θ 0 ) + H n ( θ 0 ) θ ( θ θ 0 ) + o p ( 1 ) = H n ( θ 0 ) θ 1 n H n ( θ 0 ) + n ( θ θ 0 ) H n ( θ 0 ) θ H n ( θ 0 ) θ × H n ( θ 0 ) θ 1 n H n ( θ 0 ) + n ( θ θ 0 ) + o p ( 1 ) .
This suggests selecting θ ^ to minimize H n ( θ ) H n ( θ ) results in the asymptotic distribution of n ( θ ^ θ 0 ) being equivalent to the distribution of arg min λ Λ ( Z + λ ) M 0 M 0 ( Z + λ ) , where the random variable is defined as
Z = lim n E H n ( θ 0 ) θ 1 n H n ( θ 0 ) , M 0 = E H n ( θ 0 ) θ ,
and the cone is defined by Λ λ R m ( m + 1 ) / 2 + 2 k m + 2 k + 1 : λ m ( m + 1 ) / 2 + 2 k m + k + 1 0 . The estimator is defined as θ ^ = arg min θ Θ H n ( θ ) H n ( θ ) and its asymptotic distribution is characterized in Theorem 1 of [1]. For continuity of presentation, the theorem is repeated here.
Theorem 1.
Assumptions 1–4imply that, when β p β 0 , the asymptotic distribution of n ( θ ^ θ 0 ) is equivalent to the distribution of
λ ^ = arg min λ Λ ( Z + λ ) M 0 M 0 ( Z + λ ) .
The objective function can be minimized at a value of the tuning parameter in ( 0 , ) or possibly at α = 0 . The asymptotic distribution of the tuning parameter will be composed of two parts, a discrete mass at α = 0 and a continuous function over ( 0 , ) . The asymptotic distribution is characterized as the projection of a stochastic process onto a cone. The special structure of the ridge estimator using a holdout sample permits the calculation of the closed form for the asymptotic distribution for the parameter of interest, see Theorem 1 in [17], case 2 after Theorem 2 in [18], Section 3.8 in [19], and Theorem 5 in [20].
Theorem 2.
Assumptions 1–4imply, when β p β 0 ,
(i) n ( β ^ β 0 ) converges in distribution to a draw from a normal distribution with mean
( Γ 0 R z Γ 0 ) 1 ( β 0 β p ) σ ε 2 2 π τ ( 1 τ ) ( β 0 β p ) ( Γ 0 R z Γ 0 ) 1 ( β 0 β p )
and covariance
σ ε 2 Γ 0 R z Γ 0 1 + 1 τ ( 1 τ ) σ ε 2 Γ 0 R z Γ 0 1 ( β 0 β p ) ( β 0 β p ) Γ 0 R z Γ 0 1 ( β 0 β p ) Γ 0 R z Γ 0 1 ( β 0 β p ) ,
and
(ii) n α ^ will converge in distribution to a mixture distribution with discrete mass of 1/2 at zero and over [ 0 , ) , a truncated normal distribution with zero mean and covariance
σ ε 2 τ ( 1 τ ) ( β 0 β p ) ( Γ 0 R z Γ 0 ) 1 ( β 0 β p ) .
When the prior is different from the population parameter value, the ridge estimator has an asymptotic bias and its asymptotic variance is larger than for the 2SLS. However, this is restricted to only one dimension. The asymptotic MSE is
σ ε 2 Γ 0 R z Γ 0 1 + ( 2 π n + 1 ) 2 π τ ( 1 τ ) n σ ε 2 Γ 0 R z Γ 0 1 ( β 0 β p ) ( β 0 β p ) Γ 0 R z Γ 0 1 ( β 0 β p ) Γ 0 R z Γ 0 1 ( β 0 β p ) = σ ε 2 Γ 0 R z Γ 0 1 / 2 I k + ( 2 π n + 1 ) 2 π τ ( 1 τ ) n P Γ 0 R z Γ 0 1 / 2 ( β 0 β p ) Γ 0 R z Γ 0 1 / 2 .
This is the MSE for the 2SLS estimator plus a term built on P Γ 0 R z Γ 0 1 / 2 ( β 0 β p ) , the projection matrix for Γ 0 R z Γ 0 1 / 2 ( β 0 β p ) . The ridge estimator using the holdout sample has the same bias, variance, and MSE as the 2SLS estimator, except in the dimension of Γ 0 R z Γ 0 1 / 2 ( β 0 β p ) . Because 1 τ ( 1 τ ) takes its minimum at τ = 0.5 , the optimal sample split to minimize the asymptotic bias, variance, and MSE is when the sample is equally split between the training and the testing (or holdout) sample.
Because the population parameter value enters into both the asymptotic bias and the asymptotic variance, it is not possible to determine individual t-statistics for the parameters. However, under the null hypothesis that H 0 : β = β 0 , the statistic
n ( β ^ β 0 ) + ( Γ 0 R z Γ 0 ) 1 ( β 0 β p ) σ ε 2 n 2 π τ ( 1 τ ) ( β 0 β p ) ( Γ 0 R z Γ 0 ) 1 ( β 0 β p ) × σ ε 2 Γ 0 R z Γ 0 1 + 1 τ ( 1 τ ) σ ε 2 Γ 0 R z Γ 0 1 ( β 0 β p ) ( β 0 β p ) Γ 0 R z Γ 0 1 ( β 0 β p ) Γ 0 R z Γ 0 1 ( β 0 β p ) 1 × ( β ^ β 0 ) + ( Γ 0 R z Γ 0 ) 1 ( β 0 β p ) σ ε 2 n 2 π τ ( 1 τ ) ( β 0 β p ) ( Γ 0 R z Γ 0 ) 1 ( β 0 β p )
will converge in distribution to a chi-square with k degrees of freedom. This statistic can be inverted to create accurate confidence regions.
The asymptotic behavior is different when β p = β 0 .
Theorem 3.
Assumptions 1–4imply, when β p = β 0 , n ( β ^ β 0 ) , conditional on α ^ , converges in distribution to
N 0 , σ ε 2 Γ 0 R z Γ 0 + α ^ I k 1 Γ 0 R z Γ 0 Γ 0 R z Γ 0 + α ^ I k 1
where α ^ converges in distribution to a draw from an a-min distribution with parameter S = R z 1 / 2 Γ 0 .
In the unlikely event that the prior is selected equal to the population parameter value, the asymptotic covariance is smaller than or equal to the 2SLS asymptotic covariance. In terms of implementation, the covariance and bias associated with β p β 0 should be used because it is asymptotically correct for all priors except for β p = β 0 , where it leads to conservative confidence regions.

Linear Regression

A special case is the linear regression model where Y is n × 1 and X is n × k
Y = X β 0 + ε
with full rank k × k second moments R x = E [ x i x i ] , and conditional on X, ε i i i d 0 , σ ε 2 < . The estimation equations for the ridge regression estimate where the tuning parameter, α 0 , is selected with a holdout sample can be written in the method of moments framework using the parameterization
θ = vech ( R x τ ) β t r α β w i t h θ 0 = vech ( R x ) β 0 0 β 0 .
The ridge estimator is part of the parameter estimates defined by the just identified system of equations H n ( θ ) = 1 n i = 1 n h i ( θ ) = 0 , where
h i ( θ ) = 1 τ ( i ) vech ( R x τ x i x i ) 1 τ ( i ) x i ( y i x i β t r ) + α ( β t r β p ) ( 1 1 τ ( i ) ) ( y i x i β t r ) x i R x τ + α I k 1 ( β p β t r ) x i ( y i x i β ) + α ( β β p ) .
Along with Assumption 4, the following three assumptions are sufficient to obtain the asymptotic results.
Assumption 5.
x i is iid with finite fourth moments and E [ x i x i ] = R x has full rank.
Assumption 6.
Conditional on X, ε i i i d ( 0 , σ ε 2 < ) .
Assumption 7.
The parameter space Θ is defined as follows: R x is restricted to a symmetric positive definite matrix with eigenvalues 1 / B 1 e 1 e 2 e k B 1 ,   β j B 2 for j = 1 , 2 , , k , and α [ 0 , B 3 ] where B 1 , B 2 , and B 3 are positive and finite.
Lemma 2 gives the rate of convergence for the tuning parameter when the prior is different from the population parameter value and characterizes its asymptotic distribution when the prior is equal to the population parameter value.
Lemma 2.
Assumptions 4–7imply
(i) if β p β 0 , (1) α ^ p 0 and (2) n α ^ = O p ( 1 ) ; (ii) if β p = β 0 , α ^ converges in distribution to a draw from the a-min distribution with parameter S = R x 1 / 2 , the symmetric matrix square root of R x .
The asymptotic distribution of n ( θ ^ θ 0 ) is equivalent to the distribution of arg min λ Λ ( Z + λ ) M 0 M 0 ( Z + λ ) where the random variable is defined as
Z = lim n E H n ( θ 0 ) θ 1 n H n ( θ 0 ) , M 0 = E H n ( θ 0 ) θ ,
and the cone is defined by Λ λ R k ( k + 1 ) / 2 + 2 k + 1 : λ k ( k + 1 ) / 2 + k + 1 0 . The estimator is defined as
θ ^ = arg min θ Θ H n ( θ ) H n ( θ ) .
Theorem 4.
Assumptions 4–7imply, when β p β 0 , the asymptotic distribution of n ( θ ^ θ 0 ) is equivalent to the distribution of
λ ^ = arg min λ Λ ( Z + λ ) M 0 M 0 ( Z + λ ) .
This Theorem characterizes the asymptotic distribution of the estimator as the projection of a stochastic process onto a cone. The special structure of this problems allows for the analytic derivation of the asymptotic distribution.
Theorem 5.
Assumptions 4–7imply, when β p β 0 ,
(i) n ( β ^ β 0 ) is asymptotically normally distributed with mean
R x 1 ( β 0 β p ) σ ε 2 2 π τ ( 1 τ ) ( β 0 β p ) R x 1 ( β 0 β p )
and covariance
σ ε 2 R x 1 + 1 τ ( 1 τ ) σ ε 2 R x 1 ( β 0 β p ) ( β 0 β p ) R x 1 ( β 0 β p ) R x 1 ( β 0 β p ) ,
and
(ii) n α ^ asymptotically has a mixture distribution with a discrete mass of 1/2 at zero and over [ 0 , ) , a truncated normal distribution with zero mean and covariance
σ ε 2 τ ( 1 τ ) ( β 0 β p ) R x 1 ( β 0 β p ) .
When β p β 0 , the ridge estimator has an asymptotic bias and its asymptotic variance is larger than that of the OLS estimator. However, this is only in one dimension. The asymptotic MSE is
σ ε 2 R x 1 + ( 2 π n + 1 ) 2 π τ ( 1 τ ) n σ ε 2 R x 1 ( β 0 β p ) ( β 0 β p ) R x 1 ( β 0 β p ) R x 1 ( β 0 β p ) .
This is the MSE for the OLS estimator plus a constant times the projection matrix for R x 1 / 2 ( β 0 β p ) . The ridge estimator using the holdout sample has the same bias, variance, and MSE as the OLS estimator except in the dimension of the R x 1 / 2 ( β 0 β p ) . In order to minimize bias, variance, and MSE of the estimator, τ = 0.5 should be selected. The population parameter value in both the asymptotic bias and the asymptotic variance does not allow for individual t-statistics. However, under the null hypothesis that H 0 : β = β 0 , the statistic
n ( β ^ β 0 ) + R x 1 ( β 0 β p ) σ ε 2 n 2 π τ ( 1 τ ) ( β 0 β p ) R x 1 ( β 0 β p ) × σ ε 2 R x 1 + 1 τ ( 1 τ ) σ ε 2 R x 1 ( β 0 β p ) ( β 0 β p ) R x 1 ( β 0 β p ) R x 1 ( β 0 β p ) 1 × ( β ^ β 0 ) + R x 1 ( β 0 β p ) σ ε 2 n 2 π τ ( 1 τ ) ( β 0 β p ) R x 1 ( β 0 β p )
will converge in distribution to a draw from a chi-square with k degrees of freedom and can be used to create confidence regions.
When β p = β 0 , a different asymptotic distribution occurs.
Theorem 6.
Assumptions 4–7imply, when β p = β 0 , n ( β ^ β 0 ) , conditional on α ^ , converges in distribution to a draw from
N 0 , σ ε 2 R x + α ^ I k 1 R x R x + α ^ I k 1
where α ^ will converge in distribution to a draw from an a-min distribution with parameter S = R x 1 / 2 .
If β p = β 0 , the asymptotic covariance is smaller than or equal to the OLS estimator’s asymptotic covariance. Again, the covariance and bias associated with β p β 0 should be used for inference.

4. Small Sample Properties

The behavior in finite samples is investigated next by simulating the model in Equations (1) and (2) with k = 2 and m = 4 (Because the ridge estimator and the 2SLS estimator will be compared using MSE, two moments need to exist. This is ensured by having four instruments to estimate the two parameters. See [21].). To standardize the model, set z i i i d N ( 0 , I 4 ) and β 0 = ( 0 , 0 ) . Create endogeneity with
ε i u i i i d N 0 , 1 0.7 0.7 . 7 1 0 0.7 0 1 .
The strength of the instrument is controlled by the parameter δ with Γ 0 = 1 0 1 1 0 δ 0 0 . If δ = 0 , the second element of β 0 is not identified.
The ridge parameter estimate α is determined in two steps. In the first step, the objective function is evaluated on a grid of values to determine a starting value. In the second step, the objective function is evaluated over a finer grid (10,000 points) centered at the best value obtained from the first step. A value of α ^ = 0 in the second step corresponds to the ridge estimator ignoring the prior in favor of the data, whereas a value α ^ = 10 7 corresponds to “infinite regularization” implying that the ridge estimator ignores the data in favor of the prior.

4.1. Coverage Probabilities

The chi-square test is performed for a range of parameterizations including different priors, different strengths of the instrument, and different sample sizes. Each model is simulated 10,000 times and a size of ten percent is used for each test. The results are presented in Table A1. The observed coverage probabilities agree with the theoretical values. As expected, the approximations are best in cases where the sample size is large, the correlation between instruments and covariates is higher, and the prior is closer to the population parameter value.

4.2. MSE for the Ridge Estimator

The ridge estimator is compared with the 2SLS estimator to demonstrate settings where the ridge estimator can be expected to give more accurate results in small samples. The simulated models differ in three dimensions: sample size, strength of the instruments, and the prior on the structural parameters. For smaller sample sizes, the ridge estimator should have better properties, whereas, for larger sample sizes, 2SLS should perform better. Sample sizes of n = 25 , 50, 250 and 500 are considered. As noted above, the instrument signal strength decreases with δ . For smaller signal strengths, the ridge estimator should perform better. Values of δ = 0.01 , 0.05, 0.1, and 0.5 are considered. For prior values further from to the population parameter values, the ridge estimator should perform worse. Three different values of β p are considered 1 2 , 1 2 , 1 2 , 1 2 , and ( 1 , 1 ) . A total of 48 model specifications are simulated: four sample sizes n, four values of the precision parameter δ , and three values of the prior β p . Each specification is simulated 10 , 000 times and estimated with 2SLS and the ridge estimator. τ = 0.5 is used to split the sample between training and test samples for the ridge estimator.
Table A2, Table A3 and Table A4 compare the performance of the 2SLS estimator with the ridge estimator for different precision levels and sample sizes when the prior is fixed at β p = ( 1 2 , 1 2 ) , β p = ( 1 2 , 1 2 ) , and β p = ( 1 , 1 ) , respectively. The estimators are compared on the basis of bias, standard deviation of the estimates, MSE values of the estimates, and sum of MSE values of β ^ 1 and β ^ 2 . All three tables demonstrate expected results. The ridge estimator dominates in models with smaller sample sizes, weaker instrument strength, and when the prior is closer to the population parameter values.
Overall, the simulations demonstrate that, for some models’ specifications, the ridge estimator using a holdout sample has better small sample performance relative to the 2SLS estimator. The simulations agree with the asymptotic distributions. As the sample size increases, the 2SLS estimator performs better than the ridge estimator.

5. Conclusions

Inference has always been a weakness of ridge regression. This paper presents a methodology and results to help address some of its weaknesses. Theoretically accurate inferences can be performed with the asymptotic distribution of the ridge estimates of the linear IV model when the tuning parameter is empirically selected by a holdout sample. It is well known that the distribution of the estimates of the structural parameters is affected by empirically selected tuning parameters. This is addressed by simultaneously estimating both the parameters of interest and the ridge regression tuning parameter in the method of moments framework. When the prior is different from the population parameter value, the estimator accounts for the probability limit of the tuning parameter being on the boundary of the parameter space. The asymptotic distribution for the tuning parameter is a nonstandard mixed distribution. The asymptotic distribution for the estimates of the structural parameters is normal but with a nonzero mean. The ridge estimator of the structural parameters has asymptotic bias and the asymptotic covariance is larger than the asymptotic covariance for the 2SLS estimator; however, the bias and larger covariance only apply to one dimension of the parameter space. The dependence of the asymptotic mean and variance on the population parameter values, prevents the calculation of t-statistics for individual parameters. Fortunately, a chi-square statistic provides accurate confidence regions for the structural parameters.
If the prior is equal to the population parameter value, the ridge estimator is consistent and the asymptotic covariance is smaller than the 2SLS asymptotic covariance. The asymptotic distribution provides insights on how to perform estimation with a holdout sample. The minimum bias, variance, and MSE for the structural parameters occur when the sample is equally split into a training sample and a test (or holdout) sample.
This paper’s approach can be useful in determining the asymptotic behavior for other empirical procedures that select tuning parameters. Two natural extensions would be to generalize cross-validation (see [22]) and K-fold cross-validation, where the entire dataset would be used to select the tuning parameter. This paper has focused on strong correlations between the instruments and the regressors. Another important extension would be the asymptotic behavior in models with weaker correlation between the instruments and the regressors, see [23].

Supplementary Materials

The following are available online at https://www.mdpi.com/article/10.3390/stats4030043/s1.

Author Contributions

Conceptualization, F.S. and N.S.; simulations, F.S. and N.S.; writing—original draft preparation, F.S. and N.S.; writing—review and editing, F.S. and N.S. Both authors contributed equally to this project. Both authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Acknowledgments

The authors thank three anonymous referees for their helpful comments. The authors benefited from discussions during the presentation at the 2021 North American Summer Meeting of the Econometrics Society.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Coverage probabilities for confidence regions created from the χ 2 statistic with τ = 0.5 and the corresponding values from 2SLS estimation. The simulated model is given by Equations (1), (2), (4) and (12) with k = 2 , m = 4 , z i i i d N ( 0 , I 4 ) , and β 0 = ( 0 , 0 ) . The models differ with respect to priors ( β p ), strength of the instrument ( δ ), and sample size (n). Each model is simulated 10,000 times and the test performed with a size of 10%.
Table A1. Coverage probabilities for confidence regions created from the χ 2 statistic with τ = 0.5 and the corresponding values from 2SLS estimation. The simulated model is given by Equations (1), (2), (4) and (12) with k = 2 , m = 4 , z i i i d N ( 0 , I 4 ) , and β 0 = ( 0 , 0 ) . The models differ with respect to priors ( β p ), strength of the instrument ( δ ), and sample size (n). Each model is simulated 10,000 times and the test performed with a size of 10%.
β p δ n χ τ = 0.50 2 χ 2 sls 2
250.6510.547
δ = 0.01 5000.7170.629
10,0000.7080.661
250.6550.561
β p = ( 1 / 2 , 1 / 2 ) δ = 0.05 5000.7100.663
10,0000.7210.863
250.6560.571
δ = 0.10 5000.7090.759
10,0000.8550.890
250.6970.716
δ = 0.50 5000.8700.896
10,0000.9000.897
250.6230.556
δ = 0.01 5000.6860.636
10,0000.6750.661
250.6250.563
β p = ( 1 / 2 , 1 / 2 ) δ = 0.05 5000.6710.673
10,0000.7360.859
250.6240.576
δ = 0.10 5000.6820.760
10,0000.8630.892
250.6880.723
δ = 0.50 5000.8700.888
10,0000.8910.900
250.6210.565
δ = 0.01 5000.6700.632
10,0000.6500.669
250.6220.565
β p = ( 1 , 1 ) δ = 0.05 5000.6640.664
10,0000.7840.856
250.6160.564
δ = 0.10 5000.6730.750
10,0000.8670.892
250.6890.726
δ = 0.50 5000.8650.895
10,0000.8860.896
Table A2. The simulated model is given by Equations Equations (1), (2), (4) and (12) with k = 2 , m = 4 , z i i i d N ( 0 , I 4 ) , and β 0 = ( 0 , 0 ) . Summary statistics are reported for estimates of β ^ 1 and β ^ 2 using 2SLS and ridge estimator for β p = ( 1 2 , 1 2 ) , where τ = 0.5 . The models differ with respect to the strength of the instrument ( δ ) and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.
Table A2. The simulated model is given by Equations Equations (1), (2), (4) and (12) with k = 2 , m = 4 , z i i i d N ( 0 , I 4 ) , and β 0 = ( 0 , 0 ) . Summary statistics are reported for estimates of β ^ 1 and β ^ 2 using 2SLS and ridge estimator for β p = ( 1 2 , 1 2 ) , where τ = 0.5 . The models differ with respect to the strength of the instrument ( δ ) and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.
β ^ 1 β ^ 2 β ^ 1 , β ^ 2
δ nEstimatorBiasSDMSEBiasSDMSEMSE
0.01252SLS0.0200.1170.0140.7150.6610.9480.962
Ridge0.0800.1080.0180.6040.3430.4830.501
502SLS0.0090.0820.0070.6960.6860.9560.963
Ridge0.0550.0760.0090.5890.3330.4580.467
2502SLS0.0010.0370.0010.6940.7401.0291.031
Ridge0.0220.0370.0020.5770.3490.4550.457
5002SLS0.0010.0280.0010.6970.7100.9890.990
Ridge0.0160.0270.0010.5770.3170.4330.434
0.05252SLS0.0190.1220.0150.6880.6940.9540.970
Ridge0.0830.1100.0190.5950.3520.4770.496
502SLS0.0110.0840.0070.6830.6970.9520.959
Ridge0.0560.0780.0090.5830.3350.4520.462
2502SLS0.0010.0370.0010.5720.7120.8340.835
Ridge0.0220.0380.0020.5310.3320.3920.394
5002SLS0.0010.0250.0010.4850.6870.7070.707
Ridge0.0160.0270.0010.5040.2920.3390.340
0.10252SLS0.0210.1210.0150.6360.7270.9330.948
Ridge0.0820.1070.0180.5720.3560.4540.472
502SLS0.0080.0860.0070.5900.7190.8650.872
Ridge0.0540.0790.0090.5520.3340.4160.425
2502SLS0.0020.0350.0010.3350.5500.4150.416
Ridge0.0240.0380.0020.4440.2650.2680.270
5002SLS0.0000.0260.0010.1810.4630.2470.248
Ridge0.0160.0280.0010.3780.2450.2030.204
0.50252SLS0.0120.1270.0160.1580.4620.2380.255
Ridge0.0880.1230.0230.3250.2580.1730.195
502SLS0.0060.0850.0070.0660.3120.1010.109
Ridge0.0550.0890.0110.2430.2350.1140.125
2502SLS0.0020.0360.0010.0140.1260.0160.017
Ridge0.0160.0390.0020.1040.1480.0330.035
5002SLS0.0000.0260.0010.0070.0890.0080.009
Ridge0.0090.0270.0010.0700.1110.0170.018
Table A3. The simulated model is given by Equations (1), (2), (4) and (12) with k = 2 , m = 4 , z i i i d N ( 0 , I 4 ) , and β 0 = ( 0 , 0 ) . Summary statistics are reported for estimates of β ^ 1 and β ^ 2 using 2SLS and ridge estimator for β p = ( 1 2 , 1 2 ) , where τ = 0.5 . The models differ with respect to the strength of the instrument ( δ ) and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.
Table A3. The simulated model is given by Equations (1), (2), (4) and (12) with k = 2 , m = 4 , z i i i d N ( 0 , I 4 ) , and β 0 = ( 0 , 0 ) . Summary statistics are reported for estimates of β ^ 1 and β ^ 2 using 2SLS and ridge estimator for β p = ( 1 2 , 1 2 ) , where τ = 0.5 . The models differ with respect to the strength of the instrument ( δ ) and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.
β ^ 1 β ^ 2 β ^ 1 , β ^ 2
δ nEstimatorBiasSDMSEBiasSDMSEMSE
0.01252SLS0.0210.1150.0140.6880.6590.9080.921
Ridge0.0840.1080.0190.7000.3340.6010.620
502SLS0.0090.0810.0070.7020.6980.9800.987
Ridge0.0550.0770.0090.7080.3310.6110.620
2502SLS0.0020.0370.0010.6930.7301.0141.016
Ridge0.0240.0370.0020.7000.3160.5890.591
5002SLS0.0010.0260.0010.6750.7020.9490.949
Ridge0.0160.0270.0010.7000.3040.5820.583
0.05252SLS0.0210.1170.0140.6830.6540.8940.908
Ridge0.0830.1070.0180.6940.3340.5930.612
502SLS0.0100.0810.0070.6660.6630.8840.891
Ridge0.0560.0770.0090.6900.3390.5910.601
2502SLS0.0020.0370.0010.5760.6830.7980.800
Ridge0.0230.0370.0020.6510.3060.5180.520
5002SLS0.0010.0260.0010.4720.6620.6610.662
Ridge0.0160.0270.0010.6180.3240.4870.488
0.10252SLS0.0220.1200.0150.6570.6950.9140.929
Ridge0.0850.1110.0190.6790.3610.5910.611
502SLS0.0090.0830.0070.5950.6900.8300.837
Ridge0.0560.0780.0090.6570.3160.5320.541
2502SLS0.0010.0370.0010.3340.5720.4390.441
Ridge0.0230.0380.0020.5480.3010.3910.393
5002SLS0.0000.0260.0010.1870.4570.2440.245
Ridge0.0150.0270.0010.4630.2950.3010.302
0.50252SLS0.0140.1240.0160.1690.4550.2350.251
Ridge0.0840.1260.0230.3690.3280.2440.267
502SLS0.0060.0830.0070.0660.2930.0900.097
Ridge0.0500.0850.0100.2660.2680.1420.152
2502SLS0.0010.0370.0010.0110.1280.0170.018
Ridge0.0130.0380.0020.1030.1550.0350.036
5002SLS0.0010.0260.0010.0050.0900.0080.009
Ridge0.0080.0260.0010.0710.1140.0180.019
Table A4. The simulated model is given by Equations (1), (2), (4) and (12) with k = 2 , m = 4 , z i i i d N ( 0 , I 4 ) , and β 0 = ( 0 , 0 ) . Summary statistics are reported for estimates of β ^ 1 and β ^ 2 using the 2SLS and ridge estimator for β p = ( 1 , 1 ) , where τ = 0.5 . The models differ with respect to the strength of the instrument ( δ ) and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.
Table A4. The simulated model is given by Equations (1), (2), (4) and (12) with k = 2 , m = 4 , z i i i d N ( 0 , I 4 ) , and β 0 = ( 0 , 0 ) . Summary statistics are reported for estimates of β ^ 1 and β ^ 2 using the 2SLS and ridge estimator for β p = ( 1 , 1 ) , where τ = 0.5 . The models differ with respect to the strength of the instrument ( δ ) and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.
β ^ 1 β ^ 2 β ^ 1 , β ^ 2
δ nEstimatorBiasSDMSEBiasSDMSEMSE
0.01252SLS0.0220.1240.0160.7080.7241.0251.041
Ridge0.0860.1240.0230.8310.4350.8800.903
502SLS0.0090.0810.0070.6950.6780.9430.950
Ridge0.0570.0830.0100.8360.3800.8420.853
2502SLS0.0030.0360.0010.6910.6930.9580.960
Ridge0.0240.0390.0020.8590.3850.8870.889
5002SLS0.0010.0270.0010.6980.7251.0131.014
Ridge0.0160.0280.0010.8680.3510.8770.878
0.05252SLS0.0180.1250.0160.6880.7110.9790.995
Ridge0.0830.1170.0210.8200.3990.8320.853
502SLS0.0100.0820.0070.6690.6810.9110.918
Ridge0.0570.0820.0100.8230.3830.8240.834
2502SLS0.0020.0350.0010.5680.6580.7550.756
Ridge0.0230.0390.0020.7930.3650.7610.763
5002SLS0.0010.0310.0010.4660.7250.7430.744
Ridge0.0160.0310.0010.7530.4370.7570.759
0.10252SLS0.0180.1210.0150.6540.6820.8930.908
Ridge0.0850.1190.0210.7990.4040.8020.823
502SLS0.0100.0990.0100.5931.0261.4051.415
Ridge0.0570.0830.0100.7800.4100.7770.787
2502SLS0.0010.0360.0010.3180.6050.4680.469
Ridge0.0210.0400.0020.6430.3930.5670.569
5002SLS0.0010.0260.0010.1780.4640.2470.247
Ridge0.0130.0280.0010.5240.3950.4310.432
0.50252SLS0.0140.1230.0150.1620.4340.2150.230
Ridge0.0790.1290.0230.4050.3760.3060.329
502SLS0.0060.0870.0080.0710.3170.1050.113
Ridge0.0430.0880.0100.2780.3090.1730.182
2502SLS0.0010.0370.0010.0120.1270.0160.018
Ridge0.0110.0370.0010.1030.1580.0350.037
5002SLS0.0000.0260.0010.0070.0890.0080.009
Ridge0.0070.0260.0010.0710.1140.0180.019
Proofs
Lemma 1
Proof of Lemma 1.
The objective function that determines the tuning parameter is
Q n ( 1 τ ) ( α ) = 1 2 ( n [ n τ ] ) Y n ( 1 τ ) X n ( 1 τ ) β ^ t r ( α ) P Z n ( 1 τ ) Y n ( 1 τ ) X n ( 1 τ ) β ^ t r ( α ) .
Substitute
β ^ t r ( α ) = X τ n P Z τ n X τ n [ τ n ] + α I 1 X τ n P Z τ n Y τ n [ τ n ] + α β p
and write the objective function
Q n ( 1 τ ) ( α ) = 1 2 ( n [ n τ ] ) Z n ( 1 τ ) ϵ n ( 1 τ ) Z n ( 1 τ ) X n ( 1 τ ) X τ n P Z τ n X τ n [ τ n ] + α I 1 X τ n P Z τ n ϵ τ n [ τ n ] Z n ( 1 τ ) X n ( 1 τ ) X τ n P Z τ n X τ n [ τ n ] + α I 1 α β p β 0 × Z n ( 1 τ ) Z n ( 1 τ ) 1 × Z n ( 1 τ ) ϵ n ( 1 τ ) Z n ( 1 τ ) X n ( 1 τ ) X τ n P Z τ n X τ n [ τ n ] + α I 1 X τ n P Z τ n ϵ τ n [ τ n ] Z n ( 1 τ ) X n ( 1 τ ) X τ n P Z τ n X τ n [ τ n ] + α I 1 α β p β 0 .
The CLT and LLN imply Z n ( 1 τ ) X n ( 1 τ ) X τ n P Z τ n X τ n [ τ n ] + α I 1 X τ n P Z τ n ϵ τ n [ τ n ] and Z n ( 1 τ ) ϵ n ( 1 τ ) are O p ( n 1 / 2 ) . The LLN implies Z n ( 1 τ ) X n ( 1 τ ) X τ n P Z τ n X τ n [ τ n ] + α I 1 α β p β 0 is O p ( n 1 ) when β p β 0 . However, this term will be zero if β p = β 0 . Hence, the limiting behavior of the objective function will be determined by the O p ( n 1 ) term when β p β 0 and by the O p ( n 1 / 2 ) terms when β p = β 0 .
For β p β 0 , the consistency of α ^ is presented in Lemma 1 of [1]. For β p = β 0 ,
lim n Q n ( 1 τ ) ( α ) = 1 2 V ( 1 τ ) R z Γ 0 Γ 0 R z Γ 0 + α I k 1 Γ 0 V τ × R z 1 V ( 1 τ ) R z Γ 0 Γ 0 R z Γ 0 + α I k 1 Γ 0 V τ
where V ( 1 τ ) and V τ are i i d N ( 0 , R z σ ϵ 2 ) . Hence, α ^ converges in distribution to a draw from the a-min distribution with parameter S = R z 1 / 2 Γ 0 where R z 1 / 2 is the symmetric matrix square root of R z . □
Theorem 1
This theorem and its proof are presented in [1].
Theorem 2
Proof of Theorem 2.
Let Σ h = E [ h i ( θ 0 ) h i ( θ 0 ) ] and M 0 = E h i ( θ 0 ) θ . The asymptotic distribution of Z is
Z a N 0 , M 0 1 Σ h M 0 1 N 0 , C .
The parameters and the moment conditions are written in sets. To keep track of the needed calculations, write θ , h i ( θ ) , Σ h , M 0 , Z , and C into terms associated with the sets. Use bold subscripts and superscripts to denote the different sets
θ = θ 1 , θ 2 , θ 3 , θ 4 vech ( R τ ) vech ( R ( 1 τ ) ) vec ( S τ ) vec ( S ( 1 τ ) ) β t r α β
h 1 , i ( θ ) h 2 , i ( θ ) h 3 , i ( θ ) h 4 , i ( θ ) 1 τ ( i ) vech ( R τ z i z i ) ( 1 1 τ ( i ) ) vech ( R ( 1 τ ) z i z i ) 1 τ ( i ) vec ( S τ z i x i ) ( 1 1 τ ( i ) ) vec ( S ( 1 τ ) z i x i ) 1 τ ( i ) S τ R τ 1 z i ( y i x i β t r ) + α ( β t r β p ) ( 1 1 τ ( i ) ) ( y i x i β t r ) z i R ( 1 τ ) 1 S ( 1 τ ) S τ R τ 1 S τ + α I k 1 ( β p β t r ) ( τ S τ + ( 1 τ ) S 1 τ ) ( τ R τ + ( 1 τ ) R 1 τ ) 1 ) z i ( y i x i β ) + α ( β β p ) .
Σ h i , j E h i , i ( θ 0 ) h j , i ( θ 0 ) , and M 0 i , j E h i , i ( θ 0 ) θ j for i , j = 1 , , 4 . Denote the partitioned terms of M 0 1 , M 0 i , j for i , j = 1 , , 4 . The limiting random variables will be partitioned
Z 1 Z 2 Z 3 Z 4 = M 0 1 , 1 M 0 1 , 2 M 0 1 , 3 M 0 1 , 4 M 0 2 , 1 M 0 2 , 2 M 0 2 , 3 M 0 2 , 4 M 0 3 , 1 M 0 3 , 2 M 0 3 , 3 M 0 3 , 4 M 0 4 , 1 M 0 4 , 2 M 0 4 , 3 M 0 4 , 4 1 n i = 1 n h i , 1 ( θ 0 ) h i , 2 ( θ 0 ) h i , 3 ( θ 0 ) h i , 4 ( θ 0 ) .
The partitioned elements of C will be denoted C i , j for i , j = 1 , , 4 and can be written
C i , j = l = 1 4 k = 1 4 M 0 i , l Σ h l , k M 0 j , k .
(Note that the transpose on the second M 0 term is achieved with the index being flipped.) The C 4 , 4 term is the covariance matrix for the estimate of θ 4 , i.e., the ridge estimate of the structural parameters, β . The detailed calculation of the C i , j terms are presented in the Supplemental Material for this paper.
The θ 3 term is α that is restricted to being non-negative and the probability limit is on the boundary of the parameter space, i.e., α 0 = 0 . Following Self and Liang (1987), the probability limit being on the boundary of the parameter space results in an asymptotic distribution that is characterized by a projection onto a cone. Because the probability limit is zero, the asymptotic distribution is obtained by projecting the limiting stochastic process onto the non-negative values of the θ 3 α . This projection is defined using the limiting covariance matrix to define the inner product. When a draw from the limiting distribution Z has a non-negative Z 3 term, it directly contributed to the asymptotic distribution. When a draw has a negative Z 3 term, the random vector is projected on the cone with Z 3 = 0 . This means Z 3 will be mapped to zero. The other parameters will also be adjusted depending on their covariance and correlation with Z 3 . This adjustment can contribute an asymptotic bias term.
The asymptotic distribution of the estimates can be characterized
n θ ^ θ 0 a Z 1 Z 2 Z 3 Z 4 1 Z 3 0 + Z 1 ( C 1 , 3 / C 3 , 3 ) Z 3 Z 2 ( C 2 , 3 / C 3 , 3 ) Z 3 0 Z 4 ( C 4 , 3 / C 3 , 3 ) Z 3 1 Z 3 < 0 .
The asymptotic distribution for α ^ is a mixture with 1/2 probability mass on zero and truncated normal distribution N ( 0 , C 3 , 3 ) over non-negative values.
The asymptotic distribution for the ridge estimator of the structural parameters β , conditional on Z 3 , is the asymptotic distribution for θ 4
n ( β ^ β 0 ) a N 1 2 ( C 4 , 3 / C 3 , 3 ) Z 3 , C 4 , 4
where Z 3 is a draw from the truncated normal distribution N ( 0 , C 3 , 3 ) over the negative values. The asymptotic bias can be evaluated in closed form. The expectation of Z 3 for the truncated normal distribution is
E z = 0 z 2 2 π C 3 , 3 exp z 2 2 C 3 , 3 d z
= 2 2 π C 3 , 3 0 z exp z 2 2 C 3 , 3 d z
= 2 2 π C 3 , 3 C 3 , 3 exp z 2 2 C 3 , 3 0
= 2 2 π C 3 , 3 C 3 , 3
= 2 C 3 , 3 π .
Under the null hypothesis that H 0 : β = β 0 , the statistic
n ( β ^ β 0 ) C 4 , 3 2 C 3 , 3 2 C 3 , 3 π / n C 4 , 4 1 ( β ^ β 0 ) C 4 , 3 2 C 3 , 3 2 C 3 , 3 π / n
will converge in distribution to chi-square with k degrees of freedom.
The asymptotic distribution for α ^ , β ^ , and the test statistic require the terms
C 3 , 3 , C 4 , 3 , a n d C 4 , 4 .
The details of the matrix multiplication are presented in the Supplemental Material for this paper. The terms are
C 3 , 3 = σ ε 2 τ ( 1 τ ) ( β 0 β p ) ( Γ 0 R z Γ 0 ) 1 ( β 0 β p ) ,
C 4 , 3 = σ ε 2 ( Γ 0 R z Γ 0 ) 1 ( β 0 β p ) τ ( 1 τ ) ( β 0 β p ) ( Γ 0 R z Γ 0 ) 1 ( β 0 β p ) ,
C 4 , 4 = σ ε 2 ( Γ 0 R z Γ 0 ) 1 + σ ε 2 ( Γ 0 R z Γ 0 ) 1 ( β 0 β p ) ( β 0 β p ) ( Γ 0 R z Γ 0 ) 1 τ ( 1 τ ) ( β 0 β p ) ( Γ 0 R z Γ 0 ) 1 ( β 0 β p ) ,
and
C 4 , 3 2 C 3 , 3 2 C 3 , 3 π = ( Γ 0 R z Γ 0 ) 1 ( β 0 β p ) σ ε 2 2 π τ ( 1 τ ) ( β 0 β p ) ( Γ 0 R z Γ 0 ) 1 ( β 0 β p ) .
The asymptotic distribution for the optimally selected tuning parameter when β p β 0 is a mixture with a discrete mass of 1/2 at zero and, over the positive values, a truncated normal with zero mean and covariance C 3 , 3 . The asymptotic distribution for the structural parameters n ( β ^ β 0 ) is normally distributed with mean C 4 , 3 2 C 3 , 3 2 C 3 , 3 π and covariance C 4 , 4 .
Theorem 3
Proof of Theorem 3.
The tuning parameter is estimated, but it is not identified. Consider only the final system of k equations used to estimate the parameters of interest, conditional on the estimated tuning parameter α ^
X P Z ( Y X β ^ ) n + α ^ ( β ^ β p ) = 0 .
This implies
β ^ = X P Z X n + α ^ I k 1 X P Z Y n + α ^ β p .
Substitute in β p = β 0 ,
β ^ = X P Z X n + α ^ I k 1 X P Z Y n + α ^ β 0 .
Substitute in Y = X β 0 + ε ,
β ^ = X P Z X n + α ^ I k 1 X P Z ( X β 0 + ε ) n + α ^ β 0 .
Simplify to
β ^ = X P Z X n + α ^ I k 1 X P Z X n + α ^ I k β 0 + X P Z ε n .
This implies
( β ^ β 0 ) = X P Z X n + α ^ I k 1 X P Z ε n
which gives the root-n consistency of the estimate and the asymptotic distribution becomes
n ( β ^ β 0 ) a N 0 , σ ε 2 ( Γ 0 R z Γ 0 ) + α ^ I k 1 ( Γ 0 R z Γ 0 ) ( Γ 0 R z Γ 0 ) + α ^ I k 1
where α ^ is a draw from the a-min distribution with parameter R z 1 / 2 Γ 0 .
The inverse of the ridge estimator’s variance is
σ ε 2 ( Γ 0 R z Γ 0 + α ^ I k ) Γ 0 R z Γ 0 1 ( Γ 0 R z Γ 0 + α ^ I k ) = σ ε 2 ( Γ 0 R z Γ 0 ) + 2 σ ε 2 α ^ I k + σ ε 2 α ^ ( Γ 0 R z Γ 0 ) 1 .
For α ^ > 0 , this is larger than the inverse of the 2SLS estimator’s variance, σ ε 2 ( Γ 0 R z Γ 0 ) . Hence, the variance of the 2SLS can never be smaller than the variance of the ridge estimator, when β p = β 0 .
Theorem 4
This is a special case of Theorem 1. The proof of Theorem 1 as presented in [1] applies to this set of parameters and moment conditions.
Theorem 5
This is a special case of Theorem 2. Assumption 6 implies that there is no need for instrumental variables. The basic simplification is that in the IV model 1 n X X p Γ 0 R z Γ 0 , while, for the linear regression model, this reduces to 1 n X X p R x .
Theorem 6
This is a special case of the results in Theorem 3. The explanation for Theorem 5 also applies for this theorem.

References

  1. Sengupta, N.; Sowell, F. On the Asymptotic Distribution of Ridge Regression Estimators Using Training and Test Samples. Econometrics 2020, 8, 39. [Google Scholar] [CrossRef]
  2. Obenchain, R. Classical F-Tests and Confidence Regions for Ridge Regression. Technometrics 1977, 19, 429. [Google Scholar] [CrossRef]
  3. Van Wieringen, W.N. Lecture notes on ridge regression. arXiv 2021, arXiv:1509.09169. [Google Scholar]
  4. Melo, S.; Kibria, B.M.G. On Some Test Statistics for Testing the Regression Coefficients in Presence of Multicollinearity: A Simulation Study. Stats 2020, 3, 40–55. [Google Scholar] [CrossRef] [Green Version]
  5. Halawa, A.; Bassiouni, M.E. Tests of regression coefficients under ridge regression models. J. Stat. Comput. Simul. 2000, 65, 341–356. [Google Scholar] [CrossRef]
  6. Theobald, C.M. Generalizations of Mean Square Error Applied to Ridge Regression. J. R. Stat. Soc. Ser. B 1974, 36, 103–106. [Google Scholar] [CrossRef]
  7. Schmidt, P. Econometrics; Statistics, Textbooks and Monographs; Dekker: New York, NY, USA, 1976. [Google Scholar]
  8. Smith, G.; Campbell, F. A Critique of Some Ridge Regression Methods. J. Am. Stat. Assoc. 1980, 75, 74–81. [Google Scholar] [CrossRef]
  9. Montgomery, D.; Peck, E.; Vining, G. Introduction to Linear Regression Analysis; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
  10. Gómez, R.S.; García, C.G.; Pérez, J.G. The Raise Regression: Justification, properties and application. arXiv 2021, arXiv:2104.14423. [Google Scholar]
  11. Kibria, B.M.G.; Banik, S. A Simulation Study on the Size and Power Properties of Some Ridge Regression Tests. Appl. Appl. Math. Int. J. (AAM) 2019, 14, 741–761. [Google Scholar]
  12. Zorzi, M. Empirical Bayesian learning in AR graphical models. Automatica 2019, 109, 108516. [Google Scholar] [CrossRef] [Green Version]
  13. Zorzi, M. Autoregressive identification of Kronecker graphical models. Automatica 2020, 119, 109053. [Google Scholar] [CrossRef]
  14. Alheety, M.I.; Ramanathan, T.V. Confidence Interval for Shrinkage Parameters in Ridge Regression. Commun. Stat.-Theory Methods 2009, 38, 3489–3497. [Google Scholar] [CrossRef]
  15. Rubio, H.; Firinguetti, L. The Distribution of Stochastic Shrinkage Parameters in Ridge Regression. Commun. Stat.-Theory Methods 2002, 31, 1531–1547. [Google Scholar] [CrossRef] [Green Version]
  16. Akdeniz, F.; Öztürk, F. The distribution of stochastic shrinkage biasing parameters of the Liu type estimator. Appl. Math. Comput. 2005, 163, 29–38. [Google Scholar] [CrossRef]
  17. Moran, P.A.P. Maximum-likelihood estimation in non-standard conditions. Math. Proc. Camb. Philos. Soc. 1971, 70, 441–450. [Google Scholar] [CrossRef]
  18. Self, S.; Liang, K. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Am. Stat. Assoc. 1987, 82, 605–610. [Google Scholar] [CrossRef]
  19. Andrews, D.W.K. Generalized method of moments estimation when a parameter is on a boundary. J. Bus. Econ. Stat. 2002, 20, 530–544. [Google Scholar] [CrossRef]
  20. Andrews, D.W.K. Estimation When a Parameter is on a Boundary. Econometrica 1999, 67, 1341–1383. [Google Scholar] [CrossRef]
  21. Kinal, T.W. The Existence of Moments of k-Class Estimators. Econometrica 1980, 48, 241–249. [Google Scholar] [CrossRef]
  22. Golub, G.H.; Heath, M.; Wahba, G. Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter. Technometrics 1979, 21, 215–223. [Google Scholar] [CrossRef]
  23. Antoine, B.; Renault, E. Efficient GMM with nearly-weak instruments. Econom. J. 2009, 12, S135–S171. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Sowell, F.; Sengupta, N. Inference for the Linear IV Model Ridge Estimator Using Training and Test Samples. Stats 2021, 4, 725-744. https://doi.org/10.3390/stats4030043

AMA Style

Sowell F, Sengupta N. Inference for the Linear IV Model Ridge Estimator Using Training and Test Samples. Stats. 2021; 4(3):725-744. https://doi.org/10.3390/stats4030043

Chicago/Turabian Style

Sowell, Fallaw, and Nandana Sengupta. 2021. "Inference for the Linear IV Model Ridge Estimator Using Training and Test Samples" Stats 4, no. 3: 725-744. https://doi.org/10.3390/stats4030043

APA Style

Sowell, F., & Sengupta, N. (2021). Inference for the Linear IV Model Ridge Estimator Using Training and Test Samples. Stats, 4(3), 725-744. https://doi.org/10.3390/stats4030043

Article Metrics

Back to TopTop