Bootstrap Tests for Overidentification in Linear Regression Models

We study the finite-sample properties of tests for overidentifying restrictions in linear regression models with a single endogenous regressor and weak instruments. Under the assumption of Gaussian disturbances, we derive expressions for a variety of test statistics as functions of eight mutually independent random variables and two nuisance parameters. The distributions of the statistics are shown to have an ill-defined limit as the parameter that determines the strength of the instruments tends to zero and as the correlation between the disturbances of the structural and reduced-form equations tends to plus or minus one. This makes it impossible to perform reliable inference near the point at which the limit is ill-defined. Several bootstrap procedures are proposed. They alleviate the problem and allow reliable inference when the instruments are not too weak. We also study their power properties.


Introduction
In recent years, there has been a great deal of work on the finite-sample properties of estimators and tests for linear regression models with endogenous regressors when the instruments are weak. Much of this work has focused on the case in which there is just one endogenous variable on the right-hand side, and numerous procedures for testing hypotheses about the coefficient of this variable have been studied. See, among many others, Staiger and Stock (1997), Stock, Wright, and Yogo (2002), Kleibergen (2002), Moreira (2003, 2009), Andrews, Moreira, and Stock (2006, and Davidson and Mac-Kinnon (2008, 2010. However, the closely related problem of testing overidentifying restrictions when the instruments are weak does not appear to have been studied to anything like the same extent. In the next section, we discuss the famous test of Sargan (1958) and other asymptotic tests for overidentification in linear regression models estimated by instrumental variables (IV) or limited information maximum likelihood (LIML). We show that the test statistics are all functions of six quadratic forms defined in terms of the two endogenous variables of the model, the linear span of the instruments, and its orthogonal complement. In fact, they can be expressed as functions of a certain ratio of sums of squared residuals and are closely related to the test proposed by Anderson and Rubin (1949). In Section 3, we analyze the properties of these overidentification test statistics. We use a simplified model with only three parameters, which is nonetheless capable of generating statistics with exactly the same distributions as those generated by a more general model. In Section 4, we derive the limiting behavior of the statistics in the context of weak-instrument asymptotics as the instrument strength tends to zero, as the correlation between the disturbances in the structural and reduced-form equations tends to unity, and as the sample size tends to infinity.
In Section 5, we investigate by simulation the finite-sample behavior of the statistics we consider. We find that simulation evidence and theoretical analysis concur in strongly preferring a variant of a likelihood-ratio test to the more conventional forms of Sargan test. Section 6 discusses a number of bootstrap procedures that can be used in conjunction with any of the overidentification tests. Some of these procedures are purely parametric, while others make use of resampling. In Section 7, we look at the performance of bootstrap tests, finding that the best of them behave very well if the instruments are not too weak. However, as our theory suggests, they improve very little over tests based on asymptotic critical values in the neighborhood of the singularity that occurs where the instrument strength tends to zero and the correlation of the disturbances tends to one.
In Section 8, we analyze the power properties of the two main variants of bootstrap test. We obtain analytical results that generalize those of Section 3. Using those analytical results, we conduct extensive simulation experiments, mostly for cases that allow the bootstrap to yield reliable inference. We find that bootstrap tests based on IV estimation seem to have a slight power advantage over those based on LIML, at the cost of slightly greater size distortion under the null when the instruments are not too weak. Section 9 presents a brief discussion of how both test statistics and bootstrap procedures can be modified to take account of heteroskedasticity and clustered data. Finally, some concluding remarks are made in Section 10.

Tests for Overidentification
Although the tests for overidentification that we deal with are applicable to linear regression models with any number of endogenous right-hand side variables, we restrict attention in this paper to a model with just one such variable. We do so partly for expositional convenience and partly because this special case is of particular interest and has been the subject of much research in recent years. The model consists of just two equations, y 1 = βy 2 + Zγ + u 1 , and (1) Here y 1 and y 2 are n--vectors of observations on endogenous variables, Z is an n × k matrix of observations on exogenous variables, and W is an n×l matrix of instruments such that S(Z) ⊂ S(W ), where the notation S(A) means the linear span of the columns of the matrix A. The disturbances are assumed to be homoskedastic and serially uncorrelated. We assume that l > k + 1, so that the model is overidentified.
The parameters of this model are the scalar β, the k--vector γ, the l--vector π, and the 2 × 2 contemporaneous covariance matrix of the disturbances u 1i and u 2i : Equation (1) is the structural equation we are interested in, and equation (2) is a reduced-form equation for the second endogenous variable y 2 .
The model (1) and (2) implicitly involves one identifying restriction, which cannot be tested, and q ≡ l − k − 1 overidentifying restrictions. These restrictions say, in effect, that if we append q regressors all belonging to S(W ) to equation (1) in such a way that the equation becomes just identified, then the coefficients of these q additional regressors are zero.
The most common way to test the overidentifying restrictions is to use a Sargan test (Sargan, 1958), which can be computed in various ways. The easiest is probably to estimate equation (1) by instrumental variables (IV), using the l columns of W as instruments, and then to regress the IV residualsû 1 on W. The explained sum of squares from this regression divided by the IV estimate of σ 2 1 is the test statistic, and it is asymptotically distributed as χ 2 (q).
The numerator of the Sargan statistic can be written as (y 1 − Zγ IV −β IV y 2 ) P W (y 1 − Zγ IV −β IV y 2 ), whereβ IV andγ IV denote the IV estimates of β and γ, respectively, and P W ≡ W (W W ) −1 W projects orthogonally into S(W ). We define P Z similarly, and let M W ≡ I − P W and M Z ≡ I − P Z . Since Z is orthogonal to the IV residuals, Then, since P W M Z = M Z P W = P W − P Z = M Z − M W , the numerator of the Sargan statistic can also be written as Similarly, the denominator is just Expression (5) is the numerator of the Anderson-Rubin, or AR, statistic for the hypothesis that β =β IV ; see Anderson and Rubin (1949). The denominator of this same AR statistic is which may be compared to the second line of (6). We see that the Sargan statistic estimates σ 2 1 under the null hypothesis, and the AR statistic estimates it under the alternative.
Of course, AR statistics are usually calculated for the hypothesis that β takes on a specific value, say β 0 , rather thanβ IV . Since by definitionβ IV minimizes the numerator (4), it follows that the numerator of the AR statistic is always no smaller than the numerator of the Sargan statistic. Even though the AR statistic is not generally thought of as a test of the overidentifying restrictions, it could be used as such a test, because it will always reject if the restrictions are sufficiently false.
It seems natural to modify the Sargan statistic by using (7) instead of (6) as the denominator, and this was done by Basmann (1960). The usual Sargan statistic can be written as and the Basmann statistic as where SSR 0 is the sum of squared residuals from regressing y 1 −β IV y 2 on Z, SSR 1 is the SSR from regressing y 1 −β IV y 2 on W, and ζ(β IV ) ≡ SSR 1 /SSR 0 . Observe that both test statistics are simply monotonic functions of ζ(β IV ), the ratio of the two sums of squared residuals.
In what follows, it will be convenient to analyze LR rather than LR.
We have seen that the Sargan statistic (8), the Basmann statistic (9), and the two likelihood ratio statistics LR and LR are all monotonic functions of the ratio of SSRs ζ(β) for some estimatorβ. Both the particular function of ζ(β) that is used and the choice ofβ affect the finite-sample properties of an asymptotic test. For a bootstrap test, however, it is only the choice ofβ that matters. This follows from the fact that it is only the rank of the actual test statistic in the ordered list of the actual and bootstrap statistics that determines a bootstrap P value; see Section 6 below and Davidson and MacKinnon (2006a). Therefore, for any given bootstrap data-generating process (DGP) and any estimatorβ, bootstrap tests based on any monotonic transformation of ζ(β) yield identical results.

Analysis using a Simpler Model
It is clear from (5), (6), and (10) that all the statistics we have considered for testing the overidentifying restrictions depend on y 1 and y 2 only through their projections M Z y 1 and M Z y 2 . We see also that ζ(β) is homogeneous of degree zero with respect to M Z y 1 and M Z y 2 separately, for any β. Thus the statistics depend on the scale of neither y 1 nor y 2 . Moreover, the matrix Z plays no essential role. In fact, it can be shown that the distributions of the test statistics generated by the model (1) and (2) for sample size n are identical to those generated by the simpler model y 1 = βy 2 + u 1 , and where the sample size is n − k, the matrix W has l − k columns, and σ 1 = σ 2 = 1. Of course, y 1 , y 2 , and W in the simpler model (11) and (12) are not the same as in the original model. In the remainder of the paper, we deal exclusively with the former. For the original model, n and l in our results below would have to be replaced by n − k and l − k, and y 1 and y 2 would have to be replaced by M Z y 1 and M Z y 2 .
It is well known -see Mariano and Sawa (1972) -that all the test statistics depend on the data generated by (11) and (12) only through the six quadratic forms P 11 ≡ y 1 P W y 1 , P 12 ≡ y 1 P W y 2 , P 22 ≡ y 2 P W y 2 , M 11 ≡ y 1 M W y 1 , M 12 ≡ y 1 M W y 2 , and M 22 ≡ y 2 M W y 2 .
This is also true for the general model (1) and (2), except that P W must be replaced by P W − P Z = P W M Z .
In this section and the next two, we make the additional assumption that the disturbances u 1 and u 2 are normally distributed. Since the quadratic forms in (13) depend on the instruments only through the projections P W and M W , it follows that their joint distribution depends on W only through the number of instruments l and the norm of the vector Wπ. We can therefore further simplify equation (12) as where the vector w ∈ S(W ) is normalized to have length unity, which implies that a 2 = π W Wπ. Thus the joint distribution of the six quadratic forms depends only on the three parameters β, a, and ρ, and on the dimensions n and l; for the general model (1) and (2), the latter would be n − k and l − k.
The above simplification was used in Davidson and MacKinnon (2008) in the context of tests of hypotheses about β, and further details can be found there. The parameter a determines the strength of the instruments. In weak-instrument asymptotics, a = O(1), while in conventional strong-instrument asymptotics, a = O(n 1/2 ). Thus, by treating a as a parameter of order unity, we are in the context of weak-instrument asymptotics; see Staiger and Stock (1997). The square of the parameter a is often referred to as the (scalar) concentration parameter; see Phillips (1983, p. 470) and Stock, Wright, and Yogo (2002).
By equating the two expressions forβ in (17) and (18) which does not depend on β. Similar calculations show that the denominator in (15) and the coefficients A and B in the equation (19) do not depend on β, and so neither do the statistics S and LR .
In Davidson and MacKinnon (2008) it is shown that, under the assumption of normal disturbances, the six quadratic forms (13) can be expressed as functions of the three parameters β, a, and ρ and eight mutually independent random variables, the distributions of which do not depend on any of the parameters. Four of these random variables, which we denote by x 1 , x 2 , z P , and z M , are standard normal, and the other four, which we denote by t P 11 , t P 22 , t M 11 , and t M 22 , are respectively distributed as χ 2 l−2 , χ 2 l−1 , χ 2 n−l , and χ 2 n−l−1 . In terms of these eight variables, we make the definitions These quantities have simple interpretations: Q ij = u i P W u j , and N ij = u i M W u j , for i = 1, 2.
Realizations of LR can be generated similarly. From the standard formula for the roots of a quadratic equation, we see that where the discriminant ∆ is given by ∆ = (P 11 M 22 − 2P 12 M 12 + P 22 M 11 ) 2 − 4(M 11 M 22 − M 2 12 )(P 11 P 22 − P 2 12 ).

Limits
In this section, we show that no test of the overidentifying restrictions is robust to weak instruments. In fact, the distributions of S and LR have a singularity at the point in the parameter space at which a = 0 and ρ = ±1, or, equivalently, a = r = 0. In order to show this, we consider the limits of the expressions (15) and (22), first when a → 0, and then when r → 0. It is also useful to check that the finite-sample expressions have the form given by conventional (strong-instrument) asymptotics when a → ∞ and n → ∞.
Note that (23) no longer depends on r at all. Thus the distribution of LR in the limit of completely irrelevant instruments is independent of all the model parameters.
Thus the limit of LR when r → 0 is just This is independent of a, and it tends to a χ 2 l−1 variable as n → ∞. The singularity mentioned above is a consequence of the fact that the limit at a = r = 0 is ill-defined, since LR converges to two different random variables as r → 0 for a = 0 and as a → 0 for r = 0. These random variables are quite different and have quite different distributions.
The limit of LR as a → ∞, which is the limit when the instruments are strong, can be computed in a similar way, by isolating the coefficients of powers of a rather than those of r and performing a Taylor expansion for small 1/a. The limit turns out to be, like the limit as r → 0, n(Q 11 − x 2 1 )/N 11 . As n → ∞, N 11 /n → 1, which shows that the asymptotic distribution with strong instruments is just χ 2 l−1 . It would be tedious to go through analogous calculations for the statistic S . We content ourselves with presenting the results. First, the value of S for a = 0 is (n − l) (Q 11 Q 22 − Q 2 12 )(ρ 2 Q 11 + 2rρQ 12 + r 2 Q 22 ) ρ 2 D 0 + 2rρD 1 + r 2 D 2 , where D 0 = Q 2 12 N 11 − 2Q 11 Q 12 N 12 + Q 2 11 N 22 , D 1 = Q 12 Q 22 N 11 − N 12 (Q 11 Q 22 + Q 2 12 ) + Q 11 Q 12 N 22 , and D 2 = Q 2 22 N 11 − 2Q 12 Q 22 N 12 + Q 2 12 N 22 . This expression does depend on r, unlike the analogous expression for LR . When r → 0 with a = 0, it is easy to see that S tends to the limit When r → 0 with a = 0, the limit of S is (n − l)(Q 11 − x 2 1 )(a 2 + 2ax 1 + Q 11 ) N 11 (a + x 1 ) 2 .
This does depend on a, and its limit as a → 0 is just (n − l) Q 11 (Q 11 − x 2 1 ) which is quite different from (26), where the order of the limits is inverted. Lastly, as expected, the limit of S as a → ∞ is the same as that of LR .
The fact that the test statistics S and LR depend on the parameters a and ρ indicates that these statistics are not robust to weak instruments. Passing to the limit as n → ∞ with weak-instrument asymptotics does not improve matters. Of the six quadratic forms on which everything depends, only the M ij depend on n. Their limiting behavior is such that M 11 /n → 1, M 22 /n → 1, and M 12 /n → ρ as n → ∞. But the P ij do not depend on n, and they do depend on a and ρ.

Finite-Sample Properties of the Tests
The discussion in the previous section was limited to the statistics S and LR . When we discuss bootstrap tests, it is enough to consider just these two, since all other statistics mentioned in Section 2 are monotonic transforms of them. But, of course, the different versions of the Sargan test and the LR test have different properties when used with (strong-instrument) asymptotic critical values. In this section, therefore, we present some Monte Carlo results on the finite-sample performance of five test statistics, including the four discussed above (S, S , LR, and LR ).
The fifth test statistic we examine is based on the estimator proposed by Fuller (1977). Like the IV and LIML estimators, Fuller's estimator is a K-class estimator for model (1) and (2). It takes the form Setting K =κ, the minimized value of the variance ratio (10), in equation (28) gives the LIML estimator, while setting K = 1 gives the IV estimator. Fuller's estimator sets K =κ + η/(n − l) for some nonrandom number η > 0 independent of the sample size n. We set η = 1. With this choice, Fuller's estimatorβ F has all moments (except when the sample size is very small) and is approximately unbiased. The corresponding test statistic is simply −n log ζ(β F ), which has the same form as the LR statistic. We will refer to this as the LRF test.
The data-generating processes, or DGPs, used for our simulations all belong to the simplified model (11) and (14). The disturbances are generated according to the relations u 1 = v 1 , u 2 = ρv 1 + rv 2 , where v 1 and v 2 are n--vectors with independent standard normal elements, and r ≡ (1 − ρ 2 ) 1/2 . Of course, it is quite unnecessary to generate simulated samples of n observations, as it is enough to generate the six quadratic forms (13) as functions of eight mutually independent random variables, using the relations (20) and (21). The sample size n affects only the degrees of freedom of the two χ 2 random variables t M

11
and t M 22 that appear in (20). Although any DGP given by (11) and (14) involves no explicit overidentifying restrictions, the test statistics are computed for the model (1) and (2), for which there are q ≡ l − k − 1 of them.
The first group of experiments is intended to provide guidance on the appropriate sample size to use in the remaining experiments. Our objective is to mimic the common situation in which the sample size is reasonably large and the instruments are quite weak. Since the behavior of our simulation DGPs is governed by weakinstrument asymptotics, we should not expect any of the test statistics to have the correct size asymptotically. However, for any given a and ρ, the rejection frequency converges as n → ∞ to that given by the asymptotic distribution of the statistic used; these asymptotic distributions were discussed at the end of the previous section. In the experiments, we use sample sizes of 20, 28, 40, 56, and so on, up to 1810. Each of these numbers is larger than its predecessor by approximately √ 2. Each experiment used 10 6 replications.
The results of four sets of experiments are presented in Figure 1, in which we plot rejection frequencies in the experiments for a nominal level of 5%. In the top two panels, a = 2, so that the instruments are very weak. In the bottom two panels, a = 8, so that they are reasonably strong. Recall that the concentration parameter is a 2 . In the two panels on the left, ρ = 0.5, so that there is moderate correlation between the structural and reduced form disturbances. In the two panels on the right, ρ = 0.9, so that there is strong correlation. Note that the vertical axis differs across most of the panels.
It is evident that the performance of all the tests varies greatly with the sample size. The Sargan (S) and Basmann (S ) tests perform almost the same for large samples but very differently for small ones, with the latter much more prone to overreject than the former. For a = 2, the LR test and its linearized version LR perform quite differently in small samples but almost identically once n ≥ 200. In this case, the Fuller variant of the LR test performs somewhat differently from both LR and LR for all sample sizes. In contrast, for a = 8, LR and LRF are so similar that we did not graph LR to avoid making the figure unreadable. LR, LR , and LRF perform almost identically, and very well indeed, for large sample sizes, even though they overreject severely for small sample sizes.
As expected, all of the rejection frequencies seem to be converging to constants as n → ∞. Moreover, in every case, it appears that the (interpolated) results for n = 400 are very similar to the results for larger values up to n = 1810. Accordingly, we used n = 400 in all the remaining experiments.
In the second group of experiments, the number of overidentifying restrictions q is varied. The four panels in Figure 2 correspond to those of Figure 1. In most cases, performance deteriorates as q increases. Sometimes, rejection frequencies seem to be converging, but by no means always. In the remaining experiments, we somewhat arbitrarily set q = 8. Choosing a smaller number would generally have resulted in smaller size distortions.
In the third group of experiments, the results of which are shown in Figure 3, we set n = 400 and q = 8, and we vary ρ between 0.0 and 0.99 at intervals of 0.01 for four values of a. The vertical axis is different in each of the four panels, because the tests all perform much better as a increases. For clarity, rejection frequencies for LR are not shown in the figure, because they always lie between those for LR and LRF. They are very close to those for LR when a is small, and they are very close to those for LRF when a is large.
For the smaller values of a, all of the tests can either overreject or underreject, with rejection frequencies increasing in ρ. The Sargan and Basmann tests overreject very severely when a is small and ρ is large. The LR , LR, and LRF tests underreject severely when a is small and ρ is not large, but they overreject slightly when a is large. Based on Figure 1 and on the analysis of the previous section, we expect that this slight overrejection vanishes for larger samples.
Although the performance of all the tests is quite poor when a is small, it is worth noting that the Sargan tests are not as unreliable as t tests of the hypothesis that β has a specific value, and the LR tests are not as unreliable as LR tests for that hypothesis; see Davidson andMacKinnon (2008, 2010).

Near the Singularity
From Figures 1-3, we see that the rejection probabilities of all the tests vary considerably with the parameters a and ρ as they vary in the neighborhood of the singularity at a = 0, ρ = 1. Further insight into this phenomenon is provided by Figures 4 and 5.
These are contour plots of rejection frequencies near the singularity for tests at the 0.05 level with a and ρ on the horizontal and vertical axes, respectively. Figure 4 is for the Basmann statistic S , and Figure 5 is for the LR statistic. Both figures are for the case dealt with in Figure 3, for which n = 400 and q = 8. The rejection frequencies are, once again, estimated using 10 6 replications.
It is clear from these figures that rejection frequencies tend to be greatest as the singularity is approached by first setting r = 0 and then letting a tend to zero. In this limit, S is given by expression (27) and LR by expression (25). For extremely small values of a, S actually underrejects. But, as a rises to values that are still very very small, rejection frequencies soar, sometimes to over 0.80. In contrast, LR underrejects severely for small values of a, values which do not have to be nearly as small as in the case of S . In much of the figure, however, the rejection frequencies for LR are just a little greater than 0.05.
The 95% quantile of the distribution of expression (27) has the huge value of 16,285, as estimated from 9,999,999 independent realizations. In contrast, recall that the 95% quantile of the χ 2 q distribution for q = 8 is 15.5073. Since the distribution of S for arbitrary a and ρ is stochastically bounded by that of (27), S is boundedly pivotal. However, basing inference on the distribution of (27) is certain to be extremely conservative.

Bootstrap Tests
Every test statistic has a distribution which depends on the DGP that generated the sample from which it is computed. The "true" DGP that generated an observed realization of the statistic is in general unknown. However, according to the bootstrap principle, one can perform inference by replacing the unknown DGP by an estimate of it, which is called the bootstrap DGP. Because what we need for inference is the distribution of the statistic under DGPs that satisfy the null hypothesis, the bootstrap DGP must necessarily impose the null. This requirement by itself does not normally lead to a unique bootstrap DGP, and we will see in this section that, for an overidentification test, there are several plausible choices.
If the observed value of a test statistic τ isτ , and the rejection region is in the upper tail, then the bootstrap P value is the probability, under the bootstrap distribution of the statistic, that τ is greater thanτ . To estimate this probability, one generates a large number, say B, of realizations of the statistic using the bootstrap DGP. Let the j th realization be denoted by τ * j . Then the simulation-based estimate of the bootstrap P value is just the proportion of the τ * j greater thanτ : where I(·) is the indicator function, equal to 1 when its argument is true and 0 otherwise. If this fraction is smaller than α, the level of the test, then we reject the null hypothesis. See Davidson and MacKinnon (2006a).

Parametric Bootstraps
The DGPs contained in the simple model defined by equations (11) and (14) are characterized by just three parameters, namely, β, a, and ρ, Since the value of β does not affect the distribution of the overidentification test statistics, the bootstrap DGP for a parametric bootstrap (assuming normally distributed disturbances) is completely determined by the values of a and ρ that characterize it.
The test statisticτ itself may be any one of the overidentification statistics we have discussed. The model that is actually estimated in order to obtainτ is not the simple model, but rather the full model given by (1) and (2). The parameters of this model include some whose values do not interest us for the purpose of defining a bootstrap DGP: β, since it has no effect on the distribution of the statistic, and γ, since the matrix Z plays no role in the simple model, from which the bootstrap DGP is taken. There remain π, ρ, σ 1 , and σ 2 .
For equation (14), the parameter a was defined as the square root of π W Wπ, but that definition assumes that the vector w has unit length, and that all the variables are scaled so that the variance of the disturbances u 2 is 1. In order to take account of these facts, a suitable definition of a is It follows from (29) that, in order to estimate a, it is necessary also to estimate σ 2 2 . Since the parameter ρ is the correlation of the disturbances, which are not observed, any estimate of ρ must be based on the residuals from the estimation of equations (1) and (2). Let these residuals be denoted byü 1 andü 2 . Then the obvious estimators of the parameters of the covariance matrix arë 2 ) 1/2 , and the obvious estimator of a is given bÿ whereπ estimates π. Forü 1 , there are two obvious choices, the IV residuals and the LIML residuals from (1). Forü 2 , the obvious choice is the vector of OLS residuals from (2), possibly scaled by a factor of (n/(n−l)) 1/2 to take account of the lost degrees of freedom in the OLS estimation. However, this obvious choice is not the only one, because, if we treat the model (1) and (2) as a system, the system estimator of π that comes with the IV estimator of β is the three-stage least squares (3SLS) estimator, and the one that comes with the LIML estimator of β is the full-information maximum likelihood (FIML) estimator. These system estimators give rise to estimators not only of π, but also of u 2 , that differ from those given by OLS.
The system estimators of π can be computed without actually performing a system estimation, by running the regression y 2 = Wπ + ϕü 1 + residuals; see Davidson and MacKinnon (2008), where this matter is discussed in greater detail. Ifü 1 is the vector of IV residuals, then the corresponding estimatorπ is the 3SLS estimator; if it is the vector of LIML residuals, thenπ is the FIML estimator.
For the purpose of computation, it is worth noting that all these estimators can be expressed as functions of the six quadratic forms (13). A short calculation shows that the estimators of a 2 and ρ based on IV residuals, scaled OLS residuals, and the OLS estimator of π areâ whereb = P 12 /P 22 is the difference between the IV estimator of β and the true β of the DGP, σ 2 1 = n −1 Q 11 + N 11 − 2b(P 12 + M 12 ) +b 2 (P 22 + M 22 ) , andσ 2 2 = M 22 /(n − l).
The weak-instrument asymptotic limit of this expression replaces the denominator divided by n − l by 1. The expectation of the numerator without the factor of n − l is a 2 + ρ 2 l + r 2 l = a 2 + l. Consequently, it may be preferable to reduce bias in the estimation of a 2 by settingâ 2 = (n − l) max(0, P 22 /M 22 − l); see Davidson and MacKinnon (2008).
It is plain that, the closer the bootstrap DGP to the true DGP, the better will be bootstrap inference; see Davidson and MacKinnon (1999). We may therefore expect that IV-ER should perform better than IV-R, and that LIML-ER should perform better than IV-ER. Between LIML-ER and F(1)-ER, there is no obvious reason a priori to expect that one of them would outperform the other. But, whatever the properties of these bootstraps may be when the true DGP is not in the neighborhood of the singularity at a = 0, ρ = 1, we cannot expect anything better than some improvement over inference based on asymptotic critical values, rather than truly reliable inference, in the neighborhood of the singularity.

Resampling
Any parametric bootstrap risks being unreliable if the strong assumptions used to define the null hypothesis are violated. Most practitioners would therefore prefer a more robust bootstrap method. The strongest assumption we have made so far is that the disturbances are normally distributed. It is easy to relax this assumption by using a bootstrap DGP based on resampling, in which the bivariate normal distribution is replaced by the joint empirical distribution of the residuals. The discussion of the previous subsection makes it clear that several resampling bootstraps can be defined, depending on the choice of residuals that are resampled.
The most obvious resampling bootstrap DGP in the context of IV estimation is where y * 1 and y * 2 are n--vectors of bootstrap observations,û * 1 andû * 2 are n--vectors of bootstrap disturbances with typical elementsû * 1i andû * 2i , respectively, andπ is the OLS estimate from (2). The bootstrap disturbances are drawn in pairs from the bivariate empirical distribution of the structural residualsû IV 1i and the rescaled reduced-form residuals n/(n − l) Here EDF stands for "empirical distribution function". The rescaling of the reduced form residualsû OLS 2i ensures that the distribution of theû * 2i has variance equal to the unbiased OLS variance estimator.
Since all of the overidentification test statistics are invariant to the values of β and γ, we may replace the bootstrap DGP for y * 1 given by (35) by y * 1 =û * 1 .
The bootstrap statistics generated by (38) and (36) are identical to those generated by (35) and (36). We will refer to the bootstrap DGP given by (38), (36), and (37) as the IV-R resampling bootstrap. It is a semiparametric bootstrap, because it uses parameter estimates of the reduced-form equation, but it does not assume a specific functional form for the joint distribution of the disturbances. The empirical distribution of the residuals has a covariance matrix which is exactly that used to estimate a and ρ by the IV-R parametric bootstrap; hence our nomenclature.
The IV-ER resampling bootstrap draws pairs from the joint EDF of the IV residualŝ u IV 1 from equation (1) and the residuals y 2 − Wπ IV computed by running regression (31) withû IV 1 replacingü 1 . It also uses the resulting estimatorπ IV in (36) instead of the OLS estimatorπ. Note that the residuals y 2 − Wπ IV are not the residuals from (31), but rather those residuals plusφû IV 1 . The LIML-ER resampling bootstrap is very similar to the IV-ER one, except that it usesû LIML 1 both directly and in regression (31). Formally, the resampling draws pairs from the bivariate empirical distribution of Similarly, for the F(1)-ER resampling bootstrap, the structural equation (1) is estimated by Fuller's estimator with η = 1, and the residuals from this used both for resampling and in the regression (31).
A word of caution is advisable here. Although the values of overidentification test statistics are invariant to β, thereby allowing us to use (38) instead of (35) in the bootstrap DGP, the residuals from which we resample in (37) and (39) do depend on the estimate of β, as does the estimate of π if it is based on any variant of equation (31). But the test statistics depend on the estimate of β only through the residuals and the estimate of π.

Performance of Bootstrap Tests
In principle, any of the bootstrap DGPs discussed in the previous section can be combined with any of the test statistics discussed in Section 2. However, there is no point considering both S and S , or both LR and LR , because in each case one test statistic is simply a monotonic transformation of the other. If both the statistics in each pair are bootstrapped using the same bootstrap DGP, they must therefore yield identical results.
All of our experiments involve 100,000 replications for each set of parameter values, and the bootstrap tests mostly use B = 399. This is a smaller number than should generally be used in practice, but it is perfectly satisfactory for simulation experiments, because experimental randomness in the bootstrap P values tends to average out across replications. Although the disturbances of the true DGPs are taken to be normally distributed, the bootstrap DGPs we investigate in the main experiments are resampling ones, because we believe they are the ones that will be used in practice.
Figures 6, 7, and 8 present the results of a large number of Monte Carlo experiments. Figure 6 concerns Sargan tests, Figure 7 concerns LR tests, and Figure 8 concerns Fuller LR tests. Each of the figures shows rejection frequencies as a function of ρ for 34 values of ρ, namely, 0.00, 0.03, 0.06, . . . , 0.99. The four panels correspond to a = 2, 4, 6, and 8. Note that the scale of the vertical axis often differs across panels within each figure and across figures for panels corresponding to the same value of a.
It is important to keep this in mind when interpreting the results.
As we have already seen, for small and moderate values of a, Sargan tests tend to overreject severely when ρ is large and to underreject modestly when it is small. It is evident from Figure 6 that, for a = 2, using either the IV-R or IV-ER bootstrap improves matters only slightly. However, both these methods do provide a more and more noticeable improvement as a increases. For a = 8, the improvement is very substantial. If we were increasing n as well as a, it would be natural to see this as evidence of an asymptotic refinement.
There seems to be no advantage to using IV-ER rather than IV-R. In fact, the latter always works a bit better when ρ is very large. This result is surprising in the light of the findings of Davidson andMacKinnon (2008, 2010) for bootstrapping t tests on β. However, the bootstrap methods considered in those papers imposed the null hypothesis that β = β 0 , while the ones considered here do not. Apparently, this makes a difference.
Using the LIML-ER and F(1)-ER bootstraps with the Sargan statistic yields entirely different results. The former underrejects very severely for all values of ρ when a is small, but the extent of the underrejection drops rapidly as a increases. The latter always underrejects less severely than LIML-ER (it actually overrejects for large values of ρ when a = 2), and it performs surprisingly well for a ≥ 6. Of course, it may seem a bit strange to bootstrap a test statistic based on IV estimation using a bootstrap DGP based on LIML or its Fuller variant.
In Figure 7, we see that, in contrast to the Sargan test, the LR test generally underrejects, often very severely when both ρ and a are small. Its performance improves rapidly as a increases, however, and it actually overrejects slightly when ρ and a are both large. All of the bootstrap methods improve matters, and the extent of the improvement increases with a. For a = 8, all the bootstrap methods work essentially perfectly. For small values of a, the IV-R bootstrap actually seems to be the best in many cases, although it does lead to modest overrejection when ρ is large.
In Figure 8, we see that the Fuller LR test never underrejects as much as the LR test, and it actually overrejects quite severely when ρ is large and a = 2. However, that is the only case in which it overrejects much. This is the only test for which its own bootstrap DGP, namely, F(1)-ER, is arguably the best one to use. Except when the asymptotic test already works perfectly, using that bootstrap method almost always improves the performance of the test. The bottom two panels of Figure 8 look very similar to the corresponding panels of Figure 7, except that the bootstrapped Fuller test tends to underreject just a bit. It is evident that, as a increases, the LR test and its Fuller variant become almost indistinguishable.
Figures 6, 7, and 8 provide no clear ranking of tests and bootstrap methods. There seems to be a preference for the LR and Fuller LR tests, and for the LIML-ER and F(1)-ER bootstrap DGPs. In no case does any combination of those tests and those bootstrap DGPs overreject anything like as severely as the Sargan test bootstrapped using IV-R or IV-ER. Provided the instruments are not very weak, any of these combinations should yield reasonably accurate, but perhaps somewhat conservative, inferences in most cases.
The rather mixed performance of the bootstrap tests can be understood by using the concept of "bootstrap discrepancy," which is a function of the nominal level of the test, say α. The bootstrap discrepancy is simply the actual rejection rate for a bootstrap test at level α minus α itself. Davidson and MacKinnon (2006b) shows that the bootstrap discrepancy at level α is a conditional expectation of the random variable where R(α, µ) is the probability, under the DGP µ, that the test statistic is in the rejection region for nominal level α, and Q(α, µ) is the inverse function that satisfies the equation R Q(α, µ), µ = α = Q R(α, µ), µ .
Thus Q(α, µ) is the true level-α critical value of the asymptotic test under µ. The random element in (40) is µ * , the bootstrap DGP. If µ * = µ, then we see clearly that q(α) = 0, and the bootstrap discrepancy vanishes. For more detail, see Davidson and MacKinnon (2006b).
Suppose now that the true DGP µ 0 is near the singularity. The bootstrap DGP can reasonably be expected also to be near the singularity, but most realizations are likely to be farther away from the singularity than µ 0 itself. If µ 0 were actually at the singularity, then any bootstrap DGP would necessarily be farther away. If the statistic used is S, then we see from Figure 4 that rejection frequencies fall as the DGP moves away from the singularity in most, but not all, directions. Thus, for most such bootstrap DGPs, Q(α, µ * ) is smaller than Q(α, µ 0 ) for any α, and so the probability mass R Q(α, µ * ), µ 0 in the distribution generated by µ 0 is greater than α. This means that q(α) is positive, and so the bootstrap test overrejects. However, if the statistic used is LR, the reverse is the case, as we see from Figure 5, and the bootstrap test underrejects. This is just what we see in Figures 6 through 8.
Figures 9 and 10 are contour plots similar to Figures 4 and 5, but they are for bootstrap rather than asymptotic tests. The IV-R parametric bootstrap is used for the Sargan test in Figure 9, and the LIML-ER parametric bootstrap is used for the LR test in Figure 10. In both cases, there are 100,000 replications, and B = 199. Figure 9 looks remarkably like Figure 4, with low rejection frequencies for extremely small values of a, then a ridge where rejection frequencies are very high for slightly larger values of a. The ridge is not quite as high as the one in Figure 4, and the rejection frequencies diminish more rapidly as a increases.
Similarly, Figure 10 looks like Figure 5, but the severe underrejection in the far left of the figure occurs over an even smaller region, and there is an area of modest overrejection nearby. Both these size distortions can be explained by Figure 5. When a is extremely small, the estimate used by the bootstrap DGP tends on average to be larger, so the bootstrap critical values tend, on average, to be overestimates. This leads to underrejection. However, there is a region where a is not quite so small in which the bootstrap DGP uses estimates of a that are sometimes too small and sometimes too large. The former causes overrejection, the latter underrejection. Because of the curvature of the rejection probability function, the net effect is modest overrejection; see Davidson and MacKinnon (1999). This is actually the case for most of the parameter values shown in the figure, but the rejection frequencies are generally not much greater than 0.05.

Power Considerations
Overidentification tests are performed in order to check whether some of the assumptions for the two-equation model (1) and (2) to be correctly specified are valid. Those assumptions are not valid if the DGP for equation (1) is actually where the columns of the matrix W 1 are in the span of the columns of the matrix W and are linearly independent of those of Z. As in Section 3, we can eliminate Z from the model, replacing all other variables and the disturbances by their projections onto the orthogonal complement of the span of the columns of Z. The simpler model of equations (11) and (14) becomes The vector Wπ is now written as aw 1 instead of aw, and the vector W 1 δ is written as δ w p . As before, we make the normalizations that w 1 2 = 1 and a 2 = π W Wπ. In addition, we normalize so that w p 2 = 1 and δ 2 = δ W 1 W 1 δ.
The Basmann statistic S is still given by equation (15), which is simply an algebraic consequence of the definition (9). Since the DGP for y 2 is unchanged, the quantities denoted in (9) by P 22 and M 22 are the same under the alternative as under the null. Since the DGP for M W y 1 is also the same under the null and the alternative, so are M 11 and M 12 . Thus only P 11 and P 12 differ from the expressions for them in equations (21). It is easy to check that neither the numerator nor the denominator of S in (15) depends on β under the alternative, and so in our computations we set β = 0 without loss of generality.
In order to analyze the asymptotic power of the Sargan test in Basmann form, we seek to express its limiting asymptotic distribution as a chi-squared variable that is non-central under the alternative. As usual, in order for the non-centrality parameter (NCP) to have a finite limit, we invoke a Pitman drift. With our normalization of w p , this just means that δ is constant as the sample size n tends to infinity. Again, we cannot expect to find a limiting chi-squared distribution with weak-instrument asymptotics, and so our asymptotic construction supposes that a → ∞ as n → ∞.
Under the null and the alternative, the denominator of (15), divided by (n − l)P 22 , is simply an estimate of the variance of v 1 . For the purposes of the asymptotic analysis of the simpler model, it can therefore be replaced by 1. The quantity of which the limiting distribution is expected to be chi-squared is therefore P 11 − P 2 12 /P 22 . Recall that this is just the numerator of both the S and S statistics.
With β = 0, we compute as follows: P 11 = y 1 P W y 1 = δ 2 + 2δ θx 1 + 2δ tz 1 + v 1 P W v 1 , P 12 = y 1 P W y 2 = a(x 1 + δ θ) + O p (1), and where the symbol O p (1) means of order unity as a → ∞. Also, z i = w 2 v i and, as before, x i = w 1 v i , i = 1, 2. Thus the limit as a → ∞ of P 11 − P 2 12 /P 22 is In equation (20), we introduced the quantity Q 11 , equal to v 1 P W v 1 and distributed as χ 2 l . It was expressed as the sum of three mutually independent random variables, x 2 1 , z 2 P , and t P 11 . Now we separate out both the terms x 2 1 and z 2 1 to obtain where all four random variables above are independent, with x 1 , z 1 , and z P standard normal, and t P 0 11 distributed as χ 2 l−3 . Note that t P 0 11 is not to be confused with t P 11 in equations (20), which is distributed as χ 2 l−2 .
It is legitimate to write Q 11 in this way because it can be constructed as the sum of the squares of the l independent N(0,1) variables w j v 1 , where the w j form an arbitrary orthonormal basis of the span of the columns of W. Using (45), the right-hand side of (44) can be written as z 2 P + t P 0 11 + (z 1 + δt) 2 . This is the sum of three independent random variables. The first is χ 2 1 , the second is χ 2 l−3 , and the last is noncentral χ 2 1 (δ 2 t 2 ). It follows that, when a 2 and the sample size both tend to infinity, which implies that the instruments are not weak, the numerator of the test statistic is distributed as χ 2 l−1 (δ 2 t 2 ). Note that, if θ = 1, so that w p = w 1 , the NCP δ 2 t 2 vanishes. For the general model (1) and (2), with DGP given by equation (41), it can be shown that the NCP is For the simpler model given by equations (42) and (43), the first term here collapses to δ 2 and the second term, which arises because β has to be estimated, collapses to −θ 2 δ 2 . Therefore, expression (46) as a whole corresponds to δ 2 t 2 for the simpler model.

Finite-sample concerns
The asymptotic result that S follows the χ 2 l−1 (δ 2 t 2 ) distribution strongly suggests that S, LR, and LR must do so as well, because all these statistics are asymptotically equivalent. In fact, a more tedious calculation than that in equations (44) and (45) shows that the limiting distribution of LR as both n and a tend to infinity is the same as for S , namely χ 2 l−1 (δ 2 t 2 ). Because these results are only asymptotic, however, it is necessary to resort to simulation to investigate behavior under the alternative in finite samples.
Under the null, we were able in Section 3 to express all six quantities, the P ij and the M ij , for i, j = 1, 2, in terms of eight independent random variables. Under the alternative, we require ten of these variables. For the M ij , there is no need to change the expressions for them in (21), where we use the three variables t M 11 , t M 22 , and z M , distributed respectively as χ 2 n−l , χ 2 n−l−1 , and N(0,1). These represent the projections of v 1 and v 2 onto the orthogonal complement of the span of the instruments. For the P ij , however, we decompose as follows: Here x i , z i , i = 1, 2, and z P are standard normal, t P 11 is χ 2 l−3 , and t P 22 is χ 2 l−2 , all seven variables being mutually independent. We can simulate both S and LR very cheaply, by drawing ten random variables, independently of either the sample size n or the degree of overidentification l − 1, because all the statistics are deterministic functions of the P ij and the M ij , and, of course, n and l. The relations in (21) hold except those for P 11 and P 12 . These are replaced by P 11 = Q 11 + δ 2 + 2δ θx 1 + 2δ tz 1 , and P 12 = ax 1 + ρQ 11 + rQ 12 + δ(aθ + ρθx 1 + ρtz 1 + r θx 2 + rtz 2 ).
These equations differ from the corresponding ones in (21) only by terms proportional to a positive power of δ.

Simulation evidence
Since we have seen that the LR test often has much better finite-sample properties than the S test, even when both are bootstrapped, it is important to see whether the superior performance of LR comes at the expense of power. In this section, we employ simulation methods to do so.
Given the considerable size distortion of the asymptotic tests for most of that part of the parameter space considered in Section 7, we limit attention to parametric bootstrap tests. In this, we follow Horowitz and Savin (2000), which argues that the best way to proceed, as long as the rejection probability of a test is far removed from its nominal level, is to consider a bootstrap test. But that proposition is based on the assumption that the bootstrap discrepancy is small enough to be ignored, which is not the case for the overidentification tests we have considered in the neighborhood of the singularity. Because of that, and because it is unreasonable to expect that there is much in the way of usable power near the singularity, it is primarily of interest to investigate power for situations in which the instruments are not too weak.
As before, all the simulation results are presented graphically. These results are based on 200,000 replications with 399 bootstrap repetitions. The same random variables are used for every set of parameter values. These experiments would have been extremely computationally demanding without the theoretical results of Sections 6 and the first part of this section, which allow us to calculate everything very cheaply after we have generated and stored 200, 000 × 10 plus 200, 000 × 399 × 8 random variables. The first set of random variables is used to calculate the actual test statistics and the estimates of a and ρ, and the second set is used to calculate the bootstrap statistics.
We report results only for S bootstrapped using the IV parameter estimates and for LR bootstrapped using the LIML estimates. Recall from Section 2 that the former results apply to S as well as S , and the latter apply to LR as well as LR , because the test statistics in each pair are monotonically related. Figure 11 shows power functions for q = 8, ρ = 0.5, and four values of a. When a = 2, LR rejects much less frequently than S , both under the null and under the alternative. Both power functions level out as δ becomes large, and it appears that neither test rejects with probability one as δ → ∞. As a increases, the two power functions converge, and both tests do seem to reject with probability one for large δ.
The top two panels of Figure 12 are comparable to the top two panels of Figure 11, but with q = 2. When a = 2, S now rejects less often that it did before, but LR rejects more often. When a = 4, LR rejects very much more often than it did before, and the two power functions are quite close. We also obtained results for a = 6, a = 8, and a = 16, which are not shown. For a = 6, the power functions for S and LR are extremely similar, and for a ≥ 8 they are visually indistinguishable.
The bottom two panels of Figure 12 are comparable to the top right panel, except that ρ = 0.1 or ρ = 0.9 instead of ρ = 0.5. It is evident that the shapes of the power functions depend on ρ, but for most values of δ the dependence is moderate. This justifies our use of ρ = 0.5 in most of the experiments. Using other values of ρ would not change the main results.
When one power function is always above another, as is the case in all the panels of Figures 11 and 12, it is difficult to conclude that one test is genuinely more powerful than the other. Perhaps greater power is just an artifact of greater rejection frequencies whether or not the null hypothesis is true.
One way to compare such tests is to graph rejection frequencies under the alternative against rejection frequencies under the null. Each point on such a "size-power curve" corresponds to some nominal level for the bootstrap test, with levels running from 0 to 1. The abscissa is the rejection frequency when the DGP satisfies the null, the ordinate the rejection frequency when the DGP belongs to the alternative. For a level of 0, the test never rejects, since bootstrap P values cannot be negative. If the level is 1, the test always rejects. As the nominal level increases from 0 to 1, we expect power (on the vertical axis) to increase more rapidly than the rejection frequency under the null (on the horizontal axis). See Davidson and MacKinnon (1998).
The top two panels of Figure 13 show size-power curves for q = 2, a = 4, and four values of δ. Perhaps surprisingly, the curves for LR in the left-hand panel look remarkably similar to the ones for S in the right-hand panel. The apparently greater power of S , which is evident in the top right panel of Figure 12, seems to be almost entirely accounted for by its greater tendency to reject under the null.
The bottom two panels of Figure 13 show size-power curves for q = 2, δ = 4, and four values of a. It is clear that power increases with a, but at a decreasing rate. As a → ∞, the curves converge to the one given by asymptotic theory, where the distribution under the null is central χ 2 l−1 and the one under the alternative is noncentral χ 2 l−1 (δ 2 t 2 ). This curve is graphed in the figure and labelled a = ∞. The asymptotic result that the test statistics follow the χ 2 l−1 (δ 2 t 2 ) distribution suggests that only the product δt = δ(1 − θ 2 ) 1/2 influences power, and that, in particular, there should be no power beyond the level of the test when θ = 1. In finite samples, things turn out to be more complicated, as can be seen from Figure 14, which plots power against θ for δ = 4. The top two panels show results for a = 2 and a = 4. The S test has substantial power when θ = 1 and a = 2, which presumably reflects its tendency to overreject severely under the null when the instruments are weak. Those panels also show, once again, that S can reject far more often than LR when the instruments are weak. This is much less evident in the bottom two panels, which show results for larger values of a (6 and 8).
One surprising feature of Figure 14 is that, in all cases, power initially increases as θ increases from 0, even though δ(1 − θ 2 ) 1/2 declines. This is true even for quite large values of a, such as a = 16, although, of course, it is not true for extremely large values.

Relaxing the IID Assumption
The resampling bootstraps that we looked at in Section 7 do not implicitly make the assumption that the disturbances are normal. They do, however, assume that the disturbances are pairwise IID. If instead the disturbances are heteroskedastic, then the covariance matrix of their bivariate distribution may be different for each observation. In that case, all the test statistics we have studied have distributions that depend on the pattern of heteroskedasticity, and so they are no longer approximately pivotal for the model (1) and (2) under either weak-instrument or strong-instrument asymptotics.
Andrews, Moreira, and Stock (2004) proposes heteroskedasticity-robust versions of test statistics for tests about the value of β that are robust to weak instruments. Note that, although Andrews, Moreira, and Stock (2006) is based on the 2004 paper and has almost the same title, it does not contain this material. However, this work cannot be applied here, because, as we have seen, the overidentification tests are not robust to weak instruments.
The role of the denominators of the statistics S, S , and LR is simply to provide non-robust estimates of the scale of the numerators. In order to make those statistics robust to heteroskedasticity, we have to provide robust measures instead. The numerators of all three statistics can be written aŝ where the vectorû 1 denotes either y 1 − Zγ IV −β IV y 2 , in the case of S and S , or y 1 − Zγ LIML −β LIML y 2 , in the case of LR . Expression (47) is a quadratic form in the l--vector W û 1 . The usual estimate of the covariance matrix of that vector is W Ω W, whereΩ = diagû 2 1i . Thus the heteroskedasticity-robust variant of all three test statistics is the quadratic form There would be no point in using a heteroskedasticity-robust statistic along with a bootstrap DGP that imposed homoskedasticity. The natural way to avoid doing so is to use the wild bootstrap. In Davidson and MacKinnon (2010), the wild bootstrap is shown to have good properties when used with tests about the value of β. The disturbances of the wild bootstrap DGP are given by where ν * i is an auxiliary random variable with expectation 0 and variance 1. The easiest choice for the distribution of the ν * i is the Rademacher distribution, which sets ν * i to +1 or −1, each with probability one half. This is also probably the best choice in most cases; see Davidson and Flachaire (2008).
The IID assumption can, of course, be relaxed in other ways. In particular, it would be easy to modify the test statistic (48) to allow for clustered data by replacing the middle matrix with one that resembles the middle matrix for the usual cluster robust covariance matrix. We could then use a variant of the cluster robust wild bootstrap of Cameron et al. (2008) that allows for simultaneity. The Rademacher random variable associated with each cluster, the analog of ν * i in equation (49), would then multiply the residuals for all observations within that cluster for both equations.

Concluding Remarks
We have shown that the well-known Sargan test for overidentification in a linear simultaneous-equations model estimated by instrumental variables often overrejects severely when the instruments are weak. In the same circumstances, the likelihood ratio test often underrejects severely. We provide a finite-sample analysis that explains these facts and shows that the distributions of the different test statistics we consider have a singularity when the concentration parameter vanishes and the absolute value of the correlation between the disturbances of the structural and reduced-form equations tends to one. Thus it can be risky to use asymptotic tests in this situation. We have proposed a new test based on Fuller's modified LIML estimator, which often outperforms the ordinary LR test.
We have also proposed four bootstrap methods which can be applied to all three of these tests. Although bootstrapping does not help much when the instruments are extremely weak, especially when the disturbances of the two equations are highly correlated, it does help substantially when the instruments are only moderately weak. In particular, using a bootstrap DGP based on Fuller's estimator generally leads to much more accurate inferences than simply using asymptotic theory in this case.
There is a cost in terms of power to using a bootstrap test based on any version of the likelihood ratio statistic relative to a test based on the conventional Sargan or Basmann statistics. This cost generally seems to be very modest, except when the instruments are very weak.