A Comparison of Some Recent Bayesian and Classical Procedures for Simultaneous Equation Models with Weak Instruments

We compare the finite sample performance of a number of Bayesian and classical procedures for limited information simultaneous equations models with weak instruments by a Monte Carlo study. We consider Bayesian approaches developed by Chao and Phillips, Geweke, Kleibergen and van Dijk, and Zellner. Amongst the sampling theory methods, OLS, 2SLS, LIML, Fuller’s modified LIML, and the jackknife instrumental variable estimator (JIVE) due to Angrist et al. and Blomquist and Dahlberg are also considered. Since the posterior densities and their conditionals in Chao and Phillips and Kleibergen and van Dijk are nonstandard, we use a novel “Gibbs within Metropolis–Hastings” algorithm, which only requires the availability of the conditional densities from the candidate generating density. Our results show that with very weak instruments, there is no single estimator that is superior to others in all cases. When endogeneity is weak, Zellner’s MELO does the best. When the endogeneity is not weak and ρω12>0, where ρ is the correlation coefficient between the structural and reduced form errors, and ω12 is the covariance between the unrestricted reduced form errors, the Bayesian method of moments (BMOM) outperforms all other estimators by a wide margin. When the endogeneity is not weak and βρ < 0 (β being the structural parameter), the Kleibergen and van Dijk approach seems to work very well. Surprisingly, the performance of JIVE was disappointing in all our experiments.


Introduction
Research on Bayesian analysis of the simultaneous equations models addresses a problem, raised initially by Maddala (1976), and now recognized as related to the problem of local nonidentification when diffuse/flat priors are used in traditional Bayesian analysis, e.g., Drèze (1976); Drèze and Morales (1976), and Drèze and Richard (1983). 1 In this paper, we examine the approaches developed by Chao and Phillips (1998, hereafter CP), Geweke (1996), Kleibergen and van Dijk (1998, hereafter KVD), and Zellner (1998).The idea in KVD was to treat an overidentified simultaneous equations model (SEM) as a linear model with nonlinear parameter restrictions, and has been extended further in Kleibergen and Zivot (2003).While KVD focused mainly on resolving the problem of local nonidentification, CP explored further the consequences of using a Jeffreys prior.By deriving the exact and (asymptotically) approximate representations for the posterior density of the structural parameter, CP showed that the use of a Jeffreys prior brings Bayesian inference closer to classical inference in the sense that this prior choice leads to posterior distributions which exhibit Cauchy-like tail behavior akin to the LIML estimator.Geweke (1996), being aware of the potential problem of local nonidentification, suggests a shrinkage prior such that the posterior density is properly defined for each parameter.In another approach, Zellner (1998) suggested a finite sample Bayesian method of moments (BMOM) procedure based on given data without specifying a likelihood function or introducing any sampling assumptions.
For the Bayesian approaches considered, while Geweke (1996) proposed Gibbs sampling (GS) to evaluate the posterior density with a shrinkage prior, the posterior densities as well as their conditional densities resulting from CP and KVD are nonstandard and cannot be readily simulated.In the category of "block-at-a-time" approach, we suggest a novel MCMC procedure, which we call a "Gibbs within M-H" algorithm.The advantage of this algorithm is that it only requires the availability of the conditional densities from the candidate generating density.These conditional densities are used in a Gibbs sampler to simulate the candidate generating density, whose drawings on convergence are then weighted to generate drawings from the target density in a Metropolis-Hastings (M-H) algorithm.In this study, we will focus on weak instruments, where the classical approach has been particularly problematic. 2 Ni and Sun (2003) have studied similar issues in the context of vector autoregressive models, see also Ni et al. (2007).Radchenko and Tsurumi (2006) used many of the procedures analyzed in this paper to estimate a gasoline demand model using an MCMC algorithm.
The main objective of the present paper is to compare the small sample performance of some Bayesian and classical approaches using Monte Carlo simulations.For the purpose of comparison, a number of classical methods including OLS, 2SLS, LIML, Fuller's modified LIML, and a jackknife instrumental variables estimator (JIVE) originally due to Angrist et al. (1999) and Blomquist and Dahlberg (1999) are also computed from the generated data.Our simulation results from repeated sampling experiments provide some unambiguous guidelines for empirical practitioners.
The plan of the paper is as follows.In Section 2, we set up the model.Section 3 reviews in limited details the recent Bayesian approaches and JIVE.Section 4 suggests a new MCMC procedure for evaluating the posterior distributions for CP and KVD, and discusses the convergence diagnostics implemented.Section 5 presents simulation results and some discussions.Section 6 contains the main conclusions.

The Model
Consider the following limited information formulation of the m-equation simultaneous equations model (LISEM): where y 1 : (T × 1) and Y 2 : (T × (m − 1)) are the m included endogenous variables; Z 1 : (T × k 1 ) is an observation matrix of exogenous variables included in the structural Equation (1); Z 2 : (T × k 2 ) is an observation matrix of exogenous variables excluded from (1); and u and V 2 are, respectively, a T × 1 vector and a T × (m − 1) matrix of random disturbances to the system.We assume that (u, V 2 ) ∼ N 2 There has been a lot of interest in the estimation of LISEM with weak instruments.See Buse (1992); Bound et al. (1995); Staiger and Stock (1997); Angrist et al. (1999); Blomquist and Dahlberg (1999), among others.More recently, Andrews et al. (2019) review the literature on weak instruments in linear IV regression, and suggest that weak instruments remain an important issue in empirical practice.
(0, Σ ⊗ I T ), where the m × m covariance matrix is positive definite symmetric (pds) and is partitioned conformably with the rows of (u, V 2 ) as follows The likelihood function for the model described by ( 1) and ( 2) can be written as where Y = (y 1 , Y 2 ) and Z = (Z 1 , Z 2 ).
Note that in the absence of restrictions on the covariance structure, (1) is fully identified if and only if rank(Π 1 ) = (m − 1) ≤ k 2 .

Review of Some Bayesian Formulations
Among the Bayesian approaches, Geweke (1996) used a shrinkage prior such that all parameters are identified (in the sense that a proper posterior distribution exists) even when Π 2 has reduced rank.KVD treated overidentified SEMs as linear models with nonlinear parameter restrictions using the singular value decomposition.A diffuse or natural conjugate prior for the parameters of the embedding linear model results in the posterior for the parameters of the SEM having zero weight in the region of parameter space where Π 2 has reduced rank.This is a feature of the Jacobian of transformation from the multivariate linear model to the SEM.CP used a prior by applying Jeffreys principle on the model described by (1) and (2) and the assumptions regarding the disturbances.An important advantage of the Jeffreys prior in the present context is that it places no weight in the region of the parameter space where rank(Π 2 ) < (m − 1) and relatively low weight in close neighborhoods of this region where the model is nearly unidentified.
3 Geweke (1996) considered a more general specification.To facilitate comparison, for Geweke approach only, we have denoted Y = (Y 2 , y 1 ).

Zellner's Bayesian Method of Moments Approach (BMOM)
Among the various Bayesian treatments of SEM proposed by Zellner (1971Zellner ( , 1978Zellner ( , 1986Zellner ( , 1994Zellner ( , 1998)), the Bayesian method of moments approach applies the principle of maximum entropy and generates optimal estimates which can be evaluated by double K-class estimators.Given the unrestricted reduced form equation y 1 = Zπ 1 + ξ 1 , Zellner (1998) considered a balanced loss function, where X = (Y 2 , Z 1 ), δ = (β , γ ), and δ is an estimate of δ.The BMOM estimate that minimizes EL b , where the expectation is taken with respect to a probability density function of the π matrices of the unrestricted reduced form equations, is given by where BMOM estimate will vary depending on the value of ω.When ω = 1, it is the optimal estimate resulting from a "goodness of fit" loss function L g .When ω = 0, it is the optimal estimate given by a precision of estimation loss function L p .Meanwhile, the well-known minimum expected loss (MELO) estimator is derived using a precision of estimation loss function and may be evaluated as a K-class estimator with Similar to the BMOM method, Conley et al. (2008) developed a Bayesian semiparametric approach to the instrumental variable problem assuming linear structural and reduced form equations, but with unspecified error distributions.(1996) assumes the following reference prior

Geweke
which is the product of an independent inverted Wishart distribution for Σ with v degrees of freedom and scale matrix S, and an independent N(0, τ 2 ) shrinkage priors for each element of β and Π 2 .Geweke derived the respective conditional posterior distributions, which may be used to generate drawings through Gibbs sampling from the joint posterior distribution.Regarding the vector of parameters Σ −1 , A, Π 2 , β , we obtain the full conditional densities as follows: (1) Conditional density of where where where generalized inverse of Φ and the columns of Φ + and Φ 0 are orthogonal, and where

The Chao and Phillips Approach
Using Jeffreys prior, CP obtains exact and approximate analytic expressions for the posterior density of the structural coefficient β in the LISEM (1) and (2).Their formulas are found to exhibit Cauchy-like tails analogous to comparable results in the classical literature on LIML estimation.For the model ( 1) and (2) under normality assumption for the disturbances, a Jeffreys prior on the parameters, where ln L(θ Y, Z) is the log-likelihood function as specified in (3), and Q X = I T − P X , P X = X(X X) −1 X .As first noted by Poirier (1996), the prior in (13) places no weight where rank(Π 2 ) < (m − 1) The expressions for the conditional densities of Π 2 and β given in (Geweke 1996, expressions (11) and ( 13)) contain some typographical errors and are corrected here in ( 11) and ( 12).
The joint posterior of the parameters of LISEM (1) and ( 2) is constructed as proportional to the product of the prior (13) and the likelihood function (3), where (u, V 2 ) is defined in (1) and (2).Note that (14) or its conditionals do not belong to any standard class of probability density functions.

The Kleibergen and van Dijk Approach
To solve the problem of local nonidentification and to avoid the so-called Borel-Kolmogorov paradox, see Billingsley (1986) and Poirier (1995), KVD considered (4) as a multivariate linear model with nonlinear parameter restrictions: where . The reduced form model ( 4) is obtained if a reduced rank restriction is imposed on the linear model ( 15) such that rank(Φ) = (m − 1) instead of m.
Note that λ is obtained through pre-and postmultiplication of s 2 by orthogonal matrices while s 2 contains the smallest singular values of Φ and is invariant with respect to the ordering of variables contained in Y and Z 2 .
According to KVD, the above shows that the model described by ( 1) and ( 2) can be considered as equivalent to the linear model ( 16) with a nonlinear (reduced rank) restriction λ = 0 on the parameters.Therefore, the priors and posteriors of the parameters of the LISEM (1) and (2) may be constructed as proportional to the priors and posteriors of the parameters of the linear model ( 16) evaluated at λ = 0.
The joint posterior of the parameters of the LISEM (4) is readily constructed as proportional to the product of the prior (21) and the likelihood function (5), Unfortunately, the above posterior or its conditional densities do not belong to a known class of probability density functions.
Z −t and X −t are (T − 1) × k and (T − 1) × (m − 1 + k 1 ) matrices obtained after eliminating the t-th rows of Z and X matrices respectively, Π = (Z Z) −1 (Z X) and h t = Z t (Z Z) −1 Z t .In JIVE, the instrument is independent of the disturbances even in finite samples, which is achieved by using a 'leave-one-out' jackknife-type fitted value in place of the usual unrestricted reduced form predictions.Angrist et al. (1999) also proposed a second jackknife estimator that is a slight modification of ( 23).Similar to their study, we found that its performance is very similar to JIVE, and is not reported here.

Posterior Simulator: "Gibbs within M-H" Algorithm
Given the full conditional densities in ( 9) through ( 12) for the four blocks of parameters, evaluating the joint posterior densities by Gibbs sampling is straightforward, see Geweke (1996) for a detailed description.Although Geweke's (1996) shrinkage prior does not meet the argument in KVD that the implied prior/posterior on the parameters of an embedding linear model should be well-behaved, we found that the use of Geweke's shrinkage prior does not lead to a reducible Markov Chain.With the specification of a shrinkage prior, when Π 2 has reduced rank, the joint posterior density still depends on β and will not exhibit any asymptotic cusp.In the following, we only discuss the posterior simulation for CP and KVD.
KVD suggested two simulation algorithms for the posterior ( 22): an Importance sampler and a Metropolis-Hastings algorithm.We found that their M-H algorithm performs unsatisfactorily with low acceptance rate even for reasonable parameter specifications. 7As mentioned earlier, since the posteriors ( 14) and ( 22) as well as their conditional posteriors do not belong to any standard class of probability density functions, Gibbs sampling cannot be used.In this section, we suggest an alternative simulation algorithm which combines Gibbs sampling (see Casella and George (1992) and Chib and Greenberg (1996)) and the Metropolis-Hastings algorithm (see Metropolis et al. 1953;Hastings 1970;Smith and Roberts 1993;Tierney 1994;Chib and Greenberg 1995).Our algorithm is different from the "M-H within Gibbs" algorithm and can find its usefulness in other applications as well.
To generate drawings from the target density p(x), we use a candidate-generating density r(x).An Independence sampler, which is a special case of the M-H sampler, in algorithmic form is as follows: 0.
Choose starting values x 0 1.Draw x i from r(x) 2.
Accept x i with probability It is generally not feasible to draw all elements of the vector x simultaneously.A block-at-a-time possibility was first discussed in (Hastings 1970, sct. 2.4) and then in Chib and Greenberg (1995) along with an example.Chib and Greenberg (1995) considered applying the M-H algorithm in turn to sub-blocks of the vector x, which presumes that the target density p(x) may be manipulated to generate full conditional densities for each of the sub-blocks of x, conditioning on other elements of x.However, the full 7 Zellner et al. (2014) suggested a variant of this approach called Acceptance-Rejection within Direct Monte Carlo (ARDMC) to evaluate the posterior density, and report substantial gain in computational efficiency, particularly with weak instruments.They also studied the existence conditions for posterior moments of the parameters of interest in terms of the number of available instruments being greater than the number of endogenous variables plus the order of the moment.conditionals are sometimes not readily available from the target density for empirical investigators.The posteriors ( 14) and ( 22) happen to fall in this category.In this latter case, problems come up at step 1 while trying to generate drawings from the joint marginal density r(x).Note that these drawings, whether accepted or rejected at step 2, satisfy the necessary reversibility condition if step 1 is performed successfully.
To simplify the notation, we consider a vector x which contains two blocks, x = (x 1 , x 2 ).KVD used the fact that r(x 1 , and suggested to draw x i 1 from r(x 1 ) and then draw x i 2 from r x 2 x i 1 .The pair x i 1 , x i 2 is then taken as a drawing from r(x).It turns out that this strategy gives very low acceptance rate at step 2 in simulation studies for various reasonable parameter values.Sometimes the move never takes place and the posterior has all its mass at the parameter values of the first drawing.The reason for the failure is that information is not updated at subsequent drawings and the transition kernel of ( 25) is static.
If the full conditionals r(x 1 |x 2 ) and r(x 2 |x 1 ) are available, which is usually true for many standard densities, we propose to use them in a Gibbs sampler to make independent drawings from the invariant density r(x) after the Markov chain has converged.
The combined algorithm is thus as follows, which we call "Gibbs within M-H": As explained, step 2 is the Gibbs step and step 3 is the M-H step in our combined algorithm.In the following subsections, we describe the steps for implementing the above procedure to generate drawings from the posteriors ( 14) and ( 22).8

Implementing the CP Approach
Note that the posterior in the CP approach is proportional to the product of the prior, which is uniformly bounded, and the likelihood function, which can be sampled by a Gibbs sampler.Therefore, we choose the candidate-generating density the way suggested by Chib and Greenberg (1995): we use the likelihood function, L(β, γ, Π 1 , Π 2 , Σ| Y, Z), as the candidate generating density for the posterior (14).Using precision matrix Σ −1 , the simulation steps are as follows, 0.
Choose starting values as a drawing from the posterior ( 14) with probability, min The conditional densities used in the first step are constructed as follows (see Percy (1992) and Chib and Greenberg (1996)): Rewrite the model ( 1) and ( 2) as a SUR model, where which follows a Wishart distribution with (T − m − 1) degrees of freedom, where H = T t=1 (y t − W t δ)(y t − W t δ) , and
Compute 29) and (30) 5. Draw ) as a drawing from the posterior with probability, min , 1 Note that the conditional densities used in the first step are as follows: which is matric-variate normal density.
The conditional density used in step 5 is Evaluated at λ = 0, where

Convergence Diagnosis
One important implementation issue associated with MCMC methods is that of determining the number of iterations required.There are various informal or formal methods for the diagnosis of convergence, see Cowles and Carlin (1996) and Brooks and Roberts (1998) for comprehensive reviews and recommendations.Since the posterior densities in ( 14) and ( 22) resulting from CP and KVD do not have moments of any positive integer order, most of the methods proposed in the MCMC literature which require the existence of at least the first moment (posterior mean) are ruled out.We are left with a very few alternatives that can be used in our context.
First, the popular Raftery and Lewis (1992) method has been recognized as the best for estimating the convergence rate of the Markov chain if quantiles of the posterior density are of major interest, although the method does not provide any information as to the convergence rate of the chain as a whole.Because we are interested in the posterior modes and medians for β associated with the Bayesian approaches, we will largely rely on Raftery and Lewis' method to determine the number of burn-ins and the subsequent number of iterations required to attain specified accuracy (e.g., estimating the 0.50 quantile in any posterior within ±0.05 with probability 0.95).However, we do not adopt their suggested skip-interval.MacEachern and Berliner (1994) showed that estimation quality is always degraded by discarding samples.We also experimented with using the skip-intervals and found that the results are basically the same if a sufficient number of iterations are run.This seems to be inefficient and sometimes infeasible in terms of computation time.
For each specification in our Monte Carlo study with repeated experiments, we determined the number of burn-ins and subsequent number of iterations by running the publicly available FORTRAN code gibbsit on MCMC output of 10,000 iterations from three or more testing replications.For KVD and CP approaches, the number of burn-ins for both the GS step and the M-H algorithm were estimated.It was found that the number of burn-ins in the GS step is negligible for most cases.However, we discarded more iterations as the transient phase than the estimated number of burn-ins. 10he estimated number of subsequent iterations across testing replications was stable for the Gibbs sampler (in both Geweke approach and the GS step for KVD and CP approaches), but it varied a lot for the M-H procedures, which is also demonstrated by the variation in acceptance rates over repeated experiments.We used a generous value for the number of subsequent iterations when feasible.
Second, for MCMC output from each testing replication, we also applied other convergence diagnostic methods, including percentiles derived from every quarter of the long chain, Yu and Mykland (1998)'s CUSUM plot, and Brooks (1996)'s D-sequence statistic.While the CUSUM partial sums actually involve averaging over sampling drawings, the computation of Brooks' statistic is justified on the basis that it is designed to measure the frequency of back and forth movement in the MCMC algorithm.However, these diagnostics may sometimes provide contradictory outcomes so that one has to be extra careful in interpreting them before making a judgment on convergence.

Simulation Results and Discussions
In this section, we present results of Monte Carlo experiments and discuss some of the findings.As mentioned before, for the purpose of comparison, we also computed a number of single K-class estimators including OLS, 2SLS, LIML, and Fuller's modified LIML.In summary, the set of K-class estimators for the structural coefficients in model ( 1) and ( 2) is given by: where V2 = Q Z Y 2 -see Equation ( 7) above.
The following LISEM estimators have been considered: (1) Ordinary least squares (OLS) (2) Two stage least squares (2SLS) (3) Zellner's (1978) Bayesian minimum expected loss estimator (MELO) (4) Zellner's Bayesian method of moments relative to balanced loss function (BMOM)11 (5) Classical LIML.We compute classical LIML as an iterated Aitken estimator (see Pagan (1979) and Gao and Lahiri (2000a)).( 6) Fuller (1977) modified LIML estimators (Fuller1 and Fuller4) where and it is computed using the LIML estimate.(7) JIVE.(8) Posterior mode and median from the Geweke (1996) approach using Gibbs Sampling.The values of the hyperparameters are chosen to be τ 2 = 0.01, v = m(m + 1)/2, S = 0.01I m .12(9) Mode and median of the marginal density of β based on classical LIML from Gibbs sampling (LIML-GS).LIML-GS is a byproduct of the "Gibbs within M-H" algorithm for the CP approach since the likelihood function is used as the candidate-generating density to explore the CP posterior.(10) Posterior mode and median from CP approach using "Gibbs within M-H" algorithm.(11) Posterior mode and median from KVD approach using "Gibbs within M-H" algorithm.
For the Bayesian approaches and LIML-GS, we report both (posterior) mode and median to show possible asymmetry in the marginal densities of β.Any preference for one over the other will depend on the researcher's loss function.We obtain 16 estimates for each generated data set.The data are generated from the model, where y 1 , Y 2 are T × 1 such that m = 2, and Z 2 : T × k 2 .We further specify β = 1 and For ρ we used 0.20, 0.60, and 0.95.13Z 2 is simulated from a N 0, I k 2 ⊗ I T distribution and (u, V 2 ) from a N(0, Σ ⊗ I T ) distribution.A constant term is added in each equation, i.e., Z 1 is a T × 1 vector of 1 s.
The simulation results are reported in Tables 1-13.Tables 1-12 are for cases with ρ > 0, each table reporting results for one specification.Table 10.T = 100, ρ = 0.20, k 2 = 9, R 2 = 0.10.Table 13 summarizes the results for cases with ρ < 0 for BMOM and KVD for whom negative ρ made a surprising difference.As mentioned before, we focus on the estimates of the structural parameter β.Specifically, we analyze the sensitivity of the various estimates of β with respect to the strength of the instrumental variables Z, the degree of overidentification (k 2 − m + 1), the degree of endogeneity (ρ), and the sample size (T).Also, we will examine whether the performance of an estimator is symmetric with respect to the sign of parameter ρ, an issue generally overlooked in the literature. 14 Note that the strength of the instrumental variables for the included endogenous variable Y 2 is measured in terms of the adjusted R 2 by regressing Y 2 on Z = (Z 1 , Z 2 ).In the data generating process, we controlled R 2 to be within ±2.5% of the specified value to reduce unnecessary variation.We did not experiment with extremely small R 2 (say, 0.01 or less).In these cases, the mean values of all estimators approached the point of concentration ω 12 /ω 22 , which is equal to (β + ρ) for our data generating process (DGP).

Mean
For each specification, the number of replications is 400.The number of burn-ins (nburn_GS and nburn_MH), and subsequent number of iterations (n) determined at the convergence diagnosis step are reported in the footnotes to each table.
The average acceptance rate and its standard deviation (in parentheses) across replications for each M-H routine are reported as well.To evaluate alternative estimators, we computed mean, standard deviation (Std), root of mean squared errors (RMSE), and mean absolute deviation (MAD) over repeated experiments for all the estimators considered. 15Since LIML, posterior densities for CP and KVD, as well as 2SLS in the just-identified case do not have finite moments of positive order in finite samples, one should interpret the computed mean, standard deviation and RMSE across replications for these estimators with caution.In this sense, the MAD across replications is a preferred measure to consider.
We will first look at cases reported in Tables 1-12 with ρ > 0. In Table 1, we consider a case (T = 50, ρ = 0.60, k 2 = 4) with moderately strong instruments (R 2 = 0.40).It is found that with reasonably strong instruments all estimators designed for simultaneous equations perform reasonably well.As expected, OLS is seriously biased.BMOM has a slight edge over others in terms of RMSE and MAD.For all Bayesian approaches and LIML-GS, the medians perform a little better than modes, and CP over KVD, in terms of bias, RMSE, and MAD.Notice that the classical LIML estimates are different from LIML-GS (mode or median).As noted by Drèze (1976), from a Bayesian viewpoint, LIML produces an estimate of β conditionally on the overidentifying restrictions, the modal values of all the remaining parameters, and a uniform prior.In other words, the concentrated likelihood function of β after concentrating out (i.e., maximizing with respect to) other reduced-form and nuisance parameters is a conditional density.
However, LIML-GS is a marginal density with all other parameters being integrated out.Due to possible asymmetry in the distribution of the nuisance parameters, the modal/median values of LIML-GS may not coincide with classical LIML estimates.In all our experiments, we find that the median-unbiasedness property of (conditional) LIML does not carry over to the marginal LIML (i.e., LIML-GS); however, the former generally has a much larger standard deviation than the latter.In a way, LIML-GS brings the classical LIML estimator close to its Bayesian counterpart for the purpose of comparison.
Letting ρ = σ 12 / √ σ 11 σ 22 , the second relationship may be rewritten as: If Σ is normalized as in ( 35) with σ 11 = ω 22 = 1, then ω 12 = β + ρ.Therefore, in our context, given β = 1, the sign and magnitude of ρ (or ω 12 ) has a special significance. 15Medians were also calculated.Since they were very close to the corresponding means in all our experiments, we did not report them in this paper.
however, the use of Jeffreys prior reduces the bias in CP quite substantially.For example, in Table 4 with T = 50 and a high degree of overidentification, the bias is reduced from 0.36 to 0.25.A simple case when the structural model is just identified (k 2 = 1) is reported in Table 2.For this case it is well known that classical LIML coincides with 2SLS.The KVD approach does not accommodate the case of just-identification since (15) requires k 2 > (m − 1). 16In this case, we find that CP-Mode produces results closer to LIML-GS-Mode than to LIML.CP (1998) showed that for a two-equation just-identified SEM in orthonormal canonical form, the posterior density of β with Jeffreys prior has precisely the same functional form as the density of the finite sample distribution of the corresponding LIML estimator as obtained by Mariano and McDonald (1979).Our simulation results show that the assumption of orthonormal canonical form is crucial for their exact correspondence, which cannot be extended to a general SEM. 17In general, the Bayesian marginal density is not the same as the classical conditional density.Interestingly, JIVE is considerably more biased and has larger standard deviation than 2SLS.Also, CP-Median and LIML-GS-Median perform significantly worse than their modes.This is because in an exactly identified model with weak instruments, the probability of local nonidentification is substantial, and the resulting nonstandard marginal density exhibits a very high variance.The same result holds true for Geweke-Median, but to a lesser extent.Thus, for exactly identified SEMs with very weak instruments, mode of the marginal density is a more dependable measure of β.We should point out that in all other cases in this study, the medians generally turned out to be more preferable than the modes in terms of bias, RMSE, and MAD (see Tables 11 and 12, for instance).
Results reported in Tables 3-12 consider cases with general overidentification and weak instruments.As noted in the literature, OLS and 2SLS are median-biased in the direction of the correlation coefficient ρ, and the bias in 2SLS grows with the degree of over identification, and decreases as sample size increases.Results in Tables 3-10 confirm these results.Since MELO is a single K-class estimator with 0 < K < 1, its performance is always between OLS and 2SLS estimates.The bias in MELO shows the same pattern as that of 2SLS.With moderate simultaneity, the median-bias in 2SLS can be as large as about 40% of the true value (see Table 8).We note that MELO, LIML-GS-Mode, and KVD-Mode or KVD-Median are also median-biased in the direction of ρ.However, the bias in JIVE is consistently in the opposite direction of ρ.Classical LIML is remarkably median-unbiased when the instrumental variables are not very weak, which is well documented in the literature.We find that LIML is median-biased in the direction of ρ when the instruments are very weak (Table 8), which is consistent with the finding in Staiger and Stock (1997) using local-to-zero asymptotic theory.Even in this situation, the bias of LIML is much smaller than that of any other estimator, except BMOM.
The MAD of OLS is very close to its bias (i.e., relatively small Std) across all cases and it implies that OLS method is robust in the sense that it does not suffer from heavy tails or outlying estimates, see Zellner (1998).In this sense, MELO and BMOM are all robust with relatively small standard deviations across replications.However, OLS exhibits large bias in the presence of simultaneity and is not so appealing.It is known that for a degree of overidentification strictly less than seven, 2SLS would have a smaller asymptotic mean squared error (AMSE) than LIML, cf.Mariano and McDonald   16 When k 2 = (m − 1), a diffuse prior in (20) for the linear model implies that the prior for the parameters of the LISEM (4) is and the prior for the parameters of the LISEM (1) and ( 2) is which is identical to the Jeffreys prior; see also expressions ( 22) and (42) in CP. 17 Note that the relationship between the standardized parameter vector and the original parameter vector involves the nuisance parameters, cf.Chao and Phillips (1998).However, when a SEM is in orthonormal canonical form (i.e., the exogenous regressors are orthonormal and the disturbance covariance matrix Ω is an identity matrix), both the density of random parameter β from the CP approach and the probability density of the classical LIML estimator for β are conditional on these information.
(1979) and references therein.In cases with weak instruments the situation gets more complicated in finite samples.In our experiments, LIML has larger RMSE and MAD than 2SLS except in Tables 11  and 12 where ρ was 0.95.Note that the degree of overidentification is 8.0 in Tables 4, 6, 8 and 10.Among classical estimators, JIVE turns out to be least appealing.Monte Carlo simulations in Angrist et al. (1999) showed that JIVE has slight median bias in the opposite direction of ρ (but less than 2SLS) and have heavier tails than LIML.Our Table 6 is comparable to panel 2 of their Table 1, and the results are similar.Our other experiments show that JIVE may also have large absolute bias (larger than LIML) in the case with weak instruments, sometimes even greater than 2SLS (see Table 2).Generally, JIVE has slightly less bias than 2SLS, but this gain is overshadowed by enlarged standard deviation such that in finite samples it has no advantage over 2SLS in terms of MAD and RMSE.We also find that JIVE has greater RMSE and MAD than LIML.Blomquist and Dahlberg (1999) experimented with much larger sample sizes than ours.Comparing our Table 4 with Table 6 and with an unreported simulation with a sample size of 500, we found that the relative gain in JIVE is more than other estimators as sample size increases, even though its relative low standing remains valid.Examined from different angles, these results are very similar to those reported by Davidson andMacKinnon (2006a, 2006b). 18uller's modified LIML estimators are included because Fuller1 is designed to minimize the median-bias, and Fuller4 to minimize the mean-squared error.It seems that this conclusion is also problematic in the presence of weak instruments.Between the two, Fuller1 has smaller median-bias, and Fuller4 has smaller standard deviations across replications.However, in terms of RMSE or MAD, Fuller4 shows no advantage over Fuller1 in most of the cases.
Because all the estimators except OLS are consistent and their asymptotic distributions are also the same, results in Tables 3-6 confirm that their bias and dispersion decrease as sample size increases.But if the instruments are very weak (see Tables 7 and 8), their bias and dispersion may remain significant, a point emphasized forcefully by Zellner (1998).However, when the endogeneity is not strong (see Tables 9 and 10), their bias and dispersion may not be a big concern for some of the estimators.
Across all cases, we find that the bias in BMOM is small if ρ is not too small and the structural Equation (1) is overidentified.As sample size increases or degree of over-identification rises, the observed bias in BMOM decreases.The most striking feature of BMOM is that it exhibits the smallest MAD and Std when ρ is not too small.MELO shows slightly smaller MAD and Std than BMOM if ρ is small (see Tables 9 and 10).In cases with very weak instruments and a high degree of overidentification, the MAD of BMOM is only one-fourth of that of other estimators (see Table 8).These are in accordance with Tsurumi (1990)'s finding that in many cases, ZEM has the least relative mean absolute deviation.Meanwhile, if ρ is very small and the structural equation is overidentified, the bias in BMOM can be large; 2SLS, LIML-GS, Geweke, and CP perform remarkably well in these situations.
Next, we examine in more detail the performance of the Bayesian approaches.Overall, the median bias resulting from these approaches exhibits the same pattern as the bias of 2SLS, it increases with the degree of overidentification, and decreases as sample size rises.The Geweke (1996) approach used a shrinkage prior but its performance is comparable with LIML-GS and CP.The median-bias from PMOD-Geweke is the same or slightly less than that of LIML-GS-Mode, and the bias from Geweke-Median is always slightly less than that of LIML-GS-Median.Similar relationships are observed for MADs.These reflect the impact of the (informative) shrinkage prior on the posterior density.
For each specification, the acceptance rate in the M-H algorithm using CP approach is stable while that using KVD approach shows huge variation across replications.The acceptance rate for CP is generally above 40%, except when sample size is small and the degree of overidentification is high.This shows that the posterior of CP is largely dominated by the likelihood function (3) and the Jeffreys prior generally carries little information.Second, in terms of the computed standard deviations (Stds) of the estimates across replications, CP-Mode has larger dispersion than LIML-GS-Mode, and CP-Median has larger dispersion than LIML-GS-Median.These also shed light on the notion that Jeffreys prior is less informative than a uniform prior.However, between the Jeffreys prior (13) used by CP and the implied prior (21) resulting from diffuse/Jeffreys prior on a linear model used by KVD, it is not clear which one is less informative.
As for the KVD (1998) approach, we observe that it performs as well as any other estimator if the instruments are not weak (see Table 1).But when the instruments are weak, and ρ is positive, KVD shows more bias and higher MAD than those from CP.In Table 4 with T = 50 and high degree of overidentification, KVD performs as bad as OLS.
Next, we consider cases with negative ρ, and the results are summarized in Table 13.We replicate each case in Tables 1-12 with the same specification except ρ being negative.Since the performance of all estimators except BMOM and KVD were basically the same with respect to the sign of ρ, we only report results on these two in Table 13.We find that when ρ changes sign, the bias of BMOM does not change sign and even increases in magnitude.Also note that the computed Stds for BMOM when ρ < 0 are close to the respective ones when ρ > 0. Therefore, for cases with ρ < 0, BMOM has large RMSEs/MADs and loses its attraction.Note that BMOM is the same as the double K-class estimator (DKC) with K values fixed.This asymmetry in the performance in DKC is not well recognized in the literature, and has been discussed in Gao and Lahiri (2001).The observed asymmetry in its bias with respect to ρ in our experiments is readily explained by examining an expression for the mean of double K-class estimator (DKC) in (Dwivedi and Srivastava (1984, Theorem (1)).We can express δDKC as: where δK 1 is a single K-class estimator with characterizing scalar K 1 .When Z 1 Z 2 = 0, which is satisfied in our experimental specifications, a double K-class estimator of β may be written as Observe that for 0 < K 1 < 1, β ˆK1 is biased in the direction of ρ, as noted in Mariano and McDonald (1979).Note also that Y 2 ∆Y 2 > 0, and Y 2 Q Z y 1 provides an estimate of ω 12 .Although Dwivedi and Srivastava (1984) explored the dominance of double K-class over K-class using the exact MSE criterion, their guidelines for the selection of K 2 for a given K 1 are not entirely valid, because the conditions were derived from a small Monte Carlo simulation with positive ω 12 and negative ρ only.Since K 1 < K 2 for BMOM, when ρ and ω 12 have the opposite sign, the second term in βDKC will be of the same sign as the bias of βK 1 , therefore βDKC (hence BMOM) will exhibit large bias.Otherwise, when ρ.ω 12 > 0, the bias is mitigated.Based on our simulation results, we found that the sign of ρ has no effect on the standard deviation of BMOM.This finding shows that the greater RMSE of BMOM when ρω 12 < 0 is due to the aggravated bias.For the specification corresponding to Table 4 in Table 13 (i.e., T = 50, ρ = −0.60,K 2 = 4, R 2 = 0.10), we find that for given K 1 = 0.947, RMSE is minimized if K 2 is chosen to be 0.829, which is much less than K 1 , and less than K 2 = 0.987 used in BMOM.See Gao and Lahiri (2001) for further details.
In Tables 3-12 we found that KVD with ρ > 0 performs very poorly, often with substantial bias and relatively high RMSE and MAD.CP uniformly dominates KVD in these cases.However, with ρ < 0 the picture turns around remarkably well in favor of KVD.As we see in Table 13, across all cases the bias tends to be negative and relatively small.With other parameter values being the same, KVD with ρ < 0 has significantly less RMSE and MAD than cases when ρ > 0, and performs unequivocally the best among all estimators when endogeneity is strong.However, since this observed asymmetry is essentially a finite sample problem with KVD, the improved performance when ρ < 0 becomes less significant when the sample size increases from 50 to 100.With ρ < 0 the overall performance of KVD is very comparable to that of CP, if not slightly better in some cases.
After experimenting with widely different negative and positive values of β and ρ, we found out that the performance of KVD is dependent on the sign of βρ, rather than on the sign of ρ.When βρ > 0, it performs very unsatisfactorily as documented in Tables 3-12.Kleibergen and Zivot (2003) have derived exact analytical expressions for the conditional densities of β given Ω for both the KVD and CP posteriors.They show that the difference between the two is in the Jacobian relating the unrestricted linear multivariate model to the restricted reduced form model.We expect that this additional term may account for the asymmetry in KVD with respect to βρ.In our experiments, we found that in finite samples, when βρ > 0, the reduced rank restriction using singular value decomposition shifts the marginal posterior for KVD away from the marginal posterior of the linear multivariate model.However, when the sample size gets large, the problem seems to go away.

Conclusions
This paper examines the relative merits of some contemporary developments in the Bayesian and classical analysis of limited information simultaneous equations models in situations where the instruments are very weak.Since the posterior densities and their conditionals in the Bayesian approaches developed by Chao and Phillips (1998) and Kleibergen and van Dijk (1998) are nonstandard, we proposed and implemented a "Gibbs within Metropolis-Hastings" algorithm, which only requires the availability of the conditional densities from the candidate-generating density.These conditional densities are used in a Gibbs sampler (GS) to simulate the candidate generating density, whose drawings, after convergence, are then weighted to generate drawings from the target density in a Metropolis-Hastings (M-H) algorithm.We rely on Raftery and Lewis (1992) to determine the number of burn-ins, and the subsequent number of required iterations in order to ensure convergence.Through a MCMC simulation study, our results provide useful guidelines for empirical practitioners.
The first comforting result is that with reasonably strong instruments (marginal R 2 in excess of 0.40), all estimators perform equally well in finite samples.In cases with very weak instruments (marginal R 2 less than 0.10), there is no single estimator that is superior to others in all cases-a conclusion also reached by Andrews and Stock (2005).When endogeneity is weak (ρ less than 0.20), Zellner's MELO does the best.When the endogeneity is relatively strong (ρ in excess of 0.60) and ρω 12 > 0, BMOM outperforms all other estimators by wide margins.When the endogeneity is strong but βρ < 0, the KVD approach seems to get very appealing; but, otherwise, its performance is surprisingly poor.With βρ > 0, as the sample size gets larger, the performance of KVD improves rapidly.Fortunately, the Geweke and CP approaches exhibit no such asymmetry and their performances based on bias, RMSE, and MAD are very similar.Based on the medians of marginal posteriors, their performance ranking is consistently a distant second.The record of JIVE is quite disappointing across all our experiments and is not recommended in practice.Even though JIVE is slightly less biased than 2SLS in most cases, its standard deviation is considerably higher, particularly in small samples.The most remarkable result in this study is that poor instruments can affect the performance of different estimators differently, depending on the signs and magnitudes of certain key parameters of the model.Given the finding that even in finite samples with very weak instruments BMOM and KVD perform remarkably well on certain parts of the parameter space, more research is needed to understand the reasons for the asymmetry and find ways to fix the problem.Another important caveat of our comparative study is that it was done in an iid setting.Heteroskedastic and autocorrelated errors, particularly in highly leveraged regressions, can affect inferences based on alternative instrument variable regressions differentially relative to ordinary least squares, see Young (2019).These issues remain unresolved.the 2000 Winter Meetings of the Econometric Society.We are grateful to Late Arnold Zellner for his early encouragement with this work, and to Ingolf Dittmann, Yuichi Kitamura, Roberto Mariano, Eric Zivot, Herman van Dijk and two anonymous referees for many helpful comments and suggestions.Peter Dang re-typed the whole manuscript.However, the responsibility for any remaining errors and shortcomings is solely ours.

Conflicts of Interest:
The authors declare no conflict of interest.Consider a simple case with m = k 2 = 2.In this case, .