Accuracy and Efficiency of Various GMM Inference Techniques in Dynamic Micro Panel Data Models

The performance in finite samples is examined of inference obtained by variants of the Arellano-Bond and the Blundell-Bond GMM estimation techniques for single dynamic panel data models with possibly endogenous regressors and cross-sectional heteroskedasticity. By simulation the effects are examined of using particular instrument strength enhancing reductions and transformations of the matrix of instrumental variables, of less robust implementations of the GMM weighting matrix, and also of corrections to the standard asymptotic variance estimates. We compare the root mean squared errors of the coefficient estimators and also the size of tests on coefficient values and of different implementations of overidentification restriction tests. Also the size and power of tests on the validity of the additional orthogonality conditions exploited by the Blundell-Bond technique are assessed over a pretty wide grid of relevant cases. Surprisingly, particular asymptotically optimal and relatively robust weighting matrices are found to be superior in finite samples to ostensibly more appropriate versions. Most of the variants of tests for overidentification restrictions show serious deficiencies. A recently developed modification of GMM is found to have great potential when the cross-sectional heteroskedasticity is pronounced and the time-series dimension of the sample not too small. Finally all techniques are employed to actual data and lead to some profound insights.


Introduction
One of the major attractions of analyzing panel data rather than single indexed variables is that they allow to cope with the empirically very relevant situation of unobserved heterogeneity correlated with included regressors. Econometric analysis of dynamic relationships on the basis of panel data, where the number of surveyed individuals is relatively large while covering just a few time periods, is very often based on GMM (generalized method of moments). Its reputation is built on its claimed flexibility, generality, ease of use, robustness and efficiency. Widely available standard software enables to estimate models including exogenous, predetermined and endogenous regressors consistently, while allowing for semiparametric approaches regarding the presence of heteroskedasticity and the type of distribution of the disturbances. This software also provides specification checks regarding the adequacy of the internal and external instrumental variables employed and the specific assumptions made regarding (absence of) serial correlation.
Popular are especially the GMM implementations put forward by Arellano and Bond (1991). However, practical problems have often been reported, such as vulnerability due to the abundance of internal instruments, discouraging improvements of 2-step over 1step GMM findings, poor size control of test statistics, and weakness of instruments especially when the dynamic adjustment process is slow (a root is close to unity). As remedies it has been suggested to reduce the number of instruments by renouncing some valid orthogonality conditions, but also to extend the number of instruments by adopting more orthogonality conditions. Extra orthogonality conditions can be based on certain homoskedasticity or stationarity assumptions or initial value conditions, see Blundell and Bond (1998). By abandoning weak instruments finite sample bias may be reduced, whereas by extending the instrument set with a few strong ones the bias may be further reduced and the efficiency enhanced. Presently, it is not clear yet how practitioners can best make use of these suggestions, because no set of preferred testing tools is yet available that allows to select instruments by assessing both their validity and their strength, and to classify individual regressors accurately as either endogenous, predetermined or strictly exogenous. Therefore it happens often that in applied research models and techniques are selected simply on the basis of the perceived significance and plausibility of their coefficient estimates, whereas it is well known that imposing invalid coefficient restrictions and employing regressors wrongly as instruments will often lead to relatively small estimated standard errors, which then contain misleading information on the actual precision of often seriously biased estimators.
The available studies on the performance of alternative inference techniques for dynamic panel data models have obvious limitations when it comes to advising practitioners on the most effective implementations of estimators and tests under general circumstances. As a rule, they do not consider various empirically relevant issues in conjunction, such as: (i) occurrence and the possible endogeneity of regressors additional to the lagged dependent variable, (ii ) occurrence of individual effect (non-)stationarity of both the lagged dependent variable and other regressors, (iii ) cross-section and/or time-series heteroskedasticity of the idiosyncratic disturbances, and (iv ) variation in signal-to-noise ratios and in the relative prominence of individual effects. For example: the simulation results in Arellano and Bover (1995), Hahn and Kuersteiner (2002), Alvarez and Arellano (2003), Hahn et al. (2007), Kiviet (2007), Kruiniger (2008), Okui (2009), Roodman (2009b, Hayakawa (2009) and Han and Phillips (2013) just concern the panel AR(1) model under homoskedasticity. Although an extra regressor is included in the simulation studies in Arellano and Bond (1991), Kiviet (1995), Bowsher (2002), Hsiao et al. (2002), Bond and Windmeijer (2005), Bun and Carree (2005a,b), Bun and Kiviet (2006), Gouriéroux et al. (2010), Hayakawa (2010), Dhaene and Jochmans (2012), Flannery and Watson Hankins (2013), Everaert (2013) and Kripfganz and Schwarz (2013), this regressor is (weakly-)exogenous and most experiments just concern homoskedastic disturbances and stationarity regarding the impact of individual effects. Blundell et al. (2001) and Bun and Sarafidis (2013) include an endogenous regressor, but their design does not allow to control the degree of simultaneity; moreover, they stick to homoskedasticity. Harris et al. (2009) only examine the effects of neglected endogeneity. Heteroskedasticity is considered in a few simulation experiments in Arellano and Bond (1991) in the model with an exogenous regressor, and just for the panel AR(1) case in Blundell and Bond (1998). Windmeijer (2005) analyzes panel GMM with heteroskedasticity, but without including a lagged dependent variable in the model. Bun and Carree (2006) and Juodis (2013) examine effects of heteroskedasticity in the model with a lagged dependent and a strictly exogenous regressor under stationarity regarding the effects. So, we have to conclude that knowledge is still scarce with respect to the performance of GMM when it is not only needed to cope with genuine simultaneity (which we consider to be the core of econometrics), but also because of occurrence of heteroskedasticity of unknown form. Moreover, many of the simulation studies mentioned above did not systematically explore the effects of relevant nuisance parameter values on the finite sample distortions to asymptotic approximations. Regarding the performance of tests on the validity of instruments worrying results have been obtained in Bowsher (2002) and Roodman (2009b) for homoskedastic models. On the other hand Bun and Sarafidis (2013) report reassuring results, but these just concern models where T = 3. Our grid of examined cases will be much wider and have more dimensions. Moreover, we will deliberately explore both feasible and unfeasible versions of estimators and test statistics (in unfeasible versions particular nuisance parameter values are assumed to be known). Therefore we will be able to draw more useful conclusions on what aspects do have major effects on any inference inaccuracies in finite samples.
The data generating process designed here can be simulated for classes of models which may include individual and time effects, a lagged dependent variable regressor and another regressor which may be correlated with these and other individual effects and be either strictly exogenous or jointly dependent with regard to the idiosyncratic disturbances, whereas the latter may show a form of cross-section heteroskedasticity associated with both the individual effects. For a range of relevant parameter values we will verify in moderately large samples the properties of alternative GMM estimators, both 1-step and 2-step, focussing on alternative implementations regarding the weighting matrix and corresponding corrections to variance estimates according to the often practiced approach by Windmeijer (2005). This will include variants of the popular system estimator, which exploit as instruments the first-difference of lagged internal variables for the untransformed model in addition to lagged level internal variables as instruments for the model from which the individual effects have been removed. We will examine cases where the extra instruments are valid, and where they are not, to verify whether particular tests for overidentification restrictions have appropriate size and power, such that with reasonable probabilities valid instruments will be recognized as appropriate and invalid instruments will be detected and discarded. Moreover, following Kiviet and Feng (2014), we shall investigate a rather novel modification of the traditional GMM implementation which aims at improving the strength of the exploited instruments in the presence of heteroskedasticity. Of course, also the simulation design used here has its limitations. It has only one extra regressor next to the lagged dependent variable, we only consider cross-sectional heteroskedasticity, and all basic random terms have been drawn from the normal distribution. Moreover, the design does not accommodate general forms of cross-sectional dependence between error terms. However, by including individual and time specific effects particular simple forms of cross-sectional dependence are accommodated.
The structure of this study is as follows. In Section 2 we first present the major issues regarding IV and GMM coefficient and variance estimation in linear models and on inference techniques on establishing instrument validity and regarding the coefficient values by standard and by corrected test statistics. Next in Section 3 the generic results of Section 2 are used to discuss in more detail than provided elsewhere the various options for their implementation in linear models for single dynamic simultaneous micro econometric panel data relationships with both individual and time effects and some form of cross-sectional heteroskedasticity. In Section 4 the Monte Carlo design is developed to analyze and compare the performance of alternative often asymptotically equivalent inference methods in finite samples of empirically relevant parametrizations. Section 5 summarizes the simulation results, from which some preferred techniques for use in finite samples of particular models emerge, plus a warning regarding particular types of models that require more refined methods yet to be developed. Next, in Section 6, we examine whether these techniques yield new insights when employed to empirical data on life-cycle labor supply which have been analyzed already extensively in the literature. Finally Section 7 concludes.

Basic GMM results for linear models
Here we present concisely the major generic results on IV and GMM inference for single indexed data that could either represent time-series or a cross-section. First we define the model and estimators, discuss some of their special properties and consider specific test situations. From these general findings for linear regressions the examined implementations for specific linear dynamic panel data models follow easily in Section 3.

Model and estimators
Let the scalar dependent variable y i depend linearly on K regressors x i and an unobserved disturbance term u i , and let there be L ≥ K variables z i (the instruments) that establish orthogonality conditions such that .., n. (2.1) Here x i and β 0 are K × 1 vectors, β 0 containing the true values of the unknown coefficients, and z i is an L×1 vector. Applying the analogy principle, the method of moments 4 (MM) aims to find an estimator for model parameter β by solving the L sample moment equations Generally, these have a unique solution only when L = K and then yield provided the inverse exists. For L > K the MM recipe to find a unique estimator is: minimize with respect to β the criterion function for some weighting matrix G. It can be shown that the asymptotically optimal choice for G is an expression which has a probability limit that is proportional to the inverse of the asymptotic variance V of When u i ∼ iid(0, σ 2 u ) an optimal choice for G is proportional to the inverse of n i=1 z i z i and the MM estimator iŝ where y = (y 1 · · · y n ) , X = (x 1 · · · x N ) and Z = (z 1 · · · z N ) . But, when E(z i u i ) = 0 while u = (u 1 · · · u n ) ∼ (0, σ 2 u Ω), where Ω has full rank and without loss of generality tr(Ω) = n, the optimal choice for G is a matrix proportional to (Z ΩZ) −1 , yielding MM estimatorβ GM M = [X Z(Z ΩZ) −1 Z X] −1 X Z(Z ΩZ) −1 Z y. (2.5) Note that for Ω = I the latter formula simplifies toβ IV . When L = K bothβ GM M and β IV simplify to (2.3) or (Z X) −1 Z y.
When Ω is unknown and therefore (2.5) is unfeasible, one should use an informed guess Ω (0) to obtain the 1-step estimatorβ (1) GM M , which is sub-optimal when Ω (0) = Ω, though consistent under the assumptions made. Then the residualŝ are consistent for u, thus from them it should be possible to obtain an expressionΩ (1) such that plim n −1 (Z Ω (1) Z − Z ΩZ) = O. Substituting in (2.5) yields the 2-step estimatorβ which is asymptotically equivalent toβ GM M and thus asymptotically optimal, given the L instruments used.

Some algebraic peculiarities
Defining P Z = Z(Z Z) −1 Z andX = P Z X one findsβ IV = (X X ) −1X y, which highlights its two-stage least-squares character. Now suppose that X = (X 1 X 2 ) and Z = (Z 1 Z 2 ) with Z 2 = X 2 whereas Xβ = X 1 β 1 +X 2 β 2 , where β 1 and β 2 have K 1 and K 2 elements respectively. Standard results on partitioned regression yieldŝ β 1,IV = (X 1 MX 2X 1 ) −1X 1 MX 2 y = (X 1 P M X 2 Z 1 X 1 ) −1 X 1 P M X 2 Z 1 y, (2.8) which is the IV estimator in the regression of y on just X 1 using the L − K 2 instruments M X 2 Z 1 . This result is known as partialling out the predetermined regressors X 2 . It follows fromX 2 = P Z X 2 = X 2 which yields MX 2X 1 = M X 2 P Z X 1 = M X 2 (P X 2 + P M X 2 Z 1 )X 1 = P M X 2 Z 1 X 1 . A similar result is not straight-forwardly available for GMM because of the following.
As is well-known and easily verified, linear transformations of the matrix of instruments of the form Z = ZC, where C is a full rank L × L matrix, have no effect onβ IV nor onβ GM M . However, there is not such invariance when the matrix Z is premultiplied by some transformation matrix, and hence not the columns but the rows of Z are directly affected. It has been shown in Kiviet and Feng (2014) that such transformations, chosen in correspondence with the required transformation of the model when Ω = I, may lead to modified GMM estimation achieving higher efficiency levels and better results in finite samples than standard GMM, provided the validity of the transformed instruments is maintained. We will examine here the effects of employing transformation Z * = ΨZ, which provides the modified GMM estimator (2.11) When this can be made feasible, it yieldsβ

Particular test procedures
Inference on elements of β 0 based onβ GM M of (2.7) requires an asymptotic approximation to its distribution. Under correct specification the standard first-order approximation iŝ β (2) GM M a ∼ N (β 0 ,σ 2 u [X Z(Z Ω (1) Z) −1 Z X] −1 ). (2.12) It allows testing general restrictions by Wald-type tests. For an individual coefficient, say β 0k = e K,k β 0 , where e K,k is a K × 1 vector with all elements zero except its k th element (1 ≤ k ≤ K) which is unity, testing H 0 : β 0k = β 0 k and allowing for one-sided alternatives, amounts to comparing test statistic with the appropriate quantile of the standard normal distribution. Note that this test statistic is actually an asymptotic t-test; in finite samples the type I error probability may deviate from the chosen nominal level, also depending on whetherσ 2 u has been obtained from 1-step or from 2-step residuals and any employed loss of degrees of freedom corrections. In fact it has been observed that the consistent estimator of the variance of two-step GMM estimators V ar(β (2) GM M ) given in (2.12) often underestimates the finite sample variance, because in its derivation the randomness ofΩ (1) is not taken into account. Windmeijer (2005) provides a corrected formula V arc(β (2) GM M ), see Appendix A, which can be used in the corrected t-test (2.14) When L > K the overidentification restrictions can be tested by the Sargan-Hansen statistic J (1) which under correct specification and valid instruments Z is distributed as χ 2 L−K asymptotically. BecauseΩ (1) is based onû (1) , which relates to a consistent estimator, but not to an asymptotically optimal estimator, it seems better to perform at least one further iteration and use GM M is used to constructΩ (2) . When Z = (Z m Z a ), where Z m is an n × L m matrix with L m ≥ K containing the instruments whose validity seems very likely, then, under the maintained hypothesis E(Z m u) = 0, one can test the validity of the L − L m additional instruments Z a by the incremental test statistic (2.17) which under correct specification of the model with valid instruments Z is distributed as m andΩ (2) m are obtained by just using the instruments Z m . Note that for m = K we have JI In the simulations we will also examine unfeasible versions of the above test statistics, which exploit information that is usually not available in practice. This will produce evidence on what elements of the feasible asymptotic tests may cause any inaccuracies in finite samples. So, next to (2.13), (2.16) and (2.17) we will also examine W (u) β k 3 Implementations for dynamic micro panel models

Model and assumptions
We consider the balanced linear dynamic panel data model (i = 1, ..., N ; t = 1, ..., T ) where x it contains K x ≥ 0 strictly exogenous regressors (excluding fixed time effects), w it are K w ≥ 0 predetermined regressors (probably including lags of the dependent variable and other variables affected by lagged feedback from y it or just from ε it ), v it are K v ≥ 0 endogenous regressors (affected by instantaneous feedback from y it and therefore jointly dependent with y it ), the τ t are random or fixed time effects, the η i are random individual specific effects (most likely correlated with many of the regressors) such that The classification of the regressors implies For the sake of simplicity we assume that all regressors are time varying and that the vectors x it , w it or v it are defined for t = 1, ..., T . However, their elements may contain observations prior to t = 1 for regressors that are actually the l th order lag of a current variable. Only these lagged regressors are observed from t = 1 − l onwards. This means that all regressors, be it current variables or lags of them, have exactly T observations. So, any unbalancedness problems have been defined away; moreover, no internal instrumental variables can be constructed involving observations prior to those included in Stacking the T time-series observations of (3.1) the equation in levels can be written and ι T is the T × 1 vector with all its elements equal to unity. Defining the i-th T × K block of the regressor matrix R i = (X i W i V i I T ), the K × 1 coefficient vector α = (β γ δ τ ) and the compound disturbances as u i = η i ι T + ε i , the model can compactly be expressed as We do allow K x = 0, K w = 0, K v = 0, but not all three at the same time, so The parameter vector τ could be void too, or have all its elements equal though non-zero. However, it seems better to allow for a general τ, for the same reasons as an ordinary cross-section regression normally should include an intercept, namely to help underpin the assumption that both the idiosyncratic disturbances and the individual effects have expectation zero. We focus on micro panels, where the number of time-series observations T is usually very small, possibly a one digit number, and the number of cross-section units N is large, usually at least several hundreds. Therefore asymptotic approximations will be for N → ∞ and T finite.
To construct a full column rank matrix of instrumental variables Z = (Z 1 · · · Z N ) , which expresses as many linearly independent orthogonality conditions as possible for (3.9), while restricting ourselves to internal variables, i.e. variables occurring in (3.1), we define the following vectors (3.10) Without making this explicit in the notation it should be understood that these three vectors only contain unique elements. Hence, if vector x is (or w is ) contains for 1 < s ≤ T a particular value and also its lag (which is not possible for v it ), then this lag should be taken out since it already appears in x i,s−1 . Matrix Z i is of order (T − 1) × L and consists of four blocks (though some of these may be void) The first block is associated with the fundamental moment conditions E(ε i ) = 0 and could therefore form part of Z i even if one imposesτ = 0. For the other blocks we have (3.13) whereas MM estimation requires L ≥ K − 1. It follows from (3.3) and (3.4) that E(Z iε i ) = 0 indeed. In actual estimation one may use a subset of these instruments by taking the linear transformation Z * i = Z i C, where C is an L × L * matrix (with all its elements often being either zero, one or minus one) of rank L * < L, provided L * ≥ K. In the above we have implicitly assumed that the variables are such that Z = (Z 1 · · · Z N ) will have full column rank, so another necessary condition is N (T − 1) ≥ L. Of course, it is not required that individual blocks Z i have full column rank.
Despite its undesirable effect on the asymptotic variance of method of moments estimators, reducing the number of instruments may improve estimation precision, because it may at the same time mitigate estimation bias in finite samples, especially when weak instruments are being removed. So, instead of including the block I T −1 in Z i one couldwhen the model has no time-effects -replace it by I T −1 ι T −1 = ι T −1 . Regarding Z w i and Z v i two alternative instrument reduction methods have been suggested, namely omitting long lags (see Bowsher 2002, Windmijer 2005and Bun and Kiviet 2006 and collapsing (see Roodman 2009b, but also suggested in Anderson and Hsiao 1981). Both are employed in Ziliak (1997); these two methods can also be combined.
Omitting long lags could be achieved by reducing Z w i to, for instance,  and similar for Z v i . The collapsed versions of Z w i and of Z v i can be denoted as Collapsing can be combined with omitting long lags, if one removes all the columns of Z * w i and Z * v i which have at least a certain number of zero elements (say 1 or 2 or more) in their top rows. In corresponding ways, the column space of Z x i can be reduced by including in Z i either a limited number of lags and leads, or the collapsed matrix (3.16) or just its first two or three columns -or what is often done in practice -simply the difference between the first two columns, the K x regressors (∆x i2 , ..., ∆x iT ) .
In the simulations we will examine the following specific forms of instrument matrix reduction of the case where all instruments associated with valid linear moment restrictions are being used. The latter case we label as A (all); the reductions are labelled C (standard collapsing), L0, L1, L2, L3 (which primarily restrict the lag length), and C0, C1, C2, C3 (which combine the two reduction principles). In all the reductions we replace I T −1 by ι T −1 when the model does not include time-effects. Regarding Z x i , Z w i and Z v i different types of reductions can be taken, which we will distinguish by using for example the characterization: A v , L2 w , C1 x etc. The particular reductions that we will consider are defined as indicated in Table 1. Table 1. Definition of labels for particular instrument matrix reductions Note that for all three types of regressors L2, like L1, uses one extra lag compared to L0, but does not impose the first-difference restrictions characterizing L1. We skipped a similar intermediary case between L2 and L3. Self-evidently L2 x can also be represented by combining diag(x i1 , ..., x i,T −1 ) with L1 x , and similar for L2 w and L2 v . The reductions C0 and C1, which yield just one instrument per regressor, consitute generalizations of the classic instruments suggested by Anderson and Hsiao (1981). These may lead to just identified models where the number of instruments equals the number of regressors and the occurrence of the non-existence of moments problem. To avoid that, and also because we suppose that in general some degree of overidentification will have advantages regarding both estimation precision and the opportunity to test model adequacy, we will restrict ourselves to the popular C1 x and the reductions C and C3 as far as collapsing is concerned. In C3 just the first three columns of the matrices in (3.15) and (3.16) are being used as instruments.

Alternative weighting matrices
We assumed in (3.3) that the ε it are serially and cross-sectionally uncorrelated but may be heteroskedastic. Let us define the matrices . Under standard regularity we have Hence, the optimal GMM estimator ofα of (3.9) should use a weighting matrix such that its inverse has probability limit proportional to plim N −1 N i=1 Z i DΩ i D Z i . This can be achieved by starting with a consistent 1-step GMM estimator α (1) , which, assuming incorrectly Ω i = σ 2 ε I, uses the weighting matrix (3.19) and then yields, what in the simulations we will call the AB1 estimator, Next, in a second step, the consistent 1-step residuals ε (1) can be used to construct the asymptotically optimal weighting matrix (3.21) However, an alternative is usinĝ (3.23) Note that both (NĜ b ) −1 have a probability limit equal to the limiting variance of (3.18). The latter is less robust, but may converge faster when Ω i is diagonal indeed. On the other hand (3.22) may not be positive definite, whereas (3.21) is.
For the special case Ω i = σ 2 ε,i I of cross-section heteroskedasticity but time-series homoskedasticity, one could usê although this estimator is not consistent for T finite. All these different weighting matrices can be used to calculate alternative α (2) (j) estimators for j ∈ {a, b, c}, to be labelled AB2a, AB2b and AB2c, according to When the employed weighting matrix is asymptotically optimal indeed, the first-order asymptotic approximation to the variance of α (2) (j) is given by the inverse of the matrix in square brackets. From this (corrected) t-tests are easily obtained, see Section 3.4. Matching implementations of Sargan-Hansen statistics follow easily too, see Section 3.5.
A valid instrument for model (3.29) is the (as a regressor possibly redundant) intercept, because this embodies the single orthogonality condition E[ T t=1 (η i + ε it )] = E[ T t=1 u it ] = 0 (∀i), which is implied by the T +1 assumptions E(η i ) = 0 and E(ε it ) = 0 (for t = 1, ..., T ) made in (3.2) and (3.3). These T +1 assumptions can also be represented (through linear transformation) by (i ) E(η i ) = 0, (ii ) E(∆ε it ) = 0 (for t = 2, ..., T ) and (iii ) E( T t=1 u it ) = 0. Because we cannot express η i exclusively in observed variables and unknown parameters it is impossible to convert (i ) into a separate sample orthogonality condition. The T − 1 orthogonality conditions (ii ) are already employed by Arellano-Bond estimation, through including I T −1 in Z i of (3.11) for the equation in first differences. Orthogonality condition (iii ), which is in terms of the level disturbance, can be exploited by including the column ι T in the i-th block of an instrument matrix for level equation (3.29).
Combining the T −1 difference equations and the T level equations in a system yields , so it is extended by an extra column of zeros (to annihilate coefficient τ 1 in the equation in first differences), andü i = (ε i u i ) . We find that E(ε i u i ) = E(Dε i ε i ) = DΩ i and (3.31) Model (3.30) can be estimated by MM using the N (2T − 1) × (L + 1) matrix of instruments with blocksZ provided L + 1 ≥ K and N (2T − 1) ≥ L + 1. Since bothR i andZ i contain a column (0 , ι T ) , and due to the occurrence of the O-block inZ i , by a minor generalization of result (2.8) the IV estimator ofα obtained by using instrument blocksZ i in (3.30) will be equivalent regardingα with the IV estimator of equation (3.9) using instruments with blocks Z i . That the same holds here for GMM under cross-sectional heteroskedasticity when using optimal instruments is due to the very special shape ofZ i and is proved in Appendix B. Hence, there is no reason to estimate the system, just in order to exploit the extra valid instrument (0 , ι T ) .

Effect stationarity
More internal instruments can be found for the equation in levels when some of the regressors are known to be uncorrelated with the individual effects. We will assume that after first differencing some of the explanatory variables may be uncorrelated with η i . Let r it = (x it w it v it ) contain the K = K x + K w + K v unique elements of r it which are effect stationary, by which we mean that E(r it η i ) is time-invariant, so that E(∆r it η i ) = 0, ∀i, t = 2, ..., T.
This implies that for the equation in levels the orthogonality conditions hold. When w it includes y i,t−1 , then apparently y it is effect stationary so that the adopted model (3.1) suggests that all regressors in r it must be effect stationary, resulting in K = K. Like for the T − 1 conditions E(∆ε it ) = 0 discussed below (3.6), many of the conditions (3.33) are already implied by the orthogonality conditions E(Z iε i ) = 0 for the equation in first-differences. In Appendix C we demonstrate that a matrixZ s i of instruments can be designed for the equation in levels (3.6) just containing instruments additional to those already exploited by E( Under effect stationarity of the K variables (3.33) the system (3.30) can be estimated while exploiting the matrix of instruments If one decides to collapse the instruments included in Z i , it seems reasonable to collapsẽ Z s i as well and replace it bÿ (3.36) Note thatZ * s i has L = 1 + K columns.

Alternative weighting matrices under effect stationarity
For the above system we have Hence a feasible initial weighting matrix is given by with q some nonnegative real value. Weighting matrix S (0) (q) would be optimal if Ω i = σ 2 ε I with q = σ 2 η /σ 2 ε . The for any q consistent 1-step GMM system estimator BB1(q) is given by Next, in a second step, the consistent 1-step residuals ü (q) can be used to construct the asymptotically optimal weighting matrix (q). However, several alternatives are possible. Consider weighting matrix but on the basis of the residuals ε (3.43) For the special case σ 2 ε Ω i = σ 2 ε,i I of cross-section heteroskedasticity and time-series homoskedasticity one can use the weighting matrix Note that, unlike the former, the latter is consistent, because for t = s

16
The three alternatives yield 2-step system estimators α (2) (j) for j ∈ {a, b, c}, labelled BB2a, BB2b and BB2c, where (3.47) and the inverse matrix expression can be used again to estimate the variance of α (j) if all employed moment conditions are valid.

Coefficient restriction tests
Simple Student-type coefficient test statistics can be obtained from 1-step and 2-step AB and BB estimation for the different weighting matrices considered. The 1-step estimators can be used in combination with a robust variance estimate (which takes possible heteroskedasticity into account). The 2-step estimators can be used in combination with the standard or a corrected variance estimate. 1 When testing particular coefficient values, the relevant element of AB1 estimator α (1) given in (3.20) should under homoskedasticity be scaled by the corresponding diagonal element of the standard expression for its estimated variance given by ε,i is given in (3.26). This test will be indicated as AB1. Its robust version under cross-sectional heteroskedasticity uses for j ∈ {a, b, c} (3.49) which will be indicated as either AB1aR, AB1bR or AB1cR. However, under heteroskedasticity the estimators α (2) (j) given in (3.47) are more efficient. The standard estimator for their variance is (3.50) which yields test statistics AB2a etc. The corrected version V arc( α (j) ) requires derivation for k = 1, ..., K − 1 of the actual implementation of matrix ∂Ω(β)/∂β k of Appendix A which is here N (T − 1) × N (T − 1). We denote its i-th block as ∂Ω (j)i (α)/∂α k . For the a-type weighting matrix 2 the relevant T − 1 × T − 1 matrix ∂ε iε i /∂α k with ε i =ỹ i −R iα , is −(ε iR ik +R ikε i ), whereR ik denotes the k-th column ofR i . For weighting matrix b it simplifies to the matrix consisting of the main diagonal and the two first sub-diagonals of −(ε iR ik +R ikε i ) with all other elements zero. And )F (j) , (3.51) 1 Many authors and the Stata program confusingly address the corrected 2-step variance as robust.
2 This is the only variant considered in Windmeijer (2005).
with the k-th column ofF (j) given bŷ (3.52) This gives rise to tests indicated as AB2aW etc.
All above expressions become a bit more complex when considering BB estimation of the K coefficientsα. The suboptimal 1-step BB estimator (3.40) ofα should not be used for testing, unless in combination with a robust variance estimator, which is . This yields BB1aR, BB1bR and BB1cR. It seems better of course to use the efficient estimator α (2) (j) of (3.47). The standard expression for its estimated variance is leading to tests BB2a etc. Their corrected versions (BB2aW etc.) can be obtained by a formula similar to (3.51) upon changingα inα andF (j)·k of (3.52) in (3.55) where the block of ∂Ω (j)i (α)/∂α k corresponding to the equation in first differences is similar as for AB, but with an extra column and row of zeros for the intercept. The block corresponding to the equation in levels we took for weighting matrices a and b equal to ∂u i u i /∂α k = −(u iR * ik +R * ik u i ), and for type c for which the first term yields −2[(ε i H −1R ik )/(T − 1)]I T , and the second gives For the nondiagonal upperblock of ∂Ω (j)i (α)/∂α k we took in cases a and b ∂ε i u i /∂α k = −(ε iR * ik +R ik u i ) and for the derivative with respect to the intercept −ε i ι T . In case c it is −2[(ε i H −1R ik )/(T − 1)]D and a zero matrix for the derivative with respect to the intercept.

Tests of overidentification restrictions
Using Arellano-Bond and Blundell-Bond type estimation, many options exist with respect to testing the overidentification restrictions. These options differ in the residuals and weighting matrices being employed. After 1-step see (3.20) and (3.26), we have test statistic which is only valid in case of homoskedasticity. Alternatively, after 1-step estimation the heteroskedasticity-robust test statistics (3.57) may be used, whereĜ (1) j is given in (3.21), (3.22) and (3.24). Given the type of weighting matrix being used, one obtains 2-step residuals ε (j) , and from these overidentification restrictions test statistics can be obtained, which may differ depending on whether the weighting matrix is now obtained still from 1-step or already from 2-step residuals. 3 This leads to where the 2-step weighting matrices are eitherĜ i(c) /(T − 1). Exploiting effect stationarity of a subset of the regressors by estimating the Blundell-Bond system leads to the 1-step test statistics (1) j can be found in (3.39), (3.41), (3.42) and (3.44). Defining the various 2-step residuals and variance estimators as ü η(j) similar to (3.45) and (3.46) though obtained from the appropriate two-step residuals ε (j) , the statistics to be used after 2-step estimation are (2) c one can use 3 Package xtabond2 for Stata always reports JAB (1,0) after Arellano-Bond estimation, which is inappropriate when there is heteroskedasticity. After requesting for robust standard errors in 1-step estimation it presents also JABa (2,1) instead of JABa (1,1) . Requesting 2-step estimation also presents both JAB (1,0) and JABa (2,1) . Blundell-Bond estimation yields JBB (1,0) and JBB (2,1) , although a version of JBB (1,0) is reported that does not use weighting matrix S (0) (σ 2,s(1) η /σ 2,s(1) ε ), but S (0) (0), which is only valid under homoskedasticity and σ 2 η = 0.
Under their respective null hypotheses the tests based on Arellano-Bond estimation follow asymptotically χ 2 distributions with L − K + 1 degrees of freedom, whereas the tests based on Blundell-Bond estimates have L + L − K degrees of freedom. Selfevidently tests on the effect stationarity related orthogonality conditions are given by and should be compared with a χ 2 critical value for L − 1 degrees of freedom. 4

Modified GMM
In the special case that panel model (3.5) has cross-sectional heteroskedasticity and no time-series heteroskedasticity, hence we can easily employ MGMM estimator (2.11). However, because H −1 is not a lowertriangular matrix, not all instruments σ −2 ε,i H −1 Z i would be valid for the equation in first-differences. This problem can be avoided by using, instead of first-differencing, the forward orthogonal deviation (FOD) transformation for removing the individual effects. Let holds, whereas E(Z iε i ) = 0 for Z i given by (3.11). Hence, estimating modeľ wherey i = By i andŘ i = BR i , by GMM, but using an instrument matrix with components σ −2 ε,i Z i yields the unfeasible MABu estimator for the model with cross-sectional heteroskedasticity, which is ε,i > 0 these are intrinsically equivalent with E(Z iε i ) = 0, but they induce the use of a different set of instruments yielding a different estimator. That it is most likely that the unfeasible standard AB estimator ABu, which uses instruments σ ε,i Z i for regressors σ −1 ε,iŘ i , will generally exploit weaker instruments than MABu, which uses σ −1 ε,i Z i for regressors σ −1 ε,iŘ i , should be intuitively obvious. To convert this into a feasible procedure, one could initially assume that all σ 2 ε,i are equal. Then the first-step MGMM estimator α (1) M AB is numerically equivalent to AB1 of (3.20), provided all instruments are being used. 5 Next, exploiting (3.26), the feasible 2-step MAB2 estimator can be obtained by Modifying the system estimator is more problematic, primarily because the inverse of the matrix V ar(u However, as an unfeasible modified system estimator we can combine estimation of the model fory i using instruments σ −2 ε,i Z i with estimation of the model for y i using instruments (σ 2 ε,i + σ 2 η ) −1Z s i . So, the system is then given by the model ...
. For the 1-step estimator we could again choose some nonnegative value q and calculate the 1-step estimator BB1 given in (3.40) in order to find residuals and obtain the estimatorsσ 2,s(1) ε,i andσ 2,s(1) η of (3.45) and (3.46). Building on E( (3.70) which can be exploited in the feasible 2-step MGMM system estimator MBB2 M BB relevant t-test and Sargan-Hansen test statistics can be constructed. Regarding the latter we will just examine Arellano and Bover (1995).

where ε
(2) M GM M ch , and ) . Under their respective null hypotheses these follow asymptotically χ 2 distributions with L−K +1 and L+L −K degrees of freedom. Self-evidently, the test on the effect stationarity related orthogonality conditions is given by

Simulation design
We will examine the stable dynamic simultaneous heteroskedastic DGP (i = 1, ..., N, t = 1, ..., T ) Here β has just one element relating to the for each i stable autoregressive regressor it are IID(0, 1) and mutually independent. Parameter ρ vε indicates the correlation between the disturbances ε it = σ ε ω These may have different variances, but have similar cross-sectional heteroskedasticity pattern, determined by the fixed values ω 1 , ..., ω N . How we did generate these ω i and the start-up values x i,0 and y i,0 and chose relevant numerical values for the other eleven parameters will be discussed extensively below.
Note that in this DGP x it is either strictly exogenous (ρ vε = 0) or otherwise endogenous 6 ; the only weakly exogenous regressor is y i,t−1 . Regressor x it may be affected contemporaneously by two independent individual specific effects when π η = 0 and π λ = 0, but also with delays if ξ = 0. The dependent variable y it may be affected contemporaneously by the (standardized) individual effect η • i , both directly and indirectly; directly if σ η = 0, and indirectly via x it when βπ η = 0. However, η • i will also have delayed effects on y it , when γ = 0 or ξβπ η = 0, and so has λ • i when ξβπ λ = 0. For the cross-sectional heteroskedasticity we follow an approach similar to Kiviet and Feng (2014). It is determined by both η • i and λ • i , the two standardized individual effects, and is thus associated with the regressors x it and y i,t−1 . It follows a lognormal pattern when both η • i and δ • i are standard normal, because we take This establishes a lognormal distribution with E(ω i ) = 1 and V ar(ω i ) = e θ 2 − 1. So, for θ = 0 the ε it and v it are homoskedastic. The seriousness of the heteroskedasticity increases with the absolute value of θ. Obviously, for κ = 0 the error components η i and ε it are independent (like η • i and ε • it ), but not for 0 < κ ≤ 1. Table 2 presents some quantiles of the distributions of ω i and ω 1/2 i (taken as the positive square root of ω i ) and the expectation of the latter in order to disclose the effects of parameter θ. It shows that θ ≥ 1 implies pretty serious heteroskedasticity, whereas it may be qualified mild when θ ≤ 0.3, say. Table 2. Heteroskedasticity quantiles and E(ω Without loss of generality we may chose σ ε = 1 and α y = α x = 0. Note that (4.1) implicitly specifiesτ = 0. All simulation results refer to estimators where these T − 1 restrictions have been imposed (there are no time effects), but α y = α x = 0 have not been imposed. Hence, when estimating the model in levels ι T is one of the regressors. Moreover, we may always include I T −1 in Z i and ι T inZ s i in order to exploit the fundamental moment conditions E(ε it ) = 0 (for t = 2, ..., T ) and E[ T t=1 (η i + ε it )] = 0 for i = 1, ..., N.
Apart from values for θ and κ, we have to make choices on relevant values for eight more parameters. We could choose γ ∈ {0.2, 0.5, 0.8}, which covers a broad range of adjustment processes for dynamic behavioral relationships, and ξ ∈ {0.5, 0.8, 0.95} to include less and more smooth x it processes. Next, interesting values should be given to the remaining six parameters, namely β, σ η , π η , π λ , σ v and ρ vε . We will do this by choosing relevant values for six alternative more meaningful notions, which are all functions of some of the eight DGP parameters and allow to establish relevant numerical values for them, as suggested in Kiviet (2012).
The first three notions will be based on (ratios of) particular variance components of the long-run stationary path of the process for x it . Using lag-operator notation and assuming that v • it (and ε • it ) exist for t = −∞, ...., 0, 1, ..., T, we find 7 that the long-run path for x it consists of three mutually independent components, namely The third component, the accumulated contributions of v • it , is a stationary AR(1) process with variance . The other two components have variances π 2 η /(1 − ξ) 2 and π 2 λ /(1 − ξ) 2 respectively, so the average long-run variance of x it equals A first characterization of the x it series can be obtained by settingV x = 1. This is an innocuous normalization, because β is still a free parameter. As a second characterization of the x it series, we can choose what we call the (average) effects variance fraction of x it , given by with 0 ≤ EV F x ≤ 1, for which we could take, say, EV F x ∈ {0, 0.3, 0.6}. To balance the two individual effect variances we define for the case EV F x > 0 what we call the individual effect fraction of η • i in x it given by From these three characterizations we obtain For all three we will only consider the nonnegative root, because changing the sign would have no effects on the characteristics of x it , as we will generate the series η it from symmetric distributions. The above choices regarding the x it process have the following implications for the average correlations between x it and its two constituting effects:ρ Now the x it series can be generated upon choosing a value for ρ vε . This we obtain from (4.14) In order that both correlations are smaller than 1 in absolute value an admissibility restriction has to be satisfied, namelyρ 2 xε ≤ σ 2 v , givinḡ When choosing EV F x = 0.6 and ξ = 0.8 we should have |ρ xε | ≤ 0.379. That we should not exclude negative values ofρ xε will become obvious in due course. For the moment it seems interesting to examine, say,ρ xε ∈ {−0.3, 0, 0.3}. The remaining choices concern β and σ η which both directly affect the DGP for y it . Substituting (4.5) and (4.3) in (4.1) we find that the long-run stationary path for y it entails four mutually independent components, since The third term of the final expression constitutes for each i an AR(2) process and the fourth one an ARMA(2,1) process. The variance of each of the four terms is given (derivations in the Appendix) by Averaging the last two over all i yieldsV ζ andV ε . For the average long-run variance of y it we then can evaluateV When choosing fixed values for ratios involving these components to obtain values for β and σ η we will run into the problem of multiple solutions. On the other hand, the four components of (4.17) have particular invariance properties regarding the signs of β, σ η and ρ vε , since changing the sign of all three yields exactly the same value ofV y . We coped with this as follows. Although we note that V η does depend on βπ η , we set σ η simply by fixing the direct cumulated effect impact of η • i on y it relative to the current noise σ ε = 1. This is DEN η y = σ η /(1 − γ). Finally we fix a signal-noise ratio, which gives a value for β. Because under simultaneity the noise and current signal conflate, we focus on the case where ρ xε = 0. Then we havē Leaving the variance due to the effects aside, the average signal variance isV ζ +V ε − 1, because the current average noise variance is unity. Hence, we may define a signal-noise ratio as where we have substituted (4.11). For this we may choose, say, SN R ∈ {2, 3, 5}, in order to find Note that here another admissibility restriction crops up, namely However, for |γ| ≤ 0.8 this is satisfied for SN R ≥ 1.78. From (4.21) we only examined the positive root. Instead of fixing SNR another approach would be to fix the total multiplier which would directly lead to a value for β, given γ. However, different T M values will then lead to different SN R values, because . (4.24) At this stage it is hard to say what would yield more useful information from the Monte Carlo, fixing T M or SN R. Keeping both constant for different γ and some other characteristics of this DGP is out of the question. We chose to fix SN R = 3. which yields T M values in the range 1.5-1.8. When comparing with results for T M = 1 we did not note substantial principle differences. For all different design parameter combinations considered, which involve sample size N = 200 and T ∈ {3, 6, 9}, we used the very same realizations of the underlying standardized random components η • i , λ • i , ε • it and ζ • it over the respective 10000 replications that we performed. At this stage, all these components have been drawn from the standard normal distribution. To speed-up the convergence of our simulation results, in each replication we have modified the N drawings η • i and λ • i such, that they have sample mean zero, sample variance 1 and sample correlation zero. This rescaling is achieved by 1/2 , and by replacing the λ • i by the residuals obtained after regressing λ • i on η • i and an intercept, and next scaling them by taking λ In addition, we have rescaled in each replication the ω i by dividing them by N −1 N i=1 ω i , so that the resulting ω i have sum N as they should in order to avoid that presence of heteroskedasticity is conflated with larger or smaller average disturbance variance.
In the simulation experiments we will start-up the processes for x it and y it at presample period s < 0 by taking x is = 0 and y i,s = 0 and next generate x it and y it for the indices t = s + 1, ..., T. The data with time-indices s, ..., −1 will be discarded when estimating the model. We suppose that for s = −50 both series will be on their stationary track from t = 0 onwards. When taking s = −1 or −2 the initial values y i0 and x i1 will be such that effect-stationarity has not yet been achieved. Due to the fixed zero startups (which are equal to the unconditional expectations) the (cross-)autocorrelations of the x it and y it series have a very peculiar start then too, so our results regarding effect nonstationarity will certainly not be fully general, but in a way for s close to zero we mimic the situation that a process only started very recently.
Another simple way to mimic a situation in which lagged first-differenced variables are invalid instruments for the model in levels can be designed as follows. Equations (4.5) and (4.16) highlight that in the long-run ∆x it and ∆y it are uncorrelated with the effects η • i and λ • i . This can be undermined by perturbing x i0 and y i0 as obtained from s = −50 in such a way that we add to them the values (4.25) respectively. Note that for φ = 1 effect stationarity is maintained, whereas for 0 ≤ φ < 1 the dependence of x i0 and y i0 on the effects is mitigated in comparison to the stationary track (upon maintaining stationarity regarding ε • it and ζ • it ), whereas for φ > 1 this dependence is inflated. Note that this is a straight-forward generalization of the approach followed in Kiviet (2007) for the panel AR(1) model.

Simulation results
To limit the number of tables we proceed as follows. Often we will first produce results on unfeasible implementations of the various inference techniques in relatively simple DGPs. These exploit the true values of ω 1 , ..., ω N , σ 2 ε and σ 2 η instead of their estimates. Although this information is generally not available in practice, only when such unfeasible techniques behave reasonably well in finite samples it seems useful to examine in more detail the performance of feasible implementations. Results for the unfeasible Arellano Bond (1991) and Blundell Bond (1998) GMM estimators are denoted as ABu and BBu respectively. As indicated before, their feasible counterparts are denoted as AB1 and BB1 for the 1-step (which under homoskedasticity are equivalent to their unfeasible counterparts) and AB2 and BB2 for the 2-step estimators. For 2-step estimators the lower case letters a, b or c are used (as in for instance AB2c) to indicate which type of weighting matrix has been exploited, as discussed in sections 3.2.1 and 3.3.2. For corresponding MGMM implementations these acronyms are preceded by the letter M. Under homoskedasticity their unfeasible implementation has been omitted when this is equivalent to GMM. In BB estimation we have always used q = 1.
First we will discuss the results for DGPs in which the initial conditions are such that BB estimation will be consistent and more efficient than AB, and subsequently in a separate subsection the situation where BB is inconsistent is examined. Within these two subsections (5.1 and 5.2) we will in separate subsections examine different parameter value combinations for the DGP. We will start by presenting results for a reference parametrization (indicated P0) which has been chosen such that the model has in fact four parameters less, by choosingρ xε = 0 (x it is strictly exogenous), EV F x = 0 (hence π λ = π η = 0, so x it is neither correlated with λ i nor with η i ) and κ = 0 (any crosssectional heteroskedasticity is just related with λ i ). These choices (implying that any heteroskedasticity will be unrelated to the regressor x it ) may (hopefully) lead to results where little difference between unfeasible and feasible estimation will be found and where test sizes are relatively close to the nominal level of 5%. Next we will discuss the effects of settings (to be labelled P1, P2 etc.) which deviate from this reference parametrization P0 in one or more aspects regarding the various correlations and variance fractions and ratios. In P0 the relationship for y it will be characterized by DEN η y = 1 (the impact on y it of the individual effect η i and of the idiosyncratic disturbance ε it have equal variance). The two remaining parameters are fixed over all cases examined (including P0). The x it series has autoregressive coefficient ξ = 0.8 and regarding y it we take SN R = 3 (excluding the impacts of the individual effects, the variance of the explanatory part of y it is three times as large as σ 2 ε ). In section 3.2 we already indicated that we will examine implementations of GMM where all internal instruments associated with linear moment conditions will be employed (A), but also particular reductions based either on collapsing (C) or omitting long lags (L3, etc.), or a combination (C3, etc.). On top of this we will also distinguish situations that may lead to reductions of the instruments that are being used, because the regressor x it in model (4.1), which will either be strictly exogenous or endogenous with respect to ε it , might be rightly or wrongly treated as either strictly exogenous, or as predetermined (weakly exogenous), or as endogenous. These three distinct situations will be indicated by the letters X, W and E respectively. So, in parametrization P0, where x it is strictly exogenous, the instruments used by either A, C or, say, L2, are not the same under the situations X, W and E. This is hopefully clarified in the next paragraph.
Since we assume that for estimation just the observations y i0 , ..., y iT and x i1 , ..., x iT are available, the number of internal instruments that are used under XA (all instruments, x it treated as strictly exogenous) for estimation of the equation in first differences is: T − 1 (time-dummies) + T (T − 1)/2 (lags of y it ) + (T − 1)T (lags and leads of x it ). This yields {11, 50, 116} instruments for T = {3, 6, 9}. Under WA this is {8, 35, 80} and under EA {6, 30, 72}. From section 3.3.1 it follows that for BB estimation this number of instruments increases with 1 (intercept) + T − 1 (when y i,t−1 is supposed to be effect stationary) + T −1 (when x i,t is supposed to be effect stationary) −1 (when x it is treated as endogenous). This implies for T = {3, 6, 9} a total of {5, 11, 17} extra instruments under XA and WA, and of {4, 10, 16} under EA, whereas these extra instruments will be valid in section 5.1 and invalid in section 5.2.
In the tables to follow we always examine the three values T ∈ {3, 6, 9}, while N = 200 (as in the classic Arellano and Bond, 1991 study) and the three values γ ∈ {0.2, 0.5, 0.8} for the dynamic adjustment coefficient. This is done for both θ = 0 (homoskedasticity) and θ = 1 (substantial cross-sectional heteroskedasticity). Tables have a code which starts by the design parametrization, followed by the character u or f, indicating whether the table contains unfeasible or feasible results. Because of the many feasible variants not all results can be combined in just one table. Therefore, the f is followed by c, t or J, where c indicates that the table just contains results on coefficient estimates, which are estimated bias, standard deviation (Stdv) and RMSE (root mean squared error; below often loosely addressed as precision); t refers to estimates of the actual rejection probabilities of tests on true coefficient values; and J indicates that the table only contains results on Sargan-Hansen tests. Next, after a bar (-), the earlier discussed code for how regressor x it is actually treated when selecting the instruments is given, followed by the type of instrument reduction.

DGPs under effect stationarity
Here we focus on the case where BB is consistent and more efficient than AB (s = −50, φ = 1). Table 3, with code P0u-XA, gives results for unfeasible GMM coefficient estimators, unfeasible single coefficient tests, and for unfeasible Sargan-Hansen tests for the reference parametrization P0 when x it is (correctly) treated as strictly exogenous and all available instruments are being used. Table 4 (P0fc-XA) presents a selection of feasible counterparts regarding the coefficient estimators. Under homoskedasticity we see that for γ ABu =γ AB1 its bias (which is negative), stdv and thus its RMSE increase with γ and decrease with T, whereas the bias ofβ ABu =β AB1 is moderate and its RMSE, although decreasing in T, is almost invariant with respect to β. The BBu coefficient estimates are superior indeed, the more so for larger γ values (as is already well-known), but less so for β. As already conjectured in section 3.6 under cross-sectional heteroskedasticity both ABu and BBu are substantially less precise than under homoskedasticity. However, modifying the instruments under cross-sectional heteroskedasticity as is done by MABu and MBBu yields considerable improvements in performance both in terms of bias and RMSE. In fact, the precision of the unfeasible modified estimators under heteroskedasticity comes very close to their counterparts under homoskedasticity.

Results for the reference parametrization P0
The simulation results for feasible estimation do not contain the b variant of the weighting matrix 9 because it is so bad, whereas both the a and c variants yield RMSE values very close to their unfeasible counterparts, under homoskedasticity as well as heteroskedasticity. Although the best unfeasible results under heteroskedasticity are obtained by MBBu, this does not fully carry over to MBB, because for T small and also for moderate T and large γ, BB2c performs much better. The performance of MAB and AB2c is rather similar, whereas we established that their unfeasible variants differ a lot when γ is large. Apparently, the modified estimators can be much more vulnerable when the variances of the error components, σ 2 ε,i and σ 2 η , are unknown, probably because their estimates have to be inverted in (3.68) and (3.70).
From the type I error estimates for unfeasible single coefficient tests in Table 3 we see that the standard test procedures work pretty well for all techniques regarding β, but with respect to γ ABu fails for larger γ. This gets even worse under heteroskedasticity, but less so for MABu. For BBu and MBBu the results are reasonable. Here the test seems to benefit from the small bias of BBu. For the feasible variants we find in Table 5  Table 3.     (P0ft-XA) that under homoskedasticity AB1 has reasonable actual significance level for β, but for γ only when it is small. The same holds for AB2c. Under heteroskedasticity σ 2 ε,i has to be estimated and then AB2c overrejects, especially for γ or T large, but only mildly so for tests on β. Both AB2a and MAB overreject enormously. Employing the Windmeijer correction mitigates the overrejection probabilities in many cases, but not in all. AB2cW has appropriate size for tests on β, but for tests on γ the size increases both with γ and with T from 7% to 37% over the grid examined. Since the test based on ABu shows a similar pattern, it is self-evident that a correction which takes into account the randomness of AB1 cannot be fully effective. Oddly enough the Windmeijer correction is occasionally more effective for the heavily oversized AB2a than for the less oversized AB2c. Under homoskedasticity both BB2c and BB2cW behave very reasonable, both for tests on β and on γ. Under heteroskedasticity BB2cW is still very reasonable, but all other implementations fail in some instances, especially for tests on γ when γ or T are large.
Regarding the unfeasible J tests Table 3 shows reasonable size properties under homoskedasticity, especially for JBBu, but less so for the incremental test on effect stationarity when γ is large. Under heteroskedasticity this problem is more serious, though less so for the unfeasible modified procedure. Heteroskedasticity and γ large lead to underrejection of the JABu test, especially when T is large too. Turning now to some of the many variants of feasible J tests, of which some 10 are presented in Table  6 (P0fJ-XA), we first focus on JAB. Under homoskedasticity JAB (1,0) behaves reasonable, though when applied when θ = 1 it rejects with high probability (thus detecting heteroskedasticity instead of instrument invalidity, probably due to underestimation of the variance of the still valid moment conditions). Of the JAB (1,1) tests, both for θ = 0 and θ = 1, the c variant severely underrejects when T = 9 (when there is an abundance of instruments), but less so than the a version. Such severe underrejection had already been noted by Bowsher (2002). An almost similar pattern we note for JAB (2,1) and JAB (2,2) . Test JM AB (2,1) overrejects severely for T = 3 and underrejects otherwise, whereas JM AB (2,2) is fine for T = 3 but underrejects for larger values of T. Turning now to feasible JBB tests we find that JBB (1,0) underrejects when θ = 0 and, like JAB (1,0) , rejects with high probability when θ = 1. Both the a and c variants of test JBB (1,1) , like JAB (1,1) , have rejection probabilities that are not invariant with respect to T, γ and θ. The c variants seem the least vulnerable, and therefore also yield an almost reasonable incremental test JES (1,1) , although it underrejects when θ = 0 and overrejects when θ = 1 for γ = 0.8. For JBB (2,1) and JBB (2,2) too the c variant has rejection probabilities which vary the least with T, γ and θ, but they are systematically below the nominal significance level, which is also the case for the resulting incremental tests. Oddly enough, the incremental tests resulting from the a variants have type I error probabilities reasonably close to 5%, despite the serious underrejection of both the JAB and JBB tests from which they result.
When treating regressor x it as predetermined (P0-WA), whereas it is strictly exoge- Table 6. P0fJ-XA * nous, fewer instruments are being used. Since the ones that are now abstained from are most probably the strongest ones regarding ∆x it , it is no surprise that in the simulation results (not tabulated here) we note that especially the standard deviation of the β coefficient suffers. Also the rejection probabilities of the various tests differ slightly between implementations WA and XA, but not in a very systematic way as it seems.
When treating x it as endogenous (P0-EA) the precision of the estimators gets worse, with again no striking effects on the performance of test procedures under their respective null hypotheses. Upon comparing for P0 the instrument set A (and set C) with the one where A x (C x ) is replaced by C1 x it has been found that the in practice quite popular choice C1 x yields often slightly less efficient estimates for β, but much less efficient estimates for γ. When x it is again treated as strictly exogenous, but the number of instruments is reduced by collapsing the instruments stemming from both y it and x it , then we note from Table 7 (P0fc-XC) a mixed picture regarding the coefficient estimates. Although any substantial bias always reduces by collapsing, standard errors always increase at the same time, leading either to an increase or a decrease in RMSE. Decreases occur for the AB estimators of γ, especially when γ is large; for β just increases occur. A noteworthy reduction in RMSE does show up for BB2a when γ is large, T = 9 and θ = 1, but then the RMSE of BB2c using all instruments is in fact smaller. However, Table 8 (P0ft-XC) shows that collapsing is certainly found to be very beneficial for the type I error probability of coefficient tests, especially in cases where collapsing yields substantially reduced coefficient bias. The AB tests benefit a lot from collapsing, especially the c variant, leaving only little room for further improvement by employing the Windmeijer correction. After collapsing AB1 works well under homoskedasticity, and also under heteroskedasticity provided robust standard errors are being used, where the c version is clearly superior to the a version. AB2c has appropriate type I error probabilities, except for testing γ when it is 0.8 at T = 3 and θ = 1 (which is not repaired by a Windmeijer correction either), and is for most cases superior to AB2aW. After collapsing BB2a shows overrejection which is not completely repaired by BB2aW when θ = 1. BB2c and BB2cW generally show lower rejection probabilities, with occasionally some underrejection. Tests based on MAB and MBB still heavily overreject. Table 9 (P0fJ-XC) shows that by collapsing JAB and JBB tests suffer much less from underrejection when T is larger than 3. However, both the a and c versions of the J (2,1) and J (2,2) tests usually still underreject, mostly by about 1 or 2 percentage points. Good performance is shown by JES . When x it is still correctly treated as strictly exogenous but for the level instruments just a few lags or first differences are being used (XL0 ... XL3) for both y it and x it then we find the following. Regarding feasible AB and BB estimation collapsing (XC) always gives smaller RMSE values than XL0 and XL1 (which is much worse than XL0), but this is not the case for XL2 and XL3. Whereas XC yields smaller bias, XL2 and XL3 often reach smaller Stdv and RMSE. Especially regarding β XL3 performs better than XL2. Probably due to the smaller bias of XC it is more successful in mitigating size problems of coefficient tests than XL0 through XL3. The effects on J tests is less clearcut. Combining collapsing with restricting the lag length we find that XC2 and XC3 are in some aspects slightly worse but in others occasionally better than XC for P0. We also examined the hybrid instrumentation which seems popular amongst practitioners where C w is combined with L1 x . Especially for γ this leads to loss of estimator precision Table 7.   without any other clear advantages, so it does not outperform the XC results for P0. From examining P0-WC (and P0-EC) we find that in comparison to P0-WA (P0-EA) there is often some increase in RMSE, but the size control of especially the t-tests is much better.
Summarizing the results for P0 on feasible estimators and tests we note that when choosing between different possible instrument sets a trade off has to be made between estimator precision and test size control. For both some form of reduction of the instrument set is often but not always beneficial. Not one single method seems superior irrespective of the actual values of γ, β and T. Using all instruments is not necessarily a bad choice; also XC, XL3 and XC3 often work well. To mitigate estimator bias and foster test size control while not sacrificing too much estimator precision using collapsing (C) for all regressors seems a reasonable compromise, as far as P0 is concerned. Coefficient and J tests based on the feasible modified estimator behave so poorly, that in the remainder we no longer mention its results.

Results for alternative parametrizations
Next we examine a series of alternative parametrizations where each time we just change one of the parameter values of one of the already examined cases. In P1 we increase DEN η y from 1 to 4 (hence, substantially increasing the relative variance of the individual effects). We note that for P1-XA (not tabulated here) all estimators regarding γ are more biased and dispersed than for P0-XA, but there is little or no effect on the β estimates. For both T and γ large this leads to serious overrejection for the unfeasible coefficient tests regarding γ, in particular for ABu. Self-evidently, this carries over to the feasible tests and, although a Windmeijer correction has a mitigating effect, the overrejection remains often serious for both AB and BB based tests. Tests on β based on AB behave reasonable, apart from not robustified AB1 and AB2a. For the latter a Windmeijer correction proves reasonably effective. When exploiting the effect stationarity the BB2c implementation seems preferable. The unfeasible J tests show a similar though slightly more extreme pattern as for P0-XA. Among the feasible tests both serious underrejection and overrejection occurs. As far as the incremental tests concerns JES (2,2) c behaves remarkably well. In Tables 10, 11 and 12 (P1fj-XC, j = c,t,J) we find that collapsing leads again to reduced bias, slightly deteriorated precision though improved size control (here all unfeasible tests behave reasonably well). All feasible AB1R and AB2W tests have reasonable size control, apart from tests on γ when T is small and γ large. These give actual significance levels close to 10%. BB2cW seems slightly better than BB2aW. The 1-step J tests show some serious overrejection whereas the 2-step J tests behave quite satisfactorily. For C3 reasonably similar results are obtained, but those for L3 are generally slightly less attractive.
In P2 we increase EV F x from 0 to 0.6, upon having again IEF η x = 0 (hence, x it is still uncorrelated with effect η i though correlated with effect λ i , which determines any heteroskedasticity). This leads to increased β values. Results for P2-XA show larger Table 10.   absolute values for the standard deviations of the β estimates than for P0-XA, but they are almost similar in relative terms. The patterns in the rejection probabilities under the respective null hypotheses are hardly affected, and P2-XC shows again improved behavior of the test statistics due to reduced estimator bias, whereas the RMSE values have slightly increased.

P1fc-XC *
In P3 we change IEF η x from 0 to 0.3, while keeping EV F x = 0.6 (hence, realizing now dependence between regressor x it and the individual effect η i ). Comparing the results for P3-XA with those for P2-XA (which have the same β values) we find that all patterns are pretty similar. Also P3-XC follows the P2-XC picture closely.
P4 differs from P3 because κ = 0.25, thus now the heteroskedasticity is determined by η i too. This has a noteworthy effect on MBB estimation and a minor effect on JBB (and thus on JES) testing only.
P5 differs from P0 just in havingρ xε = 0.3, so x it is now endogenous with respect to ε it . P5-EA uses all instruments available when correctly taking the endogeneity into account. This leads to very unsatisfactory results. The coefficient estimates of γ have serious negative bias, and those for β positive bias, whereas the standard deviation  is slightly larger than for P0-EA, which are substantially larger than for P0-XA. All coefficient tests are very seriously oversized, also after a Windmeijer correction, both for AB and BB. Tests JABu and JBBu show underrejection, whereas the matching JES tests show serious overrejection when T is large, but the feasible 2-step variants are not all that bad. From Tables 13, 14 and 15 (P5fj-EC, j = c,t,J) we see that most results which correctly handle the simultaneity of x it are still bad after collapsing, especially for T small (where collapsing can only lead to a minor reduction of the instrument set), although not as bad as those for P5-EA and larger values of T . For P5-EC the rejection probabilities of the corrected coefficient tests are usually in the 10-20% range, but those of the 2-step J tests are often close to 5%. Both AB and BB are inconsistent when treating x it either as predetermined or as exogenous. For P5-WA and P5-XA the coefficient bias is almost similar but much more serious than for P5-EA. For the inconsistent estimators the bias does not reduce when collapsing the instruments. Because the inconsistent estimators have a much smaller standard deviation than the consistent estimators practitioners should be warned never to select an estimator simply because of its attractive estimated standard error. The consistency of AB and BB should be tested with the Sargan-Hansen test. In this study we did not examine the particular incremental test which focusses on the validity of the extra instruments when comparing E with W or E with X. Here we just examine the rejection probabilities of the overall overidentification J tests for case P5 using all instruments and can compare the rejection frequencies when treating x it correctly as endogenous, or incorrectly as either predetermined or exogenous.
From the Tables 16, 17 and 18 (P5fJ-jA, j = E,W,X) we find that the detection of inconsistency in this way has often a higher probability when the null hypothesis is W than when it is X. The probability generally increases with T and with γ and is often better for the c variant than for the a variant and slightly better for BB implementations than for AB implementations, whereas in general heteroskedasticity mitigates the rejection probability. In the situation where all instruments have been collapsed, where we already established that the J tests do have reasonable size control, we find the fol- Table 13.   lowing. For T = 3 and γ = 0.2 the rejection probability of the JAB and JBB tests does not rise very much whenρ xε moves from 0 to 0.3, whereas for T = 9,ρ xε = 0.3 this rejection probability is often larger than 0.7 when γ ≥ 0.5 and often larger than 0.9 for γ = 0.8. Hence, only for particular T, γ and θ parametrizations the probability to detect inconsistency seems reasonable, whereas the major consequence of inconsistency, which is serious estimator bias, is relatively invariant regarding T, γ and θ.

P5fc-EC *
Summarizing our results for effect stationary models we note the following. We established that finite sample inaccuracies of the asymptotic techniques seriously aggravate when either σ η /(1−γ) σ ε or under simultaneity. For both problems it helps to collapse instruments, and the first problem is mitigated and the second problem detected with higher probability by instrumenting according to W rather than X. Neglected simultaneity leads to seemingly accurate but seriously biased coefficient estimators, whereas asymptotically valid inference on simultaneous dynamic relationships if often not very accurate either. Even when the more efficient BB estimator is used with Windmeijer corrected standard errors, the bias in both γ and β is very substantial and test sizes are seriously distorted. Some further pilot simulations disclosed that N should be much and  much larger than 200 in order to find much more reasonable asymptotic approximation errors.

Nonstationarity
Next we examine the effects of a value of φ different from unity while s = −50. We will just consider setting φ = 0.5 and perturbing x i0 and y i0 according to (4.25), so that their dependence on the effects is initially 50% away from stationarity so that BB estimation is inconsistent. That this occurred we will indicate in the parametrization code by changing P into P φ . Comparing the results for P φ 0-XA with those for P0-XA, where φ = 1 (effect stationarity), we note from Table 19 (P φ 0fc-XA) a rather moderate positive bias in the BB estimators for both γ and β when both T and γ are small. Despite the inconsistency of BB the bias is very mild for larger T and for smaller γ. This can be explained, because convergence towards effect stationarity does occur when time proceeds, while this convergence is faster for smaller γ. In fact, we note that the RMSE of inconsistent BB1, BB2a and BB2c is often smaller than that for consistent AB1, AB2a and AB2c when T and γ are not both small. With respect to the AB estimators we find little to no difference compared to the results under stationarity. Table 20 (P φ 0ft-XA) shows that when γ = 0.8 the BB2cW coefficient test on γ yields very mild overrejection, while AB2aW and AB2cW seriously overreject. For smaller values of γ it is the other way around. After collapsing (not tabulated here) similar but more moderate patterns are found, due to the mitigated bias which goes again with slightly increased standard errors. Hence, for this case we find that if T is not too small and γ not very large we should not worry too much when applying BB even if effect stationarity does not strictly hold for the initial observations. As it happens, we note from Table 21 (P φ 0fJ-XA) that the rejection probabilities of the JES tests are such that they are relatively low when BB inference is preferable to AB inference, and relatively high when either T or γ are low and φ = 0.5. This pattern is much more pronounced for the JES tests than for the JBB tests. However, it is also the case in P φ 0 that collapsing mitigates this welcome quality of the JES tests to warn against unfavorable consequences of effect nonstationarity on BB inference. From P φ 1-XA, in which the individual effects are much more prominent, we find that φ = 0.5 has curious effects on AB and BB results. For effect stationarity (φ = 1) we already noted more bias for AB than under P0. For γ large, this bias is even more serious when φ = 0.5, despite the consistency of AB. For BB estimation the reduction of φ leads to much larger bias and much smaller stdv, with the effect that RMSE values for inconsistent BB are usually much worse than for AB, but are often slightly better, except for BB2c, when γ = 0.8. All BB coefficient tests for γ have size close or equal to 1 under P φ 1-XA and the AB tests for γ = 0.8 overreject very seriously as well. Under P φ 1-XC the bias of AB is reasonable except for γ = 0.8. The bias of BB has decreased but is still enormous, although it still has preferable RMSE when γ = 0.8. Especially regarding tests on γ BB fails. For both the a and c versions the JES test has high rejection probability to detect φ = 1, except when γ is large. The relatively low rejection probability of JES tests obtained after collapsing when γ = 0.8 and φ = 0.5 again indicates that despite its inconsistency BB has similar or smaller RMSE than AB for that specific case.
Next we consider the simultaneous model again. In case P φ 5-EA estimator AB is consistent and BB again inconsistent. Nevertheless, for all γ and T values examined in Table 22 (P φ 5fc-EA), AB has a more severe bias than BB, whereas BB has smaller stdv values at the same time and thus has smaller RMSE for all γ and T values examined. The size control of coefficient tests is worse for AB, but for BB it is appalling too, where BB2aW, with estimated type I error probabilities ranging from 5% to 70%, is often preferable to BB2cW. The 2-step JAB tests behave reasonably (wrongly indicate inconsistency with a probability rather close to 5%), whereas the JBB tests reject with probabilities in the 3-38% range, and JES in the 3-69% range. By collapsing the RMSE of AB generally reduces when T ≥ 6 and for BB especially when γ = 0.8. BB has again smaller RMSE than AB. The rejection rates of the JBB and JES tests are substantially lower now, which seems bad because the invalid (first-differenced) instruments are less often detected, but this may nevertheless be appreciated because it induces to prefer less inaccurate BB inference to AB inference. After collapsing the size distortions of BB2aW and BB2cW are less extreme too, now ranging from 5-33%, but the RMSE values for BB may suffer due to collapsing, especially when γ and T are small. The RMSE values for BB under P φ 5-WA and P φ 5-XA are usually much worse than those for AB under P φ 5-EA. Hence, although the invalid instruments for the level equation are not necessarily a curse when endogeneity of x it is respected, they should not be used when they are invalid for two reasons (φ = 1, ρ xε = 0). That neither AB nor BB should Table 19.   be used in P5 under W and X will be indicated with highest probability under WC, and then this probability is larger than 0.8 for JBB (2,1) a only when T is high and for JAB (2,1) a only when both T and γ are high.
Summarizing our findings regarding effect nonstationarity, we have established that although φ = 1 renders BB estimators inconsistent, especially when T is not small and γ not large BB inference nevertheless beats consistent AB, provided possible endogeneity of x it is respected. The JES test seems to have the remarkable property to be able to guide towards the technique with smallest RMSE instead of the technique exploiting the valid instruments.

Empirical results
The above findings will be employed now in a re-analysis of the data and some of the techniques studied in Ziliak (1997). The main purpose of that article was to expose the downward bias in GMM as the number of moment conditions expands. This is done by estimating a static life-cycle labor-supply model for a ten year balanced panel of males, and comparing for various implementations of 2SLS and GMM the coefficient estimates and their estimated standard errors when exploiting expanding sets of instruments. We find this approach rather naive for various reasons: (a) the difference between empirical coefficient estimates will at best provide a very poor proxy to any underlying difference in bias; (b) standard asymptotic variance estimates of IV estimators are known to be very poor representations of true estimator uncertainty; 11 (c) the whole analysis is based on just one sample and possibly the model is seriously misspecified. 12 The latter issue also undermines conclusions drawn in Ziliak (1997) on overrejection by the J test, because it is of course unknown in which if any of his empirical models the null hypothesis is true. To avoid such criticism we designed the controlled experiments in the two foregoing sections on the effects of different sets of instruments on various relevant inference techniques. And now we will examine how these simulation results can be exploited to underpin actual inference from the data set used by Ziliak. This data set originates from waves XII-XXI and the years 1979-1988 of the PSID. The subjects are N = 532 continuously married working men aged 22-51 in 1979. Ziliak (1997) employs the static model 13 where h it is the observed annual hours of work, w it the hourly real wage rate, z it a vector of four characteristics (kids, disabled, age, age-squared), η i an individual effect and ε it the idiosyncratic error term. It is assumed that ln w it may be an endogenous regressor and that all variables included in z it are predetermined. The parameter of interest is β and its GMM estimates range from approximately 0.07 to 0.52, depending on the Table 22.   (3) shows that although some estimated standard errors substantially increase also many remain almost the same or even decrease when we treat log wage as endogenous. Especially the findings in column (3) on the coefficients γ 1 and γ 2 strongly reject the static specifications maintained in Ziliak (1997) and in Cameron and Trivedi (2005) for the same data. On the other hand the estimated values are such that the simulation results on large γ values do not seem relevant here. Testing the endogeneity by the difference between the two obtained JAB (2,1) a statistics yields 10.52 (with p-value 0.161) leaving us still in doubt whether we should simply accept predeterminedness of wage, or err on the side of caution and treat it as endogenous. The rather indeterminate test results on endogeneity are in agreement with the finding that the coefficient estimates in columns (1) and (3) are only slightly different. They imply an immediate wage elasticity of hours worked (β w 0 ) of 0.40 and 0.46 respectively, and an estimated long-run elasticity T M w = (β w 0 +β w 1 +β w 2 )/(1 −γ 1 −γ 2 ) of 0.68 and (assuming endogeneity 51 of wage) 0.48. A static model self-evidently would produce equivalent point-estimates for the immediate and the long-run wage elasticity of hours worked. The estimates in columns (1) and (3) suggest a value of DEN η y of about 1.22 or 0.96 respectively, which we interpret such that the relatively unfavorable simulation results for case P1 where it is 4 do not apply here.
Next we estimate the same model again, but now by robust 1-step Arellano-Bond, assuming all variables to be predetermined in column (4) and wage to be endogenous in column (5). Most AB1aR standard errors are a bit larger than those for the corresponding AB2aW. This is in agreement with the differences between true standard deviations that we noted within simulations for P0 and P5. Although these simulations did not produce a serious difference between the coefficient bias of AB2 and AB1 the results in column (4) yield a much larger estimated T M w of 0.908. The test on overidentification provided by xtabond2 14 (after estimating by AB1aR) is again JAB (2,1) a and not JAB (1,1) a .
Next we re-estimate the models of columns (4) and (5) by AB1aR, though employing substantially fewer instruments by collapsing. The results in columns (6) and (7) show some remarkable changes, both regarding coefficient estimates (yielding substantially lower T M w ) and regarding standard errors, which are often larger (as expected) but sometimes much lower. According to the simulation results both (6) and (7) will produce seriously biased coefficient estimates in the presence of a genuinely endogenous regressor, and both virtually unbiased estimators if all regressors are actually predetermined. The two versions of the incremental test (with 1 degree of freedom) on endogeneity give here p-values of 0.100 and 0.078 respectively. The collapsed AB2aW results in column (8) are less extreme and more in line with those of column (3). An unpleasant finding in columns (7) and (8) is that they lead to rejection by the incremental J test of validity as instruments of the time dummies. Since these are exogenous by nature this raises doubts on the adequacy of the model specification. Such doubts may also be fed by the fact that all columns, except (8), suggest that a disability leads to increased labor supply, although this might perhaps be realistic in the US.
We also employed BB2aW estimation, for E and W, and for A and C. When using all instruments we found p-values for the incremental test on effect stationarity of 0.061 and 0.045. However, after collapsing these are 0.314 and 0.252. This is a pattern that we noted in our simulations and gives reasons to believe that we should reject effect stationarity and take this test outcome as a warning that BB inference may be here less accurate than AB inference. The BB coefficient estimates imply an insignificant but positive immediate wage elasticity, whereas the long-run elasticity for all four is close to zero or even negative. This could well be due to the huge coefficient bias that we noted in our experiments with φ = 0.5, especially those with an endogenous regressor.
Hence, supposing that for these data wage is endogenous indeed, that the effects are nonstationary and heteroskedasticity is present, our cases P φ 5-EA and P φ 5-EC with moderate γ, T = 6 and θ = 1 seem most relevant. For these we learned that collapsing has some advantages, but otherwise there is little to choose between AB1a and AB2a, because both yield substantial negative bias (about -50%) for γ and huge positive bias (about +30%) for β with actual test size of corrected t-tests close to 25%, although in this data set N is actually larger than 200. Nevertheless it seems a truism to conclude that there is great urge for developing more refined inference procedures for structural dynamic panel data models.

Major findings
In social science the quantitative analysis of many highly relevant problems requires structural dynamic panel data methods. These allow the observed data to have at best a quasi-experimental nature, whereas the causal structure and the dynamic interactions in the presence of unobserved heterogeneity have yet to be unraveled. When the crosssection dimension of the sample is not very small, employing GMM techniques seems most appropriate in such circumstances. This is also practical since corresponding software packages are widely available. However, not too much is known yet about the actual accuracy in practical situations on the abundance of different not always asymptotically equivalent implementations of estimators and test procedures. This study aims to demarcate the areas in the parameter space where the asymptotic approximations to the properties of the relevant inference techniques in this context have either shown to be reliable beacons or are actually often misguiding marsh fires.
In this context we provide a rather rigorous treatment of many major variants of GMM implementations as well as for the inference techniques on testing the validity of particular orthogonality assumptions and restrictions on individual coefficient values. Special attention is given to the consequences of the joint presence in the model of time-constant and individual-constant unobserved effects, covariates that may be strictly exogenous, predetermined or endogenous, and disturbances that may show particular forms of heteroskedasticity. Also the implications regarding initial conditions for separate regressors with respect to individual effect stationarity are analyzed in great detail, and various popular options that aim to mitigate bias by reducing the number of exploited internal instruments are elucidated. In addition, as alternatives to those used in current standard software, less robust weighting matrices and additional variants of Sargan-Hansen test implementations are considered, as well as the effects of particular modifications of the instruments under heteroskedasticity.
Next, a simulation study is designed in which all the above variants and details are being parametrized and categorized, which leads to a data generating process involving 10 parameters, for which, under 6 different settings regarding sample size and initial conditions, 60 different grid points are examined. For each setting and various of the grid points 13 different choices regarding the set of instruments have been used to examine 12 different implementations of GMM coefficient estimates, giving rise to 24 different implementations of t-tests and 27 different implementations of Sargan-Hansen tests. From all this only a pragmatically selected subset of results is actually presented in this paper.
The major conclusion from the simulations is that, even when the cross-section sample size is several hundreds, the quality of this type of inference depends heavily on a great number of aspects of which many are usually beyond the control of the investigator, such as: magnitude of the time-dimension sample size, speed of dynamic adjustment, presence of any endogenous regressors, type and severity of heteroskedasticity, relative prominence of the individual effects and (non)stationarity of the effect impact of any of the explanatory variables. The quality of inference also depends seriously on choices made by the investigator, such as: type and severity of any reductions applied regarding the set of instruments, choice between (robust) 1-step or (corrected) 2-step estimation, employing a modified GMM estimator, the chosen degree of robustness of the adopted weighting matrix, the employed variant of coefficient tests and of (incremental) Sargan-Hansen tests in deciding on the endogeneity of regressors, the validity of instruments and on the (dynamic) specification of the relationship in general.
Our findings regarding the alternative approaches of modifying instruments and exploiting different weighting matrices are as follows for the examined case of crosssectional heteroskedasticity. Although the unfeasible form of modification does yield very substantial reductions in both bias and variance, for the straight-forward feasible implementation examined here the potential efficiency gains do not materialize. The robust weighting matrix, which also allows for possible time-series heteroskedasticity, performs often as well as (and sometimes even better than) a specially designed less robust version, although the latter occasionally demonstrates some benefits for incremental Sargan-Hansen tests.
Furthermore we can report to practitioners: (a) when the effect-noise-ratio is large, the performance of all GMM inference deteriorates; (b) the same occurs in the presence of a genuine (or a supervacaneously treated as) endogenous regressor; (c) in many settings the coefficient restrictions tests show serious size problems which usually can be mitigated by a Windmeijer correction, although for γ large or under simultaneity serious overrejection remains unless N is very much larger than 200; (d ) the limited effectiveness of the Windmeijer correction is due to the fact that the positive or negative bias in coefficient estimates is often more serious than the negative bias in the variance estimate; (e) limiting to some degree the number of instruments usually reduces bias and therefore improves size properties of coefficient tests, though at the potential cost of power loss because efficiency usually suffers; (f ) for the case of an autoregressive strictly exogenous regressor we noted that it is better to not just instrument it by itself, but also by some of its lags because this improves inference, especially regarding the lagged dependent variable coefficient; (g) to mitigate size problems of the overall Sargan-Hansen overidentification tests the set of instruments should be reduced, possibly by collapsing, and then, after 2-step estimation, none of the examined alternative variants did systematically beat the one using the standard robust weighting matrix based on 1-step residuals; (h) collapsing also reduces size problems of the incremental Sargan-Hansen effect stationarity test; (i ) except under simultaneity, the GMM estimator which exploits instruments which are invalid under effect nonstationarity may nevertheless perform better than the estimator abstaining from these instruments; (j ) the rejection probability of the incremental Sargan-Hansen test for effect stationarity is such that it tends to direct the researcher towards applying the most accurate estimator, even if this is inconsistent.
When re-analyzing a popular empirical data set in the light of the above simulation findings we note in particular that actual dynamic feedbacks may be much more subtile than those that can be captured by just including a lagged dependent variable regressor, which at present seems the most common approach to model dynamics in panels. In theory the omission of further lagged regressor variables should result in rejections by Sargan-Hansen test statistics, but their power suffers when many valid and some invalid orthogonality conditions are tested jointly, instead of by deliberately chosen sequences of incremental tests or direct variable addition tests. Hopefully tests for serial correlation, which we intentionally left out of this already overloaded study, provide an extra help to practitioners in guiding them towards well-specified models.