1. Introduction
One of the major attractions of analyzing panel data rather than single indexed variables is that they allow us to cope with the empirically very relevant situation of unobserved heterogeneity correlated with included regressors. Econometric analysis of dynamic relationships on the basis of panel data, where the number of surveyed individuals is relatively large while covering just a few time periods, is very often based on GMM (generalized method of moments). Its reputation is built on its claimed flexibility, generality, ease of use, robustness and efficiency. Widely available standard software enables us to estimate models including exogenous, predetermined and endogenous regressors consistently, while allowing for semiparametric approaches regarding the presence of heteroskedasticity and the type of distribution of the disturbances. This software also provides specification checks regarding the adequacy of the internal and external instrumental variables employed and the specific assumptions made regarding (absence of) serial correlation.
Especially popular are the GMM implementations put forward by Arellano and Bond [
1]. However, practical problems have often been reported, such as vulnerability due to the abundance of internal instruments, discouraging improvements of 2-step over 1-step GMM findings, poor size control of test statistics, and weakness of instruments especially when the dynamic adjustment process is slow (a root is close to unity). As remedies it has been suggested to reduce the number of instruments by renouncing some valid orthogonality conditions, but also to extend the number of instruments by adopting more orthogonality conditions. Extra orthogonality conditions can be based on certain homoskedasticity or stationarity assumptions or initial value conditions, see Blundell and Bond [
2]. By abandoning weak instruments finite sample bias may be reduced, whereas by extending the instrument set with a few strong ones the bias may be further reduced and the efficiency enhanced. Presently, it is not clear yet how practitioners can best make use of these suggestions, because no set of preferred testing tools is yet available, nor a comprehensive sequential specification search strategy, which in a systematic fashion allow us to select instruments by assessing both their validity and their strength, as well as to classify individual regressors accurately as relevant and either endogenous, predetermined or strictly exogenous. Therefore, it happens often that, in applied research, models and techniques are selected simply on the basis of the perceived significance and plausibility of their coefficient estimates, whereas it is well known that imposing invalid coefficient restrictions and employing regressors wrongly as instruments will often lead to relatively small estimated standard errors. Then, however, these provide misleading information on the actual precision of the often seriously biased estimators.
The available studies on the performance of alternative inference techniques for dynamic panel data models have obvious limitations when it comes to advising practitioners on the most effective implementations of estimators and tests under general circumstances. As a rule, they do not consider various empirically relevant issues in conjunction, such as: (i) occurrence and the possible endogeneity of regressors additional to the lagged dependent variable, (ii) occurrence of individual effect (non-)stationarity of both the lagged dependent variable and other regressors, (iii) cross-section and/or time-series heteroskedasticity of the idiosyncratic disturbances, and (iv) variation in signal-to-noise ratios and in the relative prominence of individual effects. For example: the simulation results in Arellano and Bover [
3], Hahn and Kuersteiner [
4], Alvarez and Arellano [
5], Hahn et al. [
6], Kiviet [
7], Kruiniger [
8], Okui [
9], Roodman [
10], Hayakawa [
11] and Han and Phillips [
12] just concern the panel AR(1) model under homoskedasticity. Although an extra regressor is included in the simulation studies in Arellano and Bond [
1], Kiviet [
13], Bowsher [
14], Hsiao et al. [
15], Bond and Windmeijer [
16], Bun and Carree [
17,
18], Bun and Kiviet [
19], Gouriéroux et al. [
20], Hayakawa [
21], Dhaene and Jochmans [
22], Flannery and Hankins [
23], Everaert [
24] and Kripfganz and Schwarz [
25], this regressor is (weakly-)exogenous and most experiments just concern homoskedastic disturbances and stationarity regarding the impact of individual effects. Blundell et al. [
26] and Bun and Sarafidis [
27] include an endogenous regressor, but their design does not allow us to control the degree of simultaneity; moreover, they stick to homoskedasticity. Harris et al. [
28] only examine the effects of neglected endogeneity. Heteroskedasticity is considered in a few simulation experiments in Arellano and Bond [
1] in the model with an exogenous regressor, and just for the panel AR(1) case in Blundell and Bond [
2]. Windmeijer [
29] analyzes panel GMM with heteroskedasticity, but without including a lagged dependent variable in the model. Bun and Carree [
30] and Juodis [
31] examine effects of heteroskedasticity in the model with a lagged dependent and a strictly exogenous regressor under stationarity regarding the effects. Moral-Benito [
32] examines stationary and nonstationary regressors in a dynamic model with heteroskedasticity, but the extra regressor is predetermined or strictly exogenous. Moreover, his study is restricted to time-series heteroskedasticity, while assuming cross-sectional homoskedasticity. In a micro context cross-sectional heteroskedasticity seems more realistic to us, whereas it is also trickier when
N is large and
T small.
So, knowledge is still scarce with respect to the performance of GMM when it is not only needed to cope with genuine simultaneity (which we consider to be the core of econometrics), but also because of occurrence of heteroskedasticity of unknown form. Moreover, many of the simulation studies mentioned above did not systematically explore the effects of relevant nuisance parameter values on the finite sample distortions to asymptotic approximations. We examine estimating a prominent nuisance parameter, namely the variance of the individual effects, which to date has received surprisingly little attention in the literature. Regarding the performance of tests on the validity of instruments worrying results have been obtained in Bowsher [
14] and Roodman [
10] for homoskedastic models. On the other hand Bun and Sarafidis [
27] report reassuring results, but these just concern models where
Hence, it would be useful to examine more cases over an extended grid covering more dimensions. Our grid of examined cases will be much wider and have more dimensions. Moreover, we will deliberately explore both feasible and unfeasible versions of estimators and test statistics (in unfeasible versions particular nuisance parameter values are assumed to be known). Therefore we will be able to draw more useful conclusions on what aspects do have major effects on any inference inaccuracies in finite samples.
The data generating process designed here can be simulated for classes of models which may include individual and time effects, a lagged dependent variable regressor and another regressor which may be correlated with these and other individual effects and be either strictly exogenous or jointly dependent with regard to the idiosyncratic disturbances, whereas the latter may show a form of cross-section heteroskedasticity associated with both the individual effects. For a range of relevant parameter values we will verify in moderately large samples the properties of alternative GMM estimators, both 1-step and 2-step, focussing on alternative implementations regarding the weighting matrix and corresponding corrections to variance estimates according to the often practiced approach by Windmeijer [
29]. This will include variants of the popular system estimator, which exploit as instruments the first-difference of lagged internal variables for the untransformed model in addition to lagged level internal variables as instruments for the model from which the individual effects have been removed. We will examine cases where the extra instruments are (in)valid in order to verify whether particular tests for overidentification restrictions have appropriate size and power, such that with reasonable probabilities valid instruments will be recognized as appropriate and invalid instruments will be detected and can be discarded. Moreover, following Kiviet and Feng [
33], we shall investigate a rather novel modification of the traditional GMM implementation which aims at improving the strength of the exploited instruments in the presence of heteroskedasticity. Of course, also the simulation design used here has its limitations. It has only one extra regressor next to the lagged dependent variable, we only consider cross-sectional heteroskedasticity, and all basic random terms have been drawn from the normal distribution. Moreover, the design does not accommodate general forms of cross-sectional dependence between error terms. However, by including individual and time specific effects particular simple forms of cross-sectional dependence are accommodated.
Due to the high dimensionality of the Monte Carlo design a general discussion of the major findings is hard, because particular qualities (and failures) of inference techniques are usually not global but only occur in a particular limited context. However, in the last but one paragraph of the concluding
Section 7 we nevertheless provide a list of eleven (
a through
k) established observations which seem very useful for practitioners. Here we will list a few more advises most of which seem contrary to current dominant practice: (i) many studies claim to have dealt with the limitations of a static model by just including the lagged dependent variable, whereas its single extra coefficient just leads to a highly restrictive dynamic model; (ii) it seems widely believed that an exogenous regressor should just be instrumented by itself, whereas using its lags as instruments too is highly effective to instrument further non-exogenous regressors; (iii) test statistics involving a large number of degrees of freedom will generally lack power when they jointly test restrictions of which only few are false and therefore Sargan-Hansen statistics should as a rule be partitioned into a series of well-chosen increments; (iv) it has been reported (see, for instance, Hayashi [
34] p. 218), that Sargan-Hansen tests tend to overreject, especially when using many instruments, though in our simulations we find that underrejection is predominant, as already reported under homoskedasticity by Bowsher [
14]; (v) estimates of nuisance parameters are generally useful to interpret estimated parameters of primary interest and therefore not only the variance of the idiosyncratic disturbances but also the variance of the individual effects should be examined.
The structure of this study is as follows. In
Section 2 we first present the major issues regarding IV and GMM coefficient and variance estimation in linear models and on inference techniques on establishing instrument validity and regarding the coefficient values by standard and by corrected test statistics. Next in
Section 3 the generic results of
Section 2 are used to discuss in more detail than provided elsewhere the various options for their implementation in linear models for single dynamic simultaneous micro econometric panel data relationships with both individual and time effects and some form of cross-sectional heteroskedasticity. In
Section 4 the Monte Carlo design is developed to analyze and compare the performance of alternative often asymptotically equivalent inference methods in finite samples of empirically relevant parametrizations.
Section 5 summarizes the simulation results, from which some preferred techniques for use in finite samples of particular models emerge, plus a warning regarding particular types of models that require more refined methods yet to be developed. An empirical illustration, which involves data on labor supply earlier examined by Ziliak [
35], can be found in
Section 6, where we also formulate a tentative comprehensive specification search strategy for dynamic micro panel data models. Finally, in
Section 7 the major findings are summarized.
4. Simulation Design
We will examine the stable dynamic simultaneous heteroskedastic DGP (
)
Here
β has just one element relating to the for each
i stable autoregressive regressor
with
and
All random drawings
are
and mutually independent. Parameter
indicates the correlation between the cross-sectionally heteroskedastic disturbances
and
which are both homoskedastic over time. How we did generate the values
and the start-up values
and
and chose relevant numerical values for the other eleven parameters will be discussed extensively below.
Note that in this DGP
is either strictly exogenous
or otherwise endogenous
6; the only weakly exogenous regressor is
Regressor
may be affected contemporaneously by two independent individual specific effects when
and
but also with delays if
The dependent variable
may be affected contemporaneously by the (standardized) individual effect
both directly and indirectly; directly if
and indirectly via
when
However,
will also have delayed effects on
, when
or
and so has
when
The cross-sectional heteroskedasticity is determined by both
and
the two standardized individual effects, and is thus associated with the regressors
and
It follows a lognormal pattern when both
and
are standard normal, because we take
where
This establishes a lognormal distribution with
and
So, for
the
and
are homoskedastic. The seriousness of the heteroskedasticity increases with the absolute value of
θ. From
it follows that
is lognormally distributed too, hence
and
Table 2 presents some quantiles and moments of the distributions of
and
(taken as the positive square root of
) in order to disclose the effects of parameter
It shows that
implies pretty serious heteroskedasticity, whereas it may be qualified mild when
say. In all our simulation experiments
will be unconditionally homoskedastic, irrespective of the value of
because
However, for
none of the experiments will be characterized by conditional homoskedasticity, because of the following. Always
will be employed as instrument. Because a realization of
depends on
also
will depend on
Without loss of generality we may chose and Note that (99) implicitly specifies All simulation results refer to estimators where these T restrictions have been imposed (there are no time effects), but have not been imposed. Hence, when estimating the model in levels is one of the regressors. Moreover, we may always include in and in in order to exploit the fundamental moment conditions (for and for
Apart from values for
θ and
we have to make choices on relevant values for eight more parameters. We could choose
which covers a broad range of adjustment processes for dynamic behavioral relationships, and
to include less and more smooth
processes. Next, interesting values should be given to the remaining six parameters, namely
and
We will do this by choosing relevant values for six alternative more meaningful notions, which are all functions of some of the eight DGP parameters and allow us to establish relevant numerical values for them, as suggested in Kiviet [
39].
The first three notions will be based on (ratios of) particular variance components of the long-run stationary path of the process for
. Using lag-operator notation and assuming that
(and
exist for
we find
7 that the long-run path for
consists of three mutually independent components, namely
The third component, the accumulated contributions of
is a stationary AR(1) process with variance
Approximating
by
the average variance is
The other two components have variances
and
respectively, so the average long-run variance of
equals
A first characterization of the
series can be obtained by setting
This is an innocuous normalization, because
β is still a free parameter. As a second characterization of the
series, we can choose what we call the (average) effects variance fraction of
given by
with
for which we could take, say,
. To balance the two individual effect variances we define for the case
what we call the individual effect fraction of
in
given by
So
with
expresses the fraction due to
of the (long-run) variance of
stemming from the two individual effects. We could take, say,
.
From these three characterizations we obtain
For all three we will only consider the nonnegative root, because changing the sign would have no effects on the characteristics of
, as we will generate the series
and
from symmetric distributions. The above choices regarding the
process have the following implications for the average correlations between
and its two constituting effects:
Now the
series can be generated upon choosing a value for
This we obtain from
which on average is
Hence, fixing the average simultaneity
8 to
we should choose
In order that both correlations are smaller than 1 in absolute value an admissibility restriction has to be satisfied, namely
giving
When choosing
and
we should have
That we should not exclude negative values of
will become obvious in due course. For the moment it seems interesting to examine, say,
The remaining choices concern
β and
which both directly affect the DGP for
Substituting (103) and (101) in (99) we find that the long-run stationary path for
entails four mutually independent components, since
The second term of the final expression constitutes for each
i an AR(2) process and the third one an ARMA(2,1) process. The variance of
has four components given by (derivations in
Appendix D)
Averaging the last two over all
i yields
and
For the average long-run variance of
we then can evaluate
When choosing fixed values for ratios involving these components to obtain values for
β and
we will run into the problem of multiple solutions. On the other hand, the four components of (115) have particular invariance properties regarding the signs of
and
, since changing the sign of all three yields exactly the same value of
. We coped with this as follows. Although we note that
does depend on
we set
simply by fixing the direct cumulated effect impact of
on
relative to the current noise
This is
Because the direct and indirect (via
) effects from
may have opposite signs,
could be given negative values too, but we restricted ourselves to
yielding
Finally we fix a signal-noise ratio, which gives a value for
Because under simultaneity the noise and current signal conflate, we focus on the case where
Then we have
Leaving the variance due to the effects aside, the average signal variance is
, because the current average noise variance is unity. Hence, we may define a signal-noise ratio as
where we have substituted (109). For this we may choose, say,
in order to find
Note that here another admissibility restriction crops up, namely
However, for
this is satisfied for
From (119) we only examined the positive root.
Instead of fixing SNR another approach would be to fix the total multiplier
which would directly lead to a value for
given
However, different
values will then lead to different
values, because
At this stage it is hard to say what would yield more useful information from the Monte Carlo, fixing
or
Keeping both constant for different
γ and some other characteristics of this DGP is out of the question. We chose to fix
. which yields
values in the range 1.5–1.8. When comparing with results for
we did not note substantial principle differences.
For all different design parameter combinations considered, which involve sample size and we used the very same realizations of the underlying standardized random components and over the respective 10,000 replications that we performed. At this stage, all these components have been drawn from the standard normal distribution. To speed-up the convergence of our simulation results, in each replication we have modified the N drawings and such that they have sample mean zero, sample variance 1 and sample correlation zero. This rescaling is achieved by replacing the N draws first by and next by and by replacing the by the residuals obtained after regressing on and an intercept, and next scaling them by taking In addition, we have rescaled in each replication the by dividing them by so that the resulting have sum N as they should in order to avoid that presence of heteroskedasticity is conflated with larger or smaller average disturbance variance.
In the simulation experiments we will start-up the processes for and at pre-sample period by taking and and next generate and for the indices The data with time-indices will be discarded when estimating the model. We suppose that for both series will be on their stationary track from onwards. When taking or the initial values and will be such that effect-stationarity has not yet been achieved. Due to the fixed zero startups (which are equal to the unconditional expectations) the (cross-)autocorrelations of the and series have a very peculiar start then too, so such results regarding effect nonstationarity will certainly not be fully general, but for s close to zero they mimic in a particular way the situation that the process started only very recently.
Another simple way to mimic a situation in which lagged first-differenced variables are invalid instruments for the model in levels can be designed as follows. Equations (103) and (114) highlight that in the long-run
and
are uncorrelated with the effects
and
This can be undermined by perturbing
and
as obtained from
in such a way that we add to them the values
respectively. Note that for
effect stationarity is maintained, whereas for
the dependence of
and
on the effects is mitigated in comparison to the stationary track (upon maintaining stationarity regarding
and
), whereas for
this dependence is inflated. Note that this is a straight-forward generalization of the approach followed in Kiviet [
7] for the panel AR(1) model.
5. Simulation Results
To limit the number of tables we proceed as follows. Often we will first produce results on unfeasible implementations of the various inference techniques in relatively simple DGPs. These exploit the true values of
and
instead of their estimates. Although this information is generally not available in practice, only when such unfeasible techniques behave reasonably well in finite samples it seems useful to examine in more detail the performance of feasible implementations. Results for the unfeasible Arellano and Bond [
1] and Blundell and Bond [
2] GMM estimators are denoted as ABu and BBu respectively. Their feasible counterparts are denoted as AB1 and BB1 for the 1-step (which under homoskedasticity are equivalent to their unfeasible counterparts) and AB2 and BB2 for the 2-step estimators. For 2-step estimators the lower case letters a, b or c are used (as in for instance AB2c) to indicate which type of weighting matrix has been exploited, as discussed in
Section 3.2.1 and
Section 3.3.2. For corresponding MGMM implementations these acronyms are preceded by the letter M. Under homoskedasticity their unfeasible implementation has been omitted when this is equivalent to GMM. In BB estimation we have always used
.
First in
Section 5.1 we will discuss the results for DGPs in which the initial conditions are such that BB estimation will be consistent and more efficient than AB, and subsequently in
Section 5.2 the situation where BB is inconsistent is examined. Within these subsections we will examine different parameter value combinations for the DGP. We will start by presenting results for a reference parametrization (indicated P0) which has been chosen such that the model has in fact four parameters less, by taking
(
is strictly exogenous),
(hence
so
is neither correlated with
nor with
) and
(any cross-sectional heteroskedasticity is just related with
). These choices (implying that any heteroskedasticity will be unrelated to the mean of regressor
) may (hopefully) lead to results where little difference between unfeasible and feasible estimation will be found and where test sizes are relatively close to the nominal level of 5%. Next we will discuss the effects of settings (to be labelled P1, P2 etc.) which deviate from this reference parametrization P0 in one or more aspects regarding the various correlations and variance fractions and ratios. In P0 the relationship for
will be characterized by
(the impact on
of the individual effect
and of the idiosyncratic disturbance
have equal variance). The two remaining parameters have been held fixed over all cases examined (including P0); the
series has autoregressive coefficient
and regarding
we take
(excluding the impacts of the individual effects, the variance of the explanatory part of
is three times as large as
).
In
Section 3.2 we already indicated that we will examine implementations of GMM where all internal instruments associated with linear moment conditions will be employed (A), but also particular reductions based either on collapsing (C) or omitting long lags (L3, etc.), or a combination (C3, etc.). On top of this we will also distinguish situations that may lead to reductions of the instruments that are being used, because the regressor
in model (99), which will either be strictly exogenous or endogenous with respect to
might be rightly or wrongly treated as either strictly exogenous, or as predetermined (weakly exogenous), or as endogenous. These three distinct situations will be indicated by the letters X, W and E respectively. So, in parametrization P0, where
is strictly exogenous, the instruments used by either A, C or, say, L2, are not the same under the situations X, W and E. This is hopefully clarified in the next paragraph.
Since we assume that for estimation just the observations
and
are available, the number of internal instruments that are used under XA (all instruments,
treated as strictly exogenous) for estimation of the equation in first differences is:
(time-dummies) +
(lags of
+
(lags and leads of
. This yields
instruments for
Under WA this is
and under EA
From
Section 3.3.1 it follows that for BB estimation this number of instruments increases with 1 (intercept) +
(when
is supposed to be effect stationary) +
(when
is supposed to be effect stationary)
(when
is treated as endogenous). This implies for
a total of
extra instruments under XA and WA, and of
under EA, whereas these extra instruments will be valid in
Section 5.1 below and invalid in
Section 5.2.
For the tables to follow we always examine the three values
for the dynamic adjustment coefficient at the three sample size values
while mostly
as in the classic Arellano and Bond [
1] study. This is done for both
(homoskedasticity) and
(substantial cross-sectional heteroskedasticity). Tables have a code which starts by the design parametrization, followed by the character u or f, indicating whether the table contains unfeasible or feasible results. Because of the many feasible variants not all results can be combined in just one table. Therefore, the f is followed by c, t, J or
σ, where c indicates that the table just contains results on coefficient estimates, which are estimated bias, standard deviation (Stdv) and RMSE (root mean squared error; below often loosely addressed as precision); t refers to estimates of the actual rejection probabilities of tests on true coefficient values; J indicates that the table only contains results on Sargan-Hansen tests; and
σ indicates that the table just contains results on estimating
en
. Next, after a bar (-), the earlier discussed code is given for how regressor
is actually treated when selecting the instruments, followed by the type of instrument reduction.
5.1. DGPs under Effect Stationarity
Here we focus on the case where BB is consistent and more efficient than AB, since and .
5.1.1. Results for the Reference Parametrization P0
Table 3, with code P0u-XA, gives results for unfeasible GMM coefficient estimators, unfeasible single coefficient tests, and for unfeasible Sargan-Hansen tests for the reference parametrization P0 when
is (correctly) treated as strictly exogenous and all available instruments are being used.
Table 4 (P0fc-XA) presents a selection of feasible counterparts regarding the coefficient estimators. Under homoskedasticity we see that for
its bias (which is negative), stdv and thus its RMSE increase with
γ and decrease with
whereas the bias of
is moderate and its RMSE, although decreasing in
is almost invariant with respect to
β. The BBu coefficient estimates are superior indeed, the more so for larger
γ values (as is already well-known), but less so for
As already conjectured in
Section 3.6 under cross-sectional heteroskedasticity both ABu and BBu are substantially less precise than under homoskedasticity. However, modifying the instruments under cross-sectional heteroskedasticity as is done by MABu and MBBu yields considerable improvements in performance both in terms of bias and RMSE. In fact, the precision of the unfeasible modified estimators under heteroskedasticity comes very close to their counterparts under homoskedasticity.
The simulation results in
Table 4 for feasible estimation do not contain the b variant of the weighting matrix
9 because it is so bad, whereas both the a and c variants yield RMSE values very close to their unfeasible counterparts, under homoskedasticity as well as heteroskedasticity. Although the best unfeasible results under heteroskedasticity are obtained by MBBu, this does not fully carry over to MBB, because for
T small and also for moderate
T and large
BB2c performs much better. The performance of MAB and AB2c is rather similar, whereas we established that their unfeasible variants differ a lot when
γ is large. Apparently, the modified estimators can be much more vulnerable when the variances of the error components,
and
, are unknown, probably because their estimates have to be inverted in (92) and (94).
From the type I error estimates for unfeasible single coefficient tests in
Table 3 we see that the standard test procedures work pretty well for all techniques regarding
but with respect to
γ ABu fails for larger
This gets even worse under heteroskedasticity, but less so for MABu. For BBu and MBBu the results are reasonable. Here the test seems to benefit from the smaller bias of BBu. For the feasible variants we find in
Table 5 (P0ft-XA) that under homoskedasticity AB1 has reasonable actual significance level for
but for
γ only when it is small. The same holds for AB2c. Under heteroskedasticity AB2c overrejects, especially for
γ or
T large, but only mildly so for tests on
Both AB2a and MAB overreject enormously. Employing the Windmeijer [
29] correction mitigates the overrejection probabilities in many cases, but not in all. AB2cW has appropriate size for tests on
but for tests on
γ the size increases both with
γ and with
T from 7% to 37% over the grid examined. Since the test based on ABu shows a similar pattern, it is self-evident that a correction which just takes the randomness of AB1 into account cannot be fully effective. Oddly enough the Windmeijer correction is occasionally more effective for the heavily oversized AB2a than for the less oversized AB2c. Under homoskedasticity both BB2c and BB2cW behave very reasonable, both for tests on
β and on
Under heteroskedasticity BB2cW is still very reasonable, but all other implementations fail in some instances, especially for tests on
γ when
γ or
T are large. The failure of BB1 under heteroskedasticity is self-evident, see (76).
Regarding the unfeasible
J tests
Table 3 shows reasonable size properties under homoskedasticity, especially for
, but less so for the incremental test on effect stationarity when
γ is large. Under heteroskedasticity this problem is more serious, though less so for the unfeasible modified procedure. Heteroskedasticity and
γ large lead to underrejection of the
test, especially when
T is large too. Turning now to the many variants of feasible
J tests, of which only some
10 are presented in
Table 6 (P0fJ-XA), we first focus on
. Under homoskedasticity
behaves reasonable, though when (inappropriately) applied when
it rejects with high probability (thus detecting heteroskedasticity instead of instrument invalidity, probably due to underestimation of the variance of the still valid moment conditions). Of the
tests, which are only valid when
the c variant severely underrejects when
(when there is an abundance of instruments), but less so than the a version. Such severe underrejection under homoskedasticity had already been noted by Bowsher [
14] when
An almost similar pattern we note for
and
which are asymptotically valid for any
Test
overrejects severely for
and underrejects otherwise. Turning now to feasible
tests we find that
underrejects when
and, like
rejects with high probability when
Both the a and c variants of test
like
have rejection probabilities that are not invariant with respect to
γ and
The c variants seem the least vulnerable, and therefore also yield an almost reasonable incremental test
, although it underrejects when
and overrejects when
for
For
and
too the c variant has rejection probabilities which vary the least with
γ and
but they are systematically below the nominal significance level, which is also the case for the resulting incremental tests. Oddly enough, the incremental tests resulting from the a variants have type I error probabilities reasonably close to 5%, despite the serious underrejection of both the
and
tests from which they result.
From
Table 7 it can be seen that in the base case P0 estimation of
(which has true value 1) is pretty accurate for all techniques and
T and
γ values, but less so under heteroskedasticity when
T is small and
γ large. Estimation of
is much more problematic. Only when
γ is moderate, estimation bias is moderate too. The bias can exceed 100% when
γ is large and
T is small, and gets even worse under heteroskedasticity. Employing BB mitigates this bias.
When treating regressor as predetermined (P0-WA, not presented here), although it is strictly exogenous, fewer instruments are being used. Since the ones that are now abstained from are most probably the strongest ones regarding it is no surprise that in the simulation results we note that especially the standard deviation of the β coefficient suffers. Also the rejection probabilities of the various tests differ slightly between implementations WA and XA, but not in a very systematic way as it seems. When treating as endogenous (P0-EA) the precision of the estimators gets worse, with again no striking effects on the performance of test procedures under their respective null hypotheses. Upon comparing for P0 the instrument set A (and set C) with the one where A (C) is replaced by C1 it has been found that the in practice quite popular choice C1 yields often slightly less efficient estimates for β, but much less efficient estimates for γ.
When
is again treated as strictly exogenous, but the number of instruments is reduced by collapsing the instruments stemming from both
and
then we note from
Table 8 (P0fc-XC, just covering
a mixed picture regarding the coefficient estimates. Although any substantial bias always reduces by collapsing, standard errors always increase at the same time, leading either to an increase or a decrease in RMSE. Decreases occur for the AB estimators of
especially when
γ is large; for
β just increases occur. A noteworthy reduction in RMSE does show up for BB2a when
γ is large,
and
but then the RMSE of BB2c using all instruments is in fact smaller. However,
Table 9 (P0ft-XC) shows that collapsing is certainly found to be very beneficial for the type I error probability of coefficient tests, especially in cases where collapsing yields substantially reduced coefficient bias. The AB tests benefit a lot from collapsing, especially the c variant, leaving only little room for further improvement by employing the Windmeijer correction. After collapsing AB1 works well under homoskedasticity, and also under heteroskedasticity provided robust standard errors are being used, where the c version is clearly superior to the a version. AB2c has appropriate type I error probabilities, except for testing
γ when it is
at
and
(which is not repaired by a Windmeijer correction either), and is for most cases superior to AB2aW. After collapsing BB2a shows overrejection which is not completely repaired by BB2aW when
. BB2c and BB2cW generally show lower rejection probabilities, with occasionally some underrejection. Tests based on MAB and MBB still heavily overreject.
Table 10 (P0fJ-XC) shows that by collapsing the
and
tests suffer much less from underrejection when
T is larger than 3. However, both the a and c versions of the
and
tests usually still underreject, mostly by about 1 or 2 percentage points. Good performance is shown by
and
Table 11 (P0f
σ-XC) shows that collapsing reduces the bias in estimates of
substantially, although the bias is still huge when
γ is large and
T small, especially for AB and more so under heteroskedasticity.
When
is still correctly treated as strictly exogenous but for the level instruments just a few lags or first differences are being used (XL0 ... XL3) for both
and
then we find the following. Regarding feasible AB and BB estimation collapsing (XC) always gives smaller RMSE values than XL0 and XL1 (which is much worse than XL0), but this is not the case for XL2 and XL3. Whereas XC yields smaller bias, XL2 and XL3 often reach smaller Stdv and RMSE. Especially regarding
β XL3 performs better than XL2. Probably due to the smaller bias of XC it is more successful in mitigating size problems of coefficient tests than XL0 through XL3. The effects on
J tests is less clear-cut. Combining collapsing with restricting the lag length we find that XC2 and XC3 are in some aspects slightly worse but in others occasionally better than XC for P0. We also examined the hybrid instrumentation which seems popular amongst practitioners where C
is combined with L1
(see
Table 1). Especially for
γ this leads to loss of estimator precision without any other clear advantages, so it does not outperform the XC results for P0. From examining P0-WC (and P0-EC) we find that in comparison to P0-WA (P0-EA) there is often some increase in RMSE, but the size control of especially the
t-tests is much better.
Summarizing the results for P0 on feasible estimators and tests we note that when choosing between different possible instrument sets a trade off has to be made between estimator precision and test size control. For both some form of reduction of the instrument set is often but not always beneficial. Not one single method seems superior irrespective of the actual values of β and Using all instruments is not necessarily a bad choice; also XC, XL3 and XC3 often work well. To mitigate estimator bias and foster test size control while not sacrificing too much estimator precision using collapsing (C) for all regressors seems a reasonable compromise, as far as P0 is concerned. Coefficient and J tests based on the modified estimator using its simple feasible implementation examined here behave so poorly, that in the remainder we no longer mention its results.
5.1.2. Results for Alternative Parametrizations
Next we examine a series of alternative parametrizations where each time we just change one of the parameter values of one of the already examined cases. In P1 we increase from 1 to 4 (hence, substantially increasing the relative variance of the individual effects). We note that for P1-XA (not tabulated here) all estimators regarding γ are more biased and dispersed than for P0-XA, but there is little or no effect on the β estimates. For both T and γ large this leads to serious overrejection for the unfeasible coefficient tests regarding γ, in particular for ABu. Self-evidently, this carries over to the feasible tests and, although a Windmeijer correction has a mitigating effect, the overrejection remains often serious for both AB and BB based tests. Tests on β based on AB behave reasonable, apart from not robustified AB1 and AB2a. For the latter a Windmeijer correction proves reasonably effective. When exploiting the effect stationarity the BB2c implementation seems preferable. The unfeasible J tests show a similar though slightly more extreme pattern as for P0-XA. Among the feasible tests both serious underrejection and some overrejection occurs. The when invalid is not much worse than the valid tests. As far as the incremental tests concerns behaves remarkably well.
In
Table 12,
Table 13,
Table 14 and
Table 15 (P1f
j-XC,
c,t,J,
σ) we find that collapsing leads again to reduced bias, slightly deteriorated precision though improved size control (here all unfeasible tests behave reasonably well). All feasible AB1R and AB2W tests have reasonable size control, apart from tests on
γ when
T is small and
γ large. These give actual significance levels close to 10%. BB2cW seems slightly better than BB2aW. The
J tests using 1-step residuals only show some serious overrejection under heteroskedasticity, whereas the
and
behave quite satisfactorily. The increase of
has an adverse effect on its estimate when using uncollapsed BB for
γ small, but collapsing substantially reduces the bias in
estimates. For C3 reasonably similar results are obtained, but those for L3 are generally slightly less attractive.
In P2 we increase from 0 to 0.6, upon having again (hence, is still uncorrelated with effect though correlated with effect which determines any heteroskedasticity). This leads to increased β values. Results for P2-XA show larger absolute values for the standard deviations of the β estimates than for P0-XA, but they are almost similar in relative terms. The patterns in the rejection probabilities under the respective null hypotheses are hardly affected, and P2-XC shows again improved behavior of the test statistics due to reduced estimator bias, whereas the RMSE values have slightly increased. Under P2 estimates are more biased than under P0.
In P3 we change from 0 to 0.3, while keeping (hence, realizing now dependence between regressor and the individual effect ). Comparing the results for P3-XA with those for P2-XA (which have the same β values) we find that all patterns are pretty similar. Also P3-XC follows the P2-XC picture closely. Under P3 estimates are more biased than under P0.
P4 differs from P3 because thus now the heteroskedasticity is determined by too. This has a noteworthy effect on MBB estimation, a minor effect on JBB (and thus on JES) testing, and almost no effect on estimation.
P5 differs from P0 just in having
so
is now endogenous with respect to
P5-EA uses all instruments available when correctly taking the endogeneity into account. This leads to very unsatisfactory results. The coefficient estimates of
γ have serious negative bias, and those for
β positive bias, whereas the standard deviation is slightly larger than for P0-EA, which are substantially larger than for P0-XA. All coefficient tests are very seriously oversized, also after a Windmeijer correction, both for AB and BB. Tests
and
show underrejection, whereas the matching
tests show serious overrejection when
T is large, but the feasible 2-step variants are not all that bad. From
Table 16,
Table 17 and
Table 18 (P5f
j-EC,
c,t,J) we see that most results which correctly handle the simultaneity of
are still bad after collapsing, especially for
T small (where collapsing can only lead to a minor reduction of the instrument set), although not as bad as those for P5-EA and larger values of
T. For P5-EC the rejection probabilities of the corrected coefficient tests are usually in the 10%–20% range, but those of the 2-step
J tests are often close to 5%. Under P5 estimates of
and
are much more biased than under P0. Both AB and BB are inconsistent when treating
either as predetermined or as exogenous. For P5-WA and P5-XA the coefficient bias is almost similar but much more serious than for P5-EA. For the inconsistent estimators the bias does not reduce when collapsing the instruments. Because the inconsistent estimators have a much smaller standard deviation than the consistent estimators practitioners should be warned never to select an estimator simply because of its attractive estimated standard error. The consistency of AB and BB should be tested with the Sargan-Hansen test.
In this study we did not examine the particular incremental test which focusses on the validity of the extra instruments when comparing E with W or E with X. Here we just examine the rejection probabilities of the overall overidentification
J tests for case P5 using all instruments and can compare the rejection frequencies when treating
correctly as endogenous, or incorrectly as either predetermined or exogenous. From
Table 19 (P5fJ-
jA,
E,W,X) we find that size control for
can be slightly better than for
The detection of inconsistency by
has often a higher probability when the null hypothesis is W than when it is X. The probability generally increases with
T and with
γ and is often better for the c variant than for the a variant and slightly better for BB implementations than for AB implementations, whereas in general heteroskedasticity mitigates the rejection probability. In the situation where all instruments have been collapsed, where we already established that the
J tests do have reasonable size control, we find the following. For
and
the rejection probability of the
and
tests does not rise very much when
moves from 0 to 0.3, whereas for
this rejection probability is often larger than
when
and often larger than
for
Hence, only for particular
γ and
θ parametrizations the probability to detect inconsistency seems reasonable, whereas the major consequence of inconsistency, which is serious estimator bias, is relatively invariant regarding
γ and
Summarizing our results for effect stationary models we note the following. We established that finite sample inaccuracies of the asymptotic techniques seriously aggravate when either or under simultaneity. For both problems it helps to collapse instruments, and the first problem is mitigated and the second problem detected with higher probability by instrumenting according to W rather than X. Neglected simultaneity leads to seemingly accurate but seriously biased coefficient estimators, whereas asymptotically valid inference on simultaneous dynamic relationships is often not very accurate either. Even when the more efficient BB estimator is used with Windmeijer corrected standard errors, the bias in both γ and β is very substantial and test sizes are seriously distorted. Some further pilot simulations disclosed that N should be much and much larger than 200 in order to find much more reasonable asymptotic approximation errors.
5.2. Nonstationarity
Next we examine the effects of a value of
ϕ different from unity. We will just consider setting
and perturbing
and
according to (123), so that their dependence on the effects is initially 50% away from stationarity so that BB estimation is inconsistent. That this occurred we will indicate in the parametrization code by changing P into P
Comparing the results for P
0-XA with those for P0-XA, where
(effect stationarity), we note from
Table 20 (P
0fc-XA) a rather moderate positive bias in the BB estimators for both
γ and
β when both
T and
γ are small. Despite the inconsistency of BB the bias is very mild for larger
T and especially for larger
γ it is much smaller than for consistent AB. The pattern regarding
T can be explained, because convergence towards effect stationarity does occur when time proceeds. Since this convergence is faster for smaller
γ the good results for large
γ seem due to great strength of the first-differenced lagged instruments regarding the level equation. Since
here
is in fact a valid instrument too. Note that the RMSE of inconsistent BB1, BB2a and BB2c is always smaller than that for consistent AB1, AB2a and AB2c, except when
T and
γ are both small. With respect to the AB estimators we find little to no difference compared to the results under stationarity.
Table 21 (P
0ft-XA) shows that when
the BB2cW coefficient test on
γ yields very mild overrejection, while AB2aW and AB2cW seriously overreject. For smaller values of
γ it is the other way around. After collapsing (not tabulated here) similar but more moderate patterns are found, due to the mitigated bias which goes again with slightly increased standard errors. Hence, for this case we find that one should perhaps not worry too much when applying BB even if effect stationarity does not strictly hold for the initial observations. As it happens, we note from
Table 22 (P
0fJ-XA) that the rejection probabilities of the JES tests are such that they are relatively low when BB inference is more precise than AB inference, and relatively high when either
T or
γ are low for
This pattern is much more pronounced for the JES tests than for the JBB tests. However, it is also the case in P
0 that collapsing mitigates this welcome quality of the JES tests to warn against unfavorable consequences of effect nonstationarity on BB inference.
From P1-XA, in which the individual effects are much more prominent, we find that has curious effects on AB and BB results. For effect stationarity we already noted more bias for AB than under P0. For γ large, this bias is even more serious when despite the consistency of AB. For BB estimation the reduction of ϕ leads to much larger bias and much smaller stdv, with the effect that RMSE values for inconsistent BB are usually much worse than for AB, but are often slightly better (except for BB2c) when . All BB coefficient tests for γ have size close or equal to 1 under P1-XA and the AB tests for overreject very seriously as well. Under P1-XC the bias of AB is reasonable except for The bias of BB has decreased but is still enormous, although its RMSE remains preferable when Especially regarding tests on γ BB fails. For both the a and c versions the JES test has high rejection probability to detect except when γ is large. The relatively low rejection probability of JES tests obtained after collapsing when and again indicates that despite its inconsistency BB has similar or smaller RMSE than AB for that specific case.
Next we consider the simultaneous model again. In case P
5-EA estimator AB is consistent and BB again inconsistent. Nevertheless, for all
γ and
T values examined in
Table 23 (P
5fc-EA), AB has a more severe bias than BB, whereas BB has smaller stdv values at the same time and thus has smaller RMSE for all
γ and
T values examined. The size control of coefficient tests is worse for AB, but for BB it is appalling too, where BB2aW, with estimated type I error probabilities ranging from 5% to 70%, is often preferable to BB2cW. The 2-step JAB tests behave reasonably, whereas the JBB tests reject with probabilities in the 3%–38% range, and JES in the 3%–69% range. By collapsing the RMSE of AB generally reduces when
and for BB especially when
BB has again smaller RMSE than AB. The rejection rates of the JBB and JES tests are substantially lower now, which seems bad because the invalid (first-differenced) instruments are less often detected, but this may nevertheless be appreciated because it induces to prefer less inaccurate BB inference to AB inference. After collapsing the size distortions of BB2aW and BB2cW are less extreme too, now ranging from 5%–33%, but the RMSE values for BB may suffer due to collapsing, especially when
γ and
T are small. The RMSE values for BB under P
5-WA and P
5-XA are usually much worse than those for AB under P
5-EA. Hence, although the invalid instruments for the level equation are not necessarily a curse when endogeneity of
is respected, they should not be used when they are invalid for two reasons (
and
). That neither AB nor BB should be used in P5 under W and X will be indicated with highest probability under WC, and then this probability is larger than 0.8 for
only when
T is high and for
only when both
T and
γ are high.
Summarizing our findings regarding effect nonstationarity, we have established that although renders BB estimators inconsistent, especially when T is not small BB inference nevertheless often beats consistent AB, provided possible endogeneity of is respected. The JES test seems to have the remarkable property to be able to guide towards the technique with smallest RMSE instead of the technique exploiting the valid instruments. For further details we refer to the full set of Monte Carlo results.
6. Empirical Results
The above findings will be employed now in a re-analysis of the data and some of the techniques studied in Ziliak [
35]. The main purpose of that article was to expose the downward bias in GMM as the number of moment conditions expands. This is done by estimating a static life-cycle labor-supply model for a ten year balanced panel of males, and comparing for various implementations of 2SLS and GMM the coefficient estimates and their estimated standard errors when exploiting expanding sets of instruments. We find this approach rather naive for various reasons: (
a) the difference between empirical coefficient estimates will at best provide a very poor proxy to any underlying difference in bias; (
b) standard asymptotic variance estimates of IV estimators are known to be very poor representations of true estimator uncertainty;
11 (
c) the whole analysis is based on just one sample and possibly the model is seriously misspecified.
12 The latter issue also undermines conclusions drawn in Ziliak [
35] on overrejection by the
J test, because it is of course unknown in which if any of his empirical models the null hypothesis is true. To avoid such criticism we designed the controlled experiments in the two foregoing sections on the effects of different sets of instruments on various relevant inference techniques. Now we will examine how these simulation results can be exploited to underpin actual inference from the data set used by Ziliak.
This data set originates from waves XII-XXI and the years 1979–1988 of the PSID. The subjects are
continuously married working men aged 22–51 in 1979. Ziliak [
35] employs the static model
13
where
is the observed annual hours of work,
the hourly real wage rate,
a vector of four characteristics (kids, disabled, age, age-squared),
an individual effect and
the idiosyncratic error term. He assumes that
may be an endogenous regressor and that all variables included in
are predetermined. The parameter of interest is
β and in the various static models examined its GMM estimates range from approximately 0.07 to 0.52, depending on the number of instruments employed.
After some experimentation we inferred that lagged reactions play a significant role in this relationship and that in fact a general second-order linear dynamic specification is required in order to pass the diagnostic tests which are provided by default in the Stata package xtabond2 (StataCorp LLC), see Roodman [
43]. This model, also allowing for time-effects, is given by
We did not include lags of age and its square.
14 Contrary to Ziliak, we will not treat variable
as predetermined, since due to its very nature (no feedbacks from hours worked to age) it must be strictly exogenous. On the other hand, lagged or even immediate feedbacks from labor supply to the variables
and
seem well possible.
In the sequence of various model specifications and instrument set compositions embarked on below, we adopted the following methodological strategy. We start with a rather general initial dynamic model specification employing a relatively uncontroversial set of instruments, hence avoiding as much as possible the imposition of doubtful exclusion restrictions on (lagged) regressor variables as well as the exploitation of yet unconfirmed orthogonality conditions. This initial model is estimated by 1-step AB with heteroskedasticity robust standard errors, neglecting for the moment any coefficient t-tests, unless serial correlation tests and heteroskedasticity robust J tests show favorable results. As long as the latter is not the case, the model should be re-specified by adapting the functional form and/or including additional explanatories, either new ones or transformations of already included ones such as longer lags or interactions. When favorable serial correlation and robust J tests have been obtained, and when reconfirmed (especially in case evidence has been found indicating the presence of heteroskedasticity) by favorable autocorrelation and J tests after 2-step AB estimation, hopefully initial consistent estimates have been accomplished. Then, in next stages, the two further aims are: attaining increased efficiency and mitigating finite sample bias. These are pursued first by sequentially testing additional orthogonality conditions. Initially by testing whether variables treated as endogenous seem actually predetermined, and next by verifying whether predetermined variables seem in fact exogenous, possibly followed by testing the orthogonality conditions implied by effect stationarity. In this process the tested extra instruments are added to the already adopted set of instruments, provided incremental J tests are convincingly insignificant. Next, one could test coefficient restrictions (on the basis of robust 1-step AB standard errors in case of suspected heteroskedasticity, or using Windmeijer-corrected 2-step AB standard errors) and impose these restrictions when convincingly insignificant from both a statistical and economic point of view. During the whole process the effects on the various estimates and test statistics of collapsing the instrument set and/or removing instruments with long lags could be monitored and possibly induce not exploiting particular probably valid orthogonality conditions represented by apparently weak instruments.
For the present data set, the inclusion of second-order lags in the initial model specification yields and when estimating the first-differenced model (125), hence . Although no generally accepted rules of thumb exist yet on requirements regarding the number of degrees of freedom and the degree of overidentification for GMM to work well in the analysis of micro panel data sets, we chose to respect at any stage in the specification search the inequalities and but also examined cases where
Table 24 presents some estimation and test results for model (125) obtained by employing different estimation methods and instrument sets. All results have been obtained by Stata/SE14.0 with package xtabond2 (StataCorp LLC), abstaining from any finite sample corrections, and supplemented with code for calculating
and
test variants. In column (1) 1-step Arellano-Bond GMM estimates are presented (omitting the results for the included time-effects) with heteroskedasticity robust standard errors (indicated by AB1R) using all level instruments that are valid when (with respect to
regressor
is predetermined, the regressors
and
could be endogenous, and
is exogenous (indicated by 1P3E1X). This yields
instruments, because we instrumented both
and
like the seven time-dummies, just by themselves. For the AR and
J tests given in the bottom lines the
p-values are presented. Hence, in column (1) the (first-differenced) residuals do exhibit 1st order serial correlation (as they should) but no significant 2nd order problems emerge. We supplemented
and
as presented by xtabond2 (StataCorp LLC, see our footnote 3) by
and
The
p-value of 0.000 for
should be neglected, because we found convincing evidence of heteroskedasticity from an auxiliary LSDV regression (not presented) of the squared level residuals for the findings in column (1) on all regressors of model (125), except the current endogenous ones. Xtabond2 suggests now that we judge the adequacy of model and instruments on the basis of test
hence on a hybrid test statistic involving both 1-step and 2-step residuals. Its
p-value is high, thus seems to approve the validity of the instruments. However, in some of our simulations this variant underrejects. The purely 1-step based
test is only valid under conditional homoskedasticity, so we may neglect its low
p-value in this and all other columns.
Because many regressors in column (1) have very low absolute t-values, this may undermine the finite sample performance of the tests. Therefore, in column (2) we examine removal of the time-effects from the regression, which in column (1) have absolute t-values between 0.34 and 1.15. In column (3) we remove the time-dummies from the set of instruments too. This has little effect. Because the exogeneity of the time-effects is self-evident, we decide to keep them in the instrument set, though exclude them from the regressors. Since we did not manage to get more satisfying results regarding the J tests by relaxing implicit restrictions (including interactions, generalizing the functional form), we adopt with some hesitance the specification and classification of the variables of column (2) as an acceptable starting point. In the table all coefficient estimates with a t-ratio above 2 are marked by a double asterix, and a single asterix when between 1 and 2 (estimated standard errors are given between parentheses). The modest estimated values for the lagged dependent variable coefficient estimates in combination with those of suggest values of the concept such that the relatively unfavorable simulation results for case P1 (where ) do not seem to apply here. Column (4) presents the Windmeijer corrected 2-step AB estimates. For many coefficients these suggest an improvement in estimator efficiency. From the simulations we learned that we should not overrate the qualities of 2-step estimation. Also note that some of the coefficient estimates deviate from their 1-step counterparts, which might be due to vulnerability to finite sample bias. This can also be seen from the bottom row of the table which presents the estimate of the long-run wage elasticity of hours worked. This total multiplier is given by Column (4) suggests a lower elasticity than column (2). Many of the static models estimated by Ziliak suggest even lower values for this elasticity (and forced equality of immediate and long-run elasticity, which is sharply rejected by all our models).
Before we proceed, we want to report that when estimating model (125) by AB1R without second order lags (then and ) the p-values of the AR(1) and AR(2) tests are 0.000 and 0.754 respectively, whereas that of is Hence, despite the significance of various of the coefficients of twice lagged variables in columns (1) through (3), these three tests do not detect the apparent dynamic underspecification; hence, they lack power.
Although quite a few slope coefficients in columns (1) through (3) have t-ratio’s with small absolute values, similar to the time-effects, we prefer not to proceed at this stage by imposing further coefficient restrictions on the model. Instead, we shall try to decrease the estimated standard errors and mitigate finite sample bias by examining whether the three regressors which we treated as endogenous could actually be classified such that additional and stronger instruments might be used. However, before we do that, just for illustrative purposes, we present again AB1 and AB2 results for the model specification and instrument set as used in column (2), but now not robustified AB1 in column (5) and not Windmeijer corrected AB2 in column (6). For most coefficients column (5) suggests smaller standard errors than column (2), but given the detected heteroskedasticity we know these are deceitful inconsistent standard deviation estimates. Column (6) shows that not using the Windmeijer correction would incorrectly suggest that AB2 is substantially more efficient than (robust) AB1, which often it is not, as we already learned from our simulations. Note that the value of the serial correlation tests does not just depend on the (unaffected) residuals, but on the (affected) coefficient standard errors too. Therefore, we interpret the rejection by AR(2) in column (5) as due to size problems.
Next, a series of incremental tests (not presented in the table) has been performed to establish the actual classification of the three yet treated as endogenous regressors. Testing against 1P4X (which implies 42 extra instruments) yields a p-values below 0.005. So, we better proceed step by step to assess whether some of these 42 instruments are nevertheless valid. Testing validity of the 7 extra instruments in case is treated as predetermined yields a p-value of 0.029, so this seems truly endogenous. Doing the same for gives 0.520. Next testing whether the 7 extra instruments involving current values of seem valid too yields p-value 0.398 and when testing against column (2) the 14 extra instruments yield p-value 0.490. Accepting exogeneity of the variable and maintaining endogeneity of we now focus on the classification of Testing the extra 7 instruments when treating as predetermined yields p-values 0.330. Testing jointly the 21 instruments additional to column (2) the p-value is 0.429. We decide to adopt the classification where variables and are exogenous, and are endogenous, and self-evidently is predetermined (all with respect to ). The corresponding AB1R and AB2W estimates can be found in columns (7) and (8). Note that the extra instruments are especially beneficial for the standard errors of the coefficients. Again the estimate is larger for 1-step than for 2-step estimation.
In columns (9) and (10) we examine the effects on the results of column (7) of reducing the number of instruments; in column (9) by collapsing and in column (10) by discarding instruments lagged more than two periods. This leads to disturbing results. If the instruments used in column (7) are valid, those used in the columns (9) and (10) cannot be invalid. Nevertheless, test p-values of test reduces substantially. That the estimated coefficient standard errors have increased in columns (9) and (10) is understandable, but the substantial shifts in coefficient estimates is seriously uncomfortable. The negative found after collapsing seems not very realistic. The main question seems now whether this is just caused by finite sample bias, or by inconsistency. In the latter case the results of all other columns must be inconsistent too.
Finally, we examine 2-step Blundell-Bond system estimation with Windmeijer correction. Testing validity of the 34 instruments used in column (11) additional to those used in column (8), yields a
p-value for the
test of 0.016, whereas the
based Hayashi-version (see our footnote 3) calculated by xtabond2 (StataCorp LLC) gives a
p-value of 0.136. So, effect stationarity seems doubtful, although the five
γ and
coefficients seem all highly significant now (with all further coefficients insignificant). The estimates of
and
deviate strongly from those of columns (1) through (8). Even more distorted BB2 results are obtained after collapsing. We find it hard to believe that this is all due to increased efficiency and reduced finite sample bias and simply reject effect stationarity and tend to accept the results of columns (7) and (8). Or, should we declare all results in
Table 21 uninterpretable simply because no model from the class examined here matches with the Ziliak data? It is hard to answer this question, simply because we learned from the simulations how vulnerable all employed tools are even in cases where the adopted model specification fully corresponds with the underlying DGP.
Hopefully the small sample bias is such that proper interpretation of the coefficients of column (7) is possible. Then we note that—although not statistically significant—we find a tendency that a positive change in either
or
leads to an immediate drop in hours supplied, although this drop is mitigated for a substantial part after a few periods. Also, the older an individual gets there is a tendency (again insignificant) to work fewer hours. The wage elasticity is positive with a larger value than was inferred by earlier (static) studies. However, given what we learned from the simulations, we should restrain ourselves when drawing far-reaching conclusions from the estimation and test results given in
Table 24, simply because we established that for the currently available techniques for analysis of dynamic panel data models the bias of coefficient estimates can be substantial and the actual size of tests may deviate considerably from the aimed at levels whereas their actual power seems modest.
7. Major Findings
In social science the quantitative analysis of many highly relevant problems requires structural dynamic panel data methods. These allow the observed data to have at best a quasi-experimental nature, whereas the causal structure and the dynamic interactions in the presence of unobserved heterogeneity have yet to be unraveled. When the cross-section dimension of the sample is not very small, employing GMM techniques seems most appropriate in such circumstances. This is also practical since corresponding software packages are widely available. However, not too much is known yet about the actual accuracy in practical situations on the abundance of different not always asymptotically equivalent implementations of estimators and test procedures. This study aims to demarcate the areas in the parameter space where the asymptotic approximations to the properties of the relevant inference techniques in this context have either shown to be reliable beacons or are actually often misguiding marsh fires.
In this context we provide a rather rigorous treatment of many major variants of GMM implementations as well as for the inference techniques on testing the validity of particular orthogonality assumptions and restrictions on individual coefficient values. Special attention is given to the consequences of the joint presence in the model of time-constant and individual-constant unobserved effects, covariates that may be strictly exogenous, predetermined or endogenous, and disturbances that may show particular forms of heteroskedasticity. Also the implications regarding initial conditions for separate regressors with respect to individual effect stationarity are analyzed in great detail, and various popular options that aim to mitigate bias by reducing the number of exploited internal instruments are elucidated. In addition, as alternatives to those used in current standard software, less robust weighting matrices and additional variants of Sargan-Hansen test implementations are considered, as well as the effects of particular modifications of the instruments under heteroskedasticity.
Next, a simulation study is designed in which all the above variants and details are being parametrized and categorized, which leads to a data generating process involving 10 parameters, for which, under 6 different settings regarding sample size and initial conditions, 60 different grid points are examined. For each setting and various of the grid points 13 different choices regarding the set of instruments have been used to examine 12 different implementations of GMM coefficient estimates, giving rise to 24 different implementations of t-tests and 27 different implementations of Sargan-Hansen tests. From all this only a pragmatically selected subset of results is actually presented in this paper.
The major conclusion from the simulations is that, even when the cross-section sample size is several hundreds, the quality of this type of inference depends heavily on a great number of aspects of which many are usually beyond the control of the investigator, such as: magnitude of the time-dimension sample size, speed of dynamic adjustment, presence of any endogenous regressors, type and severity of heteroskedasticity, relative prominence of the individual effects and (non)stationarity of the effect impact on any of the explanatory variables. The quality of inference also depends seriously on choices made by the investigator, such as: type and severity of any reductions applied regarding the set of instruments, choice between (robust) 1-step or (corrected) 2-step estimation, employing a modified GMM estimator, the chosen degree of robustness of the adopted weighting matrix, the employed variant of coefficient tests and of (incremental) Sargan-Hansen tests in deciding on the endogeneity of regressors, the validity of instruments and on the (dynamic) specification of the relationship in general.
Our findings regarding the alternative approaches of modifying instruments and exploiting different weighting matrices are as follows for the examined case of cross-sectional heteroskedasticity. Although the unfeasible form of modification does yield very substantial reductions in both bias and variance, for the straight-forward feasible implementation examined here the potential efficiency gains do not materialize. The robust weighting matrix, which also allows for possible time-series heteroskedasticity, performs often as well as (and sometimes even better than) a specially designed less robust version, although the latter occasionally demonstrates some benefits for incremental Sargan-Hansen tests.
Furthermore, we can report to practitioners: (a) when the effect-noise-ratio is large, the performance of all GMM inference deteriorates; (b) the same occurs in the presence of a genuine (or a supervacaneously treated as) endogenous regressor; (c) in many settings the coefficient restrictions tests show serious size problems which usually can be mitigated by a Windmeijer correction, although for γ large or under simultaneity serious overrejection remains unless N is very much larger than 200; (d) the limited effectiveness of the Windmeijer correction is due to the fact that the positive or negative bias in coefficient estimates is often more serious than the negative bias in the variance estimate; (e) limiting to some degree the number of instruments usually reduces bias and therefore improves size properties of coefficient tests, though at the potential cost of power loss because efficiency usually suffers; (f) for the case of an autoregressive strictly exogenous regressor we noted that it is better to not just instrument it by itself, but also by some of its lags because this improves inference, especially regarding the lagged dependent variable coefficient; (g) to mitigate size problems of the overall Sargan-Hansen overidentification tests the set of instruments should be reduced, possibly by collapsing; under conditional heteroskedasticity one should employ the quadratic form in 2-step residuals, possibly in combination with a weighting matrix based on 1-step residuals, although occasionally the 2-step weighting matrix seems preferable; (h) collapsing also reduces size problems of the incremental Sargan-Hansen effect stationarity test; (i) except under simultaneity, the GMM estimator which exploits instruments which are invalid under effect nonstationarity (BB) may nevertheless perform better than the estimator abstaining from these instruments (AB); (j) the rejection probability of the incremental Sargan-Hansen test for effect stationarity is such that it tends to direct the researcher towards applying the most accurate estimator, even if this is inconsistent; (k) The estimate of is usually pretty accurate, which is certainly not always the case for although quality improves for larger N and is better for BB than for AB and usually benefits from collapsing.
When re-analyzing a popular empirical data set in the light of the above simulation findings we note in particular that actual dynamic feedbacks may be much more subtle than those that can be captured by just including a lagged dependent variable regressor, which at present seems the most common approach to model dynamics in panels. In theory the omission of further lagged regressor variables should result in rejections by Sargan-Hansen test statistics, but their power suffers when many valid and some invalid orthogonality conditions are tested jointly instead of by deliberately chosen sequences of incremental tests or by direct variable addition tests. Hopefully tests for serial correlation, which we intentionally left out of this already overloaded study, provide an extra help to practitioners in guiding them towards well-specified models. Our results demonstrate that, especially under particular unfavorable settings, there is great urge for developing more refined inference procedures for structural dynamic panel data models.
multiple