For many of us, the essential distinction between econometric regression analysis and otherwise-similar forms of regression analysis conducted outside of economics, is the overarching concern shown in econometrics with regard to the model assumptions made in order to obtain consistent ordinary least squares (OLS) parameter estimation and asymptotically valid statistical inference. As Friedman
) famously noted (in the context of economic theorizing), it is both necessary and appropriate to make model assumptions—notably, even assumptions which we know to be false—in any successful economic modeling effort: the usefulness of a model, he asserted, inheres in the richness/quality of its predictions rather than in the accuracy of its assumptions. Our contribution here—and in Ashley
) and Ashley and Parmeter
), which address similar issues in the context of IV and GMM/2SLS inference using possibly-flawed instruments—is to both posit and operationalize a general proposition that is a natural corollary to Friedman’s assertion: It is perfectly acceptable to make possibly-false (and even very-likely-false) assumptions—if and only if one can and does show that the model results one cares most about are insensitive to the levels of violations in these assumptions that it is reasonable to expect
In particular, the present paper proposes a sensitivity analysis for OLS estimation/inference in the presence of unmodeled endogeneity in the explanatory variables of the usual linear multiple regression model. This context provides an ideal setting in which to both exhibit and operationalize a quantification of the “insensitivity” alluded to in the proposition above, because this setting is so very simple. This setting is also attractive in that OLS estimation of multiple regression models with explanatory variables of suspect exogeneity is common in applied economic work. The extension of this kind of sensitivity analysis to more complex—e.g., nonlinear—estimation settings is feasible, but will be laid out in separate work, as it requires a different computational framework, and does not admit the closed-form results obtained (for some cases) here.
The diagnostic analysis proposed here quantifies the sensitivity of OLS hypothesis test rejection p-values (for both multiple and/or nonlinear parameter restrictions)—and also of single-parameter confidence intervals—with respect to possible endogeneity (i.e., non-zero correlations between the explanatory variables and the model errors) in the usual k-variate multiple regression model.
This sensitivity analysis rests on a derivation of the sampling distribution of , the OLS parameter estimator, extended to the case where some or all of the explanatory variables are endogenous to a specified degree. In exchange for restricting attention to possible endogeneity which is solely linear in nature—the most typical case—the derivation of this sampling distribution proceeds in a particularly straightforward way. And under this “linear endogeneity” restriction, no additional model assumptions—e.g., with respect to the third and fourth moments of the joint distribution of the explanatory variables and the model errors—are necessary, beyond the usual assumptions made for any model with stochastic regressors. The resulting analysis quantifies the sensitivity of hypothesis test rejection p-values (and/or estimated confidence intervals) to such linear endogeneity, enabling the analyst to make an informed judgment as to whether any selected inference is “robust” versus “fragile” with respect to likely amounts of endogeneity in the explanatory variables.
We show below that, in the context of the linear multiple regression model, this additional sensitivity analysis is so straightforward an addendum to the usual OLS diagnostic checking which applied economists are already doing—both theoretically, and in terms of the effort required to set up and use our proposed procedure—that analysts can routinely use it. In this regard we see our sensitivity analysis as a general diagnostic “screen” for possibly-important, unaddressed endogeneity issues.
1.3. Preview of the Rest of the Paper
In Section 2
we provide a new derivation of the asymptotic sampling distribution of the OLS structural parameter estimator (
) for the usual k
-variate multiple regression model, where some or all of the explanatory variables are endogenous to a specified degree. This degree of endogeneity is quantified by a given set of covariances between these explanatory variables and the model error term (
); these covariances are denoted by the vector
. The derivation of the sampling distribution of
is greatly simplified by restricting the form of this endogeneity to be solely linear in form, as follows:
We first define a new regression model error term (denoted ), from which all of the linear dependence on the explanatory variables has been stripped out. Thus—under this restriction of the endogeneity to be purely linear in form—this new error term must (by construction) be completely unrelated to the explanatory variables. Hence, with solely-linear endogeneity, must be statistically independent of the explanatory variables. This allows us to easily construct a modified regression equation in which the explanatory variables are independent of the model errors. This modified regression model now satisfies the assumptions of the usual multiple regression model (with stochastic regressors that are independent of the model errors), for which the OLS parameter estimator asymptotic sampling distribution is well known.
Thus, under the linear-endogeneity restriction, OLS estimation of this modified regression model yields unbiased estimation and asymptotically valid inferences on the model’s parameter vector; but these OLS estimates are now, as one might expect, unbiased for a coefficient vector which differs from by an amount depending explicitly on the posited endogeneity covariance vector, . Notably, however, this sampling distribution derivation requires no additional model assumptions beyond the usual ones made for a model with stochastic regressors: in particular, one need not specify any third or fourth moments for the joint distribution of the model errors and explanatory variables, as would be the case if the endogeneity were not restricted to be solely-linear in form.
In Section 3
we show how this sampling distribution for
can be used to assess the robustness/fragility of an OLS regression model inference with respect to possible endogeneity in the explanatory variables; this Section splits the discussion into several parts. The first parts of this discussion—Section 3.1
through Section 3.5
—describe the sensitivity analysis with respect to how much endogeneity is required in order to “overturn” an inference result with respect to the testing of a particular null hypothesis regarding the structural parameter,
, where this null hypothesis has been rejected at some nominal significance level—e.g.,
—under the assumption that all of the explanatory variables are exogenous. Section 3.6
closes our sensitivity analysis algorithm specification by describing how the sampling distribution for
derived in Section 2
can be used to display the sensitivity of a confidence interval for a particular component of
to possible endogeneity in the explanatory variables; this is the “parameter-value-centric” sensitivity analysis outlined at the end of Section 1.2
. So as to provide a clear “road-map” for this Section, each of these six subsections is briefly described next.
shows how a specific value for the posited endogeneity covariance vector (
)—combined with the sampling distribution of
from Section 2
—can be used to both recompute the rejection p
-value for the specified null hypothesis with regard to
and to also convert this endogeneity covariance vector (
) into the corresponding endogeneity correlation vector,
, which is more-interpretable than
. This conversion into a correlation vector is possible (for any given value of
) because the
sampling distribution yields a consistent estimator of
, making the error term
in the original (structural) model asymptotically available; this allows the necessary variance of
to be consistently estimated.
operationalizes these two results from Section 3.1
into an algorithm for a sensitivity analysis with regard to the impact (on the rejection of this specified null hypothesis) of possible endogeneity in a specified subset of the k
regression model explanatory variables; this algorithm calculates a vector we denote as “
,” whose Euclidean length—denoted “
” here—is our basic measure of the robustness/fragility of this particular OLS regression model hypothesis testing inference.
The definition of
is straightforward: it is simply the shortest endogeneity correlation vector,
, for which possible endogeneity in this subset of the explanatory variables suffices to raise the rejection p
-value for this particular null hypothesis beyond the specified nominal level (e.g.,
)—thus overturning the null hypothesis rejection observed under the assumption of exogenous model regressors. Since the sampling distribution derived in Section 2
is expressed in terms of the endogeneity covariance vector (
), this implicit search proceeds in the space of the possible values of
, using the p
-value result from Section 3.1
to eliminate all
values still yielding a rejection of the null hypothesis, and using the
result corresponding to each non-eliminated
value to supply the relevant minimand,
In Section 3.3
we obtain a closed-form expression for
in the (not uncommon) special case of a one-dimensional sensitivity analysis, where only a single explanatory variable is considered possibly-endogenous. This closed-form result reduces the computational burden involved in calculating
, by eliminating the numerical minimization over the possible endogeneity covariance vectors; but its primary value is didactic: its derivation illuminates what is going on in the calculation of
The calculation of
is not ordinarily computationally burdensome, even for the general case of multiple possibly-endogenous explanatory variables and tests of complex null hypotheses—with the sole exception of the simulation calculations alluded to in Section 1.2
above. These bootstrap simulations, quantifying the impact of substituting
calculation, are detailed in Section 3.4
. In practice these simulations are quite easy to do, as they are already coded up as an option in the implementing software, but the computational burden imposed by the requisite set of
replications of the
calculation can be substantial. In the case of a one-dimensional sensitivity analysis—where the closed-form results are available—these analytic results dramatically reduce the computational time needed for this simulation-based assessment of the extent to which the length of the sample data set is sufficient to support the sensitivity analysis. And our sensitivity results for the illustrative empirical example examined in Section 4
suggest that such one-dimensional sensitivity analyses may frequently suffice. Still, it is fortunate that this simulation-based assessment in practice needs only to be done once, at the outset of one’s work with a particular regression model.
The portion of Section 3
describing the sensitivity analysis with respect to inference in the form of hypothesis testing then concludes with some preliminary remarks, in Section 3.5
, as to how one can interpret
(and its length,
) in terms of the “fragility” or “robustness” of such a hypothesis test inference. This topic is taken up again in Section 6
, at the end of the paper.
closes with a description of the implementation of the “parameter-value-centric” sensitivity analysis outlined at the end of Section 1.2
. This version of the sensitivity analysis is simply a display of how the
confidence interval for any particular component of
. Its implementation is consequently a straightforward variation on the algorithm for implementing the hypothesis-testing-centric sensitivity analysis, as the latter already obtains both the sampling distribution of
and the value of
for any given value of the endogeneity covariance vector,
. Thus, each value of
chosen yields a point in the requisite display. The results of a univariate sensitivity analysis (where only one explanatory variable is considered to be possibly endogenous, and hence only one component of
can be non-zero) are readily displayed in a two-dimensional plot. Multi-dimensional, “parameter-value-centric” sensitivity analysis results are also easy to compute, but these are more challenging to display; in such settings one could, however, still resort to a tabulation of the results, as in Kiviet
In Section 4
we illustrate the application of this proposed sensitivity analysis to assess the robustness/fragility of the inferences obtained by Mankiw, Romer, and Weil (Mankiw et al. 1992
)—denoted “MRW” below—in their influential study of the impact of human capital accumulation on economic growth. This model provides a nice example of the application of our proposed sensitivity analysis procedure, in that the exogeneity of several of their explanatory variables is in doubt, and in that one of their two key null hypotheses is a simple zero-restriction on a single model coefficient and the other is a linear restriction on several of their model coefficients. We find that some of their inferences are robust with respect to possible endogeneity in their explanatory variables, whereas others are fragile. Hypothesis testing was the focus in the MRW study, but this setting also allows us to display how a
confidence interval for one of their model coefficients varies with the degree of (linear) endogeneity in a selected explanatory variable, both for the actual MRW sample length (
) and for an artificially-huge elaboration of their data set.
We view the sensitivity analysis proposed here as a practical addendum to the profession’s usual toolkit of diagnostic checking techniques for OLS multiple regression analysis—in this case as a “screen” for assessing the impact of likely amounts of endogeneity in the explanatory variables. To that end—as noted above—we have written scripts encoding the technique for several popular computing frameworks:
; Section 5
briefly discusses exactly what is involved in using these scripts.
Finally, in Section 6
we close the paper with a brief summary, and a modest elaboration of our Section 3.5
comments on how to interpret the quantitative (objective) sensitivity results which this proposed technique provides.