#### 1.1. Motivation

For many of us, the essential distinction between econometric regression analysis and otherwise-similar forms of regression analysis conducted outside of economics, is the overarching concern shown in econometrics with regard to the model assumptions made in order to obtain consistent ordinary least squares (OLS) parameter estimation and asymptotically valid statistical inference. As

Friedman (

1953) famously noted (in the context of economic theorizing), it is both necessary and appropriate to make model assumptions—notably, even assumptions which we know to be false—in any successful economic modeling effort: the usefulness of a model, he asserted, inheres in the richness/quality of its predictions rather than in the accuracy of its assumptions. Our contribution here—and in

Ashley (

2009) and

Ashley and Parmeter (

2015b), which address similar issues in the context of IV and GMM/2SLS inference using possibly-flawed instruments—is to both posit and operationalize a general proposition that is a natural corollary to Friedman’s assertion: It is perfectly acceptable to make possibly-false (and even very-likely-false) assumptions—

if and only if one can and does show that the model results one cares most about are insensitive to the levels of violations in these assumptions that it is reasonable to expect.

1In particular, the present paper proposes a sensitivity analysis for OLS estimation/inference in the presence of unmodeled endogeneity in the explanatory variables of the usual linear multiple regression model. This context provides an ideal setting in which to both exhibit and operationalize a quantification of the “insensitivity” alluded to in the proposition above, because this setting is so very simple. This setting is also attractive in that OLS estimation of multiple regression models with explanatory variables of suspect exogeneity is common in applied economic work. The extension of this kind of sensitivity analysis to more complex—e.g., nonlinear—estimation settings is feasible, but will be laid out in separate work, as it requires a different computational framework, and does not admit the closed-form results obtained (for some cases) here.

The diagnostic analysis proposed here quantifies the sensitivity of OLS hypothesis test rejection p-values (for both multiple and/or nonlinear parameter restrictions)—and also of single-parameter confidence intervals—with respect to possible endogeneity (i.e., non-zero correlations between the explanatory variables and the model errors) in the usual k-variate multiple regression model.

This sensitivity analysis rests on a derivation of the sampling distribution of ${\widehat{\beta}}^{OLS}$, the OLS parameter estimator, extended to the case where some or all of the explanatory variables are endogenous to a specified degree. In exchange for restricting attention to possible endogeneity which is solely linear in nature—the most typical case—the derivation of this sampling distribution proceeds in a particularly straightforward way. And under this “linear endogeneity” restriction, no additional model assumptions—e.g., with respect to the third and fourth moments of the joint distribution of the explanatory variables and the model errors—are necessary, beyond the usual assumptions made for any model with stochastic regressors. The resulting analysis quantifies the sensitivity of hypothesis test rejection p-values (and/or estimated confidence intervals) to such linear endogeneity, enabling the analyst to make an informed judgment as to whether any selected inference is “robust” versus “fragile” with respect to likely amounts of endogeneity in the explanatory variables.

We show below that, in the context of the linear multiple regression model, this additional sensitivity analysis is so straightforward an addendum to the usual OLS diagnostic checking which applied economists are already doing—both theoretically, and in terms of the effort required to set up and use our proposed procedure—that analysts can routinely use it. In this regard we see our sensitivity analysis as a general diagnostic “screen” for possibly-important, unaddressed endogeneity issues.

#### 1.3. Preview of the Rest of the Paper

In

Section 2 we provide a new derivation of the asymptotic sampling distribution of the OLS structural parameter estimator (

${\widehat{\beta}}^{OLS}$) for the usual

k-variate multiple regression model, where some or all of the explanatory variables are endogenous to a specified degree. This degree of endogeneity is quantified by a given set of covariances between these explanatory variables and the model error term (

$\epsilon $); these covariances are denoted by the vector

$\lambda $. The derivation of the sampling distribution of

${\widehat{\beta}}^{OLS}$ is greatly simplified by restricting the form of this endogeneity to be solely linear in form, as follows:

We first define a new regression model error term (denoted $\nu $), from which all of the linear dependence on the explanatory variables has been stripped out. Thus—under this restriction of the endogeneity to be purely linear in form—this new error term must (by construction) be completely unrelated to the explanatory variables. Hence, with solely-linear endogeneity, $\nu $ must be statistically independent of the explanatory variables. This allows us to easily construct a modified regression equation in which the explanatory variables are independent of the model errors. This modified regression model now satisfies the assumptions of the usual multiple regression model (with stochastic regressors that are independent of the model errors), for which the OLS parameter estimator asymptotic sampling distribution is well known.

Thus, under the linear-endogeneity restriction, OLS estimation of this modified regression model yields unbiased estimation and asymptotically valid inferences on the model’s parameter vector; but these OLS estimates are now, as one might expect, unbiased for a coefficient vector which differs from $\beta $ by an amount depending explicitly on the posited endogeneity covariance vector, $\lambda $. Notably, however, this sampling distribution derivation requires no additional model assumptions beyond the usual ones made for a model with stochastic regressors: in particular, one need not specify any third or fourth moments for the joint distribution of the model errors and explanatory variables, as would be the case if the endogeneity were not restricted to be solely-linear in form.

In

Section 3 we show how this sampling distribution for

${\widehat{\beta}}^{OLS}$ can be used to assess the robustness/fragility of an OLS regression model inference with respect to possible endogeneity in the explanatory variables; this Section splits the discussion into several parts. The first parts of this discussion—

Section 3.1 through

Section 3.5—describe the sensitivity analysis with respect to how much endogeneity is required in order to “overturn” an inference result with respect to the testing of a particular null hypothesis regarding the structural parameter,

$\beta $, where this null hypothesis has been rejected at some nominal significance level—e.g.,

$5\%$—under the assumption that all of the explanatory variables are exogenous.

Section 3.6 closes our sensitivity analysis algorithm specification by describing how the sampling distribution for

${\widehat{\beta}}^{OLS}$ derived in

Section 2 can be used to display the sensitivity of a confidence interval for a particular component of

$\beta $ to possible endogeneity in the explanatory variables; this is the “parameter-value-centric” sensitivity analysis outlined at the end of

Section 1.2. So as to provide a clear “road-map” for this Section, each of these six subsections is briefly described next.

Section 3.1 shows how a specific value for the posited endogeneity covariance vector (

$\lambda $)—combined with the sampling distribution of

${\widehat{\beta}}^{OLS}$ from

Section 2—can be used to both recompute the rejection

p-value for the specified null hypothesis with regard to

$\beta $ and to also convert this endogeneity covariance vector (

$\lambda $) into the corresponding endogeneity correlation vector,

${\widehat{\rho}}_{X\epsilon}$, which is more-interpretable than

$\lambda $. This conversion into a correlation vector is possible (for any given value of

$\lambda $) because the

${\widehat{\beta}}^{OLS}$ sampling distribution yields a consistent estimator of

$\beta $, making the error term

$\epsilon $ in the original (structural) model asymptotically available; this allows the necessary variance of

$\epsilon $ to be consistently estimated.

Section 3.2 operationalizes these two results from

Section 3.1 into an algorithm for a sensitivity analysis with regard to the impact (on the rejection of this specified null hypothesis) of possible endogeneity in a specified subset of the

k regression model explanatory variables; this algorithm calculates a vector we denote as “

${r}_{min}$,” whose Euclidean length—denoted “

${\left|r\right|}_{min}$” here—is our basic measure of the robustness/fragility of this particular OLS regression model hypothesis testing inference.

The definition of

${r}_{min}$ is straightforward: it is simply the shortest endogeneity correlation vector,

${\widehat{\rho}}_{X\epsilon}$, for which possible endogeneity in this subset of the explanatory variables suffices to raise the rejection

p-value for this particular null hypothesis beyond the specified nominal level (e.g.,

$0.05$)—thus overturning the null hypothesis rejection observed under the assumption of exogenous model regressors. Since the sampling distribution derived in

Section 2 is expressed in terms of the endogeneity covariance vector (

$\lambda $), this implicit search proceeds in the space of the possible values of

$\lambda $, using the

p-value result from

Section 3.1 to eliminate all

$\lambda $ values still yielding a rejection of the null hypothesis, and using the

${\widehat{\rho}}_{X\epsilon}$ result corresponding to each non-eliminated

$\lambda $ value to supply the relevant minimand,

$|{\widehat{\rho}}_{X\epsilon}|$.

In

Section 3.3 we obtain a closed-form expression for

${r}_{min}$ in the (not uncommon) special case of a one-dimensional sensitivity analysis, where only a single explanatory variable is considered possibly-endogenous. This closed-form result reduces the computational burden involved in calculating

${r}_{min}$, by eliminating the numerical minimization over the possible endogeneity covariance vectors; but its primary value is didactic: its derivation illuminates what is going on in the calculation of

${r}_{min}$.

The calculation of

${r}_{min}$ is not ordinarily computationally burdensome, even for the general case of multiple possibly-endogenous explanatory variables and tests of complex null hypotheses—with the sole exception of the simulation calculations alluded to in

Section 1.2 above. These bootstrap simulations, quantifying the impact of substituting

${\widehat{\Sigma}}_{XX}$ for

${\Sigma}_{XX}$ in the

${r}_{min}$ calculation, are detailed in

Section 3.4. In practice these simulations are quite easy to do, as they are already coded up as an option in the implementing software, but the computational burden imposed by the requisite set of

${N}_{boot}\approx 1000$ replications of the

${r}_{min}$ calculation can be substantial. In the case of a one-dimensional sensitivity analysis—where the closed-form results are available—these analytic results dramatically reduce the computational time needed for this simulation-based assessment of the extent to which the length of the sample data set is sufficient to support the sensitivity analysis. And our sensitivity results for the illustrative empirical example examined in

Section 4 suggest that such one-dimensional sensitivity analyses may frequently suffice. Still, it is fortunate that this simulation-based assessment in practice needs only to be done once, at the outset of one’s work with a particular regression model.

The portion of

Section 3 describing the sensitivity analysis with respect to inference in the form of hypothesis testing then concludes with some preliminary remarks, in

Section 3.5, as to how one can interpret

${r}_{min}$ (and its length,

${\left|r\right|}_{min}$) in terms of the “fragility” or “robustness” of such a hypothesis test inference. This topic is taken up again in

Section 6, at the end of the paper.

Section 3 closes with a description of the implementation of the “parameter-value-centric” sensitivity analysis outlined at the end of

Section 1.2. This version of the sensitivity analysis is simply a display of how the

$95\%$ confidence interval for any particular component of

$\beta $ varies with

${\widehat{\rho}}_{X\epsilon}$. Its implementation is consequently a straightforward variation on the algorithm for implementing the hypothesis-testing-centric sensitivity analysis, as the latter already obtains both the sampling distribution of

${\widehat{\beta}}^{OLS}$ and the value of

${\widehat{\rho}}_{X\epsilon}$ for any given value of the endogeneity covariance vector,

$\lambda $. Thus, each value of

$\lambda $ chosen yields a point in the requisite display. The results of a univariate sensitivity analysis (where only one explanatory variable is considered to be possibly endogenous, and hence only one component of

$\lambda $ can be non-zero) are readily displayed in a two-dimensional plot. Multi-dimensional, “parameter-value-centric” sensitivity analysis results are also easy to compute, but these are more challenging to display; in such settings one could, however, still resort to a tabulation of the results, as in

Kiviet (

2016).

In

Section 4 we illustrate the application of this proposed sensitivity analysis to assess the robustness/fragility of the inferences obtained by Mankiw, Romer, and Weil (

Mankiw et al. 1992)—denoted “MRW” below—in their influential study of the impact of human capital accumulation on economic growth. This model provides a nice example of the application of our proposed sensitivity analysis procedure, in that the exogeneity of several of their explanatory variables is in doubt, and in that one of their two key null hypotheses is a simple zero-restriction on a single model coefficient and the other is a linear restriction on several of their model coefficients. We find that some of their inferences are robust with respect to possible endogeneity in their explanatory variables, whereas others are fragile. Hypothesis testing was the focus in the MRW study, but this setting also allows us to display how a

$95\%$ confidence interval for one of their model coefficients varies with the degree of (linear) endogeneity in a selected explanatory variable, both for the actual MRW sample length (

$n=98$) and for an artificially-huge elaboration of their data set.

We view the sensitivity analysis proposed here as a practical addendum to the profession’s usual toolkit of diagnostic checking techniques for OLS multiple regression analysis—in this case as a “screen” for assessing the impact of likely amounts of endogeneity in the explanatory variables. To that end—as noted above—we have written scripts encoding the technique for several popular computing frameworks:

$\mathtt{R}$ and

$\mathtt{Stata}$;

Section 5 briefly discusses exactly what is involved in using these scripts.

Finally, in

Section 6 we close the paper with a brief summary, and a modest elaboration of our

Section 3.5 comments on how to interpret the quantitative (objective) sensitivity results which this proposed technique provides.