Evaluating Eigenvector Spatial Filter Corrections for Omitted Georeferenced Variables

The Ramsey regression equation specification error test (RESET) furnishes a diagnostic for omitted variables in a linear regression model specification (i.e., the null hypothesis is no omitted variables). Integer powers of fitted values from a regression analysis are introduced as additional covariates in a second regression analysis. The former regression model can be considered restricted, whereas the latter model can be considered unrestricted; this first model is nested within this second model. A RESET significance test is conducted with an F-test using the error sums of squares and the degrees of freedom for the two models. For georeferenced data, eigenvectors can be extracted from a modified spatial weights matrix, and included in a linear regression model specification to account for the presence of nonzero spatial autocorrelation. The intuition underlying this methodology is that these synthetic variates function as surrogates for omitted variables. Accordingly, a restricted regression model without eigenvectors should indicate an omitted variables problem, whereas an unrestricted regression model with eigenvectors should result in a failure to reject the RESET null hypothesis. This paper furnishes eleven empirical examples, covering a wide range of spatial attribute data types, that illustrate the effectiveness of eigenvector spatial filtering in addressing the omitted variables problem for georeferenced data as measured by the RESET.


Introduction
A practitioner spends considerable time contemplating which covariates to include in a descriptive regression equation, as well as the functional forms they should have.A serious problem in regression analysis is misspecification of a descriptive equation by failing to include all relevant covariates in it: the omitted variables problem.One result of such omissions is omitted-variable bias (OVB), which arises when parameter estimates for the covariates included in a descriptive equation are over-or under-estimated because estimation attempts to compensate for the omitted variables.In part, this outcome arises from multicollinearity; in part, this outcome arises from a biased error variance estimate (i.e., covariates being removed from a specification because they are deemed insignificant when they are significant).A serious linear regression consequence of OVB for ordinary least squares (OLS) estimation is biased and inconsistent parameter estimates.OVB also impacts on non-linear regression.
The Ramsey (1969) [1] regression equation specification error test (RESET) furnishes a tool to at least partially assess OVB.Technically, it is not about omitted variables, but rather it is about functional form (e.g., Wooldridge 2013 ( [2], Chapter 9)).It addresses the question asking whether or not non-linear combinations of fitted values help explain a response variable.Its supporting logic contends that non-linear combinations (e.g., exponential powers and cross-products) of covariates that correlate with a response variable signify a mis-specified equation.Consequently, the RESET specifically tests functional form, but often with inferences drawn about omitted variables.Shukur and Mantalos (2004) [3] comment that the RESET has good statistical power with increasing misspecification, and as the RESET proxy variate more closely approximates omitted variables.Of note is that the only way to truly assess OVB is to have the omitted variables to assess, which is not practical.
Studies (e.g., Brasington and Hite 2005 [4], Pace and LeSage 2010 [5]) show that spatial models accommodating spatial dependence are less influenced by OVB, especially when a true data generating process contains a spatial dependence component.Comparisons of model specifications between non-spatial and/or spatial models already appear in the literature.LeSage and Parent (2007) [6] investigate OVB with different model specifications, including ones for non-spatial and spatial regression, using a Bayesian model averaging technique.LeSage and Fischer (2008) [7] and Piribauer and Fischer (2015) [8] extend this approach for model uncertainty in spatial growth modeling.Piribauer (2016) [9] further extends it using stochastic search variable selection priors to improve OVB as well as over-parameterization.
The purpose of this paper is to demonstrate how eigenvector spatial filtering (ESF) impacts OVB as measured by the RESET.As a popular alternative approach for spatial regression model specification (Griffith 2003 [10], Pace, LeSage, and Zhu 2013 [11], Chun and Griffith 2014 [12]), ESF offers the potential to alleviate OVB by including spatial dependence components.

The RESET for a Linear Regression Specification
Ramsey (1969) [1] formulated his test for the case of linear regression.His test begins with the conditional expectation E pY|Xq " Xβ where Y is an n-by-1 vector of response values, hat (the diacritical mark) denotes fitted value, E denotes the calculus of expectation operator, X is an n-by-(p + 1) matrix containing p covariates (p must be at least 1 here), n is the number of observations, and β is a (p + 1)-by-1 vector of regression coefficients.If some n-by-q matrix of covariates Z is incorrectly omitted from this regression equation, in the case where X and Z are non-stochastic, then where superscript T denotes the matrix transpose operation, θ denotes regression coefficients for the covariates Z, and γ denotes the full set of regression coefficients.If X T Z = 0, which is highly unlikely in practice, then no OVB is present, emphasizing the relationship between OVB and multicollinearity.
If the covariate matrix in Equation ( 2) is expanded to (X Z), then E( γ) = ˜β θ ¸.Therefore, if this covariate matrix can be augmented with proxy covariates that approximate matrix Z (or at least the part of Z correlated with X), then the OVB decreases, converging on zero as the approximation becomes increasingly better.Thursby and Schmidt (1977) [13] discuss that an approximation being correlated with omitted variables can lead to a powerful test.The RESET uses exponential powers of Xβ for this approximation.Accordingly, matrix X must contain more than the vector of ones (for the intercept term).The resulting set of equations for testing purposes is given by where Ŷk " `X β˘k for integer k ě 2, and is a n-by-1 vector of random errors for a non-spatial model.The joint null hypothesis for the ϕ k coefficients is that all of them are zero, which is tested using the F-ratio where ESS j and df j are, respectively, the error sum of squares and the degrees of freedom for model j (j = 1, 2, . . .).Rejection of the null hypothesis implies misspecification.When implementing Equation (3), in order to exploit the spatial autocorrelation common to X and Z, as well as the spatial autocorrelation unique to Z, our analyses used exponential powers of fitted values from an eigenvector spatial filter for this approximation: Ŷ " X β `Eh βh , where E h are the eigenvectors discussed in Section 4. That is, an ESF model can be expressed as 3. The RESET for a Generalized Linear Regression Specification Sapra (2005) [14] extends Ramsey's RESET to generalized linear models (GLMs).The logic remains the same here; the response variable no longer is a normal random variable (RV).Rather, it is a Poisson, binomial, or other RV from the exponential family.
The basic equation is similar to (3): assessment is in terms of powers of a linear combination of covariates.For a Poisson RV, the linear combination is the log-mean estimate.For a binomial random variable, the linear combination is the log-odds ratio function.The test statistic is the chi-square, whereas the calculation is ´2 times the log-likelihood function differences (subtracting that for the expanded specifications from the original specification).Sapra (2005) [14] comments that this extended version of the RESET appears to have reasonable statistical power for medium to large sample sizes.

Eigenvector Spatial Filtering and Omitted Variables
One contention about the presence of non-zero spatial autocorrelation in regression residuals is that it arises because covariates with spatial patterns are missing from a descriptive equation specification (e.g., Temple 1999 [15]).Shifting this spatial autocorrelation from the residuals to the systematic part of the equation (e.g., introducing a spatial autoregressive term) furnishes a surrogate for the missing variable(s), which can be seen by, for example, an increase in the accompanying pseudo-R 2 value.But auto-models are complicated.ESF offers a simpler approach to handling this omitted variables problem.In other words, because spatial autocorrelation can arise from a missing relevant variable that has an underlying spatial map pattern, a spatial filter constructed with eigenvectors that shows this same underlying spatial autocorrelation pattern can serve as a proxy for missing variables by accounting for spatial autocorrelation.
ESF uses a set of synthetic proxy variables, which are extracted as eigenvectors from an adjusted spatial weights matrix C (defined in Equation ( 5)) that links geographic objects together in space, and then adds these vectors as control variables to an equation specification.These control variables identify and isolate the stochastic spatial dependencies among a given set of georeferenced observations, resulting in their mimicking independent ones, thus allowing spatial statistical analysis to proceed in standard ways.Spatial autocorrelation in regression residuals often arises because of a missing relevant variable that has an underlying spatial pattern (e.g., McMillan 2003 [16]).Thus, a spatial filter constructed with eigenvectors that exhibit appropriate spatial autocorrelation patterns can serve as a proxy by accounting for spatial autocorrelation.
ESF applies the mathematical decomposition that creates eigenfunctions to the following transformed spatial weights matrix: where I is an n-by-n identity matrix, and 1 is an n-by-1 vector of ones.This decomposition generates n eigenvectors and their associated n eigenvalues.In descending order, the n eigenvalues can be denoted as λ = (λ 1 , λ 2 , λ 3 , . . ., λ n ), ranging between the largest eigenvalue that is positive, λ 1 , and the smallest eigenvalue that is negative, λ n .The corresponding n eigenvectors can be denoted as E = (E 1 , E 2 , E 3 , . . ., E n ), where each eigenvector, E j , is an n-by-1 vector.These eigenfunctions have a number of important properties.First, the eigenvectors are mutually orthogonal and uncorrelated (Griffith 2000) [17]: the symmetry of matrix C ensures orthogonality, and the projection matrix `I ´11 T {n ˘ensures that eigenvectors have zero means, guaranteeing uncorrelatedness.That is, EE T = I and E T 1 = 0, and the correlation between any pair of eigenvectors, say E i and E j , is zero when i ‰ j.Second, the eigenvectors portray distinct, selected map patterns.Tiefelsdorf and Boots (1995) [18] establish that each eigenvector portrays a different map pattern exhibiting a specified level of spatial autocorrelation when it is mapped onto the n areal units associated with the corresponding spatial weights matrix C.They also establish that the Moran coefficient (MC) value for a mapped eigenvector is equal to a function of its corresponding eigenvalue (i.e., MC j = n 1 T C1 ¨λj , for E j ).Third, given a spatial weights matrix C, the feasible range of MC values is determined by the largest and smallest eigenvalues; i.e., by λ 1 and λ n (de Jong et al. 1984) [19].Based upon these properties, the eigenvectors can be interpreted as follows (Griffith 2003) [10]: The first eigenvector, E 1 , is the set of real numbers that has the largest MC value achievable by any set of real numbers for the spatial arrangement defined by the spatial weight matrix C; the second eigenvector, E 2, is the set of real numbers that has the largest achievable MC value by any set that is uncorrelated with E 1 ; the third eigenvector, E 3 , is the set of real numbers that has the largest achievable MC value by any set that is uncorrelated with both E 1 and E 2 ; the fourth eigenvector is the fourth such set of values; and so on through E n , the set of real numbers that has the largest negative MC value achievable by any set that is uncorrelated with the preceding (n ´1) eigenvectors.
As such, these eigenvectors furnish distinct map pattern descriptions of latent spatial autocorrelation in spatial variables, because they are mutually both orthogonal and uncorrelated.
ESF furnishes a promising alternative approach to the popular spatial auto-model for describing a spatial process.Pace, LeSage, and Zhu (2013) [11] comment that ESF is an effective method to alleviate OVB.With a simulation experiment that examines ESF estimates for two different types of data generating processes (i.e., spatial autoregressive and spatial error processes), they find that ESF reduces bias in parameter estimates.One appealing feature of ESF is that it utilizes a relevant subset of eigenvectors extracted from a spatial weights matrix, whereas a spatial autoregressive model utilizes the full set of these eigenvectors, both ones that correlate and ones that do not correlate (and hence introduce noise) with the response variable in question.Another appealing feature of ESF is that determining its associated degrees of freedom is more straightforward; a spatial autoregressive model has a complicated degrees of freedom structure because of its multiplicative form.The number of degrees of freedom for the spatial autocorrelation parameter can differ from 1 (Janson, Fithian, and Hasatie 2015) [20].

Specimen Empirical Datasets
Illustrative analyses have been completed with eleven empirical datasets 1 that span a range of sample sizes (49 to 3109): Dallas, TX City and County census tracts; United States (US) state economic areas (SEAs); US as well as Texas counties; Anselin's Columbus neighborhoods; Plano, TX block groups; Mercer-Hall agricultural field plots; and, Puerto Rico municipalities.Figure 1 portrays the various surface partitionings associated with these datasets.These datasets not only furnish a range of sizes, but Figure 1 reveals that they also furnish a wide range of qualitatively different surface partitionings.In addition, they furnish a range of covariate set sizes, as well as a range of response variable types that includes examples of each of the three most commonly encountered varieties of georeferenced RVs (e.g., normal, binomial, and Poisson).These datasets not only furnish a range of sizes, but Figure 1 reveals that they also furnish a wide range of qualitatively different surface partitionings.In addition, they furnish a range of covariate set sizes, as well as a range of response variable types that includes examples of each of the three most commonly encountered varieties of georeferenced RVs (e.g., normal, binomial, and Poisson).

RESET Results for the Specimen Empirical Datasets
The RESET for an ESF model was conducted with the selected eigenvectors as additional independent variables.That is, the F-test was calculated with the sums of squared errors for the ESF model and its counterpart with additional fitted value terms. 2 Inclusion of a constructed eigenvector spatial filter improves the RESET analysis in all eleven cases (Tables 1 and 2).This improvement is of three types: when the diagnostic fails to indicate omitted variables; when the diagnostic indicates omitted variables before, but not after, adding an eigenvector spatial filter; and, when the diagnostic still indicates omitted variables after inclusion of an eigenvector spatial filter.
In all cases, inclusion of an eigenvector spatial filter increases the (pseudo-)R 2 , sometimes more than tripling it.Both Columbus, OH crime rates, and Puerto Rico density of irrigated farms include covariates that do not yield a RESET diagnostic suggesting omitted variables; nevertheless, inclusion of an eigenvector spatial filter increases the null hypothesis (no omitted variables) RESET probability.
Plano vehicle burglary rates, City of Dallas crime rates, Mercer-Hall wheat yield, US SEA prostate cancer rates, and Dallas County population density have an initial RESET diagnostic suggesting omitted variables, and a RESET diagnostic with a probability of at least 0.1 after inclusion of an eigenvector spatial filter.The implication here is that an eigenvector spatial filter substitutes well for omitted variables.
Texas median monthly mortgages, US population density, and GLM results for both percentage of Puerto Rican irrigated farms and Texas cancer counts have RESET diagnostics that indicate the presence of omitted variables both with and without inclusion of an ESF.Inclusion of an ESF increases the RESET probabilities, but not enough for them to be non-significant.These may be cases in which a spatially unstructured term also is needed to compensate for omitted variables.
For comparison purposes, a RESET was conducted for spatial lag and spatial error model specifications using the Columbus dataset.Here, because of their non-linear forms, the RESET employs the chi-square test for the likelihood ratio difference between a restricted model and its unrestricted counterpart (Vaona 2009) [21].That is, integer powers of (z-score versions of) fitted values from a spatial regression model are introduced as explanatory variables.Here the resulting RESET p-values are 0.3663 and 0.1852, respectively, whereas the resulting pseudo-R 2 values are 0.6523 and 0.6584, respectively.These findings suggest that spatial autoregressive models also correct for OVB, offering spatial analysts two ways of exploiting spatial autocorrelation to compensate for omitted variables.
2 ESS 1 was calculated with covariates and selected eigenvectors, and ESS 2 was calculated with additional fitted terms as well as the covariates and the selected eigenvectors.For Columbus data, df 2 for the non-spatial model is 41 (= 49 -the number of independent variables; that is, 2 covariates, intercept, and 5 fitted terms); df2 for the ESF model is 38 (= 49 -the number of independent variables with 3 additional eigenvectors).Pseudo-R2 0.1315 0.3722 1 The covariates are log of household median income, log of white population rates, and log of single marital status rates.

Cross-Validation RESET Results for the Specimen Empirical Datasets
Each of the specimen datasets was subjected to a cross-validation evaluation to examine the sensitivity of the RESET to individual observations, with each observation in a dataset being left out, in turn, and then predicted.Table 3 summarizes results for the linear model examples, and Table 4 summarizes results for the generalized linear model examples.These results are encouraging, given the number of improvements, but indicate the need for further refinement work in this area.The goal would be for almost all, if not all, of the cases to improve, achieving a RESET probability exceeding 0.1.

Correction for Omitted Variable Bias: Selected Simulation Experiments
OVB results in an estimated regression coefficient differing substantially from its population parameter, often in an attempt by included covariates to compensate for omitted variables.This substantial difference can render an incorrect null hypothesis test result concerning included variables.Empirical evidence presented here suggests that an eigenvector spatial filter helps remediate this situation.
The first simulation experiment summarized here is based upon the Puerto Rico (n = 73) agricultural dataset.The response variable is the sum of the density of farms using irrigation (X 1 ) and Box-Tidwell transformed mean rainfall (X 2 ), plus an independent and identically distributed (iid) random error term that is N(0, 0.1 2 ).The correlation between the two covariates is 0.43, indicating modest collinearity.The response variable (containing 73 values) was simulated 10,000 times, followed by estimation of its linear regression equation as well as each of the two individual bivariate regression equations, resulting in The intercept term estimate is not reported here because it is not of interest.The average regression coefficient estimates of 1.00046 and 0.99996 are not different from 1 (standard errors of roughly 0.049), their population parameter counterparts (i.e., the true model).The bivariate regression coefficient estimates indicate that the OVB is sizeable, exceeding 42%, and significant (standard errors of 0.044).
Powers of the eigenvector spatial filter fitted values ( Ŷj ) furnish the RESET terms for simulation replicate j.Table 5 summarizes outcomes of this simulation experiment, which involved stepwise selection of the RESET terms (which are constructed from eigenvector spatial filters).The average bivariate regression coefficient estimates corrected by the RESET are 0.95574 and 0.94882, both of which are markedly less than their OVB counterparts, although they are modestly deflated.Their respective standard errors are 0.062 and 0.067, which, unlike the original OVB estimates, mean they are not significantly different from 1.The second simulation experiment summarized here is based upon the Texas (n = 254) cancer dataset.The response variable is the exponentiated weighted sum of the logarithms of median household income (X 1 ), percentage of white population (X 2 ), and percentage of single (i.e., unmarried) people (X 3 ), plus log-total population as an offset variable.The weights are the Poisson regression coefficients from a GLM.Because the expectation equation is a description of cancer counts that are overdispersed, it was used as the mean of a gamma RV, whose sampled values were treated as means of Poisson RVs. 3 The response variable (containing 254 values) was simulated 10,000 times, followed by 3 The mean of the empirical RV is 133, its standard deviation is 407, and its overdispersion scale parameter is 2.8.The simulated data have a mean of 134, a standard deviation of 419, and a scale parameter of approximately 2.8.estimation of its Poisson GLM equation as well as each of the three individual bivariate and individual trivariate binomial regression equations, resulting in Again, the intercept term estimate is not reported here because it is not of interest; however, in some empirical cases, it is of interest, another reason to use the z-score versions of fitted values.Table 6 summarizes outcomes of this simulation experiment, which involved stepwise selection of the RESET terms (which, as before, are constructed from eigenvector spatial filters).The average regression coefficient estimates of ´0.30213, 0.21343, and ´0.80155 respectively do not differ from ´0.3, 0.2, and ´0.8 (standard errors of roughly 0.2), their population parameter counterparts.The bivariate and trivariate Poisson regression coefficient estimates indicate that the OVB is sizeable, many being at least 20%, and statistically significant.For the bivariate regressions, the eigenvector spatial filter reduces the OVB as reported in Table 7.For the bivariate cases, the estimates with the ESF RESET adjustment are closer to their true values.Specifically, the estimates for X 1 and X 3 are close to their true values, whereas the adjustment for X 2 is less effective.These results indicate that the ESF adjustment is reasonable in a bivariate regression case, but not so in a trivariate regression case.The correlation structure may play a role here: r X 1 X 2 = 0.11, r X 1 X 3 = ´0.10,and r X 2 X 3 = ´0.53.
These two empirically based simulation experiments furnish a proof of concept, and indicate that ESFs offer promise for effectively dealing with the OVB problem.Clearly, future research should be devoted to this theme.

Implications and Conclusions
Properly testing for OVB requires knowing the omitted variables, which does not help in practice.This situation also can be assessed if instrumental variables are available to use.At least in some cases, an eigenvector spatial filter can be treated like an instrument (see Le Gallo and Paez 2013 [22]).Ramsey's RESET furnishes a special case test where the omitted variables are nonlinear functions of the included covariates.This paper summarizes findings based upon a set of empirical examples and a pair of conditional simulations suggesting that an ESF often can serve as a surrogate for omitted variables.Inclusion of an eigenvector spatial filter tends to increase the (pseudo-)R 2 and the RESET null hypothesis probability.Combining an eigenvector spatial filter with a spatially unstructured term to correct for OVB merits subsequent research, too.

1
Several of these dataset were used in the 2008 US National Science Foundation funded spatial filtering workshop held at the University of Texas at Dallas during June 16-20 (http://www.spatialfiltering.com/).

Table 1 .
Ramsey regression equation specification error test (RESET) results for the linear model empirical examples.

Table 2 .
RESET results for the generalized linear model (GLM) empirical examples.

Table 3 .
RESET cross-validation results for the specimen linear models.p-valuesfor35(out of 92) cases increased from less than 0.0001 to greater than 0.05.2p-values for 252 (out of 254) cases increased from less than 10 ´7 to greater than 0.001.3p-valuesfor256(out of 261) cases increased from less than 10 ´9 to greater than 0.001.4p-values of 3104 (out of 3109) cases increased from less than 10 ´10 to greater than 0.0001. 1

Table 4 .
RESET cross-validation results for the specimen generalized linear models.

Table 5 .
Selection frequency of RESET terms for the Puerto Rico simulation experiment.

Table 6 .
Selection frequency of RESET terms for the Texas data simulation experiment.

Table 7 .
Parameter estimates with OVB and ESF RESET adjustments.