# Evaluating Eigenvector Spatial Filter Corrections for Omitted Georeferenced Variables

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. The RESET for a Linear Regression Specification

**Y**is an n-by-1 vector of response values, hat (the diacritical mark) denotes fitted value, E denotes the calculus of expectation operator,

**X**is an n-by-(p + 1) matrix containing p covariates (p must be at least 1 here), n is the number of observations, and $\mathsf{\beta}$ is a (p + 1)-by-1 vector of regression coefficients. If some n-by-q matrix of covariates

**Z**is incorrectly omitted from this regression equation, in the case where

**X**and

**Z**are non-stochastic, then

**Z**, and $\mathsf{\gamma}$ denotes the full set of regression coefficients. If

**X**

^{T}

**Z**=

**0**, which is highly unlikely in practice, then no OVB is present, emphasizing the relationship between OVB and multicollinearity.

**X**

**Z**), then E($\widehat{\mathsf{\gamma}}$) = $\left(\begin{array}{c}\mathsf{\beta}\\ \mathsf{\theta}\end{array}\right)$. Therefore, if this covariate matrix can be augmented with proxy covariates that approximate matrix

**Z**(or at least the part of

**Z**correlated with

**X**), then the OVB decreases, converging on zero as the approximation becomes increasingly better. Thursby and Schmidt (1977) [13] discuss that an approximation being correlated with omitted variables can lead to a powerful test. The RESET uses exponential powers of $X\mathsf{\beta}$ for this approximation. Accordingly, matrix

**X**must contain more than the vector of ones (for the intercept term). The resulting set of equations for testing purposes is given by

_{1}− ESS

_{2})/(df

_{2}− df

_{1})]/[ESS

_{2}/(n-df

_{2})]

_{j}and df

_{j}are, respectively, the error sum of squares and the degrees of freedom for model j (j = 1, 2, …). Rejection of the null hypothesis implies misspecification. When implementing Equation (3), in order to exploit the spatial autocorrelation common to

**X**and

**Z**, as well as the spatial autocorrelation unique to

**Z**, our analyses used exponential powers of fitted values from an eigenvector spatial filter for this approximation: $\widehat{Y}=X\widehat{\mathsf{\beta}}+{E}_{\mathrm{h}}{\widehat{\mathsf{\beta}}}_{\mathrm{h}}$, where

**E**

_{h}are the eigenvectors discussed in Section 4. That is, an ESF model can be expressed as

## 3. The RESET for a Generalized Linear Regression Specification

## 4. Eigenvector Spatial Filtering and Omitted Variables

^{2}value. But auto-models are complicated. ESF offers a simpler approach to handling this omitted variables problem. In other words, because spatial autocorrelation can arise from a missing relevant variable that has an underlying spatial map pattern, a spatial filter constructed with eigenvectors that shows this same underlying spatial autocorrelation pattern can serve as a proxy for missing variables by accounting for spatial autocorrelation.

**C**(defined in Equation (5)) that links geographic objects together in space, and then adds these vectors as control variables to an equation specification. These control variables identify and isolate the stochastic spatial dependencies among a given set of georeferenced observations, resulting in their mimicking independent ones, thus allowing spatial statistical analysis to proceed in standard ways. Spatial autocorrelation in regression residuals often arises because of a missing relevant variable that has an underlying spatial pattern (e.g., McMillan 2003 [16]). Thus, a spatial filter constructed with eigenvectors that exhibit appropriate spatial autocorrelation patterns can serve as a proxy by accounting for spatial autocorrelation.

**I**is an n-by-n identity matrix, and

**1**is an n-by-1 vector of ones. This decomposition generates n eigenvectors and their associated n eigenvalues. In descending order, the n eigenvalues can be denoted as

**λ**= (λ

_{1}, λ

_{2}, λ

_{3}, …, λ

_{n}), ranging between the largest eigenvalue that is positive, λ

_{1}, and the smallest eigenvalue that is negative, λ

_{n}. The corresponding n eigenvectors can be denoted as

**E**= (

**E**

_{1},

**E**

_{2},

**E**

_{3}, …,

**E**

_{n}), where each eigenvector,

**E**

_{j}, is an n-by-1 vector.

**C**ensures orthogonality, and the projection matrix $\left(I-{\mathbf{11}}^{\mathrm{T}}/n\right)$ ensures that eigenvectors have zero means, guaranteeing uncorrelatedness. That is,

**EE**=

^{T}**I**and

**E**=

^{T}1**0**, and the correlation between any pair of eigenvectors, say

**E**and

_{i}**E**, is zero when i ≠ j. Second, the eigenvectors portray distinct, selected map patterns. Tiefelsdorf and Boots (1995) [18] establish that each eigenvector portrays a different map pattern exhibiting a specified level of spatial autocorrelation when it is mapped onto the n areal units associated with the corresponding spatial weights matrix

_{j}**C**. They also establish that the Moran coefficient (MC) value for a mapped eigenvector is equal to a function of its corresponding eigenvalue (i.e., MC

**= $\frac{\mathrm{n}}{{1}^{\mathrm{T}}C1}\cdot {\mathsf{\lambda}}_{j}$, for**

_{j}**E**). Third, given a spatial weights matrix

_{j}**C**, the feasible range of MC values is determined by the largest and smallest eigenvalues; i.e., by

**λ**and

_{1}**λ**(de Jong et al. 1984) [19]. Based upon these properties, the eigenvectors can be interpreted as follows (Griffith 2003) [10]:

_{n}As such, these eigenvectors furnish distinct map pattern descriptions of latent spatial autocorrelation in spatial variables, because they are mutually both orthogonal and uncorrelated.The first eigenvector,E_{1}, is the set of real numbers that has the largest MC value achievable by any set of real numbers for the spatial arrangement defined by the spatial weight matrixC; the second eigenvector,E_{2,}is the set of real numbers that has the largest achievable MC value by any set that is uncorrelated withE_{1}; the third eigenvector,E_{3}, is the set of real numbers that has the largest achievable MC value by any set that is uncorrelated with bothE_{1}andE_{2}; the fourth eigenvector is the fourth such set of values; and so on throughE_{n}, the set of real numbers that has the largest negative MC value achievable by any set that is uncorrelated with the preceding (n − 1) eigenvectors.

## 5. Specimen Empirical Datasets

## 6. RESET Results for the Specimen Empirical Datasets

^{2}, sometimes more than tripling it. Both Columbus, OH crime rates, and Puerto Rico density of irrigated farms include covariates that do not yield a RESET diagnostic suggesting omitted variables; nevertheless, inclusion of an eigenvector spatial filter increases the null hypothesis (no omitted variables) RESET probability.

^{2}values are 0.6523 and 0.6584, respectively. These findings suggest that spatial autoregressive models also correct for OVB, offering spatial analysts two ways of exploiting spatial autocorrelation to compensate for omitted variables.

#### Cross-Validation RESET Results for the Specimen Empirical Datasets

## 7. Correction for Omitted Variable Bias: Selected Simulation Experiments

_{1}) and Box-Tidwell transformed mean rainfall (X

_{2}), plus an independent and identically distributed (iid) random error term that is N(0, 0.1

^{2}). The correlation between the two covariates is 0.43, indicating modest collinearity. The response variable (containing 73 values) was simulated 10,000 times, followed by estimation of its linear regression equation as well as each of the two individual bivariate regression equations, resulting in

_{1}), percentage of white population (X

_{2}), and percentage of single (i.e., unmarried) people (X

_{3}), plus log-total population as an offset variable. The weights are the Poisson regression coefficients from a GLM. Because the expectation equation is a description of cancer counts that are overdispersed, it was used as the mean of a gamma RV, whose sampled values were treated as means of Poisson RVs.3 The response variable (containing 254 values) was simulated 10,000 times, followed by estimation of its Poisson GLM equation as well as each of the three individual bivariate and individual trivariate binomial regression equations, resulting in

_{1}and X

_{3}are close to their true values, whereas the adjustment for X

_{2}is less effective. These results indicate that the ESF adjustment is reasonable in a bivariate regression case, but not so in a trivariate regression case. The correlation structure may play a role here: ${\mathrm{r}}_{{\mathrm{X}}_{1}{\mathrm{X}}_{2}}$ = 0.11, ${\mathrm{r}}_{{\mathrm{X}}_{1}{\mathrm{X}}_{3}}$ = −0.10, and ${\mathrm{r}}_{{\mathrm{X}}_{2}{\mathrm{X}}_{3}}$ = −0.53.

## 8. Implications and Conclusions

^{2}and the RESET null hypothesis probability. Combining an eigenvector spatial filter with a spatially unstructured term to correct for OVB merits subsequent research, too.

## Author Contributions

## Conflicts of Interest

## References

- J.B. Ramsey. “Tests for specification errors in classical linear least squares regression analysis.” J. Royal Stat. Soc.: Ser. B 31 (1969): 350–371. [Google Scholar]
- J. Wooldridge. Introductory Econometrics: A Modern Approach, 5th ed. Mason, OH, USA: South-Western, 2013. [Google Scholar]
- G. Shukur, and P. Mantalos. “Size and power of the RESET test as applied to systems of equations: A bootstrap approach.” J. Mod. Appl. Stat. Methods 3 (2004): 370–385. [Google Scholar]
- D.M. Brasington, and D. Hite. “Demand for environmental quality: A spatial hedonic analysis.” Reg. Sci. Urban Econ. 35 (2005): 57–82. [Google Scholar] [CrossRef]
- R.K. Pace, and J.P. LeSage. “Omitted variable biases of OLS and spatial lag models.” In Progress in Spatial Analysis. Edited by A. Páez, J. LeGallo, R. Buliung and S. Dall’Erba. Berlin, Germany: Springer, 2010, pp. 17–28. [Google Scholar]
- J. LeSage, and O. Parent. “Bayesian model averaging for spatial econometric models.” Geogr. Anal. 39 (2007): 241–267. [Google Scholar] [CrossRef]
- J. LeSage, and M.M. Fischer. “Spatial growth regressions: Model specification, estimation and interpretation.” Spat. Econ. Anal. 3 (2008): 275–304. [Google Scholar] [CrossRef]
- P. Piribauer, and M.M. Fischer. “Model uncertainty in matrix exponential spatial growth regression models.” Geogr. Anal. 47 (2015): 240–261. [Google Scholar] [CrossRef]
- P. Piribauer. “Heterogeneity in spatial growth clusters.” Empir. Econ., 2016. [Google Scholar] [CrossRef]
- D.A. Griffith. Spatial Autocorrelation and Spatial Filtering: Gaining Understating through Theory and Scientific Visualization. Berlin, Germany: Springer, 2003. [Google Scholar]
- R.K. Pace, J.P. LeSage, and S. Zhu. “Interpretation and computation of estimates from regression models using spatial filtering.” Spat. Econ. Anal. 8 (2013): 352–369. [Google Scholar] [CrossRef]
- Y. Chun, and D.A. Griffith. “A quality assessment of eigenvector spatial filtering based parameter estimates for the normal probability model.” Spat. Stat. 10 (2014): 1–11. [Google Scholar] [CrossRef]
- J.G. Thursby, and P. Schmidt. “Some properties of tests for specification error in a linear regression model.” J. Am. Stat. Assoc. 72 (1977): 635–641. [Google Scholar] [CrossRef]
- S. Sapra. “A regression error specification test (RESET) for generalized linear model.” Econ. Bull. 3 (2005): 1–6. [Google Scholar]
- J. Temple. “The New Growth Evidence.” J. Econ. Lit. 37 (1999): 112–156. [Google Scholar] [CrossRef]
- D.P. McMillen. “Spatial autocorrelation or model misspecification? ” Int. Reg. Sci. Rev. 26 (2003): 208–217. [Google Scholar] [CrossRef]
- D.A. Griffith. “Eigenfunction properties and approximations of selected incidence matrices employed in spatial analyses.” Linear Algebra Its Appl. 321 (2000): 95–112. [Google Scholar] [CrossRef]
- M. Tiefelsdorf, and B.N. Boots. “The exact distribution of Moran’s I.” Environ. Plan. A 27 (1995): 985–999. [Google Scholar] [CrossRef]
- P. De Jong, C. Sprenger, and F.V. Veen. “On extreme values of Moran’s I and Geary’s c.” Geogr. Anal. 16 (1984): 17–24. [Google Scholar] [CrossRef]
- L. Janson, W. Fithian, and T.J. Hastie. “Effective degrees of freedom: A flawed metaphor.” Biometrika 102 (2015): 479–485. [Google Scholar] [CrossRef] [PubMed]
- A. Vaona. “Spatial autocorrelation or model misspecification? The help from RESET and the curse of small samples.” Lett. Spat. Resour. Sci. 2 (2009): 53–59. [Google Scholar] [CrossRef]
- J. Le Gallo, and A. Paez. “Using synthetic variables in instrumental variable estimation of spatial series models.” Environ. Plan. A 45 (2013): 2227–2242. [Google Scholar] [CrossRef]

^{1}Several of these dataset were used in the 2008 US National Science Foundation funded spatial filtering workshop held at the University of Texas at Dallas during June 16–20 (http://www.spatialfiltering.com/).^{2}ESS_{1}was calculated with covariates and selected eigenvectors, and ESS_{2}was calculated with additional fitted terms as well as the covariates and the selected eigenvectors. For Columbus data, df_{2}for the non-spatial model is 41 (= 49 – the number of independent variables; that is, 2 covariates, intercept, and 5 fitted terms); df2 for the ESF model is 38 (= 49 – the number of independent variables with 3 additional eigenvectors).^{3}The mean of the empirical RV is 133, its standard deviation is 407, and its overdispersion scale parameter is 2.8. The simulated data have a mean of 134, a standard deviation of 419, and a scale parameter of approximately 2.8.

**Figure 1.**Surface partitionings for the specimen datasets. (

**a**) Columbus, OH (n = 49); (

**b**) US counties (n = 3109); (

**c**) US state economic areas (n = 508); (

**d**) City of Dallas census tracts (n = 264); (

**e**) Dallas County census tracts (n = 529); (

**f**) Texas counties (n = 254); (

**g**) Mercer-Hall agricultural field plots (n = 500); (

**h**) City of Plano census block groups (n = 159); (

**i**) Puerto Rico municipalities (n = 73).

**Table 1.**Ramsey regression equation specification error test (RESET) results for the linear model empirical examples.

Data | n | Y | X | RESET Non-Spatial Model | RESET Spatial Model (ESF) | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|

R^{2} | RESET | DF1, DF2 | p-Value | R^{2} | RESET | DF1, DF2 | p-Value | ||||

Columbus | 49 | Crime rates | Housing value, household income | 0.5524 | 1.6122 | 5, 41 | 0.1784 | 0.7419 | 1.4361 | 5, 38 | 0.2337 |

Puerto Rico | 73 | Irrigated farm density | Mean rainfall | 0.1383 | 1.8075 | 5, 66 | 0.1235 | 0.4686 | 1.4361 | 5, 60 | 0.2245 |

Plano Census Block groups | 159 | Box-Cox ^{1} transformed Vehicle Burglary rates | Rates of population aged between 18 and 24, Distance to highway | 0.1428 | 4.7558 | 5, 151 | 0.0005 | 0.4169 | 1.5777 | 5, 142 | 0.1701 |

Texas Counties | 254 | Median Monthly Mortgage | Log of Population Density, Log of Household Median Income, % of housing units built since 1980 | 0.7740 | 10.5403 | 5, 245 | 3.5 × 10^{−9} | 0.8597 | 3.6297 | 5, 228 | 0.0035 |

City of Dallas Census Tracts | 264 | Log of violation crime rates in 2000 | Rates of population aged between 13 and 17, Black population rates, Poverty rate | 0.5336 | 20.6077 | 4, 256 | 9.8 × 10^{−15} | 0.7374 | 1.9273 | 4, 245 | 0.1065 |

Mercer Hall | 500 | Wheat yield | Straw yield | 0.5326 | 3.9194 | 3, 496 | 0.0088 | 0.7376 | 0.991 | 4, 455 | 0.4121 |

US SEA | 508 | White male Prostate cancer rates | White male Bladder cancer rate, Mean indoor radon concentration | 0.1392 | 4.0884 | 5, 500 | 0.0012 | 0.4857 | 0.3308 | 5, 470 | 0.8943 |

Dallas County Census tracts | 529 | Box-Cox ^{2} transformed Pop. Density | Y coordinates, # of families, Log of distance to CBD | 0.1671 | 11.9806 | 3, 522 | 1.3 × 10^{−7} | 0.5949 | 1.2653 | 3, 472 | 0.2857 |

US Counties | 3109 | Log of population density | Log of # of families, Old population rates (60+) | 0.7394 | 13.8545 | 5, 3101 | 2.1 × 10^{−13} | 0.8952 | 4.4681 | 5, 2894 | 0.0005 |

^{1}The Box-Cox transformation was performed with $({y}^{\mathsf{\lambda}}-1)/\mathsf{\lambda}$ where $\widehat{\mathsf{\lambda}}=-0.1113$.

^{2}The Box-Cox transformation was performed with $\widehat{\mathsf{\lambda}}=0.3408$.

Term | Before ESF | After ESF | ||
---|---|---|---|---|

${\mathit{\chi}}^{\mathbf{2}}$ | p-Values | ${\mathit{\chi}}^{\mathbf{2}}$ | p-Values | |

Puerto Rico (Binomial): Irrigate farms (y) with log of mean rainfall (x) | ||||

${\widehat{\text{Y}}}^{2}$ | 0.4510 | 0.5018 | 0.0003 | 0.9853 |

${\widehat{\text{Y}}}^{3}$ | 100.7835 | <2.2 × 10^{−16} | 16.2781 | 0.0003 |

Pseudo-R^{2} | 0.4528 | 0.4829 | ||

Texas counties (Poisson): Cancer counts (y) with three covariates ^{1} | ||||

${\widehat{\text{Y}}}^{2}$ | 127.3967 | <2.2 × 10^{−16} | 3.5006 | 0.0614 |

${\widehat{\text{Y}}}^{3}$ | 147.8025 | <2.2 × 10^{−16} | 18.2274 | 0.0001 |

Pseudo-R^{2} | 0.1315 | 0.3722 |

^{1}The covariates are log of household median income, log of white population rates, and log of single marital status rates.

Data | n | Maintained p ≤ 0.1 | Improved from p ≤ 0.1 to p > 0.1 | Declined from p > 0.1 to p ≤ 0.1 | Maintained p > 0.1 |
---|---|---|---|---|---|

Columbus | 49 | 0 | 2 | 2 | 45 |

Puerto Rico | 73 | 0 | 6 | 2 | 65 |

Plano Census Block Groups | 159 | 92 ^{1} | 67 | 0 | 0 |

Texas Counties | 254 | 254 ^{2} | 0 | 0 | 0 |

City of Dallas Census Tracts | 264 | 261 ^{3} | 3 | 0 | 0 |

Mercer Hall | 500 | 4 | 495 | 0 | 1 |

US SEA | 508 | 0 | 507 | 0 | 1 |

Dallas County Census Tracts | 529 | 0 | 528 | 1 | 0 |

US Counties | 3109 | 3019 ^{4} | 0 | 0 | 0 |

^{1}p-values for 35 (out of 92) cases increased from less than 0.0001 to greater than 0.05.

^{2}p-values for 252 (out of 254) cases increased from less than 10

^{−7}to greater than 0.001.

^{3}p-values for 256 (out of 261) cases increased from less than 10

^{−9}to greater than 0.001.

^{4}p-values of 3104 (out of 3109) cases increased from less than 10

^{−10}to greater than 0.0001.

Term | n | Maintained p ≤ 0.1 | Improved from p ≤ 0.1 to p > 0.1 | Declined from p > 0.1 to p ≤ 0.1 | Maintained p > 0.1 |
---|---|---|---|---|---|

Puerto Rico (Binomial): Irrigate farms (y) with log of mean rainfall (x) | |||||

${\widehat{\text{Y}}}^{2}$ | 73 | 0 | 1 | 70 | 2 |

${\widehat{\text{Y}}}^{3}$ | 73 | 72 ^{1} | 1 | 0 | 0 |

Texas counties (Poisson): Cancer counts (y) with three covariates ^{1} | |||||

${\widehat{\text{Y}}}^{2}$ | 254 | 246 | 8 | 0 | 0 |

${\widehat{\text{Y}}}^{3}$ | 254 | 253 ^{2} | 1 | 0 | 0 |

^{1}The p-values of 70 cases (out of 72) increased from one less than 1.0 × 10

^{−16}to one greater than 1.0 × 10

^{−5}, and for 12 cases of them, increased to one greater than 0.0001.

^{2}The p-values of 252 cases (out of 253) increased from one less than 1.0 × 10

^{−16}to one greater than 1.0 × 10

^{−5}, and for 190 cases of them, increased to one greater than 0.0001.

Variable | None | ${\widehat{Y}}^{\mathbf{2}}$ | ${\widehat{Y}}^{\mathbf{3}}$ | ${\widehat{Y}}^{\mathbf{4}}$ | ${\widehat{Y}}^{\mathbf{2}}$&${\widehat{Y}}^{\mathbf{3}}$ | ${\widehat{Y}}^{\mathbf{2}}$&${\widehat{Y}}^{\mathbf{4}}$ | ${\widehat{Y}}^{\mathbf{3}}$&${\widehat{Y}}^{\mathbf{4}}$ | ${\widehat{Y}}^{\mathbf{2}}$&${\widehat{Y}}^{\mathbf{3}}$&${\widehat{Y}}^{\mathbf{4}}$ |
---|---|---|---|---|---|---|---|---|

X_{1} | 0 | 843 | 4607 | 4282 | 0 | 4 | 210 | 54 |

X_{2} | 0 | 1673 | 4660 | 3666 | 0 | 1 | 0 | 0 |

Variable | None | ${\widehat{Y}}^{\mathbf{2}}$ | ${\widehat{Y}}^{\mathbf{3}}$ | ${\widehat{Y}}^{\mathbf{4}}$ | ${\widehat{Y}}^{\mathbf{2}}$&${\widehat{Y}}^{\mathbf{3}}$ | ${\widehat{Y}}^{\mathbf{2}}$&${\widehat{Y}}^{\mathbf{4}}$ | ${\widehat{Y}}^{\mathbf{3}}$&${\widehat{Y}}^{\mathbf{4}}$ | ${\widehat{Y}}^{\mathbf{2}}$&${\widehat{Y}}^{\mathbf{3}}$&${\widehat{Y}}^{\mathbf{4}}$ |
---|---|---|---|---|---|---|---|---|

X_{1} | 27 | 594 | 566 | 923 | 902 | 553 | 20 | 6415 |

X_{2} | 84 | 677 | 609 | 1258 | 271 | 364 | 72 | 6665 |

X_{3} | 1336 | 773 | 251 | 636 | 688 | 500 | 40 | 5776 |

X_{1} & X_{2} | 625 | 770 | 499 | 1025 | 332 | 406 | 60 | 6283 |

X_{1} & X_{3} | 1320 | 809 | 249 | 635 | 533 | 515 | 85 | 5854 |

X_{2} & X_{3} | 1718 | 825 | 287 | 702 | 578 | 517 | 43 | 5330 |

Number of Omitted Variables | Estimate Type | X_{1} | X_{2} | X_{3} |
---|---|---|---|---|

two | Parameter | −0.30 | 0.20 | −0.80 |

OVB | −0.46 | 1.05 | −0.97 | |

ESF RESET adjusted | −0.23 | 0.90 | −0.87 | |

one | OVB | −0.32 | 0.95 | |

ESF RESET adjusted | −0.25 | 0.92 | ||

OVB | −0.31 | −0.90 | ||

ESF RESET adjusted | −0.34 | −0.97 | ||

OVB | 0.29 | −0.82 | ||

ESF RESET adjusted | 0.39 | −0.76 |

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license ( http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Griffith, D.A.; Chun, Y. Evaluating Eigenvector Spatial Filter Corrections for Omitted Georeferenced Variables. *Econometrics* **2016**, *4*, 29.
https://doi.org/10.3390/econometrics4020029

**AMA Style**

Griffith DA, Chun Y. Evaluating Eigenvector Spatial Filter Corrections for Omitted Georeferenced Variables. *Econometrics*. 2016; 4(2):29.
https://doi.org/10.3390/econometrics4020029

**Chicago/Turabian Style**

Griffith, Daniel A., and Yongwan Chun. 2016. "Evaluating Eigenvector Spatial Filter Corrections for Omitted Georeferenced Variables" *Econometrics* 4, no. 2: 29.
https://doi.org/10.3390/econometrics4020029