3. The Latitude–Longitude Model
Ross et al. [
14] present an alternative to the use of distance variables in hedonic house price regressions. Their alternative replaces the distance variables with a latitude–longitude co-ordinate system that does not require information on the locations or number of amenities or dis-amenities. In its simplest form, the regression model they propose is
where
HP is house price, and
Zi is a vector of non-distance house characteristics and their coefficients, including an intercept. The
x and
y variables refer to the latitude and longitude of the house location along with their squares, and the
θs represent the parameters to be estimated.
One important implication of this formulation can be seen by partially differentiating the regression model with respect to
x and
y. By doing so and solving yields,
and
These solutions indicate that underlying Ross et al.’s [
14] co-ordinate system is the idea that the neighborhood is characterized by a single maximum property value. One obvious problem with this analytic solution is that it will not exist if either
θ3 or
θ4, or both, equal zero. Ross et al. [
14] illustrate some of the problems with the use of distance variables in regressions and investigate the relative merits of their latitude–longitude approach using three simulations: their
Illustrative Simulation,
Simulation I and
Simulation II. We discuss each of these simulations below.
The Ross et al. [
14]
Illustrative Simulation is not really a Monte Carlo simulation in the usual sense, as the results are based on a single sample, and the true regression model contains no random error term. This model is used to illustrate the consequences of omitting and/or redefining independent variables on model performance measured in terms of slope estimates and model
R2 values. They estimate four models, all of which are variations of their true model. Two of the four models are straightforward applications of omitted variable results and do not require a simulation for analysis [
14]. We discuss omitted variable bias and apply omitted variable results to obtain parameter estimates for these two cases in
Appendix A.
The sample space used in Ross et al.’s [
14]
Simulations I and
II is based, in part, on locations inside a 40 × 40 square centered at the origin. They also make use of a smaller square inside this larger square centered at the origin, which is of length 20. Ross et al.’s [
14]
Simulations I and
II are based on the two models below, where
A and
B refer to the true landmarks, and
DA and
DB are the distances to the true landmarks. A non-distance variable,
x, is also included, yielding the following regression models:
and
Models (4) and (5) represent the true models or data-generating processes. Ross et al. [
14] focus on the statistical properties of the estimators of slope parameters in these models, particularly those of the non-distance variable
x, when the true landmark locations are unknown, and distances to the mis-measured landmarks,
Da and
Db, are used instead. These estimated models are calculated using
and
These incorrect landmark locations are generated randomly for each of Ross et al.’s [
14] Monte Carlo experiments. As a result, there are always three possible models to be estimated: one model using distances to correct landmarks, one model using distances to incorrect landmarks, and the latitude–longitude version of the hedonic regression. Ross et al. [
14] estimate separate distance models, with one based on incorrect landmarks, and another based on true landmarks
and the latitude–longitude version of the hedonic regression. Ross et al.’s [
14] simulations use 400 observations per trial and 1000 trials.
3.1. Simulation I
For Monte Carlo
Simulation I, Ross et al. [
14] choose random house locations inside the inner square, which we call the
neighborhood. The locations outside the inner square but inside the larger square, we refer to as the
periphery. The landscape features two true landmarks,
A and
B, both located in the periphery. Their co-ordinates are
A = (0, 15) and
B = (10, 15), both unknown to the researcher. Instead, random landmarks are chosen from the periphery, and distances to each are used in the estimation. All of the random house locations are chosen from the neighborhood. Ross et al.’s [
14]
Simulation I examines the consequences for the estimators in two cases. In the first case, the estimated model contains distances to one or two incorrect landmarks when the true model contains the distance to a single true landmark. In the second case, the estimated model contains distances to one or two incorrect landmarks when the true model contains distances to two true landmarks. In other words, Ross et al. [
14] investigate parameter estimates if the true model is given by (4) above, but instead (6) and (7) are estimated, as well as what happens if the true model is (5) above, but models (6) and (7) are estimated. For these models, Ross et al. [
14] obtain parameter estimates using several Monte Carlo experiments. Ross et al. [
14] assume the coefficient of
x is 2,
x is an N(0, 1) random variable, and the error term is an N(0, 2) random variable. As previously noted, the true locations of
A and
B are (0, 15) and (10, 15), respectively. The true values of the model parameters are
β0 = 1,
β1 = 2,
γ1 = −0.25, and
γ2 = −0.1.
The regression models in Ross et al.’s [
14] Monte Carlo
Simulation I are based on distances to incorrect landmarks that also change from experiment to experiment. Based on their simulation results, they note that the explanatory power of the model is, on average, higher when the estimated model contains two incorrectly measured distance variables rather than one incorrectly measured distance variable, whether the data-generating process contains one or two true landmark distances (see
Appendix B for a presentation of the Ross et al. [
14]
Simulation I results, along with our own re-estimation). From this simulation, Ross et al. [
14] also conclude “that incorrectly identifying the landmark has consequences for producing unbiased estimates of the non-distance variables in the regression”. As we show below, this assertion is not true because the non-distance variable
x in the Ross et al. [
14] simulations is not correlated with any potentially included independent variable, so the slope coefficient of
x is not subject to omitted variable bias in these simulations.
3.2. Simulation II
Their Monte Carlo
Simulation II uses the same two true models as in the Monte Carlo
Simulation I given in 4 and 5, but now, the true landmarks change from experiment to experiment [
14]. Estimates from this model are compared to an alternative, which is the latitude–longitude version of the model preferred by Ross et al. [
14]. The estimated distance models include their non-distance variable
x and either one or two distances to true landmarks. The alternative model includes their non-distance variable but uses the Ross et al. [
14] co-ordinate system in place of distances to the true landmarks (see
Appendix C for details on the Ross et al. [
14]
Simulation II results, which we repair). The locations of the true landmarks and the housing units are randomly chosen over the larger square (i.e., the neighborhood and the periphery combined).
3.3. Conclusions and Conjectures
There are two main conclusions reached by Ross et al. [
14] regarding the virtues of using the latitude–longitude system in place of distance variables. The first is that omitted variable bias can be eliminated or greatly reduced in the estimates of the parameters of non-distance variables, even if the true landmark locations are unknown, provided that the latitude–longitude version of the hedonic regression model is estimated. The second is that multicollinearity is reduced, and the possibility of extreme multicollinearity is eliminated as the use of sometimes highly correlated distance variables is avoided. We fully address these claims and more in the next section after presenting and discussing our simulation results. However, before doing so, we discuss several unusual features of the Ross et al. [
14] simulations.
3.4. Peculiar Features of Simulations I and II
Ross et al.’s [
14]
Simulation I is based on distances to two true fixed landmarks in the periphery, random house locations chosen from the neighborhood and distances to randomly chosen incorrect landmarks from the periphery. From these samples, Ross et al. [
14] estimate models with one or two distance variables based on the inclusion of incorrect landmarks and compare estimation results for the two when the underlying true model contains either one true distance variable or two true distance variables. Their
Simulation II is based on one or two randomly chosen landmarks from the neighborhood and periphery combined. Random house locations are chosen from the neighbor and periphery combined. In each case, the true model is estimated, the latitude–longitude version of the model is estimated, and the results are compared [
14]. Their
Simulation I parameter estimates are obtained from regression models based on distances to incorrect landmarks, while their
Simulation II includes only the true regression model and the associated latitude–longitude version. None of the Ross et al. [
14] simulations estimate all three possible models, and the sample spaces are different for
Simulations I and
II. This setup greatly complicates comparisons.
The construction of the non-distance variable
x is unusual. As previously mentioned, this variable is a normally distributed random variable with a mean of zero. Thus, even with a nonzero-associated coefficient, the variable’s average contribution to the regression model is nil. In effect, this non-distance variable is akin to having a second error term in the model. Another peculiarity related to the first is found in the construction of the regression models used in the simulations; the problem is that variation in the generated house prices is largely dominated by the randomly generated non-distance variable and the randomly generated error term. Consider the Ross et al. [
14] model with two known landmarks, 1 and 2, calculated using
Recall that for the Ross et al. [
14] Monte Carlo
Simulation I, house locations are randomly chosen from a neighborhood square of length 20 centered at the origin. The locations of the true landmarks in their
Simulation I are
A = (0, 15) and
B = (10, 15). The most distant location from points
A and
B inside the neighborhood is at the point (−10, −10). From this location, the distance to landmark
A is 27 (i.e.,
), and the distance to landmark
B is 32. In the regression model above, the penalty subtracted for this maximum distance to
A is one-fourth of 27 or 6.75, and the value subtracted for this maximum distance to
B is one-tenth of 32 or 3.2. These penalties combined are likely to be dominated by the contributions of the non-distance variable and the error term. The coefficient of the non-distance variable is 2.00, and the values of the non-distance variable will typically fall between −2 and +2. This implies that the contribution to the hedonic regression from the distance variable will fall between −4 and +4 for a range of 8. The same is true for the random error term, which has a standard deviation of 2; the bulk of the values will fall between −4 and +4. Together, these two random components of the regression model can cause house prices to go up or down by approximately eight for a range of 16. Thus, the distance variables based on the most distant locations have about sixty-three percent of the impact of the non-distance variable and the random error combined. For less distant locations, the impact of the distance variables on house prices will diminish, but the impact of the non-distance variable and the error term will not. For the not-so-distant housing units, the random component is likely to be more dominant. This feature of the Ross et al. [
14] simulations complicates pinpointing optimal house locations because of the enormous random component in house prices built into the simulation.
A third issue concerns the locations of the true landmarks
A and
B in Ross et al.’s [
14] Monte Carlo
Simulation I. The fact that they are both outside of the neighborhood square
and on the same side of the neighborhood makes for an easier estimation problem because, on average, the highest-valued houses will be on the “north side” of the neighborhood. Compare this situation with a single centrally located landmark. From sample to sample, the estimated “optimal location” can be pulled in
any direction. More curious, perhaps, is the fact that Landmark
B does not matter for the determination of the optimal location as the central location and higher distance penalty (the coefficient) associated with Landmark
A means that distances to
B only impact the final house price but not the optimal location. In effect, the two-landmark model using distances to both
A and
B is essentially a one-landmark model consisting of Landmark
A.
Below, we show that the highest value, on average, is the point in the neighborhood (0, 10) closest to Landmark
A. From this location, the distance to Landmark A is 5.0, and the distance to B is 11.18 (i.e.,
). Thus, the penalty for locating at (0, 10), the center north position in the neighborhood, is 2.37 (i.e., 0.1(11.18) + 0.25(5) = 1.12 + 1.25). Movements due east from this center north point will increase the distance to Landmark
A and decrease the distance to Landmark
B until the northeast corner of the neighborhood is reached at (10, 10). The penalty at this corner is approximately 3.29 (i.e., 0.25(11.18) + 0.1(5) = 2.79 + 0.5). Thus, the costs are lower at the center north point
even with two landmarks. Although house prices will change in the two-landmark model because there is a second distance penalty to consider, the optimal location will not change; thus, Ross et al.’s [
14] two-landmark model in
Simulation I is, for the purposes of finding the optimal location, actually a one-landmark model.
Another unusual feature of the Ross et al. [
14] simulations is the relationship between their sample space and their sample sizes. The landscape used by Ross et al. [
14] is typically based on a square of length 40 centered at the origin. This large square contains 1600 unit squares. The Ross et al. [
14] sample size is 400, which means that in each Monte Carlo experiment, at least 75% of these unit squares contain zero housing units, which is a problem that is mitigated when the inner square or neighborhood is used for the sample space. This is a very sparse landscape, which, along with the extra variation induced by the construction of their non-distance variable, means there will be considerable parameter fluctuation from sample to sample. This fact prompted us to use a much larger number of Monte Carlo trials in our own simulations.
Ross et al.’s [
14]
Simulation I is based on distances to two known landmarks, but the true model is not estimated. Instead, Ross et al.’s [
14] Table 1 compares estimates from
two different mis-specified models based on incorrect landmarks for each of the two true underlying DGPs. In
Simulation II, Ross et al. [
14] estimate the true model and the latitude–longitude version of the model with the locations of the underlying true landmarks known but changing from sample to sample. The changing locations of the true landmarks add an additional layer of variability to the estimation results, which are already highly variable due to the unusual structure of the non-distance variable discussed earlier and the sparse coverage of the landscape afforded by the relatively small sample size in a relatively large neighborhood.
Finally, we note that there are two obvious omissions in the Ross et al. [
14] work. These are issues mentioned frequently by Ross et al. [
14] but not investigated in their paper. Despite their claim that the latitude–longitude approach is useful when the locations of the true landmarks are unknown or likely incorrect, Ross et al. [
14] do not directly compare their latitude–longitude formulation to a distance regression model based on distances to incorrect landmarks. A second omission concerns the optimal location in space. Ross et al. [
14] also frequently mention finding optimal locations in space as one of the virtues of their co-ordinate system, again, without any investigation of the issue. We address these omissions in our Monte Carlo simulations in the next section.
4. Our Monte Carlo Simulations
We conducted an extensive set of Monte Carlo experiments to examine omissions and extensions of the Ross et al. [
14] analysis. We combined features of Ross et al.’s [
14] Monte Carlo
Simulations I and
II into a single simulation. We estimated housing models containing from one to five known landmarks. The locations were fixed and given as follows: L1 = (−15, −15), L2 = (10, 10), L3 = (−5, −5), L4 = (0, 0) and L5 = (5, −13). The true model in each case is given by applying zero restrictions on the coefficients of the model below to change the number of distance variables included from one to five:
We also included the Ross et al. [
14] non-distance variable
x. For each version of the specification above, we estimated three models: the true model using distances to the true landmark(s), a model using distances to randomly chosen incorrect landmarks, and the Ross et al. [
14] latitude–longitude version of the model. From each trial, we collected information on the parameter estimates, the model
R2, the estimate of σ
2 and the implied optimal location, that is, the location of the highest property value. We used the same sample size (i.e., 400) as Ross et al. [
14] but conducted one million Monte Carlo trials in an effort to reduce the sample variation we discussed previously. The sample space was a 40 × 40 square centered at the origin from which we randomly generated house locations.
The simulation results begin with
Table 1. Panel A contains the average regression statistics for each model. Panel B contains the model estimates for the co-ordinates of the optimal locations in space. For each trial, the optimal location in space was determined by recording the co-ordinates of the highest predicted house price for each model. These values were averaged over the simulation to obtain means and standard deviations for each set of co-ordinates. We also calculated and tracked the co-ordinates of the optimal location obtained using the analytic solution given above in Equations (2) and (3). This solution is of considerable practical importance in applied work. The fact that this approach uses estimated model parameters allows one to obtain point estimates of the optimal location and the estimated variances of these co-ordinates from a single sample. Approximate variances can be obtained using the delta method (see Greene [
25], pp. 172–173). A standard error for the maximum predicted value from any hedonic regression model can be obtained using the bootstrap.
The simulation results for models based on a single distance variable are given in Panel A of
Table 1. We first examined the model
R2 values. The
R2 for the true model is 0.744, for the latitude–longitude model, 0.725, and for the model with incorrect landmarks, 0.443. We see that the true model and the longitude–latitude versions offer very similar explanatory power, with the model based on incorrect landmarks performing relatively poorly. Also worthy of note is that the estimated standard deviation of the model
R2 values for the true model and the latitude–longitude versions is 0.02, but the value for the model based on the distance to an incorrect landmark location is a much higher 0.15. We calculated the correlations between these
R2 values over the simulation. For the true model and the latitude–longitude version, the simple correlation is 0.94. For the true model and the latitude–longitude version, the correlations with the incorrect landmark model are 0.11 and 0.12, respectively. The high correlation between
R2 values from the true landmark model and the latitude–longitude version, along with the low standard deviations of the
R2 values, indicates that both models explain the data well and that the difference in performance between the two is small in this case. The model based on incorrect landmarks explains the data relatively poorly compared to the other two models, and the explanatory power does not track the others very closely. Although the average
R2 values are lower for the incorrect landmark model, the maximum values of
R2 are nearly equal. This is not surprising as the probability of choosing a random landmark near the true landmark is high in one million samples. However, this is a reminder that the landmark must be at some location in the neighborhood, and, though unknown, searching for the landmark location with a computer is a possibility.
If we turn now to the parameter estimates, Column Two provides the results obtained from the estimation of the true model. The averages of all estimated model parameters, including the estimate of σ
2, are equal to their true values. For the latitude–longitude version of the model, we find that the coefficient of the non-distance variable is also 2.000 and equal to its true value. The estimated value of σ
2 is somewhat high at 4.321 compared to the true value of 4.000. For the model based on incorrect landmarks, the coefficient of the non-distance variable is also 2.000. Thus, all cases support our claim about the absence of omitted variable bias. The estimate for σ
2 is much higher at 8.705 compared to the true value of 4.000. The reason for the absence of this bias is shown analytically in
Appendix A.
Panel B of
Table 1 contains co-ordinate estimates for optimal locations. The correct landmark model yields the average location of the maximum to be (−13.965, 13.963), with the average maximum house price of 4.058. The latitude–longitude version yields the optimum co-ordinates of (−13.418, 13.421), with an average maximum house price of 3.205. The analytic solution provides nothing useful at this point but will be of value for other models. This may be a result of division by zero as we indicated earlier. For the model with incorrect landmarks, the location of the highest price is (−6.323, 6.326), with an average maximum house value of 1.806. The first two models give somewhat similar results, but the model based on a single incorrect landmark is far from the results of the other two models.
The simulation results for models based on two true distance variables are given in
Table 2. First, we note the average model
R2 values in Panel A. All three
R2 values increased over the single landmark results. For the true model, we obtained 0.761; for the latitude–longitude version, we obtained 0.749; and for the model based on incorrect landmarks, we obtained 0.590. The first two
R2 values are close, as before, and the model based on incorrect landmarks is closing the gap with the other two and enjoying a reduction in variance. The correlation between
R2 values for the true model and the latitude–longitude version is 0.96. The correlation between
R2 values for the incorrect landmark and either of the other two is about 0.14.
Column Two of Panel A in
Table 2 shows the coefficient estimates for the true model. Not surprisingly, the average estimated values of 1.000, 2.000, −0.250 and −0.010 are all exactly equal to their true values, as is the value of σ
2, which is equal to its true value of 4.000. For the latitude–longitude model, the
R2 is slightly lower than for the true model, as we saw previously, and, as before, the estimated value of σ
2 is larger than for the true model at 4.217. Again, the value of the non-distance coefficient is exactly 2.000. Column Four contains the estimation results for the model based on two incorrect landmarks. The coefficients are far different from the true value save for, once again, the coefficient of the non-distance variable, which is 2.000 on average. The estimated value for σ
2 is the highest in the group at 6.854.
Panel B of
Table 2 shows the average optimal location estimates from each of our models. Using the correct landmarks, the co-ordinates of the average optimum location are (−11.426, 13.660). The average house price there is 1.715. For the latitude–longitude version, the average co-ordinates of the optimal location are (−8.975, 13.859), with an average house price of 1.143. The analytic solution gives the optimal co-ordinates as (−10.797, 20.780). Using two incorrect landmarks yields an average optimal location of (−6.024, 11.381), with an average house price of 0.990. The optimal co-ordinates of the true model and the latitude–longitude version are similar, but the average predicted optimal house prices are closer for the latitude–longitude version and the incorrect landmark version.
Panel A of
Table 3 contains the estimation results for a true model containing three landmarks. The first row indicates the continued convergence of the model
R2 values for our three specifications. The
R2 for the model with true landmarks is 0.777; for the latitude–longitude version, it is 0.767; and for the model with three incorrect landmarks, the
R2 is 0.692. The standard deviations for each continue to decline, and the pairwise correlations continue to increase. For the true model and the latitude–longitude version, the correlation between
R2 values is 0.97. The correlation between
R2 values for the model using distances to incorrect landmarks is 0.20 with either of the other two.
Column Two of Panel A of
Table 3 indicates that the coefficients are well-estimated when the true landmarks are used. The coefficients of the intercept, the non-distance variable, the three true distance variables and the estimate of σ
2 are equal to their true values. For the latitude–longitude version, the coefficient of the non-distance variable equals its true value, and the estimate of σ
2 exceeds its true value by 0.181. In the version of the model using incorrect landmarks, the coefficient of the non-distance variable is also equal to its true value on average, and the estimate of σ
2 is highest at 5.523 but appears to be getting smaller with additional landmarks. The other coefficients in this model bear little resemblance to their true values.
Panel B of
Table 3 shows the estimates of the optimal house location based on these simulations. Using the correct landmarks, the highest valued house location is (−9.342, 9.316), with an estimated house value of −1.020. Using the latitude–longitude version of the model, the optimal location is (−7.751, 8.544), with an estimated house value of −1.401. The analytic solution using derivatives based on coefficients from the latitude–longitude regression yields the location of (−7.886, 8.544) for the highest value. For the model using incorrect landmarks, the optimal location is (−7.224, 7.952), with an optimal house value of −0.947, which is surprisingly
higher than the values obtained from either of the other two specifications. This is possible because there are one million different incorrect landmark locations from which to choose and a large (as we explained previously) random component of house prices.
The estimation results for hedonic regression models based on four landmarks are given in Panel A of
Table 4. Row Two contains the model
R2 values, and the convergence in values continues. For the model using distances to the correct landmarks, the model
R2 values average 0.801. The
R2 values for the latitude–longitude version and the model using incorrect landmarks follow closely at 0.790 and 0.750, respectively. The average correlation between
R2 values for the true model and the latitude–longitude version is about 0.96, but the correlation between
R2 values for the model using incorrect distance variables and those of the other two models remains low but rises to about 0.30.
As expected, the coefficients are well-estimated when the distances to the true landmarks are used, as the results in Column Two of Panel A indicate. All of the estimated coefficients, except the intercept, are exactly equal to their true values, while the intercept is low, on average, by 0.001. The average estimate of σ2 is 4.000. With the latitude–longitude version, the estimate of the coefficient of the non-distance variable is unbiased, but the estimate of σ2 is 0.222 higher than its true value. When the incorrect distance variables are used, the average bias in the estimate of the coefficient of the non-distance variable is zero, and the average estimate of σ2 is 5.033.
Panel B of
Table 4 shows the corresponding estimates of the optimal landmark locations. For the model using true landmarks, the average optimum location is (−6.339, 6.027), and the house value there is, on average, −2.209. For the latitude–longitude version of the model, the average co-ordinates of the optimum are (−5.850, 6.509), and the house value there is, on average, −2.556. The analytic solution using coefficients from the latitude–longitude version of the model is very nearly the same as when the locations are obtained from the predicted average house values from the latitude–longitude version of the model. The average analytic location of the optimum is (−5.852, 6.514). The average optimum location using the incorrect landmarks is (−5.868, 6.333), with a house value of −1.920. For the model with four distance variables, the optimal locations are nearly the same, but the estimated variances of the co-ordinates found analytically are much lower than the others and maybe only one-tenth as much as with the other methods. Again, the highest optimum house value is found using incorrect landmarks, with the latitude–longitude version giving the lowest maximum house price.
Table 5 contains the simulation results based on five landmarks. As indicated in Panel A of the table, the model
R2 values are consistent across specifications but slightly lower than the four-landmark models, as the result is dependent on the particular landmark locations chosen. For the regression model with the true distances, the
R2 is 0.786; for the latitude–longitude version, it is 0.770; and for the model including incorrect distance variables, it is 0.750. The two
R2 values for the true model and the latitude–longitude version fall slightly compared to the four-landmark values, but the
R2 for the model using incorrect landmarks is unchanged. An interesting result is that the variance of the
R2 values for both the true model and the latitude–longitude version of the model increases slightly with the addition of the fifth landmark. In the model based on incorrect landmarks, the addition of the fifth landmark caused the variance to
fall by about one-third, though it is still larger (about double) than that of the other two models. For the model using the correct landmarks or the latitude–longitude version of the model, the estimated standard deviations are 0.02. For the model with incorrect distances, the value is now 0.04. The average correlation between
R2 values for the true model and the latitude–longitude version of the model fell slightly to 0.95, but the correlation between
R2 values for the model based on incorrect distance variables and either of the other two
R2s increased to 0.46.
The simulation results for models using correct distance variables are shown in Column Two of Panel A in
Table 5. The results show that the intercept is underestimated by 0.001, and all other estimated coefficients and the estimates of σ
2 have mean values equal to the true values (to three decimal places). The latitude–longitude version of the model is shown in Column Three of
Table 5. The average estimated coefficient of the non-distance variable is 2.000, exactly equal to its true value. The average estimate of σ
2 is somewhat high, as usual, at 4.288. The results based on incorrect landmarks are given in Column Four of the table. Once more, the average value of the coefficient of the non-distance variable is 2.000. The average estimate of σ
2 is high at 4.688 but seems to fall as more landmarks are added.
Panel B of
Table 5 presents the estimated optimal locations for each model and the analytic version. There is general agreement as to the optimal location. Using the correct distances, the optimal location is estimated to be (−4.015, 2.408), with an average house value of −4.231. Using the latitude–longitude version, the optimal location is (−4.021, 3.556), with an average house value of −4.572. The analytic solution gives (-4.019, 3.553) as the optimal location, and the optimal location using incorrect distance variables is (−4.288, 3.558), with an average house price of −4.005. These models all seem to give similar locations for the optimum, and there is good agreement between the two methods of regression and the analytic approach based on the latitude–longitude version of the model. Note that the model with incorrect distance variables produced the highest-valued house, but this may be misleading because that estimate is based on one million different sets of distance variables.
5. Omitted Variable Bias and Correlated Regressors
In this section, we take a different approach and present the results of a simulation that is unlike our previous simulations but goes directly to the heart of the two main issues in Ross et al. [
14]. These are (1) the extent to which the latitude–longitude system is a substitute for distance measures and (2) the effectiveness of the latitude–longitude system at reducing omitted variable bias. Unlike prior simulations, the true model in the upcoming simulations contains only two independent variables of two true distance variables. We examined the impact of omitting one of the distance variables on the coefficient estimates for the other distance variable when the omitted distance variable is replaced with the latitude–longitude system.
In these simulations, we fixed the location of one landmark and changed the location of the second landmark to alter the correlation between the resulting two distance measures, as this correlation is central to an examination of omitted variable bias. Unlike the previous simulations, there actually
is omitted variable bias in the slope estimator of the remaining distance variable. We sought to determine the effectiveness of the latitude–longitude system in reducing or eliminating this bias. In each Monte Carlo experiment, we replaced the omitted distance variable with the Ross et al. [
14] co-ordinate system to determine whether the resulting omitted variable bias is reduced by the inclusion of the co-ordinate system. We examined the impact of this replacement on the estimated coefficient of the included distance variable and noted how this impact is affected by the correlation between the included and excluded distance variables.
The model examined in this simulation is like those previously estimated. In particular, the true model is
where, as before,
A and
B are the true and known locations of Landmarks
A and
B, and
DA and
DB represent their respective distances. The coefficient of interest is
, which is equal to −0.1; this is also the true value of
. The true value of the intercept is 1.00 and that of σ
2 is 4.00. The problem for estimation is that
DB is unknown and replaced by the Ross et al. [
5] co-ordinate system to yield the estimated model:
By allowing the location of B to change relative to A, we varied the correlation between the distances to A and B, which changes the impact of the omitted variable, DB, on the size of the bias in the estimated coefficient of the included distance variable, DA.
For this simulation, we used random samples from a square of length 20 centered at the origin. The location of Landmark
A is fixed at (10, 0). The locations for
B vary with each Monte Carlo experiment from (−10, 0), (−10, 5), (−10, 10), (−5, 10), (0, 10), (5, 10), (10, 10) and (10, 5). These changing
B locations allow the correlations between distances to change from large negative values to large positive values, respectively. For each pair of true landmarks, we conduct 10,000 Monte Carlo experiments. For each Monte Carlo experiment, three models were estimated: the
true model, including distances to both
A and
B; the model with distance to
B omitted and thus subject to omitted variable bias; and the model with distance to
B omitted but replaced with the Ross et al. [
14] latitude–longitude system or LL. The results of this simulation are shown in
Table 6.
As the first column indicates, the average correlations between the distances to the true landmarks are −0.821, −0.752, −0.620, −0.379, 0.044, 0.452, 0.680 and 0.876. For the true model with both distance variables included, the estimated average slope of the distance to Landmark A is −0.100 regardless of the landmark distance correlation. Note that in some of these cases, the correlation between distance measures has a large magnitude (−0.821 and 0.876), and the slope estimator remains unbiased just as the theory suggests; multicollinearity does not invalidate the statistical properties of the OLS estimator and is not a problem that needs a solution.
A more interesting comparison can be made between the estimates from the model using the Ross et al. [
14] co-ordinate system (LL) and estimates from the model with distance to
B omitted with no replacement (
omitted). Of the eight cases presented in
Table 6, the mean slope estimate with the distance to
B replaced by the LL system is closer to the truth in seven of the eight cases compared to the
omitted variable model, and, on average, the latitude–longitude estimates are much closer. These results appear to suggest that the use of the Ross et al. [
14] co-ordinate system provides some relief for omitted variable bias. However, this is misleading as the overall performance of the co-ordinate system as a replacement for the distance variable is poor.
This conclusion is apparent if one examines the root mean square errors for the estimates. As the table shows, the RMSEs for the true model are the smallest in every case, as the range for these RMSEs is 0.020 to 0.041. The RMSEs for the omitted variable model range from 0.021 to 0.094 and are not nearly as good as the true model but are close when the correlation between distance variables is zero. However, the RMSEs for the latitude–longitude version of the model range from 0.146 to 0.166. There is not much variation. The omitted variable model gives RMSEs about twice as large as the omitted variable model regardless of the correlation.
As expected, the RMSEs for the true model and the omitted variable model are nearly equal when the correlation is near zero, which is exactly what the theory of omitted variable bias presented earlier predicts (and is exactly the situation with the Ross et al. [
14] non-distance variable). On the other hand, the LL RMSEs are, by far, the largest and are
independent of the correlations. The LL RMSEs hover around 0.150 regardless of the correlation. This value is about five to seven times the average RMSEs from the true model and about
two to three times larger than the average RMSEs from the omitted variable model. Thus, we conclude that using the LL model in this case is demonstrably worse than doing nothing; the omitted variable model is better, often much better. Furthermore, this simulation indicates that the use of the latitude–longitude system does nothing to reduce omitted variable bias in models actually subjected to omitted variable bias and could make the situation worse.
6. Summary and Conclusions
As pointed out above, hedonic house price studies typically incorporate information about location into the non-stochastic portion of regression models by including either a set of dummy variables to represent individual locations called “neighborhoods” or by using a set of distance (or travel time) variables to characterize locations in terms of proximity to amenities and dis-amenities. The latter approach requires relatively few additional parameters, although, unlike the dummy variable approach, the locations of the amenities and dis-amenities must be known. Ross et al. [
14] present an alternative to the use of distance variables in hedonic house price regressions that replaces the distance variables with a latitude–longitude co-ordinate system that requires no information on the locations or the number of amenities or dis-amenities and, based on Monte Carlo simulation results, claim that the use of the latitude–longitude system eliminates omitted variable bias in the estimation of the coefficients of non-distance variables in the case where the locations of the amenities are unknown or incorrect.
Despite the claims of Ross et al. [
14], we show that their models are not subject to omitted variable bias due to their construction and inclusion of a non-distance variable. Thus, their true model is not subject to omitted variable bias as there is none to eliminate. The random nature of their non-distance variable removes the possibility of omitted variable bias, as their non-distance variable is uncorrelated with other regressors. In addition to showing how the theory predicts no bias in this case, we also illustrate this fact with several simulations. The results of 15 simulations presented above in this study reveal that the average value of the coefficient estimates attached to their non-distance variable is equal to its true value of 2.000 in 14 of the simulations, while in the 1 remaining simulation, the average value is 2.001. Moreover, we failed to encounter any problems with extreme multicollinearity, even though we estimated 15 different models, including from one to five distance variables (both correct and incorrect), one million times. We report that the model
R2 values for the model using correct distances and the Ross et al. [
14] latitude–longitude version were nearly equal and highly correlated over different specifications. The model using the incorrect distance variables yielded a lower
R2 than the other two models initially, but the gap closed as more incorrect distance variables were added. After three or four distance variables were added, the average
R2 values were similar for the three models, although the
R2 for the incorrect distance model had a larger variance that decreased with the addition of more distance variables.
We extended Ross et al. [
14] Monte Carlo experiments to examine how well their co-ordinate system can replace biases in the coefficients of distance variables when the regressors actually are correlated. The results of this simulation are based on a true regression model containing two independent variables—both true distance variables—but one is omitted. We altered the omitted variables to change the correlation with the included distance variable, as this relationship underlies any omitted variable bias. We investigated the parameter estimates for the retained distance variable from the estimation of the true model, the model with the omitted variable, and a model including the retained distance variable and the Ross et al. [
14] co-ordinate system. We allowed the correlations between the true distance variables to change and found that in terms of the root-mean-squared error, the
omitted variables model provides better estimates than the model that replaces the missing distance variable with the Ross et al. [
14] co-ordinate system.
What benefits, if any, does the Ross et al. [
14] co-ordinate system provide? In the absence of any distance variables, the use of the system can lead to a higher model
R2, although a model with a few distance variables to incorrect locations can also improve the explanatory power (i.e., choosing locations for incorrect landmarks to use is easy to implement and does appear to provide some benefits). Our results seem to suggest that distances to incorrect landmarks are better than no distance variables at all. Is increasing the
R2 a desirable goal? If the goal is the prediction of the dependent variable, the answer is yes. However, given that
R2 is a non-decreasing function of the number of explanatory variables, the inclusion of
any additional variables will cause an increase or at least no decrease in
R2. If prediction of the dependent variable is not the objective, but rather explaining the dependent variable is, then one should consider specification issues based on some underlying theory rather than adding variables in order to maximize
R2.