Next Article in Journal
Counterfactual Distributions in Bivariate Models—A Conditional Quantile Approach
Next Article in Special Issue
Testing in a Random Effects Panel Data Model with Spatially Correlated Error Components and Spatially Lagged Dependent Variables
Previous Article in Journal
Is Benford’s Law a Universal Behavioral Theory?
Previous Article in Special Issue
Strategic Interaction Model with Censored Strategies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Measurement Errors Arising When Using Distances in Microeconometric Modelling and the Individuals’ Position Is Geo-Masked for Confidentiality

1
Department of Statistical Science, Catholic University of the Sacred Heart, Rome 00168, Italy
2
Department of Economics and Management, University of Trento, Trento 38122, Italy
*
Author to whom correspondence should be addressed.
Econometrics 2015, 3(4), 709-718; https://doi.org/10.3390/econometrics3040709
Submission received: 25 August 2015 / Revised: 16 October 2015 / Accepted: 20 October 2015 / Published: 29 October 2015
(This article belongs to the Special Issue Spatial Econometrics)

Abstract

:
In many microeconometric models we use distances. For instance, in modelling the individual behavior in labor economics or in health studies, the distance from a relevant point of interest (such as a hospital or a workplace) is often used as a predictor in a regression framework. However, in order to preserve confidentiality, spatial micro-data are often geo-masked, thus reducing their quality and dramatically distorting the inferential conclusions. In particular in this case, a measurement error is introduced in the independent variable which negatively affects the properties of the estimators. This paper studies these negative effects, discusses their consequences, and suggests possible interpretations and directions to data producers, end users, and practitioners.

1. Introduction

In many microeconometric studies we often use distances. For instance in labor, in schooling, or in health studies, the distance between each observed individual and a conspicuous point (e.g., a hospital or the workplace or a school) is often used as a predictor in a regression model. Furthermore, if our aim is to take into account the interacting effects between individuals using econometrics methods [1], we often build up weight matrices based on some inverse-distance function, using the two by two distances between individuals as the basis for the calculation. When analyzing granular micro spatial data (like e.g., household surveys), in order to preserve confidentiality the coordinates of each individual are often displaced according to some random method. For instance some recent DHS surveys coordinates are first collected in the field using GPS receivers with an accuracy of less than 15 m, but then, in order to ensure that respondent confidentiality is maintained, the observed points are geo-masked along a random distance and a random angle (see, e.g., [2]) using different maximum error distances in urban and rural areas (see [3,4,5] for details). A second method is the Gaussian geo-masking which consists in randomly reallocating the point in the neighborhood of the true location, with a probability described by a Gaussian bivariate density function centered on the true point. This paper’s contribution consists in analyzing the problems induced by geo-masking in econometric models as a problem of measurement error. In Section 2 we will consider the measurement error induced by geo-masking in case the distance is used as a predictor in a regression model. Section 3 concludes with some practical comments and directions to assist the statistics producers to calibrate the geo-masking procedures before delivering the data to the public.

2. Effects of Geo-Masking When We Use the Distance as a Predictor in a Regression

2.1. Generalities

Let us consider the case of the estimation of a linear regression when the true individuals’ coordinates are not disclosed for confidentiality and a distance from a relevant point is used as a predictor. For instance, in health economics it is common practice to postulate a relationship between a health outcome for each individual (such as the effect of a health policy), say y, and the individual’s distance (say d) from a clinic or a hospital. To illustrate the essence of the problem we will restrict ourselves to the case, admittedly unrealistic, of a simple regression without any further predictors. Furthermore, to simplify the algebra without compromising the generality of the results, we will postulate a relationship between the health outcome and the squared distance from a relevant point. When data are geo-masked, the distance between each individual and a conspicuous point will be upward biased (as shown, e.g., by Arbia et al., 2015 [5] and Elkies, et al., 2015 [6]). This paper shows that, when the masking procedure is disclosed, this information can be taken into account for the benefit of the analysis.
To start with, let us recall that the classical error measurement theory (e.g., [7]) defines the true model as:
y i j = α + β d i j 2 + v i j
for each individual observed in the point of coordinates (i,j), with d i j the distance between point (i,j) and the point of interest and v i j n . i . d . ( 0 , σ v 2 ) . The distance is observed with an error due to geo-masking and the measurement error is defined as:
u i j = d ¯ i j 2 d i j 2
with d ¯ i j 2 the squared distance observed after geo-masking. Following the classical theory, as it is known, uij should be assumed to be such that: E ( u i j ) = 0 , V a r ( u i j ) = σ u 2 = constant, and uij independent of vij and of d i j 2 . Normality of u is also often assumed. In these conditions, having called β ^ the OLS estimator of β, such estimator will be still unbiased, but less efficient, since its variance can now be expressed as:
V a r ( β ^ ) = β 2 σ u 2 + σ v 2 n σ d 2 2 > σ v 2 n σ d 2 2
The estimator will also be inconsistent with a downward asymptotic bias towards zero (called attenuation) quantified by the expression [7]:
p lim β ^ = β ( σ d 2 2 σ u 2 + σ d 2 2 ) < β
with σ d 2 2 the variance of the squared uncontaminated distances.
However, in the case of a measurement error induced by geo-masking, the results are quite different from the classical, as we will illustrate in the next sections.

2.2. Gaussian Geo-Masking within a Circle

Let us start considering the effect of geo-masking when using a distance as a regressor in an econometric model, in the case of Gaussian geo-masking, that is when the true individuals’ location is perturbated with a bivariate Gaussian distribution centered on the true point. More formally let us consider the point of coordinates (i,j) and let us geo-mask this point by disclosing, instead, the coordinates i ¯ = i + ε i and j ¯ = j + ε j with ε i n . i . d . ( 0 , σ ε 2 ) . Let us further consider, for each individual point, its distance from a conspicuous point that, without loss of generalities, we can allocate at the origin of the Cartesian system. In this case the true squared distance of the point of coordinates (i,j) from the conspicuous point is given by:
d i j 2 = i 2 + j 2
while, after geo-masking we observe instead
d ¯ i j 2 = i ¯ 2 + j ¯ 2 = d i j 2 + ( ε i 2 + ε j 2 ) + 2 ( i ε i + j ε j ) = d i j 2 + u i j
because of our definitions.
So the term u defines the measurement error on the independent variable of the model as in the classical theory. However, in contrast with the classical theory, in this case we have:
E ( u i j ) = 2 σ ε 2 0      i , j
and
V a r ( u i j ) = 4 σ ε 2 ( σ ε 2 + d i j 2 )
Thus the measurement error has non-zero mean and non-constant variances. (See Appendix A for the proof). The non-zero mean does not affect the point estimate of the parameter β, but only the constant term. Equation (6) shows that the procedure of geo-masking also induces heteroscedasticity.
Furthermore from Equation (4) we have:
u i j = d ¯ i j 2 d i j 2 = ( i + ε i ) 2 + ( j + ε j ) 2 d i j 2
As a consequence (since d i j 2 , i and j are constant terms and ε i n . i . d . ( 0 , σ ε 2 ) ) u χ 2 2 ( λ ) is a non-central Chi-squared with 2 degrees of freedom and non-centrality parameter λ = d i j 2 σ ε 2 . The proof is left to Appendix B.
Following the classical theory, the OLS estimator will be less efficient and inconsistent recalling Equations (2) and (3). In particular, in the case of Gaussian geo-masking, the variance of the OLS estimator will be:
V a r ( β ^ ) i j = 4 β 2 σ ε 2 ( σ ε 2 + d i j 2 ) + σ v 2 n σ d 2 2 > σ v 2 n σ d 2 2
Thus, the larger are σ ε 2 and the larger the distance from the conspicuous point is, the lower the precision of the estimate will be. The precision also depends on the square of the true value of β. Furthermore, using Equation (6) to evaluate the attenuation, we have:
p lim β ^ = β ( σ d 2 2 σ u 2 + σ d 2 2 ) = β ( σ d 2 2 4 σ ε 2 ( σ ε 2 + d i j 2 ) + σ d 2 2 )
which shows that the attenuation effect on the OLS estimator is greater in the presence of a larger geo-masking variance of higher distances.
In practical cases, to communicate with practitioners, it is useful to introduce the Gaussian geo-masking mechanism with reference to a maximum displacement distance which is easier to interpret than a variance for non-specialists. Since in a Gaussian distribution P ( 3 σ x 3 σ ) = 0.9973 , with a probability close to 1 we can assume that the maximum displacement distance is 3σ. If we call θ* such maximum distance, we have that 3σ = θ*. So the expected measurement error is E ( u ) = 2 σ ε 2 = 2 9 θ * 2 and the bias can be seen as a fraction of the maximum squared displacement distance. Furthermore its variance can be expressed as V a r ( u ) = 4 81 θ * 4 + 8 9 θ * 2 d i j 2 which shows that uncertainty increases with the maximum displacement distance and with the absolute position of the individual with respect to the conspicuous point. By using this alternative expression, the variance of the OLS estimator can be expressed as:
V a r ( β ^ ) i j = β 2 ( 4 81 θ * 4 + 4 9 θ * 2 d i j 2 ) + σ v 2 n σ d 2 2
and the attenuation effect as:
p lim β ^ = β ( σ d 2 2 σ u 2 + σ d 2 2 ) = β ( σ d 2 2 4 81 θ * 4 + 4 9 θ * 2 d i j 2 + σ d 2 2 )
which shows more intuitively the negative effects of geo-masking on the OLS estimates. The greater the maximum displacement distance is, the larger both the loss in efficiency and the attenuation effect will be.

2.3. Uniform Geo-Masking within a Circle

Let us now turn to analyze the effects of a uniform geo-masking (such as the one employed, e.g., by DHS, 2013 [4]), that is a mechanism which transforms the coordinates displacing them along a random angle (say δ) and a random distance (say θ) both obeying a uniform probability law. The mechanism can be formally expressed through the following hypotheses:
HP1:
θ i i d U ( 0 , θ * ) and δ i i d U ( 0 , 360 ° ) , with θ* the maximum distance error, and
HP2:
θ and δ are independent.
Assuming again, without loss of generality, that the conspicuous point is located in the origin, the true squared distance between point of coordinates (i,j) and the conspicuous point before geo-masking is measured by d i j 2 = ( i 2 + j 2 ) , while, after geo-masking, it can be expressed, using the polar coordinates, as:
d ¯ i j 2 = ( i + θ C o s δ ) 2 + ( j + θ S i n δ ) 2
Expanding Equation (12) we obtain:
d ¯ i j 2 = ( i 2 + 2 i θ C o s δ + θ 2 C o s 2 δ + j 2 + 2 j θ S i n δ + θ 2 S i n 2 δ )
so that we can now express the measurement error as:
u i j = d ¯ i j 2 d i j 2 = 2 θ ( i C o s δ + j S i n δ ) + θ 2 ( C o s 2 δ + S i n 2 δ )
Similarly to the Gaussian case we have a non-zero mean and a non-constant variance, given by:
E ( u i j ) = θ * 3 0
and
V a r ( u i j ) = 17 180 θ * 4 + 2 3 θ * 2 d i j 2
Again the proofs are left to the appendices, specifically to Appendix C.
So, consistently with the results obtained with a Gaussian geo-masking and according with the intuition, the measurement error increases its variance as the maximum displacement distance θ* increases and as we move away from the conspicuous point.
If we use this result again to provide an explicit expression to the estimation variance and to the attenuation effect, we have, respectively:
V a r ( β ^ ) i j = β 2 ( 17 180 θ * 4 + 2 3 θ * 2 d i j 2 ) + σ v 2 n σ d 2 2
and
p lim β ^ = β ( σ d 2 2 σ u 2 + σ d 2 2 ) = β ( σ d 2 2 17 180 θ * 4 + 2 3 θ * 2 d i j 2 + σ d 2 2 )
which lead to very similar conclusions to those found for the Gaussian geo-masking (see Equations (10) and (11)). The greater the maximum displacement distance is, the lower the precision and the larger the attenuation effects are.

3. Discussion and Conclusions

In this paper we examined the measurement error on distances introduced by the procedure of geo-masking the individuals’ true location to protect their confidentiality, when such distances are used as predictors in a linear regression. The formal expressions that we derived for the loss in efficiency and for the attenuation in the case of Gaussian and uniform geo-masking, are very important under the practical point of view. In fact, the true location of each of the observed individuals is known to the producer of the official statistics before the displacement is introduced and so it is the variance term σ d 2 2 . So, in principle, the data producers could calculate the appropriate expression before geo-masking the data when choosing the maximum location error (θ*) so as to limit the negative consequences on the subsequent analysis. Furthermore, the data producer could disclose to the end users and to the practitioners the level of attenuation which is expected given the chosen geo-masking procedure. In fact, for any given dataset, Expressions (11) and (17) are just functions of the maximum displacement distance θ*.
To illustrate this point, suppose, for instance, that n = 100 individuals have been observed in a unitary squared study area as it is shown in Figure 1.
Taking these points as given we have that d i j 2 = 0.520151 (considering for operational reasons the mean of all squared distances from the origin) and σ d 2 2 = 0.1592879. Figure 2 reports the behavior of the attenuation effect for Gaussian and uniform geo-masking for the values of the maximum displacement distance θ* ranging between 0 and 1.44 (1.44 being the theoretical maximum possible distance in a unitary square).
Two features emerge from the inspection of the graph. First, the attenuation increases dramatically already at small levels of θ*. Secondly, the Gaussian geo-masking, other things being constant, produces more severe consequences on the estimation of β than the uniform geo-masking. This graph could be used by the data producers to calibrate the optimal value of θ* and to communicate to the practitioners the resulting level of attenuation they should expect from a regression analysis.
Figure 1. Spatial coordinates of n = 100 individuals located in a unitary square.
Figure 1. Spatial coordinates of n = 100 individuals located in a unitary square.
Econometrics 03 00709 g001
Figure 2. Attenuation effect in the presence of geo-masking as a function of the maximum displacement distance θ*. Gaussian geo-masking (red line). Uniform geo-masking (green line).
Figure 2. Attenuation effect in the presence of geo-masking as a function of the maximum displacement distance θ*. Gaussian geo-masking (red line). Uniform geo-masking (green line).
Econometrics 03 00709 g002

Author Contributions

Authors have equally contributed to this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Proof: Expression (5) in the text can be proved as follows. First of all, consider that:
E ( u i j ) = E [ ( ε i 2 + ε j 2 ) + 2 ( i ε i + j ε j ) ] = 2 E ( ε 2 ) + 2 E ( ε ) ( i + j ) = 2 E ( ε 2 )
since E ( ε i ) = 0. Furthermore, V a r ( ε 2 ) = E ( ε 2 ) = σ ε 2
                                     Q.E.D.
Similarly Equation (6) can be proved by considering that:
V a r ( u ) = V a r [ ( ε i 2 + ε j 2 ) + 2 ( i ε i + j ε j ) ] = 2 V a r ( ε 2 ) + 4 V a r ( ε i ) ( i 2 + j 2 ) = 2 V a r ( ε 2 ) + 4 σ ε 2 d i j 2
due to independence. For the term V a r ( ε 2 ) in this expression, consider that, since ε i n . i . d . ( 0 , σ ε 2 ) , ε σ ε n . i . d . ( 0 , 1 ) , and ε 2 σ ε 2 χ 1 2 . So V a r ( ε 2 σ ε 2 ) = 2 and, hence, V a r ( ε 2 ) = 2 σ ε 4 which substituted in the previous expression, proves the result.
                                     Q.E.D.

Appendix B

Proof: To prove that the distribution of the measurement error is a non-central Chi square, in Expression (7) consider that, if we divide both terms by σ ε 2 we have:
u i j σ ε 2 = ( i + ε σ ε ) 2 + ( j + ε σ ε ) 2 d i j 2 σ ε 2 = ( i σ ε + ε σ ε ) 2 + ( j σ ε + ε σ ε ) 2 d i j 2 σ ε 2
The last term in this expression is a constant while each of the term in the brackets is N ( 1 σ ε ; 1 ) . Hence u is the sum of two independent squared normal distributions with non-zero mean and, as a consequence, it is distributed as a non-central Chi-squared with 2 degrees of freedom and non-centrality parameter λ = ( i 2 σ ε 2 + j 2 σ ε 2 ) = d 2 σ ε 2 .
                                     Q.E.D.

Appendix C

Proof: The result reported in Equation (14) can be proved as follows. Consider that Expression (13) can be written as:
E ( u i j ) = E [ 2 θ ( i C o s δ + j S i n δ ) + θ 2 ( C o s 2 δ + S i n 2 δ ) ]
and, due to Hp2:
E ( u i j ) = 2 i E ( θ ) E ( C o s δ ) + 2 j E ( θ ) E ( S i n δ ) + E ( θ 2 ) E ( C o s 2 δ + S i n 2 δ )
Furthermore, we have that due to HP1:
E ( C o s δ ) = 0 360 C o s ( δ ) f ( δ ) d δ = 1 360 0 360 C o s ( δ ) d δ = 1 360 [ S i n δ ] 0 360 = 0 = E ( S i n δ )
We also have that:
E ( θ 2 ) [ E ( C o s 2 δ + S i n 2 δ ) ] = E ( θ 2 )
from the Pythagorean identity.
Finally, since, from HP1, θ i i d U ( 0 , θ * ) , then E ( θ ) = θ * 2 and V a r ( θ ) = θ * 2 12 , so that E ( θ 2 ) = V a r ( θ ) + E ( θ ) 2 = θ * 2 12 + θ * 2 4 = θ * 2 3 which proves Expression (11).
                                     Q.E.D.
Similarly, to prove Expression (15) consider that:
V a r ( u i j ) = 4 i 2 V a r ( θ C o s δ ) + V a r ( θ 2 C o s 2 δ ) + 4 j 2 V a r ( θ S i n δ ) + V a r ( θ 2 S i n 2 δ )
since all cross terms have zero expectation.
Let us examine the various terms in this expression. First of all, we have that:
V a r ( θ C o s δ ) = E ( θ 2 C o s 2 δ ) E ( θ  C o s   δ ) 2
and, due to Hp2:
V a r ( θ C o s δ ) = E ( θ 2 ) E ( C o s 2 δ ) E ( θ ) 2 E ( C o s δ ) 2 = E ( θ 2 ) E ( C o s 2 δ )
because E ( C o s δ ) = 0:
Furthermore:
E ( C o s 2 δ ) = 0 360 C o s 2 ( δ ) f ( δ ) d δ = 1 360 0 360 C o s 2 ( δ ) d δ = 1 360 [ 360 2 + C o s 360 S i n 360 C o s 0 S i n 0 2 ] 0 360 = 1 2
and, as already seen E ( θ 2 ) = θ * 2 3 . So, eventually:
V a r ( θ C o s δ ) = θ * 2 6 = V a r ( θ S i n δ ) .
Secondly, we have that
V a r ( θ 2 C o s 2 δ ) = E ( θ 4 C o s 4 δ ) E ( θ 2 C o s 2 δ ) 2 = E ( θ 4 ) E ( C o s 4 δ ) E ( θ 2 ) E ( C o s 2 δ ) 2
In this expression we have that:
E ( θ 4 ) = 0 θ * θ 4 f ( θ ) d θ = 1 θ * 0 θ * θ 4 d θ = 1 θ * [ θ 5 5 ] 0 θ * = 1 θ * θ * 5 5 = θ * 4 5 ,
E ( C o s 4 δ ) = 0 360 cos 4 δ f ( δ ) d δ = 1 360 { [ cos 3 δ sin δ ] o 360 + 3 4 0 360 cos 2 δ d δ } = = 1 360 { 3 4 0 360 cos 2 δ d δ } = 1 360 { 3 4 360 2 } = 3 8
and, as already seen:
E ( C o s 2 δ ) = 1 2
As a consequence we have:
V a r ( θ 2 C o s 2 δ ) = θ * 4 5 3 8 θ * 4 9 1 4 = 17 360 θ * 4 = V a r ( θ 2 S i n 2 δ )
So eventually:
V a r ( u ) = 4 i 2 θ * 2 6 + 4 j 2 θ * 2 6 + 34 360 θ * 4 = 2 θ * 2 3 ( i 2 + j 2 ) + 17 180 θ * 4
and finally
V a r ( u ) = 2 θ * 2 3 d 2 + 17 180 θ * 4
                                     Q.E.D.

References

  1. G. Arbia. A Primer for Spatial Econometrics. London, UK: Palgrave Mac Millan, 2014. [Google Scholar]
  2. B. Collins. Boundary Respecting Point Displacement. Python Script; Arlington, VA, USA: Blue Raster, LLC, 2011. [Google Scholar]
  3. C.R. Burgert, J. Colston, T. Roy, and B. Zachary. “Geographic displacement procedure and georeferenced data release policy for the Demographic and Health Surveys.” In DHS Spatial Analysis Report No. 7. Calverton, MA, USA: ICF International, 2013. [Google Scholar]
  4. DHS. Spatial Analysis Report 8. Guidelines on the Use of GPS DHS Data. Calverton, MD, USA: ICF Interactional, September 2013. [Google Scholar]
  5. G. Arbia, G. Espa, and D. Giuliani. “Dirty spatial econometrics.” Ann. Reg. Sci., 2015. submitted. [Google Scholar]
  6. N. Elkies, G. Fink, and F.G. Baernighausen. “Scrambling geo-referenced data to protect privacy induces bias in distance estimation.” Popul. Environ. 37 (2015): 83–98. [Google Scholar] [CrossRef]
  7. M. Verbeek. A guide to Modern Econometrics. Chichester, UK: Wiley, 2004. [Google Scholar]

Share and Cite

MDPI and ACS Style

Arbia, G.; Espa, G.; Giuliani, D. Measurement Errors Arising When Using Distances in Microeconometric Modelling and the Individuals’ Position Is Geo-Masked for Confidentiality. Econometrics 2015, 3, 709-718. https://doi.org/10.3390/econometrics3040709

AMA Style

Arbia G, Espa G, Giuliani D. Measurement Errors Arising When Using Distances in Microeconometric Modelling and the Individuals’ Position Is Geo-Masked for Confidentiality. Econometrics. 2015; 3(4):709-718. https://doi.org/10.3390/econometrics3040709

Chicago/Turabian Style

Arbia, Giuseppe, Giuseppe Espa, and Diego Giuliani. 2015. "Measurement Errors Arising When Using Distances in Microeconometric Modelling and the Individuals’ Position Is Geo-Masked for Confidentiality" Econometrics 3, no. 4: 709-718. https://doi.org/10.3390/econometrics3040709

Article Metrics

Back to TopTop