Measurement Errors Arising When Using Distances in Microeconometric Modelling and the Individuals' Position Is Geo-masked for Confidentiality

In many microeconometric models we use distances. For instance, in modelling the individual behavior in labor economics or in health studies, the distance from a relevant point of interest (such as a hospital or a workplace) is often used as a predictor in a regression framework. However, in order to preserve confidentiality, spatial micro-data are often geo-masked, thus reducing their quality and dramatically distorting the inferential conclusions. In particular in this case, a measurement error is introduced in the independent variable which negatively affects the properties of the estimators. This paper studies these negative effects, discusses their consequences, and suggests possible interpretations and directions to data producers, end users, and practitioners.


Introduction
In many microeconometric studies we often use distances.For instance in labor, in schooling, or in health studies, the distance between each observed individual and a conspicuous point (e.g., a hospital or the workplace or a school) is often used as a predictor in a regression model.Furthermore, if our aim is to take into account the interacting effects between individuals using econometrics methods [1], we often build up weight matrices based on some inverse-distance function, using the two by two distances between individuals as the basis for the calculation.When analyzing granular micro spatial data (like e.g., household surveys), in order to preserve confidentiality the coordinates of each individual are often displaced according to some random method.For instance some recent DHS surveys coordinates are first collected in the field using GPS receivers with an accuracy of less than 15 m, but then, in order to ensure that respondent confidentiality is maintained, the observed points are geo-masked along a random distance and a random angle (see, e.g., [2]) using different maximum error distances in urban and rural areas (see [3][4][5] for details).A second method is the Gaussian geo-masking which consists in randomly reallocating the point in the neighborhood of the true location, with a probability described by a Gaussian bivariate density function centered on the true point.This paper's contribution consists in analyzing the problems induced by geo-masking in econometric models as a problem of measurement error.In Section 2 we will consider the measurement error induced by geo-masking in case the distance is used as a predictor in a regression model.Section 3 concludes with some practical comments and directions to assist the statistics producers to calibrate the geo-masking procedures before delivering the data to the public.

Generalities
Let us consider the case of the estimation of a linear regression when the true individuals' coordinates are not disclosed for confidentiality and a distance from a relevant point is used as a predictor.For instance, in health economics it is common practice to postulate a relationship between a health outcome for each individual (such as the effect of a health policy), say y, and the individual's distance (say d) from a clinic or a hospital.To illustrate the essence of the problem we will restrict ourselves to the case, admittedly unrealistic, of a simple regression without any further predictors.Furthermore, to simplify the algebra without compromising the generality of the results, we will postulate a relationship between the health outcome and the squared distance from a relevant point.When data are geo-masked, the distance between each individual and a conspicuous point will be upward biased (as shown, e.g., by Arbia et al., 2015 [5] and Elkies, et al., 2015 [6]).This paper shows that, when the masking procedure is disclosed, this information can be taken into account for the benefit of the analysis.
To start with, let us recall that the classical error measurement theory (e.g., [7]) defines the true model as: for each individual observed in the point of coordinates (i,j), with ij d the distance between point (i,j) and the point of interest and 2 . ..(0, ) The distance is observed with an error due to geo-masking and the measurement error is defined as: with 2 ij d the squared distance observed after geo-masking.Following the classical theory, as it is known, uij should be assumed to be such that: Var u = σ = constant, and uij independent of vij and of 2 ij d .Normality of u is also often assumed.In these conditions, having called β the OLS estimator of β, such estimator will be still unbiased, but less efficient, since its variance can now be expressed as: ( ) The estimator will also be inconsistent with a downward asymptotic bias towards zero (called attenuation) quantified by the expression [7]: However, in the case of a measurement error induced by geo-masking, the results are quite different from the classical, as we will illustrate in the next sections.

Gaussian Geo-Masking within a Circle
Let us start considering the effect of geo-masking when using a distance as a regressor in an econometric model, in the case of Gaussian geo-masking, that is when the true individuals' location is perturbated with a bivariate Gaussian distribution centered on the true point.More formally let us consider the point of coordinates (i,j) and let us geo-mask this point by disclosing, instead, the coordinates Let us further consider, for each individual point, its distance from a conspicuous point that, without loss of generalities, we can allocate at the origin of the Cartesian system.In this case the true squared distance of the point of coordinates (i,j) from the conspicuous point is given by: while, after geo-masking we observe instead because of our definitions.So the term u defines the measurement error on the independent variable of the model as in the classical theory.However, in contrast with the classical theory, in this case we have: and ( ) Thus the measurement error has non-zero mean and non-constant variances.(See Appendix A for the proof).The non-zero mean does not affect the point estimate of the parameter β, but only the constant term.Equation (6) shows that the procedure of geo-masking also induces heteroscedasticity.
Furthermore from Equation (4) we have: As a consequence (since 2 ij d , i and j are constant terms and 2 . ..(0, ) Following the classical theory, the OLS estimator will be less efficient and inconsistent recalling Equations ( 2) and (3).In particular, in the case of Gaussian geo-masking, the variance of the OLS estimator will be: Thus, the larger are 2 ε σ and the larger the distance from the conspicuous point is, the lower the precision of the estimate will be.The precision also depends on the square of the true value of β.Furthermore, using Equation ( 6) to evaluate the attenuation, we have: which shows that the attenuation effect on the OLS estimator is greater in the presence of a larger geo-masking variance of higher distances.
In practical cases, to communicate with practitioners, it is useful to introduce the Gaussian geo-masking mechanism with reference to a maximum displacement distance which is easier to interpret than a variance for non-specialists.Since in a Gaussian distribution ( 3 3 ) 0.9973 , with a probability close to 1 we can assume that the maximum displacement distance is 3σ.If we call θ* such maximum distance, we have that 3σ = θ*.So the expected measurement error is which shows that uncertainty increases with the maximum displacement distance and with the absolute position of the individual with respect to the conspicuous point.By using this alternative expression, the variance of the OLS estimator can be expressed as: which shows more intuitively the negative effects of geo-masking on the OLS estimates.The greater the maximum displacement distance is, the larger both the loss in efficiency and the attenuation effect will be.

Uniform Geo-Masking within a Circle
Let us now turn to analyze the effects of a uniform geo-masking (such as the one employed, e.g., by DHS, 2013 [4]), that is a mechanism which transforms the coordinates displacing them along a random angle (say δ) and a random distance (say θ) both obeying a uniform probability law.The mechanism can be formally expressed through the following hypotheses: Assuming again, without loss of generality, that the conspicuous point is located in the origin, the true squared distance between point of coordinates (i,j) and the conspicuous point before geo-masking is measured by , while, after geo-masking, it can be expressed, using the polar coordinates, as: Similarly to the Gaussian case we have a non-zero mean and a non-constant variance, given by: Again the proofs are left to the appendices, specifically to Appendix C.
So, consistently with the results obtained with a Gaussian geo-masking and according with the intuition, the measurement error increases its variance as the maximum displacement distance θ* increases and as we move away from the conspicuous point.
If we use this result again to provide an explicit expression to the estimation variance and to the attenuation effect, we have, respectively: which lead to very similar conclusions to those found for the Gaussian geo-masking (see Equations ( 10) and ( 11)).The greater the maximum displacement distance is, the lower the precision and the larger the attenuation effects are.

Discussion and Conclusions
In this paper we examined the measurement error on distances introduced by the procedure of geo-masking the individuals' true location to protect their confidentiality, when such distances are used as predictors in a linear regression.The formal expressions that we derived for the loss in efficiency and for the attenuation in the case of Gaussian and uniform geo-masking, are very important under the practical point of view.In fact, the true location of each of the observed individuals is known to the producer of the official statistics before the displacement is introduced and so it is the variance term 2 2 d σ .So, in principle, the data producers could calculate the appropriate expression before geo-masking the data when choosing the maximum location error (θ*) so as to limit the negative consequences on the subsequent analysis.Furthermore, the data producer could disclose to the end users and to the practitioners the level of attenuation which is expected given the chosen geo-masking procedure.In fact, for any given dataset, Expressions (11) and ( 17) are just functions of the maximum displacement distance θ*.
To illustrate this point, suppose, for instance, that n = 100 individuals have been observed in a unitary squared study area as it is shown in Figure 1.
Taking these points as given we have that 2 ij d = 0.520151 (considering for operational reasons the mean of all squared distances from the origin) and 2 2 d σ = 0.1592879.Figure 2 reports the behavior of the attenuation effect for Gaussian and uniform geo-masking for the values of the maximum displacement distance θ* ranging between 0 and 1.44 (1.44 being the theoretical maximum possible distance in a unitary square).Two features emerge from the inspection of the graph.First, the attenuation increases dramatically already at small levels of θ*.Secondly, the Gaussian geo-masking, other things being constant, produces more severe consequences on the estimation of β than the uniform geo-masking.This graph could be used by the data producers to calibrate the optimal value of θ* and to communicate to the practitioners the resulting level of attenuation they should expect from a regression analysis.

Appendix B
Proof: To prove that the distribution of the measurement error is a non-central Chi square, in Expression (7) consider that, if we divide both terms by 2 ε σ we have: The last term in this expression is a constant while each of the term in the brackets is Hence u is the sum of two independent squared normal distributions with non-zero mean and, as a consequence, it is distributed as a non-central Chi-squared with 2 degrees of freedom and non-centrality parameter

Appendix C
Proof: The result reported in Equation ( 14) can be proved as follows.Consider that Expression (13) can be written as:

with 2 2 dσ
the variance of the squared uncontaminated distances.

2
squared with 2 degrees of freedom and non-centrality parameter is left to Appendix B.
θ and the bias can be seen as a fraction of the maximum squared displacement distance.Furthermore its variance can be expressed as θ* the maximum distance error, and HP2: θ and δ are independent.

Figure 1 .
Figure 1.Spatial coordinates of n = 100 individuals located in a unitary square.

Figure 2 .
Figure 2. Attenuation effect in the presence of geo-masking as a function of the maximum displacement distance θ*.Gaussian geo-masking (red line).Uniform geo-masking (green line).