1. Geosocial Network Data in Research
In recent years, data from geosocial networks such as Twitter, Flickr, Instagram, Foursquare and others have become a comprehensively used basis for geospatial analysis in a number of application areas, including disaster management (Laituri and Kodrich 2008
; Resch et al. 2018
), public health and epidemiology (Santillana et al. 2015
; Boulos et al. 2011
), urban planning (Foth et al. 2011
; Resch et al. 2016
), traffic management (Pan et al. 2013
; Steiger et al. 2016a
), crime analysis (Malleson and Andresen 2015
; Ristea et al. 2018
; Kounadi et al. 2018
), and others. While early research efforts focused on simple analysis using traditional methods (Girardin et al. 2008
; Sagl et al. 2012
), more recent research has developed more sophisticated approaches, including self-learning systems such as artificial neural networks (ANN) (Steiger et al. 2016b
), machine learning semantic topic models (Hasan and Ukkusuri 2014
; Kovacs-Gyori et al. 2018
) or real-time analysis algorithms (Sakaki et al. 2010
Resulting from the rapid development of social media analysis, data analysis methods have become more robust and results more reliable. In turn, geosocial networks are meanwhile acknowledged as a high-quality data source that supports the investigation of real-world problems and subsequent decision-making. This development has been fostered by the dramatically increasing availability of social media posts throughout the world, particularly in urban settings. Consequently, we have witnessed the emergence of far-reaching analysis efforts that investigate urban processes at a remarkably high spatial and temporal resolution.
On the downside, this leads to the pressing question of how to preserve the privacy of social media users, an issue which is becoming more and more serious as the spatial and temporal density of social media posts increases. This is because extracting user profiles and identifying single users can be done relatively easily by analysing accumulated social media posts, particularly when coupled with other data sources such as demographic data, statistical data or household data, which are increasingly available as open-source repositories (Steiger et al. 2015
Potential infringements, which may arise when analysing geosocial media data and when publishing according results, include revealing a user’s identity, building behaviour profiles of users, generating political profiles of users, putting users at physical risk (e.g., lateral thinkers, public figures, etc.), or making information permanently available on the Internet. Therefore, these infringements are particularly critical as fine-grained research outputs may only constitute a surrogate for more concrete and personal influences on users: A spatial accumulation of “negative emotions”, “high traffic volumes”, or “bad air quality” may have very direct consequences on a user’s permanent stress level, their quality of life or even their life expectancy. Thus, more accurate, finer-grained or more complete information may in some cases not necessarily be desirable, as this would potentially allow for conclusions regarding the subjects’ identity on a very small scale or, in extreme cases, even on the individual level (Resch 2013
For the particular case of geosocial media, the responsibility for sharing data in “appropriate” semantic, content-wise, spatial and temporal detail cannot be shifted to the user because terms and conditions of the use of social networks are mostly articulated in convoluted and hardly understandable language. Posts in geosocial media are usually available through (public) Application Programming Interfaces (API), which enable data access without the users’ awareness even though users knowingly and consensually share their data as agreed in the terms and conditions of a geosocial network application. The above-mentioned conditions necessitate a rigorous way of handling data from geosocial media to preserve the users’ privacy. The challenge of this goal lies in the spatial nature of geosocial media data: The first of the 21 theses in the “Geoprivacy Manifesto” by Keßler and McKenzie
) says that “information about an individual’s location is substantially different from other kinds of personally identifiable information”. Individual time trajectories in space reveal activity spaces (e.g., locations of work, social clubs, day care, grocery, place of worship, etc.) that can, in turn, be processed to get insights on human behaviour and detect personal profiles (Armstrong et al. 2018
). Therefore, the major factor that makes spatial information more challenging is the potential of inferences on identity that may be drawn.
This paper discusses privacy risks associated with research efforts using geosocial media data, identifies the limitations of existing studies regarding inference and protection, and proposes a set of geoprivacy-by-design recommendations with respect to sharing these data, anonymising them, publishing the resulting maps, modalities of data storage, and privacy-preserving measures. This is followed by a thorough discussion of the proposed recommendations, particularly in light of the recent General Data Protection Regulation (GDPR), and a set of future research directions in the area of geoprivacy. Furthermore, this paper addresses the use (storage, analysis, visualisation and sharing) of geosocial network data, but it shall not be understood as a guideline relevant to building location-based social networks.
2. Background on Inferences, Users, and Policies
Studies on location inference attacks examine the types of personal information that can be revealed from individual-level spatial trajectories and the accuracy of inferred information. Thus, privacy policies in research efforts involving LBSN data should take possible inference attacks into consideration (Section 2.1
) and provide mechanisms that are in line with the users’ preferences and attitudes towards privacy protection (Section 2.2
2.1. Inference Attacks vs. Risk of Re-Identification
In this section, we review the literature on inference attacks and re-identification risk from spatial trajectory data. First, we should clarify the difference between inference attacks and re-identification. Inference is to draw conclusions based on observations and analytical results of the data. For instance, given a set of locations of a GPS user, the clusters of high point intensity can be computed. Then, the central point of the cluster with the highest density can be assumed as the home location of the user, especially if there are temporal signals during late night hours when people usually stay at home. In statistical terms, inference is accompanied by accuracy results. To calculate the accuracy of inference, true outcomes should be measurable. This means that, in the previous example, the true home location of the user is known and compared with the estimated one. However, this is not always the case for the studies on inference attacks from location data. Hence, in many cases, inference attacks show the potential of the kind of information that can be disclosed without necessarily validating the conclusions. On the other hand, re-identification involves a disclosure method as well as an accuracy assessment against the actual information.
Taking this distinction into consideration, most of the studies have examined the potential of drawing conclusions about the private matters of individuals, and only few of these studies validated the degree to which such conclusions are accurate (Table 1
, category: validation data). Types of re-identified information are, for example, the prediction of a social media user’s next location in a georeferenced post (Preoţiuc-Pietro and Cohn 2013
), the location of social media posts from a georeferenced dataset (Schulz et al. 2013
), or the home address and identity of individuals that carry GPS receivers (Krumm 2007
). Other studies evaluated their inference results questionably because the validation data that were used had significant limitations. Zang and Bolot
) aimed at detecting the home and work locations of cell phone users, but only had 12 subscribers to validate their results. Li et al.
) employed a significant number of participants in their experiments in order to reveal highly sensitive information from geosocial network media data and Wi-Fi traffic records such as age, gender, education, living place, and location patterns. However, the participants are only representative of a particular subgroup of the population, that is, people who work, study, or live on a university campus. One could argue that the sample’s characteristics are considerably less variant than a representative sample of the general population and thus easier to predict. For example, more than half of the participants have a bachelor’s degree, and there are only three types of education levels (i.e., bachelor, master, and PhD). In addition, Schulz et al.
) inferred the home location of Twitter users and then validated the estimated home location by using the last location of the user as ground truth. Similarly, Pontes et al.
) validated an estimated city of a user using as ground truth the information provided in the user’s home city attribute. All approaches highlight the potential for re-identification but do not reveal the actual re-identified information.
Furthermore, some of these studies propose measures to protect subjects’ anonymity in location trajectories, such as perturbation, aggregation (e.g., areal, point, or temporal), considering the desired level of privacy defined by user preferences, shortening the time collection period, and removing sensitive areas (i.e., spatial cloaking) (Table 1
, category: countermeasures). Most of these measures deliver sufficiently large anonymous datasets and may work well for plain spatiotemporal trajectories. Nevertheless, in Section 7
, we outline the limitations of the k-anonymity concept with respect to geosocial network data, due to the diversity and variety of potential disclosed information, and we explain why alternative measures based on differential privacy or l-diversity are preferable.
2.2. Users’ Privacy Preferences
Geosocial network data are provided voluntarily by the users who are also the data subjects. Their preferences on the protection of personal privacy in LBSN have been studied and conceptualised as opinions, attitudes, and behaviours. Beldad and Kusumadewi
) identified major determinants of sharing locations in LBSN. The first two are related to personal benefits, such as entertainment and impression management (i.e., controlling the impression they have on others), while the third one is trusting the competences of an application to protect personal privacy. A similar study on location information disclosure behaviour also confirms that privacy risks weaken the relationship between perceived benefits and intention to disclose personal information (Sun et al. 2015
). In addition, both studies on location disclosure behaviour found that there are significant gender differences in the responses of the participants. Benisch et al.
) performed a survey on 27 participants and collected their location trajectories over three weeks. Then, the participants ranked and explained their disclosing criteria. One of the most significant findings was that the decision of users of whether or not to disclose their locations varies depending on the time of the day, the day of the week, and their exact location. A second finding, which is important for policy implementation, is that users would prefer a more complex location- and time-based privacy set of rules over a simpler approach that restricts disclosed information to a particular group (i.e., friends or family).
On the other hand, people’s opinions, attitudes, and behaviours regarding geoprivacy risks are connected to their geoprivacy awareness, which is not yet widely spread and well understood. Half or more of the participants in a location awareness study had no idea
if: (a) their profiles in Twitter and Instagram are private or public; (b) they use the geolocation feature; and (c) they ever changed default privacy settings (Furini and Tamanini 2015
). Another study asked participants to state their awareness of 14 types of inference attacks from geosocial network data (e.g., to infer home and work location, to know their friend network and weekly habits, etc.) (Alrayes and Abdelmoty 2014
). More than one-third of the participants were not aware of possible attacks such as these related to other people being able to know what their personal activities are.
Users’ preferences regarding the protection of their geosocial data are diverse and probably linked to their awareness about which data are available (on the Internet) and how they can be used. Thus, it is not advisable for researchers or institutions to construct privacy by design guidelines based on generalised preferences of the public who lack specialised knowledge on geoprivacy implications.
2.3. Privacy Policies in LBSN
Gambs et al.
) gave a comprehensive overview of the privacy policies implemented by four LBSN (Foursquare, Qype, La Ruche and Twitter). They identified the following eight privacy criteria and also checked whether the LBSN adhere to them:
Privacy criteria in LBSN:
Registration information: How much personal information from users is needed for registration?
Real identities versus pseudonyms: Are users allowed to use pseudonyms instead of their real name?
Information available to others (friends, public, and third parties): What personal information about users is disclosed to other parties operating on the LBSN?
Privacy settings: Do users have control over how their data is collected, used and disseminated?
Policy of data retention in case of account deletion: Does the LSBN delete all data from a user after they delete their network account?
Mobility data collection and management: Are location data collected continuously or only when a user action requires location data access?
Security features: Does the LBSN implement reasonable IT security measures to prevent data theft?
The authors concluded that the platforms largely do not implement measures to fit the criteria and provide a list of practical recommendations for LBSN to use. Vicente et al.
) performed a similar study that examines a larger number of LBSN and outlines the features of the services that increase the re-identification risk. These features are the real-time publication time (occurs in 43% of the examined LBSNs of the study), the use of exact location (occurs in 62% of the LBSNs), and the ability to tag or check-in multiple users (occurs in 19% of the LBSNs). However, only 14% of the LBSNs use anonymous user identities, which is a feature that decreases the re-identification risk. Furthermore, some of their listed LBSN pose privacy issues for users, although they leave a threat formalisation to future studies and suggest spatial and temporal cloaking as a possible privacy protection measure. Further, they provided an outlook, in which they name user awareness of publishing location information as a factor in privacy protection.
3. Data Sharing
Researchers may share processed or unprocessed datasets for several reasons, for example, to allow research replicability, to establish synergies with research partners, or to publish in open data scientific journals. These datasets, typically, do not contain key identifiers (i.e., name, home address, etc.), but pseudonyms (i.e., username) that can be used to derive subsets for each subject. Inferential disclosure can be applied to the attributes (e.g., location) and reveal not only the identity but also further personal information about the subject.
Starting with the inferential disclosure of the subject’s identity, an attacker may use the subject’s space-time stamps to make a guess about their potential home address. In a study that used GPS trajectories of participants, the author re-engineered the real home addresses with a median distance error of 60.7 meters and the identity of a small fraction of the participants (Krumm 2007
). Furthermore, LBSN data contain additional attributes that can lead to greater information disclosure compared to GPS data. Alrayes and Abdelmoty
) enlisted twelve types of disclosure that can be inferred from combining and analysing the spatial, temporal, and non-spatial semantics such as times spent away from home, activities during weekends, and time and location of meetings with friends, amongst others.
The simplest approach to mitigating such privacy threats is to remove pseudonyms and other information that can be used to derive subsets per subject prior to the release of a dataset. This can indeed be an effective solution, since many studies are interested in aggregated results, such as the spatial-temporal distribution of a topic of interest in a study area. If analysis by user or group of users is needed, pseudonyms may be stored, but the dataset has to be anonymised. Prior to the anonymisation, the data holder should consider the total number of observations by time intervals per user. Time information is critical in inferring personal information from geosocial network data. For example, locations derived after midnight and during weekdays can be used as a starting dataset to infer home addresses. The probability of an accurate inference is related to the entropy of the locations; in other words, the lower the entropy, the higher or more confident the inference is. More observations within a time interval may lower the entropy, resulting in easier detectability of a pattern. Ultimately, this depends on the temporal frequency of posts by users. A user with sporadic posts may be harder to identify compared to a frequently posting user. Thus, restricting the temporal frequency of the observations per user can be an anonymisation strategy to mitigate disclosure risk.
There are several methods for the anonymisation of LBSN data, which are critically discussed in Section 7
. The primary criterion for the selection of an optimal method is that it protects the data sufficiently based on an anonymity measure, such as k-anonymity (Sweeney 2002
), l-diversity (Machanavajjhala et al. 2007
), or differential privacy (Dwork 2008
). Another significant criterion is the spatial effect that a method imposes on the anonymised dataset. For instance, anonymised data produced by random perturbation approaches detect spatial clusters more accurately than data produced by aggregation approaches (Hampton et al. 2010
; Kounadi and Leitner 2016
). On the other hand, aggregation may be preferred if data are to be analysed or visualised in the same aggregation level as the anonymised data. The effect should also be calculated and communicated to future users. Kounadi and Resch
) proposed several measures that calculate the effect of the spatial error of the spatial analysis to be performed.
Remove pseudonyms or other subject identification attributes.
Anonymise data if subject distinguishability is required.
Ensure anonymisation method provides sufficient protection based on an anonymity measure.
Select a method that minimises the spatial effect on the anonymised dataset based on their utility.
Calculate the spatial effect of anonymised data on certain types of analysis.
Communicate anonymity level and accuracy errors of anonymised data to future users.
5. Publication of Maps
Publication deliverables such as maps of confidential and private data may also lead to information disclosure. More specifically, research on confidential point data shows that locations depicted on maps in scientific publications can be re-engineered with considerable accuracy either from a digital or a printed map (Brownstein et al. 2006
; Leitner et al. 2007
). To the best of our knowledge, no similar research has been conducted regarding social media data. However, if a map distinguishes locations or trips by data subject, a similar re-engineering process can be used and, thus, the risk of disclosure remains.
Researchers must ensure that public cartographic visualisations do not compromise the privacy of the individuals involved in the dataset. A simple way to ensure privacy protection in maps is to lower the spatial or the temporal precision and present aggregate data (Graham 2012
). In addition, the independent body Information Commissioner’s Office (ICO) in the UK suggests to use heat maps (i.e., continuous or aggregated surfaces of densities) or explore alternatives of representing confidential information on maps (ICO 2012
). Of course, if researchers wish to present detailed unprocessed information on maps, they should use the anonymised versions of their data. In line with Section 3
, the spatial error of the map should be evaluated concerning the impact it may have on the readers when interpreting the map.
Ensure privacy protection of public cartographic visualisations.
Reduce the spatial and/or temporal resolution of public maps.
Consider the use of heat maps or other types of cartographic visualisations.
Use anonymised data if it is necessary to publish detailed maps (i.e., locations or trajectories distinguished by subject).
Assess the spatial error and its impact when anonymised data are used in maps.
6. Data Storing
Boulos et al.
) described data security
as the “missing ring” in privacy-preserving discussions that have predominately neglected risks such as data theft, data loss, or data disclosure to non-authorised parties. The authors highlighted several security measures (e.g., building security, cable locks, cryptography, access authentication, etc.) and suggested a “purpose-built”
combination of measures that depend on the type, sensitivity, value, and risk of data. An expert, who acts as a designated privacy manager and whose knowledge extends beyond location-related disclosure risks, should oversee data storing and processing tasks. If unauthorised persons can physically access the storage devices, sensitive data on them should be encrypted to avoid theft. In case data are to be stored or processed on machines provided by third parties within a cloud computing environment, the entire workflow from sending the data to receiving results has to be subject to encryption, which must not be compromised at any stage (e.g., by the use of client-side encryption). Chen and Zhao
) gave a more detailed overview of cloud computing security architectures and data security issues. On top of these measures, it is of course also important to adhere to well-known security routines that help prevent data theft. Examples of such measures are locking computers when not needed, not writing down passwords, using strong passwords, and not reusing passwords.
Assign a privacy manager or security expert to oversee data storage and processing tasks.
Apply all necessary security measures and best practices throughout the entire workflow.
If storing or processing data on third-party machines, ensure that security standards are upheld throughout the entire workflow.
7. Privacy Concepts and Protection Methods
An approach to protect the locations of LBSN data is to prevent them from being released to unauthorised parties. This can be achieved by allowing the user to set up their own privacy preferences for location disclosure or by transmitting data in an encrypted form. For example, the data can be encrypted when shared with untrusted third-party servers, and then decrypted by the users that the data is intended for (e.g., friends) (Puttaswamy and Zhao 2010
). In addition, encryptions can be transferred to the hands of users who may apply policies on who may access their private data based on their attributes (i.e., attribute-based encryption) (Baden et al. 2009
). Another possibility is that the users decide and adapt the granularity of their shared locations, while probabilistic encryption ensures that their data and preferences remain private (Hu et al. 2017
). A third approach is to divide the released information between social network servers and location-based servers (Wei et al. 2012
) or to further split them into multiple location servers to prohibit access to users’ social network topology based on their friend sets (Li et al. 2017
Although encryption, adaptive privacy preferences, and location servers are straightforward privacy protection approaches, they prohibit or limit the use of LBSN data for secondary purposes, such as research studies, which are the scope of this paper. On the other hand, location transformation promises privacy protection while data are shared openly, and data can, thus, be extracted and used for research purposes. Armstrong et al.
) were the first scholars to anonymise data by transforming their locations and established the term “geographical masking” for the protection of discrete spatial datasets. Later approaches applied geographical masking with the privacy measure of k-anonymity, which ensures that a data subject cannot be distinguished amongst k-1 other subjects (Sweeney 2002
), (Cassa et al. 2006
; Hampton et al. 2010
). This concept is best applicable to confidential datasets (e.g., health and crime information) such as locations of domestic violence events where each location can be a direct link to a building or a household. In practice, spatially anonymised regions are defined based on the dataset and underlying population, and then data are either displaced (Kounadi and Leitner 2016
) or aggregated within these regions (Croft et al. 2017
). Furthermore, spatiotemporal versions of k-anonymity, commonly known as cloaking, have been applied to location-based services data by degrading the location and/or time information that is sent to a server to ensure that queries contain at least k-1 users (Gruteser and Grunwald 2003
; Mokbel et al. 2006
; Kalnis et al. 2007
). Regarding geosocial networks, Freni et al.
) proposed a technique that combines generalisation, spatial cloaking, and temporal cloaking to protect two types of privacy concerns. The first concern is the uncontrolled disclosure of a user’s location at specific times and the second concern is the uncontrolled disclosure of the absence of a user at a location at specific times (e.g., a user is not at work or home).
Shokri et al.
), in their work on location privacy of mobile users, developed a quantifier (metric) for location privacy that is based on the incorrectness of the adversary in their inference attack (i.e., the higher the number of incorrect inferences, the higher the privacy level is achieved). The authors analysed the localisation of users over time from trajectories protected via k-anonymity but found that the desired anonymity level was in some cases over or underestimated. Another limitation of k-anonymity is that it cannot prevent disclosure from a homogeneity attack (i.e., knowing a person who is in the database) and a background knowledge attack (i.e., knowing a person who is in the database and additional information on the distribution of the sensitive attribute/attributes). Unfortunately, there are many types of datasets, including LBSN data, which may suffer from these attacks. For example, an attacker may know a user or groups of users that have accounts in a geosocial network as well as other background information on the type of inference he/she is about to make. Privacy concepts such as l-diversity (ensures that an equivalent class has at least l “well-represented” values for the sensitive attributes) (Machanavajjhala et al. 2007
) or differential privacy (ensures that the presence or absence of a subject in the data does not alter the probability of the properties of a query answer ) (Dwork 2006
) are able to protect against these two types of attacks. L-diversity results in protected datasets and differential privacy yields answers to aggregate queries. Although both approaches were formulated in the context of statistical databases, they show great potential for protecting data from geosocial networks and spatial data in general. Nevertheless, we should stress that l-diversity data are still vulnerable to composition attacks (i.e., an attacker uses independent anonymised releases about overlapping populations to compromise privacy), but a differential privacy based approach may satisfy such conditions (Ganta et al. 2008
One of the first attempts to prevent these attacks for location data is the work by (Cormode et al. 2012
) who adapted spatial indexing methods such as quadtrees and kd-trees to provide spatial decompositions that are differentially private. The decompositions allow queries to know how many individuals (or other point objects in question) fall within a given region. However, considering the complexity of geosocial network data, this is only one of the many possible queries that entail private information.
Another possible query is to identify locations of interest near other locations. For instance, social media applications use the users’ personal trajectories (captured vie check-ins) to give them suggestions about which places to visit. An attacker may use the recommended locations to make individual inferences such as the user’s actual trajectory. Zhang et al.
) proposed sanitation approaches that allow recommendations queries without revealing the user’s trajectory. In a similar way, the LocBorg approach retains online personas by suggesting users add posts that are similar to their topics of interest but have fake locations (Zakhary et al. 2017
). This approach might be useful for a-spatial studies, in which sensitive attributes and personal profiles are important, but location accuracy is of no interest.
Another approach based on differential privacy for individual-level location data is the notion of “geoindistinguishability” (Chatzikokolakis et al. 2015
), which allows users to be protected within an adaptive radius of r, for which the desired privacy (l-privacy
) increases with the distance. The advantage of geoindistinguishability compared to the previous two approaches is that it is applicable to queries related to a single user (location at specific time) rather than providing aggregate information about several users.
Furthermore, a typical use of social media data in research is to identify and examine spatial clusters of features of interest. Wang et al.
) proposed a method that provides differentially private results in areas with high concentrations of privacy-preserved tweets. The outcome can be used to identify correlations between users and events without identifying the exact locations. Moving away from differentially private methods but still looking into research applications, the location history of users can be used to predict their next locations. Xue et al.
) developed a destination prediction model that explores the check-in service of geosocial networks and a privacy protection method against such attacks.
8. Discussion and Future Research Directions
The way in which LSBN data are analysed and published in a responsible manner may not only be a technical question but a legal one as well. Depending on which country data are from and published in, different legal restrictions that may go beyond the recommendations given in this paper may apply. Examples of such legal frameworks can be found accompanying the many open data portals that some governments operate to share their data.
8.1. The EU Open Data Portal and the General Data Protection Guideline (GDPR)
The EU Open Data Portal (European Parliament 2011
) is used to unify all open data portals of the EU member states. As a legal framework, they adhere to the Regulation 45/2001 on processing of personal data by the EU institutions (European Parliament 2001
), which applies by proxy to the EU member states. However, taking a closer look at how it is implemented in the respective member states reveals that, even with a universal privacy protection law, differences may occur in this respect. This is shown by Custers et al.
), who compared how different EU governments and their citizens enforce and allocate resources for data protection and engage in debate about the topic.
As of 25 May 2018, the General Data Protection Regulation (GDPR) (European Parliament 2016
) has been in effect, which affects LBSN operating within the European Union or the European Economic Area. Its goal is to empower users by enforcing transparency and constraints for the storage and processing of personal data, thus forcing LBSN and other organisations to design their data storage facilities in a way that precludes misuse. The GDPR requires that personal data are stored according to principles such as privacy by design and by default, minimising data storage time, informing individuals about how their data are processed, and purpose limitation. Concretely, it regulates data processing with respect to the following aspects:
Lawful basis for processing: if no user consent for data processing has been provided, there needs to be a legal basis for analysing data, such as public interest, contractual obligations or to protect the interest of the subject
Responsibility and accountability: responsibility and the liability of the data controller to implement effective data and privacy protection measures
Data protection by design and by default: high level of privacy by default, including encryption, and rules for the analysis of data
Pseudonymisation: replacing bits of information with random information (e.g., replacing names with random names) to avoid re-identification
Right of access: a subject’s right to access their personal data
Right to erasure: a subject may request the erasure of all their personal data
Records of processing activities: documentation of the data processing steps, including their purpose, the categories of used personal data, the projected time limits for erasure, or a general description of taken security measures
Data protection officer: a data protection manager has to be assigned in every institution
Data breaches: the data controller is legally obliged to notify the supervisory authority about any data breach
Sanctions: warnings, audits or fines can be issued
Business to business (B2B) marketing: allowed, provided consent or legitimate interest is given
Importantly, for research campaigns, the GDPR does not apply in the following circumstances:
Lawful interception, national security, military, police, justice
Statistical and scientific analysis
Deceased persons are subject to national legislation
There is a dedicated law on employer-employee relationships
Processing of personal data by a natural person in the course of a purely personal or household activity
8.2. The Challenges of Diverging National and Supra-national Legislation
The legal constraints for storing and processing personal data and the right to privacy differ widely between countries. Noorda and Hanloser
) provided an overview of selected national legislations from across the world and point out some of their incompatibilities. They also gave examples of cases in which such incompatibilities allowed privacy violations to be committed with impunity, thus pointing out the impactful consequences of such unclear legal situations. Custers et al.
) showed that even on a smaller scale, within the EU, such incompatibilities exist.
The most severe limiting factor in this regard is the varying interpretation of “privacy” in different parts of the world. For instance, privacy can be traded as an economic good by its owner in the USA, whereas it is protected by law in the European Union. An ideal, but unlikely case would be that supra-national legislation bodies and initiatives set up appropriate world-wide regulations (Resch et al. 2012
). As shown in Figure 1
, legislation and governments play highly different roles in these two environments.
This makes it impossible to draw up one universally applicable and legally binding set of rules for data storage. As a consequence, researchers must not only respect the data storage and processing conditions set forth by data providers and best practice guidelines but also by their national jurisdiction and the jurisdiction of their data subjects.
8.3. Future Research Directions
outlines the limitations of existing studies on inference attacks
. They mainly arise from the lack of true and measurable actual data. When inferences are being made about private information of individuals, the ideal means of validating them is by confirming with the tested individuals. However, in reality, it is virtually impossible to get these individuals to report on their private matters (e.g., where do you live? Where do you go on weekends?). Furthermore, if private aspects are reported, they are oftentimes prone to a number of biases such as the cooperative principle (respondents may alter their statements when answering questions repeatedly), retrospective biases that may be caused through delayed responding (e.g., inaccurate recall, recency effects, false memories), or the fact that some respondents may stick with what they answered earlier in order to appear consistent and not contradict themselves (Bluemke et al. 2017
A major incentive of inference attack studies is to raise awareness on the negative implications of regular geosocial media practices of people (e.g., geotagging posts). If such studies are ethically responsibly conducted and performed by reliable research campaigns, people would potentially become motivated to participate for the benefit of the society.
In addition, studies on inference attacks typically use the spatial information (i.e., coordinates and trajectories) to make inferences. However, McKenzie et al.
) demonstrated that the protection of private location information should not be exclusively handled from a spatial perspective. Place-based information co-exists in the semantic signatures
of geosocial footprints such as the spatial, temporal, and thematic inductiveness of posts. Furthermore, another feature of LBSN that has not been discussed in this paper is image data
(geocoded or not). In fact, the link between image data and information disclosure remains a significant research gap in the literature of geoprivacy. Image data generally belong to the LBSN features that may increase the re-identification risk. For example, similarity and clustering algorithms can match an image “A” (or set of images) to another image “B” (or set of images) (Kawakubo and Yanai 2011
; Lv et al. 2004
; Chen et al. 2005
). If image “A” belongs to a fully anonymous account of an LBSN user that has location information data (e.g., geotagged messages and geotagged pictures), and image “B” belongs to another identifiable account (or any source of information linked to individuals), then one can draw conclusions about the involved individuals or even infer that they are the same person. As a result, information from the anonymous account can be disclosed.
Most importantly, the field of computer vision deals, amongst other things, with image location recognition algorithms (Arase et al. 2009
; Zhang and Kosecka 2006
; Hays and Efros 2008
; Li et al. 2010
). Thus, images of anonymous accounts can be directly processed to identify location patterns of the users. However, this has not yet been studied in the context of geoprivacy and spatial re-identification risk.
Moreover, there is a need to match and harmonise scientific knowledge (e.g., protection methods and privacy by design guidelines), and the legal aspects of location privacy (e.g., how should privacy be protected in Europe based on the GDPR?) with the use of technological tools. One such tool is a spatial decision support system (SDSS). SDSS can be specified for the application domain of geoprivacy in order to help and guide “data holders”, researchers (or principal investigators in larger research campaigns) when anonymising their data. This system can have the form of a graphical user interface (GUI) to allow users to interact with the program and make informative decisions. As we explained earlier in the paper, decisions on anonymisation or protection of LBSN data involve certain standard principles but also depend on, or could be adapted based on, future analyses (e.g., regression, classification, point pattern analysis, clustering, etc.), as well as the type of release (e.g., an aggregated table or a detailed point distribution map).
Finally, an essential aspect of future research efforts in the area of geoprivacy are the unclear consequences that the GDPR may pose. Although the GDPR is in place and has to be followed, it is not entirely clear which measures researchers need to take to comply with the regulation with respect to data acquisition, storage, processing, visualisation or sharing. This problem is rooted in the ambiguous and non-exhaustive formulations of the GDPR. Consequently, detailed interpretations of the GDPR may only be possible after a number of jurisdictional cases, which may potentially compromise current research practices and put severe limits on operational procedures of research involving personal geodata.