Digital Contact Tracing: Large-scale Geolocation Data as an Alternative to Bluetooth-based Apps' Failure

The currently deployed contact-tracing mobile apps have failed as an efficient solution in the context of the COVID-19 pandemic. None of them has managed to attract the number of active users required to achieve an efficient operation. This urges the research community to re-open the debate and explore new avenues that lead to efficient contact-tracing solutions. This paper contributes to this debate with an alternative contact-tracing solution that leverages already available geolocation information owned by BigTech companies with very large penetration rates in most countries adopting contact-tracing mobile apps. Moreover, our solution provides sufficient privacy guarantees to protect the identity of infected users as well as precluding Health Authorities from obtaining the contact graph from individuals.


I. INTRODUCTION
There is growing evidence that any strategy to effectively fight COVID-19 requires an efficient tracing of all contacts of infected individuals. Recent studies conclude that manual tracing is not fast enough and recommend the use of digital contact tracing systems able to use large-scale location information [1]. A key element of the success of a digital contacttracing system is its adoption.
Singapore was one of the first countries implementing a digital contact-tracing system in early 2020. They opted to implement a mobile app that uses bluetooth (BT) technology to identify when two users have been in close proximity. If one of those users is tested positive in COVID-19, the other one is identified as a potential contagion. 20% of the population in Singapore installed the mobile app. But this was not enough. Indeed, a responsible from the Ministry of Health of Singapore stated that they would need three quarters of the citizens installing the app to make the digital contact-tracing strategy successful [2].
Although it is not clear what is the adoption rate from which a BT contact-tracing app becomes efficient in controlling a pandemic, some preliminary studies suggest that to mitigate the pandemic an adoption by 60% of the population in a country would be required [1] [3]. Some simulations studies show that if the adoption is below 20% the benefit of a BT contact-tracing app is very small, but we can observe a significant impact with 40+% adoption rate [3]. Note that, by adoption rate we refer to the rate of people actively using the app, rather than the number of installations.
BT-based contact-tracing apps have a major problem. They are newly released and thus they need to achieve the required high adoption rate in a short period of time from scratch. To the best of our knowledge, neither researchers nor public or private institutions have proposed a convincing strategy to achieve the required adoption rate. For the time being, it seems that the success of any BT contact-tracing app depends solely on the self-responsibility of people, and it has not been enough.
Despite of the described problem and the reported failure of Singapore's app, most western countries (especially in Europe) have also opted for mobile apps using BT technology as their contact-tracing systems. In particular, most of these countries opted for using the Decentralized Privacy-Preserving Proximity Tracing (DP-3T) protocol [4]. The main design goal of DP-3T is to provide full-privacy guarantees. In particular, it aims at guaranteeing that the contact-tracing applications using this protocol cannot be misused in the future for privacy-intrusive practices such as advertising, or even, massive surveillance. To support Health Authorities willing to deploy contact-tracing apps, Google and Apple integrated in their mobile operative systems (OS) a version of the DP-3T protocol. The OS just records user encounters using BT and offers this information to the mobile app, which implements the algorithm to identify risk contacts. In spite of this effort, to the best of our knowledge, none of the existing contact-tracing apps has significantly contributed to mitigate the virus transmission so far. For instance, early data from the Swiss Health Authority indicates that just 12% of infected individuals report their positive through the app [5]. In Spain this number shrinks to 2%.
In addition, as scientific evidences about the airbone transmission of COVID-19 become almost irrefutable [6], another important limitation of existing BT contact-tracing apps arises. They are designed to identify short-distance contact between two individuals, i.e., less than 2 meters apart. However, airbone transmission implies that contagion between two persons at longer distances is possible. Hence, existing BT contacttracing apps may miss an important fraction of contacts that should be identified as risk contacts.
Finally, solutions like DP-3T, designed with the main goal of offering full-privacy, present further important shortcomes in the fight of a pandemic. Some of them are: 1) Even if the adoption rate were high, they require infected users voluntarily declare their positive condition through the app, leaving a very important task such as the control of a pandemic in the hands of individuals' decision. For instance, an early study in Switzerland shows that 1/3 of the users of the app tested positive did not use the app to report their case [5]; 2) The performance and efficiency of the contact-tracing app cannot be assessed. Not even how many infected users have been detected through the app, as recognized by authors of the DP-3T protocol [5]; 3) They are unable to provide even aggregate (and not privacy invasive) context information, which might be of great value to improve our knowledge on the transmission patterns used by COVID-19 (or other viruses). For instance, in this paper we consider the following one: revealing aggregate statistics of the type of locations (restaurants, sport facilities, public transportation, hospitals, etc) infected users visited while they were contagious may be useful to identify statistical biases on specific type of locations that may reveal hotspots for the virus transmission.
Given the described context, the main goal of this paper is to urge the research community to expand the definition of digital contact-tracing systems having in mind the following key elements: 1) the experience gained so far invites to avoid solutions that require to achieve massive adoption from scratch; 2) contact-tracing solutions must be designed considering airbone transmission distances (few meters) as reference; 3) guide the design of the solutions setting the efficiency in fighting the pandemic (i.e., saving lives and mitigating the impact on the economy) as the primary goal instead of privacy. Of course, the proposed solution should be compliant with the existing data protection and privacy laws in the country where it is deployed.
In this paper, we propose an alternative digital contacttracing system based on the three previous key elements as fundamental design principles: 1) High adoption rate: We propose to use real-time location information from (literally) billions of people around the world that is already available in databases of large BigTech companies like Facebook (FB), Google, Apple, etc. We refer to these players as Location Providers (LPs) in this paper. Some of these LPs, mainly Google and Facebook, have a very large rate of active users, over 50%, in many western countries.

2) Contact identification in airbone transmission range:
To geolocate users at both outdoor [7] and indoor locations [8] with an accuracy of few meters, these BigTech firms use a combination of techniques that rely on multiple signals including: GPS location information, WiFi SSIDs signal's power, cellular network signals, etc. 3) Legal and Ethical Requirements: We are interested in performing contact-tracing just for individuals who has been tested positive of COVID-19. The identity of infected individuals is sensitive information handled by the Health Authority (HA) of each country, which is also responsible for running the contact-tracing strategy. Therefore, the HA has the identity of infected individuals while the LP has the data to perform the contacttracing for those individuals. We propose a system that allows running contact-tracing using LPs data on those individuals tested positive as reported by HAs. Even the most restrictive data protection laws, like the GDPR [9], explicitly provision exceptions in which personal data can be used to monitor epidemics and their spread (see GDPR Article 6 Recital 46 [9]). Sustained on this legal basis an agreement to perform an exchange of data between LPs and HAs might be possible. However, to provide higher privacy guarantees, in this paper we propose a simple architecture and communication protocol that enable the exchange of information between a LP and a HA significantly limiting the ability of (1) HAs to obtain the contact graph of an individual and (2) LPs to learn the identity of infected individuals. We acknowledge that we have no evidence of whether our system will solve the contact-tracing problem, but we believe it is a technically sound alternative worth exploring. In addition, it serves the main purpose of this paper: to encourage the research community to revisit the design of digital contacttracing solutions that are efficient to mitigate future waves of COVID-19 pandemic and if not, at least, future epidemics.

II. SOLUTION RATIONALE
We propose a novel contact-tracing solution that uses geolocation data of billions of users to find people that have been in contact to individuals tested positive. We refer to them as risk contacts. The geolocation information is owned by BigTech companies referred to as Location Providers (LPs) in this paper, and the information of users tested positive is owned by Health Authorities (HA). The core of our solution can be described as follow: HAs send to LPs the IDs of infected users. LPs use the location information they own to find the risk contacts of the received IDs (according to the guidelines provided by epidemiology experts) and send back the list of risk contacts IDs to the HA. Finally, HAs reach out the risk contacts to inform them about the prevention protocol they have to follow.
Note that for practical purposes, we propose to use the mobile phone number of individuals as user IDs in our solution. LPs know the mobile phone number of a major part of the users using their services, and it is reasonable to assume HAs record the mobile phone of infected users to communicate with them.
Unfortunately, the direct exchange of data in clear between HAs and LPs presents important privacy issues. In particular, LPs should not receive in clear IDs of infected individuals and HAs should not be able to link the IDs of risk contacts to their correspondent infected user. Our solution addresses this challenge allowing to perform the contact-tracing task with strong privacy guarantees. To this end, we define an architecture and a communication protocol that involve in addition to LPs and HAs two more players: an Identity Provider (IDP) and an Independent Third-Party Authority (ITPA).
A. Why using geolocation data?
Adoption: The main limitation of contact-tracing based on mobile apps is the need to achieve a high rate of active users. This is a major bottleneck that so far has led every attempt in this line to fail.
Our solution avoids this bottleneck by using large-scale already available geolocation data owned by BigTech companies. To explicitly compare the penetration of BigTechs' data vs. BT mobile apps, Table I shows for 18 countries we have found data on the number of installations of contact tracing apps: 1) the penetration rate of smartphones, Android OS [10] [11] [12] and the Monthly Active Users (MAU) reported by FB; 1 2) the penetration rate of BT mobile-app in number of installations as well as an estimation of its penetration in terms of active users. The list of sources we have used to report the number of mobile apps installations can be accessed here. 2 Note that, to the best of our knowledge, Switzerland is the unique country reporting the percentage of active users of its app, 63% as of Dec 21, 2020 [13] . In order to have an estimation of the fraction of active users for other countries reporting the number of installations, we apply to their number of installations the Swiss ratio.
According to our estimation, none of the countries reach a significant adoption rate close to 40% for the contacttracing mobile apps, and only 5 countries are above 20%. In contrast, Facebook penetration is beyond 50% in all countries but Germany (45.5%). Similarly, the penetration of Android is higher than 40% in all countries but US (32%) and Switzerland (39%). Note the Android penetration just represents a lower bound of Google penetration. Google has few other extremely popular apps such as Gmail and Google Maps that are widely used by iOS users.
Accuracy: BigTech companies use sophisticated techniques combining GPS, WiFi and cellular networks signals to geolocate users with high precision both outdoors and indoors [7] [8]. Indeed, Google claims to be able to geolocate users with an accuracy of 1 to 2 meters using multilateration algorithms based on the WiFi signal from 3 access points. [8].
Therefore, high penetration rates and location accuracy of BigTechs present them as a data source that may be enough to implement efficient contact-tracing solutions. Recent research works, which uses data from LPs with much lower penetration than FB or Google, also backup this hypothesis [14].

B. Other benefits
The proposed solution allows to monitor its performance. In addition, geographical locations can be associated to specific categories referred to as Point of Interests (POIs). For instance, a given location can be mapped to a restaurant, a train station or a hospital. Our solution exploits this to provide statistical distribution of the POIs visited by infected users vs. POIs visited by the general population. The comparison of these distributions may help to identify statistical biases in POIs regularly more visited by infected users, that might be infection hotspots.

C. Privacy requirements
On the one hand, privacy experts and Data Protection Authorities (DPAs) have shown concerns to use geolocation information for digital contact tracing. They basically argue that it may ease governments through their HAs to implement massive surveillance due to the scalability provided by digital technologies. Therefore, our solution should limit the ability of HAs to massively infer the contact graph information of individuals using the data received from LPs. In addition, it should provide privacy provisions to allow revealing targeted attacks willing to infer the contact graph of particular individuals.
On the other hand, BigTech companies have means to infer the identity of infected individuals. They can leverage geolocation data but also other information sources such as emails, posts in social networks or queries in search engines they own. For instance, they can detect a user who visited a testing facility after visiting its website and then remains at home for a period similar to the mandatory quarantine period. Therefore, we believe that proposals, as ours, leveraging BigTech companies' geolocation data do not impose any extra risk to infected users' privacy. In spite of this, appropriate privacy guarantees should be provided. In particular, our solution should not provide LPs explicit information about the identity of infected users. It also should limit the ability of LPs to infer such identities from the information received from HAs.

D. Meeting Privacy Requirements
In order to meet the defined privacy requirements we leverage the following principles: K-anonymity, basic cryptography and non-repudiation auditing. K-anonymity: In our solution the HA sends a list of user IDs to the LP and the LP answers with the risk contacts of those user IDs.
Leveraging K-anonymity principles, the HA mixes in its request M IDs from infected users and N real random IDs (i.e., random mobile phone numbers associated to real users) where M <<< N . This serves to anonymize the identity of infected users and to hinder the capacity of LPs to easily infer the IDs belonging to infected users. The random IDs used by the HA are provided by the Identity Provider (IDP) in order to guarantee that they are existing IDs. In our solution, IDPs are represented by mobile network operators.
In addition, the HA must aggregate the IDs into groups. There are two types of groups: infected groups include exclusively IDs from infected users; random groups include IDs from random users or a mix of random and infected users. The messages from the HA to the LP include K groups from which only L are infected groups, where L <<< K. Upon the reception of a request message from the HA, the LP computes the risk contacts of each user ID. After that, it aggregates together in the reply the risk contacts of all user IDs in a single group. This aggregation process relies on the Kanonymity concept to prevent the HA from linking the received risk contacts IDs to a specific individual. Note that the larger is the size of the groups the higher are the privacy guarantees.   [15].
Cryptography: A honest HA is interested only on the risk contacts IDs associated to infected groups. To hinder the ability of HAs to access contacts IDs from random groups, the LP encrypts the list of contacts of each group (included in the reply to the HA) using a different key per group. Therefore, the HA receives the contact IDs of all groups encrypted. To retrieve the keys of infected groups, the HA has to send a request to an intermediary that we refer to as Independent Third-Party Authority (ITPA). In this request, the HA indicates the total number of groups in the query as well as the ID of infected groups. In turn, the ITPA requests the keys of all groups to the LP and forwards to the HA only the keys associated to the infected groups. Finally, using the received keys, the HA obtains the risk contact IDs associated only to infected groups, thus completing the contact tracing procedure.
Non-repudiation auditing: Our solution relies on the concept of liability to guarantee privacy rights of the users. Note that this is a widely adopted approach in the legal system of advanced democracies. For instance, a state cannot prevent anyone from driving above the speed limit, but anyone doing so is liable for it. In the case of privacy, a state cannot prevent a BigTech company implementing privacy intrusive practices, but punish them in case an auditing process reveals the use of those practices. Therefore, a HA or a LP that uses the data they receive for purposes different than contact-tracing will be liable for it.
For instance, a malicious HA can implement a targeted attack (see Section IV) to unveil the contact graph of an individual and leaking it to other government branches. This would be a crime equivalent to leaking the medical record of a target individual to other government branches. Our solution collects the required non-repudiation proofs to be used by the corresponding auditing entity to unveil any potential attack by a HA.

III. PROTOCOL FOR CONTACT-TRACING USING LOCATION PROVIDERS INFORMATION
In this section we describe the steps of the communication protocol including the sequence of messages exchanged by the four players involved in our solution: Health Authority (HA), Location Provider (LP), Identity Provider (IDP) and Independent Third Party Authority (ITPA).
Step 0: This step refers to the basic context our solution relies on. On the one hand, LPs record historical location information from users running their OSes, mobile apps, etc. In addition, they also store the mobile phone number for a major portion of the users. On the other hand, IDPs (i.e., mobile operators) provide users with mobile phone numbers that serve as user IDs in our solution.
Step 1: The HA obtains the IDs of users that have been tested positive in a given time window (e.g., a day).
Step 2: The HA triggers the contact-tracing process by requesting the IDP a list of N user IDs (i.e., real mobile phone numbers). The value of N is decided by the HA and may differ from one request to another.
There are few remarks to be considered: (1) This message includes a unique identifier referred to as Transaction ID that will be included in all the remaining messages in the process; (2) The message is signed with the private key of the HA. Note that in the rest of the process all entities sign with their private key the messages they send.
Step 3: The IDP responds the HA request with a a list of N random user IDs.
Step 4: The HA creates K groups. As explained above, only L of these groups are infected groups and K − L are random groups. The resulting groups are included in a Contact-Tracing Request message that is sent to the LP. It is important to note that the user IDs included in an infected group cannot be present in other infected groups neither in this request nor in past or future requests.
Step 5: Upon the reception of the Contact-Tracing Request the LP runs the contact-tracing algorithm to identify the risk contact IDs of each user ID included in the request. The risk contact IDs from all users in a group are aggregated so that any link between a user ID and a risk contact ID is eliminated. In addition the LP collects the POIs visited by each user ID in a defined time window in the past (e.g., last 10 days). Then, it computes the distribution of types of POIs visited by the user IDs included in each group as well as the overall distribution of types of POIs visited by all user IDs included in the request.
The information associated to each group, i.e., list of risk contact IDs and distribution of type of POIs is encrypted with an independent key per group.
Finally, the LP aggregates the encrypted information per group along with the distribution of types of POIs for all users IDs and creates a Contact-Tracing Reply message that is sent to the HA.
Three important remarks to consider are: (1) The LP must keep record of the key used to encrypt each group; (2) The contact tracing algorithm implemented by the LP as well as the number of days for the identification of visited POIs must be defined by epidemiologists and it is out of the scope of this paper; (3) the LP stores all the Contact-Tracing Request messages received for auditing purposes.
Step 6: Upon the reception of the Contact-Tracing Reply the HA needs to decrypt the information associated to infected groups, i.e., the risk contacts list and type of POIs distribution.
To this end, it sends a Keys Request message to the ITPA including the total number of groups included in the Contact-Tracing Request and the identifiers of the infected groups.
Step 7: The ITPA sends to the LP the Keys Request message but it only includes the Transaction ID.
Step 8: Upon the reception of the Keys Request message, the LP sends to the ITPA a Keys Reply message including the keys for all groups.
Step 9: The ITPA checks if the number of keys in the received reply matches the actual number of groups reported by the HA. If the numbers are the same, the ITPA generates a Keys Reply message to the HA that only includes the keys of the infected groups. Otherwise, the Keys Reply message includes an error indicating that the reported number of groups does not match with the number of keys provide by the LP.
Step 10: Upon the reception of the Keys Reply message, the HA decrypts the information about risk contacts and type of POIs distributions included in the Contact-Tracing Reply for the groups of infected users.
Step 11: The HA gets in touch with risk contacts.

IV. POTENTIAL ATTACKS AND COUNTERMEASURES
As explained above, our solution is designed to hinder both, LPs and HAs, from misbehaving to get access to information they are not authorized to obtain. Next, we explain in detail the countermeasures provided by our solution to avoid: (i) LPs trying to infer the IDs associated with infected individuals, (ii) HAs trying to obtain the contact graph of citizens.

A. LP inference of infected users identity
A malicious LP may intend to unveil the identity of infected users based on the information received in Contact-Tracing Request messages (known as re-identification attack). To this end they could use a single request or combine subsequent requests to obtain the identity of infected users.
In order to prevent re-identification attacks, the HA has to reuse the IDs that have been already used including them in random groups of subsequent requests. Otherwise, if random IDs are only used once and discarded, the LP could infer with very high probability that repeated IDs in different queries belong to infected individuals.
In addition to reusing IDs, our solution relies on the Kanonymity principle. The number of random IDs, N , in request messages is several times larger than the number of infected user IDs, M . The complexity to perform a re-identification attack grows with the ratio N M . In addition, our solution allows introducing a high level of randomness in the request messages to avoid that LPs can infer patterns that allow identifying groups including infected users IDs: (i) the number of infected and random user IDs differs from message to message, (ii) the number of groups in a message differs from message to message, (iii) the length of the different groups within the same message should also differ. In addition, the HA could send messages that do not include any infected user ID from time to time.
Beyond the technical measures, the main argument to support our solution is that powerful LPs such as Google or Facebook willing to identify infected citizens can easily do it already with the information they own. Therefore, the privacy measures adopted in our solution provide sufficient guarantees to avoid increasing the risk of a potential re-identification attack by LPs.

B. HAs inference of the contact graph of a user-id
Our solution cannot prevent beforehand a malicious HA from obtaining the contact graph of a particular individual. For instance, a HA can perform a targeted attack by using twice the same ID in two different infected groups (despite it is forbidden in our solution). The common risk contacts in the two groups may reveal the contact graph of the targeted individual.
However, our solution keeps the required non-repudiation proofs to show such an attack has happened. The auditing entity just needs to check whether the HA has used the same ID twice (or more times) in groups of infected users in the same or different messages. The auditing entity can retrieve all the Contact-Tracing Request messages from the LP. Similarly, the auditing entity retrieves from the ITPA, for each Contact-Tracing Request message, what are the infected groups declared by the HA. With that information the auditing entity can easily identify attacks from the HA. The described auditing capacity provides privacy guarantees based on undeniable liability, a widely used technique in developed democracies.
Finally, our recommendation is to run the described auditing process once a day to detect any malicious HA soon after it has implemented an attack.

V. CONCLUSION
The only digital contact-tracing approach used so far to fight COVID-19 pandemic consists on the utilization of mobile apps that leverage Bluetooth technology to identify proximity encounters. The lack of sufficient adoption of such mobileapps has led every single attempt in this direction to fail.
Due to the importance that digital contact-tracing solutions may have to help fighting the pandemic, it is the obligation of researchers, public health authorities and technology companies to explore alternatives until an effective contact-tracing solution is found. To trigger this exploration effort, in this paper we propose a first alternative solution. We propose to use already existing scalable and accurate geolocation data, which is likely to serve to build an efficient digital contact-tracing solution. Our proposal defines an architecture that leverages such data and provide sufficient privacy guarantees to citizens.