Efficient Aggregate Queries on Location Data with Confidentiality

Location data have great value for facility location selection. Due to the privacy issues of both location data and user identities, a location service provider can not hand over the private location data to a business or a third party for analysis or reveal the location data for jointly running data analysis with a business. In this paper, we propose a newly constructed PSI filter that can help the two parties privately find the data corresponding to the items in the intersection without any computations and, subsequently, we give the PSI filter generation protocol. We utilize it to construct three types of aggregate protocols for facility location selection with confidentiality. Then we propose a ciphertext matrix compressing method, making one block of cipher contain lots of plaintext data while keeping the homomorphic property valid. This method can efficiently further reduce the computation/communication cost of the query process—the improved query protocol utilizing the ciphertext matrix compressing method is given followed. We show the correctness and privacy of the proposed query protocols. The theoretical analysis of computation/communication overhead shows that our proposed query protocols are efficient both in computation and communication and the experimental results of the efficiency tests show the practicality of the protocols.


Introduction
With the widespread popularity of electronic devices with GPS functions and thanks to advances in information and communication technologies, location-based services (LBS) have developed significantly, and location service providers (LSP), such as Google Maps and FourSquare, already own a very large amount of user location data, which are of great commercial value in facility location selection. LSPs seek to find ways to provide location data aggregate analysis services to other businesses. Businesses would also like to pay for this kind of service for location selection because an appropriate location can bring much more economic interest than an ordinary location. Specifically, for example, a bank plans to open a branch, and it has several candidate addresses. "Which is the best location" is a puzzle for the bank because it does not have the location data of its users, so the bank seeks the help of the LSPs. Even in some public sectors (communal facilities) where economic interests are not the core issue, for businesses that define "value" as being used by more people or making something convenient for more people, a proper location makes the public facilities more valuable. The desire of the LSP and the business to cooperation raises a problem that the LSP can not simply send location data to the business or a third party for analyzing or reveal the location data for jointly running a data analysis query with the business due to privacy issues in both terms of law and ethics. From the above points of view, how to run a two-party privacy-preserving query between an LSP and a business is a meaningful topic.
The selection of a facility location among alternative locations has been widely studied, which is regarded as a multi-criteria decision-making problem, including both quantitative and qualitative criteria [1]. The location selection query is to choose a location with maximum influence for a new facility [2]. We mainly focus on three types of objectives [3] for location selection-(1) maximizing the number of users attracted (Maximum is not appropriate for some special businesses such as express. Instead, they may pursue sharing workload equally for each facility. For this objective, the RNNC query is still applicable.); (2) minimizing the average distance from all users to their nearest facilities; (3) minimizing the maximum distance from all users to their nearest facilities. The above selection indicators correspond to three types of aggregate queries, respectively, as follows: (1) Reverse Nearest Neighbor Cardinality (RNNC) query. The RNN query is a very common query for location selection. In our two-party privacy-preserving query scenario, for a (public) set of existing facilities and given query locations, the RNN query computes all the users that are closer to each given query location than to any other facilities. Considering the first objective above, our RNNC query only needs to return the set of the cardinalities of the RNN. (2) Average Distance (AVGD) query. The average distance represents the mean value of the distances between all users and their nearest facilities. The average distance is valuable information for the client to minimize is to maximize user benefit. The AVGD query result measures the impact of a candidate location of the new facility on all users, which can help a business such as a supermarket to decide where to put a new branch store in order to maximize the benefit to its customers. In short, the minimized average distance corresponds to the most valuable location. (3) Maximum Distance (MAXD) query. The maximum distance represents the maximum value of the distances between all users and their nearest facilities. It takes the worst case as a selection indicator, i.e., after running several MAXD queries returning the maximum distances corresponding to different candidate locations, the business can compare those and choose the location which has the smallest one to maximize the benefits to the most "inconvenient" user.
In practice, for multiple alternative locations, a business can run one of these three types of queries according to the chosen criterion for each location, then decide which location is more beneficial to it.
In this paper, we use "server (S)" to replace the LSP that provides services of aggregate queries and use "client (C)" to replace the business that needs location data analyses. Consistently, we use U s /U c to represent S/C's user list. We continue to focus on the practical settings described in [3]: (1) We use I = U s ∩ U c , the intersection of the user lists of S and C, as a substitute for C's user list U c because the scale of U s is so big that the number of items (user identities) in U c that are out of the intersection is relatively small enough to be ignored; (2) These two user lists U s and U c are both private, each of which should not be revealed to the other party; (3) The location data of users in the hands of S are private from C.
OUR CONTRIBUTIONS. We proposed two-party privacy-preserving aggregate query protocols for three types of queries-RNN Cardinality query, Average Distance query and Maximum Distance query for facility location selection. Compared with the state-of-the-art scheme [3], our protocols are one step closer to practical application in terms of efficiency. Specifically, our main contributions are as follows: 1.
We propose a new construction that we call a PSI filter, which can be used by two parties to filter irrelevant data corresponding to items out of the intersection without any computations and without revealing any information of items in their private sets to each other. Specifically, we extract the PSI process of [4] and modify it to a PSI filter generation protocol. The generated filter F consists of two parts, in both of which the items are indistinguishable (looks random). These two parts are held by these two parties (S and C), respectively, and they can help to find the valid data in our aggregate query protocols.

2.
We propose privacy-preserving aggregate query protocols for three types of queries-RNN Cardinality query, Average Distance query and Maximum Distance query by using our PSI filter to find the valid data (corresponding to the items in the intersection) and using a Paillier cryptosystem to achieve the processing on the ciphertext. Compared with the state-of-the-art schemes [3], the use of a PSI filter rather than a superset (for hiding user identities and serving as an index) of user identity space can bring a new advantage: no need to increase the communication costs for hiding the user identity lists from each other. Further, it can efficiently reduce the times of encryption fromñ to n s by removing many invalid computations wherẽ n( n s ) is the size of the superset and n s is the size of LSP's user set.

3.
We propose a ciphertext matrix compressing (CMC) method that can further reduce the communication cost (and the computation cost incidentally) of the RNN Cardinality query protocol then we give the optimized version of the RNNC query protocol utilizing CMC. Although our original three types of query processing are similar to some extent, the communication cost of the RNN cardinality query protocol is O(n s ·k), k× that of the other two protocols, where k is the number of facilities. Our ciphertext matrix compressing method can reduce the communication cost to O(n s · k/η ), where η is relevant to the security parameter λ and a threshold value α (Section 6). What is striking is that our compressing method has a broader scope of application; more specifically, it has great potential in small-value plaintext scenarios for compressing the scale of ciphertext while maintaining the homomorphic property.

4.
We give the security analysis and performance evaluation of the aggregate query protocols. First, we prove the security of a PSI filter generation protocol, based on which we illustrate that the aggregate query protocols are privacy-preserving for the location data owned by the LSP and user lists of both the LSP and the business. Finally, we give the theoretical analysis and experimental results of performance.
The remainder of this paper is organized as follows. After reviewing the related research contents and giving the preliminaries in Section 2, we introduce our system model and problem definitions in Section 3. In Section 4, we give the definition of a PSI filter and the generation protocol that we utilized to construct the three types of query protocols in Section 5. We present a ciphertext matrix compressing method in Section 6 to further improve the efficiency of the RNNC query protocol. In Section 7, we give the proofs of correctness and privacy. In Section 8, we give the theoretical analysis and experimental results for performance evaluation. Finally, the concluding remarks are given in Section 9. The nearest neighbor (NN) query has received vast attention in the spatial database research community since its introduction in [5]. We say point A is the nearest neighbor of B when A is closer to B than any other points under consideration, and naturally, B is one of the reverse nearest neighbors of A. The RNN query was firstly considered in [6]. In monochromatic RNN [7,8], all points are regarded as in the same class; in contrast, the dichromatic RNN [2,6,9,10] considers two different classes of points-objects and sites. The average distance (for L 1 distance) query was first considered in min-dist optimal location query [11]. Then, [12] gave a scheme for the L 2 distance version. Recently, for the TIPS problem (Trajectory-aware Inconvenience-minimizing Placement of Services), [13] showed both the MAX-TIPS and AVG-TIPS are NP-hard. Then, a novel query called the reverse nearest neighborhood (RNH) query was proposed in [14]. Unlike an RNN query, an RNH query emphasizes a group of users instead of an individual user.

Related Work and Preliminaries
Most of the schemes of aggregate queries on location data assume that data is public or allowed to be sent to another party. Today, when people attach great importance to data privacy, these schemes do not meet our requirements.

Privacy-Preserving Aggregate Query
The privacy-preserving database query was first defined in [15]; Ref. [16] then gave a solution. Ref. [17] showed how to execute SQL aggregations over encrypted data. The authors developed an enhanced encrypted data storage model and introduced formal query implementation techniques to translate original aggregation queries to a form that can be directly executed over the encrypted data. Ref. [18] proposed a privacy-preserving range query protocol. Ref. [19] proposed a privacy-aware query processing framework Casper for privacy-preserving NN query. Then, ref. [20] gave efficient protocols for privacypreserving k-NN searches by using secure multi-party computation technologies. Ref. [21] proposed a simple privacy-preserving protocol PDAS for computing and verifying queries in outsourced databases against malicious adversaries. For the same problem in [20], ref. [22] gave a solution using a Paillier cryptosystem. Recently, ref. [3] proposed two types of solutions for privacy-preserving aggregate query protocols. The main difference between these two types is who (the server or the client) does most of the encryptions and decryptions. No matter who does, the protocols need a superset of sizeñ in order to conceal the identities of users and to serve as an index-the positions of any users are fixed where both the two parties can operate directly on the corresponding positions.
How to cancel the use of a superset and filtrate data without an "index" for aggregate queries on location data while the two parties both have their own private user sets is not solved yet. In general, we use homomorphic encryption and the idea of PSI to design our protocols. Compared with the traditional PSI protocols, our newly constructed PSI filter generation protocol does not find the intersection directly due to privacy issues. What the two parties get are the sets containing computationally indistinguishable labels that were generated in the protocol. In comparison with [3], we cancel the use of a superset through the use of our new primitive PSI filter and further improve the efficiency by compressing multiple data into one block of ciphertext.

Private Set Intersection
A private set intersection (PSI) protocol enables two parties, the Sender and the Receiver, holding private sets X and Y, respectively, to jointly compute the intersection X∩Y without revealing any additional information about their respective sets (the conditions may be weakened or changed in different circumstances). In a one-way version of PSI, only the Receiver can learn the intersection X∩Y, while the Sender learns nothing. In a concrete setting called labeled PSI [23] for applications, there is a label l i for each item x i ∈X, and the protocol enables the Receiver to learn the labels corresponding to the items in the intersection, i.e., the Receiver should learn the set {(x i , l i ) : x i ∈Y}. A labeled PSI can be seen as a variant of one-way PSI.
We construct our protocols using a PSI filter (Section 4), which is generated by a modified one-way PSI. The generation process is particularly similar to a labeled PSI protocol, and the difference is that the labels do not exist before our PSI processing. In other words, the labels corresponding to the items will be generated in the intermediate process.
Moreover, the Receiver does not know the specific correspondence of any pair (x i , l i ). Note that for reducing the communication/computation cost, the (generated) corresponding labels are important for filtrating data in queries. For consistency, we will use the titles Server (S)/Client (C) instead of Sender/Receiver, and accordingly, use U s /U c instead of X/Y, afterward.

Decisional Diffie-Hellman Assumption
The decisional Diffie-Hellman (DDH) assumption is a computational hardness assumption in cyclic groups. It indicates that no efficient algorithm can distinguish between these two distributions (g, g a , g b , g ab ) and (g, g a , g b , g c ), which enables one to construct efficient cryptographic systems with strong security properties.
Formally, let G(λ) be a cyclic group of order q with secure parameter λ. For a probabilistic adversary (algorithm) A, we define the advantage of A as where the probability is over a random generator g of G, and random a,b,c ∈ G. We say that the decisional Diffie-Hellman assumption holds in group G if for any probabilistic adversary A, the advantage Adv[A, G] is negligible.

Homomorphic Encryption
Homomorphic encryption is a cryptographic primitive that allows computation on ciphertexts without decryption. The homomorphism can be expressed as Enc(m 1 )•Enc(m 2 ) = Enc(m 1 •m 2 ), more strictly, Dec(Enc(m 1 )• Enc(m 2 )) = m 1 •m 2 . Note that in nondeterministic encryption, the previous homomorphic equation is not directly equivalent due to the existence of random numbers. It represents that Enc(m 1 )• Enc(m 2 ) equals one of the ciphertexts of the message m 1 •m 2 .
We choose a Paillier cryptosystem [24] to achieve the additive homomorphism, i.e., Enc(m 1 )·Enc(m 2 ) = Enc(m 1 + m 2 ). After selecting the public key pk = (n, g) and the private key sk = (λ, µ), for message m ∈ M, the encryption algorithm computes ciphertext c = g m ·r n mod n 2 where r is randomly selected from Z * n . The decryption algorithm recovers the message by computing m = L(c λ mod n 2 ) · µ mod n where L(x) = x−1 n . The Paillier cryptosystem provides semantic security against chosen-plaintext attacks (IND-CPA) under the decisional composite residuosity assumption (DCRA). In other words, the encryptions of different messages are computationally indistinguishable without knowledge of the private key.

Security Definition
We consider two-party privacy-preserving aggregate query protocols in a semi-honest model, i.e., the adversaries are honest but curious, where the main security issue is to protect the privacy of the two parties. We formally define the security of a two-party protocol by using a simulation paradigm [25,26], which gives us an effective method to prove the security.
Let f = ( f 1 , f 2 ) be a probabilistic polynomial-time functionality and let π be a twoparty protocol for computing f . The view of the ith party (i ∈ {1, 2}) during the execution of π on (x, y) and security parameter n is denoted by view π i (x, y, n) and equals (w, r i ; m i 1 , . . . , m i t ), where w ∈ {x, y} (depends on the value of i), r i equals the contents of the ith party's internal random tape and m i j represents the jth message that it received. The output of the ith party during an execution of π on (x, y) and security parameter λ is denoted by output π i (x, y, λ) and can be computed from its own view of the execution. We denote the joint output of both parties by output π (x, y, λ) = (output π 1 (x, y, λ), output π 2 (x, y, λ)).

Definition 1.
(Security for Semi-honest Adversaries) Let f = ( f 1 , f 2 ) be a functionality. We say that π securely computes f in the presence of static semi-honest adversaries if there exist probabilistic polynomial-time algorithms SIM 1 and SIM 2 such that where x, y ∈ {0, 1} * such that |x| = |y|, and n ∈ N.
The simulation paradigm is one of the most important paradigms in the definition and design of cryptographic primitives. In a two-party setting, it can ensure one has not learned anything about the other's secret by showing that the one could have simulated the entire interaction by itself. In other words, one can gain no further knowledge as the result of interacting with the other beyond what it could have discovered by itself. It gives a formal way to prove the security; the details of the proof are shown in Section 7.

System Model
The system model is composed of two parties, as shown in Figure 1: the server S provides location-based services and the client C that is interested in some aggregated data (query results) that are helpful for selecting the optimal facility location.

Client
Server S has a private user set U s = {v 1 , . . . , v n s }, where u i is the unique ID of the ith user of S. Further, S has the knowledge of all the users' locations L = {l 1 , . . . , l n s }, where l i is the location of u i . (S's own data privacy-how it prevents getting compromised-is not in our consideration. Some techniques can be found in related fields, such as hardware security and database security. We focus on the privacy-preserving interactive protocols between S and C.) Actually, l i can be the expression of a point with respect to any coordinate system, for example, l i = (x i , y i ) in Cartesian coordinate system and l i = (lng i , lat i ) in geographic coordinate system. C has a private user set U c = {u 1 , . . . , u n c }, where v i is the unique ID of the ith user of C. In addition, C has F = { f 1 , . . . , f k }, a list of locations of (k − 1) existing facilities and one of the candidate locations, which can be public to others.
As described in Figure 1, there are two phases of interactions: the setup phase and the query phase. In the setup phase, C and S interact to do some pre-computations where a PSI filter will be generated on the basis of the two user sets U c and U s , in order to facilitate the query phase. After finishing the setup phase, when C wants to do a query, it can send a query request with a specific query type to S. After receiving the query request, S interacts with C to jointly compute an aggregate query result that will be obtained by C while protecting the privacy of both C and S. We discuss the details of privacy requirements and list all involving sensitive information in Section 3.2 There is a good property that a fixed C collaborating with S can process an aggregate query many times with just one setup phase, so the computation and communication costs of the setup phase can be amortized over a number of queries.

Threat Model
Both C and S are considered as semi-honest (also called honest-but-curious) adversaries [27], as described in Section 2.2.4 That is, they honestly act according to their prescribed actions in the designed protocols but may try to learn additional information from the encrypted data and all the intermediate messages obtained by themselves, and there is evidently no collusion between C and S. We say a two-party protocol π is secure in the semi-honest model if no party of the two gains information about the other's private inputs, other than what can be deduced from the result of the protocol.
In our system model, there is some private information that needs to be hidden as follows.
• U s : these user identities should not be revealed to C; • U c : these user identities should not be revealed to S; • L: the user location data is private to C; • Q: the query result is private to S.
Note that the values of n s , n c and n I are actually not sensitive information. We do not hide them in our protocol, but just in case, we give a simple way to hide n s and n c to some degree. Moreover, the location that was finally selected is not private information, while the query result Q is private to S because the final location is visible to anyone.

Problem Definitions
Different sorts of aggregate queries can employ the PSI filter proposed in Section 4 to build privacy-preserving query protocols. In our system model, a privacy-preserving aggregate query protocol is a secure two-party computation protocol for functionality After executing the protocol, C receives the query result Q corresponding to a specific query type while S gains nothing. Note that it is a sketchy representation of the functionality Q ignoring some inputs such as secure parameters and keys, and the input of functionality Q has not changed from the previous state-of-the-art scheme. For popularity (usefulness and usage frequency) and ease of comparison with the previous literature, we consider three types of queries, the same as in [3]-RNN Cardinality (RNNC) query, Average Distance (AVGD) query, and Maximum Distance (MAXD) query. The formal definitions of the three types of queries are given as follows:

Definition 2. (RNNC Query) Given a set of locations of facilities
where | · | means the cardinality of a set. In other words, it intents to find the cardinality of RNN for each facility f i .
For the RNNC query, the query result Q = RNNC(F) is a k-vector (ordered set) of RNN cardinalities.

Definition 3. (AVGD Query) Given a set of locations of facilities
where f j is the NN of u i , and d(u i , f j ) means the distance between u i and f j . In other words, it intends to find the average distance of all users and their nearest facilities.
In other words, it intends to find the maximum distance between a user u i ∈ I and its nearest facility f j ∈ F.
Note that we consider Manhattan distance (L 1 distance) instead of Euclidean distance (L 2 distance), because L 1 distance is more accurate for representing the driving distance in a city road network [28].

Design Objectives
To meet the anticipated requirements under the aforementioned system model and the threat model, the proposed protocols aim to simultaneously achieve the following objectives: • Correctness. The client should correctly obtain the desired query result if the two parties execute the protocol honestly. • Privacy. The private information U s , U c , L and Q should be hidden throughout the whole protocol. Although we use I as a substitute for U c , the client does not gain any knowledge of U s ; in other words, the client has no idea if an item in U c also exactly belongs to U s , and similarly, the server gains nothing about U c . • Efficiency. Our protocol should be efficient for both S and C. That is, both the computation cost and the communication cost should be low to support the aggregate queries with large-scale location datasets and user lists.

PSI Filter
Definition 5. (PSI filter) For a two-party query protocol π, and two datasets D 1 and D 2 held by S and C, respectively, let F = (Θ 1 , Θ 2 ) be a tuple of two formally analogous sets that are held by S and C, respectively, where F is said to be a PSI filter if it does not contain any sensitive information and the protocol π can utilize F to filtrate out irrelevant data corresponding to items out of the intersection D 1 D 2 with no extra computation.
Our newly constructed PSI filter is a pair of special sets of labels designed to filtrate data, which means helping C to find and retain the relevant data corresponding to the items (users) in the intersection and discard other irrelevant data without revealing the private identities of users to either C or S. As described in Section 2.2 the process of generating the PSI filter is, in fact, a modified one-way PSI that is similar to a labeled PSI. Concretely, C and S hold private sets respectively. An originally non-existent label label i that looks random will be generated for each item v i ∈U s in the intermediate process, and C will learn {label i : v i ∈U c }, the set of labels corresponding to the items in the intersection as a result. It should be noted that in the PSI filter generating process, C does not know the specific correspondence of any pair (v i , label i ), and S does not know which labels C will retain, so the identities are hidden for both C and S.
We give the basic idea of the PSI filter generation protocol design in Figure 2. C and S hold their private keys k c and k s , respectively. For each item, v i ∈ U s , S computes H(v i ) k s as the distinguishable labels forming the set Θ 1 and specifying all the correspondence To determine if u i = v j , C and S jointly compute the double-encryption results H(u i ) k c k s and H(v j ) k c k s , which is similar to Diffie-Hellman key exchange process. We have that if u i = v j then H(u i ) k c k s = H(v j ) k c k s , and the converse proposition holds with an overwhelming probability close to 1 while the collision resistance holds. After the intersection operation on the sets {H(u i ) k c k s } i∈[n c ] and {H(v j ) k c k s } j∈[n s ] , while guaranteeing the privacy of identities u i and v j due to the DDH assumption, C could get the subset Θ 2 of the distinguishable labels in which the labels are corresponding to the items in U c one by one. The details of the PSI filter generation protocol PFGen are given in Figure 3. Common input: a secure parameter λ, a group G in which the DDH assumption holds, a user identifier space U , and a cryptographic hash function H(x) : U → G. Hold For each u i ∈U c , compute H(u i ) kc , that is, map the user identities to the group G then encrypt it.

2.
Shuffle the set From j = 1 to n s and integer i increasing from 1, Hold Thus far, the PSI filter F = (Θ 1 , Θ 2 ) has been generated and Θ 1 , Θ 2 are held separately by S and C. Note that Θ 1 is required to be an ordered set while Θ 2 is an ordinary set. The generation protocol PFGen will be used as a subprotocol in the setup phase of aggregate query protocols. We will formally state and prove the security of the PFGen protocol in Section 7, and we just simply indicate here that the PFGen protocol is secure against semihonest adversaries, so it can be used to construct privacy-preserving query protocols. Note that the difference from a traditional PSI protocol is that both the two parties have no idea if an item is exactly in the intersection I. All they can get are the labels that look random.

Aggregate Query Protocols
In this section, we use the PFGen protocol proposed in Section 4 to build privacypreserving query protocols for different types of aggregate queries. In these protocols, the client C and the server S execute the PFGen protocol to generate a PSI filter in the setup phase. In the query phase, C finds out the encrypted data related to the users in the intersection from all candidate structured data received from S by comparing the labels in Θ 1 with those in Θ 2 .

RNNC Query
Setup: Step 1: S and C execute the PFGen protocol with secure parameter λ. The output is the PSI filter F = (Θ 1 , Θ 2 ), where Θ 1 and Θ 2 are held by S and C, respectively.
Step 2: S takes the same secure parameter λ to generate the key pair (sk, pk) of the Paillier cryptosystem, then sends pk to C. Query: Step 1: C sends the query request F = { f 1 , . . . , f k } with the type of query "RNNC" to S.
Step 2: After receiving the facility locations, S computes d f i , l j for i ∈ [k] and j ∈ [n s ], which means the distances between f i ∈ F and u j ∈ U s . S determines each user's nearest neighbor.
Step 3: S initializes a n s × k zero matrix M, then sets m ij = 1 for all i, j satisfying f j is u i 's nearest neighbor. S computes M , the ciphertext of M under the public key pk, i.e., computes each m ij = Enc(pk, m ij ).
Step 4: S augments the ciphertext matrix M with Θ 1 , and gets n s × (k + 1) matrix N = Θ 1 M as a result (treating the ordered set Θ 1 as a column vector, the same below), then sends N to C.
Step 5: For each row of N , C retains the row if the first element of the row is also in Θ 2 . Otherwise, it discards the row. Then, C discards the first column to obtain a n I × k matrix Y .
Step 6: C multiplies the elements in each column of Y and gets the vector s = [s 1 , . . . , s k ]. C picks a random vector r = [r 1 , . . . , r k ] where r i ∈ Z n , computes r = [r 1 , . . . , r k ] where r i = Enc(r i ), then computes t = [t 1 , . . . , t k ] where t i = s i r i . C sends t to S.
Step 7: S decrypts t by computing t i = Dec(sk, t i ), then returns t = [t i , . . . , t k ] to C.
Step 8: C computes s = t − r and lets q i = s i to obtain Q = [q 1 , . . . , q k ] as the query result.
The key parts are (1) C/S hides its user set from the other one by using the PSI filter, which also reduces the computational cost and communication cost, (2) S hides the user's location information by using additively homomorphic encryption, (3) C picks a random vector r to mask s (on ciphertext) to hide the query result from S. Clearly, Q = [q 1 , . . . , q k ] is the desired query result where q i is equal to the RNN cardinality of f i . We will give the other two aggregate query protocols with similar ideas, and a detailed explanation of correctness and privacy are given in Section 7.
In practice, it is not necessary to generate a key pair of Paillier cryptosystems every time a client comes in. The server can set several key pairs {sk i , pk i } corresponding to different security levels (parameters λ i ) and make all {λ i , pk i } public. When a new client comes in and needs to set up the server, it can pick an appropriate public key and just tell the server which one it chose. The same goes for the next two query protocols in Sections 5.2 and 5.3

Setup:
The setup phase is the same as in Section 5.1. Query: Step 1: C sends the query request F = { f 1 , . . . , f k } with the type of query "AVGD" to S.
Step 2: After receiving the facility locations, S computes d f i , l j for i ∈ [k] and j ∈ [n s ], which means the distances between f i ∈ F and u j ∈ U s . S determines each user's nearest neighbor.
Step 3: S initializes a column vector D = [d 1 , . . . , d n s ] T where d i is the distance between u i and its nearest neighbor, then computes D , the ciphertext of D under the public key pk, i.e., computes each d i = Enc(pk, d i ).
Step 4: S augments ciphertext matrix D with Θ 1 , gets n s × 2 matrix N = Θ 1 D , then sends N to C.
Step 5: For each row of N , C retains the row if the first element of the row is also in Θ 2 . otherwise, it discards the row. Then, C discards the first column to obtain a n I -dimensional column vector Y = [y 1 , . . . , y n I ] T and saves n I .
Step 6: C picks a random number r ∈ Z n and lets r = Enc(r). C computes s = ∏ n I i=1 y i and t = s r , then sends t to S.
Step 7: S decrypts t by computing t = Dec(sk, t ), then returns t to C.
Step 8: C computes s = t − r and Q = s/n I to obtain the query result Q.

Setup:
The setup phase is the same as in Section 5.1. Query: Step 1: C sends the query request F = { f 1 , . . . , f k } with the type of query "MAXD" to S.
Step 2: After receiving the facility locations, S computes d f i , l j for i ∈ [k] and j ∈ [n s ], which means the distances between f i ∈ F and u j ∈ U s . S determines each user's nearest neighbor.
Step Step 5: From the top to bottom of N , C finds out the first row in which the label S τ(j) is in Θ 2 (suppose it is the jth row). C holds d τ(j) and discards all other elements.
Step 6: C picks a random number r ∈ Z n , computes r = Enc(r), then computes t = d τ(j) r . C sends t to S.
Step 7: S decrypts t by computing t = Dec(sk, t ), then returns t to C.
Step 8: C computes Q = t − r (in fact Q = d τ(j) ) as the query result Q.

Application Examples
In order to make it more clear how to apply our privacy-preserving aggregate query protocols to the actual scenes, we give the application examples for each of the three types of queries as follows. Choosing a site for a hospital may consider the worst case-it hopes that the furthest one of the all potential patients can arrive at the hospital in a shorter time, i.e., the goal is to minimize the maximum distance between all potential patients and his/her nearest hospital, so it can carry out the MAXD query by interacting with the LSP. Then the hospital can be established in the best location chosen from several alternative locations.

Ciphertext Matrix Compressing
Although the processes of the three query protocols in Section 5 have similar design ideas, the communication costs are not exactly of the same magnitude in different query types. Concretely, the communication costs of AVGD query and MAXD query are both roughly O(n s ), but that of RNNC query is roughly O(n s ·k). The essential cause is that the dimension of matrix M initialized in Step 3 of the RNNC query must be sufficient to contain the relationship information between all n s users and all k facilities when each row of the n s -dimension vector D in Step 3 of AVGD/MAXD query corresponds to only one (the nearest) facility. When the scale of the facility set F is large, the linear relationship between communication cost and the number of facilities will be a challenge. To address this issue, we compress the ciphertext for this particular scenario (the plaintext space actually used is much less than Z n ). Properly speaking, under the same security parameter, we make the same size ciphertext contain more plaintext information, which will lead to the reduction of the ciphertext matrix dimension.
In homomorphic cryptosystems satisfying CPA-security, the size of the ciphertext is bigger than that of plaintext because there must be enough randomization to achieve a "non-determination" property. In a Paillier cryptosystem, the encryption function is essentially a map f : Z n × Z * n → Z * n 2 , where Z n is the plaintext space, Z * n 2 is the ciphertext space, and the selected random number r ∈ Z * n ensures the "non-determination" property. Clearly, the size of the ciphertext is equal to 2× the size of the plaintext. It seems unrealistic to directly compress the ciphertext space under the premise of ensuring security: (1) It is impossible to compress Z * n 2 while keeping Z n and Z * n fixed because f is bijective, (2) Ensuring a constant security level means that Z * n can not be compressed, (3) It is hard to compress both Z n and Z * n 2 simultaneously because such behavior will easily break the ingenious bijection structure-the underlying theoretical basis of the Paillier cryptosystem. The generalized Paillier cryptosystem [29] can efficiently reduce the ciphertext expansion factor to s+1 s where s is proportional to the block size of the ciphertext, but under the same secure parameter λ, picking a large s means concurrently expanding both plaintext space and ciphertext space. Concisely, in a generalized Paillier cryptosystem, a smaller ciphertext expansion factor leads to a larger ciphertext block size, which is not suitable for our scenario. A method [30] similar to ours is an efficient algorithm for fast batch summation using homomorphism on generalized Paillier cryptosystems, but we do not want to use the big ciphertext block because the number of facilities k may be relatively too small to waste a lot of space and bring unnecessary computational overhead.
We observe that even encrypting a small number needs a complete block in a Paillier cryptosystem. For a small number m n, the plaintext space Z n is far from being efficiently used, i.e., log 2 n m bits are wasted. In the RNNC query protocol, the elements in matrix M are either Enc(0) or Enc(1); even in steps 6 and 8 for each s i = Enc(s i ), s i is relatively very small compared with n.
Our solution is putting multiple data (small numbers) in one plaintext while keeping the ciphertext able to be calculated using the homomorphism property from the vertical direction. Supporting the plaintext space is Z n and each single data in plaintext throughout the process (considering the intermediate calculation products after decrypting) is no more than α bits, i.e., no overflows of plaintext occur after the sum operations by the homomorphic property in the intermediate process. The process of ciphertext matrix compressing is given in Figure 4. Suppose we have already generated the parameters of the Paillier cryptosystem over the security parameter λ. First, we choose the parameter α such that no overflows occur for plaintext throughout the whole query protocol and naturally let η = log 2 n α , where η means that any block i can contain up to η data (m i1 , . . . , m iη ).
Then, we perform the following process.
(1) For 1 ≤ i ≤ µ, put η data in one plaintext block: for 1 ≤ j ≤ η, pad each m ij = 1 or 0 to α bits with "0" then attach them end to end-a plaintext block has been constructed.
Thus far, we have encrypted µ · η data using just µ plaintext blocks. Meanwhile, we have gotten s 1 , . . . , s η by running (µ − 1) rather than (µ − 1) · η multiplications on the ciphertext. In order to express more clearly, we give a toy example here and interpret why the homomorphic property can still be used. After padding and attaching procedures, we get three blocks as follows.
For our column-wise summation problem of a small-value µ × η matrix, the proposed ciphertext matrix compressing combines (pads with 0 then links) multiple plaintexts in a row before encrypting. The number of ciphertext blocks can be reduced from µ × η to µ, and the number of encryption/decryption operations is reduced from µ × η to µ. At the same time, the homomorphic property is kept valid. Therefore, we can reform the RNNC query protocol by utilizing the ciphertext matrix compressing method. Here is the variation of the steps of the query process.

An Improved Protocol RNNC-CMC:
Step 3: S initializes an n s × k zero matrix M, then sets m ij = 1 for all i, j satisfying f j is u i 's nearest neighbor. S compresses the matrix M to n s × k/η matrixM. (When k is not an integer multiple of η, the way to deal with the extra space is simple: padding with "0" in Step 3.) S computesM , the ciphertext ofM under the public key pk, i.e., computes eachm ij = Enc(pk,m ij ).
Step 5: For each row of N , C retains the row if the first element of the row is also in Θ 2 . Otherwise, it discards the row. Then, C discards the first column to obtain a n I × k/η matrixỸ .
The changes of the reformed protocol RNNC-CMC are intuitive: (1) compressing the scale of matrix M from n s × k to n s × k/η , which can efficiently reduce the communication cost; (2) one homomorphic operation (multiplication) on ciphertext by column is equivalent to η operations in the original RNNC query protocol. The details of the comparison are given in Section 8.

Security Analysis
In this section, we first derive the correctness of our aggregate protocols. Then, we consider the privacy of the PSI filter generation protocol PFGen, give an optional handling for stricter privacy protection as a supplement and finally discuss the privacy of our aggregate query protocols and two types of external attackers.

Correctness
We simply derive the correctness of three types of queries. For a component q i ∈ Q in RNNC query, the following equation holds: For the query result Q in AVGD query, For the query result Q in MAXD query, due to the monotonicity ofD, i.e., d τ(i) ≥ d τ(i+1) , and Step 5, we only need to demonstrate that t − r = d τ(j) holds as follows:

Privacy
As for the privacy of our aggregate query protocols, we first show our PSI filter generation protocol PFGen proposed in Section 4 would not reveal any information about the private data and give optional handling for stricter privacy protection (treating n s and n c as privacy). Based on this, we illustrate that the proposed aggregate query protocols are privacy-preserving.
The following lemma is given to illustrate the security of the PFGen protocol in a semi-honest model, and we give the proof in the simulation paradigm. We can say that for all security parameters λ and inputs and {H(u i ) k c k s } i∈[n c ] .
From Lemma 1, we know that throughout the PFGen protocol, C can know nothing more than n s and n I ; S can know nothing more than n c and n I . These values are, in fact, not so private for S and C. When S and C want to further enhance the privacy, i.e., to hide the cardinalities n s and n c from each other, S/C can simply add some dummy users to U s /U c . This can efficiently conceal n s and n c to some extent, and when a new dummy user is added, after being hashed to group G, it may collide with a real user with a probability of 1 |G| . It can be said that n I is not a privacy issue when n s and n c have been concealed.
In the query process, all the messages C received are N and t (or t) where t (or t) is not a private message to C; and the only message S received is t (or t ). In the RNNC query protocol, N = Θ 1 M is a ciphertext matrix augmented with a vector of randomized labels from where C can not learn any plaintext information without the private key sk, and t = [t 1 , . . . , t k ] is a ciphertext vector where each component t i = s i r i = Enc(pk, q i + r i )even a curious S decrypts Enc(pk, q i + r i ) by using sk, a masked query result q i + r i would not reveal any information about q i . In the AVGD query protocol, N = Θ 1 D is also ciphertext from where C can not learn anything, and t = s r = Enc(Q · n I + r) is a masking version of Enc(Q · n I ) ensuring the privacy of the query result Q. In the MAXD query, similarly, N reveals nothing to C, and C computes t = d τ(j) r to ensure the privacy of the query result Q = d τ(j) to S due to Dec(sk, t ) = d τ(j) + r. Moreover, our ciphertext matrix compressing method does not sacrifice or reduce the security level. Because of the indistinguishability of the ciphertext in the Paillier cryptosystem, for the same security parameter λ, the RNNC-CMC query protocol is strictly as secure as the original RNNC query protocol in Section 5.
In addition, traditional protocols with similar ideas to Diffie-Hellman key exchange may suffer from two attacks-statistical attack and man-in-the-middle attack. We give the discussions on these two types of attacks as follows.
Statistical attack. In the PFGen protocol, each user will obtain a fixed label, and the label corresponding to each user is always the same. Intuitively, it seems that a statistical attack can get the frequency of the users during the queries. In fact, when the client and the server have jointly executed the PFGen protocol, the generated PSI filter (two sets of fixed labels) has also fixed the user sets with unknown correspondences, which means the client with a modified user set U c could not successfully execute the query protocols-even the client keeps the set Shuf({label i } i∈[n s ] ), the set Θ 2 could not be modified appropriately based only on U c without the correspondences.
Man-in-the-middle attack. There may be a malicious man intercepting in the middle between S and C, sending and receiving data by using another key k e . There is a simple way to deal with this attack: using the signature technique to ensure the authenticity of the source. Specifically, the Sender (whether S or C) signs all messages as a whole during each interaction. Even without using the signature technique, it is difficult for the attacker to obtain useful information. When the attacker has jointly performed the PFGen protocol with S, the computationally indistinguishable labels are meaningless for the attacker, and they can not obtain any private information from S during some forged queries, according to Lemma 1. When the attacker pretending to be a server has jointly performed the PFGen protocol with C, the difference from the previous is that C may obtain a wrong result (nonoptimal location) from the message received from the attacker during some queries. It can be said that the attacker may consume a lot of energy and can not obtain the information they want to know. Again, to prevent "man-in-the-middle attacker returning wrong query result", the signature technique is sufficient.

Theoretical Analysis
We first analyze the computational complexity for each party in the four proposed protocols and compare them with those of the state-of-the-art schemes in [3]-the server-based protocols RNNC/S, AVGD/S and MAXD/S, and the client-based protocols RNNC/C, AVGD/C and MAXD/C, as shown in Table 1. We focus on the relatively time-consuming operations, i.e., encryption, decryption, multiplication on ciphertext and exponentiation on ciphertext, and the time-saving but repetitive operation, i.e., distance calculation. Note that we ignore the constant permutation operations in Table 1. To be more intuitive, the setup phase and query phase are considered separately.
Setup phase. In the setup phase of all our four query protocols, S and C jointly run the PFGen protocol where both S and C need (n s + n c ) exponentiation operations modular n = 2 λ , which is much faster than the modular n 2 . In addition, S needs two permutations and C needs one permutation, both of which can be ignored. It is a small cost compared to the client-based protocols RNNC/C, AVGD/C and MAXD/C, where the client C needs to executeñ encryptions. Although the setup phase is very simple in server-based protocols RNNC/S, AVGD/S and MAXD/S, it will bring a great amount of computation cost to C in the query phase.
Our Schemes RNNC 1 PFGen n s · k dist n s · k enc k dec n I · k mult k enc AVGD 1 PFGen n s · k dist n s enc 1 dec 1 enc n I mult 1 div MAXD 1 PFGen n s · k dist n s enc 1 dec 1 enc 1 mult RNNC-CMC 1 PFGen n s · k dist n s · k/η enc k/η dec n I · k/η mult k/η enc 1. Theñ is the size of the superset. Typically, in order to achieve enough privacy for both U s and U c ,ñ should be far greater than ( ) n s , or at leastñ ns should be not less than a specified threshold. 2. w is a large number satisfying w > max, where max is the maximum distance between a user and their nearest facility. 3. When S and C jointly run a PFGen processing, both S and C need (n c + n s ) exponentiation (modular n) operations, while the "exp" represents the exponentiation operation modular n 2 . 4. The parameter η = λ(=log 2 n) α is the amount of data concluded in one ciphertext block where λ is the security parameter and α is the number of bits reserved for intermediate process with no overflows.
Query phase. In the query phase, for our RNNC, AVGD and MAXD query protocols, the computation processes of S are similar-n s · k distance calculations, n s encryptions and 1 decryption for both the AVGD query and MAXD query, and due to the dimension issue of matrix M, n s · k distance calculations, n s · k encryptions and k decryptions for the RNNC query. The main operations of C take place in Step 6-n I · k multiplications on the cipher and k encryptions for RNNC, n I multiplications on the cipher and one encryption for AVGD, one multiplication on cipher and one encryption for MAXD. In addition, for AVGD, C needs to execute one division in Step 8. For our RNNC-CMC protocol, we further reduce the computational complexity after the distance calculations to 1/k times of our original RNNC protocol for both S and C.
From the analysis results, since we avoid the use of a superset with the very large cardinalityñ (no "ñ enc" operations in our schemes), we have completely avoided a large number of homomorphic encryption operations on both the server and client-sides and the corresponding computational complexity has been reduced. Further, we avoid a big number w > max in the MAXD query protocol, resulting in a reduction of a large number of exponentiation and encryption operations. In addition, we further reduce the parameter k to k/η in the improved protocol RNNC-CMC.
For communication cost, we give the final result directly as follows, which is between those of the server-based protocols and the client-based protocols in [3].
(1) RNNC: O(n s · k)× size of the block of cipher; (2) AVGD: O( 3 2 n s + n I )× size of the block of cipher; (3) MAXD: O( 3 2 n s )× size of the block of cipher; (4) RNNC-CMC: O(n s · ( k/η + 1 2 ))× size of the block of cipher. In summary, the proposed aggregate query protocols are comprehensively ahead of the server-based protocols and have advantages and disadvantages in some aspects compared with the client-based protocols.

Experimental Results
We implement all our protocols in Python 3.7 with a 64-bit Windows 10 system, 3.20 GHz Intel Core i5 processor and 16 GB RAM. We choose the security parameter λ = 1024, and set n c = 20,000 andñ = 1,000,000, the same values as in [3] for reasonable comparisons (We do not emphasize that we use the real-world dataset [31], because whether the location data are real or randomly selected has no effect on our experimental results, and without considering differential privacy, real-world datasets are meaningless for efficiency tests. In fact, the results of the experiments (the trends of the lines in Figures A1 and A2) on real and unreal location datasets are exactly the same). In addition, we suppose n I = n c in our experiment. First, for the three types of queries-RNNC, AVGD and MAXD, we fix the value n s = 100,000 and test the computation costs of our protocols and those of the previous along with the change of k. Then, we fix the value k = 50 and test the computation costs along with the change of n s . For our RNNC-CMC protocol, we set the threshold α = 16, which is enough for the general RNNC query (each q i is no more than 2 16 − 1 = 65,535), and accordingly, η = 64.
Since the server and the client may have different scales of computing power in a real environment, we test the cost of the server and the cost of the client separately and give their lines in figures. Our experimental results are given in Figures A1 and A2 in Appendix A.
There are two different types of exponentiation operations in our experiments. The first one is the exponentiation modular n 2 (2048 bits) executed on the client-side of MAXD/S and the server-side of AVGD/C and MAXD/C. The second one is the exponentiation modular n (1024 bits) executed in the setup phase (jointly running PFGen protocol) of our four protocols. Note that the time cost of modular exponentiation operation is very unstable when the exponent changes, and we know that computing a x costs the equivalent of log 2 x multiplications by using the fast exponential algorithm. Therefore, we use the mathematical expectation instead of the unstable value. Suppose an exponent is a b-bit number, the expectation of equivalent times of multiplications can be calculated as E(t) = Therefore, we use (b − 1)× the time of 1 multiplication instead of 1 exponentiation to draw the figures to clearly show the trend of change. Moreover, according to our test, the time cost of the 1024-bit modular exponentiation operation is a little smaller than a quarter of that of the 2048-bit.
From the experimental results, we give what we see for reference as follows.
(1) For the RNNC query, our protocols (RNNC and RNNC-CMC) are more efficient than RNNC/S protocols on both the client-side and the server-side, and not as efficient as the RNNC/C protocol but acceptable (Our protocols avoid a lot of encryption operations in the setup phase compared to client-based protocols (RNNC/C, AVGD/C and MAXD/C). The price of the client-based protocols being faster than ours in some cases (on the clientside for RNNC and AVGD queries, and on the server-side for RNNC and MAXD queries) is that a lot of calculations are transferred to the setup phase. In addition, taking note of the multiples of the time cost comparisons in (1), (2) and (3), 1 80 and 1 25 are non-ignorable values, but 0.24 s (in our experimental environment) to the client and hundreds of seconds (in our experimental environment) to the server are actually very small values). Concretely, taking k = 50 and n s = 100,000, on the client-side, our RNNC-CMC protocol is (0.24 s) 50× faster than the RNNC/S protocol, and 1 3 × faster than the RNNC/C protocol; on the server-side, our RNNC-CMC protocol (407.47 s) is 620× faster than the RNNC/S protocol, and 1 80 × faster than the RNNC/C protocol.
(2) For the AVGD query, on the client-side, the time cost of our AVGD protocol is between that of the AVGD/S protocol and AVGD/C protocol. However, on the server-side, our protocol is much more efficient than the previous two, and it is valuable that our query protocol reduces the rate of increase in the time cost of the server when n s increases ( Figure A2-AVGD Query). Concretely, taking k = 50 and n s = 100,000, on the client-side, our AVGD protocol is (0.24 s) 2× faster than the AVGD/S protocol, and 1 80 × faster than the AVGD/C protocol; on the server-side, our AVGD protocol is (508.46 s) 20× faster than the AVGD/S protocol, and 5× faster than the AVGD/C protocol. In addition, when n s increases to 500,000, our AVGD protocol is (2542.30 s) 4× faster than the AVGD/S protocol, and 5× faster than the AVGD/C protocol on the server-side.
(3) For the MAXD query, our protocol is friendly to a resource-limited client compared with the two previous protocols. It is more than one magnitude faster than the client-based protocol MAXD/C. Moreover, on the server-side, the time cost of our MAXD protocol is between that of the MAXD/S protocol and MAXD/C protocol. Concretely, taking k = 50 and n s = 100,000, on the client-side, our MAXD protocol is (0.0051 s) 26,000× faster than the MAXD/S protocol, and 15× faster than the MAXD/C protocol; on the server-side, our MAXD protocol is (508.46 s) 5000× faster than the MAXD/S protocol, and 1 25 × faster than the MAXD/C protocol.

Conclusions
In this paper, we have studied the aggregate queries for facility location selection between an LSP and a business with confidentiality. We proposed the construction of a PSI filter to help two parties find the relevant data corresponding to the items in the intersection without revealing the privacy of the items, including those in the intersection. Then, for location data analysis, we utilize our PSI filter to construct three types of aggregate query protocols without setting a superset of a large size for concealing the user identities and serving as an index. Moreover, we propose a ciphertext matrix-compressing method that makes one ciphertext block contain multiple small plaintext data while keeping the homomorphic property valid, and, subsequently, we give the improved protocol by utilizing this method. It further raises the efficiency of both computation and communication on the basis of the original query process. Finally, we illustrate the correctness and privacy and give the performance evaluation through theoretical analysis and experiments that show the superiorities of our protocols.

Conflicts of Interest:
The authors declare no conflict of interest.

Notations
The following notations are used in this manuscript:

Notation
Description C and S client and server k c , k s C, S's secret key U c = {u 1 , . . . , u n c } C's user set U s = {v 1 , . . . , v n s } S's user set I the intersection of U c and U s F facility location set L user location set Q (or Q) the desired query result n s , n c , n I and k the cardinalities of the sets U s , U c , I and F (sk, pk) the key pair of Paillier cryptosystem label i a label looks random H() a hash function τ() a random permutation

Appendix A. Figures of Experimental Results
We give the experimental results (of Section 8.2) in Figures A1 and A2.