A Privacy Preserving Cloud-Based K-NN Search Scheme with Lightweight User Loads

: With the growing popularity of cloud computing, it is convenient for data owners to outsource their data to a cloud server. By utilizing the massive storage and computational resources in cloud, data owners can also provide a platform for users to make query requests. However, due to the privacy concerns, sensitive data should be encrypted before outsourcing. In this work, a novel privacy preserving K-nearest neighbor (K-NN) search scheme over the encrypted outsourced cloud dataset is proposed. The problem is about letting the cloud server find K nearest points with respect to an encrypted query on the encrypted dataset, which was outsourced by data owners, and return the searched results to the querying user. Comparing with other existing methods, our approach leverages the resources of the cloud more by shifting most of the required computational loads, from data owners and query users, to the cloud server. In addition, there is no need for data owners to share their secret key with others. In a nutshell, in the proposed scheme, data points and user queries are encrypted attribute-wise and the entire search algorithm is performed in the encrypted domain; therefore, our approach not only preserves the data privacy and query privacy but also hides the data access pattern from the cloud server. Moreover, by using a tree structure, the proposed scheme could accomplish query requests in sub-liner time, according to our performance analysis. Finally, experimental results demonstrate the practicability and the efficiency of our method.


Introduction
By outsourcing data and/or tasks to the cloud, even devices with low computational ability can conduct analytic works with a large amount of data. Otherwise, data owners need to build a data warehouse for hosting their data so that other users can make queries on it, for further researches or services. Cloud server does benefit to the above application scenarios; however, it is well-known that security issues are arisen if sensitive data are involved. For example, the medical records have not yet been stored on the cloud platform because the records may contain sensitive information which needs specific privacy protection before being released to public. A compromised cloud server might expose the medical dataset outsourced from medical data owners or infringe upon patients' privacy by leaking out the associated symptoms or diagnosis results.
For the protection of sensitive data, data owners usually encrypt their data before outsourcing them to the cloud. Moreover, to make sure none of the others can retrieve the private data, no one except the data owner themselves can access the decryption key. At the same time, other users who want to access the data can make query requests to the cloud. Interestingly, sometimes these users also want to preserve their query privacy from data owners and the cloud server. Therefore, the query would be encrypted as well. The above-mentioned security issues can be solved if we can process the queries directly over the encrypted dataset and make sure that the whole scheme works as a searchable privacy preserving data storage system.
In this work, we focus on solving the K-nearest neighbor (K-NN) search problem over the encrypted dataset in the cloud. The K-NN search is a basic method widely used for classification and regression in pattern recognition or data mining areas. The K-NN search algorithm takes a dataset and a query as input and outputs K data points closest to the query. Both data points in the dataset and the query can be multi-dimensional data and the measurement of the distance between them is the norm defined in a specific space, usually the Euclidean space. In the proposed scheme, data points are encrypted attribute-wise and so is the query. The entire search algorithm is performed in the encrypted domain and the user obtains the unencrypted results relative to his query. Apparently, the user has to be trusted by data owners since they can obtain a small portion of the plaintext data, eventually.
There are several works proposed to solve the pre-described privacy-preserving K-NN (PPkNN) search problem, but all of them have certain limitations, such as the need of sharing the data owner's decryption key to others, adding additional storage to the user's end, using two non-colluding servers, and refining the resultant data by the query user, etc. In this work, we try to overcome those limitations and find a new approach which meets the needs of privacy preserving and searchability simultaneously. Moreover, unlike most of existing methods that search the K-NN results linearly through the entire dataset, our approach leverages a tree structure, named R-tree [1], to store the data, making the search complexity faster than the linear ones. In short, the contributions of our work can be summarized as follows: a. We build a new encryption scheme by combining the ideal secure order-preserving encryption [2] with the well-known Paillier cryptosystem [3] and show that the resultant scheme is still ideally secure. b. We propose a PPkNN search scheme based on the new encryption method described above, which can perform comparison and addition operations in the encrypted domain, and use an Rtree structure to achieve sub-linear search complexity. c. Benefiting from the new encryption method, the construction of an R-tree can also be done by the cloud server on the encrypted domain, which can reduce the computational overhead of data owners and enhance the query user's security by preventing data owners from knowing the data access pattern. d. We address and overcome the limitations of related works and make a comparison of our work with them. e. Experimental results over synthetic datasets show that our method is practical and efficient.
Moreover, the workloads on user ends are very lightweight.
The rest of this paper is organized as follows. In Section 2, we survey some related works and discuss their limitations. The preliminary backgrounds to understand our scheme are introduced in Section 3 and the proposed work is addressed in Section 4. The security guarantees, such as the ideal security definition, the data privacy, and the query privacy, are analyzed in Section 5. In Section 6, we evaluate the performances of our work by complexity analyses and demonstrate them by the experimental results. Finally, Section 7 concludes this work.

A Brief Survey of the Access Control Mechanisms
As mentioned by one of the anonymized reviewers, cloud computing provides the on demand and scalable services, therefore, the corresponding environment is highly dynamic. In addition to seeking for the assistance of cryptographic techniques, applying the access control mechanism is another fundamental and important choice that can meet the security requirements in the cloud environment. The main goal of applying access control mechanism is to restrict the users from performing any unauthorized activity to protect the sensitive information.
As summarized in [4], there are wide variety of models, methods, and policies proposed for designing an access control system. Each access control system has its own attributes, methods, and functions based on set of policies. For example, the Mandatory Access Control model (MAC) [5] is an access control policy in which a subject or request initiator can perform some sort of operation on a particular object or resource. When a subject or user attempts to access an object or resources, an authorization rule is enforced to determine whether the access can take place by examining the security attributes. Discretionary Access Control model (DAC) [6] is another well-known access control policy which determines the owner of an object. The owner decides who is allowed to access the service based on users' identities. Some researchers analyze the dynamic requirements for cloud environment and introduce the Role Based Access Control model (RBAC) [7] to the cloud environment. RBAC is an access control policy determined by the systems rather than by the owner itself. RBAC model can only be applied within a closed network and it is based on identification. It only checks the user identity and roles assigned to users and based on this role it checks if the user is authorized or not.
However, as commented also in [4], both MAC and DAC can only be applied to specific environments like operating and database management systems, while RBAC model fails to check the malicious activity done by the authorized users. In facing of the above-mentioned shortages, the trust-based mechanisms (TBMs) [8] and the context-aware RBAC (CA-RBAC) [9][10][11][12], which well fit today's dynamic environments provided by cloud and fog computing, are arisen. Both TBMs and CA-RBAC do effectively secure the sensitive information by authorizing legal users and protecting the cloud resources from the malicious activities. However, the focus of this work is to find cryptography-based security solution for Cloud-based K-NN search problem, and a complete survey of TBMs and AC-RBAC cannot be provided within such a limited space. Interested readers can find the corresponding details in the listed references and the references therein.

A General Survey of Privacy-Preserving K-NN Problem
As mentioned in Section 1, many works have been proposed to solve the secure K-NN search problem over the encrypted outsourced dataset in the cloud. In [13], a PPkNN approach was proposed to provide the privacy of dataset, input query, kNN result, and data access pattern, but it is vulnerable to collusion attacks. In other words, it assumed that there is neither collusion among cloud servers nor collusion between any data owner and cloud server. In [14], a privacy preserving medical diagnosis system using E-health was proposed, in which the cloud server worked in a privacy preserving manner even though the medical datasets are owned by multiple data owners. As a building block of the proposed diagnosis system, the authors designed a deterministic K-NN based privacy preserving protocol for finding the k data with the highest similarity to a queried symptom, which reduces the average running time by 35% compared to that of a previous probabilistic-behaved work [15]. Basically, this PPkNN protocol is constructed on the basis of multiparty computation (MPC) based on secret sharing, and therefore works in a distributed manner without any trusted server. A survey about existing works related to PPkNN can be found in [14]. Similarly, in [16], the authors focused on solving the clustering problem over encrypted cloud data. In particular, they proposed a privacy-preserving k-means clustering technology over encrypted multi-dimensional cloud data by leveraging the scalar-product-preserving encryption primitive, called PPK-means. The proposed technique is able to achieve efficient multi-dimensional data clustering as well as preserve the confidentiality of the outsourced cloud data. The authors claimed that their work is the first one to explore the privacy-preserving multi-dimensional data clustering in the cloud computing environment. Extensive experiments in simulation data-sets and real-life datasets demonstrate that the proposed PPK-means is secure, efficient, and practical.

A Survey of the Most Related Works
In this subsection, we briefly describe the most related works and their corresponding characteristics such as the scheme structure, the main idea of searchable encryption, the security definition and the corresponding performances.
Wong et al. [17] presented an asymmetric scalar-product preserving encryption (ASPE) scheme to perform the secure K-NN computation on encrypted databases. The word asymmetric indicates the fact that the data owner and the query user perform their encryption in different ways. The scalarproduct between encrypted query and data points is preserved and the encryption equations built by ASPE allow the user to find out the K closest points among all the encrypted data. The data owner is forced to share the decryption key with users so that they can recover the encrypted results after receiving them. We consider that as a risk of revealing more unexpected private information to users. In addition, the method simply uses linear scan to find the results, but the requirement of complexity we consider is a sub-linear one.
Hu et al. [18] discussed the general problem of query processing over untrusted data cloud including K-NN query and range query. Unlike most of the other methods which simply outsourced the data to cloud server, they restructured their data based on an R-tree and sent the associated indexes to users. The query processing procedure is executed by the cloud server, who has the decryption key, and the user. They leverage a privacy homomorphism encryption method to make the secure version of K-NN best first search (BFS) algorithm over R-tree executable. However, most of the computational cost associated with the processing procedure is endured at the user end which is against the original intention of using the cloud. Moreover, the indexes of R-tree need to be stored which consumes additional memories from the user. Furthermore, the scheme was proved not secure under probing attack [19]. After sending an adequate number of query requests, the user is able to recover the plaintext owned by the data owner.
Some security issues of [17,18] have been reported by Yao et al. in their work [19]. They revisited the above two methods and proved that they are insecure under chosen plaintext attack by introducing a new attack model. Another secure nearest neighbor finding scheme using a secure Voronoi Diagram (SVD) was presented in [19]. The SVD scheme induces a partition of the dataset and the data owner encrypts those partitions and sends them to the server with their identifiers. The query user finds the nearest neighbor of his query by asking server to return the partitions relative to the results. However, the above-mentioned limitations such as the need of sharing decryption key with users and the additional storage consumption (to store the descriptions of partitioned data) still exist in this scheme. Users also need to provide extra computational resources to retrieve the accurate results among the returned partitions. Worst of all, the construction of SVD on data with higher than two dimensions needs large computational cost.
Considering the security level more strictly, Elmehdwi et al. [20] proposed a fully secure K-NN search protocol over encrypted datasets in outsourced environment. They introduced several basic security protocols, e.g., secure multiplication, secure squared Euclidean distance, secure minimum, and secure bit-or with Paillier cryptosystem to build the whole secure scheme, which not only preserve the data privacy and query privacy but also hide the data access pattern from the cloud server. Unfortunately, the protocols are based on the setup of two non-colluding cloud servers and the corresponding complicated protocols between the two servers leads to a long query response time. Those characteristics are regard as the disadvantages of this fully secure scheme which may handicap its value in practical usage. Our approach tries to relax the security guarantee of this approach by preventing the data access pattern revealed to server into considerations so as to enhance the practicability.
Wang et al. [21] proposed a practical method using the ideal secure order preserving encryption and the R-tree structure to solve the similar problem on large-scale data. In their work, the search procedure includes two interaction communications between the user and the server. In the first round of interaction, the server narrows down the candidate results by identifying which minimum bounding rectangle (MBR) contains the query and sends the points within MBR back to the user. Then the user creates a search box according to the nearest point in the returned set and sends it to the server, and the server outputs all points in that box as the resulting set. One can see that this scheme, the same as [19], does not return accurate results, and extra distance calculations must be operated by the user. The problem structure of this work is also different from the ordinary ones, for example, in [21] a user plays the roles of both the data owner and the query user. That is, if we want to allow a third-party user to make a query request, the data owner needs to share out the decryption key, after all.
In addition, a few new works have been proposed to solve the same problem, but most of them did not meet our needs as well. Those works either need two non-colluding servers as modelassumption [22] or only return the indexes of search results [23,24]; moreover, decryption key shared from the data owner is a must, if the user wants to obtain the unencrypted results.
In summary, plenty of privacy preserving K-NN search schemes have been proposed but all of them have certain shortages and put up barriers on their usage in practice. On the other hand, our approach achieves the relaxed security level, which preserves the data privacy of the data owner and the query privacy of the query user but reveals the data access pattern to the cloud server. Notice that the leakage of data access pattern is not an issue when the stored data has been encrypted. In other words, without the decryption key, no damage occurs even if the server knows what ciphertext the user has accessed. More importantly, the limitations of the other works have been released somewhat by our scheme. We summarize the characteristics of the proposed scheme as follows and also show the comparisons with some related works in Table 1: a. The search complexity of our scheme is faster than linear. b. The data owner does not have to share the decryption key to the cloud server or other query users. c. Our approach does not need extra local storage at users' end and consumes extremely small computational resources. d. The cloud server returns neither the approximate results nor the indexes of resultant data points, but the accurate K-NN values. e. Not only suitable for two dimensional spatial tuples, our approach can be extended to high dimensional dataset, directly. f.
Comparing with some methods in which two non-colluding servers are required, one honestbut-curious server is good enough to our scheme.

Preliminaries
For the ease of explanation, the three most important components of the proposed scheme: Paillier Cryptosystem, Order-preserving Encryption, and R-tree will be briefly reviewed in this section.

Paillier Cryptosystem
Paillier cryptosystem is a probabilistic public key cryptographic system proposed by Pascal Paillier in 1999 [3]. Since it is a secure and efficient cryptography along with the additive homomorphic property, it has been utilized in many existing applications such as E-voting. The Paillier encryption scheme, denoted as (KeyGen, Enc, Dec) , is defined as follows:

a. KeyGen
The key generation procedure is responsible for generating a pair of keys ( , ). Public key is for data encryption and available to all public users in the system. Secret key is for decryption and only the privilege member, e.g., official manager has the right to access it. A simplified variant of generation steps [25] can be described as follows: (1) Randomly select two large primes and , such that pq and ( − 1)( − 1) are relatively prime, i.e., gcd , ( − 1)( − 1) = 1.

b. Enc
The encryption procedure takes the plaintext message ∈ ℤ and as input, while the output is the encrypted ciphertext . Notation E (•) is used to represent the encryption procedure with the encryption key . The following two steps complete the corresponding encryption process.
The decryption procedure takes the ciphertext ∈ ℤ * and as input, while the output is the decrypted plaintext message . Notation D (•) is used to represent the decryption procedure with the decryption key . The decryption procedure can be represented by the following formulas.
The most well-known feature of Paillier cryptosystem is its additive homomorphic property, which enables users to do additive operation directly in the encrypted domain without decrypting the ciphertext first. The additive homomorphic property of Paillier cryptosystem, from the decryption point of view, includes: (1) The decryption of the product of two ciphertexts and can be realized by the addition of the two corresponding plaintexts and directly, that is D ( • ) = D E ( ) • E ( ) mod = E ( + mod ).
(2) The decryption of a ciphertext raised to the power of a plaintext can be realized by computing the product of the corresponding plaintexts and directly, that is D (

Order-Preserving Encryption
Order-preserving encryption or order-preserving encoding (OPE) [26] is a special kind of encryption scheme in which the order relationship between plaintext messages is preserved after the encryption is done. The security of an ideal OPE should be defined with indistinguishability under ordered chosen plaintext attacks (IND-OCPA) [27]. That means the ciphertexts of an OPE should reveal nothing more but the ordering of the plaintexts. Popa et al. [2] presented the first ideal-security protocol for OPE. They showed that the mutability of some ciphertexts is required for ideal-security OPE even if the encryption model is already stateful and interactive. The proposed mutable order preserving encoding (mOPE) protocol is conducted between a client (or data owner) and an OPE server. A basic mOPE protocol scheme, denoted as (KeyGen, InitState, Enc, Dec, Order) , can be understood as follows:

a. KeyGen
The client takes a security parameter κ as input and generates the secrete key by the key generation module of any symmetric deterministic encryption scheme (SDES), which obeys the pseudo random function security property [28].

b. InitState
The server initializes its state by creating a binary search tree, or a -ary B-tree, as OPE tree and a table as OPE table. The nodes on an OPE tree after encryption contain the ciphertexts and the values of the decrypted ciphertexts in the tree's left subtree nodes are smaller than those in the right subtree ones. The OPE table records a mapping from each ciphertext to its corresponding OPE value.

c. Enc
The encryption algorithm of mOPE is run interactively by a client and a server. The client takes as input to encrypt the original plaintext . After running the algorithm, the server state is updated. The full encryption scheme can be addressed as follows: (1) Client: Encrypt by SDES and send the ciphertext to the server. A ciphertext value in OPE table is computed according to a path from the root to the node and the position of the ciphertext in that node is obtained based on Equation (2): where [path] is the concatenation of the binary strings of path pointer indexes from root to the node, where [pos] is the binary string of one plus position index of the ciphertext in that node. The total length of both of them depends on , which is the maximum ciphertext number in a node. Figure 1 shows an example of the pre-described encryption algorithm. The texts in each node represent a plaintext-ciphertext pair. The blue strings represent the path string from the root to the target node. The red strings represent the ciphertext positions in that node.

d. Dec
The decryption of mOPE is just the same as the decryption of SDES. There are no additional operations needed for calculating OPE value or OPE binary string.

e. Order
Since OPE table stores the ordered OPE values for each ciphertext, by taking ciphertexts as input, we can compare which corresponding plaintext is larger by using the ordering function Ord(•). Take numbers 10 and 14 in Figure 1 as an example, since 10 < 14, Ord("x2b017") = 1 < Ord("x6481d") = 6.

R-Tree
R-tree is a tree data structure proposed by Antonin Guttman in 1984 [1]. It is widely used in many multi-dimensional data management tasks, such as -NN or geometric search. Each node in an R-tree represents an MBR of all its children MBRs. At the leaf level, each MBR contains a specific number of spatial data points or data objects as its children. R-tree is also a balanced tree, like B-tree, so the average time complexity for searching a data point on it is O(log ) , where is the maximum number of children in each node and is the total number of data points. Figure 2 shows an example of a two-dimensional R-tree, in which = 3 and = 10.

OPE client
Secret key OPE server B-tree: OPE

Bulk-Loading
Bulk-loading is a kind of construction method for R-tree which loads several data points to an R-tree at once. Apart from the normal construction method, where the data points are inserted one by one, bulk-loading provides a more efficient construction but needs to know the whole data points beforehand. The R-tree constructed by bulk-loading method also has better query performance since the overlap between MBRs can be reduced.
Leutenegger et al. [29] proposed an efficient R-tree bulk-loading algorithm in 1997. The leaf-MBRs of an R-tree constructed by this algorithm do not overlap at all and the rest MBRs in higher levels overlap just a little. The following is the construction algorithm for -dimensional data points, where is the maximum children number in a node: (1) Set = ; if < , simply create the root-MBR and end the algorithm.
(2) Calculate the number of leaf-MBR pages, that is = , where ⌈ ⌉ denotes the ceiling function.
(3) Sort the data points according to the -th coordinate and divide them into = slices. Each slice has • data points.
(4) Recursively process each slice, by repeating steps 2 and 3, where is the current data number in this slice, and = − 1 until each slice contains only data points. (5) Create an MBR for each slice with pointers pointing to all data points in that slice. (6) Treat MBRs created in step 5 as data points and create higher level MBRs based on step 1.

-NN Search over R-Tree
The structure of an R-tree can improve the performance of a -NN search. By traversing a part of MBRs and the data points in them, there is no need to compute the distance between the query point and the whole data points. The idea of the best first search (BFS), proposed in [20] for -NN search over R-tree, is to access the nearest MBRs or data points all the time by using a priority queue. The input of the algorithm is an R-tree, , and a query point , and the output will be , the set of data points nearest to . The algorithm can be understood as follows: (1) Initialize a priority queue and the result set . Compute the distance from to each of the children of . ii.
Enqueue all children of into using their distance to as priority.
At the beginning, we enqueue the root of an R-tree to the priority queue. For every element dequeued from the priority queue, we check if it is a data point. If so, add it into the resulting set or recursively enqueue its children to the priority queue if it is still an MBR. The algorithm ends if the resulting set has elements. Since the priority used in the priority queue is the distance to query point, we can always access to the best candidate containing the resultant points.

Overview
In this section, the structure of our proposed scheme and some notations will be described first, and the details of the scheme will be addressed in the remaining paragraphs. There are three characters (or players) involved in the proposed scheme, the data owner ( ) who owns the original data, the cloud server ( ) who provides the storage for data and shares the computation loads of -NN search, and the query user ( ) who makes the -NN search query request to , and obtains the final results. In order to preserve the privacy of the data, encrypts his data before outsourcing them to . On the other hand, wants to protect his own query privacy from revealing it to and , so he also encrypts the query. We assume that mentioned here is an honest-but-curious third party, i.e., it will follow the protocol to maintain its business credit but will try to know the data or query as much as it can during the execution of procedures. We leverage Paillier cryptosystem as the encryption method for and so that both of them can perform encryption with the same public key while the privilege of decryption retained to . The homomorphic properties of Paillier will help us shift some operations to on the encrypted domain. At the same time, the ordering between the encrypted data should be preserved if we want to outsource the construction of R-tree and the running-job of BFS algorithm as well. After executing BFS algorithm, the results returned to should be decrypted but without revealing them to ; otherwise the approximate value of query will be known by . scrambles the results before sending them to for decryption and helps restore them, and then obtains the real data results. The structure of the proposed scheme and the detail steps of the proposed protocol are presented in Section 4.4.
In the next section, we describe the main encryption method used in the proposed scheme. Some security data management building blocks, used by server for protecting query privacy, are introduced in Section 4.3 and a newly proposed privacy preserving -NN search method is described in Section 4.4.

Encryption Method
As mentioned above, the encryption method in the proposed protocol is a combination of Paillier cryptosystem and OPE. By utilizing both of their properties, is able to do additive operations without asking to decrypt the ciphertexts first and checking the order between them. First of all, we try to replace in mOPE by Paillier cryptosystem. There are plenty of issues to be considered because Paillier is neither a symmetric nor a deterministic cryptosystem. In an mOPE, the order relation is preserved by OPE tree whose construction needs decryption. If we change the cryptosystem from symmetric to asymmetric ones, the whole members in that system are able to encrypt data using the public key but cannot obtain the order unless the accessibility to secret key is provided. Therefore, no matter how the other members encrypt their own new data by the public key, they are unable to find out the original data of through probing attack. More security analyses will be presented in Section 5. On the other hand, the original mOPE made use of a deterministic algorithm so the ciphertexts will be the same if there are two equal plaintexts. mOPE protocol can easily come to an end if a ciphertext already existed in OPE table. If we change to Paillier, two equal plaintexts will be encrypted into two different ciphertexts and their OPE values will also be different. It is no big deal here, after we described the whole scheme, because the corresponding OPE values still adjacent to each other in ascending or descending order and the equal ciphertexts are always decrypted to the same number. The homomorphic properties of Paillier also need to be considered in this combination. The situation that some system members, without having secret key, may create new ciphertexts with the aid of homomorphic addition is similar to that of the pre-described symmetric/asymmetric situation. The OPE value of the new ciphertext obtained from homomorphic addition will not be the addition of two OPE values. In order to obtain the real one, one has to access to secret key from , first. The full encryption scheme combining Paillier and mOPE is denoted as ( , , , , , ) in the rest of this paper.

a. KeyGen
The same as KeyGen , generates public key for and for encryption and keeps the secret key to himself for conducting the decryption job.

c. Enc
The same as Enc , the one with encrypts a plaintext to a ciphertext by Paillier cryptography.

d. SmOPE
SmOPE stands for the secure mOPE. First of all, we replace the encryption of SDES by taking the ciphertext from Enc in the first step of Enc . Secondly, we do not need to check whether the ciphertext is already in OPE table (in step 2) or not because Paillier is a non-deterministic algorithm. The rest of the protocols will be described, as a security building blocks, in Section 4.3.

e. Dec
is the only one who owns in this scheme and the decryption of a ciphertext is the same as that of Dec .

f. Order
We can obtain the ordering between ciphertexts of Paillier because of the functionality of mOPE. However, the difference apart from Order is that if and are two plaintexts and = , there will be two different ciphertexts and , where ≠ and Ord( ) < Ord( ) if is encrypted before .

Security Building Blocks
In our scheme, only has the right to use to decrypt data. This concern is about not letting anybody except himself have the authority to obtain the original data. will be more willing to provide data to through if he has such a security guarantee. On the other hand, wants to access the data from by sending queries without revealing his own personal query to any other people. However, the query of , even encrypted by the scheme described in Section 4.2, can easily be revealed to through conducting Dec. Thus, plays an important role here to meet the needs of both. Some security building blocks are needed to be used by such that does not need to share his secret key with others and can keep his query privacy after decryption at the same time.

R-Tree Construction on the Encrypted Domain
In existing methods, which leveraged R-tree data structure to speed up their -NN search, the construction steps are done by data owners in the plaintext domain. After data owners constructed the tree, they encrypted the plaintext values at the tree nodes and left the pointers between parents and children unencrypted and sent the tree structure to the server. Though the server would not obtain the real plaintext value in each node, it obviously could know the data access pattern of the query since the pointers are unencrypted. Therefore, if we use similar construction procedure to the existing methods, the data access pattern will be revealed to when running BFS algorithm. That is unacceptable because is the one who built the tree in this situation. Once he obtains the access pattern, he can find out nearby data of the query and thus obtains an approximation of the personal query value of . To make sure that the above-mentioned problem will not happen, we decided to shift the construction steps of R-tree from to . By doing this, we can hide the access pattern from for one thing and share the computation cost of construction for another. As we can see, the R-tree construction method introduced in the previous section, only needs to sort along with different dimensions and a few computations about the total data number , dimension , and the maximum number of children in a node. The sorting procedure can be successfully done in the encrypted domain by Ord(. ). Parameters and should be known by even if the data outsourced by are encrypted. In our method, we will let decide which dimension the sorting procedure will be followed up in addition to parameter as a randomness of construction. The altered algorithm of R-tree construction on the encrypted domain, denoted as RtConstruct, is shown in Algorithm1.

Algorithm 1. RtConstruct
Input: -dimensional encrypted data points Output: Encrypted R-tree root pointer t (3) Calculate the number of the leaf MBR pages, = .
(4) Sort the encrypted data points by Ord() according to coordinate and divide them into = slices. Each slice has • data points.
(5) Recursively process on each slice by repeating steps 3 and 4; while becomes the current data number of this slice, let = − 1 and = + 1 until each slice contains only data points. (6) Create an MBR for each slice with pointers pointing to all data points in that slice.
Treat MBRs created in step 6 as data points and create higher level MBRs based on step 2.
By adding two parameters decided by and the function Ord(. ) to sort the ciphertexts, cannot directly know which MBRs or data points are nearby 's query during the execution of BFS algorithm. The data access pattern is revealed to , which is the same as the construction process of existing methods; but it does not matter since cannot perform the decryption to find the plaintext value out.

Secure Compare and Secure mOPE
keeps the structure of R-tree in order to prevent from finding out the query through data access pattern. However, already has a chance to know it before searching if we simply follow the original mOPE protocol. The problem is that mOPE requires pre-decryption by and that will directly reveal the query value, which is unacceptable. We need a secure protocol that can make the procedure mOPE more robust.
should know nothing about query, including its plaintext value or the ordering to any of his data point. We leverage a secure comparison protocol, SCompare as shown in Algorithm 2, as a building block and propose the secure version of mOPE, denoted as SmOPE, shown in Algorithm 3. We also let decide the order of data points input to SmOPE once he obtains all the outsourced data. As a result, does not obtain the structure details of OPE tree (just like an R-tree); thus, he truly has no idea what the query value is.

Secure Square Euclidean Distance
The last security building block introduced in this section is the secure square Euclidean distance protocol, SSED. This protocol allows and to compute the Euclidean distance between two data points together without revealing the plaintext distance value to both of them. It comes up from the security protocols proposed in [20]. In this protocol, first, the distance between two encrypted data points is calculated through homomorphic subtraction. Each dimension of that difference vector is then used to compute the square of the difference vector through a secure multiplication protocol. Finally, the ciphertext of the square Euclidean distance is computed by homomorphically adding all the components of the square difference vector. We do not need to further compute the square root of the previous distance square because it will be used just for finding the priority in BFS algorithm. The secure multiplication, SM , is a multiplication protocol between and for two given ciphertexts, and it outputs the ciphertext of multiplication of two corresponding plaintexts without letting both of them know any one of the plaintext values. Details of SM and SSED are shown in Algorithm 4 and 5, respectively. be given as follows:

A Newly Proposed Privacy Preserving K-NN Search Scheme (PPKSS)
The setup of our scheme starts from the encryption of all data points by using . publishes the public key of Paillier and then outsources the encrypted data points to . Upon receiving the encrypted data, permutes them and runs SmOPE one ciphertext by another. The maximum ciphertexts number in the node of an OPE tree is also decided by so that cannot obtain the structure details of OPE tree. After processed the RtConstruct algorithm, we are ready for responding to any query request from now. is able to encrypt his own query any time he wants using , broadcasted by , and sends the encrypted query to . In order to prevent from keeping himself online all the time (which is impractical), needs to notify every time he wants to make a query request. Upon receiving the request from , checks if he is a trusted query user and ready for participating the protocols starting from . The first protocol needs to start is SmOPE for inserting the query to OPE tree and keeping its order relation with other data points in OPE table. Secondly, runs the secure BFS algorithm, SBFS as shown in Algorithm 6, in the encrypted domain by using encrypted distance as priority, calculating the encrypted distance by SSED and pushing it into priority queue by SCompare. Notice that the distance between the query and an MBR should be the distance from the query to the closest point in that MBR which can be found by using Ord(). Figure 3 shows two examples of finding the closest points to queries in an MBR with the aid of the pre-described orderpreserving properties. The output returned by the algorithm is a set of encrypted data points which represents the K-NN search results of the query.
then adds a random number to each dimensional component of each point by homomorphic addition and sends the resultant set to . decrypts the whole resultant set and returns it to . At the same time, would send the plaintexts of the random numbers matrix he adds on the encrypted results to to help recover the real plaintext results. The steps of the full privacy preserving -NN search scheme, PP SS, isshown in Algorithm 7 and the corresponding flow diagram is given in Figure 4.

Indistinguishability under Order Chosen Plaintext Attack
The ideal security of OPE is founded on its indistinguishability under order chosen plaintext attack (IND-OCPA), defined by Boldyreva et al. [27]. Generally speaking, the ideal OPE method is IND-OCPA secure if the encryption procedure leaks nothing more than the ordering between ciphertexts. An IND-OCPA security game between , , and a malicious adversary ( ) can clearly describe the definition. Consider that there are two sequences = { , , … , } and = { , , … , } sent from an to , where is the sequence length and the sequences have the same order relation, that is, < if and only if < for all , ∈ ℤ * . then randomly selects a sequence to encrypt and would attack the model by guessing which sequence has been chosen. The definition assumes that is able to check every ciphetexts of the chosen sequence and the corresponding state of the server if it is an interactive scheme. We say wins, Win , if the right sequence randomly selected by is guessed and the satisfaction of Equation (3) realizes the IND-OCPA security. That is where Pr[A] denotes the probability of event A and negl(κ) is a negligible function with parameter κ. That means can only have a negligible advantage over random guessing if the scheme is IND-OCPA secure.
We prove that our modified mOPE combining with Paillier cryptosystem is IND-OCPA secure based on their primitive security guarantees. Paillier cryptosystem has been proved to provide semantic security against chosen plaintext attack (IND-CPA), so, by definition, it meets a higher security requirement than an OPE scheme. That is, the order relationships of the two sequences sent by are not necessary the same, if the scheme is IND-CPA secure. Therefore, we can prove the proposed encryption method provides IND-OCPA security merely by discussing that the information leaks from while running mOPE is indistinguishable between two sequences. We inductively prove it on the basis of the number of encrypted messages. The base case is at the beginning of the procedure, when no message is encrypted yet, initializes the server state, which leaks the same original information to . We assume that receives the same information no matter which sequence has been encrypted and cannot distinguish the results after conducting theth encryption. At the ( + 1)-th encryption, the sustentation of the indistinguishability completes the proof. Since sequences and have the same order relation, the associated OPE trees have the same structure after the -th encryption has been conducted. The traversal over the tree and the update of OPE table, at the ( + 1)-th encryption, are also in the same way as before. Therefore, by observing the server state, can obtain nothing but the ordering between the ciphertexts, which is the same on both sequences. By mathematical induction, the procedure of mOPE reveals the same ordering information to . Additionally, owing to IND-CPA security of Paillier, we complete the proof that the encryption scheme can survive the order chosen plaintext attack (IND-OCPA).

Privacy Preserving of Data
We claim that the data privacy of our scheme is preserved if the following two statements can be justified: (1) is prevented from knowing the actual plaintexts of the original data. (2) receives only -NN search results in every query request, and has no idea about the rest of the dataset.
Since the data are encrypted by Paillier cryptosystem before outsourcing and is held by all the time, cannot know the plaintexts directly. Therefore, our analyses focus on the interactions within mOPE. Despite assuming an honest-but-curious server, we consider the situation that takes wrong ciphertexts while running SmOPE as launching a probing attack and tries to find the plaintexts of dataset out.
can prevent this kind of attack by setting a reasonable threshold on the maximum number of taken by to complete the system setup. The exceeding amount of is regarded as a malicious behavior and could shut the comparison service down before the privacy exposes.
The second requirement for preserving the data privacy is achieved much simpler than the first one, since has no chance at all to directly contact with the encrypted dataset. The data involved by -NN search are only shared by and , so has no idea about the dataset except the receiving results. Certainly, one should also setup a threshold for the maximum acceptable value of to prevent from sending an extremely large value and retrieving the entire dataset.

Privacy Preserving of Query
Apart from the preserving of data privacy, the proposed scheme preserves query privacy as well. Similarly, two required statements have to be justified to fulfill the privacy preserving of query. (1) is prevented from knowing the actual value of query. (2) While running the search scheme, must have no idea about the query, including the plaintext of it, the order relation with any other point in the dataset and the corresponding -NN search result.
The first statement can be justified in a similar way to that of the previous section, i.e., the query can be treated as another new encrypted data point. By executing SmOPE protocol, the only information leaks to is the order relation. On the other hand, since our approach needs to make a request to at the same time, is unable to maliciously pretend as a fake query user and send a large sequence of values to probe for the plaintext data.
The issues of query privacy prevention against are more complicated because is the one has decryption key . We design the scheme carefully by bringing in randomness on every part of operations involving the decryption process. First of all, the structure of OPE tree is unknown to as the parameters are decided by . Under the above condition, SmOPE involves a sequence of comparisons of random values at 's aspect. The true value of the query and the order relation with data points in the dataset are hidden as a result. Secondly, the structure of R-tree is also unknown to , similar to the case of OPE tree. Therefore, has no choice but treats SBFS algorithm as an ordinary sequence of multiplications and comparisons. Moreover, the protocols used in SBFS, i.e., SSED and SCompare, always add random numbers to the ciphertexts, which then will be decrypted by Paillier homomorphic property, so that is performing operations on obscured data values. Finally, receives a set of encrypted results at the end of PPKSS; however, even if the decryption is performed upon all of them, still cannot obtain the -NN search result associated with the query eventually, since they have been randomly permuted in advance.

Performance Evaluation
The effectiveness of our scheme, PPKSS, is demonstrated in the following two different ways. First, we analyze the complexity of the scheme in the next section. Secondly, we show the experimental results in Section 6.2. The full PPKSS can be separated into different parts and each part will be discussed independently.

Complexity Analysis
First of all, encrypts the dataset attribute-wise and outsources them to , which takes O( * ) encryptions. This part of tasks can be fully parallelized since the encryption of each data can be done separately. After the outsourcing, performs O( * ) SmOPE protocol to preserve the order relation of each ciphertext. We further disassemble SmOPE protocol and find that it is composed of O(log( * )) SCompare and O(1) OPE table update, where log( * ) indicates the height of the OPE tree. The computation load of SCompare is shared by and , where they need O(1) Paillier decryption and plaintext comparison and O(1) Paillier encryption and ciphertext multiplication, respectively. Additionally, a hash table is used to implement the OPE table, so we claim that the average update and access complexity takes O(1) plaintext operations.
After the construction of OPE tree, we leverage the order relation store in OPE table to build the R-tree, RtConstruct , which is done simply based on sorting. The complexity of RtConstruct is O( log( )), and it can be derived as follows.
Since RtConstruct sorts data along with every dimension at each level of an R-tree, the comparison time using quick sort is the multiplication of and the summation of sorting complexity at each level. That is, * log ≤ log( ) enrolls the system after the setup steps were completed. We will discuss why the scheme is claimed to be with lightweight user loads, later. The complexity of query encryption is the same as the complexity of encryption of an individual data point, where the Paillier encryption is done by . We analyze the complexity of SBFS based on the original search algorithm over an R-tree in the plaintext domain [29]. The analysis is simplified under the assumption that the data in the dataset are uniformly distributed. Let be the hypersphere centered at the query , with (the distance from to the -th result ), as radius. The search complexity or the total time SSED would take is the access number of MBRs from the root of an R-tree to the leaf-MBRs, which is O(log ), plus the total access number of data points of the leaf-MBRs (i) inside , which is O( ), or (ii) intersected by , which is the most complicated part to analyze and it is derived as follows. For simplification, another assumption is made, that is, we assume each leaf-MBR intersected by forms a hypercube with average occupancy of data points. Since the data points are assumed to be uniformly distributed, the expected volume of search region is and the expected volume of each leaf-MBR is . Moreover, the volume of is proportional to , so if = , = .
Similarly, the side volume of each leaf-MBR = . In addition, the number of leaf-MBRs intersected by is the same as that intersected by the circumscribed cube of [30,31], so the 2 * sides of a circumscribed cube intersect O(log + D ) is the expected node number in the priority queue.
Each sub-task of SSED is analyzed separately to see the workload in detail. From the pseudocodes given in Section 4.3.3, it follows that SSED runs O( ) SM protocols, each of which needs O(1) decryption, plaintext multiplication and encryption on the site of , and O(1) encryption, ciphertext multiplication, and exponentiation on the site of . On the other hand, a heap is used to implement the priority queue. If the expected number of MBRs and data points in the priority queue is | |, which is about O(log + ) in total, then it takes O(log (| |)) SCompare individually to enqueue or dequeue data from the queue. Moreover, the complexity of our enqueue and dequeue approximates to O(log + + ).
In the end of PP SS, takes O( ) Paillier encryptions and multiplications to randomly permute the results and takes O( ) decryptions. Finally, only extremely light, O( ) , plaintext subtractions need to be done by . We summarize our complexity analysis results in Table 2 by showing the executed complexities of different type of operations in different protocols of our scheme. Clearly, from the table, the operations in Paillier ciphertext domain, which cost most expensively, are performed by . As for , he or she focuses mainly on the decryption tasks and needs to do a few plaintext operations. Most importantly, a very lightweight workload is put on , which raises the possibility of extending our approach to a user with resource-limited mobile devices. Another highlight of our approach is that the search complexity, O(log + + ), is faster than linear scan, whose complexity is O( ), since is much greater than and , in general.

Experimental Results
The overall performance of our proposed scheme is also demonstrated by several simulation experiments. We realized the proposed scheme, including , , and , in C-language and executed them on a laptop running macOS Sierra 10.12.3, with 2.7 GHz Intel Core i5 and 8 GB 1867 MHz DDR3 memory. The test dataset for is randomly generated with different and . We use 4-ary B-tree as an OPE tree and = 5 for R-tree. The key length of Paillier cryptosystem is set to 1024 bits. The rest of this section shows the executing time of each part of PP SS except for the encryption operations of dataset and query, which are simply accomplished by applying the original Paillier encryption. In general, a Paillier encryption of a plaintext data or a query takes about 7.5 milliseconds (ms) under our settings. Specifically, the timing responses of the proposed scheme will be examined separately so as to represent the workload of each character, independently. Moreover, the communication cost is also listed because the proposed procedures always include interactions between two parties. First of all, Figure 5 shows the timing cost of performing SmOPE on a single Paillier ciphertext, that is, the time required to insert one ciphertext into an OPE tree and update the corresponding OPE table. When the total number of encrypted attributes grows from 100,000 to 1,000,000, each insertion may take longer time, changing from 40 ms to 45 ms, since the height of the tree grows. Notice that it may take a long time to complete the order preserving setup, but that does not matter since it is a one-time job. The second task of the setup is RtConstruct, which can totally be accomplished by . We illustrate the required timing for constructing the R-tree of the testing datasets with different sizes and dimensions in Figure 6. As the size of 2-dimensional dataset grows from 20,000 to 100,000, the construction costs changed from 8 s to 45 s. Since the required construction time is proportion to the dimension of the dataset, as expected, Figure 6 shows it grows linearly with the respect to the number of involved data attributes. We define the query search time of our scheme as a specific time-duration, which starts from the time after completed SmOPE for inserting the encrypted query into an OPE table, and ends after removed the random numbers out and got the unencrypted results correctly. The size of dataset , the dimension of data , and the range value for a -NN search are used as parameters to generate various sets of experiments. Each experimental result (corresponding to a different set of parameters) is obtained by running 100 queries on it and we record the resultant average search time. First, by fixing = 2 and = 1, that is, to find the nearest neighbors in a 2-dimensional dataset, we illustrate the obtained average search time, for datasets with sizes changed from 10,000 to 100,000, in Figure 7. Since our approach utilizes R-tree structure to manage the dataset, the search time grows only a little while the size of dataset increases tremendously. Next, by setting = 1 and = 20,000, Figure 8 shows the search time behavior of our work over multi-dimensional dataset. The average search time grows apparently as the data dimension increases, for example, the search time changed from 3.52 s to 53.49 s when changed from 2 to 6. The reasons for this phenomenon come from the relatively complicated structure of R-tree and the increase of the number of necessary visited data points in MBRs, if high dimensional datasets are involved. In the last experiment, we fix = 20,000 and = 2 and conduct -NN searches by varying , and the resulting search time is shown in Figure 9. As the value of increases from 1 to 20, the search time grows linearly from 3.52 s to 9.17 s.   Finally, we separate the search time of our approach into the combination of 's CPU time, 's CPU time, and 's CPU time to clearly identify the workload of each character. The data transmission costs are also listed to evaluate the extra communication overhead of the proposed scheme. We take = 20,000, = 20 and different as testing parameters and report the details of the composition of a search time in Table 3. One can see that although the search time is a little bit longer as the dimension of dataset is increased, about 80% of computational loads are outsourced to , and takes care of the rest 20%. At the site, the computation overhead is extremely lightweight and is negligible. This fact makes the proposed scheme more applicable and practical even if and are worked on resource-limited environments. Notice that there is a high correlation between the communication cost and the total query search time, which is quite reasonable.

Conclusions and Future Works
In this work, a privacy preserving -NN search scheme is constructed based on a newly proposed encryption scheme, which preserves not only comparison but also addition operations in the encrypted domain. Moreover, some disadvantages of related works are addressed and released by our approach. The proposed scheme is realized on the bases of several security protocols among a data owner, a cloud server and a query user, and we manage the dataset with the structure of an Rtree. Security analysis shows that the proposed encryption scheme not only achieves IND-OCPA security but also preserves both the data privacy and query privacy. Furthermore, we justify that our approach runs in sub-linear time by complexity analysis. And finally, experimental results demonstrate its effectiveness with respect to different sizes of datasets. Most of all, the impressive lightweight user loads enhance the applicability of our approach to resource-limited mobile platforms.
As an interesting extension of PPkNN, Wu et al. [32] presented the first solution to the so-called Group k-nearest neighbor (kGNN) search problem, which allows a group of n mobile users to jointly retrieve k points from a location-based service provider (LSP) that minimizes the aggregate distance to them, at the same time. The authors identified four protection objectives in the privacy-preserving kGNN (PPkGNN) search: (i) every user's location should be protected from LSP; (ii) the group's query and the query answer should be protected from LSP; (iii) LSP's private database information should be protected from users, i.e., the users cannot learn more information beyond the answer they requested; (iv) every user's location should be protected from the other users in the group. Since the encryption mechanism of our scheme is based on the Paillier cryptosystem, same as in [32], we might extend our PPkNN scheme to solve PPkGNN search problem, which is one of our future research directions. Furthermore, how to improve the overall system performance in regarding to highdimensional datasets is another research direction.
One of the anonymized reviewers brings the following important and interesting issue to us: "How a data owner can be contributed in the process of securing/encrypting cloud dataset?", which is an important aspect of the dynamically changing environments, such as in Cloud-based servers. Although this topic is out of the main scope of our current work, this subject should be included in our future research directions, as suggested by the reviewer. Specific thanks to the same reviewer for mentioning the following useful reference [33] to us. As mentioned by the same reviewer: "Using more than one Cloud servers the processing and computational overheads can be further reduced." Of course, this interesting and challenging issue would be included in our future research topics. Thanks also to the reviewer for bringing the following useful references to us [34,35].
Our scheme would fail if a cloud server did not follow the pre-defined protocols and returned the wrong results to the query user. In other words, we think how to design a cloud-based privacy preserving -NN search scheme which is robust to "Man-in-the-middle attack" is still an open problem.

Conflicts of Interest: Page: 25
The authors declare no conflict of interest.