Next Article in Journal
Editorial: Advances in Stochastic System Modeling, Control, Optimization, and Their Applications
Next Article in Special Issue
Physical Unclonable Function and Machine Learning Based Group Authentication and Data Masking for In-Hospital Segments
Previous Article in Journal
Trust-Based Beacon Node Localization Algorithm for Underwater Networks by Exploiting Nature Inspired Meta-Heuristic Strategies
Previous Article in Special Issue
Dynamic Voltage and Frequency Scaling and Duty-Cycling for Ultra Low-Power Wireless Sensor Nodes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Parallelly Running and Privacy-Preserving k-Nearest Neighbor Classification in Outsourced Cloud Computing Environments

Graduate School of Information Security, Korea University, Seoul 02841, Republic of Korea
*
Author to whom correspondence should be addressed.
Electronics 2022, 11(24), 4132; https://doi.org/10.3390/electronics11244132
Submission received: 8 November 2022 / Revised: 2 December 2022 / Accepted: 6 December 2022 / Published: 11 December 2022

Abstract

:
Classification is used in various areas where k-nearest neighbor classification is the most popular as it produces efficient results. Cloud computing with powerful resources is one reliable option for handling large-scale data efficiently, but many companies are reluctant to outsource data due to privacy concerns. This paper aims to implement a privacy-preserving k-nearest neighbor classification (PkNC) in an outsourced environment. Existing work proposed a secure protocol (SkLE/SkSE) to compute k data with the largest/smallest value privately, but this work discloses information. Moreover, SkLE/SkSE requires a secure comparison protocol, and the existing protocols also contain information disclosure problems. In this paper, we propose a new secure comparison and SkLE/SkSE protocols to solve the abovementioned information disclosure problems and implement PkNC with these novel protocols. Our proposed protocols disclose no information and we prove the security formally. Then, through extensive experiments, we demonstrate that the PkNC applying the proposed protocols is also efficient. Especially, the PkNC is suitable for big data analysis to handle large amounts of data, since our SkLE/SkSE is executed for each dataset in parallel. Although the proposed protocols do require efficiency sacrifices to improve security, the running time of our PkNC is still significantly more efficient compared with previously proposed PkNCs.

1. Introduction

In the era of big data, data mining and machine learning are important tools used to extract valuable information and predict outcomes, and these tools need to be able to analyze large-scale data [1,2]. For the sake of efficiency, large volumes of data are typically analyzed by cloud computing services at large IT companies, such as Amazon and Google [3], where there is ample and easy access to powerful resources for analyzing the plethora of data from a massive number of data owners. From the standpoint of data owners, it is more efficient to have the cloud handle analysis and return the results than attempt to analyze their own data. These days, as parallelly processing utilities, such as Hadoop, are becoming more widely disseminated, it is becoming easier to utilize cloud computing.
Although cloud computing for big data analysis has significant advantages, many companies and users are still reluctant to use these services due to privacy concerns surrounding outsourced cloud computing environments [3] because cloud computing service providers can access and reveal outsourced data, thus causing privacy problems. Even though data owners encrypt their own data before transmission as a preventative privacy-protecting measure, cloud service providers can obtain information regardless just by analyzing access patterns, which are the access records for original data according to computation results.
As for privacy protection techniques, there are secure multiparty computation and homomorphic encryption currently in use. In order to facilitate privacy-preserving computation for secret values in secure multiparty computation, non-colluding parties perform computation with their shares generated from secret values in which the values or the computation results cannot be obtained by any party as long as the parties do not collude. However, since shared data are typically not encrypted, we focus on homomorphic encryption in this paper. Homomorphic encryption is an encryption scheme in which original data can be computed in an encrypted form by a third party, such as a cloud. Partially homomorphic encryption is a type of the homomorphic encryption and allows only one type of operation with an unlimited number of times [4].
Classification is a major data analysis task that is used in a variety of areas, such as medical diagnoses, spam mail detection, and credit evaluation [5,6,7,8,9]. In this vein, k-nearest neighbor classification is popular since it produces efficient results and yields high performance. Given a classified dataset and an unclassified input query, k-nearest neighbor classification selects k data most similar to the input query, which is classified by the majority class of the k data. The target of this present work is to implement a protocol to compute k-nearest neighbor classification privately, which we call privacy-preserving k-nearest neighbor classification (PkNC). Since k-nearest neighbor classification is used for big data analysis wherein the largest parameter is the number of data, it is critical that the communication round (i.e., running time) of PkNC is independent of the number of data.
As shown Figure 1, the proposed PkNC is executed in dual non-colluding cloud servers: a data host (DH) and a cryptographic service provider (CSP). DH receives an encrypted dataset from data owner and an encrypted input query from a querier, and CSP has a decryption key. At a high level, DH runs PkNC with CSP and then it returns the class of the input query based on the dataset in an encrypted form. Our PkNC does not disclose any information about a dataset, an input query, and the resultant class to an adversary as well as even DH and CSP to run PkNC.
In order to compute k data that are most similar to an input query (i.e., k smallest distances between the data and the input query) both privately and efficiently, one existing work [10] proposed a secure k-largest/smallest element (SkLE/SkSE) protocol and applied it to PkNC. The SkLE/SkSE executes at most l rounds where l is the length of an element, meaning that if k elements (distances) with the largest/smallest value are found before the last l-th round, it terminates for efficiency. The existence of different ending points for each input dataset implies that SkLE/SkSE [10] discloses information about the input dataset. In other words, starting from the ( l 1 )-th bit of all input elements, the ending bit of SkLE/SkSE contains the information about k largest/smallest elements. For example, if SkLE to find k largest elements terminates in the first ( l 1 )-th bit, the values of k largest elements are more than 2 l 1 . (This case is realistic since l is the effective size to represent an element rather than the maximum size to support in a protocol.) Contrarily, if SkLE terminates in the last 0-th bit, two values of the smallest one in the k largest elements and the largest one in the other elements are equal or their difference is 1.Therefore, it is necessary to protect the ending bit in existing SkLE/SkSE protocol [10]. Table 1 briefly shows the security of our proposed protocols compared with existing protocols.
Moreover, SkLE/SkSE requires a secure comparison protocol, but existing works [10,11] result in information disclosure problems. Specifically, when two input data are unequal, DH sends CSP a vector that consists of random values including 0 or 1. When two input data are equal, however, DH sends CSP a vector that consists of only random values, therefore meaning that CSP learns information about whether two input data are equal or not.
In addition, recently, researchers have proposed many PkNCs, but they are not formally proven [12,13,14] or expose some information about input data. PkNCs that are formally proven [11,15,16], unfortunately, are more inefficient as the volume of data increases, making these unsuitable for big data analysis that must compute large-scale data.

Contributions

In this paper, we propose new secure comparison and SkLE/SkSE protocols that solve the abovementioned information disclosure problems. We subsequently implement PkNC using the proposed protocols and demonstrate through the experiments that our proposals are practical. Firstly, we propose a secure comparison protocol that improves security by solving the information disclosure problem. In short, regardless of whether two input data are equal or unequal, DH sends CSP the similar vector that consists of either random values including 0 or only random values according to a random coin. Our secure comparison protocol guarantees privacy for the input data and results. We present this proposed secure comparison protocol in Section 4.1 and formally prove its security in Section 4.2.
Secondly, we propose a new SkLE/SkSE that improves security by solving the information disclosure problems in existing SkLE/SkSE [10]. To achieve this, the proposed SkLE/SkSE consistently terminates in the last round regardless of the input dataset, meaning that it does not disclose any information about the content of the input dataset. We denote the existing SkLE/SkSE [10] focused on efficiency by the efficient version of SkLE/SkSE (SkLE E /SkSE E ). Similarly, we denote the proposed SkLE/SkSE to improve security by the secure version of SkLE/SkSE (SkLE S /SkSE S ), which we present in Section 4.3. The proposed SkLE S /SkSE S secures the privacy regarding the input dataset including the results and hides data access patterns even from DH and CSP. We formally prove its security in Section 4.4.
SkLE S /SkSE S is advantageous because it is highly efficient for large dataset as it executes for each dataset in parallel. In other words, the communication round, which is proportionate to running time, is independent of the number of data, which indicates that it is suitable for big data analysis. It is additionally suitable for PkNC applications with a large k of nearest neighbors since its communication round is independent of the parameter k. In order for existing protocols [11,17] to privately compute k largest/smallest data in the dataset, they must serially run a maximum/minimum protocol to compare all data k times. This means that the communication rounds in these existing protocols grow linearly with the number of data and parameter k. That is, existing works are unsuitable for both big data analysis and PkNC application with a large k.
In order to demonstrate that our proposed SkLE S /SkSE S and secure comparison protocols are practical, we implement PkNC including them and conduct extensive experiments with a real dataset. Figure 2 shows the ratio of the running time of PkNCs for the same volume of data in which our PkNC is much more efficient than existing PkNCs. Specifically, our PkNC takes 4.38 min for 1728 data and 28.95 min for 8124 data. Note that the features of running time of our PkNC is comparable to those of SkLE S /SkSE S , since the running time of SkLE S /SkSE S accounts for most running time of our PkNC. The performance of our PkNC is greatly improved in the cloud computing environment, since SkLE S /SkSE S in our PkNC is executed in parallel and the cloud enables numerous simultaneously running parallel operations. In addition, the running time of our PkNC is also independent of k of nearest neighbors like SkLE S /SkSE S . We present our PkNC and its experiment in Section 5, where the experimental results support the above arguments.
However, it cannot be denied that our security-enhanced protocols do sacrifice some efficiency. While the existing SkLE E /SkSE E [10] runs at most l rounds according to input dataset, our SkLE S /SkSE S consistently runs l rounds regardless, meaning that the number of communication rounds of SkLE S /SkSE S is equal to or more than that of SkLE E /SkSE E . Our secure comparison protocol does require one more communication round than the existing comparison protocols [10,11]. Nevertheless, we emphasize that the improved security benefits of our proposed protocols compared to the existing protocols [10,11] outweigh this sacrifice, and our PkNC is undeniably more efficient than existing PkNCs [11,15,16]. Lastly, we summarize the contributions of this paper as follows.
  • We propose a secure comparison (SCI) protocol to solve the information disclosure problem in existing works.
  • Using the secure comparison, we propose new secure k-largest/smallest element (SkLE/SkSE) protocols, which solve the information disclosure problem and hide data access patterns.
  • Using the proposed SkLE/SkSE, we implement a privacy-preserving k-nearest neighbour classification (PkNC) protocol.
  • We prove the securities of the proposed protocols formally and demonstrate that the proposed protocols are practical through PkNC experiments with real datasets. In other words, PkNC is suitable for big data analysis to handle large-scale dataset and have large k of nearest neighbors, since it is executed for each dataset in parallel.
The remainder of this paper is organized as follows. We briefly review existing works in Section 2 and explain preliminary concepts necessary for understanding our work such as system model and adversary model, performance evaluation measures, and functionalities for our proposed protocols in Section 3. In Section 4, we present the proposed secure comparison and SkLE S /SkSE S protocols along with formal proofs. Then, we explain the implementation of PkNC using the proposed protocols and demonstrate their efficiency by analyzing experimental results in Section 5. Lastly, we conclude this work in Section 6.

2. Related Works

Privacy-preserving data analysis was first proposed by Lindell and Pinkas in 2000. In this protocol, there were two parties, each with their own confidential dataset, who wish to extract valuable information in union of their datasets without disclosing information to the other party. Since then, many researchers have become interested in privacy-preserving data analysis, especially PkNC, and have proposed many protocols related to PkNC, which became a hot issue.
The authors of [17] proposed a privacy-preserving k-nearest neighbor (PPkNN) using the Paillier cryptosystem with an additively homomorphic encryption property. The PPkNN guarantees privacy for both a dataset and an input query, including PPkNN results, and hides data access patterns. Once data owners outsource their datasets and a querier sends its query (as with PkNC), cloud servers (i.e., DH and CSP) do not need to communicate with the data owners or the querier. However, the PPkNN returns k data closest to an input query rather than their majority class. The work of [11] improved on the PPkNN in [17] by proposing PPkNN classification (PkNC) to return the majority class of k data closest to an input query as a result, which formally proved its security. However, the comparison protocol in the PkNC discloses information about whether two input data are equal, which we will explain in Section 4.1. We will also demonstrate that our PkNC is more efficient than the PkNC in [11] in Section 5.2.
The work of [18] proposed PkNC in an environment with multiple keys and multiple clouds. Similar to the existing works [11,17], this PkNC guarantees privacy of datasets and an input query along with a result and hides data access pattern. In this use of the PkNC, after data owners upload encrypted data to respective cloud server, they can download and decrypt the encrypted data since they encrypt the data with their own key. In order to run the PkNC, cloud servers first convert the data encrypted with their own key into the data encrypted with the same key by proxy re-encryption, but in doing so, the PkNC exposes class information of k data closest to a query.
The authors of [19] proposed a more efficient PkNC than the scheme in [17] using Paillier and ElGamal cryptosystems. Similar to existing works, the PkNC returns the majority class of k data closest to an input query as a result and provides privacy of a dataset, an input query, a result, and data access patterns. However, the PkNC exposes to a querier classes of the k data closest to an input query rather than only their majority class. The authors in [20] proposed a very efficient PkNC for a large dataset, and this PkNC provides dataset security, key confidentiality, and query privacy as well as hides data access patterns. However, the PkNC does not provide semantic security for an outsourced dataset.
The authors of [21] proposed a very efficient PPkNN for a large dataset using an improved secure protocol for top-k selection and proved its simulation-based security formally. However, in order to improve efficiency, the top-k selection protocol returns an approximate result. In other words, it clusters a dataset using k-means algorithm and then, given a query, it selects several clusters that are closest to the query and computes the closest k data in the clusters. Using the PPkNN to output an approximate result, though, is unsuitable for applications that require an accurate classification result, such as medical diagnoses. The PPkNN also returns k data closest to a query rather than the majority class.
The PkNC in [16] provides not only privacy but also reliability of collected data. By using Blockchain, it assures that a dataset collected by data owners is trustworthy. However, there are efficiency concerns with this protocol as this PkNC requires almost one hour to process only 760 data. In practice, our PkNC is much more efficient, which will be explained in detail later in this paper. There is another PkNC that was presented in [15], which is adaptable for a high-dimensional dataset. While most existing PkNCs deal only with integers, this PkNC allows a dataset and an input query to exist as real numbers. Similar to our PkNC, the running time of the PkNC is independent of k of the nearest neighbors, but in contrast with our proposal, this PkNC requires huge memory to handle large volumes of data. The authors conducted an experiment for only 60 data in a machine with 8 GB RAM. This suggests that it requires large memory hardware, which is unsuitable for data analysis in the era of big data. Finally, the running time is also inefficient compared with our PkNC.
The authors of [22] proposed an efficient and privacy-preserving medical pre-diagnosis scheme based on multi-label k-nearest neighbors. Since a medical user can have multiple diseases at the same time, the scheme is practical. For the sake of efficiency, the scheme selects the dataset related to a medical user using k-means clustering and then performs the diagnosis scheme for the specific dataset. In other words, the scheme exposes data access patterns and cloud parties to run the scheme learn the information about a dataset or an input query. The work of [23] proposed PPkNN for eHealthcare data that combine kd-tree structure with homomorphic encryption. However, the PPkNN returns k data closest to a query as a result rather than their majority class. Moreover, a user must be authorized by data owners before sending an input query and therefore, the scheme is impractical. The authors of [12] proposed PkNC using kd-tree technique and order-preserving encryption, which protects data privacy as well as data access patterns. However, since the scheme also assumes that data owners and users are honest, it is impractical and its application is limited.
PPkNN is used for location-based services. The scheme of [13] that utilizes Moore curve [24] protects the privacy of input data such as location information and ensures the accuracy of a query result. The authors of [25] proposed a verifiable PPkNN that uses network Voronoi diagram [26]. It not only ensures the confidentiality of input data but also verifies the integrity of results. The mechanism in [27] protects the location privacy of the Internet of Connected Vehicles using Intent-based Networking. Using the machine learning ability of the network, it predicts the intent of location accesses and penalizes the malicious access. The authors of [28] proposed a privacy-preserving data sharing scheme on the edge computing service of IoT, which provides data service for IoT devices. The privacy-preserving scheme based on attribute encryption scheme realizes anonymous data sharing and access control. The authors of [29] proposed an online privacy-preserving on-chain certificate status service based on the blockchain architecture, which ensures decentralized trust and provides privacy protection. In other words, the efficient privacy-preserving certificate status check protocol solves the problems of limited block size, high latency, and privacy leakage in comparison to existing works based on the blockchain technology.The work of [30] suggested a feature weighting algorithm to select an informative feature from redundant data. The feature weight is measured with the margin between the sample and its hyperplane, which is more robust to the noise and outliers than existing works.

3. Preliminaries

In this section, we introduce our system model, security definitions, and Paillier cryptosystem as an additively homomorphic encryption scheme. We also explain how to evaluate the performance of a protocol and briefly introduce the functionalities used in our protocols.

3.1. System Model

Our proposed protocols are executed in dual non-colluding cloud servers (Figure 1): data host (DH) and cryptographic service provider (CSP). CSP generates a public key for encryption and a secret key for decryption, then sends the public key to DH. DH, which already has encrypted input data, runs a protocol with CSP. After completing a protocol, DH returns a result in an encrypted form.
In dual non-colluding cloud server model, neither DH nor CSP disclose any information about input data or results. Specifically, since DH runs a protocol for data in an encrypted form, it actually cannot obtain any information about input data or results. Even though CSP decrypts encrypted intermediate results that it receives from DH, it cannot obtain any information about input data or the results since the decrypted data are blinded by a random value. Therefore, as long as DH does not collude with CSP, our protocols ensure that no information about input data or computation results are revealed. The dual non-colluding cloud server model is realistic and reasonable since large IT companies such as Amazon and Google provide cloud computing services that prioritize reputation over gains from collusion.

3.2. Adversary Model and Security Definitions

Semi-Honest Adversary Model: In this paper, we assume that DH and CSP operate within a semi-honest adversary model, in which a compromised party follows a protocol specification but tries to obtain information about an input data and results by analyzing intermediate results. For example, in comparison protocols of existing works [10,11], CSP obtains information about whether two input data are equal by decrypting and analyzing intermediate results received from DH. In SkLE E /SkSE E [10], the information about an input dataset is also exposed by its end point. Creating a protocol in a semi-honest adversary model is a meaningful as the first step toward designing a protocol with stronger security.
Security Definition: In order to formally prove the security of our proposed protocols, we use the security definition of a semi-honest adversary model in terms of two-party computation [31]. Loosely speaking, we demonstrate that a simulator can generate the view of a corrupted party in real protocol execution when given only the input and output [32]. The view of a corrupted party consists of inputs, internal coin tosses, and received messages. If a simulator can generate indistinguishable values from the view of a corrupted party in real execution, then the definition states that the protocol is secure. The definition is as follows [31].
Let f : { 0 , 1 } * × { 0 , 1 } * { 0 , 1 } * × { 0 , 1 } * be a functionality, and f D H ( x , y ) (resp., f C S P ( x , y ) ) denote DH’s (resp., CSP’s) element of f ( x , y ) . Let π be a two-party protocol for computing f. DH’s (resp., CSP’s) view during an execution of π on ( x , y ) , denoted VIEW D H π ( x , y ) (resp., VIEW C S P π ( x , y ) ), is ( x , r , m 1 , , m t ) (resp., ( y , r , m 1 , , m t ) ), where r represents the outcome of DH’s (resp., CSP’s) internal coin tosses, and m i represents the i-th message it has received. DH’s (resp., CSP’s) output after an execution of π on ( x , y ) , denoted OUTPUT D H π ( x , y ) (resp., OUTPUT C S P π ( x , y ) ), is implicit in the party’s own view of the execution, and OUTPUT π ( x , y ) = ( OUTPUT D H π ( x , y ) , OUTPUT C S P π ( x , y ) ) .
Definition 1
(Privacy with respect to semi-honest behavior—general case).We say that π privately computes f if there exist probabilistic polynomial-time algorithms, denoted S D H and S C S P such that
{ ( S D H ( x , f D H ( x , y ) ) , f ( x , y ) ) } x , y c { ( V I E W D H π ( x , y ) , O U T P U T π ( x , y ) ) } x , y
{ ( S C S P ( y , f C S P ( x , y ) ) , f ( x , y ) ) } x , y c { ( V I E W C S P π ( x , y ) , O U T P U T π ( x , y ) ) } x , y
c means that two distributions are computationally indistinguishable. Since the functionalities of our proposed protocols are probabilistic, we use the above general case security definition, with which we prove the security of our secure comparison protocol in Section 4.2 and that of SkLE S in Section 4.4.
Sequential Modular Composition Theorem: The sequential modular composition theorem [33] is a tool used to analyze the security of a protocol in a modular way [32]. We assume that π f is a protocol that computes a functionality f, which calls a subprotocol π g to compute a functionality g. The theorem states that, in order to analyze the security of π f , it suffices to consider executing π f in a hybrid model where there is a third party to compute functionality g ideally instead of a party that executes a real subprotocol π g [32]. Therefore, in order to analyze the security of a protocol in a modular way, one first proves the security of π g and then proves the security of π f in a model that allows a party to compute functionality g ideally [32]. One denotes a model to analyze π f to call an ideal functionality g instead of π g by g-hybrid model. We prove the security of our secure comparison protocol in the F S Z P -hybrid model in Section 4.2 and that of SkLE S in the ( F S M , F S B D , F S C I ) -hybrid model in Section 4.4.

3.3. Paillier Cryptosystem

As a partially homomorphic encryption scheme, we use the Paillier cryptosystem [34] in this paper. The Paillier cryptosystem is a probabilistic asymmetric encryption scheme with semantical security, which means that an adversary cannot learn any information about original data when given its encrypted data. Let E p k ( · ) = E ( · ) be the encryption function with a public key p k , and let D s k ( · ) = D ( · ) be the decryption function with a secret key s k , for which we drop the p k and s k for succinctness in this paper. The Paillier cryptosystem also holds additively homomorphic property which allows the addition of original data to be locally computed in an encryption form. In other words, given any two data a , b Z N , the following equations [10] are satisfied.
D ( E ( a ) * E ( b ) m o d N 2 ) = a + b m o d N
D ( E ( a ) b m o d N 2 ) = a · b m o d N
For succinctness, we drop the m o d N 2 and the m o d N terms in the remainder of this paper. We stress that alternative additively homomorphic schemes can also be applied to our proposed protocols in lieu of the Paillier cryptosystem.

3.4. Performance Evaluation

We analyze the performance of a protocol in terms of computational costs (i.e., the number of encryptions/decryptions and exponentiations where we assume that encryption and decryption take the same amount of time) and communication costs (i.e., the amount of communication and the number of communication rounds) [10]. Since other operations other than encryption/decryption and exponentiation, such as homomorphic addition, have little influence on efficiency, we do not consider these in computational costs. The amount of communication means the total amount of transmitted data to complete a protocol, which we denote as a multiple of C that is the size of a ciphertext. The number of communication rounds means the communication count executed in parallel [10].

3.5. Notation

For data x with 0 x < 2 l , we let x B = x l 1 , , x 1 , x 0 by the binary representation of the data x, where x 0 (resp., x l 1 ) is the least significant bit, denoted by LSB (resp., the most significant bit, denoted by MSB) and x = j = 0 l 1 x j · 2 j for x j { 0 , 1 } [10]. Similarly, for a ciphertext E ( x ) with 0 x < 2 l , we let E ( x ) B = E ( x l 1 ) , , E ( x 1 ) , E ( x 0 ) by ciphertexts for individual bits of corresponding data x, where x = j = 0 l 1 x j · 2 j for x j { 0 , 1 } [10]. Let x ¯ by 1’s complement of data x, which is computed by toggling all bits of data. For example, 1’s complement of binary number 1010 is 0101. Similarly, for a bit x i , we let the complement of x i by x i ¯ , which is computed by x i ¯ = 1 x i for x i { 0 , 1 } [10].
[ n ] for n 1 means a set { 1 , 2 , , n } . For a set I = { i 1 , i 2 , , i n } , { d i } i I means { d i 1 , d i 2 , , d i n } . { d i } i [ l ] can be called a vector d. For a set S, r R S means that a value r is chosen in the set S uniformly at random. D H C S P : E ( x ) means that D H sends C S P a ciphertext E ( x ) . a · b means a multiplication operation in an integer and E ( a ) E ( b ) means a homomorphic addition mentioned in Section 3.3. Throughout this paper, we let the number of data by n, the size of the Paillier ciphertext by C, and the upper-bound number of bits required to represent data by l, which is less than or equal to the modulus size | N | of the Paillier cryptosystem (i.e., l | N | ) [10].

3.6. Referenced Functionalities

In our protocols, calling a subprotocol to compute a functionality is presented as DH and CSP run an interactive protocol with a third party that computes the functionality ideally. Our proposed protocols call multiplication and bit decomposition protocols, for which we introduce secure multiplication functionality F S M and secure bit decomposition functionality F S B D in this subsection. The existing works [17,35] proposed the real protocols that privately compute F S M and F S B D in the dual non-colluding cloud server model mentioned in Section 3.1 and formally proved their security under semi-honest adversary model.
Secure Multiplication functionality F S M : F S M receives { E ( a ) , E ( b ) } from DH and a secret key S K from CSP, and then it sends E ( c ) to DH where c = a · b . We define F S M as follows.
F S M ( { E ( a ) , E ( b ) } , S K ) ( E ( c ) , )
The real protocol that privately computes functionality F S M was proposed in [17]. It requires 6 encryptions/decryptions and 2 exponentiations, and 3 · C bits are transmitted in 1 round.
Secure Bit Decomposition functionality F S B D : F S B D receives E ( s ) from DH and a secret key S K from CSP, and then it sends S to DH where S = { E ( s ) B , E ( s ¯ ) B } . Recall that E ( s ) B = E ( s l 1 ) , , E ( s 1 ) , E ( s 0 ) and s ¯ is 1’s complement of the data s. We define F S B D as follows.
F S B D ( E ( s ) , S K ) ( S , )
The real protocol that privately computes functionality F S B D is implemented by adding E ( s i ¯ ) = E ( 1 ) E ( s i ) N 1 to the secure bit decomposition protocol proposed in [35]. We omit discussion about the detailed algorithm in this paper as it is trivial. This protocol requires ( 3 l + 1 ) encryptions/decryptions and ( 4 l + 2 ) exponentiations, and ( 2 l + 2 ) · C bits are transmitted in ( l + 1 ) rounds.

4. Proposed Secure Comparison and SkLE/SkSE Protocols

As mentioned earlier, if the existing SkLE E [10] finds k largest elements before the last l-th round, then it terminates. In other words, the existing SkLE E [10] exposes information about input dataset because the end points vary according to input data. In this section, we solve this information disclosure problem with our proposed SkLE S /SkSE S , whose end point is consistently the same regardless of input dataset, such that SkLE S /SkSE S does not expose any information about the input dataset. In order to construct SkLE S /SkSE S , we first propose a secure comparison and inequality (SCI) protocol that does not disclose any information.

4.1. Secure Comparison and Inequality (SCI) Protocol

In this section, we propose an SCI protocol to compare two input data privately. The proposed SCI protocol solves the information disclosure problem that occurred in existing comparison protocols (SMIN of [11] and SCP of [10]), in which at a high level, when two input data are unequal (i.e., one input dataset is larger or smaller than the other), DH sends CSP a vector that consists of random values including 0 or 1. However, when two input data are equal, DH sends CSP a vector that consists of only random values; thus, CSP can learn information about whether the two input data are equal or not. Our proposed SCI protocol does not disclose any information about two input data since DH sends CSP a vector that consists of either random values including 0 or only random values according to a random coin when the two input data are equal as well as when the two input data are unequal.
The secure comparison and inequality functionality F S C I receives { S , k } from DH and a secret key S K from CSP where S = { E ( s ) B , E ( s ¯ ) B } and k = { k B , k ¯ B } . Recall that E ( s ) B = E ( s l 1 ) , , E ( s 1 ) , E ( s 0 ) and s ¯ is 1’s complement of the data s as mentioned in Section 3.5. Then, F S C I sends { E ( M ) , E ( D ) } to DH where E ( M ) = E ( 1 ) if s < k ; otherwise, E ( M ) = E ( 0 ) and E ( D ) = E ( 1 ) if s k or else E ( D ) = E ( 0 ) . We define F S C I as follows.
F S C I ( { S , k } , S K ) ( { E ( M ) , E ( D ) } , )
We present a real protocol to privately compute functionality F S C I in Algorithm 1 and provide an example in Table 2 for easy understanding. Our SCI returns not only a comparison result ( E ( M ) ) but also an inequality result ( E ( D ) ), and one of two input datasets is in plaintext form ( k ). However, by modifying the SCI protocol slightly, it is possible to construct a common secure comparison protocol that returns only the comparison result without the inequality result, as well as ensuring that the two input data are all in an encrypted form. We omit discussion about the detailed algorithm because it is out of scope of this paper.
Algorithm 1: Secure Comparison and Inequality (SCI)
Electronics 11 04132 i001
Intuitively, DH selects functionality F ( F : s < k or F : s k ) by a random coin α and computes the functionality on two input data. CSP converts the computation result and returns the converted value ( β ) back to DH. Then, DH outputs result based on the converted value according to the random coin α (functionality F) selected by DH. As an idea for the SCI protocol, we modified the existing comparison protocol [10,11] so that an intermediate result of DH would be a vector that consists of either random values including 0 or only random values according to a random coin. We mentioned earlier that the intermediate result of DH in the existing comparison protocols [10,11] is either a vector that consists of random values including 0 or 1 if two input data are unequal or a vector that consists of only random values if two input data are equal. Therefore, when two input data are unequal, we modify 1 in a vector to be a random value. When two input data are equal, we modify one of the random values in a vector to 0 according to a random coin.
In order to avoid the scenario in which the intermediate result of DH becomes a vector that consists of only random values regardless of a random coin α when two input data are equal, we use functionality F S Z P , which privately computes whether all input data are 0 or not. Functionality F S Z P receives { E ( x i ) } i [ l ] from DH and a secret key SK from CSP, and then it sends E ( γ ) to DH where E ( γ ) = E ( 1 ) if all x i = 0 ; otherwise, E ( γ ) = E ( 0 ) . We define F S Z P as follows.
F S Z P ( { E ( x i ) } i [ l ] , S K ) ( E ( γ ) , )
We present a real protocol to privately compute functionality F S Z P in Algorithm A1 of Appendix A. The protocol requires l encryptions/decryptions and ( l + 1 ) exponentiations, and ( l + 1 ) · C bits are transmitted in 1 round.
For easier understanding of Algorithm 1, we intuitively explain data without encryption. DH selects functionality F by tossing a random coin α (line 2), where F : s < k if α = 0 ; otherwise, F : s k . When DH selects functionality F : s < k (resp., F : s k ), w j is random if ( s j , k j ) = ( 0 , 1 ) (resp., ( s j , k j ) = ( 1 , 0 ) ); otherwise, w j = 0 (lines 4–9). x j = 1 if s j k j ; otherwise, x j = 0 (line 10). y j = 1 is in the first bit with s j k j from the ( l 1 ) -th bit and the other y j are either 0 or a random value (line 11). Let the first bit with s j k j from the ( l 1 ) -th bit be the t-th bit. Then y t = 1 , y j = 0 for j = t + 1 , l 1 , and y j is the random value for j = 0 t 1 .
If s is equal to k, then γ = 1 ; otherwise γ = 0 (line 14), since x j = 0 if s j = k j ; otherwise, x j = 1 in line 10. If DH selects F : s < k ( α = 0 ), it adds γ to y 0 so that it can send CSP a vector that consists of random values including 0 when s = k and CSP returns β = 0 back to DH (lines 14–18). Note that, even though γ is added to the fixed 0-th position of y, CSP cannot know the position since the information about the position is removed by a permutation σ (line 23). If DH selects F : S k ( α = 1 ), it does not add γ so that it can send CSP a vector that consists of only random values and CSP returns β = 1 back to DH.
Let the position with y j = 1 be the t-th bit. When the selected functionality F is different from the relation of the two input data , i.e., either when DH selects F : s < k , the relation of two input datasets is s > k or when DH selects F : s k , the relation of two input datasets is s < k −−− u t = 0 and the other u j are random (lines 19–22). Conversely, when the selected functionality F is same as the relation of two input data , i.e., either when DH selects F : s < k , the relation of two input datasets is s < k or when DH selects F : s k , the relation of two input datasets is s > k −−−all u j are random by adding w j in line 21. In addition, when two input data are equal ( s = k ) and the selected functionality is F : s < k , i.e., if the selected functionality F is different from the relation of two input data−−−u is a vector that consists of random values including u 0 = 0 since DH adds γ = 1 to y 0 in lines 16–18. If the selected functionality is F : s k , i.e., when the selected functionality F is same as the relation of two input data−−−u is a vector that consists of only random values. After permutation σ of { u j } j { 0 , , l 1 } , DH sends CSP the permutated vector v (lines 23–24).
CSP returns β = 0 (resp., β = 1 ) back to DH if it receives a vector that consists of random values including 0 (resp., a vector that consists of only random values). Specifically, CSP decrypts v j and obtains v j (line 26). If there is an element with 0 in the vector v, CSP sends β = 0 to DH. If the vector v consists of only random values, CSP sends β = 1 to DH (lines 27-32). Even though the position of an element with 0 in a vector u (line 21) includes information about the first position with s j k j or the 0-th bit when s = k , CSP cannot learn any information since it is removed by a permutation in line 23. DH privately computes the result M based on the β and according to the α selected by DH. Specifically, DH sets the result M to β ( M β ) if α = 0 and the complement of β ( M 1 β ) if α = 1 (lines 34–38). In conclusion, the result M satisfies the condition M = ( s < k ) = ( α β ) for input data s and k. DH privately computes inequality result D for input data s and k (line 39).
Computation and communication costs: SCI protocol requires 2 l encryptions/ decryptions and ( 4 l + 3 ) exponential computations, and 2 ( l + 1 ) · C bits are transmitted in two rounds. Specifically, SZP requires l encryptions/decryptions in line 14, and CSP decrypts v j for j = 0 , , l 1 in line 26. Exponential computations are executed 3 l times in lines 9, 11, and 21, and two times in lines 37 and 39. Exponential computations in lines 5 and 7 are excluded since k j is a public value in { 0 , 1 } . Therefore, the total exponential computations of SCI are ( 4 l + 3 ) including ( l + 1 ) times in line 14. The amount of communication is 2 ( l + 1 ) · C bits including ( l + 1 ) · C bits in line 14. There are two communication rounds including once in line 14.

4.2. Proof of SCI Protocol

In this section, we denote SCI protocol (Algorithm 1) by π S C I and prove that π S C I privately computes F S C I in the F S Z P -hybrid model. In other words, we demonstrate that π S C I privately computes F S C I given access to functionality F S Z P . Intuitively, π S C I does not disclose any information about the comparison result including the input data to DH and CSP. In other words, the output M of π S C I satisfies the condition M = ( s < k ) = ( α β ) where DH knows a random coin α and not β since it receives the β from CSP in an encrypted form. Therefore, DH cannot learn any information about the computation result. On the contrary, CSP knows β and not α since CSP cannot learn any information about functionality F selected by α . This functionality is chosen uniformly at random by DH, and thus, CSP cannot learn any information about the computation result. In terms of proof, the DH simulator can generate the view of DH since DH only sees a random coin and data encrypted with semantically secure encryption scheme. The CSP simulator can also generate the view of CSP, which sees a vector that consists of either random values including 0 or only random values according to a random coin α which is selected uniformly at random.
Theorem 1.
π S C I privately computes F S C I in the F S Z P -hybrid model in the presence of a semi-honest adversary.
Proof of Theorem 1.
We demonstrate that joint distribution of the view and the output of the real protocol π S C I is computationally indistinguishable from that of the outputs of simulators and functionality F S C I . Specifically, we demonstrate that (1) the view of DH is computationally indistinguishable from the output of the DH simulator, (2) the view of CSP is computationally indistinguishable from the output of the CSP simulator, and (3) the output of the real execution π S C I is computationally indistinguishable from that of functionality F S C I .
(1)
The view of DH in the real protocol π S C I is as follows.
V I E W D H S C I ( { S , k } , S K ) = { { S , k } , α , E ( γ ) , E ( β ) }
In π S C I , α is a random coin that DH tosses (line 2), E ( γ ) is a ciphertext returned from F S Z P (line 14), and E ( β ) is a ciphertext received from CSP (line 32). Intuitively, since E ( γ ) and E ( β ) are in an encrypted form, the DH simulator can generate the view as random values.
DH simulator S D H
Input: The simulator S D H receives input { S , k } and output { E ( M ) , E ( D ) } of DH.
  • Simulation
-
The simulator chooses values r 1 , r 2 and r 3 uniformly at random, where r 1 R { 0 , 1 } and r 2 , r 3 R Z N 2 .
-
The simulator defines { { S , k } , r 1 , r 2 , r 3 } as the view of DH.
-
The simulator outputs the view of DH and halts.
Random coin α R { 0 , 1 } is indistinguishable from random r 1 R { 0 , 1 } . Since the Paillier cryptosystem is semantically secure and the ciphertext is less than N 2 , E ( γ ) and E ( β ) are computationally indistinguishable from r 2 and r 3 . Therefore, the view of DH and the output of S D H are computationally indistinguishable.
(2)
The view of CSP in the real protocol π S C I is as follows.
V I E W C S P S C I ( { S , k } , S K ) = { S K , { v j } j { 0 , , l 1 } , { v j } j { 0 , , l 1 } }
In π S C I , { v j } j { 0 , , l 1 } is a vector that consists of ciphertexts received from DH (line 24) and { v j } j { 0 , , l 1 } is obtained by decrypting the { v j } j { 0 , , l 1 } (line 26). Intuitively, the CSP simulator can generate { v j } j { 0 , , l 1 } , which is a vector that consists of either random values including 0 or only random values according to a random coin, and it can generate { v j } j { 0 , , l 1 } by encrypting the { v j } j { 0 , , l 1 } .
CSP simulator S C S P
Input: The simulator S C S P receives a secret key S K as CSP input.
  • Simulation
-
The simulator tosses a random coin c { 0 , 1 } .
-
If a random coin c is 0, the simulator sets r 0 4 to 0 and chooses { r j 4 } j { 1 , , l 1 } uniformly at random where r j 4 R Z N .
-
If a random coin c is 1, the simulator chooses { r j 4 } j { 0 , , l 1 } uniformly at random where r j 4 R Z N .
-
The simulator permutes { r j 4 } j { 0 , , l 1 } uniformly at random and sets them to { r j 5 } j { 0 , , l 1 } .
-
The simulator computes { E ( r j 5 ) } j { 0 , , l 1 } .
-
The simulator defines { S K , { E ( r j 5 ) } j { 0 , , l 1 } , { r j 5 } j { 0 , , l 1 } } as the view of CSP.
-
The simulator outputs the view of CSP and halts.
As presented in Algorithm 1 ( π S C I ), { v j } j { 0 , , l 1 } is a vector that consists of either random values including 0 or only random values according to a random coin α . Therefore, { v j } j { 0 , , l 1 } is indistinguishable from { r j 5 } j { 0 , , l 1 } . { v j } j { 0 , , l 1 } to encrypt { v j } j { 0 , , l 1 } is indistinguishable from { E ( r j 5 ) } j { 0 , , l 1 } .
(3)
As explained earlier, the result of π S C I satisfies the condition M = ( α β ) = ( s < k ) . As shown in Table 2, when s < k ( M = 1 ) and DH chooses a random coin α = 0 (resp., α = 1 ), then CSP returns β = 1 (resp., β = 0 ). On the contrary, when s k ( M = 0 ) and DH chooses a random coin α = 0 (resp., α = 1 ), CSP returns β = 0 (resp., β = 1 ). Therefore, the result E ( M ) of π S C I is the same as that of F S C I . As for E ( D ) , x j = 0 if s j = k j ; otherwise, x j = 1 (line 10). If s is equal to k (i.e., all x j is 0), F S Z P returns γ = 1 (line 14) and D = 0 (line 39); otherwise if s is unequal to k (i.e., there is an element with 1 in a vector x), F S Z P returns γ = 0 (line 14) and D = 1 (line 39). In other words, the result E ( D ) of π S C I is the same as that of F S C I . Therefore, the output of π S C I is computationally indistinguishable from that of F S C I . □

4.3. Secure Version of S k LE/S k SE (S k LE S /S k SE S )

Secure version of SkLE (SkLE S ): In this subsection, we propose SkLE S to privately compute k largest elements in an array (i.e., k largest data in a dataset) in which no information is disclosed. The merit of SkLE S is that it is very efficient since it is executed for each element in parallel. In order to compute k largest elements privately, the communication rounds of existing protocols [11,17] are proportional to the number of elements and parameter k of nearest neighbors since they serially repeat maximum protocol k times where the maximum protocol serially compares all elements. However, since our SkLE S is executed for each element in parallel and computes k largest elements in only one execution, the communication rounds are independent of the number of elements and the parameter k. Therefore, it is suitable for both big data analysis that handles a large volume of data (elements) and PkNC applications with large k of nearest neighbors. Since, for best performance, SkLE S needs to simultaneously execute as many operations as the number of elements, performance is greatly improved in the cloud computing environment, which enables numerous parallel operations. In addition, our SkLE S solves the information disclosure problem occurring in SkLE E [10]. The SkLE E [10] varies the end points running at most l rounds according to input array where l is the length of an element. This ultimately means that it discloses information about the input array. However, SkLE S consistently runs l rounds regardless of input array, and therefore, it does not disclose any information about input array.
F S k L E S receives a set of encrypted elements { E ( e i ) B } i [ n ] from DH and a secret key S K from CSP, and then it sends { E ( K i ) } i [ n ] to DH where K i = 1 if an element e i is included in the set of k largest elements; otherwise, K i = 0 . We define F S k L E S as follows.
F S k L E S ( { E ( e i ) B } i [ n ] , S K ) ( { E ( K i ) } i [ n ] , )
An element e i in an input dataset has auxiliary data that consists of { K i , P i , C i } { 0 , 1 } 3 , and SkLE S privately computes the set of k largest elements by computing the auxiliary data for each round. K i is the output of SkLE S , which indicates whether or not the corresponding element e i is included in the set of k largest elements. P i means whether or not the corresponding element e i is a predicted k-largest element in corresponding round. SkLE S finds k largest elements and a predicted k-largest element in the set of candidate elements, where C i indicates whether or not the corresponding element e i is included in the set of candidate elements. Once an element e i is included in the set of k largest elements, it is irreversible. (i.e., K i = 0 1 but 1 0 ). Similarly, once an element e i is excluded from the set of candidate elements, it is irreversible. (i.e., C i = 1 0 but 0 1 ).
In each round, SkLE S privately computes auxiliary data for 1 bit of all elements from the ( l 1 ) -th bit (MSB) to the 0-th bit (LSB) where l is the length of an element. Resultant k largest elements in an array are the elements e i with K i = 1 , which means the elements included in the set of k largest elements. We present a real protocol to privately compute functionality F S k L E S in Algorithm 2 and show an example in Table 3 for easy understanding. Note that DH locally performs all computations in Algorithm 2 except for the interactive protocols for functionalities F S M , F S B D , and F S C I . Recall that n is the number of all elements and l is the upper-bound number of bits required to represent an element e i .
The idea of SkLE S is to remove all elements from the set of candidate elements after it finds k largest elements so that it keeps the results equal. Since the existing SkLE E [10] terminates after it finds k largest elements, it does not need to consider computation of auxiliary data for k largest elements. Since our SkLE S does not terminate after finding k largest elements so that it is consistently executed l rounds, it needs a method to keep the results of the k largest elements equal even though it perform the same computation as before finding them. For this, when SkLE S finds k largest elements, it removes all elements from the set of candidate elements ( C i 0 ) since k largest elements are found in the set of candidate elements.
For easy understanding of Algorithm 2, we intuitively explain the data without encryption. First, DH initializes auxiliary data K i and C i for all elements e i so that there are no elements in the set of k largest elements (i.e., K i 0 ), and all elements are included in the set of candidate elements (i.e., C i 1 ) (lines 2–5). For 1 bit of an element e i , DH and CSP privately compute auxiliary data P i , K i , and C i { 0 , 1 } 3 in each round (lines 6–24), which consists of the following four steps. We assume DH and CSP compute auxiliary data for the j-th bit of all elements in the ( l j ) -th round ( j = l 1 , , 0 ).
Algorithm 2: Secure version of SkLE (SkLE S )
Electronics 11 04132 i002
     (Step 1: lines 7–10) privately computing predicted k-largest element ( P i ): A predicted k-largest element for the j-th bit is an element e i where bit 1 exists at least once from the ( l 1 ) -th bit to the j-th bit. In other words, a predicted k-largest element for the j-th bit is either a candidate element ( C i = 1 ) whose j-th bit is 1 ( e i , j = 1 ) or a k-largest element ( K i = 1 ) in the previous round, which means that e i , j = 1 exists at least once for j = l 1 , , j + 1 . Therefore, E ( P i ) is computed as follows.
E ( P i ) E ( e i , j · C i + K i )
(Step 2: line 11) privately computing the number of all predicted k-largest elements (s): Since the value of the auxiliary data for a predicted k-largest element is either 0 or 1 (i.e., P i { 0 , 1 } ), the number s of all predicted k-largest elements is computed by adding up all of the values as follow.
E ( s ) E ( i = 1 n P i ) = i = 1 n E ( P i )
(Step 3: lines 12–13) privately comparing the number s of all predicated k-largest elements to parameter k of nearest neighbors: For the comparison of s and k, DH and CSP run an interactive protocol with a party to ideally compute functionality F S C I mentioned in Section 4.1. In order to compute S for input of F S C I , DH and CSP also run an interactive protocol with a party to ideally compute secure bit decomposition functionality F S B D , as mentioned in Section 3.6. We do not present how to compute k = { k B , k ¯ B } in this paper because k is a public parameter and k can be computed easily.
(Step 4: lines 14–23) privately computing k largest elements ( K i ) and candidate elements ( C i ): As mentioned earlier, a predicted k-largest element is a candidate element whose bit is 1 in the corresponding round. Similarly, let unpredicted k-largest element be a candidate element whose bit is 0 in the corresponding round. DH and CSP privately compute whether or not an element ( e i ) is included in the set of k largest elements ( K i ) and the set of candidate elements ( C i ) according to comparison results (M and D) of the number s of predicted k-largest elements and parameter k. The idea to compute K i and C i is to include the predicted k-largest elements to the set of k largest elements if s < k and to exclude the unpredicted k-largest elements from the set of candidate elements if s > k . If s = k , all predicted k-largest elements are included in the set of k largest elements, and the other candidate elements (i.e., unpredicted k-largest elements) are excluded from the set of candidate elements in order to keep the set of k largest elements as a result since k largest elements are found in the set of candidate elements.
Table 4 shows values of C i and K i for an element e i according to each case. (case 1) When s < k (i.e., the number of predicted k-largest elements is less than parameter k), a predicted k-largest element (i.e., an element e i with e i , j = 1 and in the set of candidate elements) is included in the set of k largest elements ( K i 1 ) and is excluded from the set of candidate elements ( C i 0 ). (case 2) When s > k , an unpredicted k-largest element (i.e., an element e i with e i , j = 0 that is in the set of candidate elements) is excluded from the set of candidate elements ( C i 0 ). (case 3) When s = k , a predicted k-largest element is included in the set of k largest elements ( K i 1 ) and is excluded from the set of candidate elements ( C i 0 ). (case 4) Then, the other elements in the set of candidate elements (i.e., unpredicted k-largest elements) are excluded from the set ( C i 0 ). (case 5) Since there is no element in the set of candidate elements, the values of K i and C i for all elements (i.e., the set of k largest elements and the set of candidate elements) are kept equal for the same computation of K i and C i . According to Table 4, E ( K i ) and E ( C i ) are computed as follows.
E ( K i ) E ( K i + e i , j · C i · ( 1 D + D · M ) )
E ( C i ) E ( C i · ( M + e i , j · D · ( 1 2 M ) ) )
     When SkLE S cannot determine k largest elements due to the presence of multiple elements with the same value, it returns as a result the elements in the set of candidate elements as well as the set of k largest elements (lines 25–27). For example, when SkLE S finds the three largest elements (k = 3) in array { 1 , 2 , 3 , 3 , 4 , 5 } , it returns the four largest elements { 3 , 3 , 4 , 5 } as a result.
Parallelism: In each round, the proposed SkLE S performs computation either for each element in parallel or all elements in common. Therefore, the communication rounds regarding running time is independent of the number of elements. The operations in lines 7–10 and lines 17–23 are computed for each element independently and in parallel. The operations in lines 12–16 are computed once in common regardless of the number of elements. Although just as many homomorphic additions are serially computed as the number of elements in line 11, we do not consider them as computation and communication costs since homomorphic addition has little influence on efficiency, as mentioned in Section 3.4.
Computation and communication costs: SkLE S requires ( 24 n + 5 l + 7 ) · l encryptions/ decryptions and ( 8 n + 8 l + 9 ) · l exponential computations, and ( 12 n + 4 l + 7 ) · l · C bits are transmitted in ( l + 8 ) · l rounds where l represents the size of an element e i , and l represents the size of the number of elements. Recall that SM requires six encryptions/decryptions and two exponential computations, and 3 · C bits are transmitted in one round, as mentioned in Section 3.6. Specifically, SM in line 8 requires 6 n l encryptions/decryptions and 2 n l exponential computations, and 3 n l · C bits are transmitted since SMs for n elements are repeated l times. However, the number of communication rounds is l since SMs for n elements are executed in parallel. Likewise, SM operations in lines 18–22 are executed n l times and the operations in lines 11–16 are executed l times, respectively.
Secure version of SkSE (SkSE S ): SkSE S privately computes k smallest elements without disclosing information about an input array or results. In conclusion, SkSE S is constructed by inputting 1’s complement of an input array to SkLE S as follows.
S k S E ( { E ( e i ) B } i [ n ] ) = S k L E ( { E ( e i ¯ ) B } i [ n ] )
In order to construct SkSE S , we followed a similar process to SkLE S in this section, and then reached the above conclusion. For more details, please refer to [10].

4.4. Proof of S k LE Protocol

In this section, we denote SkLE S protocol (Algorithm 2) by π S k L E S and prove that π S k L E S privately computes F S k L E S in the ( F S M , F S B D , F S C I ) -hybrid model. Since SkSE S is constructed by SkLE S , we do not prove the security of SkSE S . We demonstrate that π S k L E S privately computes F S k L E S given access to the functionalities F S M , F S B D , and F S C I . Intuitively, π S k L E S does not disclose any information about an input array and results to DH and CSP since DH receives randomized ciphertexts from functionalities and CSP does not receive any data.
Theorem 2.
π S k L E S privately computes F S k L E S in the ( F S M , F S B D , F S C I ) -hybrid model in the presence of a semi-honest adversary.
Proof of Theorem 2.
We demonstrate that joint distribution of the view and the output of the real protocol π S k L E S is computationally indistinguishable from that of the outputs of simulators and functionality F S k L E S . Specifically, we demonstrate that (1) the view of DH is computationally indistinguishable from the output of the DH simulator and (2) the output of DH in π S k L E S is computationally indistinguishable from output of DH in F S k L E S . We do not consider the view and output of CSP because it does not receive any messages.
(1)
We define DH’s view of the real protocol π S k L E S as follows.
V I E W D H S k L E S ( { E ( e i ) B } i [ n ] , S K ) = { { E ( e i ) B } i [ n ] , { E ( u i ) } i [ n ] , { E ( s ) B , E ( s ¯ ) B } , { E ( M ) , E ( D ) } , E ( α ) , { E ( v i ) } i [ n ] , { E ( w i ) } i [ n ] , { E ( C i ) } i [ n ] }
All data in DH’s view are received from functionalities except for input { E ( e i ) B } i [ n ] . Specifically, { E ( s ) B , E ( s ¯ ) B } are received from F S B D in line 12, { E ( M ) , E ( D ) } are received from F S C I in line 13, and the other data are received from F S M . Intuitively, since DH sees the only data encrypted with a semantically secure encryption scheme, we can simulate DH’s view.
DH simulator S D H
Input: The simulator S D H receives input { E ( e i ) B } i [ n ] and output { E ( K i ) } i [ n ] of DH.
  • Simulation
-
The simulator S D H chooses values { r i 1 } i [ n ] , { r i 2 } i { 0 , , l 1 } , { r i 3 } i { 0 , , l 1 } , r 4 , r 5 , r 6 , { r i 7 } i [ n ] , { r i 8 } i [ n ] , { r i 9 } i [ n ] uniformly at random where the values are in Z N 2 .
-
The simulator defines { { E ( e i ) B } i [ n ] , { r i 1 } i [ n ] , { { r i 2 } i { 0 , , l 1 } , { r i 3 } i { 0 , , l 1 } } , { r 4 , r 5 } , r 6 , { r i 7 } i [ n ] , { r i 8 } i [ n ] , { r i 9 } i [ n ] } as the view of DH.
-
The simulator outputs DH’s view and halts.
{ E ( u i ) } i [ n ] are computationally indistinguishable from { r i 1 } i [ n ] since the Paillier cryptosystem is semantically secure and its ciphertext is less than N 2 . Similarly, the other data in the view of DH are computationally indistinguishable from the simulator’s outputs. Therefore, the distribution of DH’s view is computationally indistinguishable from outputs of the simulator S D H .
(2)
Let π S k L E S compute auxiliary data for the j-th bit of all elements in the ( l j )-th round ( j = l 1 , , 0 ). It is clear that a predicted k-largest element will be larger than other elements for the j-th bit because, from ( l 1 )-th bit to j-th bit, a predicted k-largest element has bit 1 at least once but the other elements are all 0. Predicted k-largest elements include k largest elements in the corresponding round. As shown in case 1 of Table 4, when s < k , predicted k-largest elements are included to the set of k largest elements ( K i 1 ). The prior k-largest elements remain the same since the predicted k-largest elements include them. When s = k , predicted k-largest elements are included in the set of k largest elements ( K i 1 ) as shown in Case 3. In addition, the other candidate elements (i.e., unpredicted k-largest elements) are excluded from the set of candidate elements ( C i 0 ) as shown in Case 4, so that k largest elements remain the same as in case 5. Since all elements in the set of k largest elements are larger than the other elements, output of π S k L E S is computationally indistinguishable from that of functionality F S k L E S .
Hiding data access patterns: SkLE S hides data access patterns for k largest ones of all elements. Informally, SkLE S performs either the same computation for each element or a common computation for all elements. Specifically, auxiliary data P i , K i , and C i regarding k largest elements are computed by the same equations regardless of result (lines 7–10 and lines 17–23). The other computations in lines 11–16 are executed for all common elements. Since all data are encrypted with semantically secure encryption schemes, and therefore no information is disclosed, SkLE S is secure against data access pattern attacks. □

5. Implementation and Experimental Results of Privacy-Preserving k -Nearest Neighbor Classification

In order to demonstrate the efficiency, we implemented a privacy-preserving k-nearest neighbor classification (PkNC) using the proposed SkLE S /SkSE S and SCI protocols. Our extensive experiments contained real datasets, and we compared the experimental results with the results from existing PkNC experiments [11].

5.1. Privacy-Preserving k -Nearest Neighbor Classification

Given an unclassfied input query and a classified dataset, k-nearest neighbor classification selects k data most similar to the input query and classifies the unclassified input query by the majority class of the k selected data. Typically, PkNC algorithm consists of the following three steps.
  • Step 1: computing distances between an input query and data in a dataset.
  • Step 2: selecting k smallest distances.
  • Step 3: computing the majority class of k data corresponding to the k smallest distances.
Table 5 shows the ratio of running time broken down by steps in our PkNC when applying SkLE S /SkSE S . Since the running time of SkLE S /SkSE S accounts for most of the PkNC running time, the features of SkLE S /SkSE S lead to those of PkNC in terms of running time. In order to improve the efficiency of PkNC, it is significant to compute step 2 efficiently. For more details about our PkNC algorithm, refer to Algorithm A2 in Appendix B.
As mentioned earlier, SkLE/SkSE has an efficient version (SkLE E /SkSE E ) that focuses on efficiency and a secure version (SkLE S /SkSE S ) that improves security. When comparing the secure version and the efficient version in terms of communication rounds related to running time, the secure version requires 2 l communication rounds more than the efficient version in the worst case scenario. Despite this, we emphasize again that the security of the secure version is much improved.

5.2. Implementation and Experimental Results

We implemented PkNC to apply SkLE S /SkSE S and SCI using the Paillier cryptosystem [36] as an additively homomorphic encryption scheme in C++. Then, we conducted an experiment on two Linux machines for DH and CSP. The Linux system features an Intel Core i7-4790 CPU 3.60 GHz processor and 15 GB RAM running Ubuntu 18.04 LTS. In particular, the machine has four cores and runs eight parallel operations via hyper-threading technology [37].
In order to show performance of our PkNC applying the proposed SkLE S /SkSE S clearly, we conducted experiments with the Car Evaluation dataset used in existing work [11] and the Mushroom dataset from the UCI machine learning repository [38]. The Car Evaluation dataset [39] consists of 1728 data with six attributes and four classes, and the Mushroom dataset [40] consists of 8124 data with twenty-two attributes and two classes.
We first ran our PkNC for k = 1000 with the Car Evaluation dataset, which took 4 min 23 s when the key size is 1024 bits and the number of threads is 8. Table 6 compares running times of our PkNC and existing PkNCs. These results verify that our PkNC, when utilizing SkLE S /SkSE S and SCI protocols, is very efficient. Starting the above experiment, we varied the parameters such as the k of nearest neighbors, key size, the number of data, and the number of threads for parallel operations. At last, we compared and analyzed the resultant running times to that of existing PkNC [11], which is the most efficient of the previous protocols in Table 6.
Figure 3 shows the running times of our PkNC applying SkLE S /SkSE S and an existing PkNC [11] for the k of nearest neighbors. It shows that the running time of our PkNC is independent of k as SkLE S /SkSE S . Since the existing PkNC runs a minimum protocol k times in order to privately compute k data with the smallest value, its running time rapidly increases for parameter k, but since SkLE S /SkSE S in our PkNC privately computes k smallest data via only one execution regardless of k, our PkNC is independent of parameter k. Specifically, the existing PkNC took 12.02 min to 55.5 min as k increases from 5 to 25, and we expect that it would take more than one hour for k values exceeding 25. However, our PkNC took roughly 4.4 min regardless of parameter k. Therefore, our PkNC is much more efficient for k than the existing PkNC thanks to the efficiency of SkLE S /SkSE S .
Figure 4 shows the running time of our PkNC applying SkLE S /SkSE S and the existing PkNC [11] for key size. It shows that the running time of our PkNC gradually increases in comparison with the existing PkNC. As the key size increases from 512 bits to 1024 bits, while running time of the existing PkNC rapidly increases 9.98 min to 66.97 min, our PkNC increases 1.2 min to 4.38 min. For 2048 bits of key size, our PkNC took only 27.2 min. Therefore, our PkNC is more efficient than the existing PkNC for key size.
Table 7 illustrates the amount of communication for our PkNC versus an existing PkNC [11] when key size is 1024 bits. It shows that the communication amount of our PkNC is roughly one-third of that of the existing PkNC. Specifically, while the existing PkNC needs to transmit data of 154.78 megabytes, our PkNC transmits only 54.72 megabytes. By assuming a common 10 Mbps LAN environment, while the network delay for the existing PkNC was 123.82 s, the delay for our PkNC is only 44 s.
     We also conducted an experiment for much more data than the Car Evaluation dataset (n = 1728) used in existing PkNC [11]. Even though the number of data in the Mushroom dataset (n = 8124) is more than the Car Evaluation dataset by 4.7 times, the running time of our PkNC is less than 30 min, which is more efficient than the existing PkNC with k = 15 . Therefore, our PkNC is more efficient than the existing PkNC for the number of data.
Figure 5 shows the running time of our PkNC for the number of threads (parallel operations). It implies that the performance of our PkNC is highly improved in the cloud computing environment to enable numerous parallel operations. This is because the proposed SkLE S /SkSE S is executed for each dataset in parallel, as mentioned in Section 4.3. In other words, the figure shows that the running time decreased by half as the number of threads doubled. Specifically, our PkNC took 23.5 min for only one thread without parallel operations. When the number of threads doubled, its running time decreased roughly by half. That is, for our experiment, it took 11.4 min for two threads. Likewise, when the number of threads doubled, it took 5.3 min for four threads. However, when the number of threads reached eight, its running time decreased by slightly less than half since our machines used in the experiment have four cores and allow eight parallel operations via hyper threading technology. For more than eight threads, the running time ceased to decrease since our machines allow for at most eight parallel operations. This implies that performance of our PkNC, including SkLE S /SkSE S , can be improved greatly if it is executed in the cloud, where more parallel operations are allowed and supported.
However, our PkNC with SkLE S /SkSE S applied requires a little more running time than the existing PkNC [10] with SkLE E /SkSE E , which focuses on efficiency but reveals some information. By our observations, the running time of our PkNC roughly increased by 11% in comparison with the existing PkNC [10]. The first reason for this is that SkLE S /SkSE S requires 2l communication rounds more than SkLE E /SkSE E as mentioned in Section 5.1. The second reason is that our SkLE S /SkSE S consistently terminates in the last round regardless of an input dataset, while the earlier SkLE E /SkSE E can terminate before the last round, or in the worst case scenario, it terminates in the last round according to an input dataset. In other words, the number of rounds for the proposed SkLE S /SkSE S is more than or equal to that of the earlier SkLE E /SkSE E . The last reason is that our SCI protocol requires one more communication round than the existing comparison protocols [10,11]. However, we emphasize that the security of our SkLE S /SkSE S is improved in comparison to that of SkLE E /SkSE E . Therefore, SkLE S /SkSE S and SkLE E /SkSE E should be selected in regards to the trade-off between security and efficiency. Nevertheless, we emphasize that the running time of our PkNC with SkLE S /SkSE S applied is even more efficient than that of the existing PkNCs [11,15,16].

6. Conclusions

Data mining and machine learning are highly significant tools necessary for the analysis of large-scale data to return meaningful information. In order to handle large volumes of data efficiently, using outsourced cloud computing services has emerged as a viable option but that can lead to privacy problems, which is a pressing concern to resolve. Therefore, we focused on a privacy-preserving k-nearest neighbor classification (PkNC) in outsourced cloud computing environment. To this end, we proposed SkLE S /SkSE S and SCI protocols to solve the information disclosure problems of SkLE E /SkSE E and secure comparison protocols in the existing works [10,11]. We formally proved the securities of our SkLE S /SkSE S and SCI protocols via the simulation paradigm. Then, we implemented PkNC to apply the proposed protocols in C++ and conducted extensive experiments with real datasets. Although the proposed SkLE S /SkSE S and SCI protocols sacrifice some efficiency to improve security, our PkNC with the relevant protocols applied is still more efficient than the existing PkNCs.
The efficient and private algorithms, such as SkLE S /SkSE S that are executed parallelly and disclose no information in outsourced cloud computing environments, will play an important role in improving the efficiency of big data analysis. We will continue, therefore, to study privacy-preserving big data analysis techniques in our future work with specific regard to improving SkLE/SkSE and proposing a privacy-preserving secure maximum/minimum protocol. By applying these protocols to big data analysis techniques like clustering, we will to contribute research on efficient and privacy-preserving big data analysis.

Author Contributions

Conceptualization, J.P.; Methodology, J.P.; Software, J.P.; Validation, D.H.L.; Formal analysis, J.P.; Writing—original draft, J.P.; Writing—review & editing, D.H.L.; Project administration, D.H.L.; Funding acquisition, D.H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2022R1A6A3A01087466) and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2021-0-00518, Blockchain privacy preserving techniques based on data encryption).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Secure Zero Protocol (SZP)

In this section, we explain secure zero protocol (SZP) to privately compute whether all input data are zero or not. SZP is a real protocol to privately compute F S Z P defined in Section 4.1, and it is constructed based on the idea of the existing equality protocol. We present SZP in Algorithm A1 and provide an example in Table A1 for easy understanding.
Algorithm A1: Secure Zero Protocol (SZP)
Electronics 11 04132 i003
Table A1. Example of Algorithm A1 for SZP ( l = 6 ).
Table A1. Example of Algorithm A1 for SZP ( l = 6 ).
Input F : x i = 0 ( α = 0 ) F : x i 0 ( α = 1 )
i123456i123456
(Case 1)
x i = 0
E ( x i ) E ( 0 ) E ( 0 ) E ( 0 ) E ( 0 ) E ( 0 ) E ( 0 ) E ( x i ) E ( 0 ) E ( 0 ) E ( 0 ) E ( 0 ) E ( 0 ) E ( 0 )
E ( s i ) E ( 1 ) E ( 1 ) E ( 1 ) E ( 1 ) E ( 1 ) E ( 1 ) E ( s 1 ) , E ( c 1 ) E ( 0 ) , E ( 0 )
E ( c i ) E ( r ) E ( r ) E ( r ) E ( r ) E ( r ) E ( r ) E ( c i ) E ( 0 ) E ( r ) E ( r ) E ( r ) E ( r ) E ( r )
E ( β ) E ( 1 ) E ( β ) E ( 0 )
E ( γ ) E ( 1 ) E ( γ ) E ( 1 )
(Case 2)
x i = 1
E ( x i ) E ( 0 ) E ( 0 ) E ( 1 ) E ( 0 ) E ( 1 ) E ( 0 ) E ( x i ) E ( 0 ) E ( 0 ) E ( 1 ) E ( 0 ) E ( 1 ) E ( 0 )
E ( s i ) E ( 3 ) E ( 3 ) E ( 2 ) E ( 1 ) E ( 0 ) E ( 1 ) E ( s 1 ) , E ( c 1 ) E ( 2 ) , E ( r )
E ( c i ) E ( r ) E ( r ) E ( r ) E ( r ) E ( 0 ) E ( r ) E ( c i ) E ( r ) E ( r ) E ( r ) E ( r ) E ( r ) E ( r )
E ( β ) E ( 0 ) E ( β ) E ( 1 )
E ( γ ) E ( 0 ) E ( γ ) E ( 0 )
(Case 3)
x i = 1
E ( x i ) E ( 1 ) E ( 1 ) E ( 1 ) E ( 1 ) E ( 1 ) E ( 1 ) E ( x i ) E ( 1 ) E ( 1 ) E ( 1 ) E ( 1 ) E ( 1 ) E ( 1 )
E ( s i ) E ( 10 ) E ( 8 ) E ( 6 ) E ( 4 ) E ( 2 ) E ( 0 ) E ( s 1 ) , E ( c 1 ) E ( 6 ) , E ( r )
E ( c i ) E ( r ) E ( r ) E ( r ) E ( r ) E ( r ) E ( 0 ) E ( c i ) E ( r ) E ( r ) E ( r ) E ( r ) E ( r ) E ( r )
E ( β ) E ( 0 ) E ( β ) E ( 1 )
E ( γ ) E ( 0 ) E ( γ ) E ( 0 )
r: random value.
Intuitively, DH selects functionality F ( F : x i = 0 or F : x i 0 ) by a random coin α where F : x i = 0 computes whether or not all input data { x i } i [ l ] are 0, and F : x i 0 computes whether or not there are any non-zero x i at least once in an input dataset { x i } i [ l ] . DH computes the selected functionality F on an input dataset and sends CSP the computation result, which is a vector with either random values including 0 or only random values. CSP converts the vector and returns the converted value ( β ) back to DH. Then, DH outputs the result ( γ ) based on the converted value according to the random coin α (functionality F) selected by DH.
For easy understanding of Algorithm A1, we intuitively explain the data without encryption. DH selects functionality F by tossing a random coin α (line 2) where F : x i = 0 if α = 0 ; otherwise, F : x i 0 . When DH selects α = 0 (resp., α = 1 ), if all input data { x i } i [ l ] are 0, DH sends CSP a vector c that consists of only random values (resp., random values including 0) so that CSP returns β = 1 (resp., β = 0 ) back to DH. If x i = 1 exists at least once in an input dataset { x i } i [ l ] , DH sends CSP a vector c with random values including 0 (resp., only random values) so that CSP returns β = 0 (resp., β = 1 ) back to DH. Specifically, when DH selects a random coin α = 0 ( F : x i = 0 ), it computes s i on an input data x i for i = 1 , , l as follows.
s i 1 + x i + 2 j = i + 1 l x j
If all input data { x i } i [ l ] are 0, c is a vector with only random values since all s i = 1 . When x i = 1 exists at least once in an input dataset { x i } i [ l ] , let the last position with x i = 1 be the t-th bit. c t = 0 since s t = 0 and the other c i for i t are all random since the other s i 0 for i t , and therefore, c is a vector that consists of random values including 0 (lines 4–7). When DH selects a random coin α = 1 ( F : x i 0 ), s 1 is the sum of all input data { x i } i [ l ] (line 9). If all input data { x i } i [ l ] are 0, both s 1 and c 1 are 0 (lines 9–10), and therefore c is a vector that consists of random values including 0 (lines 11–13). If x i = 1 exists at least once in an input dataset { x i } i [ l ] , c is a vector with only random values since c 1 is random (lines 9–13). After permutation σ of { c i } i [ l ] , DH sends CSP the permutated vector d (lines 15–16). The subsequent process is the same as Algorithm 1 of SCI protocol. CSP then sends DH β = 0 (resp., β = 1 ) if a vector received from DH contains random values including 0 (resp., only random values) (lines 18–24). Then, the result γ is β if α = 0 ; otherwise, the complement of β (lines 26–30).
Similar to the Algorithm 1 of the SCI protocol, the result γ of SZP satisfies the condition γ = ( x i = 0 ) = ( α β ) for an input dataset { x i } i [ l ] . Since DH cannot know β and CSP cannot learn information about α , SZP does not disclose any information about the result. Specifically, DH knows a random coin α but cannot learn information about β since the value β is encrypted with a semantically secure encryption scheme. Likewise, CSP learns β by decryption but cannot learn information about α since the intermediate result from DH is blinded by a random value and the value α is chosen uniformly at random. In this paper, we do not include a proof for SZP security.
Computation and communication costs: SZP requires l encryptions/decryptions and at most ( l + 1 ) exponential computations, and ( l + 1 ) · C bits are transmitted in one round. Specifically, CSP decrypts d i for i = 1 , , l in line 18. Encryption for E ( β ) in line 20 and 22 is excluded since it can be executed by precomputation. According to α , exponential computations are executed l or one times in lines 3–14, and at most one time in lines 26–30.

Appendix B. Privacy-Preserving k-Nearest Neighbor Classification (PkNC)

In this section, we present PkNC in Algorithm A2, which privately classifies an unclassified input query based on a classified dataset. In order to construct PkNC, additional functionalities are required. For information on the detailed protocols, refer to [10].
Algorithm A2: Privacy-preserving k-Nearest Neighbor Classification (PkNC)
Electronics 11 04132 i004
Secure Squared Euclidean Distance functionality F S S E D : F S S E D privately computes the squared Euclidean distance for two input data with m attributes. F S S E D receives { E ( a j ) , E ( b j ) } j [ m ] from DH and a secret key S K from CSP, and then it sends E ( e ) to DH where e = j = 1 m ( a j b j ) 2 .
Secure Class Frequency functionality F S C F : Given the class information ( c i ) of k data most similar to an input query, F S C F privately computes the number for each class of the k data. F S C F receives { E ( c i ) , E ( K i ) } i [ n ] from DH and a secret key S K from CSP, and then it sends { E ( f j ) } j [ t ] to DH where f j is the number of j-th classes in k data most similar to an input query.
Other functionalities F S B D , F S B D , and F S 1 L E : We introduced F S B D in Section 3.6 where, given an encrypted data E ( x ) , F S B D returns ciphertexts for the individual bits and their 1’s complement of corresponding data x (i.e., E ( x ) B , E ( x ¯ ) B ). Similarly, F S B D (with single quotation mark) returns ciphertexts E ( x ) B for the individual bits of corresponding data x, and F S B D (with double quotation mark) returns ciphertexts E ( x ¯ ) B for 1’s complements of corresponding data x. F S 1 L E means F S k L E with k = 1 , which privately computes the maximum data in a dataset. In other words, data e i with K i = 1 is the maximum.
We assume that a classified dataset consists of n data, and their classes { d i , c i } i [ n ] where data d i consists of m attributes (i.e., d i = { d i , j } j [ m ] ). Similar to data, we assume that an input query q consists of m attributes (i.e., q = { q j } j [ m ] ). We assume that DH has an encrypted dataset { E ( d i , j ) , E ( c i ) } i [ n ] , j [ m ] and an encrypted input query { E ( q j ) } j [ m ] . After PkNC, DH returns the class information of the input query based on the dataset in an encrypted form. Specifically, DH returns { E ( K j ) } j [ t ] after PkNC, and if K α = 1 , the input query is classified into the α -th class. Recall that n is the number of data, m is the number of attributes, and t is the number of class types. For easy understanding of Algorithm 4, we intuitively explain the data without encryption.
(Step 1: line 3) privately computing distances between an input query and data in a dataset: F S S E D privately computes the squared Euclidean distance e i between an unclassified input query q and the data d i in a dataset.
(Step 2: lines 4–6) privately selecting k smallest distances: In line 6, F S k L E S , whose input is 1’s complement of distances, privately computes k data closest to an input query as mentioned in Section 4.3. In other words, it privately computes k smallest distances between an input query and data in a dataset. F S B D is used to comply with the input format of F S k L E S .
(Step 3: lines 7–11): privately computing the majority class of k data corresponding to the k smallest distances: F S C F privately computes the number of each class of k data closest to an input query. F S 1 L E privately computes the majority class of the k data by computing the maximum number of the classes where F S B D is used to comply with the input format of F S 1 L E .
PkNC discloses no information since DH receives only semantically secure ciphertexts from third parties to ideally compute functionalities. Similar to the formal proof for SkLE S in Section 4.4, we can simulate DH’s view in random values since DH receives only randomized ciphertexts from third parties for functionalities. Since CSP does not receive any messages, we do not consider CSP’s view. We do not include a formal proof of PkNC security in this paper.

References

  1. Wu, X.; Zhu, X.; Wu, G.Q.; Ding, W. Data mining with big data. IEEE Trans. Knowl. Data Eng. 2013, 26, 97–107. [Google Scholar]
  2. Beam, A.L.; Kohane, I.S. Big data and machine learning in health care. JAMA 2018, 319, 1317–1318. [Google Scholar] [CrossRef] [PubMed]
  3. Hashem, I.A.T.; Yaqoob, I.; Anuar, N.B.; Mokhtar, S.; Gani, A.; Khan, S.U. The rise of “big data” on cloud computing: Review and open research issues. Inf. Syst. 2015, 47, 98–115. [Google Scholar] [CrossRef]
  4. Acar, A.; Aksu, H.; Uluagac, A.S.; Conti, M. A survey on homomorphic encryption schemes: Theory and implementation. ACM Comput. Surv. (CSUR) 2018, 51, 1–35. [Google Scholar] [CrossRef]
  5. Price, W.N.; Cohen, I.G. Privacy in the age of medical big data. Nat. Med. 2019, 25, 37–43. [Google Scholar] [CrossRef]
  6. Mehmood, A.; Natgunanathan, I.; Xiang, Y.; Hua, G.; Guo, S. Protection of big data privacy. IEEE Access 2016, 4, 1821–1834. [Google Scholar] [CrossRef] [Green Version]
  7. Botta, A.; De Donato, W.; Persico, V.; Pescapé, A. Integration of cloud computing and internet of things: A survey. Future Gener. Comput. Syst. 2016, 56, 684–700. [Google Scholar] [CrossRef]
  8. Li, F.; Shin, R.; Paxson, V. Exploring privacy preservation in outsourced k-nearest neighbors with multiple data owners. In Proceedings of the 2015 ACM Workshop on Cloud Computing Security Workshop, New York, NY, USA, 16 October 2015; pp. 53–64. [Google Scholar]
  9. Bost, R.; Popa, R.A.; Tu, S.; Goldwasser, S. Machine learning classification over encrypted data. In Proceedings of the NDSS, San Diego, CA, USA, 8–11 February 2015; Volume 4324, p. 4325. [Google Scholar]
  10. Park, J.; Lee, D.H. Parallelly running k-nearest neighbor classification over semantically secure encrypted data in outsourced environments. IEEE Access 2020, 8, 64617–64633. [Google Scholar] [CrossRef]
  11. Samanthula, B.K.; Elmehdwi, Y.; Jiang, W. K-nearest neighbor classification over semantically secure encrypted relational data. IEEE Trans. Knowl. Data Eng. 2014, 27, 1261–1273. [Google Scholar] [CrossRef] [Green Version]
  12. Du, J.; Bian, F. A Privacy-Preserving and Efficient k-nearest neighbor query and classification scheme based on k-dimensional tree for outsourced data. IEEE Access 2020, 8, 69333–69345. [Google Scholar] [CrossRef]
  13. Lian, H.; Qiu, W.; Yan, D.; Huang, Z.; Tang, P. Efficient and secure k-nearest neighbor query on outsourced data. Peer-to-Peer Netw. Appl. 2020, 13, 2324–2333. [Google Scholar] [CrossRef]
  14. Song, F.; Qin, Z.; Liu, Q.; Liang, J.; Ou, L. Efficient and Secure k-Nearest Neighbor Search Over Encrypted Data in Public Cloud. In Proceedings of the ICC 2019—2019 IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019; pp. 1–6. [Google Scholar]
  15. Sun, M.; Yang, R. An efficient secure k nearest neighbor classification protocol with high-dimensional features. Int. J. Intell. Syst. 2020, 35, 1791–1813. [Google Scholar] [CrossRef]
  16. Haque, R.U.; Hasan, A.; Jiang, Q.; Qu, Q. Privacy-preserving K-nearest neighbors training over blockchain-based encrypted health data. Electronics 2020, 9, 2096. [Google Scholar]
  17. Elmehdwi, Y.; Samanthula, B.K.; Jiang, W. Secure k-nearest neighbor query over encrypted data in outsourced environments. In Proceedings of the 2014 IEEE 30th International Conference on Data Engineering, Chicago, IL, USA, 31 March–4 April 2014; pp. 664–675. [Google Scholar]
  18. Rong, H.; Wang, H.M.; Liu, J.; Xian, M. Privacy-preserving k-nearest neighbor computation in multiple cloud environments. IEEE Access 2016, 4, 9589–9603. [Google Scholar] [CrossRef]
  19. Wu, W.; Liu, J.; Rong, H.; Wang, H.; Xian, M. Efficient k-nearest neighbor classification over semantically secure hybrid encrypted cloud database. IEEE Access 2018, 6, 41771–41784. [Google Scholar] [CrossRef]
  20. Wu, W.; Parampalli, U.; Liu, J.; Xian, M. Privacy preserving k-nearest neighbor classification over encrypted database in outsourced cloud environments. World Wide Web 2019, 22, 101–123. [Google Scholar] [CrossRef]
  21. Chen, H.; Chillotti, I.; Dong, Y.; Poburinnaya, O.; Razenshteyn, I.; Riazi, M.S. SANNS: Scaling up secure approximate k-nearest neighbors search. In Proceedings of the 29th USENIX Security Symposium (USENIX Security 20), Berkeley, CA, USA, 12–14 August 2020; pp. 2111–2128. [Google Scholar]
  22. Zhu, D.; Zhu, H.; Liu, X.; Li, H.; Wang, F.; Li, H.; Feng, D. CREDO: Efficient and privacy-preserving multi-level medical pre-diagnosis based on ML-kNN. Inf. Sci. 2020, 514, 244–262. [Google Scholar]
  23. Zheng, Y.; Lu, R.; Shao, J. Achieving efficient and privacy-preserving k-nn query for outsourced ehealthcare data. J. Med. Syst. 2019, 43, 1–13. [Google Scholar]
  24. Jagadish, H.V. Linear clustering of objects with multiple attributes. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 23–26 May 1990; pp. 332–342. [Google Scholar]
  25. Yang, S.; Tang, S.; Zhang, X. Privacy-preserving k nearest neighbor query with authentication on road networks. J. Parallel Distrib. Comput. 2019, 134, 25–36. [Google Scholar]
  26. Kolahdouzan, M.; Shahabi, C. Voronoi-based k nearest neighbor search for spatial network databases. In Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, ON, Canada, 31 August–3 September 2004; Volume 30, pp. 840–851. [Google Scholar]
  27. Wang, Y.; Tian, Z.; Sun, Y.; Du, X.; Guizani, N. LocJury: An IBN-based location privacy preserving scheme for IoCV. IEEE Trans. Intell. Transp. Syst. 2020, 22, 5028–5037. [Google Scholar]
  28. Sun, Y.; Yin, L.; Sun, Z.; Tian, Z.; Du, X. An IoT data sharing privacy preserving scheme. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Toronto, ON, Canada, 6–9 July 2020; pp. 984–990. [Google Scholar]
  29. Jia, M.; He, K.; Chen, J.; Du, R.; Chen, W.; Tian, Z.; Ji, S. PROCESS: Privacy-Preserving On-Chain Certificate Status Service. In Proceedings of the IEEE INFOCOM 2021-IEEE Conference on Computer Communications, Virtual, 10–13 May 2021; pp. 1–10. [Google Scholar]
  30. Raj, D.; Mohanasundaram, R. An efficient filter-based feature selection model to identify significant features from high-dimensional microarray data. Arab. J. Sci. Eng. 2020, 45, 2619–2630. [Google Scholar] [CrossRef]
  31. Goldreich, O. Foundations of Cryptography: Volume 2, Basic Applications; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
  32. Asharov, G.; Lindell, Y. A full proof of the BGW protocol for perfectly secure multiparty computation. J. Cryptol. 2017, 30, 58–151. [Google Scholar] [CrossRef] [Green Version]
  33. Canetti, R. Security and composition of multiparty cryptographic protocols. J. Cryptol. 2000, 13, 143–202. [Google Scholar] [CrossRef]
  34. Paillier, P. Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of the International Conference on the Theory and Applications of Cryptographic Techniques, Prague, Czech Republic, 2–6 May 1999; Springer: Berlin/Heidelberg, Germany, 1999; pp. 223–238. [Google Scholar]
  35. Samanthula, B.K.; Chun, H.; Jiang, W. An efficient and probabilistic secure bit-decomposition. In Proceedings of the 8th ACM SIGSAC Symposium on Information, Computer and Communications Security, Hangzhou, China, 8–10 May 2013; pp. 541–546. [Google Scholar]
  36. Bethencourt, J. Paillier Library. 2010. Available online: https://acsc.cs.utexas.edu/libpaillier/ (accessed on 11 December 2022).
  37. Intel. Intel Core i7-4790 Processor Specification. 2014. Available online: https://ark.intel.com/content/www/us/en/ark/products/80806/intel-core-i74790-processor-8m-cache-up-to-4-00-ghz.html (accessed on 11 December 2022).
  38. Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 11 December 2022).
  39. Marko Bohanec, B.Z. Car Evaluation Data Set. UCI Machine Learning Repository. 1997. Available online: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation (accessed on 11 December 2022).
  40. Schlimmer, J. Mushroom Data Set. UCI Machine Learning Repository. 1987. Available online: https://archive.ics.uci.edu/ml/datasets/Mushroom (accessed on 11 December 2022).
Figure 1. System model of PkNC.
Figure 1. System model of PkNC.
Electronics 11 04132 g001
Figure 2. Running time ratio of PkNCs. Refs. [11,15,16] and [This work] respectively.
Figure 2. Running time ratio of PkNCs. Refs. [11,15,16] and [This work] respectively.
Electronics 11 04132 g002
Figure 3. Running time comparison of our PkNC and the existing PkNC [11] for the number of nearest neighbors (k).
Figure 3. Running time comparison of our PkNC and the existing PkNC [11] for the number of nearest neighbors (k).
Electronics 11 04132 g003
Figure 4. Running time comparison of our PkNC and the existing PkNC [11] for key size (bit).
Figure 4. Running time comparison of our PkNC and the existing PkNC [11] for key size (bit).
Electronics 11 04132 g004
Figure 5. Running time for the number of threads.
Figure 5. Running time for the number of threads.
Electronics 11 04132 g005
Table 1. Security comparison of our protocols and existing protocols.
Table 1. Security comparison of our protocols and existing protocols.
Secure Comparison ProtocolSkLE/SkSE Protocol
[11]×
[10]××
  [ This work ]    
Table 2. Example of Algorithm 1 for SCI protocol ( l = 5 ).
Table 2. Example of Algorithm 1 for SCI protocol ( l = 5 ).
InputFunctionalityj E ( s j ) k j E ( w j ) E ( x j ) E ( y j ) E ( γ ) E ( y 0 ) E ( z j ) E ( u j ) E ( β ) E ( M ) E ( D )
(Case 1)
s < k

s = 26
k = 29
F : s < k
( α = 0 )
4 E ( 1 ) 1 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r )
3 E ( 1 ) 1 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r )
2 E ( 0 ) 1 E ( r ) E ( 1 ) E ( 1 ) E ( 0 ) E ( r ) E ( 1 ) E ( 1 ) E ( 1 )
1 E ( 1 ) 0 E ( 0 ) E ( 1 ) E ( r ) E ( r ) E ( r )
0 E ( 0 ) 1 E ( r ) E ( 1 ) E ( r ) E ( 0 ) E ( r ) E ( r ) E ( r )
F : s k
( α = 1 )
4 E ( 1 ) 1 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r )
3 E ( 1 ) 1 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r )
2 E ( 0 ) 1 E ( 0 ) E ( 1 ) E ( 1 ) E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( 1 )
1 E ( 1 ) 0 E ( r ) E ( 1 ) E ( r ) E ( r ) E ( r )
0 E ( 0 ) 1 E ( 0 ) E ( 1 ) E ( r ) E ( 0 ) E ( r ) E ( r )
(Case 2)
s > k

s = 29
k = 26
F : s < k
( α = 0 )
4 E ( 1 ) 1 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r )
3 E ( 1 ) 1 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r )
2 E ( 1 ) 0 E ( 0 ) E ( 1 ) E ( 1 ) E ( 0 ) E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 )
1 E ( 0 ) 1 E ( r ) E ( 1 ) E ( r ) E ( r ) E ( r )
0 E ( 1 ) 0 E ( 0 ) E ( 1 ) E ( r ) E ( 0 ) E ( r ) E ( r ) E ( r )
F : s k
( α = 1 )
4 E ( 1 ) 1 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r )
3 E ( 1 ) 1 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r )
2 E ( 1 ) 0 E ( r ) E ( 1 ) E ( 1 ) E ( 0 ) E ( r ) E ( 1 ) E ( 0 ) E ( 1 )
1 E ( 0 ) 1 E ( 0 ) E ( 1 ) E ( r ) E ( r ) E ( r )
0 E ( 1 ) 0 E ( r ) E ( 1 ) E ( r ) E ( 0 ) E ( r ) E ( r )
(Case 3)
s = k

s = 26
k = 26
F : s < k
( α = 0 )
4 E ( 1 ) 1 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r )
3 E ( 1 ) 1 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r )
2 E ( 0 ) 0 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r ) E ( 0 ) E ( 0 ) E ( 0 )
1 E ( 1 ) 1 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r )
0 E ( 0 ) 0 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( 1 ) E ( 0 ) E ( 0 )
F : s k
( α = 1 )
4 E ( 1 ) 1 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r )
3 E ( 1 ) 1 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r )
2 E ( 0 ) 0 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r ) E ( 1 ) E ( 0 ) E ( 0 )
1 E ( 1 ) 1 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( r )
0 E ( 0 ) 0 E ( 0 ) E ( 0 ) E ( 0 ) E ( 1 ) E ( 1 ) E ( r )
Table 3. Example of Algorithm 2 for SkLE S protocol.
Table 3. Example of Algorithm 2 for SkLE S protocol.
j { E ( e i ) B } i [ 5 ] 1 { E ( P i ) } i [ 5 ] E ( M )
{ E ( K i ) } i [ 5 ] { E ( C i ) } i [ 5 ] E ( D )
····
E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 1 ) ·
7 E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) E ( 1 )
E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 1 ) E ( 1 )
6 E ( 1 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) E ( 1 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) E ( 1 )
E ( 1 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) E ( 0 ) , E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 1 ) E ( 1 )
5 E ( 0 ) , E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 1 ) E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 1 ) E ( 0 )
E ( 1 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) E ( 0 ) , E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 1 ) E ( 1 )
4 E ( 0 ) , E ( 1 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) E ( 1 ) , E ( 1 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) E ( 1 )
E ( 1 ) , E ( 1 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) E ( 0 ) , E ( 0 ) , E ( 1 ) , E ( 1 ) , E ( 1 ) E ( 1 )
3 E ( 1 ) , E ( 0 ) , E ( 1 ) , E ( 1 ) , E ( 0 ) E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 0 ) E ( 0 )
E ( 1 ) , E ( 1 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) E ( 0 ) , E ( 0 ) , E ( 1 ) , E ( 1 ) , E ( 0 ) E ( 1 )
2 E ( 0 ) , E ( 1 ) , E ( 1 ) , E ( 0 ) , E ( 1 ) E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 0 ) , E ( 0 ) E ( 0 )
E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 0 ) , E ( 0 ) E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) E ( 0 )
1 E ( 0 ) , E ( 1 ) , E ( 0 ) , E ( 0 ) , E ( 1 ) E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 0 ) , E ( 0 ) E ( 0 )
E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 0 ) , E ( 0 ) E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) E ( 0 )
0 E ( 1 ) , E ( 0 ) , E ( 1 ) , E ( 1 ) , E ( 0 ) E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 0 ) , E ( 0 ) E ( 0 )
    E ( 1 ) , E ( 1 ) , E ( 1 ) , E ( 0 ) , E ( 0 ) 2 E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) , E ( 0 ) E ( 0 )
Parameters: k = 3 ,   l = 8 ; 1 input: { E ( e i ) B } i [ 5 ]   =   { E ( 73 ) B , E ( 54 ) B , E ( 45 ) B , E ( 41 ) B , E ( 38 ) B , } ; 2 output { E ( K i ) } i [ 5 ]   =   { E ( 1 ) E ( 1 ) E ( 1 ) E ( 0 ) E ( 0 ) } means {73, 54, 45} are the three largest elements in array {73, 54, 45, 41, 38}.
Table 4. Values of K i and C i according to the cases in SkLE S .
Table 4. Values of K i and C i according to the cases in SkLE S .
Candidate
( K i = 0 , C i = 1 )
Non-Candidate
( C i = 0 )
Predicted
   ( e i , j = 1 )    
Unpredicted
   ( e i , j = 0 )   
e i , j = 1 e i , j = 0
s > k
K i K i
C i C i
(case 2)
K i K i
C i 0
(case 5)
K i K i
C i C i
s < k (case 1)
K i 1
C i 0

K i K i
C i C i
s = k (case 3)
K i 1
C i 0
(case 4)
K i K i
C i 0
Table 5. Running time ratio by steps in our PkNC
Table 5. Running time ratio by steps in our PkNC
Step 1Step 2Step 3Total
SBDSkLE S
Ratio5%9%75%11%100%
Table 6. Running time comparison of our PkNC and the existing PkNCs.
Table 6. Running time comparison of our PkNC and the existing PkNCs.
Number of DataRunning Time
[16]76061.8 min
[15]605.4 min
[11]172812 min
[ This work ] 17284.38 min
Table 7. Communication amount and network delay of our PkNC and the existing PkNC [11] in 10 Mbps LAN.
Table 7. Communication amount and network delay of our PkNC and the existing PkNC [11] in 10 Mbps LAN.
Communication Amount
(Megabytes)
Network Delay (s)
[11]154.78123.82
[ This work ] 54.7243.77
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Park, J.; Lee, D.H. Parallelly Running and Privacy-Preserving k-Nearest Neighbor Classification in Outsourced Cloud Computing Environments. Electronics 2022, 11, 4132. https://doi.org/10.3390/electronics11244132

AMA Style

Park J, Lee DH. Parallelly Running and Privacy-Preserving k-Nearest Neighbor Classification in Outsourced Cloud Computing Environments. Electronics. 2022; 11(24):4132. https://doi.org/10.3390/electronics11244132

Chicago/Turabian Style

Park, Jeongsu, and Dong Hoon Lee. 2022. "Parallelly Running and Privacy-Preserving k-Nearest Neighbor Classification in Outsourced Cloud Computing Environments" Electronics 11, no. 24: 4132. https://doi.org/10.3390/electronics11244132

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop