Privacy-Preserving K-Nearest Neighbors Training over Blockchain-Based Encrypted Health Data

: Numerous works focus on the data privacy issue of the Internet of Things (IoT) when training a supervised Machine Learning (ML) classiﬁer. Most of the existing solutions assume that the classiﬁer’s training data can be obtained securely from different IoT data providers. The primary concern is data privacy when training a K -Nearest Neighbour ( K -NN) classiﬁer with IoT data from various entities. This paper proposes secure K -NN, which provides a privacy-preserving K -NN training over IoT data. It employs Blockchain technology with a partial homomorphic cryptosystem (PHC) known as Paillier in order to protect all participants (i


Introduction
At present, smart cities include more innumerable superior IoT infrastructures [1] to manage their component efficiently [2]. A tremendous volume of information accumulated from numerous IoT devices stationed in different city areas, such as medical health, agriculture, transportation, and energy transmission [3]. A large volume of reforms prompted by ML technology were proposed to handle the issues emerging from processing obligations of IoT data [4]. K-means [5] and K-NN [6] are pre-eminent unsupervised and supervised learning models, respectively, that can effectively implement data classification amid all ML models [7]. Therefore, these ML models have been used in various specialties to answer real-world classification dilemmas in IoT-enabled smart health. Considering the synopsis of individual fitness and healthcare records observed by wearable IoT sensors, Refs. [8][9][10] could be 1.
Most of the training phases manipulate intimate data samples such as medical data reported from clinical wearable IoT devices, resulting in the leakage of private or sensitive and confidential information at the time of training tasks.

2.
Latent invaders may cause unauthorized modification of data records by altering or tampering at the time of the data sharing process, resulting in an inaccurate classification of the ML model. 3.
The data provider may lose authority, and replication of the shared datasets may occur as datasets are available to the associates.
To secure the data privacy issue of individual data providers, most of the present solutions [14][15][16][17] focused on cryptography and differential privacy. Those solutions assumed that the data required for training could be obtained securely from various data providers to classify and analyze. Issues of ownership and data integrity were focused on trivially. However, solutions are invalid due to potential attacks, for most cases, in reality. This paper uses the Blockchain technology to build a reliable data-sharing terrace, which can cover the gap between realistic confinement and typical prediction. In general, a shared filing scheme intended to permit the distribution of tamper-proof records among various individuals is called a Blockchain [18]. Auditing is enabled on Blockchain for immutable records, which confirms the ownership of recorded data.
Some works are focused on privacy-preserving K-NN computation, search, query, and classification [19][20][21][22][23][24][25][26][27][28]. None of these works used Homomorphic encryption with Blockchain in order to secure sensitive information. To consolidate Blockchain into ML training method is laborious but encouraging. The first difficulty is to outline a suitable training data format convenient for adjusting on Blockchain to secure each data provider's privacy. The second difficulty is to develop the training algorithm that establishes an accurate K-NN classifier using the Blockchain's recorded data and secures sensitive information. secure SVM [29] was proposed to address the problems mentioned earlier. In the proposed method, they employed a Blockchain-based, privacy-preserving training algorithm for the SVM using encrypted data of IoT devices from Smart Cities. A public-key cryptosystem is applied by secure SVM to protect the privacy of the data, which are encrypted by the private keys of data providers. However, secure SVM requires too many calculations, comparison, time, and space in order to analyze the health data. In most medical health research, K-NN outperformed SVM in terms of performance and time complexity [30][31][32].
To handle the above challenges, we propose secure K-NN, a privacy-preserving K-NN schema based on Blockchain and encrypted data of IoT devices. A public-key cryptosystem called Paillier has been employed to shield the IoT data privacy, which is encrypted by the own private key of the respected data providers. Paillier is an additive homomorphic cryptosystem (HC), which is very efficient in terms of time complexity for encryption and decryption than any other algorithm, such as Rabin, RSA, and Goldwasser-Micali [33]. Handling encrypted data could be a problem because of the tremendous amount of intercommunications. However, K-NN can handle these circumstances as there is no separate optimization algorithm. K-NN has essential procedures such as polynomial operations and comparison. We design secure building blocks SPO (addition and subtraction), SC, and SBO by using the homomorphic properties of Paillier for secure K-NN. With the above building blocks, all iterations of secure K-NN do not need any trusted third party at the time of interaction, which significantly lessens the risk of a data breach.
The main contribution of this paper is as follows.
• To establish protected and trustworthy IoT data sharing, Blockchain technology is employed. All the IoT data are encrypted locally by the own private key of the respected data provider.
The encrypted data are recorded on a Blockchain by uniquely formatted transactions.

•
We designed protected building blocks, such as SPO (addition, subtraction), SBO, and SC using the PHC, i.e., Paillier, and developed a secure K-NN training algorithm. There is no requirement for a trusted third-party.

•
Rigorous analysis has been done to prove that the secure K-NN can protect data privacy at the time of training, achieve similar accuracy as general K-NN and outperform all the previous state of the art method.
The rest of the paper is articulated as follows. Related work and preliminaries are discussed in Sections 2 and 3, respectively. The system overview is presented in Section 4. Section 5 summarized the proposed method. Analyzation of confidentiality issues and evaluation of the proposed scheme are explained in Sections 6 and 7, respectively. Finally, in Section 8, the paper is concluded.

Related Work
Supervised learning contains two phases: the training phase, where the ML model learns from a given set of labeled specimens, and the classification phase, where labels are the result for a given sample with maximum possibility. Thus, current research on privacy-preserving ML can be divided into two sections, namely privacy-preserving ML training and privacy-preserving ML classification.

Privacy-Preserving ML Training
In most cases, multiple parties are involved when training an ML model, which results in the privacy issue of the IoT data. The main goal is to protect the data provider's IoT data from being discovered by others at the time of training an ML model. During the last decade, numerous work have been done on this category [15,[34][35][36][37][38][39][40] and our work focus here.
There are many methods used to secure data privacy in the publishing stage [41][42][43], but the most common approach is Differential privacy (DP) [15]. It assures the protection of published data by combining vigilantly computed distress to the fundamental data. The DP-based deep learning method was proposed by Abadi [15]. They developed a system to jointly train a neural network with preserving the sensitive information of their datasets. DP-based solutions can achieve more excellent computational performance. These solutions can execute calculations over plain text data. There are some limitations also: Firstly, due to perturbations, the quality and integrity of the training data are reduced significantly. Secondly, each training data's sensitive information is publicly exhibited, so only disruptions are not enough to effectively secure data privacy. The privacy budget parameter is inversely proportional to the model accuracy but directly proportional to data privacy protection.
ML training achieves reliability and provides privacy guarantee on encrypted IoT data using homomorphic encryption, and it allows calculations on ciphertext and preserves the correctness of the data. In order to train various ML models, different protected methods based on HC have been proposed, such as SVM [29,34], Logistic Regression [35,36], Decision Tree [38], and Naive Bayes [37]. Secure protocols [34] have been developed for secure addition and subtraction depending on Paillier, which results in a secure SVM training algorithm. Due to the computational limitation of Paillier, the authors developed an authorization server which worked as a trusted third-party.
PHC method can reach higher data privacy with more efficiency than the Fully homomorphic encryption (FHE) system. The PHC is much more practical than FHE for the addition and multiplication operations. Complicated calculations can be employed with a trusted third-party [34]. Without that, the model will be inaccurate due to the approximation of complex equations with an individual computational operation [39,40]. On the other hand, the calculations of FHE is costly in terms of time and space complexity. Therefore, existing uses of FHE is prohibitive from the scenario of encryption and forecast. As a result, they are unrealistic in terms of application.

Privacy-Preserving ML Classification
Usually, two different parties interact in a classification as a service scenario. One holds the data sample, and the other holds the ML model. It is not safe to reveal sensitive data to an unreliable ML model owner for a data owner who wants to know the classification result. On the other hand, the model owner may decline to share the classification result as the asset value is too high for the service provider.
Some existing solutions are [14,17,[44][45][46] for developing an effective solution in order to secure the privacy of both parties. A method was proposed by Wang et al. [45] to classify encrypted images based on multi-layer learning. The authors used a public classifier but considered that the image data should be secure. A privacy-preserving nonlinear SVM method was proposed by Zhu et al. [46] for online-based medical prediagnosis. The authors can protect both individual information of the health record and the SVM model with their design. Rahulamathavan et al. [17] proposed a privacy-preserving SVM data classification system. It can securely classify multiclass datasets. In this schema, the client input data samples are anonymous to the server. At the same time, clients are also unaware of the server-side classifier during the classification process. A group of classification protocols was developed using the HC techniques for employing ML's simple classifiers on encrypted data, such as Naive Bayes, hyperplane decision, and decision trees [14,44].
All the above studies employed standard ML classifiers and developed building blocks to assemble a privacy-preserving classification method. The calculations at the time of training an ML classifier are much complex compared to the classification phase. Those building blocks might be useless due to the complexity of the training algorithm.

The Novelty of This Paper
Earlier research on secure K-NN focused on any specific domain such as data confidentiality, secure query, secure search, secure computation, and secure classification [19][20][21][22][23][24][25][26][27][28]. None of them keep track of all the transactions, and most of them considered that the K-NN model is already trained. In this study, a partial cryptosystem known as Paillier is employed with Blockchain technology to handle the issues related to ownership, integrity, and data privacy at the time of training K-NN classifiers based on the data from various data providers. To be specific, all IoT data from individual data providers is encrypted using Paillier then registered on a distributed ledger. Any data analyst can obtain encrypted data by interacting with the respective data provider and train the K-NN classifier. The data analysts can never obtain the plaintext of IoT data on the Blockchain. The secure protocol of operation in K-NN is developed to conduct training tasks with encrypted data, i.e., SPO (addition/subtraction), SBO, and SC. A privacy-preserving K-NN training algorithm, secure K-NN, is proposed based on this building block, and there is no need for a trusted third party. The secure K-NN can train K-NN classifiers without the loss of accuracy as the training is based on Paillier.
Two well-known security definitions are used as security goals: secure two-party computation [47] and modular sequential composition [48]. The propose method illustrates that the individual data provided is inadequate to acquire any information about other data provider's data. Simultaneously, the model parameters of data analysts are secure from the knowledge of any data providers during the training process.

Preliminaries
This section describes all notations, background ideas, and related technologies of this research.

Notation
A dataset D, which consists of m records, where x i and y i is the i-th record in D and l i is the label of the corresponding x i and y i . Define d and (c x i , c y i ) as two relevant parameters of K-NN. In this paper, we use a PHC named Paillier as the cryptosystem, and let [[m]] represent the encryption of message under Paillier. Notations are summarized in Table 1. Table 1. Notations.

Signs
Interpretations Signs Interpretations Labeled data's array -m under Paillier

Homomorphic Cryptosystem
Cryptosystems are mainly based on three algorithms: Generation of key (KeyGen), data encryption (Enc), and data decryption (Dec). In public-key cryptosystems, a pair of keys (PK; SK) is used, such as for encryption and decryption public key (PK) and private key (SK) are used respectively. A cryptosystem property, which can map the operations over ciphertext to the corresponding plaintext without being aware of the decryption key, is known as Homomorphic. Definition 1 describes the homomorphic property of the cryptosystem. Definition 1. homomorphic [33] A method of public-key encryption (Gen, Enc, Dec) can be homomorphic only if for all n and all (PK; SK) output by Gen (1 n ), it is possible to define groups M, C (depending on PK only) such that: 1. M and C are the message space and all ciphertexts outcome respectively by Enc pk are elements of C. 2. Dec sk (o(c 1 , c 2 )) = σ(m 1 , m 2 ) is held for any m 1 , m 2 ∈ M, any c 1 output by Enc pk (m 1 ), and any c 2 output by Enc pk (m 2 ).
A partial homomorphic cryptosystem known as Paillier is being used in the proposed schema. It is a public key cryptography method with the partial homomorphic property as it allows only two operations, secure addition and subtraction. Let p and q are n-bit primes, N = pq. N and (N, φ(N)) (Let, N > 1 be an integer. Then Z * N is an abelian group under multiplication modulo N. Define φ(N) de f | Z * N |, the order of the group Z * N .) are the public key and private key, respectively. c := is the decryption function. More details about Paillier is explained in [32].

K-Nearest Neighbors (K-NN)
K-nearest neighbors (K-NN) [6] clustering is a type of supervised ML algorithm, used for classification and predictive regression problems. There is no specific training phase and uses whole training data [49] during classification because of that, it is called a lazy learning algorithm. It does not assume anything about the underlying data. Distance d needs to be calculated in order to find the designated centroid (c x j , c y j ). There are different methods to find the distance in K-NN algorithm, i.e., Euclidean distance d e (Equation (1)), Manhattan distance d m (Equation (2)), Cosine distance d c (Equation (3)), etc methods. In this study, we will use Manhattan distance d m . Let, (x 1 , y 1 ), (x 2 , y 2 )...., (x m , y m ) ∈ D. Algorithm 1 illustrate the entire process.
Compute d m j by Equation (2) if d m j > t then 10 Put

Blockchain System
Blockchain is a public and shared ledger, consists of a list of blocks. It is developed in cryptocurrency systems for registering transactions such as Bitcoin. It ensures secure transactions among untrusted participants. Various Blockchain platforms are: Ethereum, HyperLedger, etc., have been employed in different real-life sectors. According to the access restriction of users in Blockchain, Blockchain platforms are classified into consortium Blockchains, private Blockchains, and public Blockchains.
There are various advantages of Blockchain: • Decentralized: It is developed on a peer-to-peer network as a shared ledger, and there is no requirement of a trusted third-party. • Tamper-proof: Consensus protocols are employed by Blockchain, such as Proof-of-Work (PoW). Thus, Data manipulation is impractical. • Traceability: The rest participants can easily verify the transactions between two parties in a Blockchain system.
Despite having many advantages, Blockchain has the vulnerability of data privacy to skilled attackers. Initially, all transactions are registered as plain texts in blocks, which exposes the transaction's vital information to other participants, and adversaries [50]. Therefore, privacy and security issues must be handled cautiously when using Blockchain in terms of data sharing platform.

Problem Description
This segment illustrates the issues of secure K-NN training across encrypted IoT data accumulated from various parties, which includes the system design, threat type, and design purposes.

System Design
A data flow IoT ecosystem is developed and shown in Figure 1, including IoT devices, data providers, the Blockchain platform, and data analysts.

•
ZigBee, 3 rd generation (3G)/4 th generation (4G), and Wireless Fidelity (WiFi) are examples of the wired or wireless network through which IoT devices can sense and transmit valuable information, including medical data, smart cities, etc. In this study, due to the lack of computational capabilities, IoT devices will not participate in the data sharing and analysis processes.

•
Data providers gather all the data from IoT devices within their range. All the data comprises sensitive information, so all the data are encrypted using partially homomorphic encryption by the data provider and registered in a Blockchain.

•
To gather the encrypted IoT data from all data providers, the Blockchain-based IoT platform serves as a distributed database, where protocols are maintained, and all data are recorded in a shared ledger. The built-in consensus mechanism ensures the sharing of IoT data in a secure and tamper-proof way.

•
IoT data analysts intend to get a rooted perspicacity within the data registered in the Blockchain-based platform by using the existing analyzing techniques. Data analysts will obtain encrypted data from corresponding data providers in order to train the K-NN classifiers.

IoT data provider #1
IoT data provider #n

Threat Type
Various latent threats exist over individual entities and at the time of their interactions, according to the system model description in Figure 1. This study focuses on the threats related to data privacy throughout the interaction between the data provider and data analysts. It is assumed that the data analyst is a curious but honest foe. To be more specific, the data analyst is honest for maintaining the protocol of predesigned ML training and curious regarding the contents of the data. Moreover, the data analyst strives to acquire further knowledge by analyzing the intermediate data at the time of computation on encrypted data.
The subsequent models of threat are considered based on the data analyst's collected vital information with various attack inclinations mostly employed in the literature [51,52]. On the other hand, the data provider may also try to identify the data analyst's model parameter from the intermediate data. • Recognized Ciphertext Model. The data analyst can merely obtain the encrypted IoT data registered in the Blockchain Platform. The IoT data analysts can record intermediate outputs when training the secure algorithm, such as iteration steps. • Recognized Background Model. The IoT data analyst expects to know more further details of shared data. However, from the shared ciphertext model, an IoT data analyst may gather more information by using her previous knowledge. To be more specific, the IoT data analyst can conspire with distinct IoT data providers to infer the sensitive information of other participants.

Design Purposes
Consider more than one IoT data provider and data analyst conspire to steal other participant's privacy. Assume that all the participants as a curious-but-honest foe who executes protocol honestly but has an interest in other's private information. Any number of participants may conspire with each other. The proposed method aims to shield the individual participant's privacy and securely train the K-NN classifiers. The security goals are as follows: • At the time of encountering curious-but-honest foe, the data analyst and individual data provider's data are protected from disclosure.

•
At the time of encountering more than one parties conspire with each other, the data analyst and individual data provider's privacy also will be protected from disclosure.

The Construction of Secure K-NN
This section illustrates the system specifications of the proposed privacy-preserving K-NN training method over block-chain-based encrypted IoT data.

System Overview
For clearness, consider that a data analyst intents to train K-NN classifier based on the data gathered from various IoT data providers. Figure 2 illustrates the system overview, where the individual data provider preprocessed IoT data instances, encrypts them locally using their own private key, and register those encrypted data in the Blockchain-based distributed ledger. Present key supervision mechanisms [53][54][55] can be applied to handle the encryption abilities of data providers. The IoT data analyst can train a K-NN classifier by collecting the encrypted data registered in the public ledger and erect a protected algorithm with the building blocks of SPO, SBO, and SC. At the time of the training process, it is essential to interact between IoT data analyst and IoT data provider for reciprocating intermediate outcomes.
However, it is essential to mention that many comparison tasks are necessary to perform the training on an K-NN model. To accomplish the comparison task on encrypted data is extremely expensive, costly, and time-consuming. On the other hand, accurate intermediate data cannot be shared because the parameters of K-NN algorithm are easy to guess. There is a high possibility that in this situation, data analysts can be successful at guessing the original data. Therefore, to reduce the algorithm's complexity, to make the method more realistic and protected from the privacy breach of both data providers and data analysts, we introduce a SBO. The data provider adds a small amount of bias (δ) to secure the data's privacy at the time of sharing the intermediate data. This small amount of discrimination does not cause any significant change in the classification process.

Encrypted Data Sharing via Blockchain
To aid model training, without the sacrifice of generality, consider that the same training task's data instances have been locally preprocessed and designated with the corresponding feature vectors [16].
A unique transaction arrangement is defined in order to save the encrypted IoT data in the Blockchain. The proposed transaction structure primarily consists of two fields: input and output.
The input terminal comprises:

•
The address of the data provider • The encrypted version of data • Name of the IoT device from where the data is generated The corresponding output terminal holds: • The address of the data analyst • The encrypted version of data • Name of the IoT device from where the data is generated Hash value will be the addresses of the data provider and data analyst, and the encrypted data is determined from the homomorphic encryption, i.e., Paillier. Depending on the consideration that the length of the private key is 128 bytes, the length of the individual encrypted data instance is set to 128 bytes and stored in the Blockchain. The segment length of the IoT device type is 4 bytes.
The node serving as the data provider in the Blockchain network broadcasts it in a P2P system after assembling a new transaction, where the miner nodes will validate the correctness of the operation. A specific miner node can package the transaction in a new block and adding the block to the existing chain using current consensus algorithms, i.e., the PoW mechanism. Multiple transactions can be registered in a single block.

Building Blocks
Section 4 already specified that the goal is to secure the privacy of various IoT providers and design a privacy-preserving algorithm for training K-NN models over multiple private datasets afforded by diverse IoT providers.

K-NN
Several methods are available in order to calculate the distance, which is a model parameter of the K-NN. In this research, Manhattan distance d m (Equation (2)) is considered due to its simplicity in calculation. Algorithm 1 illustrate the entire process.

Secure Polynomial Operations (SPO)
In the proposed secure K-NN training schema, we develop secure polynomial addition and subtraction to securely train the K-NN model using the homomorphic property of Paillier's.
Similarly, the secure polynomial division can be achieved, as shown in Equation (5).
However, this research need only secure polynomial addition and subtraction. Thus, the secure polynomial addition and subtraction are statistically indistinguishable, as Paillier is statistically indistinguishable [33].

Secure Biasing Operations (SBO)
In line no. 6 of Algorithm 2, data provider P calculate distance [[d m j ]] using SPO. Next step is to send the encrypted distance [[d m j ]] to the data analyst C. If the data provider P sent the encrypted distance [[d m j ]] to the data analyst C then C will decrypt it and try to guess the private data of the data provider. There is a high possibility of success as the data analyst C initiated the clustering point. Therefore, to protect the privacy of the data provider P's data at the time of training the K-NN algorithm, we introduce the secure biasing operation (SBO). P will add a small amount of bias δ using SPO before sending the data to C in order to protect the data privacy. The bias δ will be unknown to C.  8 C send flag 0 to P; 9 end 10 else 11 C send flag 1 to P; 12 end

Algorithm 2: Secure Comparison
In this study, the range of bias will depend on the coefficient of variation (CV), where CV = StandardDeviation Mean = σ/x. Standard Deviation and Mean is computed using, σ = andx= ∑ x i /n respectively. Here n is the total number of data and x i stands for each data of the data-set. CV value greater than or equal to one, means that the data is scattered. CV value less than one, means that the is not scattered. Therefore, at the time of the experiment, various values for bias were chosen. The range was within [1 ≤ δ ≤ 5], when CV < 1 and [−5 ≤ δ ≤ −1], when CV >= 1. We found that bias range [1 ≤ δ ≤ 3] and [1 ≤ δ ≤ 3], gives proper classification results, when CV>=1 and CV<1, respectively. Therefore, If the coefficient of variation is greater than or equal to 1, CV ≥ 1, than set the range of δ to [−3 ≤ δ ≤ −1] and On the other hand, if the coefficient of variation is less than 1, CV < 1, than set the range of δ to [1 ≤ δ ≤ 3]. Note that the value of δ will never be equal to (δ == 0).

Secure Comparison (SC)
The secure comparison in the proposed method is illustrated as the comparison between two encrypted numbers [[m 1 ]] and [[m 2 ]]. For participants A and B engage in the secure comparison algorithm, neither parties can obtain real information. Our secure comparison algorithm protocol is exhibited in Algorithm 2, and the security proof is described in Section 6.

Proposition 1. (Security of Secure Comparison Algorithm). Algorithm 2 is secure in the curious-buthonest model.
To develop SBO, we use the secure polynomial operation (mainly addition and subtraction) based on the homomorphic property of Paillier's. A small amount of bias δ (where [−3 ≤ δ ≤ −1, δ = 0, 1 ≤ δ ≤ 3]) is encrypted and added with d m j by data provider P using SPO (addition/subtraction). However, the data analyst C will never be able to extract the exact value of the bias δ value, or of the range, and this small amount of bias does not affect the classification task, which is shown in the performance evaluation Section 7. SBO ensures the privacy of the data provider P. If δ is positive, the definition will be: Again, when δ is negative, the definition will be:

Training Algorithm of Secure K-NN
For protected optimum design parameters, we outline a privacy-preserving K-NN training algorithm. Assume there is a single IoT data analyst C and n number of data providers P. Algorithm 3 specifies the training algorithm for secure K-NN. In Algorithm 3, the K-NN model parameters and sensitive data of IoT data providers are confidential. At the time of facing any collusion or curious-but-honest adversaries, individual members cannot infer any vital information of another member from the algorithm's execution process's intermediate outcomes. Section 6 illustrates the security proofs for Algorithm 3.

Security Analysis
The security analysis manifests in this section under the identified ciphertext model and the public background model. Two security definitions were followed: secure two-party computation [47] and modular sequential composition [48]. A satisfied, protected two-party calculation protocol is safe in the face of curious-but-honest foes, and modular sequential composition implements a way to develop secret protocols in a modular way. Security proof of the proposed algorithm is described based on these two definitions.

Background of Security Proof
The notation in the article [14] was followed: Let, F is computed by a protocol π and F = (F A , F B ) be a polynomial function; Input of A's and B's are a and b respectively using π, desire to calculate F(a, b); A's view is the tuple view π A (λ, a, b) = λ; a; m 1 , m 2 , ..., m n where m 1 , m 2 , ..., m n are the message received by A at the time of execution. B's view is defined in a similar manner. output π A (a, b) and output π B (a, b) are the outputs of A and B respectively. π's global outcome is output π (a, b) = (output π A (a, b), output π B (a, b)).

Definition 2.
(Secure Two-Party Computation [47]). If for all possible inputs (a, b) and simulators S A and S B holds the following properties (≈ denotes computational indistinguishability against probabilistic polynomial-time adversaries with the negligible advantage in the security parameter λ.), only then a protocol π privately computes f with statistical security: The sequential modular composition's fundamental idea is that: protocol π is run to call an ideal functionality F by n participants, e.g., to calculate F privately, A and B send their inputs to a trusted third-party and receive the result. If the secure two-party computation is satisfied by the protocol π, and the same functionality as F can be achieved by protocol ρ privately, then the protocol of ρ in π can replace the ideal protocol for the functionality F; Then, the latest protocol π ρ is protected and safe under the curious-but-honest model [14,48]. Theorem 1. (Modular Sequential Composition [37]). Let the two-party probabilistic polynomial time functionalities be F 1 , F 2 , ..., F n and F 1 , F 2 , ..., F n is calculated by the protocols ρ 1 , ρ 2 , ..., ρ n in the presence of curious-but-honest adversaries. Let, two-party probabilistic polynomial time functionality be G and G is securely computed in the F 1 , F 2 , ..., F n by a protocol π -hybrid model in the presence of curious-but-honest adversaries. Then, π ρ 1 ,ρ 2 ,...,ρ n securely calculate G in the presence of curious-but-honest adversaries.

Security Proof for Secure Comparison
Two entities are involved in Algorithm 2: P and C. The function is Hence, the simulator: Then, S π C runs as follows: The bias δ is unknown to C, for that reason C would never be able to get the real m 1 and m 2 from (m 1 + δ) and (m 2 + δ). After comparison bewteen (m 1 + δ) and (m 2 + δ), C will return a flag with a value 0, if (m 1 + δ) ≥ (m 2 + δ), other wise value of flag will be 1. C is honest in following the method's protocols. Therefore, C would never infer the value directly.

Security Proof for Secure K-NN Training Algorithm
IoT data providers P and an IoT data analyst C, are the roles involved in Algorithm 2. Individual IoT data providers function in the same manner. Every one of the data providers will meet the security requirements if we can prove that one of them meets the security requirements. The function is: The view of C is view π C = ((d m j + δ), (c x k , c y k ), PK C , SK C ) Now, the confidentiality of (d m j + δ) needs to be discussed, i.e., whether the IoT data analyst can predict the private data of individuals IoT data providers from the value. Clearly, the value is no-solution for the unknown x i andy i . The IoT data analyst may try to calculate unknown x i andy i using the known distance (d m j + δ) and centroid (c x k , c y k ). It is not possible for IoT data analysts to identify the point of IoT data providers because the distance is added with bias value and the data analyst has no idea about the bias value or its range. Even with the brute force cracking, it is not possible to get the real value of dataset D. Consider that individual IoT data provider consists of a small dataset, which is 2−dimensional, 100 instances, and each dimension is 32 bits Typically, 4 bytes (32-bit) memory space is occupied by single-precision floating-point . Under this situation, the probability of IoT data analyst successful guessing is 1 2 (n×6400) , which is a negligible success probability [33]. We obtain the security of Algorithm 3 using modular sequential composition, as SPO, SBO, and SC are used in Algorithm 3, so it is secure in the curious-but-honest model.

Performance Evaluation
In this section, the performance of secure K-NN is evaluated based on efficiency and accuracy through extensive analysis using real-world dataset. Firstly, experiment settings are described, and the effectiveness and efficiency are demonstrated by the experimental results.

Experiment Setup
This segment discusses the testbed, dataset, and all other tasks for data preprocessing and experimental environment.

Dataset
This study uses three real-world datasets, namely Breast Cancer Wisconsin Data Set (BCWD), Heart Disease Data Set (HDD), and Diabetes Data Set (DD) [56,57]. These datasets are publicly available from the UCI machine learning repository. The features of BCWD resembled a digitalized image of a breast mass and described characteristics of the cell nuclei present in the image. Each of the data instances is labeled as benign or malignant. The HDD and DD contain 13 and 9 numeric attributes, respectively. Instances are classified based on the types of heart diseases and diabetes symptoms. Table 2 represents the statistics of the dataset. We run 10-fold cross-validation to avoid overfitting or contingent results, and average results are recorded. 80% of the data has been selected for the model training and the remaining 20% for testing.

. Float Format Conversion
The general K-NN training algorithms perform on both integer and floating-point numbers based on the data set. However, all the operations of cryptosystems are done on integers. Therefore for safety purposes, we should perform format conversion into an integer representation. Let, D is a binary floating-point number, which is represented as, D = (−1) s × M × 2 E , according to the global standard IEEE 754 [where the sign bit is s, significant number is M and exponential bit E]. A data analyst may perform this format conversion during the implementation of secure K-NN based on the dataset type.

Key Length Setting
The security of the public key cryptosystem is closely associated with the length, and a compact key may cause vulnerable encryption. A long key reduces the homomorphic operation's efficiency, and a too-short key may cause the overflow of plaintext space during the homomorphic operations (i.e., the secure polynomial operation and secure biasing operation) on the ciphertext. Therefore, it is crucial to consider the length of the key in order to bypass the possibility of overflow. The Key of Paillier cryptosystem N is set to 1024−bit in secure K-NN.
where t p is the number of relevant ( the positive class) that are labeled precisely, f p is the numbers of irrelevant (the negative class) that are labeled correctly, f n is the numbers of relevant that are mislabeled and t n is the number of irrelevant that are mislabeled in the test outcomes. The general K-NN was implemented using raw Python language in order to demonstrate that secure K-NN does not lessen the accuracy upon preserving the individual IoT data provider's privacy and securely training the classifier. The main focus is to train the classifier securely. For that reason, the train parameters are not adjusted, and default parameters are used. The results of precision and recall are summarized in Table 3. The performance of Secure K-NN is almost similar to standard K-NN and has better performance than SVM [29]. However, the data provider must be careful about the bias value because a larger bias may reduce the classifier's performance. The proposed design shows better robustness on both (discrete attributes and numerical attributes) datasets. Table 4 illustrates the running time of the SPO with encrypted datasets on Algorithm 3. Table 4 also gives the time consumption of IoT data providers P and data analysis C, the total time consumption. According to the performance results in Table 4, secure K-NN spend less than an hour with encrypted dataset BCWD, HDD, and DD at the time of training, which is an acceptable time consumption as a stand-alone algorithm. It is essential to mention that the general K-NN is comparatively slower, so it is better not to train a K-NN algorithm with a larger dataset at a time. We recommend that the reader convert the larger dataset into small portions and train the secure K-NN. In our implementation, we used multi-threading in Python to control the run time of a larger dataset.

Building Blocks Evaluation
In this experiment, various P is simulated linearly. Therefore the P time shown in Table 4 is the accumulation of time consumed by different P. In a real-world application, various P can run their algorithms parallelly so that the time consumption of P and the total time consumption can be reduced. We believe Algorithm 3 to be useful for real-world sensitive applications. Confronting various datasets such as BCWD, DD (numerical attributes), and HDD (discrete attributes), secure K-NN manifests satisfying robustness in time consumption.

Scalability Evaluation
Secure K-NN believes that various IoT data providers are engaging and contributing data. We distribute the dataset into various identical sections to mimic the situation of different IoT data providers. We observe the fluctuations in time consumption to evaluate the scalability of the proposed scheme when various data IoT providers strive for the calculation. Cases are simulated during the number of IoT data providers rises from 1 to 5. The outcomes are represented in Figure 3. The X-axis represents the number of IoT data providers associated with the calculation, and the Y-axis represents the time consumption. Theoretically, The time consumption of secure K-NN is proportional to the amount of data and number of iterations in the comparison portion. If the total amount of data and data quality are fixed, the rise in the number of P will not affect the time consumption, and the time consumption of P or C remains the same at different numbers of P. There is a small vibration in the total time consumption when the number of P rises from 1 to 5 because the program's run time gets disturbed as other host processes are used for the simulation.

Conclusions
This paper introduced a novel privacy-preserving K-NN training method called secure K-NN, which can handle data privacy and data integrity concerns. It employs Blockchain technology to train the algorithm in a multi-party scenario where the IoT data is received from various data providers. A partial homomorphic cryptosystem known as Paillier is applied to assemble an effective and reliable method. Efficiency and security of secure K-NN are demonstrated in this study. The proposed method achieves almost similar accuracy to general K-NN and outperforms the earlier state of the art. Future work includes developing a versatile structure that allows assembling a broad range of privacy-preserving ML training algorithms on a multi-party scenario with encrypted datasets.