Efficient Privacy-Preserving K-Means Clustering from Secret-Sharing-Based Secure Three-Party Computation

Privacy-preserving machine learning has become an important study at present due to privacy policies. However, the efficiency gap between the plain-text algorithm and its privacy-preserving version still exists. In this paper, we focus on designing a novel secret-sharing-based K-means clustering algorithm. Particularly, we present an efficient privacy-preserving K-means clustering algorithm based on replicated secret sharing with honest-majority in the semi-honest model. More concretely, the clustering task is outsourced to three semi-honest computing servers. Theoretically, the proposed privacy-preserving scheme can be proven with full data privacy. Furthermore, the experimental results demonstrate that our proposed privacy version reaches the same accuracy as the plain-text one. Compared to the existing privacy-preserving scheme, our proposed protocol can achieve about 16.5×–25.2× faster computation and 63.8×–68.0× lower communication. Consequently, the proposed privacy-preserving scheme is suitable for secret-sharing-based secure outsourced computation.


Introduction
With the rapid development of machine learning, Machine Learning as a Service (MLaaS) has become a popular business. Nowadays, machine learning is also widely applied in different fields, such as finance, healthcare, image recognition, and so on. Major companies such as Microsoft, Google, Amazon, etc. are beginning to provide cloud-based MLaaS. In general, these services allow the machine learning algorithms to be updated and improved via input data from their users. In order to gain high-precision model, companies tend to come together and train a common model using their datasets.
However, with the improvement of awareness of privacy, the problems caused by data privacy leakage have become increasingly prominent. On the one hand, the user, who uses the MLaaS service, hopes that the service is conducted without revealing any information of their queries and prediction result. On the other hand, companies want to train a common model without sharing their dataset. Therefore, it is important to find a secure way for privacy-preserving machine learning (PPML) to proceed.
Privacy-preserving machine learning can be tracked back to privacy-preserving data mining, which was firstly introduced by Lindell and Pinkas [1]. Since then, more and more researchers have put focus on privacy-preserving machine learning.
The dataset can be divided into two main types: labeled data and unlabeled data. Generally speaking, the former usually uses supervised learning algorithms when training the model, while the latter unsupervised learning algorithm [2]. In recent years, most solutions of PPML only consider the supervised learning algorithm, and there is less consideration of the unsupervised learning algorithm.
As an unsupervised machine learning technique, similar input records are grouped into clusters while records belonging to different clusters should be maximally different [3].
In this work, we only focus on SS-based K-means clustering. Doganay et al. [22] proposed distributed privacy preserving K-means clustering with additive secret sharing (ASS), but this work reveals the final cluster assignments to parties. Patel et al. [23,24] proposed their K-means algorithm under different security model. Upmanyu et al. [25] and Baby and Chandra [26] respectively presented a distributed threshold secret sharing scheme based on the Chinese remainder theorem (CRT-SS). However, none of them provide full privacy guarantees as shown in [3]. In 2020, Mohassel et al. [27] presented 2PC K-means clustering protocol with 2-out-of-2 additive secret sharing, and although it provides full data privacy, it is inefficient in terms of computation and communication overhead, because their work heavily relies on garbled circuit and oblivious transfer (OT). Therefore, this scheme is not practical for large-scale clustering tasks.
As shown above, most of the existing privacy-preserving K-means clustering protocols take no account of full data privacy. In addition, the gap of efficiency still exists compare with plaintext training. Algorithm inefficiency also limits its practicality, especially for large-scale training tasks. In this work, we want to construct privacy-preserving K-means clustering, which has full data privacy for security and high efficiency for practicality.

Our Contributions
In this work, we only focus on SS-based K-means clustering schemes. We provide the comparison with existing SS-based K-means clustering in Table 1. We propose an efficient three-party computation protocol for privacy-preserving K-means clustering. Concretely, our contributions are described as follows: • Our protocol provides full privacy guarantees, which allows different computing parties to cluster the combined datasets without revealing any other information except the final centroids. • Our protocol is based on replicated secret sharing (RSS), which is a 2-out-of-3 threshold secret sharing proposed by Araki et al. [28] and is suitable for constructing efficient protocol over Z 2 . Our protocol is secure against a single corrupt server under a semihonest model. We analyze the security with universal composition framework [29]. • The experimental results demonstrate that our protocol reaches the same accuracy as the plaintext K-means clustering algorithm. With the fast network, our privacypreserving scheme can deal with datasets of million points in an acceptable time.  [25] Semi-honest CRT-SS Z p Patel et al. [23] Semi-honest SSS Z p Patel et al. [24] Malicious SSS+ZKP Z p Baby and Chandra [26] N/A CRT-SS Z p Mohassel et al. [27] Semi-honest ASS+GC+OT Z 2

This work
Semi-honest RSS Z 2

Roadmap
The remaining sections are organized as follows. In Section 2, we give the definition of basic notation, threat model, and security assumption of secure computation, and plaintext algorithm related to K-means clustering. In Section 3, we give the cryptographic building blocks. In Section 4, we propose our efficient three-party protocol construction. We give detailed security analysis of our protocol in Section 5. Then, we report the experimental results of our construction in Section 6. Finally, we conclude this paper in Section 7.

Basic Notation
We denote the party i by P i for each i ∈ {1, 2, 3}. For simplicity, we define P 0 = P 3 , and P 4 = P 1 in the context. x ∈ R F is chosen uniformly at random from finite set F. We write a bold letter v to denote a d-dimension vector. The j-th component of vector v is v j . If x is a -bit number, then x[i] is its i-th bit. Let κ be the security parameter. We use [n] to denote set {1, · · · , n}. Furthermore, we assume all float-point data are encoded as -bit fixed-point number with f -bit precision, where f < .

Threat Model and Security Assumption
Our protocol follows a static and semi-honest model [30] under the honest-majority setting, i.e., the adversary A only corrupts a single and fixed party during protocol executing. In this setting, the corrupted party follows protocol honestly and wants to learn the input of other parties from received messages. Therefore, the semi-honest model is also called the passive model. Furthermore, we assume the parties communicate with other parties through a secure channel and the network is synchronized.
We prove security using a universally composable framework [29] in the ideal-real paradigm [30]. Let F be the ideal functionality executed by a trusted third party (TTP) in the ideal world, and Π be the real protocol executed by all parties in the real world. In the ideal world, there is a simulator Sim that plays as adversary A. Let C be the set of corrupted parties and x i be P i 's input. We define the ideal interaction and the real interaction as follows: • Ideal F ,Sim (κ, C; x 1 , · · · , x n ): Compute (y 1 , · · · , y n ) ← F (x 1 , · · · , x n ); Output Sim(C, {(x i , y i ), i ∈ C}), (y 1 , · · · , y n ), where y i is P i 's output. • Real Π,A (κ, C; x 1 , · · · , x n ): Run the protocol Π; Output {View i , i ∈ C}, (y 1 , · · · , y n ), where View i is the final view of P i .
We say the protocol Π securely computes the functionality F in the semi-honest model, if the view of simulator in the ideal world is indistinguishable from the view of adversary in the real world. We refer the reader to [30] for more details.

The K-Means Clustering Algorithm
Given dataset D = {P 1 , · · · , P n } with n data points, each point P i is a d-dimension vector (P i1 , · · · , P id ). We form n × d matrix P. A standard K-means clustering algorithm includes the following steps [10,27]:

1.
Cluster centroids initialization: randomly choose K different points as initialized Repeat the following until the stopping criterion (Lloyd's steps): (a) For i ∈ [n], k ∈ [K], compute the Euclidean distance between point P i and centroids φ k by (1) (b) Assign each data point P i to the closest cluster m i for i ∈ [n]. This can be done by computing k i ← arg min{X i1 , · · · , X iK } firstly, and then generate a Kdimension one-hot vector c i where '1' indicates the k i -th component of vector (X i1 , · · · , X iK ). We form K × n matrix C such that the i-th column of C is the one-hot vector c i . Let m k be the k-th row of C.
(c) Recalculate the average of the points in each cluster. For each cluster k ∈ [K], compute new cluster center with where D k = ∑ n i=1 m ki is the point number of k-th cluster. (d) Check the stopping criterion and update the new cluster center with the average. For each k ∈ [K], compute the Euclidean distance between ϕ k and φ k at first, and then the squared error can be computed by Given a small error ε, if e ≥ ε, then update φ k with ϕ k . Otherwise, stop the criterion and output ϕ k .

Building Blocks
This section gives the building blocks for our privacy-preserving K-means clustering protocol.

Correlated Randomness
In order to generate randomness among parties without any interaction, similar to [28,31,32], correlated randomness is introduced to this work.
Let F : Z κ 2 × Z κ 2 → Z 2 be a secure pseudo-random function (PRF). count is a counter maintained by the parties and updated after every PRF invocation. All parties run a onetime setup to establish PRF keys. The one-time setup can be done by letting each P i choose a random key k i , and then sending it to P i−1 for i ∈ [3]. Namely, each P i has the random PRF keys k i and k i−1 after the setup. In this way, the parties can generate the following correlated randomness locally: Note that 3-out-of-3 randomness has the property that α 1 + α 2 + α 3 ≡ 0 mod 2 , which is known as zero sharing.

Replicated Secret Sharing
Replicated secret sharing (RSS) was first proposed by Ito et al. [33]. In CCS'16, Araki et al. [28] presented 2-out-of-3 replicated secret sharing scheme, which has high throughput and low latency. As a famous 3PC framework, ABY3 is also based on this variant over 2-power ring Z 2 [31]. Our protocol also builds on ABY3. Let m be a general modulus, and we describe this replicated secret sharing as follows. • x ← share(x): To share a secret x ∈ Z m , the dealer samples three random values To reveal x to all parties, P i sends x i to P i+1 , then each party reconstructs x locally by computing To reveal x only to P i , P i−1 sends x i−1 to P i−1 which reconstructs x locally.
In this work, m can be a different modulus. When m = 2 , we call that arithmetic share and denote it as x or x A . When m = 2, we call that boolean share and denote it as x B .
The linear operation between two shares can be computed locally because of the linearity property. This means that given public constant a, b, c and two shares x = (x 1 , x 2 , x 3 ), y = (y 1 , y 2 , y 3 ), ax ± by ± c can be locally computed as (ax 1 ± by 1 ± c, ax 2 ± by 2 , ax 3 ± by 3 ). In order to compute the shares of multiplication z = xy , the parties generate zero sharing locally at first, and then P i locally computes 3-out-of-3 share z i = x i y i + x i+1 y i + x i y i+1 + α i for each i ∈ [3]. Finally, resharing is performed by the parties for 2-out-of-3 sharing semantics; this can be done by P i that sends z i to P i+1 . It is easy to see that each party only sends 1 ring element per multiplication. Compare that with ASS in the 3PC setting, wherein RSS reduces 50% communication overhead. In this context, we denote RSS multiplication protocol as Π Mul .
If the dealer wants to share a random value r ∈ R Z 2 for j-th time to all parties, the PRF key from zero sharing can be used. In particular, P i lets r i = F k i (j) and r i−1 = F k i−1 (j). If P 1 wants to share his private input x to all parties, the parties first generate another zero sharing (β 1 , β 2 , β 3 ), then define the share of r as r := (β 1 + x, β 2 , β 3 ), and P i sends x to P i−1 in the end.
Note that the appearance of decimals in the computation is unavoidable in the computation, while secret sharing only works on the integer field. To represent a real number x ∈ R, we use a fixed-point representation with f -bit precision [34]. We scalex by a factor of 2 f and represent the rounded integer x = 2 f ·x as a -bit signed integer over Z 2 . However, the multiplication result z between two shares x and y will have 2 f -bit precision.
To reduce precision from 2 f to f , Mohassel and Rindal [31] introduce two probability truncation techniques to truncate the f -bit of the result. In this work, we use the second probability truncation technique. First, the parties generate a random truncation pair ( s , s ) in the preprocessing phase, where s ∈ {0, 1} with 2 f -bit precision, s ∈ {0, 1} f with f -bit precision, such that s = s /2 f . In the online phase, the parties jointly compute z − s , then compute and open (z − s ), which is followed by computing z = s + (z − s )/2 f locally. The truncation induces error is only 2 − f .

Oblivious Selection Protocol
As an important part of our privacy-preserving K-means clustering protocol, we define the oblivious selection functionality F OS , whose functionality takes the arithmetic shares x , y and boolean share b B as input, and returns x if b = 0, and y otherwise. Note that F OS can be explained as It seems that the parties only need to compute multiplication between (y − x) and b once to implement F OS . However, this is non-trivial since y − x A is arithmetic share and b B is boolean share, while the RSS multiplication only works for same shares.
A natural idea is to convert b B to b A by using bit injection protocol Π B2A from ABY3 [31], where three-party OT is required. Instead of using OT, we implement conversion with the Beaver trick [35]. Suppose that the parties access the precomputed conversion, and obtain precomputed random bit share c B and c A in the preprocessing phase. In the online phase, the parties compute and reconstruct the bit e = b ⊕ c, followed by set- Since that random bit c B and c A are used as the one-time pad, the corrupted party can not learn any information about b. Even though the parties reveal the masked value e in the clear, our oblivious selection protocol is still secure. Looking ahead, the oblivious selection protocol is used to find the index of minimum component from vector and generate one-hot vector.

Secure Euclidean Squared Distance Protocol
From Section 2.3, it is important to compute the Euclidean distance between two points using Equation (1). However, the square root is unfriendly to construct the SS-based protocol since it is a nonlinear operation and has an expensive communication overhead. Note that f (x) = √ x is a monotonically increasing function, which can be replaced with f (x) = x. This would, however, not affect the results on clustering since the only thing we need here is the relationship of size between two values. In this way, we replace Equation (1) with the following Euclidean squared distance equation: Similarly, Equation (3) can be replaced with the following Equation (5): Now, we focus on, given the share of vector x = (x 1 , · · · , x d ) and y = (y 1 , · · · , y d ), how to compute the share of Euclidean squared distance. We define this functionality as F ESD . Observe that this can be done by first computing z = x − y = (x 1 − y 1 , · · · , x d − y d ), and then computing the inner product between z and z. A naive way is for the parties to invoke RSS multiplication d times for d-dimension, consume d truncation pair to truncate the result, locally sum the result, and reshare. The communication is O(d) ring elements, which is dependent on d, the size of vector.
In this work, we use the delay re-share technique [31] to reduce communication complexity for inner product, which is only communication O(1) ring element and independent of d.
Let ( s , s ) be the shares of truncation pair among the parties, where s ∈ {0, 1} with 2 f -bit precision, s ∈ {0, 1} f with f -bit precision, such that s = s /2 f . The delay re-share technique can be explained as the following Equation (6): In a word, the parties first compute a 3-out-of-3 additive sharing of each x i y i locally, then sum together, mask, truncate, and reshare the final result for 2-out-of-3 replicated sharing semantics. It is easy to see that this would only require communicate 1 ring element per party, which is independent of d. Furthermore, the truncation-induced error is only 2 − f with respect to the overall inner product.
The protocol for secure Euclidean squared distance is described in Figure 1.

Secure Comparison Protocol
In order to obtain the relationship between two given values, we have to consider how to implement comparison when the values are shared. We define secure comparison functionality F LT , which takes x and y as input, return boolean share b B , where bit b = 1 if x < y, and b = 0 otherwise.
Let a = (x − y), then z can be computed by extracting the most significant bit (MSB) of a, i.e., b = MSB(a). Instead of using optimized parallel-prefix-adder-based bit extraction protocol from ABY3 [31], Wagh et al. [32] present a more efficient alternative method.
From Equation (7), we observe that P i can compute MSB(a i ) locally; thus, the main challenge here is how to compute c in a secure way. Note that Equation (8) is also equivalent to c = ((2a 1 + 2a 2 + 2a 3 ) ≥ ? 2 ), which can be computed by wrap function. Wagh et al. [32] give us a solution for wrap function, denoted as Π WA . We refer the reader to their work for correctness and security.
The secure comparison protocol is described in Figure 2. Furthermore, if one of the secrets is known to all parties, e.g., y, the share of y can be defined by letting y := (y + α 1 , α 2 , α 3 ) without any interaction, where (α 1 , α 2 , α 3 ) is zero sharing generated by PRF keys among the parties. Thus, the secure comparison protocol is also work. We denote this case as b B ← F LT ( x , y).
The parties locally mask the share z + s = z + s .

Secure Assignment Protocol
Recall that once the parties obtain the shares of the Euclidean squared distance between a point and all centorids, the following step is to assign this point to the closest cluster. This step can be abstracted as the question of, given K-dimension secret shared vector v = ( v 1 , · · · , v K ), how to compute the secret shared one-hot vector e, where '1' appears in the k-th component, and k = arg min{v 1 , · · · , v K }. We denote secure assignment functionality as F Assign . The idea is straightforward. We implement secure assignment functionality with the following protocol Π Assign (see Figure 3). e ← Π Assign ( v ) Input: The parties hold shares of vector v = (v 1 , · · · , v K ) over Z 2 . Output: The parties get arithmetic shares of one-hot vector e k = (e 1 , · · · , e K ), where e j = 0 ∀j = k and e k = 1, and k = arg min{v 1 , · · · , v K }. Online: 1.
Set v min ← v 1 and e ← e 1 .

Secure Division Protocol
As shown in Section 2.3, the parties need to recalculate the average vector of the points in each cluster in a secure way. Note that the average can be split by computing addition and division, where addition can be computed locally, hence the key point is to compute division. If the divisor is known to all parties, then division can be computed locally. However, observe that the divisor denotes the number of each cluster point and should be protected since full data privacy is required. Thus, the computation of division becomes difficult in our scenarios. We define secure division functionality F Div as follows: Given secret shared value a and b with b ∈ Z + , the parties compute the share c , such that c = a/b.
Instead of invoking division garbled circuit protocol, we implement division with numerical method. The numerical method is one of the most commonly used techniques for constructing SS-based secure protocol due to its efficiency. In this work, we implement division using Goldschmidt's algorithm [36], which approximates the desired operation as a series of multiplication.
Let w 0 be an initial approximation of 1/b, and 0 := 1 − b · w 0 be the relative error for the approximation w 0 such that | 0 | < 1. The Goldschmidt algorithm iteratively computes the following Equation (9): where t is the number of iterations. When t → +∞, one has c converges to a/b [36]. We set t = 2, which is sufficient for a close approximation with our choice of fixed-point precision. Catrina and Saxena [34] give us a good initial approximation of w 0 in the interval [0.5, 1), that is w 0 = 2.9142 − 2b. However, the major challenge of our work is that b ∈ Z + does not belong to the interval [0.5, 1). In this work, we use the technique proposed by Wagh et al. [32]. The key insight here is that b is interpreted as a value with (α + 1)-bit fixed-point precision but not f -bit precision, where α ∈ Z such that 2 α ≤ b < 2 α+1 . Thus, one should first extract α. The secure division protocol is described in Figure 4. From step 1 and step 2, we extract and reveal α to all parties, which only leaks the range of b and nothing else.
Input: The parties hold arithmetic shares of a, b over Z 2 . Output: The parties get the arithmetic share of c = a/b. Online: 1.

7.
Output c over Z 2 with f -bit precision.

Privacy-Preserving K-Means Clustering
We now give a formal description of our privacy-preserving K-means clustering protocol, following the basic building blocks outlined above.

Secret Distribution
Recall that all data are held by the data owners in the secure outsourced scenarios, thus secret distribution phase is completed by the data owner. This implies that all data are horizontal partitioned. As an optimized, instead of generating 2-out-of-3 replicated shares directly, the data owner generates 3-out-of-3 additive shares and then sends to three computing parties/servers, who reshare the shares and obtain valid 2-out-of-3 replicated shares. In this way, we reduce communication costs of the data owner by a half.

Cluster Initialization
As shown in Section 2.3, we need to initialize K centroids before the Lloyd's steps. In this work, we assume that the data owner chooses K random points as the initialized centroids and secret share to three computing parties for simple. In this way, cluster initialization can be combined with secret distribution.

Approximation of Euclidean Distance
Recall that we replace Euclidean distance with Euclidean squared distance, which is not affect the final result as we shown in Section 3.4. For i ∈ [n] and k ∈ [K], the parties first invoke F SED to compute the shares of Euclidean squared distance X ik between data point P i and centroid φ k , and then form n × K matrix X.

Assigning Data Points to the Closest Cluster
For i ∈ [n], we denote X i as the i-th row vector of X. In order to assign data point P i to the closet cluster, the parties invoke our F Assign protocol as described in Section 3.6. The one-hot vector output c i ← F Assign ( X i ) indicates which cluster center this data point is assigned to. We form K × n cluster matrix C , such that the i-th column of C is c i .

Recalculating Cluster Centers
Given cluster matrix C, the parties need to recalculate each centroid. Observe that for each k ∈ [K], the k-th cluster center has D k = ∑ n i=1 C ki points exactly. Instead of computing ϕ k = C k ·P D k separately, one can first compute M = C · P, and then compute ϕ k = M k D k , where M k is the k-th row of M. Given secret shared matrix C , the new centroid matrix ϕ can be computed by vectorized multiplication technique [31] and secure division protocol (as described in Section 3.7). Furthermore, the parties also compute the shares of Euclidean squared distance e k between ϕ k and φ k for checking the stopping criterion.

Checking the Stopping Criterion and Updating Centroids
In order to check the stopping criterion, the parties first locally compute e = ∑ K k=1 e k , and then compare with a given small error ε, stop the criterion if e < ε, otherwise update ϕ and continue next round. This can be done by invoking secure comparison protocol Π LT . The only reveal message is b = (e < ? ε), which does not affect full data privacy.

Main Construction
The secure K-means protocol is described in Figure 5. According to the definition of full data privacy, the information about the intermediate centroids, cluster assignments, and cluster sizes should be protected. From Figure 5, we can see that the only information we leak is the range of D k and nothing else. Therefore, our construction provides full data privacy.

Parameters:
• Number of clusters K; number of data points n; dimension d.

•
Ideal F OS , F SED , F LT , F Assign , F Div primitives. Secret Distribution:

1.
The data owner generates the 3-out-of-3 additive shares of data points matrix P = (P 1 , · · · , P n ) with dimension n × d.

2.
The data owner sends the shares to three semi-honest computing parties/servers P 1 , P 2 , P 3 .

3.
Each party reshares the shares and obtains valid 2-out-of-3 replicated shares.

2.
The data owner shares φ to three computing parties similar to secret distribution.

Lloyd's Steps:
For t ∈ [T], the parties repeat the following steps until the stopping criterion. 1. For i ∈ [n], k ∈ [K], the parties compute the shares of the Euclidean squared distance by invoking X ik ← F SED ( P i , φ k ). The parties form n × K matrix X.

2.
For i ∈ [n], the parties assign point P i to the closest cluster by invoking c i ← F Assign ( X i ). The parties forms K × n matrix C such that the i-th column of C is c i .

3.
The parties compute the share of matrix multiplication M ← C · P by using RSS vectorized multiplication technique. For k ∈ [K], the parties jointly recalculate the sharing ϕ k of the new centroid as follows: (a) Compute the shares of the denominator Compute the shares of the average of the points in each cluster by Compute the shares of Euclidean squared distance between ϕ k and φ k by invoking e k ← F SED ( ϕ k , φ k ).

4.
The parties update the cluster centroids as follows:

Security Analyses
Our protocol follows the universally composable framework [29] and provides security against a single corrupted party under the semi-honest model. The universally composable framework guarantees the security of arbitrary composition of different protocols. Therefore, we only need to prove the security of individual protocols. Theorem 1. Π OS , Π SED , Π LT , Π Assign , Π Div securely realizes F OS , F SED , F LT , F Assign , F Div , respectively, in the presence of one semi-honest corrupt party under the hybrid model. output is the share, which is indistinguishable from the pseudo-randomness. In other words, the output of Sim F Div is indistinguishable from the execution of real world. Thus, T 4 and T 3 are indistinguishable.
Hybrid 5. Let T 5 be the same as T 4 , except the F LT execution is replaced with running the simulator Sim F LT ( e , ε). The output is the share, which does not reveal any information about the data points. In addition, the output of Sim F LT is indistinguishable from the execution of real world. Thus T 5 and T 4 are indistinguishable.
In summary, T 0 and T 5 are indistinguishable, which is end of the proof.

Experimental Setup
We implement our privacy-preserving clustering protocol and report the experimental results in this section. All experiments are executed on Ubuntu 22.04 LTS with Intel(R) Xeon(R) Gold 5222 CPU @3.41GHz and 256 GB RAM. All parties run in the same network and the connection is simulated using the Linux tc command. The LAN setting has 0.2 ms round-trip latency and 5 Gbps network bandwidth, while the WAN setting has 20 ms round-trip latency and 400 Mbps network bandwidth. We implement our protocol with the C++ open source framework FALCON (https://github.com/snwagh/falcon-public (accessed on 18 June 2022)) [32].
In all experiments, we assume the orginal data has been normalized in the same level. For the public parameters, we set bit-length = 64, fixed-point precision f = 13, and the number of iteration t = 2. Table 2 summarizes the real datasets used in our experiments. Furthermore, we also compare with Mohassel et al. [27] in the self-generated dataset.

Accuracy
In order to evaluate the accuracy of clustering classification for the real dataset, we usually compare it to the ground truth model. We downloaded the ground truth model of Iris and arff from Github repository (https://github.com/deric/clustering-benchmark (accessed on 18 June 2022)). Note that the standard K-means clustering is sensitive to the initialized centroids; thus, we ran the algorithm many times with different initialized centroids and take the best result as the global optimal solution.
We used 2D dataset arff and 4D dataset Iris for evaluating accuracy and report the experimental results in Table 2. For a visual comparison, the experimental result of dataset arff is shown in Figure 6. Compared to the ground truth model of dataset arff, we reached 98.20% accuracy in our privacy-preserving model. For dataset Iris, we reached 92.67% accuracy in our privacy-preserving model. Both accuracy results are the same as the plaintext algorithm; hence, our privacy-preserving protocol is feasible.
Recall that our secure division protocol adopts the numerical method and the result of the secure multiplication protocol needs to be truncated; thus, the privacy-preserving result ϕ is only approximate to the plain result but not the exact result. In fact, our experiment shows that the relative error is about 10 −2 , which means this part has a negligible impact on model accuracy compare with the plaintext algorithm. In other words, our privacypreserving protocol is feasible, even if ϕ is only approximate to the truth value. For the large-scale dataset, we argue that the relative error can be improved by taking bigger public parameters , f , and t. However, it will require more runtime and communication cost.
There is a trade-off between runtime and accuracy of the division. We do not consider the accuracy of experiment in the following section.
(a) Ground Truth Model (b) Plaintext and Privacy-preserving Model Figure 6. Comparison of accuracy for ground truth, plaintext, and privacy-preserving model for 2D dataset arff. Our privacy-preserving model reaches the same accuracy as the plaintext model. The accuracy is 98.20% compared to the ground truth model.

Runtime and Communication
In this section, we focus on the total communication cost and runtime of our privacypreserving protocol. We ran each experiment five times and computed the average of wall clock runtime as the reported runtime. We report the experimental results in Table 3. Note that iteration T depends on the initialized clustering centroids; hence, we set it to a fixed value in this experiment (say, T = 10). As shown in Table 3, the runtime and communication overhead are independent from the dimension d; this is because we enjoy the benefit from the vectorized multiplication and delay re-share technique [31].
Even though n = 100, 000, our scheme lasted less than 8 min and communicated less than 400 MB in total under the LAN setting. Although the overall communication cost is low, we observe that the runtime of our scheme is not good at the WAN setting. The reason for this is because WAN has low-bandwidth and high-latency, while the secretsharing-based schemes usually have much communication rounds that is bad for this setting. Thus, we argue that our privacy-preserving scheme is practical when the network is fast, high-bandwidth, and low-latency. We estimate that our privacy-preserving protocol can be used to deal with datasets of million points in an acceptable time (for example, within 2 h for clustering one million points to 2 groups in the LAN setting).

Comparison with Mohassel et al. [27]
Recall that Mohassel et al. [27] also provides full data privacy guarantees (see Table 1); thus, we also compare to their work with the self-generated dataset in our experiment environment. We downloaded the code of Mohassel et al. [27] from their Github repository (https://github.com/osu-crypto/secure-kmean-clustering (accessed on 18 June 2022)). In order to save time, all experiments are only considered under the localhost setting, which has 0.027ms round-trip latency and 41.1Gbps network bandwidth. We ran each protocol five times and report the average of wall clock runtime and communication costs. Instead of running the code of Mohassel et al. [27] in every iteration, we measured its runtime for one round iteration and multiplied by the number of iterations T to save time. Table 4 presents the computation cost and communication cost of our protocol compared with [27]. Our experimental results demonstrate that the computation costs of our protocol is about 16.5×-25.2× faster than [27], and the communication cost is about 63.8×-68.0× less than [27]. This is because their construction relies heavily on garbled circuit and oblivious transfer, while our scheme is only based on replicated secret sharing. Table 4. Comparison with Mohassel et al. [27] in large-scale self-generated datasets under the localhost setting, where n is the number of data points, K is the number of clusters, and dimension d = 2.

Conclusions and Future Work
In this work, we presented an efficient RSS-based privacy-preserving K-means clustering scheme over Z 2 under the semi-honest model. Our scheme provides full data privacy. The experiment report shows that our protocol is highly efficient and practical, as well as suitable for large-scale clustering tasks when the network is fast. Therefore, we argue that our scheme is suitable for secret-sharing-based secure outsourced computation.
The next direction of future work can extend our scheme to the malicious adversarial setting, which will be a non-trivial problem. This is because malicious adversary may not follow protocol specifying and deviate arbitrarily in any phase. For example, the adversary makes the corrupted party sends incorrect messages, such that it can break the correctness of protocol. Therefore, we should ensure that the sending message of the parties are correct. A promising direction for this case is to introduce the SPDZ protocol [37], where the correctness of the sending message can be protected by using message authentication codes (MACs).