Next Article in Journal
Development of Viscosity Iterative Techniques for Split Variational-like Inequalities and Fixed Points Related to Pseudo-Contractions
Previous Article in Journal
Eigenvector Distance-Modulated Graph Neural Network: Spectral Weighting for Enhanced Node Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Fast and Privacy-Preserving Outsourced Approach for K-Means Clustering Based on Symmetric Homomorphic Encryption

College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(17), 2893; https://doi.org/10.3390/math13172893
Submission received: 3 July 2025 / Revised: 1 August 2025 / Accepted: 5 August 2025 / Published: 8 September 2025

Abstract

Training a machine learning (ML) model always needs many computing resources, and cloud-based outsourced training is a good solution to address the issue of a computing resources shortage. However, the cloud may be untrustworthy, and it may pose a privacy threat to the training process. Currently, most work makes use of multi-party computation protocols and lattice-based homomorphic encryption algorithms to solve the privacy problem, but these tools are inefficient in communication or computation. Therefore, in this paper, we focus on the k-means and propose a fast and privacy-preserving method for outsourced clustering of k-means models based on symmetric homomorphic encryption (SHE), which is used to encrypt the clustering dataset and model parameters in our scheme. We design an interactive protocol and use various tools to optimize the protocol time overheads. We perform security analysis and detailed evaluation on the performance of our scheme, and the experimental results show that our scheme has better prediction accuracy, as well as lower computation and total overheads.

1. Introduction

In modern life, we encounter vast amounts of big data in various fields. Data mining serves as a powerful tool to uncover hidden patterns and extract valuable insights from these massive datasets. However, running data mining algorithms requires significant computational and storage resources. When leveraging cloud-based computing and storage solutions to meet these demands, there is an inherent risk of privacy breaches, as sensitive data may be exposed to unauthorized access or leakage during processing and transmission.
Several privacy-preserving cryptography techniques, such as differential privacy (DP), homomorphic encryption (HE), and secure multi-party computation (SMPC), have been employed to protect sensitive data and algorithm parameters in data mining. The differential privacy technique safeguards individual data privacy by injecting noise into datasets, with high efficiency. However, the model parameters and noise-added data remain in plaintext, leading to weak privacy protection of the whole outsourced process. The technique allows for encrypted data mining but requires multiple nodes to perform ciphertext operations, resulting in high computational and communication overheads. The homomorphic encryption technique allows for single-node computation to be carried out directly on ciphertext without inter-node communication, yet its adoption is constrained due to its computational inefficiency and restricted operational support in some HE schemes.
In this paper, we focus on the privacy-preserving outsourced clustering of k-means, which is a popular unsupervised machine learning (ML) algorithm used for clustering data into groups (clusters) based on similarity. It partitions a dataset into k distinct, non-overlapping clusters by minimizing the within-cluster variance (e.g., the sum of the squared distances between data points and their cluster centroids).
Similarly to other ML algorithms, various HE-based [1,2,3,4,5] and SMPC-based [6] schemes have already been proposed to implement privacy-preserving outsourced k-means clustering. However, constrained by the inherent characteristics of the adopted cryptographic techniques (i.e., HE and SMPC), the existing solutions suffer from either excessive communication overhead, high computational complexity, or suboptimal prediction accuracy. We will provide a detailed discussion of the related work in the subsequent section.

1.1. Related Work

Secret sharing (SS), which is a classic realization of SMPC, has been widely adopted to ensure secure outsourced computation and [7,8,9] makes use of different kinds of secret sharing protocols to implement privacy-preserving k-means outsourced clustering. Although SS-based methods offer strong security under standard assumptions, these protocols usually incur relatively heavy computational and communication overheads, due to the secret splitting and frequent interaction between multiple servers.
In addition to SMPC and SS, numerous studies leverage homomorphic encryption (HE) to preserve the privacy of k-means models and associated datasets. For instance, Ref. [3] employs TFHE [10], which replaces non-linear functions with programmable bootstrapping but suffers from extremely slow homomorphic addition and multiplication operations, severely hindering the efficiency of outsourced training. In contrast, works like [4,5] adopt alternative FHE schemes (i.e., BGV [11] and CKKS [12]), offering faster homomorphic operations than TFHE. However, BGV and CKKS also rely on the Ring-LWE problem, resulting in larger ciphertext sizes and slower encryption/decryption speeds. To obtain faster encryption/decryption speeds, researchers [2,7] have made use of Paillier [13], which is a kind of partially homomorphic encryption that only supports addition operations between ciphertexts. However, multiplication/division operations are carried out when distances are calculated; therefore, the researchers [2,7] needed to develop a secure multiparty ciphertext multiplication/division protocol, which increased the communication overhead between multiple servers.
In summary, we provide a hierarchical comparison between our SHE-based (symmetric homomorphic encryption [14]) work and related works including the deployed cryptographic tools, approximated computation, and the communication overhead of each training, based on the experimental results presented in Table 1.

1.2. Example Applications of SHE

SHE was first proposed in [14] and used in the privacy-preserving range query in fog-based IoT. After [14], since SHE provided somewhat efficient (leveled) fully homomorphic encryption, a large amount of research work has used SHE to implement privacy-preserving and efficient federated learning [15,16,17], community/range/similarity queries in cloud computing, eHealthcare, and IoT [18,19,20,21]. Based on these successful applications of SHE, it should be feasible to integrate SHE with privacy-preserving outsourced k-means clustering in order to improve the efficiency of the outsourced scheme.

1.3. Our Contributions

To improve the computation and communication efficiency, we leverage SHE, a (leveled) fully homomorphic encryption scheme supporting efficient additive and multiplicative ciphertext operations, to safeguard the outsourced k-means training process. Our key contributions are as follows:
  • During the protocol, we develop the Modified Euclidean Distance (MED) to avoid non-linear operations (e.g., division and square root) and reduce both the computational and communication costs.
  • Comprehensive experiments and comparisons were conducted to demonstrate our solution’s superiority in prediction accuracy, computational efficiency, and total runtime (including computation and communication).

2. Preliminaries

2.1. K-Means

K-means is an unsupervised learning algorithm that partitions n data points in a dataset into k clusters, where each data point is assigned to the cluster with the nearest centroid. The algorithm proceeds in the following steps:
Firstly, given a dataset X = { x 1 , x 2 , , x n } , where x i R d ( i [ 1 , n ] and of the n d-dimensional data points into a predefined integer range), k cluster centroids μ 1 , μ 2 , , μ k are randomly initialized. Each data point x i is then assigned to the nearest cluster by calculating the cluster index c ( i ) = arg min j x i μ j 2 , where j { 1 , 2 , , k } , and c ( i ) denotes the cluster assignment for x i .
Secondly, each centroid μ j is recalculated as the mean of all data points assigned to its cluster C j : μ j = 1 | C j | x i C j x i , where C j is the set of points assigned to cluster j. After the new cluster centroids are updated, each data point x i is then reassigned to the nearest new centroid using the same distance calculation as in the first step.
Thirdly, the algorithm iterates between the centroid update and data point assignment until the cluster assignments stabilize or the movement of the centroids falls below a specified threshold.
Upon convergence, the resulting model, defined by the final k centroids, can assign a new data point x new by calculating c new = arg min j x new μ j 2 , where c new is the predicted cluster index.

2.2. Symmetric Homomorphic Encryption (SHE)

The SHE scheme consists of the following three algorithms:
  • ( s k , p p ) SHE . KeyGen ( k 0 , k 1 , k 2 ) is the key generation algorithm, which uses input security parameters k 0 , k 1 , and k 2 (satisfying k 1 k 2 < ( k 0 / 2 ) ) to randomly generate a secret key s k and public parameters p p . The secret key is s k = ( p , q , R ) , where p and q are two large prime numbers satisfying | p | = | q | = k 0 , R is a random number satisfying | R | = k 2 , and the message space M = { m | m [ 2 k 1 1 , 2 k 1 1 ) } . Then, N = p q is computed, and the public parameters p p are set to ( k 0 , k 1 , k 2 , N ) .
  • [ [ m ] ] s k SHE . Enc ( m , s k ) is the encryption algorithm, which encrypts a message m M using the secret key s k = ( p , q , R ) . The ciphertext is computed as [ [ m ] ] s k = SHE . Enc ( m , ( p , q , R ) ) = ( r R + m ) ( 1 + r p ) mod N , where r { 0 , 1 } k 2 and r { 0 , 1 } k 0 are randomly sampled. m SHE . Dec ( [ [ m ] ] s k , s k ) is the decryption algorithm, which recovers the message m from a ciphertext [ [ m ] ] s k using the secret key sk via the computation m = ( [ [ m ] ] s k ( mod p ) ) ( mod R ) .
Given a plaintext m 0 and two ciphertexts, [ [ m 1 ] ] s k and [ [ m 2 ] ] s k (encrypted under the same key), the scheme supports the following homomorphic operations: m 0 [ [ m 1 ] ] s k = [ [ m 0 + m 1 ] ] s k , m 0 [ [ m 1 ] ] s k = [ [ m 0 m 1 ] ] s k , [ [ m 1 ] ] s k [ [ m 2 ] ] s k = [ [ m 1 + m 2 ] ] s k , and [ [ m 1 ] ] s k [ [ m 2 ] ] s k = [ [ m 1 m 2 ] ] s k , where ⊕ and ⊗, respectively, represent the homomorphic addition and multiplication. Furthermore, the scheme supports an arbitrary number of homomorphic additions but is limited to ( k 0 / 2 k 2 ) 1 sequential homomorphic multiplications.
Although SHE is a symmetric scheme, it can be transformed into an asymmetric one [15] by generating a public key p k = r 0 [ [ 0 ] ] s k , where r 0 is a k 2 -bit random number, and s k serves as the private key. Using this key pair, a message m is encrypted with the public key p k via the operation [ [ m ] ] p k = m p k = [ [ m ] ] s k , and the resulting ciphertext is decrypted with the private key s k using the original decryption algorithm: m = SHE . Dec ( [ [ m ] ] p k , s k ) .

3. System Design

3.1. System Architecture

Similar to previous works, our system architecture involves two entities: a data owner ( DO ) and a cloud service provider ( CSP ), where the DO owns the dataset but lacks sufficient computational resources, while the CSP provides on-demand cloud computing services. Figure 1 illustrates our outsourced k-means clustering protocol, an interactive process between the DO and the CSP . Before the protocol begins, the DO generates a key pair, normalizes the dataset, and encrypts the dataset and initial centroids, providing only the public keys to the CSP . During the protocol, the CSP is responsible for the bulk of the computational tasks, operating on the encrypted dataset and model parameters. However, because SHE does not support the direct comparison of ciphertexts, the CSP must return intermediate results to the DO for any necessary comparisons. The DO then decrypts, compares, re-encrypts, and returns the updated data to the CSP . Finally, upon convergence, the DO receives the encrypted final k-means model parameters from the CSP and decrypts them.

3.2. Security Models and Desired Properties

In our protocol’s threat model, the DO is assumed to be trusted, while the CSP is considered honest-but-curious with respect to the DO ’s dataset and model parameters. Accordingly, our protocol is designed to uphold the following security properties:
  • Dataset Privacy: The protocol must ensure the CSP cannot learn the contents of the original clustering dataset provided by the DO .
  • Model Privacy: The protocol must protect the confidentiality of the k-means model parameters throughout the entire outsourced clustering process. Specifically, CSP should not learn (i) the feature values of the cluster centroids or (ii) the distances between individual data points and these centroids.
To provide these security guarantees, we employ the (leveled) fully homomorphic encryption scheme SHE, which has been proven to be IND-CPA secure [19].

4. Protocol Implementation

In this section, we detail our k-means clustering protocol, aligning with the process illustrated in Figure 1.

4.1. Protocol Initialization

Before the protocol begins, the DO executes the initialization step, which corresponds to Phase (1) in Figure 1.
Phase (1). The DO ’s clustering dataset consists of n data points. To begin, the DO generates a SHE key pair, denoted as ( p k 1 , s k 1 ) . Because SHE only supports operations on integers, the DO must normalize each feature, x i j (for i [ 1 , n ] , j [ 1 , d ] ), of the n d-dimensional data points into a predefined integer range [ 2 b 1 , 2 b 1 ) , where b Z + .
Next, the DO randomly generates k d-dimensional cluster centroids ( μ 1 , μ 2 , , μ k ) and similarly normalizes their features, μ l j (for l [ 1 , k ] , j [ 1 , d ] ), into the same integer range. To facilitate subtraction between the data point and centroid features, the DO negates the centroid features ( μ l j ), encodes both data features ( x i j ) and the negated features ( μ l j ) as complements, and finally encrypts them with s k 1 to obtain [ [ x i j ] ] s k 1 and [ [ μ l j ] ] s k 1 .
Finally, the DO sends all encrypted features of the data points and cluster centroids ( [ [ x i j ] ] s k 1 and [ [ μ l j ] ] s k 1 ) to the CSP to begin the protocol’s first round.

4.2. The First Protocol Round (Distance Calculation and Comparison)

The first protocol round corresponds to Phase (2.1)–(2.2) in Figure 1, where the primary tasks for the CSP and the DO are distance calculation and comparison, respectively.
Phase (2.1). In this phase, using the encrypted features previously received from the DO , the CSP calculates the distance between each data point and every cluster centroid. Because SHE does not support the square root operation on ciphertexts, we use a Modified Euclidean Distance (MED) metric, which consists solely of addition, subtraction, and multiplication operations. For instance, the MED between the i-th data point x i and the l-th cluster centroid μ l is defined as
M E D ( x i , μ l ) = j [ 1 , d ] ( x i j μ l j ) 2 .
To compute this MED, the CSP first homomorphically adds the corresponding encrypted features ( [ [ x i j ] ] s k 1 and [ [ μ l j ] ] s k 1 ), then squares each resulting sum, and finally sums these squared results to obtain the encrypted MED value. As direct ciphertext comparison is not supported by SHE, the CSP then sends all the encrypted MED values back to the DO for decryption and comparison.
Phase (2.2). Upon receiving the encrypted MEDs from the CSP , the DO decrypts and compares them to determine the closest cluster centroid for each data point. On one hand, after the comparisons, the DO counts the number of data points assigned to each centroid and generates a count vector [ c n t 1 , c n t 2 , , c n t k ] , where c n t l is the number of points closest to centroid l, and l = 1 k c n t l = n . For example, the vector [ 3 , 5 , 7 , 4 , 2 ] indicates that 3, 5, 7, 4, and 2 data points are closest to the first, second, third, fourth, and fifth centroids, respectively. This count vector is subsequently used for the centroid update step.
On the other hand, the DO also generates a centroid mask for each data point i, denoted as a boolean vector [ m s k i 1 , m s k i 2 , , m s k i k ] (e.g., [ 0 , 0 , 1 , 0 , 0 ] indicates the point is closest to the third centroid). The DO then encrypts these n mask vectors and sends the resulting ciphertexts ( [ [ msk i 1 ] ] s k 1 , , [ [ msk i k ] ] s k 1 for i [ 1 , n ] ) to the CSP .

4.3. The Second Protocol Round (Centroid Calculation and Update)

The second protocol round corresponds to Phase (3.1)–(3.3) in Figure 1. During this round, the CSP and the DO collaboratively calculate and update the cluster centroids. If the centroids have not yet converged, the CSP and the DO repeat Phase (2.1)–(3.2) using the newly updated centroids.
Phase (3.1). After receiving the n encrypted centroid mask vectors from the DO , the CSP calculates the accumulated features for each new centroid according to the following equation:
[ [ acc _ μ l j ] ] s k 1 = i [ 1 , n ] [ [ m s k i l ] ] s k 1 × [ [ x i j ] ] s k 1 ,
where [ [ x i j ] ] s k 1 is the encrypted feature of the data point, and [ [ m s k i l ] ] s k 1 is the corresponding encrypted mask value, i [ 1 , n ] , j [ 1 , d ] and l [ 1 , k ] . Although the new centroid feature is defined as the accumulated feature divided by the number of associated data points, the CSP cannot perform this division directly on ciphertexts, as this operation is not supported by SHE. Therefore, the CSP must send the encrypted accumulated features, [ [ a c c _ μ l j ] ] s k 1 , back to the DO to perform the final calculation and update of the new centroids.
Phase (3.2). Upon receiving the encrypted accumulated features [ [ a c c _ μ l j ] ] s k 1 , the DO first decrypts them using s k 1 to obtain the plaintext accumulated features, a c c _ μ l j . After decryption, the DO calculates the new centroid feature, n e w _ μ l j , by dividing each accumulated feature a c c _ μ l j by the corresponding data point count, c n t l : n e w _ μ l j = a c c _ μ l j c n t l (for j [ 1 , d ] , l [ 1 , k ] ).
Next, the DO checks for convergence by comparing the previous centroid features, μ l j , with the new ones, n e w _ μ l j . If n e w _ μ l j = μ l j for all features, the centroids have converged, the clustering process is complete, and these are the final model parameters. Otherwise, the DO encrypts the updated centroid features ( n e w _ μ l j ) with s k 1 and sends the resulting ciphertexts, [ [ n e w _ μ l j ] ] s k 1 , to the CSP for the next iteration.
Phase (3.3). Using the newly encrypted centroid features [ [ n e w _ μ l j ] ] s k 1 , the CSP and the DO repeat Phase (2.1)–(3.2) until the clustering process converges.

5. Experimental Results and Analysis

The performance of our scheme was evaluated with the DO deployed on a machine equipped with a 2-core 11th Gen Intel Core i7-1195G7 CPU (operating at 2.92 GHz), 4GB of RAM, and Windows 10. The CSP was hosted on a machine with a 4-core CPU of the same model (2.92 GHz), 16GB of RAM, and Windows 11. Notably, the hardware used in our experiments is less powerful than that of the related works we compare our work against; nevertheless, our scheme demonstrates superior time performance, as will be detailed later.
The datasets used in our experiments, listed in Table 2, were selected from the standard UCI repository [22] to facilitate a direct comparison with the key related works [2,3,4,5] and demonstrate our scheme’s feasibility and efficiency. For our experiments, we used seven datasets [23,24,25,26,27,28,29], which were partitioned into a 70% training set and a 30% validation set.
To ensure a sufficient security level, we set the SHE parameters for s k 1 to ( k 0 , k 1 , k 2 ) = (1048, 30, 90), which offers a security guarantee comparable or superior to that of the related works; a detailed security analysis of these parameters is presented later. For a balance of efficiency and precision, we normalized all dataset features to the integer range [ 0 , 200 ] . This range can be extended up to [ 0 , 2 30 1 ] with negligible impact on accuracy or performance, as our chosen SHE parameters support the encryption of 30-bit ( k 1 -bit) integers.

5.1. Prediction Accuracy and Comparison

Table 3 compares the accuracy of our scheme with the key related works [2,3,4,5], each of which utilizes different clustering datasets and evaluation metrics.
Regarding the datasets, some of the compared works utilize self-generated synthetic data [2,5] or the non-UCI FCPS benchmark [3]. In contrast, our experiments use standard UCI datasets with parameters ( k , n , d ) chosen to be comparable to those in the related works.
As detailed in the notes of Table 3 and summarized in its final column, our scheme’s accuracy is benchmarked against these varied metrics. As the results show, our scheme outperforms the methods in [3,4,5] and achieves a clustering accuracy comparable to the work in [2].
The superior accuracy of our scheme can be attributed to our choice of cryptographic primitive. The compared works [3,4,5] rely on lattice-based schemes (e.g., CKKS, TFHE, BGV) that require approximations like polynomial functions to handle non-linear operations. In contrast, our approach utilizes an integer-based SHE scheme combined with data normalization. This method introduces only minimal precision loss by truncating the least significant bits, thereby preserving the accuracy. Furthermore, our interactive protocol is designed such that all homomorphic computations consist of exact integer operations (additions and multiplications), eliminating the need for approximations and thus minimizing the accuracy degradation.

5.2. Communication Overhead of Our Scheme

Figure 2 and Figure 3 illustrate the communication overhead of our scheme for various datasets, corresponding to the interactive data flows between the DO and CSP depicted in Figure 1.
In the direction from the DO to CSP , the communication overhead (illustrated in Figure 2) primarily consists of two components, corresponding to the two transmission steps shown in Figure 1.
  • The first component is generated at the end of Phase (1), when the DO transmits the initial encrypted data package to the CSP , which includes the dataset, initial centroids, and public parameters.
  • The second component occurs at the end of Phase (2.2) and consists of the encrypted new centroid mask vectors.
Conversely, in the direction from the CSP to DO , the communication overhead (Figure 3) is also composed of two main parts, reflecting the two return transmissions in the protocol.
  • The first part is the encrypted distance matrix transmitted at the end of Phase (2.1), which the DO uses for comparison.
  • The second part, sent at the end of Phase (3.1), consists of the encrypted accumulated centroid features, which are required for the new centroid calculation.
In summary, all data exchanged between the parties are encrypted, and the resulting communication overhead is of a manageable size. Under typical network conditions, this communication overhead has a negligible impact on the total protocol execution time, a point we will elaborate on in the following subsection.

5.3. Time Performance and Comparison

In this subsection, we evaluate the time performance of our protocol by analyzing three key aspects: the computation time for the DO and CSP , the theoretical computational complexity, and a comparison of the total execution time against the related works. To ensure reliability, all the reported time overheads are the average of 10 independent runs. Our analysis of the computational complexity is presented alongside the empirical computation times for each protocol phase. As plaintext operations are negligible in comparison, this complexity analysis focuses primarily on operations involving ciphertexts.
The empirical computation times for our scheme are presented in Figure 4, which breaks down the performance of both the DO and the CSP . The DO ’s computation time, detailed in Figure 4a, is divided into three parts corresponding to its three phases of activity in the protocol.
  • Phase (1). In this phase, the computation time is the time overhead of protocol initialization, mainly including the key pair generation, dataset normalization, and encryption of the dataset and initial centroids. The initial setup requires the DO to encrypt n · d data point features and k · d centroid features. As typically n k , the complexity is dominated by dataset encryption, resulting in the initial computational complexity O ( n · d ) .
  • Phase (2.2). This phase involves two main operations: comparing distances to find the nearest centroid for each data point and generating the new centroid masks. During this process, the DO mainly performs O ( n · k ) plaintext comparisons to find the nearest centroid for each point and encrypts and returns n × k mask values. The complexity of this phase is dominated by the O ( n · k ) cryptographic operations.
  • Phase (3.2). In this phase, the DO mainly needs to perform division operation of the new (accumulated) centroid feature by the number of data points that are closest to the corresponding centroid and check for convergence. Therefore, k × d decryption of encrypted accumulated features and k × d encryption of the new centroid features are needed, and the complexity of this phase is thus O ( k · d ) .
As shown in Figure 4a, most of the computation time overheads of the DO operations in different phases are less than two dozen seconds, and the total DO computation time overheads are all low (i.e., at most dozens of seconds) based on our different clustering datasets (even for the huge dataset). Based on the theoretical analysis, the total computational complexity of the DO is O ( n · d ) when n k is common.
The CSP ’s computation time, presented in Figure 4b, comprises two parts corresponding to its two active phases in the protocol.
  • Phase (2.1). CSP performs ciphertext addition and multiplication to calculate the distances between each data point and each centroid. To compute the Modified Euclidean Distance (MED) between one data point and one centroid, the CSP performs operations for each of the d dimensions. This involves d homomorphic multiplications and 2 d 1 homomorphic additions. Since this must be performed for all n data points and k centroids, the total complexity for this phase is O ( n · k · d ) homomorphic operations.
  • Phase (3.1). In this phase, the CSP performs homomorphic multiplications and additions to compute the new accumulated centroid features. The CSP calculates the accumulated features for each new centroid based on the equation in Phase (3.1). For each of the k × d features, this requires a summation over all n data points, which involves n homomorphic multiplications and n 1 homomorphic additions. Therefore, the total complexity of this phase is also O ( n · k · d ) homomorphic operations.
Similar to the computation execution time of the DO , the total computation time overheads of the CSP are also low (i.e., less than dozens of seconds for the largest dataset) based on different clustering datasets. The overall computational complexity for the CSP is O ( n · k · d ) , which constitutes the dominant portion of the protocol’s entire computation time.
While the computation times for both parties are low, the interactive nature of our protocol means that the communication overhead could also impact the total execution time. Therefore, to comprehensively evaluate the impact of different network environments on the communication overheads, we simulated a standard Internet environment (100 Mbps bandwidth, 10 ms round-trip delay) to test the total execution time overheads of our protocol (including the computation time overheads and communication time overheads). The experimental results, as shown in Figure 5, demonstrate that our scheme maintains lower total execution time overheads than the related work [4] based on various UCI datasets.
While a poor network environment (e.g., 10 Mbps bandwidth, 30 ms delay) could become a bottleneck with larger datasets, this effect is mitigated by modern high-speed internet. However, Gigabit-bandwidth Internet, which is quite common nowadays, can significantly reduce the transmission time and further reduce the total execution time overhead of our protocol.
However, since three other related works [2,3,5] used as comparison deployed non-UCI datasets, we selected corresponding UCI datasets having larger dataset parameters, and the total time overhead comparison results are shown in Table 4. As one may see, our work has an overwhelming advantage in the total time overhead over the compared non-UCI-based works, while our work has equal or better accuracy as introduced before.
There are some details that need to be explained in Table 4. When compared to [3], our scheme has converged and stops when the protocol iteration number reaches 5; so, the total time overhead is the multiplication result of our one-iteration runtime and 5. When compared to [5], we chose datasets with larger dimensions (and same for the other parameters) and we still obtained a smaller total time overhead.

5.4. Security Analysis

As proven in [19], the SHE scheme employed in our work satisfies IND-CPA (Indistinguishability under Chosen-Plaintext Attack) security, based on the ( R , p ) -decision assumption. Specifically, the ability of the SHE scheme to resist chosen-plaintext attacks and thus achieve IND-CPA security is conditional upon the lengths of the security parameters R (length k 2 ) and p (length k 0 ). On one hand, the security of R , a random number of length k 2 , relies on the infeasibility of a brute-force attack. To render such an attack computationally infeasible (requiring 2 k 2 operations), a length of k 2 90 is required. On the other hand, the security of the large prime p is tied to the difficulty of the integer factorization problem. Given the public modulus N = p q , an attacker must factor N to recover p. With the currently recommended minimum secure length for N at 1024 bits, the length of p, k 0 , must be at least 512 bits.
The security parameters selected for our implementation, k 2 = 90 and k 0 = 1048 , meet or exceed the required minimums of 90 and 512 bits, respectively. Therefore, our configuration of the SHE scheme satisfies the requirements for IND-CPA security.
Given this IND-CPA secure foundation, we now analyze how our protocol achieves the specific privacy properties defined in Section 3.2 under the honest-but-curious model.
  • Dataset Privacy. The clustering dataset X is encrypted using the secret key s k 1 . Without access to s k 1 , the CSP has no means to decrypt the data features. This directly guarantees the privacy of the dataset.
  • Model Privacy. Throughout the protocol, all sensitive model parameters are encrypted with s k 1 . This includes initial/updated centroid features, intermediate MED values, centroid masks, and accumulated features. The CSP operates exclusively on these ciphertexts. Given the IND-CPA security of SHE, the CSP cannot infer any information about these parameters, thus ensuring the privacy of the k-means model.

6. Conclusions

In this paper, we proposed a fast and privacy-preserving outsourced k-means clustering scheme based on symmetric homomorphic encryption (SHE). Our scheme enables a data owner ( DO ) to securely outsource encrypted data to a cloud service provider ( CSP ) for k-means model training, while guaranteeing the privacy of both the dataset and the model parameters. We developed a Modified Euclidean Distance (MED) metric to avoid costly non-linear operations, thereby reducing both the computational and communication overheads. The experimental results demonstrate that our scheme achieves superior performance compared to the related works in terms of the prediction accuracy, computation time, and total execution overhead. Building on the success of the SHE for efficient privacy-preserving k-means, our future work will focus on integrating this approach with outsourced incremental clustering and classification schemes to support a broader range of application scenarios.

Author Contributions

Conceptualization, S.X.; Methodology, S.X.; Software, W.T.; Validation, W.T.; Data curation, W.T.; Writing—original draft, W.T.; Writing—review & editing, S.X.; Visualization, W.T.; Supervision, S.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61902285, 62004077), the Open Foundation of Hubei Key Laboratory of Applied Mathematics (Hubei University) (Grant No. HBAM202101), the Major Program (JD) of Hubei Province (2023BAA027), and the Fundamental Research Funds for the Central Universities (Program No. 2662022XXYJ004). We thank Bo Li and the experimental teaching center of College of Informatics, Huazhong Agriculture University for providing the experimental environment and computing resources.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Cheon, J.H.; Kim, D.; Park, J.H. Towards a Practical Cluster Analysis over Encrypted Data. In Proceedings of the Selected Areas in Cryptography–SAC 2019, Waterloo, ON, Canada, 12–16 August 2019; Paterson, K.G., Stebila, D., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 227–249. [Google Scholar]
  2. Catak, F.O.; Aydin, I.; Elezaj, O.; Yildirim-Yayilgan, S. Practical Implementation of Privacy Preserving Clustering Methods Using a Partially Homomorphic Encryption Algorithm. Electronics 2020, 9, 229. [Google Scholar] [CrossRef]
  3. Jaschke, A.; Armknecht, F. Unsupervised Machine Learning on Encrypted Data. In Proceedings of the Selected Areas in Cryptography–SAC 2018, Calgary, AB, Canada, 15–17 August 2018; Cid, C., Jacobson, M.J., Jr., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 453–478. [Google Scholar]
  4. Lorenzo, R. Fast but approximate homomorphic k-means based on masking technique. Int. J. Inf. Secur. 2023, 22, 1605–1619. [Google Scholar] [CrossRef]
  5. Sakellariou, G.; Gounaris, A. Homomorphically encrypted k-means on cloud-hosted servers with low client-side load. Computing 2019, 101, 24. [Google Scholar] [CrossRef]
  6. Mohassel, P.; Rosulek, M.; Trieu, N. Practical Privacy-Preserving K-means Clustering. Proc. Priv. Enhancing Technol. 2020, 2020, 414–433. [Google Scholar] [CrossRef]
  7. Bunn, P.; Ostrovsky, R. Secure two-party k-means clustering. In Proceedings of the 14th ACM Conference on Computer and Communications Security, Alexandria, VA, USA, 29 October–2 November 2007; CCS ’07. pp. 486–497. [Google Scholar] [CrossRef]
  8. Liu, X.; Jiang, Z.L.; Yiu, S.M.; Wang, X.; Tan, C.; Li, Y.; Liu, Z.; Jin, Y.; Fang, J. Outsourcing Two-Party Privacy Preserving K-Means Clustering Protocol in Wireless Sensor Networks. In Proceedings of the 2015 11th International Conference on Mobile Ad-hoc and Sensor Networks (MSN), Shenzhen, China, 16–18 December 2015; pp. 124–133. [Google Scholar] [CrossRef]
  9. Zhang, E.; Li, H.; Huang, Y.; Hong, S.; Zhao, L.; Ji, C. Practical multi-party private collaborative k-means clustering. Neurocomputing 2022, 467, 256–265. [Google Scholar] [CrossRef]
  10. Chillotti, I.; Gama, N.; Georgieva, M.; Izabachène, M. TFHE: Fast fully homomorphic encryption over the torus. J. Cryptol. 2020, 33, 34–91. [Google Scholar] [CrossRef]
  11. Brakerski, Z.; Gentry, C.; Vaikuntanathan, V. (Leveled) fully homomorphic encryption without bootstrapping. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, Cambridge, MA, USA, 8–10 January 2012; ITCS ’12. pp. 309–325. [Google Scholar] [CrossRef]
  12. Cheon, J.H.; Kim, A.; Kim, M.; Song, Y. Homomorphic Encryption for Arithmetic of Approximate Numbers. In Proceedings of the Advances in Cryptology–ASIACRYPT 2017, Hong Kong, China, 3–7 December 2017; Takagi, T., Peyrin, T., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 409–437. [Google Scholar]
  13. Paillier, P. Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of the International Conference on the Theory and Applications of Cryptographic Techniques, Prague, Czech Republic, 2–6 May 1999; pp. 223–238. [Google Scholar]
  14. Mahdikhani, H.; Lu, R.; Zheng, Y.; Shao, J.; Ghorbani, A.A. Achieving O(log3n) Communication-Efficient Privacy-Preserving Range Query in Fog-Based IoT. IEEE Internet Things J. 2020, 7, 5220–5232. [Google Scholar] [CrossRef]
  15. Zhao, J.; Zhu, H.; Wang, F.; Lu, R.; Liu, Z.; Li, H. PVD-FL: A Privacy-Preserving and Verifiable Decentralized Federated Learning Framework. IEEE Trans. Inf. Forensics Secur. 2022, 17, 2059–2073. [Google Scholar] [CrossRef]
  16. Zhao, J.; Zhu, H.; Wang, F.; Lu, R.; Wang, E.; Li, L.; Li, H. VFLR: An Efficient and Privacy-Preserving Vertical Federated Framework for Logistic Regression. IEEE Trans. Cloud Comput. 2023, 11, 3326–3340. [Google Scholar] [CrossRef]
  17. Miao, Y.; Liu, Z.; Li, X.; Li, M.; Li, H.; Choo, K.K.R.; Deng, R.H. Robust Asynchronous Federated Learning With Time-Weighted and Stale Model Aggregation. IEEE Trans. Dependable Secur. Comput. 2024, 21, 2361–2375. [Google Scholar] [CrossRef]
  18. Sun, L.; Zhang, Y.; Zheng, Y.; Song, W.; Lu, R. Towards Efficient and Privacy-Preserving High-Dimensional Range Query in Cloud. IEEE Trans. Serv. Comput. 2023, 16, 3766–3781. [Google Scholar] [CrossRef]
  19. Zheng, Y.; Lu, R.; Guan, Y.; Shao, J.; Zhu, H. Efficient and Privacy-Preserving Similarity Range Query over Encrypted Time Series Data. IEEE Trans. Dependable Secur. Comput. 2022, 19, 2501–2516. [Google Scholar] [CrossRef]
  20. Miao, Y.; Xu, C.; Zheng, Y.; Liu, X.; Meng, X.; Deng, R.H. Efficient and Secure Spatial Range Query over Large-scale Encrypted Data. In Proceedings of the 2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS), Hong Kong, China, 18–21 July 2023; pp. 1–11. [Google Scholar] [CrossRef]
  21. Zheng, Y.; Lu, R.; Guan, Y.; Zhang, S.; Shao, J.; Zhu, H. Efficient and Privacy-Preserving Similarity Query with Access Control in eHealthcare. IEEE Trans. Inf. Forensics Secur. 2022, 17, 880–893. [Google Scholar] [CrossRef]
  22. Dua, D.; Graf, C. UCI Machine Learning Repository. 2017. Available online: https://ergodicity.net/2013/07/ (accessed on 4 August 2025).
  23. Iris. 2025. Available online: http://archive.ics.uci.edu/dataset/53/iris (accessed on 4 August 2025).
  24. Wine. 2025. Available online: http://archive.ics.uci.edu/dataset/186/wine+quality (accessed on 4 August 2025).
  25. Breast Cancer. 2025. Available online: http://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic (accessed on 4 August 2025).
  26. Banknote. 2025. Available online: https://archive.ics.uci.edu/dataset/267/banknote+authentication (accessed on 4 August 2025).
  27. Musk. 2025. Available online: https://archive.ics.uci.edu/dataset/75/musk-version-2 (accessed on 4 August 2025).
  28. Sepsis. 2025. Available online: https://archive.ics.uci.edu/dataset/827/sepsis+survival+minimal+clinical+records (accessed on 4 August 2025).
  29. Gas. 2025. Available online: https://archive.ics.uci.edu/dataset/224/gas+sensor+array+drift+dataset (accessed on 4 August 2025).
  30. Ultsch, A. CLUSTERING WITH SOM: U* C. In Proceedings of the Workshop on Self-Organizing Maps, WSOM’05, Paris, France, 5–8 September 2005; pp. 75–82. [Google Scholar]
  31. Halkidi, M.; Batistakis, Y.; Vazirgiannis, M. Clustering validity checking methods: Part II. ACM Sigmod Rec. 2002, 31, 19–27. [Google Scholar] [CrossRef]
Figure 1. The overview of the outsourced training system architecture.
Figure 1. The overview of the outsourced training system architecture.
Mathematics 13 02893 g001
Figure 2. Communication overhead from DO to CSP .
Figure 2. Communication overhead from DO to CSP .
Mathematics 13 02893 g002
Figure 3. Communication overhead from CSP to DO .
Figure 3. Communication overhead from CSP to DO .
Mathematics 13 02893 g003
Figure 4. Computation overhead of DO and CSP .
Figure 4. Computation overhead of DO and CSP .
Mathematics 13 02893 g004
Figure 5. Total time overhead comparison with UCI-based related work (i.e., Lorenzo et al. [4]).
Figure 5. Total time overhead comparison with UCI-based related work (i.e., Lorenzo et al. [4]).
Mathematics 13 02893 g005
Table 1. Comparison of related works (NOTE: SS and SHE separately denote secret sharing and symmetric homomorphic encryption).
Table 1. Comparison of related works (NOTE: SS and SHE separately denote secret sharing and symmetric homomorphic encryption).
Schemes[7,8,9][2,7][3][5][4]Our Work
Num of servers>1>11111
TechniquesSSPaillier [13]TFHE [10]BGV [11]CKKS [12]SHE [14]
Comp. overheadmediumlowhighmediummediumlow
Comm. overheadmediummediumhighmediummediumlow
Table 2. Specifications of the benchmark datasets used in our experiments.
Table 2. Specifications of the benchmark datasets used in our experiments.
DatasetsCentroid Num. (k)Dim. (d)Data Point Num. (n)
iris [23]34150
wine [24]313178
breast cancer [25]230569
banknote [26]251372
musk [27]21666598
sepsis [28]23110,204
gas [29]612813,910
Table 3. Accuracy comparisons between our scheme and related works.
Table 3. Accuracy comparisons between our scheme and related works.
Related WorksDatasetsEvaluation MetricsAccuracy Comparison
(i) [4]IrisMean error2.95∼7.21% (>Our 0%)
(ii) [2]Self-syntheticAdjusted Rand Index49∼72.4% (≈Our 40∼80%)
(iii) [3]FCPS [30]Misclassification rate0∼3% (≥Our 0%)
(iv) [5]Self-syntheticRelative error0.65∼26.91% (>Our 0%)
(i) The work in [4] uses the mean error between the centroids from ciphertext-based clustering and those from plaintext-based clustering as its accuracy metric. (ii) The authors of [2] employ the Adjusted Rand Index (ARI) [31], which measures the similarity between the clustering result and the ground-truth labels. (iii) Reference [3] utilizes the misclassification rate, which quantifies the fraction of incorrectly clustered data points, as its accuracy metric. (iv) The scheme in [5] is evaluated using the relative error between centroids computed on ciphertexts versus those computed on plaintexts.
Table 4. Total time overhead comparison with non-UCI-based related works.
Table 4. Total time overhead comparison with non-UCI-based related works.
Comparison Results Between [2] Based on Paillier and Our Work
Dataset SizeResearch [2]Our work
200099,402.12 s ≈ 27.61 h1.14 s
3000255,790.18 s ≈ 71.05 h1.29 s
4000377,794.82 s ≈ 104.94 h1.86 s
5000680,925.54 s ≈ 189.15 h4.76 s
Comparison results between [3] (exact/approximate mode) based on TFHE and our work
[3] (Exact mode)[3] (Approximate mode)Our work
Runtime per Iteration873.46 h15.56 h0.166 s
Iteration Number15405
Total Time545.91 days ≈ 17.95 months25.93 days ≈ 0.85 months0.83 s
Comparison results between [5] (fastest/slowest mode) based on BGV and our work
d , n [5] (Fastest Mode)[5] (Slowest Mode)Our work
d = 2 , n = 100 1479 s3164 s d = 4 , n = 100 0.16 s
d = 10 , n = 100 150 s702 s d = 13 , n = 100 0.97 s
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tang, W.; Xu, S. A Fast and Privacy-Preserving Outsourced Approach for K-Means Clustering Based on Symmetric Homomorphic Encryption. Mathematics 2025, 13, 2893. https://doi.org/10.3390/math13172893

AMA Style

Tang W, Xu S. A Fast and Privacy-Preserving Outsourced Approach for K-Means Clustering Based on Symmetric Homomorphic Encryption. Mathematics. 2025; 13(17):2893. https://doi.org/10.3390/math13172893

Chicago/Turabian Style

Tang, Wanqi, and Shiwei Xu. 2025. "A Fast and Privacy-Preserving Outsourced Approach for K-Means Clustering Based on Symmetric Homomorphic Encryption" Mathematics 13, no. 17: 2893. https://doi.org/10.3390/math13172893

APA Style

Tang, W., & Xu, S. (2025). A Fast and Privacy-Preserving Outsourced Approach for K-Means Clustering Based on Symmetric Homomorphic Encryption. Mathematics, 13(17), 2893. https://doi.org/10.3390/math13172893

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop