You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

10 August 2025

A Privacy-Enhanced Multi-Stage Dimensionality Reduction Vertical Federated Clustering Framework

,
and
1
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
2
College of Artificial Intelligence, Nankai University, Tianjin 300350, China
*
Author to whom correspondence should be addressed.

Abstract

Federated Clustering (FL clustering) aims to discover latent knowledge in multi-source distributed data through clustering algorithms while preserving data privacy. Federated learning is categorized into horizontal and vertical federated learning based on data partitioning scenarios. Horizontal federated learning is applicable to scenarios with overlapping feature spaces but different sample IDs across parties. Vertical federated learning facilitates cross-institutional feature complementarity, which is particularly suited for scenarios with highly overlapping sample IDs yet significantly divergent features. As a classic clustering algorithm, k-means has seen extensive improvements and applications in horizontal federated learning. However, its application in vertical federated learning remains insufficiently explored, with room for enhancement in privacy protection and communication efficiency. Simultaneously, client feature imbalance may lead to biased clustering results. To improve communication efficiency, this paper introduces Product Quantization (PQ) to compress high-dimensional data into low-dimensional codes by generating local codebooks. Leveraging the inherent k-means algorithm within PQ, local training preserves data structures while overcoming privacy risks associated with traditional PQ methods that require server-side data reconstruction (which may leak data distributions). To enhance privacy without compromising performance, Multidimensional Scaling (MDS) maps codebook cluster centers into distance-preserving indices. Only these indices are uploaded to the server, eliminating the need for data reconstruction. The server executes k-means on the indices to minimize intra-group similarity and maximize inter-group divergence. This scheme retains original codebooks locally for strict privacy protection.The nested application of PQ and MDS significantly reduces communication volume and frequency while effectively alleviating clustering bias caused by client feature dimension imbalance. Validation on the MNIST dataset confirms that the approach maintains k-means clustering performance while meeting federated learning requirements for privacy and efficiency.

1. Introduction

Federated learning (FL) is a distributed machine learning approach whose core concept involves multiple clients (e.g., mobile devices or enterprises) collaboratively training a shared global model without sharing local data [1]. Federated learning clustering integrates federated learning and clustering algorithms to mine latent knowledge from multi-source distributed data while ensuring data privacy protection [2]. Federated learning faces multiple challenges including data heterogeneity [3,4], communication bottlenecks [1,5], and privacy protection [6].
Federated learning is categorized into horizontal federated learning (horizontal federated learning, HFL) [7] and vertical federated learning (vertical federated learning, VFL) [8] based on data distribution patterns. In horizontal federated learning, participating parties share identical features but possess distinct sample IDs [9]. In vertical federated learning scenarios, the data distribution pattern entails different clients owning distinct feature sets while sharing identical sample IDs. Horizontal federated learning finds widespread application in edge computing and IoT collaboration [10,11]. Vertical federated learning’s value lies in cross-institutional feature complementarity, which is particularly suitable for scenarios with highly overlapping sample IDs yet significantly divergent features, such as in healthcare [12], finance [13], and other industries.
Current research on federated clustering algorithms predominantly focuses on the horizontal federated domain, with relatively limited exploration in vertical federated settings. Compared to horizontal federated learning, vertical federated learning relies more heavily on high-frequency encrypted interactions and high-dimensional feature transmission [14], imposing greater demands on communication. Simultaneously, vertical federated learning requires multi-party collaboration to compute loss functions and gradients. Due to features being dispersed across different participants, it necessitates reliance on Secure Multi-Party Computation (MPC), homomorphic encryption (HE), and other technologies to achieve joint training under privacy protection, all incurring substantial computational costs [15].
In 2020, Avishek et al. proposed an iterative federated clustering algorithm, which achieves efficient clustering by alternately estimating user identities and optimizing model parameters [16]. This method holds significant reference value for feature alignment. In 2023, Zitao Li et al. proposed a differentially private k-means clustering algorithm based on the Flajolet–Martin (FM) technique [17]. By aggregating differentially private cluster centers and membership information of local data on an untrusted central server, they constructed a weighted grid as a summary of the global dataset, ultimately generating global centers by executing the k-means algorithm. Subsequently, in the paper [18], they introduced Flajolet–Martin (FM) sketches to encode local data and estimate cross-party marginal distributions under differential privacy constraints, thereby constructing a global Markov Random Field (MRF) model to generate high-quality synthetic data. Federico [19] proposed a vertical federated k-means algorithm based on homomorphic encryption and differential privacy protection, demonstrating its superior clustering performance (e.g., k-means loss and clustering accuracy) over traditional privacy-preserving k-means algorithms while maintaining the same privacy level. Li et al. [20] proposed a vertical federated density peaks clustering algorithm based on a hybrid encryption framework. Building upon the merged distance matrix, they introduced a more effective clustering method under nonlinear mapping, enhancing density peaks clustering (DPC) performance while addressing privacy protection issues in vertical federated learning (VFL). Duan et al. [21] introduced communication models for federated learning and three state-of-the-art algorithms, and proposed a k-median clustering optimization algorithm. Huang et al. [22] conducted an in-depth analysis of vertical federated clustering mechanisms, proposing a privacy attack model targeting vertical federated k-means clustering. Their findings revealed the possibility of reconstructing raw data in vertical federated learning, thereby causing data privacy leakage.
This study addresses the privacy protection issues of k-means clustering in vertical federated learning scenarios by proposing a federated clustering framework based on lightweight multi-stage dimensionality reduction. It primarily resolves three major challenges: privacy constraints, communication bottlenecks, and feature imbalance problems.
Our main contributions are as follows:
(1)
We innovatively propose a multi-stage dimensionality reduction framework applicable to vertical federated learning clustering based on Product Quantization (PQ) and Multidimensional Scaling (MDS) techniques. Locally, feature compression and codebook generation effectively reduce data volume, while PQ-quantized parameters inherently provide noise injection effects to enhance privacy. Innovatively, we introduce one-dimensional MDS embedding to map clustering centers in the codebook into distance-preserving indices. This achieves zero raw codebook upload, abandons data reconstruction on the server side, and fundamentally eliminates the risk of data distribution leakage.
(2)
The multi-stage dimensionality reduction mechanism significantly reduces transmitted data volume and communication frequency while ensuring clustering accuracy, thereby improving communication efficiency.
(3)
The combination of PQ dimensionality reduction and MDS embedding algorithms mitigates clustering bias caused by feature imbalance in vertical federated learning.
(4)
Extensive experiments on the MNIST dataset validate that our algorithm satisfies federated learning privacy requirements while preserving clustering accuracy.
The structure of our paper is arranged as follows: Section 2 discusses existing clustering algorithms, challenges in VFL, and theoretical analyses of various dimensionality reduction techniques; Section 3 details our algorithmic design; Section 4 presents the experimental process and results analysis conducted to validate our algorithm; finally, Section 5 concludes the paper.

3. Method

The essence of clustering is grouping data points based on similarity. As long as compressed data preserves the “proximity relationships among data points” (i.e., distant samples remain separated and close samples remain aggregated), clustering on compressed data remains effective. Motivated by this principle, we design a multi-level dimensionality-reduction-based clustering framework for vertical federated learning that jointly accounts for feature imbalance, communication overhead, and privacy protection.
This framework is implemented based on PQ quantization and MDS techniques. After privacy-preserving alignment across multiple clients, the Product Quantization technique replaces traditional local model training. Algorithmically, PQ fundamentally relies on k-means clustering principles, compressing high-dimensional features into low-dimensional codes to significantly reduce communication overhead. Regarding privacy protection design: directly uploading codebooks (cluster center sets) risks exposing original data distributions. We observe clustering effectiveness heavily depends on relative distances between cluster centers. Thus, we employ the MDS one-dimensional embedding algorithm, f : R n R , f ( x I ) = z i ensuring the following:
z i z j x i x j , i , j
The server executes a one-shot clustering algorithm on distance-preserving indices to obtain the clustering structure, achieving maximized intra-group similarity and maximized inter-group divergence
This approach preserves raw codebook privacy locally while maintaining distance relationships essential for clustering through indices, ensuring algorithmic performance and further enhancing communication efficiency.
The adoption of multi-stage dimensionality reduction instead of direct dimensionality reduction is partially due to significant disparities in feature counts across clients. Direct reduction would bias clustering results toward certain features under such imbalances.
Figure 1 illustrates the data flow diagram of the entire algorithm. Locally on the client side, the total data dimension is D-dimensional. The original data is partitioned into M groups of subvectors, each with a dimension of D/M. After performing PQ quantization on them, a codebook is generated. At this stage, original data values are converted into indices within the codebook. Subsequently, the MDS one-dimensional embedding algorithm is executed on the cluster centers in this codebook to obtain remapped values. These values replace the original indices in the data. Finally, these local indices are uploaded to the server side. After aligning all data by columns on the server side, the k-means clustering algorithm is executed to obtain global cluster centers.
Figure 1. Data flow diagram.

3.1. Algorithm Design

Suppose there is a federated learning framework with m clients. The global dataset is D with global dimension dim, distributed across clients. D g denotes the dataset of the g-th client, where data points in D g are d i m g d i m e n s i o n a l vectors. There are d i m g = d i m . Define sub_dim as the subspace dimensionality in PQ quantization, and sub_k as the number of cluster centers per subspace in PQ quantization. C o d e b o o k g represents the codebook of the g-th client, and c o d e g i denotes the cluster center set of the i-th subspace for the g-th client, c o d e g i is the data index set of the g-th client, and B is the global data index set.
Algorithm mainly consists of the following steps:
1.
Encrypted Entity Alignment
Select common samples: Extract common samples from each party’s dataset.
2.
Local Initialization and Training of PQ Quantizer:
(1)
Pad dimensions: Determine whether the local dimension dim is divisible by sub_dim. If not divisible, pad dimensions.
First, calculate the number of dimensions p to pad:
d i m = s u b _ d i m × s u b _ k
p = d i m d i m
Then, pad the original vector by appending p zeros to its end.
v = [ v 1 , v 2 , , v dim , 0 , 0 , , 0 p ]
(2)
Generate subspace codebooks based on training data.
3.
Secondary mapping:
Perform secondary mapping on cluster centers in subspace codebooks using MDS one-dimensional embedding algorithm, then use the normalized mapped values as codebook indices.
4.
Data transmission:
Codebooks are stored locally. Transfer the indices to the server side.
5.
Server-side global cluster center aggregation:
Execute k-means algorithm on the indices uploaded by clients at the server side to obtain abstract global cluster centers.
When using abstract cluster centers, apply the same mapping operation to local data, then compute distances to global abstract cluster centers to determine true cluster assignments.
The algorithm requires only one round of communication.
Table 1 shows the explanation of symbols:
Table 1. Symbolexplanation.
Algorithm 4 pseudocode:
Algorithm 4 our algorithm
Input: Distributed dataset D = D 1 , D 2 , , D m , where D g is the data of the g-th client. Number of clusters k. Maximum iterations T.
Output: Global cluster centers C = c 1 , c 2 , , c k .
Steps:
 1:
   RSA Private Set Intersection
(1)
Server generates RSA key pair and broadcasts public key
(2)
Clients blind their own elements
(3)
Server signs blinded elements
(4)
Clients unblind received signatures
(5)
Clients send unblinded signatures to server
(6)
Server computes intersection signatures and sends to clients
(7)
Clients map signatures back to original elements
 2:
   Model Training:
(1)
Server distributes parameters: sub_dim (subspace dimensionality), ks (number of subspace cluster centers)
(2)
At local client (g-th client):
   if dim% sub_dim !=0:
      pad dimensions with zeros.
   codebook=pq.train(sub_dim,ks)
   for code in codebook:
      MDS dimensionality reduction:code = mds (code)
      Normalization:code = normalize (code)
      Add to code_list:code_list.append (code)
   Convert data to code indices: data_codes = pq.encode (data,code_list)    Upload encrypted data indices to server
(3)
Server aggregation:    Collect data indices from all clients.
   Global index set B = Φ    for client in client_list:
      Column-wise merge: B = B B g
   In abstract index set B:
   Initialize cluster centers by randomly selecting k data points as initial global cluster centers.
   Perform k-means clustering on sample set using initialized global centers to obtain final cluster centers C : c 1 , c 2 , , c k

3.2. Cluster Accuracy Analysis

The performance guarantee of the algorithm stems from the multi-stage dimensionality reduction framework preserving critical information required for clustering tasks—namely, the relative similarity (distance structure) between data points.

3.2.1. PQ Quantization

PQ quantization reduces dimensionality via subspace decomposition. Its core principle is distributing high-dimensional quantization errors across low-dimensional subspaces, thereby controlling overall information loss and preserving fundamental similarity for subsequent clustering.
As shown in the error analysis of PQ quantization in Section 2, when the codebook size k is sufficiently large, the quantization error x i x i 2 in each subspace converges to the inherent noise level (e.g., data variance) of that subspace. Consequently, the total error x x 2 is bounded. As m increases (subspace dimensionality decreases), errors distribute more uniformly, avoiding error explosion caused by the “curse of dimensionality” in high-dimensional spaces.
Clustering relies on the property that “similar data points are closer in space”. For any two raw data points x, y, if x and y are similar ( x y 2 small), their corresponding subvectors x i , y i should also be similar ( x i y i 2 small) across subspaces. With bounded PQ quantization errors, the quantized distance satisfies the following:
| x y 2 x y 2 | x x 2 + y y 2 2 ε
( ε denotes the maximum quantization error of a single vector).
When ε x y 2 , the trend of x y 2 aligns with x i y i 2 —that is, quantized distances between similar data points remain smaller than those between dissimilar points, preserving the fundamental similarity required for clustering.

3.2.2. MDS Mapping

Multidimensional Scaling (MDS) seeks to preserve the original pairwise distance relationships in a low-dimensional embedding. Because clustering (especially k-means) is driven by relative distances, as long as the distance ordering after the MDS mapping is consistent with that of the original space, clustering accuracy can be maintained. As shown in Section 2, in MDS the stress function attains its minimum when the mapping uses the eigenvector associated with the largest eigenvalue; equivalently, the squared error between the one-dimensional mapped distances and the original distances d i j is minimized. In this case, pairs that are close in the original space remain close in one dimension, and pairs that are far remain far, thereby preserving the global distance structure.
Assume that in the original codebook, c i , c j belong to the same class (small d i j ) and c i , c k belong to different classes (large d i k ). Because the MDS stress is minimized, the mapping preserves this relation with high probability, i.e., the mapped distances satisfy | z i z j | < | z i z k | , so intra-class distances remain smaller than inter-class distances.
Even under a one-dimensional mapping, as long as this “intra-class near, inter-class far” trend is maintained, clustering will correctly group the points. The advantage of a one-D embedding is that it suppresses high-dimensional noise while compressing global distance relations into a simple numerical ordering, which is well suited for cross-party concatenation and global clustering.

3.2.3. Effectiveness of the Two-Stage Joint Compression

In this framework, although the errors introduced by PQ and by MDS are not independent, their accumulated effect does not destroy the distance structure required for clustering.
(1)
Bound of the error-propagation chain:
i.
PQ: quantizes the original vector x to x′ with a bounded reconstruction error ε 1 = x x 2
ii.
MDS: maps the codebook centroids x to a one-dimensional embedding z with a bounded distance distortion ε 2 = | d i j | z i z j | | , where ε 2 is small due to stress minimization.
The joint effect on clustering appears as the probability of similarity inversion (i.e., pairs that were similar becoming dissimilar, or vice versa). By theoretical analysis, when ε 1 and ε 2 are both smaller than the gap between the maximum intra-class distance and the minimum inter-class distance in the original data, the probability of such inversions trends to zero, and the clustering result on the compressed data coincides with that on the original data.
(2)
Consistency between the multi-stage compression and the clustering objective: The k-means algorithm seeks to minimize the within-cluster sum of squared errors (SSEs), denoted as follows:
J = m i n C k = 1 K x C k x μ k
After PQ quantization, the corresponding SSE J P Q differs from J by the total quantization distortion. Similarly, when applying MDS to the PQ centroids, the resulting SSE J M D S deviates from J P Q by the distance-preservation error of the embedding. Because both the PQ and MDS errors can be made arbitrarily small (by choosing sufficiently large codebook size and by minimizing the MDS stress), the optimal cluster centers obtained on the fully compressed data (minimizer of J M D S ) will coincide closely with those of the original problem (minimizer of J). In other words, the two-stage compressed representation still admits an approximately optimal k-means solution.
Overall, PQ controls the quantization error by decomposing the feature space into low-dimensional subspaces, thereby preserving local similarity among data points. MDS then preserves the global distance structure by embedding codebook centroids into one dimension. Because both PQ and MDS introduce bounded distortions that are compatible with the clustering criterion—“intra-cluster points remain closer than inter-cluster points”—the two-stage compression can guarantee that, for sufficiently large codebook size and sufficiently small MDS stress, clustering on the compressed data closely approximates the result on the original data.

3.3. Hyperparameter Selection and Trade-Offs

In our framework, PQ quantization granularity is controlled by the subspace dimension ( s u b _ d i m ) and the number of codebook centroids per subspace ( s u b _ d i m ).
A smaller s u b _ d i m (more subspaces) disperses quantization errors more evenly but increases the number of codebooks, raising both computational and communication costs. A larger s u b _ k yields higher quantization fidelity (smaller error) and thus better global clustering performance, but reduces compression ratio and increases privacy and communication overhead.
In our experiments, we evaluated combinations with s u b _ d i m 1 , 2 , 4 , 8 and s u b _ d i m 10 , 64 , 128 , 256 to quantify performance under varying PQ granularities.

3.4. Impact of Dimension Padding

Padding dimensions may introduce additional information or noise. In our solution, since we pad with zeros, these dimensions may not contribute to the original data’s features yet increase computational and storage demands. Although the padded values are zeros, each subspace quantization must process these dimensions as distance calculations during quantization still account for them. This cost remains relatively acceptable compared to PQ quantization’s data compression benefits.
However, dimension padding may disrupt natural correlations between subspaces. High-dimensional features in real data often exhibit local correlations (e.g., adjacent dimensions in image features). When padded dimensions are irrelevant to the original data, they may break intrinsic intra-subspace correlations. Consequently, the quantized indices fail to accurately reflect the original structure, thereby increasing quantization error.

3.5. Comparison with Other Dimensionality-Reduction Methods

Consistency with the clustering objective: PCA excels at preserving global linear variance but does not directly preserve pairwise distances, which can distort the “near-dense, far-sparse” structure that clustering exploits. MDS explicitly preserves both local and global distance relationships, making it more effective for similarity-based clustering.
Information loss vs. communication cost: One-dimensional MDS retains only the principal distance variations, incurring higher information loss. Two-dimensional MDS or PCA can capture more complex relationships but multiply the communication cost and risk leaking additional distributional information. Thus, for vertical federated clustering under strict privacy and communication constraints, our PQ and 1D MDS two-stage approach strikes a balanced trade-off between fidelity, privacy, and efficiency.

3.6. Privacy Enhancement

The privacy protection mechanism of this algorithm derives from two aspects: For one thing, PQ quantization technology reduces dimensionality and processes raw data, preventing its direct upload to the server. For another, locally retained PQ codebooks avoid leakage of original data distributions.
Additionally, to defend against attacks like membership inference analysis, differential privacy noise protection can be incorporated. Differential privacy provides a rigorous framework, ensuring analytical results never reveal individual information. By adding noise to data or intermediate results, attackers cannot determine any individual’s presence in the dataset. According to differential privacy serial/parallel theorems [47] we have the following:
1.
Serial Composition: For a given dataset D, assume there exists random algorithms M 1 , M 2 , , M n , with privacy budgets ϵ 1 , ϵ 2 , , ϵ n , respectively. The composition algorithm M ( M 1 ( D ) , M 2 ( D ) , , M n ( D ) ) provides ( i = 1 n ϵ i ) - DP protection. That is, for the same dataset, applying a series of differentially private algorithms sequentially provides protection equivalent to the sum of privacy budgets.
2.
Parallel Composition: For disjoint datasets D 1 , D 2 , , D n , assume there exists random algorithms M 1 , M 2 , , M n , with privacy budgets ϵ 1 , ϵ 2 , , ϵ n , respectively. M ( M 1 ( D 1 ) , M 2 ( D 2 ) , , M n ( D n ) ) provides ( m a x ϵ i )-DP privacy budgets. That is, for disjoint datasets, applying different differentially private algorithms separately in parallel provides privacy protection equivalent to the maximum privacy budget among the composed algorithms.
Adding differential privacy noise to raw data or codebooks can effectively enhance privacy protection.

4. Experiments and Results

4.1. Experimental Settings

4.1.1. Dataset

The MNIST dataset is adopted, with columns split according to client feature ratios, and different subsets are distributed to different clients.

4.1.2. Parameter Settings

  • Total number of clients: 2;
  • Total number of client features: 784;
  • Client feature ratios: (1:1), (1:6), (1:13);
  • PQ quantization subspace dimensions: 1, 2, 4, 8;
  • Number of PQ quantization cluster centers per subspace: 10, 64, 128, 256.

4.1.3. Evaluation Metrics

The evaluation employs NMI and ARI metrics.

4.2. Performance Analysis

Centralized data clustering serves as the baseline for validation.

4.2.1. Comparison Between Proposed Method and Centralized Kmeans

This experiment involves two clients, each with 392 data dimensions at a 1:1 ratio. As shown in Table 2, the proposed algorithm achieves slightly higher average NMI and ARI values under various subspace dimensions and cluster center counts than centralized k-means, indicating effective extraction of original data features. When the subspace dimension is 1, equivalent to performing k-means on each data column and replacing values with cluster centers, PQ loss is minimized. Data confirms superior performance in this scenario. Performance improves with fewer cluster centers per subspace, partly because the sparse MNIST dataset can be effectively represented by fewer clusters.
Table 2. Algorithm performance comparison.

4.2.2. Impact of MDS on Algorithm Performance

To evaluate MDS’s impact, we compare against two alternative implementations without MDS: One clusters codebook indices at the server, and another reconstructs data using PQ quantization at the server. A privacy comparison is as follows: direct index clustering > proposed method > data reconstruction.
As Table 2 shows, performance comparison is as follows: proposed method ≥ data reconstruction direct index clustering. When subspace dimension = 1, direct index clustering performs comparably to other methods. However, its performance declines sharply as PQ subspace dimension increases, indicating insufficient capture of data distance features. The proposed method better captures data distance features, yielding slightly superior performance over data reconstruction.

4.2.3. Impact of Client Feature Quantity on Performance

As Table 3 shows, by varying the feature-partition ratios among clients (1:1, 1:6, and 1:13), our algorithm consistently achieves NMI scores of approximately 0.5 and ARI scores of approximately 0.4, significantly outperforming the baseline. Moreover, its performance remains stable within a narrow range, indicating that the relative number of features held by each client has minimal effect on clustering quality and that the method effectively overcomes feature-dimension imbalance.
Table 3. Experimental results: impact of client feature quantity on performance.

5. Conclusions

Based on the characteristics of clustering algorithms, this study transforms the core contradiction of federated clustering—“data utility vs. privacy preservation vs. communication efficiency”—into a verifiable distance-preserving optimization problem, providing a secure clustering implementation framework for vertical federated learning. The proposed algorithm can address communication challenges in vertical federated learning while preserving privacy.
Addressing three core challenges of k-means clustering in vertical federated learning—inadequate privacy protection, excessive communication overhead, and feature dimension imbalance—this paper innovatively proposes a multi-level dimensionality reduction federated clustering framework integrating Product Quantization (PQ) and Multidimensional Scaling (MDS). By compressing original high-dimensional features into low-dimensional codes via PQ and further reducing dimensionality through one-dimensional MDS embedding on codebooks, communication efficiency is significantly enhanced. Privacy protection is achieved through the following: (1) dimensionality reduction lowering data precision and transmission volume, and (2) mapping sensitive codebooks to secure indices via MDS embedding, transmitting only distance-preserving indices to the server. The original codebooks and feature data remain exclusively on local clients.
Experimental results demonstrate that our algorithm significantly outperforms the baseline algorithm DP-VFC across multiple scenarios and even marginally surpasses centralized algorithms. This indicates that while performing data dimensionality reduction and compression, our algorithm effectively preserves distance characteristics of the data. Through varying the ratio of feature dimensions across clients during experiments, the stability of the algorithm is validated. We conclude that in federated settings, our approach enhances communication efficiency and privacy while reducing sensitivity to data distribution, demonstrating strong performance stability.
Nevertheless, this study has certain limitations. Currently, uniform quantization granularity is adopted during data quantization. To better handle heterogeneous data, personalized granularity warrants further investigation—for example, allowing participants to set parameters based on their local data characteristics. The trade-off between privacy protection and compression loss also requires deeper exploration. Although single-round communication improves efficiency, critical information loss in client-uploaded statistical features (due to noise or compression) cannot be corrected through multiple rounds of interaction. This is especially so in extreme heterogeneity scenarios, as single-pass compression may cause irreversible deviation in global clustering, leading to an accuracy–efficiency trade-off dilemma. Consequently, single-round communication imposes exceptionally high requirements on local data training and compression quality. Subsequent research will address these aspects theoretically and experimentally.
Beyond clustering applications, the multi-stage dimensionality reduction framework for vertical federated learning can be readily extended to other machine learning algorithms, such as replacing global aggregation mechanisms. Future work may explore optimization and integration of multi-layered privacy-preserving techniques, including the fusion of secure multi-party computation (MPC) with homomorphic encryption (HE), as well as multi-tier privacy protection policies.

Author Contributions

Methodology, J.W.; software, J.W. and J.Z.; validation, J.Z. and X.C.; investigation, X.C.; writing—original draft preparation, J.W.; writing—review and editing, J.W. and J.Z.; visualization, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, Mai Khao, Thailand, 3–5 May 2025; PMLR: Cambridge, MA, USA, 2017; pp. 1273–1282. [Google Scholar]
  2. Stallmann, M.; Wilbik, A. Towards Federated Clustering: A Federated Fuzzy c-Means Algorithm (FFCM). arXiv 2022, arXiv:2201.07316. [Google Scholar]
  3. Chen, M.; Shlezinger, N.; Poor, H.V.; Eldar, Y.C.; Cui, S. Communication-efficient federated learning. Proc. Natl. Acad. Sci. USA 2021, 118, e2024789118. [Google Scholar] [CrossRef]
  4. Zhu, H.; Xu, J.; Liu, S.; Jin, Y. Federated learning on non-IID data: A survey. Neurocomputing 2021, 465, 371–390. [Google Scholar] [CrossRef]
  5. Zhou, X.; Yang, G. Communication-efficient and privacy-preserving large-scale federated learning counteracting heterogeneity. Inf. Sci. 2024, 661, 120167. [Google Scholar] [CrossRef]
  6. Mohammadi, N.; Bai, J.; Fan, Q.; Song, Y.; Yi, Y.; Liu, L. Differential privacy meets federated learning under communication constraints. IEEE Internet Things J. 2021, 9, 22204–22219. [Google Scholar] [CrossRef]
  7. Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and open problems in federated learning. Found. Trends® Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
  8. Wei, K.; Li, J.; Ma, C.; Ding, M.; Wei, S.; Wu, F.; Chen, G.; Ranbaduge, T. Vertical federated learning: Challenges, methodologies and experiments. arXiv 2022, arXiv:2202.04309. [Google Scholar] [CrossRef]
  9. Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 2019, 10, 1–19. [Google Scholar] [CrossRef]
  10. Li, J.; Wei, H.; Liu, J.; Liu, W. FSLEdge: An energy-aware edge intelligence framework based on Federated Split Learning for Industrial Internet of Things. Expert Syst. Appl. 2024, 255, 124564. [Google Scholar] [CrossRef]
  11. Khan, L.U.; Pandey, S.R.; Tran, N.H.; Saad, W.; Han, Z.; Nguyen, M.N.; Hong, C.S. Federated learning for edge networks: Resource optimization and incentive mechanism. IEEE Commun. Mag. 2020, 58, 88–93. [Google Scholar] [CrossRef]
  12. Rieke, N.; Hancox, J.; Li, W.; Milletari, F.; Roth, H.R.; Albarqouni, S.; Bakas, S.; Galtier, M.N.; Landman, B.A.; Maier-Hein, K.; et al. The future of digital health with federated learning. NPJ Digit. Med. 2020, 3, 119. [Google Scholar] [CrossRef] [PubMed]
  13. Wu, Z.; Hou, J.; He, B. Vertibench: Advancing feature distribution diversity in vertical federated learning benchmarks. arXiv 2023, arXiv:2307.02040. [Google Scholar]
  14. Khan, A.; ten Thij, M.; Wilbik, A. Communication-efficient vertical federated learning. Algorithms 2022, 15, 273. [Google Scholar] [CrossRef]
  15. Cheng, K.; Fan, T.; Jin, Y.; Liu, Y.; Chen, T.; Papadopoulos, D.; Yang, Q. Secureboost: A lossless federated learning framework. IEEE Intell. Syst. 2021, 36, 87–98. [Google Scholar] [CrossRef]
  16. Ghosh, A.; Chung, J.; Yin, D.; Ramchandran, K. An Efficient Framework for Clustered Federated Learning. Adv. Neural Inf. Process. Syst. 2020, 33, 19586–19597. [Google Scholar] [CrossRef]
  17. Li, Z.; Wang, T.; Li, N. Differentially private vertical federated clustering. arXiv 2022, arXiv:2208.01700. [Google Scholar] [CrossRef]
  18. Zhao, F.; Li, Z.; Ren, X.; Ding, B.; Yang, S.; Li, Y. VertiMRF: Differentially Private Vertical Federated Data Synthesis. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 4431–4442. [Google Scholar]
  19. Mazzone, F.; Brown, T.; Kerschbaum, F.; Wilson, K.H.; Everts, M.; Hahn, F.; Peter, A. Privacy-Preserving Vertical K-Means Clustering. arXiv 2025, arXiv:2504.07578. [Google Scholar]
  20. Li, C.; Ding, S.; Xu, X.; Guo, L.; Ding, L.; Wu, X. Vertical Federated Density Peaks Clustering under Nonlinear Mapping. IEEE Trans. Knowl. Data Eng. 2024, 37, 1004–1017. [Google Scholar] [CrossRef]
  21. Duan, Q.; Lu, Z. Edge Cloud Computing and Federated–Split Learning in Internet of Things. Future Internet 2024, 16, 227. [Google Scholar] [CrossRef]
  22. Huang, Y.; Huo, Z.; Fan, Y. DRA: A data reconstruction attack on vertical federated k-means clustering. Expert Syst. Appl. 2024, 250, 123807. [Google Scholar] [CrossRef]
  23. Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
  24. Han, J.; Kamber, M.; Mining, D. Concepts and techniques. Morgan Kaufmann 2006, 340, 94104–103205. [Google Scholar]
  25. Mary, S.S.; Selvi, T. A study of K-means and cure clustering algorithms. Int. J. Eng. Res. Technol. 2014, 3, 1985–1987. [Google Scholar]
  26. Hwang, H.; Yang, S.; Kim, D.; Dua, R.; Kim, J.Y.; Yang, E.; Choi, E. Towards the practical utility of federated learning in the medical domain. In Proceedings of the Conference on Health, Inference, and Learning, Cambridge, MA, USA, 22–24 June 2023; PMLR: Cambridge, MA, USA, 2023; pp. 163–181. [Google Scholar]
  27. Luo, Y.; Lu, Z.; Yin, X.; Lu, S.; Weng, Y. Application research of vertical federated learning technology in banking risk control model strategy. In Proceedings of the 2023 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Wuhan, China, 21–24 December 2023; IEEE: New York, NY, USA, 2023; pp. 545–552. [Google Scholar]
  28. Dwarampudi, A.; Yogi, M. Application of federated learning for smart agriculture system. Int. J. Inform. Technol. Comput. Eng. (IJITC) ISSN 2024, 2455–5290. [Google Scholar] [CrossRef]
  29. Jiang, D.; Wang, H.; Li, T.; Gouda, M.A.; Zhou, B. Real-time tracker of chicken for poultry based on attention mechanism-enhanced YOLO-Chicken algorithm. Comput. Electron. Agric. 2025, 237, 110640. [Google Scholar] [CrossRef]
  30. GAO, Y.; XIE, Y.; DENG, H.; ZHU, Z.; ZHANG, Y. A Privacy-preserving Data Alignment Framework for Vertical Federated Learning. J. Electron. Inf. Technol. 2024, 46, 3419–3427. [Google Scholar]
  31. Yang, L.; Chai, D.; Zhang, J.; Jin, Y.; Wang, L.; Liu, H.; Tian, H.; Xu, Q.; Chen, K. A survey on vertical federated learning: From a layered perspective. arXiv 2023, arXiv:2304.01829. [Google Scholar] [CrossRef]
  32. Liu, Y.; Kang, Y.; Zou, T.; Pu, Y.; He, Y.; Ye, X.; Ouyang, Y.; Zhang, Y.Q.; Yang, Q. Vertical federated learning: Concepts, advances, and challenges. IEEE Trans. Knowl. Data Eng. 2024, 36, 3615–3634. [Google Scholar] [CrossRef]
  33. Zhao, Z.; Mao, Y.; Liu, Y.; Song, L.; Ouyang, Y.; Chen, X.; Ding, W. Towards efficient communications in federated learning: A contemporary survey. J. Frankl. Inst. 2023, 360, 8669–8703. [Google Scholar] [CrossRef]
  34. Yang, H.; Liu, H.; Yuan, X.; Wu, K.; Ni, W.; Zhang, J.A.; Liu, R.P. Synergizing Intelligence and Privacy: A Review of Integrating Internet of Things, Large Language Models, and Federated Learning in Advanced Networked Systems. Appl. Sci. 2025, 15, 6587. [Google Scholar] [CrossRef]
  35. Zhang, C.; Li, S. State-of-the-art approaches to enhancing privacy preservation of machine learning datasets: A survey. arXiv 2024, arXiv:2404.16847. [Google Scholar]
  36. Qi, Z.; Meng, L.; Li, Z.; Hu, H.; Meng, X. Cross-Silo Feature Space Alignment for Federated Learning on Clients with Imbalanced Data. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025. [Google Scholar]
  37. Hu, K.; Xiang, L.; Tang, P.; Qiu, W. Feature norm regularized federated learning: Utilizing data disparities for model performance gains. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, Jeju Island, Republic of Korea, 3–9 August 2024; pp. 4136–4146. [Google Scholar]
  38. Aramian, A. Managing Feature Diversity: Evaluating Global ModelReliability in FederatedLearning for Intrusion Detection Systems in IoT. Eng. Technol. 2024, 39. [Google Scholar]
  39. Johnson, A. A Survey of Recent Advances for Tackling Data Heterogeneity in Federated Learning. Preprints 2025. [Google Scholar]
  40. Jegou, H.; Douze, M.; Schmid, C. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 117–128. [Google Scholar] [CrossRef]
  41. Konečnỳ, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv 2016, arXiv:1610.05492. [Google Scholar]
  42. Yue, K.; Jin, R.; Wong, C.W.; Baron, D.; Dai, H. Gradient obfuscation gives a false sense of security in federated learning. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023; pp. 6381–6398. [Google Scholar]
  43. Ge, T.; He, K.; Ke, Q.; Sun, J. Optimized product quantization. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 744–755. [Google Scholar] [CrossRef]
  44. Xiao, S.; Liu, Z.; Shao, Y.; Lian, D.; Xie, X. Matching-oriented product quantization for ad-hoc retrieval. arXiv 2021, arXiv:2104.07858. [Google Scholar]
  45. Deisenroth, M.P.; Faisal, A.A.; Ong, C.S. Mathematics for Machine Learning; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
  46. Izenman, A.J. Linear Dimensionality Reduction; Springer: New York, NY, USA, 2013. [Google Scholar]
  47. Vadhan, S.; Zhang, W. Concurrent Composition Theorems for Differential Privacy. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, Orlando, FL, USA, 20–23 June 2022. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.