A Privacy-Enhanced Multi-Stage Dimensionality Reduction Vertical Federated Clustering Framework
Abstract
1. Introduction
- (1)
- We innovatively propose a multi-stage dimensionality reduction framework applicable to vertical federated learning clustering based on Product Quantization (PQ) and Multidimensional Scaling (MDS) techniques. Locally, feature compression and codebook generation effectively reduce data volume, while PQ-quantized parameters inherently provide noise injection effects to enhance privacy. Innovatively, we introduce one-dimensional MDS embedding to map clustering centers in the codebook into distance-preserving indices. This achieves zero raw codebook upload, abandons data reconstruction on the server side, and fundamentally eliminates the risk of data distribution leakage.
- (2)
- The multi-stage dimensionality reduction mechanism significantly reduces transmitted data volume and communication frequency while ensuring clustering accuracy, thereby improving communication efficiency.
- (3)
- The combination of PQ dimensionality reduction and MDS embedding algorithms mitigates clustering bias caused by feature imbalance in vertical federated learning.
- (4)
- Extensive experiments on the MNIST dataset validate that our algorithm satisfies federated learning privacy requirements while preserving clustering accuracy.
2. Related Work
2.1. K-Means Clustering Algorithm
Algorithm 1 k-means [23] algorithm steps. |
Input: Dataset , number of clusters K, maximum iterations T. Output: Partitioning result of K clusters. Steps:
|
2.2. Vertical Federated Learning
2.3. PQ Quantization Technique
- (1)
- Space Decomposition:Decompose a vector of dimension D into Msubspaces, each with dimension D/M (requiring Dmod M = 0).Mathematical Expression:The original vector v ∈ s partitioned into M subvectors vi ∈ , with each subvector quantized independently
- (2)
- Subspace Quantization: Perform k-means clustering on each subspace to generate a codebook containing k cluster centers.
- (3)
- Encoding Representation:The original vector is represented by a combination of indices corresponding to the nearest cluster centers of its subvectors. For example, a 128-dimensional vector can be compressed into eight 8-bit indices (requiring only 64 bits of storage, achieving a compression ratio of up to 97%).
- (1)
- Reconstruction Error:Each subspace generates K centroids via k-means (). Subvector is quantized to its nearest centroid , with reconstruction error:This error decreases as K increases.
- (2)
- Subspace Partitioning Error:A D-dimensional vector is partitioned into M subspaces (each of dimension d = D/M). For an original vector v∈, partitioned as . Independent quantization across subspaces destroys interdimensional correlations, and the lower bound of the reconstruction error for is as follows:When subspace partitioning does not align with the principal components of the data distribution (e.g., without PCA preprocessing), the error increases significantly.
2.4. MDS Mapping Method
Algorithm 2 MDS [46] one-dimensional embedding algorithm steps. |
Input: m×ndata matrixX, distance metric (default: Euclidean distance). Output: One-dimensional coordinate vector Steps:
|
2.5. Baseline
Algorithm 3 DP-VFC [17] algorithm steps. |
|
3. Method
3.1. Algorithm Design
- 1.
- Encrypted Entity AlignmentSelect common samples: Extract common samples from each party’s dataset.
- 2.
- Local Initialization and Training of PQ Quantizer:
- (1)
- Pad dimensions: Determine whether the local dimension dim is divisible by sub_dim. If not divisible, pad dimensions.First, calculate the number of dimensions p to pad:Then, pad the original vector by appending p zeros to its end.
- (2)
- Generate subspace codebooks based on training data.
- 3.
- Secondary mapping:Perform secondary mapping on cluster centers in subspace codebooks using MDS one-dimensional embedding algorithm, then use the normalized mapped values as codebook indices.
- 4.
- Data transmission:Codebooks are stored locally. Transfer the indices to the server side.
- 5.
- Server-side global cluster center aggregation:Execute k-means algorithm on the indices uploaded by clients at the server side to obtain abstract global cluster centers.When using abstract cluster centers, apply the same mapping operation to local data, then compute distances to global abstract cluster centers to determine true cluster assignments.The algorithm requires only one round of communication.
Algorithm 4 our algorithm |
Input: Distributed dataset , where is the data of the g-th client. Number of clusters k. Maximum iterations T. Output: Global cluster centers . Steps:
|
3.2. Cluster Accuracy Analysis
3.2.1. PQ Quantization
3.2.2. MDS Mapping
3.2.3. Effectiveness of the Two-Stage Joint Compression
- (1)
- Bound of the error-propagation chain:
- i.
- PQ: quantizes the original vector x to x′ with a bounded reconstruction error
- ii.
- MDS: maps the codebook centroids to a one-dimensional embedding z with a bounded distance distortion , where is small due to stress minimization.
The joint effect on clustering appears as the probability of similarity inversion (i.e., pairs that were similar becoming dissimilar, or vice versa). By theoretical analysis, when and are both smaller than the gap between the maximum intra-class distance and the minimum inter-class distance in the original data, the probability of such inversions trends to zero, and the clustering result on the compressed data coincides with that on the original data. - (2)
- Consistency between the multi-stage compression and the clustering objective: The k-means algorithm seeks to minimize the within-cluster sum of squared errors (SSEs), denoted as follows:After PQ quantization, the corresponding SSE differs from J by the total quantization distortion. Similarly, when applying MDS to the PQ centroids, the resulting SSE deviates from by the distance-preservation error of the embedding. Because both the PQ and MDS errors can be made arbitrarily small (by choosing sufficiently large codebook size and by minimizing the MDS stress), the optimal cluster centers obtained on the fully compressed data (minimizer of ) will coincide closely with those of the original problem (minimizer of J). In other words, the two-stage compressed representation still admits an approximately optimal k-means solution.
3.3. Hyperparameter Selection and Trade-Offs
3.4. Impact of Dimension Padding
3.5. Comparison with Other Dimensionality-Reduction Methods
3.6. Privacy Enhancement
- 1.
- Serial Composition: For a given dataset D, assume there exists random algorithms , with privacy budgets , respectively. The composition algorithm provides - DP protection. That is, for the same dataset, applying a series of differentially private algorithms sequentially provides protection equivalent to the sum of privacy budgets.
- 2.
- Parallel Composition: For disjoint datasets , assume there exists random algorithms , with privacy budgets , respectively. provides ()-DP privacy budgets. That is, for disjoint datasets, applying different differentially private algorithms separately in parallel provides privacy protection equivalent to the maximum privacy budget among the composed algorithms.
4. Experiments and Results
4.1. Experimental Settings
4.1.1. Dataset
4.1.2. Parameter Settings
- Total number of clients: 2;
- Total number of client features: 784;
- Client feature ratios: (1:1), (1:6), (1:13);
- PQ quantization subspace dimensions: 1, 2, 4, 8;
- Number of PQ quantization cluster centers per subspace: 10, 64, 128, 256.
4.1.3. Evaluation Metrics
4.2. Performance Analysis
4.2.1. Comparison Between Proposed Method and Centralized Kmeans
4.2.2. Impact of MDS on Algorithm Performance
4.2.3. Impact of Client Feature Quantity on Performance
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, Mai Khao, Thailand, 3–5 May 2025; PMLR: Cambridge, MA, USA, 2017; pp. 1273–1282. [Google Scholar]
- Stallmann, M.; Wilbik, A. Towards Federated Clustering: A Federated Fuzzy c-Means Algorithm (FFCM). arXiv 2022, arXiv:2201.07316. [Google Scholar]
- Chen, M.; Shlezinger, N.; Poor, H.V.; Eldar, Y.C.; Cui, S. Communication-efficient federated learning. Proc. Natl. Acad. Sci. USA 2021, 118, e2024789118. [Google Scholar] [CrossRef]
- Zhu, H.; Xu, J.; Liu, S.; Jin, Y. Federated learning on non-IID data: A survey. Neurocomputing 2021, 465, 371–390. [Google Scholar] [CrossRef]
- Zhou, X.; Yang, G. Communication-efficient and privacy-preserving large-scale federated learning counteracting heterogeneity. Inf. Sci. 2024, 661, 120167. [Google Scholar] [CrossRef]
- Mohammadi, N.; Bai, J.; Fan, Q.; Song, Y.; Yi, Y.; Liu, L. Differential privacy meets federated learning under communication constraints. IEEE Internet Things J. 2021, 9, 22204–22219. [Google Scholar] [CrossRef]
- Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and open problems in federated learning. Found. Trends® Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
- Wei, K.; Li, J.; Ma, C.; Ding, M.; Wei, S.; Wu, F.; Chen, G.; Ranbaduge, T. Vertical federated learning: Challenges, methodologies and experiments. arXiv 2022, arXiv:2202.04309. [Google Scholar] [CrossRef]
- Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 2019, 10, 1–19. [Google Scholar] [CrossRef]
- Li, J.; Wei, H.; Liu, J.; Liu, W. FSLEdge: An energy-aware edge intelligence framework based on Federated Split Learning for Industrial Internet of Things. Expert Syst. Appl. 2024, 255, 124564. [Google Scholar] [CrossRef]
- Khan, L.U.; Pandey, S.R.; Tran, N.H.; Saad, W.; Han, Z.; Nguyen, M.N.; Hong, C.S. Federated learning for edge networks: Resource optimization and incentive mechanism. IEEE Commun. Mag. 2020, 58, 88–93. [Google Scholar] [CrossRef]
- Rieke, N.; Hancox, J.; Li, W.; Milletari, F.; Roth, H.R.; Albarqouni, S.; Bakas, S.; Galtier, M.N.; Landman, B.A.; Maier-Hein, K.; et al. The future of digital health with federated learning. NPJ Digit. Med. 2020, 3, 119. [Google Scholar] [CrossRef] [PubMed]
- Wu, Z.; Hou, J.; He, B. Vertibench: Advancing feature distribution diversity in vertical federated learning benchmarks. arXiv 2023, arXiv:2307.02040. [Google Scholar]
- Khan, A.; ten Thij, M.; Wilbik, A. Communication-efficient vertical federated learning. Algorithms 2022, 15, 273. [Google Scholar] [CrossRef]
- Cheng, K.; Fan, T.; Jin, Y.; Liu, Y.; Chen, T.; Papadopoulos, D.; Yang, Q. Secureboost: A lossless federated learning framework. IEEE Intell. Syst. 2021, 36, 87–98. [Google Scholar] [CrossRef]
- Ghosh, A.; Chung, J.; Yin, D.; Ramchandran, K. An Efficient Framework for Clustered Federated Learning. Adv. Neural Inf. Process. Syst. 2020, 33, 19586–19597. [Google Scholar] [CrossRef]
- Li, Z.; Wang, T.; Li, N. Differentially private vertical federated clustering. arXiv 2022, arXiv:2208.01700. [Google Scholar] [CrossRef]
- Zhao, F.; Li, Z.; Ren, X.; Ding, B.; Yang, S.; Li, Y. VertiMRF: Differentially Private Vertical Federated Data Synthesis. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 4431–4442. [Google Scholar]
- Mazzone, F.; Brown, T.; Kerschbaum, F.; Wilson, K.H.; Everts, M.; Hahn, F.; Peter, A. Privacy-Preserving Vertical K-Means Clustering. arXiv 2025, arXiv:2504.07578. [Google Scholar]
- Li, C.; Ding, S.; Xu, X.; Guo, L.; Ding, L.; Wu, X. Vertical Federated Density Peaks Clustering under Nonlinear Mapping. IEEE Trans. Knowl. Data Eng. 2024, 37, 1004–1017. [Google Scholar] [CrossRef]
- Duan, Q.; Lu, Z. Edge Cloud Computing and Federated–Split Learning in Internet of Things. Future Internet 2024, 16, 227. [Google Scholar] [CrossRef]
- Huang, Y.; Huo, Z.; Fan, Y. DRA: A data reconstruction attack on vertical federated k-means clustering. Expert Syst. Appl. 2024, 250, 123807. [Google Scholar] [CrossRef]
- Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
- Han, J.; Kamber, M.; Mining, D. Concepts and techniques. Morgan Kaufmann 2006, 340, 94104–103205. [Google Scholar]
- Mary, S.S.; Selvi, T. A study of K-means and cure clustering algorithms. Int. J. Eng. Res. Technol. 2014, 3, 1985–1987. [Google Scholar]
- Hwang, H.; Yang, S.; Kim, D.; Dua, R.; Kim, J.Y.; Yang, E.; Choi, E. Towards the practical utility of federated learning in the medical domain. In Proceedings of the Conference on Health, Inference, and Learning, Cambridge, MA, USA, 22–24 June 2023; PMLR: Cambridge, MA, USA, 2023; pp. 163–181. [Google Scholar]
- Luo, Y.; Lu, Z.; Yin, X.; Lu, S.; Weng, Y. Application research of vertical federated learning technology in banking risk control model strategy. In Proceedings of the 2023 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Wuhan, China, 21–24 December 2023; IEEE: New York, NY, USA, 2023; pp. 545–552. [Google Scholar]
- Dwarampudi, A.; Yogi, M. Application of federated learning for smart agriculture system. Int. J. Inform. Technol. Comput. Eng. (IJITC) ISSN 2024, 2455–5290. [Google Scholar] [CrossRef]
- Jiang, D.; Wang, H.; Li, T.; Gouda, M.A.; Zhou, B. Real-time tracker of chicken for poultry based on attention mechanism-enhanced YOLO-Chicken algorithm. Comput. Electron. Agric. 2025, 237, 110640. [Google Scholar] [CrossRef]
- GAO, Y.; XIE, Y.; DENG, H.; ZHU, Z.; ZHANG, Y. A Privacy-preserving Data Alignment Framework for Vertical Federated Learning. J. Electron. Inf. Technol. 2024, 46, 3419–3427. [Google Scholar]
- Yang, L.; Chai, D.; Zhang, J.; Jin, Y.; Wang, L.; Liu, H.; Tian, H.; Xu, Q.; Chen, K. A survey on vertical federated learning: From a layered perspective. arXiv 2023, arXiv:2304.01829. [Google Scholar] [CrossRef]
- Liu, Y.; Kang, Y.; Zou, T.; Pu, Y.; He, Y.; Ye, X.; Ouyang, Y.; Zhang, Y.Q.; Yang, Q. Vertical federated learning: Concepts, advances, and challenges. IEEE Trans. Knowl. Data Eng. 2024, 36, 3615–3634. [Google Scholar] [CrossRef]
- Zhao, Z.; Mao, Y.; Liu, Y.; Song, L.; Ouyang, Y.; Chen, X.; Ding, W. Towards efficient communications in federated learning: A contemporary survey. J. Frankl. Inst. 2023, 360, 8669–8703. [Google Scholar] [CrossRef]
- Yang, H.; Liu, H.; Yuan, X.; Wu, K.; Ni, W.; Zhang, J.A.; Liu, R.P. Synergizing Intelligence and Privacy: A Review of Integrating Internet of Things, Large Language Models, and Federated Learning in Advanced Networked Systems. Appl. Sci. 2025, 15, 6587. [Google Scholar] [CrossRef]
- Zhang, C.; Li, S. State-of-the-art approaches to enhancing privacy preservation of machine learning datasets: A survey. arXiv 2024, arXiv:2404.16847. [Google Scholar]
- Qi, Z.; Meng, L.; Li, Z.; Hu, H.; Meng, X. Cross-Silo Feature Space Alignment for Federated Learning on Clients with Imbalanced Data. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025. [Google Scholar]
- Hu, K.; Xiang, L.; Tang, P.; Qiu, W. Feature norm regularized federated learning: Utilizing data disparities for model performance gains. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, Jeju Island, Republic of Korea, 3–9 August 2024; pp. 4136–4146. [Google Scholar]
- Aramian, A. Managing Feature Diversity: Evaluating Global ModelReliability in FederatedLearning for Intrusion Detection Systems in IoT. Eng. Technol. 2024, 39. [Google Scholar]
- Johnson, A. A Survey of Recent Advances for Tackling Data Heterogeneity in Federated Learning. Preprints 2025. [Google Scholar]
- Jegou, H.; Douze, M.; Schmid, C. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 117–128. [Google Scholar] [CrossRef]
- Konečnỳ, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv 2016, arXiv:1610.05492. [Google Scholar]
- Yue, K.; Jin, R.; Wong, C.W.; Baron, D.; Dai, H. Gradient obfuscation gives a false sense of security in federated learning. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023; pp. 6381–6398. [Google Scholar]
- Ge, T.; He, K.; Ke, Q.; Sun, J. Optimized product quantization. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 744–755. [Google Scholar] [CrossRef]
- Xiao, S.; Liu, Z.; Shao, Y.; Lian, D.; Xie, X. Matching-oriented product quantization for ad-hoc retrieval. arXiv 2021, arXiv:2104.07858. [Google Scholar]
- Deisenroth, M.P.; Faisal, A.A.; Ong, C.S. Mathematics for Machine Learning; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
- Izenman, A.J. Linear Dimensionality Reduction; Springer: New York, NY, USA, 2013. [Google Scholar]
- Vadhan, S.; Zhang, W. Concurrent Composition Theorems for Differential Privacy. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, Orlando, FL, USA, 20–23 June 2022. [Google Scholar]
Symbol | Description |
---|---|
D | Description |
sub_dim | Subspace data dimensionality |
sub_k | k-value for the g-th client |
Dataset of the g-th client | |
Codebook of the g-th client | |
Cluster center set of the i-th subspace for the g-th client | |
Bg | Data index set of the g-th client |
B | Global data index set |
Dispersion Coefficient | Hash Dimension | Proposed Method | Without MDS Algorithm Using Codebook Indices at Server | Without MDS Algorithm, Restored Data at Server | Centralized Algorithm | DP-VFC (Under a Privacy Budget of = 1.0) | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | ||
1 | 10 | 0.53403 | 0.42542 | 0.50209 | 0.41110 | 0.51871 | 0.40490 | 0.49581 | 0.36387 | 0.2703 | 0.1456 |
2 | 10 | 0.49353 | 0.37211 | 0.47981 | 0.38815 | 0.49406 | 0.38168 | ||||
4 | 10 | 0.51018 | 0.38628 | 0.44081 | 0.33740 | 0.52128 | 0.40873 | ||||
8 | 10 | 0.52476 | 0.43289 | 0.42186 | 0.31696 | 0.52207 | 0.40520 | ||||
1 | 64 | 0.49707 | 0.36712 | 0.48437 | 0.37222 | 0.51585 | 0.40136 | ||||
2 | 64 | 0.53773 | 0.42763 | 0.44766 | 0.33290 | 0.49617 | 0.36369 | ||||
4 | 64 | 0.50124 | 0.37786 | 0.42137 | 0.32132 | 0.49500 | 0.38318 | ||||
8 | 64 | 0.48329 | 0.39052 | 0.41128 | 0.34143 | 0.49390 | 0.36302 | ||||
1 | 128 | 0.48375 | 0.36266 | 0.50666 | 0.41889 | 0.49081 | 0.36068 | ||||
2 | 128 | 0.49855 | 0.38596 | 0.46257 | 0.35915 | 0.49043 | 0.36039 | ||||
4 | 128 | 0.48026 | 0.35906 | 0.42605 | 0.34344 | 0.49403 | 0.38208 | ||||
8 | 128 | 0.47704 | 0.37444 | 0.40413 | 0.31093 | 0.48159 | 0.37014 | ||||
1 | 256 | 0.52029 | 0.40768 | 0.49181 | 0.40905 | 0.49049 | 0.36069 | ||||
2 | 256 | 0.48873 | 0.36059 | 0.49902 | 0.39875 | 0.48150 | 0.35965 | ||||
4 | 256 | 0.49801 | 0.39486 | 0.43289 | 0.35194 | 0.49628 | 0.36436 | ||||
8 | 256 | 0.46309 | 0.36877 | 0.41015 | 0.34877 | 0.48375 | 0.36400 |
Number of Clients | Feature Ratio Among Clients | Subspace Dimension | Cluster Centers per Subspace | Proposed Method | DP VFC | |||
---|---|---|---|---|---|---|---|---|
NMI | ARI | Mds Stress | NMI | ARI | ||||
2 | 1:1 | 1 | 10 | 0.53403 | 0.42542 | 0.82306 | 0.2703 | 0.1456 |
2 | 10 | 0.49353 | 0.37211 | 4.44490 | ||||
4 | 10 | 0.51018 | 0.38628 | 10.73975 | ||||
8 | 10 | 0.52476 | 0.43289 | 18.34833 | ||||
1:6 | 1 | 10 | 0.51950 | 0.40631 | 0.79459 | 0.1973 | 0.0747 | |
2 | 10 | 0.51160 | 0.39005 | 4.41391 | ||||
4 | 10 | 0.51162 | 0.38794 | 10.97604 | ||||
8 | 10 | 0.52514 | 0.42123 | 18.58371 | ||||
1:13 | 1 | 10 | 0.49672 | 0.36139 | 0.77438 | 0.2351 | 0.1418 | |
2 | 10 | 0.49207 | 0.36924 | 4.34313 | ||||
4 | 10 | 0.51211 | 0.38790 | 10.96123 | ||||
8 | 10 | 0.53030 | 0.44559 | 18.55026 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, J.; Zhang, J.; Chen, X. A Privacy-Enhanced Multi-Stage Dimensionality Reduction Vertical Federated Clustering Framework. Electronics 2025, 14, 3182. https://doi.org/10.3390/electronics14163182
Wang J, Zhang J, Chen X. A Privacy-Enhanced Multi-Stage Dimensionality Reduction Vertical Federated Clustering Framework. Electronics. 2025; 14(16):3182. https://doi.org/10.3390/electronics14163182
Chicago/Turabian StyleWang, Jun, Jiantong Zhang, and Xianghua Chen. 2025. "A Privacy-Enhanced Multi-Stage Dimensionality Reduction Vertical Federated Clustering Framework" Electronics 14, no. 16: 3182. https://doi.org/10.3390/electronics14163182
APA StyleWang, J., Zhang, J., & Chen, X. (2025). A Privacy-Enhanced Multi-Stage Dimensionality Reduction Vertical Federated Clustering Framework. Electronics, 14(16), 3182. https://doi.org/10.3390/electronics14163182