Extension Distance-Driven K-Means: A Novel Clustering Framework for Fan-Shaped Data Distributions
Abstract
1. Introduction
2. Extension Distance
2.1. One-Dimensional Extension Distance
2.2. Limitation in High Dimensions
3. Extension Distance in Two-Dimensional Space
3.1. Straight Line Traversal Method
3.2. Properties and Verification
- In the event of point lying outside set , for any line passing through the set such that and is satisfied, it follows that for any of the above lines, and are greater than 0. Consequently, is greater than 0. The following essay will provide a comprehensive overview of the relevant literature on the subject.
- When point is on the boundary of set , for any line passing through the set, or , then for any of the above lines, and are both equal to 0, so is equal to 0.
- In the event of point being in set , for any line passing through the set such that and , it can be concluded that is less than 0.
3.3. Fixed-Angle Traversal Method
3.4. Verification and Set Intersection
- In Scenario 1, only one point belongs to the set within the fan-shaped range. Furthermore, interval is a single point and , so we have:
- In Scenario 2: There are two or more points in the sector belonging to set and none of the sector boundaries are vertical or horizontal. According to Figure 1, we know that, for the interval so .
- In Scenario 3: There are two or more points within the sector belonging to set and a horizontal or vertical sector edge. In this case, projecting the obtained points onto the or axis results in a single point rather than an interval, so or . According to Equation (3), analyzing the case where gives us .
4. K-Means Variant Based on Extension Distance
4.1. Limitations of Standard K-Means
4.2. Proposed Algorithm Framework
Algorithm 1: Extension-Distance-Driven K-means Clustering |
1: Input: dataset S, clusters n, angle α 2: Calculate the distance maxima: min(D), max(D) 3: Calculate , , , [19] 4: for i in do 5: if i > then 6: Set the point corresponding to i as centroid c1 7: Break 8: end if 9: end for 10: repeat 11: Ptemp = Randomly select from 12: ctemp = The corresponding point to Pi 13: if all > then 14: set Ptemp as the centroid 15: end if 16: until Number of centres meets requirements 17: repeat 17.1: Precompute angular relation matrix A(Equation (1)) to store relative angles between all points 17.2: Utilize A to accelerate fixed-angle traversal, reducing redundant calculations. 18: for i_simple in S do 19: for i_plane in [plane1…planee] do 20: for i_cluster in all cluster do 21: Calculate Pi_plane,i_cluster 22: end for 23: Pi_cluster = ∑ Pi_plane,i_cluster 24: i_simple belong to i_cluster corresponding to min(Pi_cluster) 25: end for 26: end for 27: For each cluster, find the sample corresponding to min(Pi_cluster) 28: until Centre point remains stable |
4.3. Angle Relation Matrix
4.4. Complexity Analysis
- The algorithm processes each data point ( points), each cluster ( clusters, where is the number of clusters), and each feature plane. With features, the number of feature planes is .
- For each combination, the fixed-angle traversal method with a number of angles a (related to the input angle ) is used. The traversal involves operations per combination.
- Thus, the per-iteration complexity is
5. Algorithm Comparison Experiment
5.1. Evaluation Metrics
5.2. Datasets and Experimental Setup
5.3. Clustering Results Visualization
5.4. Quantitative Results Analysis
6. Discussion
7. Conclusions
- The proposed algorithm significantly outperforms conventional methods (e.g., K-means ++, GMM) in external metrics such as ARI and NMI, highlighting its robustness for fan-shaped distributions.
- The two-dimensional extension distance framework effectively handles inter-feature correlations, overcoming the high-dimensional limitations of one-dimensional approaches.
- However, challenges remain in optimizing the scanning angle parameter and extending the method to non-convex sets. Future work will focus on adaptive angle selection and applications to multi-modal datasets. Overall, this research contributes a scalable and interpretable clustering framework, with implications for fields such as image segmentation and anomaly detection.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Yuan, C.; Yang, H. Research on K-value selection method of K-means clustering algorithm. J 2019, 2, 226–235. [Google Scholar] [CrossRef]
- Aggarwal, C.C.; Hinneburg, A.; Keim, D.A. On the Surprising Behavior of Distance Metrics in High-Dimensional Space. In Proceedings of the 8th International Conference on Database Theory (ICDT 2001), London, UK, 4–6 January 2001; Van den Bussche, J., Vianu, V., Eds.; Springer: Berlin, Germany, 2001. Lecture Notes in Computer Science. Volume 1973, pp. 420–434. [Google Scholar]
- Von Luxburg, U. A tutorial on spectral clustering. Stat. Comput. 2007, 17, 395–416. [Google Scholar] [CrossRef]
- Ding, C.; He, X. Cluster Structure of K-means Clustering via Principal Component Analysis. In Advances in Knowledge Discovery and Data Mining; Dai, H., Srikant, R., Zhang, C., Eds.; Lecture Notes in Computer Science; Springer: Berlin, Germany, 2004; Volume 3056, p. 29. [Google Scholar]
- Xu, Q.; Ding, C.; Liu, J.; Luo, B. PCA-guided search for K-means. Pattern Recognit. Lett. 2015, 54, 50–55. [Google Scholar] [CrossRef]
- Feldman, D.; Schmidt, M.; Sohler, C. Turning Big Data into Tiny Data: Constant-Size Coresets for K-Means, PCA and Projective Clustering. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ‘13), New Orleans, LA, USA, 6–8 January 2013; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2013; pp. 1434–1453. [Google Scholar]
- Suwanda, R.; Syahputra, Z.; Zamzami, E.M. Analysis of Euclidean Distance and Manhattan Distance in the K-Means Algorithm for Variations Number of Centroid K. J. Phys. Conf. Ser. 2020, 1566, 012058. [Google Scholar] [CrossRef]
- Wu, Z.; Song, T.; Zhang, Y. Quantum k-means algorithm based on Manhattan distance. Quantum Inf. Process. 2022, 21, 19. [Google Scholar] [CrossRef]
- Singh, A.; Yadav, A.; Rana, A. K-means with Three different Distance Metrics. Int. J. Comput. Appl. 2013, 67, 13–17. [Google Scholar] [CrossRef]
- Faisal, M.; Zamzami, E.M.; Sutarman. Comparative Analysis of Inter-Centroid K-Means Performance using Euclidean Distance, Canberra Distance and Manhattan Distance. J. Phys. Conf. Ser. 2020, 1566, 012112. [Google Scholar] [CrossRef]
- Chen, L.; Roe, D.R.; Kochert, M.; Simmerling, C.; Miranda-Quintana, R.A. k-Means NANI: An Improved Clustering Algorithm for Molecular Dynamics Simulations. J. Chem. Theory Comput. 2024, 20, 5583–5597. [Google Scholar] [CrossRef] [PubMed]
- Premkumar, M.; Sinha, G.; Ramkumar, M.D.; Ramakurthi, V. Augmented weighted K-means grey wolf optimizer: An enhanced metaheuristic algorithm for data clustering problems. Sci. Rep. 2024, 14, 5434. [Google Scholar] [PubMed]
- Huang, W.; Peng, Y.; Ge, Y.; Kong, W. A new Kmeans clustering model and its generalization achieved by joint spectral embedding and rotation. PeerJ Comput. Sci. 2021, 7, 450. [Google Scholar] [CrossRef] [PubMed]
- Yang, Y.; Zhu, Z. A Fast and Efficient Grid-Based K-means++ Clustering Algorithm for Large-Scale Datasets. In Proceedings of the Fifth Euro-China Conference on Intelligent Data Analysis and Applications (ECC 2018), Xian, China, 12–14 October 2018; Volume 891, pp. 485–495. [Google Scholar]
- Moghaddam, S.S.; Ghasemi, M. Efficient Clustering for Multicast Device-to-Device Communications. In Proceedings of the 7th International Conference on Computer and Communication Engineering (ICCCE 2018), Kuala Lumpur, Malaysia, 19–20 September 2018; pp. 228–233. [Google Scholar]
- Cai, W. Extension theory and its application. Chin. Sci. Bull. 1999, 44, 1538–1548. [Google Scholar] [CrossRef]
- Qin, Y.; Li, X. A method for calculating two-dimensional spatially extension distances and its clustering algorithm. Procedia Comput. Sci. 2023, 221, 1187–1193. [Google Scholar] [CrossRef]
- Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
- Zhao, Y.; Zhu, F.; Gui, F.; Ren, S.; Xie, Z.; Xu, C. Improved k-means algorithm based on extension distance. CAAI Trans. Intell. Syst. 2020, 15, 344–351.425. [Google Scholar] [CrossRef]
- Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, 1, 224–227. [Google Scholar] [CrossRef] [PubMed]
- Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
- Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
- Strehl, A.; Ghosh, J. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 2002, 3, 583–617. [Google Scholar] [CrossRef][Green Version]
The Positional Relationship Between Points and Intervals or Set | Extension Distance Between Point and Interval | Extension Distance Between Point and Two-Dimensional Plane Set |
---|---|---|
Point outside the interval or set | ||
Point on the edge of the interval or set | ||
Point inside the interval or set |
Algorithm | ARI | NMI | Silhouette Score | DBI |
---|---|---|---|---|
Algorithm before Improvement | 0.304 | 0.480 | 0.329 | 0.904 |
This article’s algorithm | 0.480 | 0.597 | 0.259 | 0.974 |
K-means ++ | 0.289 | 0.485 | 0.388 | 0.794 |
GMM | 0.383 | 0.526 | 0.346 | 0.860 |
Agglomerative | 0.305 | 0.478 | 0.330 | 0.854 |
Algorithm | ARI | NMI | Silhouette Score | DBI |
---|---|---|---|---|
Algorithm before Improvement | 0.328 | 0.529 | 0.354 | 0.855 |
This article’s algorithm | 0.658 | 0.736 | 0.367 | 0.927 |
K-means ++ | 0.378 | 0.604 | 0.473 | 0.732 |
GMM | 0.389 | 0.610 | 0.471 | 0.732 |
Agglomerative | 0.395 | 0.617 | 0.453 | 0.760 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, X.; Yue, H.; Qin, Y.; Zhang, H. Extension Distance-Driven K-Means: A Novel Clustering Framework for Fan-Shaped Data Distributions. Mathematics 2025, 13, 2525. https://doi.org/10.3390/math13152525
Li X, Yue H, Qin Y, Zhang H. Extension Distance-Driven K-Means: A Novel Clustering Framework for Fan-Shaped Data Distributions. Mathematics. 2025; 13(15):2525. https://doi.org/10.3390/math13152525
Chicago/Turabian StyleLi, Xingsen, Hanqi Yue, Yaocong Qin, and Haolan Zhang. 2025. "Extension Distance-Driven K-Means: A Novel Clustering Framework for Fan-Shaped Data Distributions" Mathematics 13, no. 15: 2525. https://doi.org/10.3390/math13152525
APA StyleLi, X., Yue, H., Qin, Y., & Zhang, H. (2025). Extension Distance-Driven K-Means: A Novel Clustering Framework for Fan-Shaped Data Distributions. Mathematics, 13(15), 2525. https://doi.org/10.3390/math13152525