Reduced Clustering Method Based on the Inversion Formula Density Estimation
Abstract
:1. Introduction
2. Clustering Based on the Density of the Modified Inversion Formula
Algorithm 1: CBMIDE clustering algorithm |
Input: Data set X= [X1, X2, …, Xn], cluster number K |
Output: C1, C2, _ _, Ct Initiation of the mean vector using the k-means method Generate a T matrix. The set T calculation where design directions are evenly spaced on the sphere. |
1 For i = 1: t do |
2 Density estimation for each point and cluster with the formula (8) Update , , |
3 End |
4 Return C1, C2, _, Ct |
Algorithm 2: RCBMIDE clustering algorithm | |
Input: Data set X= [X1, X2, …, Xn], cluster number K, the smoothness parameter h, and the percentage ratio of outliers p0, number of iterations t, number of dimensions d, data reduction method. | |
Output: C1, C2, _ _, C Input data reduction: For j= 2: d < 15 do | |
Reduce dimension with the dimensionality reduction method Calculate the trustworthiness of the reduced dimensions with (13) | |
Choose the best-reduced dimensions data based on trustworthiness. Initiation of the mean vector using the k-means and k-means++ method Generate a T matrix. The set T calculation where directions are evenly spaced on the sphere. | |
1 For i = 1: t do | |
2 Density estimation for each point and cluster Update , , values | |
3 End | |
4 ReturnC1, C2, _, Ct |
3. Materials and Methods
3.1. Clustering Evaluation Metrics
3.2. Research Datasets
3.3. Experimental Setup
4. Results
4.1. Performances of Reduced Clustering Based on the Modified Inversion Density Estimation for Lower Dimensions Datasets
4.2. Performances of Reduced Clustering Based on the Modified Inversion Density Estimation for Higher-Dimensional Datasets
5. Discussion
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Dataset | RCBMIDE |
---|---|
arrhythmia | TriMap (5) |
breast | UMAP (3) |
Coil20 | LLE (3) |
dermatology | NMF (2) |
german | MDS (3) |
heart-statlog | CosinePCA (4) |
iono | LLE (2) |
segment | TriMap6th (5) |
spambase | PCA (10) |
wdbc | MDS (5) |
Dataset | RCBMIDE |
---|---|
1balance-scale | UMAP (2) |
atom | UMAP (2) |
cpu | TriMap (5) |
diabetes | TriMap (5) |
ecoli | UMAP (4) |
glass | LLE (5) |
Haberman | TSVD (2) |
iris | UMAP (6) |
pmf | LLE (2) |
thyroid | TSVD (4) |
Wine | PCA (4) |
References
- Chen, C.-H. A hybrid intelligent model of analyzing clinical breast cancer data using clustering techniques with feature selection. Appl. Soft Comput. 2014, 20, 4–14. [Google Scholar] [CrossRef]
- Alashwal, H.; El Halaby, M.; Crouse, J.J.; Abdalla, A.; Moustafa, A.A. The application of unsupervised clustering methods to Alzheimer’s disease. Front. Comput. Neurosci. 2019, 13, 31. [Google Scholar] [PubMed]
- Farouk, Y.; Rady, S. Early diagnosis of alzheimer’s disease using unsupervised clustering. Int. J. Intell. Comput. Inf. Sci. 2020, 20, 112–124. [Google Scholar] [CrossRef]
- Liu, A.-A.; Nie, W.-Z.; Gao, Y.; Su, Y.-T. View-based 3-D model retrieval: A benchmark. IEEE Trans. Cybern. 2017, 48, 916–928. [Google Scholar] [CrossRef] [PubMed]
- Nie, W.; Cheng, H.; Su, Y. Modeling temporal information of mitotic for mitotic event detection. IEEE Trans. Big Data 2017, 3, 458–469. [Google Scholar] [CrossRef]
- Abualigah, L.; Gandomi, A.H.; Elaziz, M.A.; Hamad, H.A.; Omari, M.; Alshinwan, M.; Khasawneh, A.M. Advances in meta-heuristic optimization algorithms in big data text clustering. Electronics 2021, 10, 101. [Google Scholar]
- Lukauskas, M.; Pilinkienė, V.; Bruneckienė, J.; Stundžienė, A.; Grybauskas, A.; Ruzgas, T. Economic Activity Forecasting Based on the Sentiment Analysis of News. Mathematics 2022, 10, 3461. [Google Scholar] [CrossRef]
- Trentin, E.; Lusnig, L.; Cavalli, F. Parzen neural networks: Fundamentals, properties, and an application to forensic anthropology. Neural Netw. 2018, 97, 137–151. [Google Scholar] [CrossRef] [PubMed]
- Lukauskas, M.; Ruzgas, T. A New Clustering Method Based on the Inversion Formula. Mathematics 2022, 10, 2559. [Google Scholar] [CrossRef]
- Ding, C.; He, X. K-means clustering via principal component analysis. In Proceedings of the 21st International Conference on Machine Learning, Banf, AL, Canada, 4–8 July 2004; p. 29. [Google Scholar]
- Yang, L.; Liu, J.; Lu, Q.; Riggs, A.D.; Wu, X. SAIC: An iterative clustering approach for analysis of single cell RNA-seq data. BMC Genom. 2017, 18, 689. [Google Scholar]
- Kakushadze, Z.; Yu, W. * K-means and cluster models for cancer signatures. Biomol. Detect. Quantif. 2017, 13, 7–31. [Google Scholar] [CrossRef] [PubMed]
- Shin, J.; Berg, D.A.; Zhu, Y.; Shin, J.Y.; Song, J.; Bonaguidi, M.A.; Enikolopov, G.; Nauen, D.W.; Christian, K.M.; Ming, G.-L. Single-cell RNA-seq with waterfall reveals molecular cascades underlying adult neurogenesis. Cell Stem Cell 2015, 17, 360–372. [Google Scholar] [CrossRef] [PubMed]
- Feng, C.; Liu, S.; Zhang, H.; Guan, R.; Li, D.; Zhou, F.; Liang, Y.; Feng, X. Dimension reduction and clustering models for single-cell RNA sequencing data: A comparative study. Int. J. Mol. Sci. 2020, 21, 2181. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Melit Devassy, B.; George, S.; Nussbaum, P. Unsupervised clustering of hyperspectral paper data using t-SNE. J. Imaging 2020, 6, 29. [Google Scholar] [CrossRef] [PubMed]
- Bollon, J.; Assale, M.; Cina, A.; Marangoni, S.; Calabrese, M.; Salvemini, C.B.; Christille, J.M.; Gustincich, S.; Cavalli, A. Investigating How Reproducibility and Geometrical Representation in UMAP Dimensionality Reduction Impact the Stratification of Breast Cancer Tumors. Appl. Sci. 2022, 12, 4247. [Google Scholar] [CrossRef]
- Li, H.; Liu, J.; Liu, R.W.; Xiong, N.; Wu, K.; Kim, T.-h. A dimensionality reduction-based multi-step clustering method for robust vessel trajectory analysis. Sensors 2017, 17, 1792. [Google Scholar] [CrossRef] [Green Version]
- Wenskovitch, J.; Crandell, I.; Ramakrishnan, N.; House, L.; North, C. Towards a systematic combination of dimension reduction and clustering in visual analytics. IEEE Trans. Vis. Comput. Graph. 2017, 24, 131–141. [Google Scholar] [CrossRef]
- Tang, B.; Shepherd, M.; Milios, E.; Heywood, M.I. Comparing and combining dimension reduction techniques for efficient text clustering. In Proceedings of the SIAM International Conference on Data Mining, Newport Beach, CA, USA, 23 April 2005; pp. 17–26. [Google Scholar]
- Wang, X.-D.; Chen, R.-C.; Zeng, Z.-Q.; Hong, C.-Q.; Yan, F. Robust dimension reduction for clustering with local adaptive learning. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 657–669. [Google Scholar] [CrossRef]
- Markos, A.; D’Enza, A.I.; van de Velden, M. Beyond tandem analysis: Joint dimension reduction and clustering in R. J. Stat. Softw. 2019, 91, 1–24. [Google Scholar] [CrossRef] [Green Version]
- Wenskovitch, J.; Dowling, M.; North, C. With respect to what? simultaneous interaction with dimension reduction and clustering projections. In Proceedings of the 25th International Conference on Intelligent User Interfaces, Cagliari, Italy, 17–20 March 2020; pp. 177–188. [Google Scholar]
- Ruzgas, T.; Lukauskas, M.; Čepkauskas, G. Nonparametric Multivariate Density Estimation: Case Study of Cauchy Mixture Model. Mathematics 2021, 9, 2717. [Google Scholar] [CrossRef]
- Kavaliauskas, M.; Rudzkis, R.; Ruzgas, T. The projection-based multivariate density estimation. Acta Comment. Univ. Tartu. Math. 2004, 8, 135–141. [Google Scholar] [CrossRef]
- Biernacki, C.; Celeux, G.; Govaert, G. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput. Stat. Data Anal. 2003, 41, 561–575. [Google Scholar] [CrossRef]
- Xu, Q.; Yuan, S.; Huang, T. Multidimensional uniform initialization Gaussian mixture model for spar crack quantification under uncertainty. Sensors 2021, 21, 1283. [Google Scholar] [CrossRef] [PubMed]
- Fraley, C. Algorithms for model-based Gaussian hierarchical clustering. SIAM J. Sci. Comput. 1998, 20, 270–281. [Google Scholar] [CrossRef] [Green Version]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–22. [Google Scholar]
- Everitt, B. Finite Mixture Distributions; Springer Science & Business Media: New York, NY, USA, 2013. [Google Scholar]
- Redner, R.A.; Walker, H.F. Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 1984, 26, 195–239. [Google Scholar] [CrossRef]
- Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [Google Scholar] [CrossRef]
- Comon, P. Independent component analysis, a new concept? Signal Process. 1994, 36, 287–314. [Google Scholar] [CrossRef]
- Jöreskog, K.G. Factor analysis as an error-in-variables model. In Principals of Modern Psychological Measurement; Routledge: Abingdon-on-Thames, UK, 1983; pp. 185–196. [Google Scholar]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2580–2605. [Google Scholar]
- Van Der Maaten, L. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 2014, 15, 3221–3245. [Google Scholar]
- Li, W.; Cerise, J.E.; Yang, Y.; Han, H. Application of t-SNE to human genetic data. J. Bioinform. Comput. Biol. 2017, 15, 1750017. [Google Scholar] [CrossRef] [PubMed]
- Kobak, D.; Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 2019, 10, 1–14. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Amid, E.; Warmuth, M.K. TriMap: Large-scale dimensionality reduction using triplets. arXiv 2019, arXiv:1910.00204. [Google Scholar]
- Ghojogh, B.; Ghodsi, A.; Karray, F.; Crowley, M. Locally linear embedding and its variants: Tutorial and survey. arXiv 2020, arXiv:2011.10925. [Google Scholar]
- Venna, J.; Kaski, S. Neighborhood Preservation in Non-linear Projection Methods: An Experimental Study. In Proceedings of the Artificial Neural Networks—ICANN, Berlin/Heidelberg, Germany, 21–25 August 2001; pp. 485–491. [Google Scholar]
- Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. -Theory Methods 1974, 3, 1–27. [Google Scholar] [CrossRef]
- Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
- Ahmadinejad, N.; Liu, L. J-Score: A Robust Measure of Clustering Accuracy. arXiv 2021, arXiv:2109.01306. [Google Scholar]
- Zhong, S.; Ghosh, J. Generative model-based document clustering: A comparative study. Knowl. Inf. Syst. 2005, 8, 374–384. [Google Scholar] [CrossRef]
- Lawrence, H.; Phipps, A. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar]
- Wang, P.; Shi, H.; Yang, X.; Mi, J. Three-way k-means: Integrating k-means and three-way decision. Int. J. Mach. Learn. Cybern. 2019, 10, 2767–2777. [Google Scholar] [CrossRef]
- Fowlkes, E.B.; Mallows, C.L. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 1983, 78, 553–569. [Google Scholar] [CrossRef]
ID | Data Sets | Sample Size (N) | Dimensions (D) | Classes |
---|---|---|---|---|
1 | Balance-scale | 625 | 4 | 3 |
2 | Arrhythmia | 452 | 262 | 13 |
3 | atom | 800 | 3 | 2 |
4 | Breast | 570 | 30 | 2 |
5 | Coil20 | 1440 | 1024 | 20 |
6 | CPU | 209 | 6 | 4 |
7 | Dermatology | 366 | 17 | 6 |
8 | Diabetes | 442 | 10 | 4 |
9 | Ecoli | 336 | 7 | 8 |
10 | German | 1000 | 60 | 2 |
11 | Glass | 214 | 9 | 6 |
12 | Haberman | 306 | 3 | 2 |
13 | Heart-statlog | 270 | 13 | 2 |
14 | Iono | 351 | 34 | 2 |
15 | Iris | 150 | 4 | 3 |
16 | pmf | 649 | 3 | 5 |
17 | segment | 2310 | 19 | 7 |
18 | spambase | 4601 | 57 | 2 |
19 | Thyroid | 215 | 5 | 3 |
20 | wdbc | 569 | 30 | 2 |
21 | Wine | 178 | 13 | 3 |
Dataset | Agg | BIRCH | GMM | BGMM | DBSCAN | K-Means | HDBSCAN | CBMIDE | RCBMIDE |
---|---|---|---|---|---|---|---|---|---|
1balance-scale | 0.624 | 0.658 | 0.568 | 0.586 | 0.464 | 0.603 | 0.597 | 0.576 | 0.693 |
atom | 1.000 | 0.868 | 0.883 | 0.960 | 1.000 | 0.719 | 1.000 | 0.891 | 1.000 |
cpu | 0.823 | 0.828 | 0.746 | 0.641 | 0.833 | 0.761 | 0.813 | 0.815 | 0.858 |
diabetes | 0.507 | 0.514 | 0.459 | 0.455 | 0.482 | 0.428 | 0.477 | 0.502 | 0.512 |
ecoli | 0.804 | 0.845 | 0.762 | 0.747 | 0.682 | 0.688 | 0.646 | 0.754 | 0.817 |
glass | 0.514 | 0.565 | 0.509 | 0.528 | 0.514 | 0.547 | 0.528 | 0.527 | 0.607 |
Haberman | 0.748 | 0.761 | 0.667 | 0.716 | 0.758 | 0.748 | 0.739 | 0.735 | 0.742 |
iris | 0.967 | 0.973 | 0.967 | 0.893 | 0.94 | 0.967 | 0.700 | 0.975 | 0.983 |
pmf | 0.977 | 0.977 | 0.92 | 0.977 | 0.983 | 0.844 | 0.978 | 0.934 | 0.981 |
thyroid | 0.93 | 0.949 | 0.963 | 0.949 | 0.874 | 0.944 | 0.823 | 0.778 | 0.834 |
Wine | 0.978 | 0.994 | 0.972 | 0.983 | 0.949 | 0.978 | 0.876 | 0.953 | 0.963 |
Dataset | Agg | BIRCH | GMM | BGMM | DBSCAN | K-Means | HDBSCAN | RCBMIDE |
---|---|---|---|---|---|---|---|---|
arrhythmia | 0.600 | 0.571 | 0.485 | 0.431 | 0.573 | 0.438 | 0.582 | 0.597 |
Breast | 0.942 | 0.954 | 0.951 | 0.953 | 0.903 | 0.928 | 0.743 | 0.909 |
Coil20 | 0.738 | 0.738 | 0.638 | 0.675 | 0.867 | 0.733 | 0.884 | 0.882 |
dermatology | 0.956 | 0.978 | 0.91 | 0.855 | 0.694 | 0.962 | 0.809 | 0.867 |
german | 0.705 | 0.713 | 0.696 | 0.638 | 0.712 | 0.673 | 0.704 | 0.716 |
heart-statlog | 0.807 | 0.811 | 0.819 | 0.826 | 0.815 | 0.848 | 0.626 | 0.831 |
iono | 0.729 | 0.795 | 0.849 | 0.809 | 0.929 | 0.712 | 0.906 | 0.899 |
segment | 0.708 | 0.732 | 0.631 | 0.612 | 0.529 | 0.665 | 0.530 | 0.789 |
spambase | 0.878 | 0.918 | 0.856 | 0.857 | 0.693 | 0.854 | 0.690 | 0.905 |
wdbc | 0.942 | 0.954 | 0.951 | 0.958 | 0.903 | 0.928 | 0.743 | 0.951 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lukauskas, M.; Ruzgas, T. Reduced Clustering Method Based on the Inversion Formula Density Estimation. Mathematics 2023, 11, 661. https://doi.org/10.3390/math11030661
Lukauskas M, Ruzgas T. Reduced Clustering Method Based on the Inversion Formula Density Estimation. Mathematics. 2023; 11(3):661. https://doi.org/10.3390/math11030661
Chicago/Turabian StyleLukauskas, Mantas, and Tomas Ruzgas. 2023. "Reduced Clustering Method Based on the Inversion Formula Density Estimation" Mathematics 11, no. 3: 661. https://doi.org/10.3390/math11030661
APA StyleLukauskas, M., & Ruzgas, T. (2023). Reduced Clustering Method Based on the Inversion Formula Density Estimation. Mathematics, 11(3), 661. https://doi.org/10.3390/math11030661