# Reduced Clustering Method Based on the Inversion Formula Density Estimation

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Clustering Based on the Density of the Modified Inversion Formula

**R**

^{d}satisfies the distribution mixture model if the distribution density $f\left(x\right)$ satisfies Equation (1).

_{k}are obtained by replacing the unknown parameters with their statistical estimates from the formula

_{1}, …, p

_{q}. In classification theory, v is interpreted as the number of the class to which the observed object belongs. Thus, observations X(t) would correspond to v(t), t = 1, …, n. The functions f

_{k}are treated as the conditional distribution density of X under the condition v = k. According to this approach, loose clustering of the sample is understood as posterior probabilities

_{k}(x) =

**P**{v = k|X = x}

Algorithm 1: CBMIDE clustering algorithm |

Input: Data set X= [X_{1,} X_{2}, …, X_{n}], cluster number K |

Output: C_{1}, C_{2}, _ _, C_{t}Initiation of the mean vector using the k-means methodGenerate a T matrix. The set T calculation where design directions are evenly spaced on the sphere. |

1 For i = 1: t do |

2 Density estimation for each point and cluster with the formula (8)Update $\widehat{M}$, ${\widehat{p}}_{k}$, $\widehat{R}$ |

3 End |

4 Return
C_{1}, C_{2}, _, C_{t}$\mathrm{and}\widehat{M}$$,{\widehat{p}}_{k}$$,\widehat{R}$ |

_{1}… X

_{k,}and we seek to determine the factors that describe these variables, the number of which is m, then, the mathematical model of factor analysis can be summarized as

Algorithm 2: RCBMIDE clustering algorithm | |

Input: Data set X= [X_{1}, X_{2}, …, X_{n}], cluster number K, the smoothness parameter h, and the percentage ratio of outliers p_{0}, number of iterations t, number of dimensions d, data reduction method. | |

Output: C_{1}, C_{2}, _ _, CInput data reduction: For j= 2: d < 15 do | |

Reduce dimension with the dimensionality reduction method Calculate the trustworthiness of the reduced dimensions with (13) | |

Choose the best-reduced dimensions data based on trustworthiness.Initiation of the mean vector using the k-means and k-means++ methodGenerate a T matrix. The set T calculation where directions are evenly spaced on the sphere. | |

1 For i = 1: t do | |

2 Density estimation for each point and clusterUpdate $\widehat{M}$, ${\widehat{p}}_{k}$, $\widehat{R}$ values | |

3 End | |

4 ReturnC_{1}, C_{2}, _, C_{t}$\mathrm{and}\widehat{M}$$,{\widehat{p}}_{k}$$,\widehat{R}$ |

## 3. Materials and Methods

#### 3.1. Clustering Evaluation Metrics

_{i}is the number of data points correctly divided into the corresponding cluster i, and k is the cluster number.

#### 3.2. Research Datasets

#### 3.3. Experimental Setup

## 4. Results

#### 4.1. Performances of Reduced Clustering Based on the Modified Inversion Density Estimation for Lower Dimensions Datasets

#### 4.2. Performances of Reduced Clustering Based on the Modified Inversion Density Estimation for Higher-Dimensional Datasets

## 5. Discussion

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A

Dataset | RCBMIDE |
---|---|

arrhythmia | TriMap (5) |

breast | UMAP (3) |

Coil20 | LLE (3) |

dermatology | NMF (2) |

german | MDS (3) |

heart-statlog | CosinePCA (4) |

iono | LLE (2) |

segment | TriMap6th (5) |

spambase | PCA (10) |

wdbc | MDS (5) |

Dataset | RCBMIDE |
---|---|

1balance-scale | UMAP (2) |

atom | UMAP (2) |

cpu | TriMap (5) |

diabetes | TriMap (5) |

ecoli | UMAP (4) |

glass | LLE (5) |

Haberman | TSVD (2) |

iris | UMAP (6) |

pmf | LLE (2) |

thyroid | TSVD (4) |

Wine | PCA (4) |

## References

- Chen, C.-H. A hybrid intelligent model of analyzing clinical breast cancer data using clustering techniques with feature selection. Appl. Soft Comput.
**2014**, 20, 4–14. [Google Scholar] [CrossRef] - Alashwal, H.; El Halaby, M.; Crouse, J.J.; Abdalla, A.; Moustafa, A.A. The application of unsupervised clustering methods to Alzheimer’s disease. Front. Comput. Neurosci.
**2019**, 13, 31. [Google Scholar] [PubMed] - Farouk, Y.; Rady, S. Early diagnosis of alzheimer’s disease using unsupervised clustering. Int. J. Intell. Comput. Inf. Sci.
**2020**, 20, 112–124. [Google Scholar] [CrossRef] - Liu, A.-A.; Nie, W.-Z.; Gao, Y.; Su, Y.-T. View-based 3-D model retrieval: A benchmark. IEEE Trans. Cybern.
**2017**, 48, 916–928. [Google Scholar] [CrossRef] [PubMed] - Nie, W.; Cheng, H.; Su, Y. Modeling temporal information of mitotic for mitotic event detection. IEEE Trans. Big Data
**2017**, 3, 458–469. [Google Scholar] [CrossRef] - Abualigah, L.; Gandomi, A.H.; Elaziz, M.A.; Hamad, H.A.; Omari, M.; Alshinwan, M.; Khasawneh, A.M. Advances in meta-heuristic optimization algorithms in big data text clustering. Electronics
**2021**, 10, 101. [Google Scholar] - Lukauskas, M.; Pilinkienė, V.; Bruneckienė, J.; Stundžienė, A.; Grybauskas, A.; Ruzgas, T. Economic Activity Forecasting Based on the Sentiment Analysis of News. Mathematics
**2022**, 10, 3461. [Google Scholar] [CrossRef] - Trentin, E.; Lusnig, L.; Cavalli, F. Parzen neural networks: Fundamentals, properties, and an application to forensic anthropology. Neural Netw.
**2018**, 97, 137–151. [Google Scholar] [CrossRef] [PubMed] - Lukauskas, M.; Ruzgas, T. A New Clustering Method Based on the Inversion Formula. Mathematics
**2022**, 10, 2559. [Google Scholar] [CrossRef] - Ding, C.; He, X. K-means clustering via principal component analysis. In Proceedings of the 21st International Conference on Machine Learning, Banf, AL, Canada, 4–8 July 2004; p. 29. [Google Scholar]
- Yang, L.; Liu, J.; Lu, Q.; Riggs, A.D.; Wu, X. SAIC: An iterative clustering approach for analysis of single cell RNA-seq data. BMC Genom.
**2017**, 18, 689. [Google Scholar] - Kakushadze, Z.; Yu, W. * K-means and cluster models for cancer signatures. Biomol. Detect. Quantif.
**2017**, 13, 7–31. [Google Scholar] [CrossRef] [PubMed] - Shin, J.; Berg, D.A.; Zhu, Y.; Shin, J.Y.; Song, J.; Bonaguidi, M.A.; Enikolopov, G.; Nauen, D.W.; Christian, K.M.; Ming, G.-L. Single-cell RNA-seq with waterfall reveals molecular cascades underlying adult neurogenesis. Cell Stem Cell
**2015**, 17, 360–372. [Google Scholar] [CrossRef] [PubMed] - Feng, C.; Liu, S.; Zhang, H.; Guan, R.; Li, D.; Zhou, F.; Liang, Y.; Feng, X. Dimension reduction and clustering models for single-cell RNA sequencing data: A comparative study. Int. J. Mol. Sci.
**2020**, 21, 2181. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Melit Devassy, B.; George, S.; Nussbaum, P. Unsupervised clustering of hyperspectral paper data using t-SNE. J. Imaging
**2020**, 6, 29. [Google Scholar] [CrossRef] [PubMed] - Bollon, J.; Assale, M.; Cina, A.; Marangoni, S.; Calabrese, M.; Salvemini, C.B.; Christille, J.M.; Gustincich, S.; Cavalli, A. Investigating How Reproducibility and Geometrical Representation in UMAP Dimensionality Reduction Impact the Stratification of Breast Cancer Tumors. Appl. Sci.
**2022**, 12, 4247. [Google Scholar] [CrossRef] - Li, H.; Liu, J.; Liu, R.W.; Xiong, N.; Wu, K.; Kim, T.-h. A dimensionality reduction-based multi-step clustering method for robust vessel trajectory analysis. Sensors
**2017**, 17, 1792. [Google Scholar] [CrossRef] [Green Version] - Wenskovitch, J.; Crandell, I.; Ramakrishnan, N.; House, L.; North, C. Towards a systematic combination of dimension reduction and clustering in visual analytics. IEEE Trans. Vis. Comput. Graph.
**2017**, 24, 131–141. [Google Scholar] [CrossRef] - Tang, B.; Shepherd, M.; Milios, E.; Heywood, M.I. Comparing and combining dimension reduction techniques for efficient text clustering. In Proceedings of the SIAM International Conference on Data Mining, Newport Beach, CA, USA, 23 April 2005; pp. 17–26. [Google Scholar]
- Wang, X.-D.; Chen, R.-C.; Zeng, Z.-Q.; Hong, C.-Q.; Yan, F. Robust dimension reduction for clustering with local adaptive learning. IEEE Trans. Neural Netw. Learn. Syst.
**2018**, 30, 657–669. [Google Scholar] [CrossRef] - Markos, A.; D’Enza, A.I.; van de Velden, M. Beyond tandem analysis: Joint dimension reduction and clustering in R. J. Stat. Softw.
**2019**, 91, 1–24. [Google Scholar] [CrossRef] [Green Version] - Wenskovitch, J.; Dowling, M.; North, C. With respect to what? simultaneous interaction with dimension reduction and clustering projections. In Proceedings of the 25th International Conference on Intelligent User Interfaces, Cagliari, Italy, 17–20 March 2020; pp. 177–188. [Google Scholar]
- Ruzgas, T.; Lukauskas, M.; Čepkauskas, G. Nonparametric Multivariate Density Estimation: Case Study of Cauchy Mixture Model. Mathematics
**2021**, 9, 2717. [Google Scholar] [CrossRef] - Kavaliauskas, M.; Rudzkis, R.; Ruzgas, T. The projection-based multivariate density estimation. Acta Comment. Univ. Tartu. Math.
**2004**, 8, 135–141. [Google Scholar] [CrossRef] - Biernacki, C.; Celeux, G.; Govaert, G. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput. Stat. Data Anal.
**2003**, 41, 561–575. [Google Scholar] [CrossRef] - Xu, Q.; Yuan, S.; Huang, T. Multidimensional uniform initialization Gaussian mixture model for spar crack quantification under uncertainty. Sensors
**2021**, 21, 1283. [Google Scholar] [CrossRef] [PubMed] - Fraley, C. Algorithms for model-based Gaussian hierarchical clustering. SIAM J. Sci. Comput.
**1998**, 20, 270–281. [Google Scholar] [CrossRef] [Green Version] - Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.)
**1977**, 39, 1–22. [Google Scholar] - Everitt, B. Finite Mixture Distributions; Springer Science & Business Media: New York, NY, USA, 2013. [Google Scholar]
- Redner, R.A.; Walker, H.F. Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev.
**1984**, 26, 195–239. [Google Scholar] [CrossRef] - Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst.
**1987**, 2, 37–52. [Google Scholar] [CrossRef] - Comon, P. Independent component analysis, a new concept? Signal Process.
**1994**, 36, 287–314. [Google Scholar] [CrossRef] - Jöreskog, K.G. Factor analysis as an error-in-variables model. In Principals of Modern Psychological Measurement; Routledge: Abingdon-on-Thames, UK, 1983; pp. 185–196. [Google Scholar]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res.
**2008**, 9, 2580–2605. [Google Scholar] - Van Der Maaten, L. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res.
**2014**, 15, 3221–3245. [Google Scholar] - Li, W.; Cerise, J.E.; Yang, Y.; Han, H. Application of t-SNE to human genetic data. J. Bioinform. Comput. Biol.
**2017**, 15, 1750017. [Google Scholar] [CrossRef] [PubMed] - Kobak, D.; Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun.
**2019**, 10, 1–14. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Amid, E.; Warmuth, M.K. TriMap: Large-scale dimensionality reduction using triplets. arXiv
**2019**, arXiv:1910.00204. [Google Scholar] - Ghojogh, B.; Ghodsi, A.; Karray, F.; Crowley, M. Locally linear embedding and its variants: Tutorial and survey. arXiv
**2020**, arXiv:2011.10925. [Google Scholar] - Venna, J.; Kaski, S. Neighborhood Preservation in Non-linear Projection Methods: An Experimental Study. In Proceedings of the Artificial Neural Networks—ICANN, Berlin/Heidelberg, Germany, 21–25 August 2001; pp. 485–491. [Google Scholar]
- Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. -Theory Methods
**1974**, 3, 1–27. [Google Scholar] [CrossRef] - Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell.
**1979**, PAMI-1, 224–227. [Google Scholar] [CrossRef] - Ahmadinejad, N.; Liu, L. J-Score: A Robust Measure of Clustering Accuracy. arXiv
**2021**, arXiv:2109.01306. [Google Scholar] - Zhong, S.; Ghosh, J. Generative model-based document clustering: A comparative study. Knowl. Inf. Syst.
**2005**, 8, 374–384. [Google Scholar] [CrossRef] - Lawrence, H.; Phipps, A. Comparing partitions. J. Classif.
**1985**, 2, 193–218. [Google Scholar] - Wang, P.; Shi, H.; Yang, X.; Mi, J. Three-way k-means: Integrating k-means and three-way decision. Int. J. Mach. Learn. Cybern.
**2019**, 10, 2767–2777. [Google Scholar] [CrossRef] - Fowlkes, E.B.; Mallows, C.L. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc.
**1983**, 78, 553–569. [Google Scholar] [CrossRef]

ID | Data Sets | Sample Size (N) | Dimensions (D) | Classes |
---|---|---|---|---|

1 | Balance-scale | 625 | 4 | 3 |

2 | Arrhythmia | 452 | 262 | 13 |

3 | atom | 800 | 3 | 2 |

4 | Breast | 570 | 30 | 2 |

5 | Coil20 | 1440 | 1024 | 20 |

6 | CPU | 209 | 6 | 4 |

7 | Dermatology | 366 | 17 | 6 |

8 | Diabetes | 442 | 10 | 4 |

9 | Ecoli | 336 | 7 | 8 |

10 | German | 1000 | 60 | 2 |

11 | Glass | 214 | 9 | 6 |

12 | Haberman | 306 | 3 | 2 |

13 | Heart-statlog | 270 | 13 | 2 |

14 | Iono | 351 | 34 | 2 |

15 | Iris | 150 | 4 | 3 |

16 | pmf | 649 | 3 | 5 |

17 | segment | 2310 | 19 | 7 |

18 | spambase | 4601 | 57 | 2 |

19 | Thyroid | 215 | 5 | 3 |

20 | wdbc | 569 | 30 | 2 |

21 | Wine | 178 | 13 | 3 |

Dataset | Agg | BIRCH | GMM | BGMM | DBSCAN | K-Means | HDBSCAN | CBMIDE | RCBMIDE |
---|---|---|---|---|---|---|---|---|---|

1balance-scale | 0.624 | 0.658 | 0.568 | 0.586 | 0.464 | 0.603 | 0.597 | 0.576 | 0.693 |

atom | 1.000 | 0.868 | 0.883 | 0.960 | 1.000 | 0.719 | 1.000 | 0.891 | 1.000 |

cpu | 0.823 | 0.828 | 0.746 | 0.641 | 0.833 | 0.761 | 0.813 | 0.815 | 0.858 |

diabetes | 0.507 | 0.514 | 0.459 | 0.455 | 0.482 | 0.428 | 0.477 | 0.502 | 0.512 |

ecoli | 0.804 | 0.845 | 0.762 | 0.747 | 0.682 | 0.688 | 0.646 | 0.754 | 0.817 |

glass | 0.514 | 0.565 | 0.509 | 0.528 | 0.514 | 0.547 | 0.528 | 0.527 | 0.607 |

Haberman | 0.748 | 0.761 | 0.667 | 0.716 | 0.758 | 0.748 | 0.739 | 0.735 | 0.742 |

iris | 0.967 | 0.973 | 0.967 | 0.893 | 0.94 | 0.967 | 0.700 | 0.975 | 0.983 |

pmf | 0.977 | 0.977 | 0.92 | 0.977 | 0.983 | 0.844 | 0.978 | 0.934 | 0.981 |

thyroid | 0.93 | 0.949 | 0.963 | 0.949 | 0.874 | 0.944 | 0.823 | 0.778 | 0.834 |

Wine | 0.978 | 0.994 | 0.972 | 0.983 | 0.949 | 0.978 | 0.876 | 0.953 | 0.963 |

Dataset | Agg | BIRCH | GMM | BGMM | DBSCAN | K-Means | HDBSCAN | RCBMIDE |
---|---|---|---|---|---|---|---|---|

arrhythmia | 0.600 | 0.571 | 0.485 | 0.431 | 0.573 | 0.438 | 0.582 | 0.597 |

Breast | 0.942 | 0.954 | 0.951 | 0.953 | 0.903 | 0.928 | 0.743 | 0.909 |

Coil20 | 0.738 | 0.738 | 0.638 | 0.675 | 0.867 | 0.733 | 0.884 | 0.882 |

dermatology | 0.956 | 0.978 | 0.91 | 0.855 | 0.694 | 0.962 | 0.809 | 0.867 |

german | 0.705 | 0.713 | 0.696 | 0.638 | 0.712 | 0.673 | 0.704 | 0.716 |

heart-statlog | 0.807 | 0.811 | 0.819 | 0.826 | 0.815 | 0.848 | 0.626 | 0.831 |

iono | 0.729 | 0.795 | 0.849 | 0.809 | 0.929 | 0.712 | 0.906 | 0.899 |

segment | 0.708 | 0.732 | 0.631 | 0.612 | 0.529 | 0.665 | 0.530 | 0.789 |

spambase | 0.878 | 0.918 | 0.856 | 0.857 | 0.693 | 0.854 | 0.690 | 0.905 |

wdbc | 0.942 | 0.954 | 0.951 | 0.958 | 0.903 | 0.928 | 0.743 | 0.951 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lukauskas, M.; Ruzgas, T.
Reduced Clustering Method Based on the Inversion Formula Density Estimation. *Mathematics* **2023**, *11*, 661.
https://doi.org/10.3390/math11030661

**AMA Style**

Lukauskas M, Ruzgas T.
Reduced Clustering Method Based on the Inversion Formula Density Estimation. *Mathematics*. 2023; 11(3):661.
https://doi.org/10.3390/math11030661

**Chicago/Turabian Style**

Lukauskas, Mantas, and Tomas Ruzgas.
2023. "Reduced Clustering Method Based on the Inversion Formula Density Estimation" *Mathematics* 11, no. 3: 661.
https://doi.org/10.3390/math11030661