Next Article in Journal
Generalization of Parameter Selection of SVM and LS-SVM for Regression
Previous Article in Journal
Exploration Using Without-Replacement Sampling of Actions Is Sometimes Inferior
Open AccessArticle

Optimal Clustering and Cluster Identity in Understanding High-Dimensional Data Spaces with Tightly Distributed Points

1
Living PlanIT AG, Knonauerstrasse 52E, 6330 Cham, Switzerland
2
Data Science Central, 2428 35th Avenue NE, Issaquah, WA 98029, USA
*
Author to whom correspondence should be addressed.
Mach. Learn. Knowl. Extr. 2019, 1(2), 715-744; https://doi.org/10.3390/make1020042
Received: 8 April 2019 / Revised: 31 May 2019 / Accepted: 3 June 2019 / Published: 5 June 2019
The sensitivity of the elbow rule in determining an optimal number of clusters in high-dimensional spaces that are characterized by tightly distributed data points is demonstrated. The high-dimensional data samples are not artificially generated, but they are taken from a real world evolutionary many-objective optimization. They comprise of Pareto fronts from the last 10 generations of an evolutionary optimization computation with 14 objective functions. The choice for analyzing Pareto fronts is strategic, as it is squarely intended to benefit the user who only needs one solution to implement from the Pareto set, and therefore a systematic means of reducing the cardinality of solutions is imperative. As such, clustering the data and identifying the cluster from which to pick the desired solution is covered in this manuscript, highlighting the implementation of the elbow rule and the use of hyper-radial distances for cluster identity. The Calinski-Harabasz statistic was favored for determining the criteria used in the elbow rule because of its robustness. The statistic takes into account the variance within clusters and also the variance between the clusters. This exercise also opened an opportunity to revisit the justification of using the highest Calinski-Harabasz criterion for determining the optimal number of clusters for multivariate data. The elbow rule predicted the maximum end of the optimal number of clusters, and the highest Calinski-Harabasz criterion method favored the number of clusters at the lower end. Both results are used in a unique way for understanding high-dimensional data, despite being inconclusive regarding which of the two methods determine the true optimal number of clusters. View Full-Text
Keywords: the elbow rule; Calinski-Harabasz criterion; Pareto front; evolutionary many-objective optimization; high-dimensional data; Sammon’s nonlinear mapping; classical multi-dimensional scaling; hyper-radial distance the elbow rule; Calinski-Harabasz criterion; Pareto front; evolutionary many-objective optimization; high-dimensional data; Sammon’s nonlinear mapping; classical multi-dimensional scaling; hyper-radial distance
Show Figures

Graphical abstract

MDPI and ACS Style

Chikumbo, O.; Granville, V. Optimal Clustering and Cluster Identity in Understanding High-Dimensional Data Spaces with Tightly Distributed Points. Mach. Learn. Knowl. Extr. 2019, 1, 715-744.

Show more citation formats Show less citations formats

Article Access Map by Country/Region

1
Back to TopTop