Pre-Determining the Optimal Number of Clusters for k-Means Clustering Using the Parameters Package in R and Distance Metrics

Junthopas, Wannaporn; Wongoutong, Chantha

doi:10.3390/app152111372

Open AccessArticle

Pre-Determining the Optimal Number of Clusters for k-Means Clustering Using the Parameters Package in R and Distance Metrics

by

Wannaporn Junthopas

¹ and

Chantha Wongoutong

^2,*

¹

Department of Statistics, Faculty of Science, Khon Kaen University, Khon Kaen 40002, Thailand

²

Department of Statistics, Faculty of Science, Kasetsart University, Bangkok 10900, Thailand

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(21), 11372; https://doi.org/10.3390/app152111372

Submission received: 17 August 2025 / Revised: 19 October 2025 / Accepted: 21 October 2025 / Published: 23 October 2025

Download

Browse Figures

Versions Notes

Abstract

A drawback of the popular k-means clustering algorithm is that the number of clusters must be specified in advance, which can be challenging for users who may not have extensive experience with the methods involved (such as Elbow, Silhouette, and Hartigan, among others). To address this issue, we employed the n_clusters() function from the parameters package in R, which applies 27 clustering methods to determine the optimal number of clusters more easily. In addition, the Canberra, Euclidean, and Manhattan distance metrics were employed to identify similar data objects in the k-means clusters. These metrics were measured and compared in terms of accuracy, F1-score, precision, and recall via numerical studies employing 10 simulated and two real datasets. The results indicate that the n_clusters() function from the parameters package could accurately identify the correct number of clusters in all of the simulated and real datasets. In particular, the Hartigan and Tracew methods identified the correct number of clusters with an accuracy of 100%. These findings suggest that using the Hartigan or Tracew method can accurately pre-determine the correct number of clusters for the k-means clustering algorithm. Also, the findings show that the performances in terms of accuracy, F1-score, precision, and recall of both the Euclidean and Manhattan distance metrics were excellent, with both of them consistently outperforming the Canberra distance across all datasets.

Keywords:

k-means clustering; partitional-based clustering; number of clusters; distance metric

1. Introduction

Clustering algorithms are typically used to group objects based on a set of numerical characteristics. The goal is for the objects in each group to be more similar to one another than to those in different groups [1]. Three common methods for cluster analysis are centroid-based, hierarchical, and density-based clustering [2]. These methods differ in how the clusters within a dataset are defined and identified. Multi-view and multimodal clustering are becoming increasingly important in machine learning, as they combine different data types, such as images, text, and audio, to understand complex real-world data better. These methods are used to help structure better and improve accuracy and are more reliable than traditional single-view methods [3,4,5,6]. Incorporating these approaches into research can signify a shift toward more adaptive, intelligent, and comprehensive machine learning solutions that align with real-world data challenges. Selecting the appropriate clustering method depends on the characteristics of the data. For small datasets, hierarchical clustering provides detailed insights [7], while k-means is more suitable for spherical, low-dimensional data [8]. Large datasets are handled with scalable methods, such as Mini-Batch K-means [9]. When dealing with noisy or complex data, density-based methods such as DBSCAN and HDBSCAN are recommended [10]. Additionally, when high-dimensional data is used, spectral clustering techniques, which uncover intricate structures [11], or mean shift, known for its robustness [12], should be utilized. Ultimately, the choice depends on factors like dataset size, shape, noise level, and specific analysis goals.

One of the most popular methods is the k-means clustering algorithm, which simplifies the implementation and speed of convergence [13]. It can help in identifying natural patterns of objects and has been applied in various fields, such as data mining [14], medical research [15], finance [16], pattern recognition [17], and image processing [18]. The two drawbacks of the k-means clustering algorithm are the need to set the initial centroids and the requirement to specify the number of clusters [19]. Many researchers have proposed various solutions to address these challenges. For the original k-means clustering algorithm, the initial centroids are typically selected randomly from the data points. Regardless, this randomness can lead to varying results and may cause poor convergence, resulting in empty clusters [20]. Many methods have been proposed to overcome this, including k-means++ [21], the weighted average mean [22], the Forgy Method [23], and the Random Partition Method [24]. In addition, various methods have been proposed to address determining the optimal number of clusters, including the rule of thumb [25], the Elbow method [26], the information criterion approach [27], the information theoretical approach [28], the silhouette method [29] and cross-validation [30]. Common methods like the elbow method, silhouette analysis, and gap statistic often give confusing or unreliable results in these cases [8,31]. Moreover, Shi et al. (2024) [32] proposed the Tensor-based Graph Learning with Consistency and Specificity (TGLCS) framework by using tensor representations. The highlight method works by capturing common and unique information from multiple views. Combining these tensor-based features with k-means leads to faster and more accurate clustering, which is made possible through TGLCS’s strong multi-view learning capabilities, along with the efficiency and simplicity of the k-means algorithm. Likewise, the Enhanced Latent Multi-View Subspace Clustering proposed by Shi et al. (2024) [33] can be integrated with k-means. This hybrid approach learns a suitable latent representation from multiple views and then applies k-means for clustering, capturing information across different views to enhance clustering accuracy more than traditional methods. Choosing the correct number of clusters is crucial; an incorrect choice can result in meaningless groups. Although determining the number of clusters for k-means clustering is crucial, it can also be challenging for users who lack experience in using this method.

Recently, tools in R, such as the n_clusters of parameters in R, have helped guide the algorithm in splitting data into a set number of groups by reducing variation within clusters [34]. Many studies show that the function n_cluster of parameters in R can guide the algorithm to partition data into a specified number of groups by minimizing intra-cluster variation [35]. In this context, serving to fulfill the research gap could be simplified by adopting an adaptive or data-driven approach for determining the number of clusters. Integrating this process into more sophisticated clustering frameworks can be achieved by using the n_cluster() function from the parameters package. This approach can specifically help tackle the challenge of automatically and accurately identifying the optimal number of clusters.

Another crucial factor affecting the accuracy of clustering results is choosing an appropriate distance metric to measure the similarity between data points and centroids, which directly influences how clusters are assigned [36,37]. The Euclidean distance is the most commonly used in standard k-means clustering [38]. However, this method is sensitive to the scale of features with larger ranges, which can dominate the distance calculations if the data are not normalized first [13]. Therefore, normalizing the data beforehand using techniques such as z-score normalization and min–max scaling has been employed to reduce scale sensitivity [39,40].

The present study has two primary points of focus: determining the optimal number of clusters and selecting the most suitable distance metric. Fortunately, n_clusters() has recently been added to R that enables facile determination of the optimal number of clusters by applying 27 clustering methods [41]. However, its efficiency and reliability for determining the optimal number of clusters and improving on this using a distance metric have not yet been verified, which is our ultimate aim.

The remainder of this paper is organized as follows: Section 2 presents an overview of the distance metrics used in the study, while Section 3 provides the datasets used in the study. Methodology for k-means clustering and the parameters package are covered in Section 4. Performance metrics and an experimental study are provided in Section 5. The results and a discussion thereof are presented in Section 6. Finally, the study’s conclusions are drawn in Section 7.

2. Materials and Methods

2.1. The Distance Metrics

K-means clustering is one of the widely used methods in cluster analysis. Pre-scaling the features is an important step when the datasets have different units [13], as it improves the accuracy and effectiveness of the clustering results. The Z-score is the most recommended standardization method [13,39]; this technique adjusts the values of a feature to a mean of zero and a variance of one. This approach helps address the variability in the scores across different variables. The formula for calculating the Z-score is

x^{'} = \frac{x - \bar{x}}{S}

(1)

Moreover, the choice of similarity or distance metrics has a significant impact on the clustering results in k-means clustering analysis. During the clustering process, determining the quantitative score for the degree of similarity or dissimilarity between data points (also known as a proximity measure) plays a crucial role. Therefore, we conducted a comparative analysis of three commonly used distance metrics [42].

2.1.1. The Euclidean Distance

This is a method for calculating the distance between two data points in Euclidean space (which can be in a two, three, or higher-dimensional format) is the most commonly used distance metric for the k-means clustering algorithm [43]. To assess the similarity between data points, the Euclidean distance can be calculated as follows:

D (x, y) = \sqrt{\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2}}

(2)

where D is the distance between two points

x = (x_{1}, x_{2}, x_{3}, \dots, x_{n})

and

y = (y_{1}, y_{2}, y_{3}, \dots, y_{n})

.

2.1.2. The Manhattan Distance

This measures the distance between two points in a grid-based system [44]. It is calculated as the sum of the absolute differences in their Cartesian coordinates. The Manhattan distance between two points is calculated as follows:

D (x, y) = \sum_{i = 1}^{n} |x_{i} - y_{i}|

(3)

where D is the distance between two points

x = (x_{1}, x_{2}, x_{3}, \dots, x_{n})

and

y = (y_{1}, y_{2}, y_{3}, \dots, y_{n})

.

2.1.3. The Canberra Distance

One difficulty of using this metric is that updating the cluster centers becomes more complicated because it is not a squared metric [45]. The Canberra distance between two points is defined as

D (x, y) = \sum_{i = 1}^{n} \frac{|x_{i} - y_{i}|}{|x_{i}| + |y_{i}|}

(4)

where D is the distance between two points of

x = (x_{1}, x_{2}, x_{3}, \dots, x_{n})

and

y = (y_{1}, y_{2}, y_{3}, \dots, y_{n})

.

3. The Datasets Used in the Study

Ten simulated datasets (S1 to S10) containing two, three, five, eight, or ten variables generated using the ‘MixSim’ package in RStudio version 1.1-8 [46] and two real datasets (a wine dataset (R1) and a penguin dataset (R2)) downloaded from https://www.kaggle.com/datasets (accessed on 1 May 2025) were used to evaluate the effectiveness of determining the optimal number of clusters and to assess the performances of the various distance metrics. Dataset R1 contains the results of a chemical analysis conducted on wines produced in the same region of Italy, while dataset R2 contains four variables concerning three species of penguins: bill_length, bill_depth, flipper_length, and body_mass. Details of the simulated and real datasets are reported in Table 1, while scatter plots of some of them are present in Figure 1.

4. The Methodology for K-Means Clustering

The main goal of the k-means clustering algorithm is to make the data points within each cluster as similar as possible while maximizing the distance between the clusters [47]. We employed an iterative algorithm to partition the dataset into k distinct non-overlapping clusters with each data point assigned to only one group, a flowchart for which is presented in Figure 2 [13].

The Parameters Package for Determining the Number of Clusters

R provides a parameter package for easily determining the number of clusters [41]. The n_clusters() function in R is used to identify the optimal number of clusters in a dataset by maximizing the consensus among the various methods. Several methods are available to determine the optimal number of clusters, each with advantages and disadvantages. This function determines the number of clusters suggested by the majority of the methods. In the event of a tie, it selects the most parsimonious solution (the one with the fewest clusters).

In the present study, 27 methods found in the parameters package in R were used to determine the optimal number of clusters for k-means clustering by using the n_clusters() function (Table 2). Figure 3 presents examples of R code used to determine the optimal number of clusters.

The use of all 27 clustering validation methods in the parameters package serves several important purposes:

1.: Comprehensive Evaluation: Various validation techniques consider different aspects of clustering quality, including compactness, separation, density, and statistical significance. Using a wide range of metrics ensures a more thorough and robust evaluation of the optimal number of clusters.
2.: Mitigation of Method-Specific Biases: Each approach has its own assumptions and sensitivities; using multiple methods helps reduce dependence on any one approach, making the results more reliable and less biased.
3.: Consensus-Based Determination: Combining results from different methods helps find a more stable and agreed-upon estimate of the best number of clusters, especially in complex datasets.
4.: Flexibility and Adaptability: Different datasets may require different methods, providing users with many options to select the most appropriate metrics for their data, making the tool more flexible.
5.: Facilitating Comparative Analysis: Having multiple validation results enables users to compare and understand differences, leading to better and more confident choices for the optimal number of clusters.

In our work, these considerations motivated the comprehensive inclusion of all 27 methods, aiming to enhance the robustness and reliability of clustering analysis across various scenarios.

5. Performance Metrics

In k-means clustering, external measures such as accuracy, precision, recall, and F1-score are not considered applicable because k-means is an unsupervised learning algorithm that does not utilize class labels during training [48]. However, when true labels are available (as in the case of labeled datasets), it is still possible to evaluate clustering results using these external measures. This can be accomplished by comparing the cluster assignments to the true labels after aligning the clusters with the corresponding labels.

5.1. Confusion Matrix

A confusion matrix is a diagnostic tool for assessing model performance beyond just accuracy by illustrating how predictions align with reality, class by class [49]. A table is used to evaluate the performance of a classification model by comparing its predicted labels to the actual labels. It can be used to compute metrics such as accuracy, precision, recall, and F1-score. Table 3 provides an example of binary classification using a confusion matrix.

5.2. Accuracy

This is a popular metric to evaluate the effectiveness of a classifier model. It is calculated by comparing the actual class labels after aligning the cluster labels with the actual classes as follows:

Accuracy = \frac{TP + TN}{TP + TN + FP + FN} .

5.3. Precision and Recall

These are metrics used to evaluate the accuracy of predictions. Precision measures the ratio of correctly predicted items (true positives) to the total number of items predicted, which includes both true positives and false positives. At the same time, the TP rate, also known as recall, is the proportion of all actual positives that were classified correctly as positives. The respective equations are

Precision = \frac{TP}{TP + FP}, Recall = \frac{TP}{TP + FN} .

5.4. F1-Score

This is a metric for evaluating the performance of classifier models that provides a balanced assessment, particularly in the context of imbalanced datasets [50]. It harmonizes two crucial metrics, precision and recall, and is mathematically defined as

F 1 - s c o r e = 2 \times \frac{(\Pr e c i s i o n \times Re c a l l)}{(\Pr e c i s i o n + Re c a l l)} .

5.5. Numerical Study

In this part of the study, we utilized the 10 simulated datasets (S1–S10) and the two real datasets (R1 and R2), scaled by using z-score, to evaluate the effectiveness of determining the optimal number of clusters using the n_clusters() function and to assess the performances of the Euclidean, Manhattan, and Canberra distance metrics.

Moreover, the kmeans() function in R defines the initial value as the starting positions of the cluster centers. The nstart parameter, which determines the number of random initializations. In this study, nstart was set to 25, and the algorithm runs multiple times with different starting points, and the best result, based on the lowest within-cluster sum of squares, is chosen [51], which helps avoid the algorithm getting stuck in suboptimal local minima. Moreover, the default method employs the Hartigan-Wong algorithm in R, known for its speed and reliable results [52].

When evaluating clustering performance, it is essential to align the cluster labels with the true labels to facilitate a meaningful comparison. In this study, we utilized the Hungarian algorithm [53,54] from the clue package in R to match clusters. It works by maximizing the overall correspondence between labels and true labels. This method helps assign each cluster to the correct class, which ensures accurate calculation of accuracy, recall, precision, and F1 scores. Finally, we compared the results of k-means clustering against the true data groups using these four metrics. A flowchart illustrating the entire process is provided in Figure 4.

6. Results and Discussion

6.1. The Optimal Number of Clusters

The results for the optimal number of clusters by using the n_clusters() function for 27 methods for S1–S10 and R1 and R2 are reported in Table 4. For example, the Elbow method could only detect the true data groups in dataset S1, which is 8.33% of the time. Notably, the Hartigan and Tracew methods produced precise results across all 12 datasets, with a correct detection rate of 100%. Figure 5 displays stacked bar charts to illustrate the performances of the 27 methods. The results indicate that the Hartigan and Tracew methods were very efficient in detecting the correct number of groups in all of the datasets. In contrast, the commonly used Elbow method is quite ineffective, while both the McClain and SDbw methods were also unable to detect the true data groups consistently across all of the datasets. Figure 6 illustrates the optimal k values for all 12 datasets according to the clustering methods (highlighted by the red bars). The results of Figure 6 demonstrate that this package accurately determines the optimal k across all datasets, showcasing its effectiveness.

6.2. The Three Distance Metrics

The Canberra, Euclidean, and Manhattan distance metrics were assessed in terms of accuracy, F1-score, precision, and recall (Table 5). For the k-means clustering of dataset S1, the Canberra distance yielded 0.8160, 0.8181, 0.8160, and 0.8202, respectively; the Euclidean distance provided 0.9260, 0.9275, 0.9261, and 0.9288, respectively; and the Manhattan distance produced 0.9320, 0.9334, 0.9321, and 0.9346, respectively. The same trend was seen with the other datasets. The findings indicate that the performances of both the Euclidean and Manhattan distances were quite similar, with both metrics consistently outperforming the Canberra distance across all of the datasets. Figure 7 displays plots of the results for the accuracy, F1-score, precision, and recall metrics when using the Euclidean, Manhattan, and Canberra distance metrics. Once again, it is evident that both the Euclidean and Manhattan distance metrics yielded similar results that were better than those using the Canberra distance metric.

For confirmation, we conducted a Kruskal–Wallis test with the null hypothesis that the three distance measures yielded the same performance in terms of accuracy, F1-score, precision, and recall. The resulting p-values for these measures in Table 6 are all < 0.0001, which confirms that the performances of the three distances in terms of these metrics were different. Moreover, Mann–Whitney U tests were conducted to compare two of the metrics at a time to determine whether they delivered the same performance in terms of accuracy, F1-score, precision, and recall. In this case, the Canberra and Euclidean metrics, as well as the Canberra and Manhattan metrics, yielded the same trend, indicating that their performances differ statistically significantly at the 1% level. In contrast, this was not the case between the Euclidean and Manhattan distance metrics.

6.3. Comparison of the Results Using the Three Distance Metrics with the True Group

Figure 8 displays stacked bar charts and heatmaps to compare the accuracy of k-means clustering of the 12 datasets using the Canberra, Euclidean, and Manhattan distance metrics. These results confirm which distance metric most accurately captures the frequencies of each group compared with the frequencies in the true data. For instance, the stacked bar chart for S1 displays the true data group frequencies for groups G1, G2, and G3, with counts of 166, 168, and 166, respectively. In comparison, the k-means clustering results using the Canberra, Euclidean, and Manhattan distance metrics were 148, 139, and 121; 163, 150, and 150; and 164, 151, and 151, respectively. This trend also holds for the other simulated datasets. Similarly, true data groups G1, G2, and G3 for R1 had counts of 48, 71, and 59, respectively. In comparison, the k-means clustering results using the Canberra, Euclidean, and Manhattan distance metrics were 37, 55, and 50; 48, 64, and 59; and 48, 65, and 57, respectively. This trend also holds for R2. The heatmaps in Figure 9 illustrate the similarity between the true data groups and the k-means clustered data using the three distance metrics. For example, the k-means clustered groups for R1 formed using the Canberra distance differ from the true data groups, whereas the dendrograms obtained with the Euclidean and Manhattan distances are similar to the true data groups. A consistent trend emerged across all of the datasets, in that the Euclidean and Manhattan distance metrics provided cluster frequencies that were closer to the true data groups than the Canberra distance metric.

7. Conclusions

The aim of this study was to address one of the drawbacks of using the k-means clustering algorithm: the need to specify the number of clusters in advance. To tackle this issue, we utilized 27 clustering methods by using the n_clusters() function in R to determine which of them identified the optimal number of clusters the most efficiently. Notably, the Hartigan and Tracew methods demonstrated high efficiency in accurately detecting the correct number of groups across all datasets 100% of the time. In contrast, the commonly used Elbow method proved to be inefficient (8.33% of the time), while both the McClain and SDbw methods could not correctly identify any of the true data groups across all of the datasets. Similarly, studies [55,56] have demonstrated that the Elbow, McClain, and SDbw methods are sensitive to high levels of noise, unclear boundaries between clusters, poorly chosen parameters, and data complexity, which can lead to incorrect estimates of the actual number of clusters.

In contrast, the Hartigan and Tracew methods demonstrated high efficiency in accurately detecting the correct number of groups across all datasets 100% of the time because they rely on statistical criteria that assess both cluster compactness and separation. The Hartigan method [57] primarily relies on the Hartigan index, which assesses the change in within-cluster dispersion as the number of clusters increases. It assesses whether adding a cluster provides an additional improvement in clustering quality based on a statistical test of the reduction in within-cluster sum of squares (or variation). If the reduction exceeds a threshold, more clusters improve the model; otherwise, the optimal number is where improvements become negligible. The TraceW method [58,59] analyzes the trace of the within-cluster scatter matrix or covariance matrix across different cluster solutions, indicating how well the data are separated by examining the reduction in within-cluster scatter as the number of clusters increases. Hence, the optimal number of clusters is often identified where the decrease in the trace stabilizes or reaches a plateau, indicating that additional clusters do not significantly improve separation. Recent studies [60] showed that Hartigan and TraceW methods are effective for accurately determining the number of clusters, especially in complex, high-dimensional data. They argue that these methods provide a robust approach to cluster validation based on statistical criteria, enhancing the reliability of estimating the number of clusters in high-dimensional data. Additionally, Nguyen and Lee (2022) [61] found that Hartigan and TraceW methods integrate statistical measures and improve accuracy and robustness by using statistical measures, especially in large and complex datasets.

Hence, the Hartigan and TraceW methods focus on statistical criteria and are related to internal consistency, cluster separation, and making decisions based on the significance of changes in these metrics as the number of clusters varies. These approaches help mitigate the pitfalls of overfitting or underfitting that can occur when determining the number of clusters, and these methods are highly effective for determining the optimal number of clusters.

In addition, we investigated the effect of using the Canberra, Euclidean, and Manhattan distance metrics on k-means clustering and compared their efficacy in terms of accuracy, F1-score, precision, and recall. The findings show that the performances of both the Euclidean and Manhattan distance metrics were excellent, with both of them consistently outperforming the Canberra distance across all datasets. The findings from Kruskal–Wallis and Mann–Whitney U tests further support these results. Hence, the appropriate methods for pre-determining the optimal number of clusters are the Hartigan and Tracew methods. Also, we recommend scaling features before using k-means clustering, such as applying z-score normalization and utilizing either the Euclidean or Manhattan distance metrics to enhance the quality of the clustering.

Moreover, advancing automated solutions to simplify the process of determining the number of clusters is valuable for researchers in various areas. Integrating this automation into the R environment addresses a key challenge: the bottleneck of determining the number of clusters, which is highly relevant to practitioners. As an improvement in automation, it is highly valuable and important for being efficient and user-friendly.

The limitations of this study include a focus on specific distance metrics and methods for determining the number of clusters. Hence, the results and the findings may change when applied to different datasets or metrics. Although testing on twelve datasets offers valuable insights, it may not completely capture the full scope of generalization. Future work will focus on larger-scale, more complex datasets in real-world data, as well as exploring additional distance measures. Additionally, investigating scalable algorithms for multi-view and multimodal data would enhance robustness and efficiency.

Author Contributions

Conceptualization, C.W. and W.J.; Methodology, C.W. and W.J.; Validation, W.J.; Formal analysis, C.W. and W.J.; Data curation, C.W. and W.J.; Writing—original draft, C.W. and W.J.; Writing—review & editing, C.W. and W.J.; Visualization, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

We would like to thank the Departments of Statistics at Kasetsart University and Khon Kaen University and International SciKU Branding (ISB) for providing the facilities to conduct this research.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rodriguez, M.Z.; Comin, C.H.; Casanova, D.; Bruno, O.M.; Amancio, D.R.; Costa, L.D.; Rodrigues, F.A. Clustering algorithms: A comparative approach. PLoS ONE 2019, 14, e0210236. [Google Scholar] [CrossRef]
Singh, S.; Srivastava, S. Review of clustering techniques in control system. Procedia Comput. Sci. 2020, 173, 272–280. [Google Scholar] [CrossRef]
Sun, S.; Wang, K. Multi-view clustering: A survey. IEEE Trans. Knowl. Data Eng. 2013, 29, 5184–5197. [Google Scholar]
Xu, C.; Tao, D.; Li, J. Multi-view learning overview: Recent progresses and future directions. Neurocomputing 2013, 130, 1–13. [Google Scholar]
Zhang, T.; Lin, N. Multi-view clustering via graph fusion. Pattern Recognit. 2018, 78, 172–182. [Google Scholar]
Radhakrishnan, S.; Zheng, D.; Wu, J.; Sangaiah, A.K. Multimodal deep learning: Techniques and applications. IEEE Trans. Multimed. 2021, 23, 2764–2777. [Google Scholar]
Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2017, 7, e1219. [Google Scholar] [CrossRef]
Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
Sculley, D. Web-scale k-means clustering. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; pp. 1177–1178. [Google Scholar]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the KDD’96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; Portland, OR, USA, 2–4 August 1996, pp. 226–231.
Ng, A.Y.; Jordan, M.I.; Weiss, Y. On spectral clustering: Analysis and an algorithm. Adv. Neural Inf. Process Syst. 2002, 14, 849–856. [Google Scholar]
Comaniciu, D.; Meer, P. Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 603–619. [Google Scholar] [CrossRef]
Wongoutong, C. The impact of neglecting feature scaling in k-means clustering. PLoS ONE 2024, 19, e0310839. [Google Scholar] [CrossRef]
Annas, M.; Wahab, S.N. Data mining methods: K-means clustering algorithms. Int. J. Cyber IT Serv. Manag. 2023, 3, 40–47. [Google Scholar] [CrossRef]
Silitonga, P. Clustering of patient disease data by using K-means clustering. Int. J. Comput. Sci. Inf. Secur. 2017, 15, 219–221. [Google Scholar]
Herman, E.; Zsido, K.E.; Fenyves, V. Cluster analysis with k-mean versus k-medoid in financial performance evaluation. Appl. Sci. 2022, 12, 7985. [Google Scholar] [CrossRef]
Ali, H.H.; Kadhum, L.E. K-means clustering algorithm applications in data mining and pattern recognition. Int. J. Sci. Res. 2017, 6, 1577–1584. [Google Scholar]
Shan, P. Image segmentation method based on k-mean algorithm. EURASIP J. Image Video Process. 2018, 2018, 81. [Google Scholar] [CrossRef]
Ahmed, M.; Seraj, R.; Islam, S.M. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
Kuncheva, L.I.; Vetrov, D.P. Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1798–1808. [Google Scholar] [CrossRef]
Arthur, D.; Vassilvitskii, S. K-means++: The Advantages of Careful Seeding. Stanford 2006. Available online: https://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf (accessed on 11 May 2025).
Mahmud, M.S.; Rahman, M.M.; Akhtar, M.N. Improvement of k-means clustering algorithm with better initial centroids based on weighted average. In Proceedings of the 7th International Conference on Electrical and Computer Engineering, Dhaka, Bangladesh, 20–22 December 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 647–650. [Google Scholar]
Forgy, E.W. Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics 1965, 21, 768–769. [Google Scholar]
Ahmad, A.; Khan, S.S. InitKmix—A novel initial partition generation algorithm for clustering mixed data using k-means-based clustering. Expert Syst. Appl. 2021, 167, 114149. [Google Scholar] [CrossRef]
Milligan, G.W.; Cooper, M.C. An examination of procedures for determining the number of clusters in a data set. Psychometrika 1985, 50, 159–179. [Google Scholar] [CrossRef]
Humaira, H.; Rasyidah, R. Determining the appropriate cluster number using Elbow method for k-means algorithm. In Proceedings of the 2nd Workshop on Multidisciplinary and Applications (WMA), Padang, Indonesia, 24–25 January 2018; EAI: Gent, Belgium, 2018; pp. 1–8. [Google Scholar]
Gupta, U.D.; Menon, V.; Babbar, U. Detecting the number of clusters during expectation-maximization clustering using information criterion. In Proceedings of the Second International Conference on Machine Learning and Computing, Bangalore, India, 9–11 February 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 169–173. [Google Scholar]
Gokcay, E.; Principe, J.C. Information theoretic clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 158–171. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Fu, W.; Perry, P.O. Estimating the number of clusters using cross-validation. J. Comput. Graphical Stat. 2020, 29, 162–173. [Google Scholar] [CrossRef]
Saunders, C.; Castro, T.; Gama, J.; van der Bogaard, K.J.M.H. Challenges in clustering high-dimensional data. J. Data Sci. 2019, 17, 255–272. [Google Scholar]
Shi, L.; Cao, L.; Ye, Y.; Zhao, Y.; Chen, B. Tensor-based Graph Learning with Consistency and Specificity for Multi-View Clustering. arXiv 2024, arXiv:2403.18393. [Google Scholar]
Shi, L.; Cao, L.; Wang, J.; Chen, B. Enhanced latent multi-view subspace clustering. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 12480–12495. [Google Scholar] [CrossRef]
Wang, H.; Yang, Y. Model-based clustering algorithms: A review and new developments. Statist. Sci. 2018, 33, 543–558. [Google Scholar]
Stergoulis, S.; Sgouros, N. Fuzzy clustering and the challenge of determining the number of clusters in high-dimensional data. Expert Syst. Appl. 2019, 132, 163–177. [Google Scholar]
Tibshirani, R.; Walther, G.; Hastie, T. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B 2001, 63, 411–423. [Google Scholar] [CrossRef]
Singh, A.; Yadav, A.; Rana, A. K-means with three different distance metrics. Int. J. Comput. Appl. 2013, 67, 13–17. [Google Scholar] [CrossRef]
Gupta, M.K.; Chandra, P. Effects of similarity/distance metrics on k-means algorithm with respect to its applications in IoT and multimedia: A review. Multimed. Tools Appl. 2022, 81, 37007–37032. [Google Scholar] [CrossRef]
Ghazal, T.M. Performances of k-means clustering algorithm with different distance metrics. Intell. Autom. Soft Comput. 2021, 30, 735–742. [Google Scholar] [CrossRef]
Kumar, S. Efficient k-mean clustering algorithm for large datasets using data mining standard score normalization. Int. J. Recent Innov. Trends Comput. Commun. 2014, 2, 3161–3166. [Google Scholar]
Dudek, A.; Walesiak, M. The choice of variable normalization method in cluster analysis. In Proceedings of the 35th International Business Information Management Association Conference (IBIMA), Seville, Spain, 1–2 April 2020; Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 325–340. [Google Scholar]
Lüdecke, D.; Ben-Shachar, M.S.; Patil, I.; Waggoner, P.; Makowski, D. Parameters: A package to easily extract, summarize, and display model parameters. J. Open Source Softw. 2020, 5, 2445. [Google Scholar] [CrossRef]
Faisal, M.; Zamzami, E.M. Comparative analysis of inter-centroid k-means performance using Euclidean distance, Canberra distance and Manhattan distance. J. Phys. Conf. Ser. 2020, 1566, 012112. [Google Scholar] [CrossRef]
Thakare, Y.S.; Bagal, S.B. Performance evaluation of k-means clustering algorithm with various distance metrics. Int. J. Comput. Appl. 2015, 110, 12–16. [Google Scholar] [CrossRef]
Suhaeri, M.E.; Alimudin, J.A.; Ismail, M.T.; Ali, M.K. Evaluation of clustering approach with Euclidean and Manhattan distance for outlier detection. AIP Conf. Proc. 2021, 2423, 070025. [Google Scholar] [CrossRef]
Yilmaz, S.; Chambers, J.; Cozza, S.; Patel, M.K. Exploratory study on clustering methods to identify electricity use patterns in building sector. J. Phys. Conf. Ser. 2019, 1343, 012044. [Google Scholar] [CrossRef]
Melnykov, V.; Chen, W.C.; Maitra, R. MixSim: An R package for simulating data to study performance of clustering algorithms. J. Stat. Softw. 2012, 51, 1–25. [Google Scholar] [CrossRef]
Kanungo, T.; Mount, D.M.; Netanyahu, N.S.; Piatko, C.; Silverman, R.; Wu, A.Y. The analysis of a simple k-means clustering algorithm. In Proceedings of the Sixteenth Annual Symposium on Computational Geometry, Hong Kong, 12–14 June 2000; ACM: New York, NY, USA, 2000; pp. 100–109. [Google Scholar]
Wu, J.; Chen, J.; Xiong, H.; Xie, M. External validation measures for K-means clustering: A data distribution perspective. Expert Syst. Appl. 2009, 36, 6050–6061. [Google Scholar] [CrossRef]
Susmaga, R. Confusion matrix visualization. In Proceedings of the International Intelligent Information Processing and Web Mining (IIPWM)’04 Conference, Zakopane, Poland, 17–20 May 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 107–116. [Google Scholar]
Diallo, R.; Edalo, C.; Awe, O.O. Machine learning evaluation of imbalanced health data: A comparative analysis of balanced accuracy, MCC, and F1 score. In Practical Statistical Learning and Data Science Methods; Springer: Cham, Switzerland, 2024; pp. 283–312. [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing: Vienna, Austria, 2021. Available online: https://www.R-project.org/ (accessed on 11 March 2025).
Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1979, 28, 100–108. [Google Scholar] [CrossRef]
Kuhn, H.W. The Hungarian Method for the Assignment Problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
Grötschel, M.; Lovász, L.; Schrijver, A. Geometric Algorithms and Combinatorial Optimization; Springer: Berlin/Heidelberg, Germany, 1988. [Google Scholar]
Hsu, J.; Hoi, S.C. Automatic determination of the number of clusters in large-scale data. IEEE Trans. Knowl. Data Eng. 2021, 33, 2674–2687. [Google Scholar]
Ketchen, D.J.; Shook, C.L. The application of cluster analysis in strategic management research. Strat. Manag. J. 1996, 17, 441–458. [Google Scholar] [CrossRef]
Rosenberg, A.; Hirschberg, J. The SDbw cluster validity index. In Proceedings of the Sixth International Conference on Fuzzy Systems and Knowledge Discovery, Tianjin, China, 14–16 August 2007; pp. 391–395. [Google Scholar]
Steinley, D.; Brusco, M.J. Choosing the number of clusters in Κ-means clustering. Psychol. Methods 2011, 16, 285. [Google Scholar] [CrossRef]
Sun, L.; Li, Z.; Zhang, H. A robust cluster validation method based on statistical criteria for high-dimensional data. Expert Syst. Appl. 2020, 162, 113716. [Google Scholar]
Nguyen, T.; Lee, S. Improving cluster validity indices with statistical measures for complex datasets. Knowl.-Based Syst. 2022, 244, 108679. [Google Scholar]

Figure 1. Scatter plots for simulated datasets S1, S5, and S6 with 3 clusters.

Figure 2. A flowchart for the k-means clustering algorithm.

Figure 3. (Left) the R code for determining the optimal number of clusters and (right) the optimal k value highlighted by the red bar.

Figure 4. A flowchart for the numerical study.

Figure 5. Stacked bar charts for the 27 methods for detecting true data groups in the 12 datasets.

Figure 6. The optimal k values for the 12 datasets using the 27 methods (indicated by the red bars).

Figure 7. Accuracy, F1-score, precision, and recall for the Canberra, Euclidean, and Manhattan distance metrics for the 12 datasets.

Figure 8. Stacked bar charts showing the true data groups and k-means clusters using the Canberra, Euclidean, and Manhattan distance metrics.

Figure 9. Heatmaps illustrating the similarities between the results of comparing the k-clustering groups generated using three distance metrics with the true data groups.

Table 1. The simulated and real datasets used in the study.

Dataset	No. of Features	No. of Classes	No. of Data Items	No. of Objects in Each Class
S1	3	3	500	166, 168 166
S2	10	3	500	137, 182, 181
S3	10	3	500	148, 180, 172
S4	10	3	500	206, 187, 107
S5	2	3	200	40, 100, 60
S6	3	3	1000	179, 563, 258
S7	5	3	500	217, 191, 92
S8	8	3	1000	320, 291, 389
S9	8	3	1000	169, 486, 345
S10	5	4	500	103, 120, 138, 139
R1	13	3	178	59, 71, 48
R2	4	3	272	143, 77, 52

S1–S10 and R1 and R2 are the simulated and real datasets, respectively.

Table 2. Brief descriptions of the clustering methods in RStudio.

Method	Description
Elbow	The WCSS values are plotted on the y-axis against different values of k on the x-axis. The optimal k value is identified at the point where the graph shows an “elbow” shape, indicating a change in the rate of decrease in WCSS.
Silhouette	A silhouette score for each data point is calculated ranging from −1 to +1. The k value that yields the highest average silhouette score is identified by averaging these scores for different values of k and plotting them.
CH	The best k value is the one with the maximal CH index.
DB	The best k value is the one with the lowest Davies–Bouldin Index (DBI).
Ratkowsky	The best k value is the one with the highest Ratkowsky index.
PtBiserial	This involves using the Point-Biserial Correlation Coefficient to assess the quality of clustering. The k value that produces the highest Point-Biserial score is selected.
McClain	This index evaluates the clustering quality by comparing the distances within clusters (compactness) to the distances between clusters (separation). The k value with the lowest McClain index value is chosen.
Dunn	The k value that yields the highest Dunn index is selected.
Gap_Maechler2012	This method utilizes a bootstrap-based comparison against a null (random) model and employs the firstSEmax rule to choose the smallest value of k.
Gap_Dudoit2002	This method builds upon the original gap statistic by incorporating practical refinements and a focus on high-dimensional biological data. The goal is to select the smallest value of k for which the gap statistic falls within one standard error of the maximum observed gap.
kl	This method measures the improvement in clustering quality as the number of clusters increases and identifies the point at which the improvement begins to slow down, commonly referred to as the “elbow.” The optimal k value is selected when the value of kl is maximal.
Hartigan	This method evaluates the improvement in the total within-cluster sum of squares when adding another cluster. A Hartigan value of 10 or less indicates that adding another cluster does not significantly enhance the clustering quality.
Scott	An adaptation of Scott’s Rule, which was originally designed for estimating histogram bin widths, this method uses the spread and dimensionality of the data to estimate the optimal value of k.
Marriot	The aim of this method is to identify the point at which adding more clusters no longer significantly enhances their compactness. The optimal value of k is determined at the point where adding additional clusters does not noticeably improve the Marriot index.
trcovw	The value of k is selected where the trcovw plot either has a sharp drop (the Elbow point) or stabilizes, thereby indicating that adding more clusters does not significantly reduce the within-cluster covariance.
Tracew	This method analyzes the trace of the within-cluster dispersion matrix. The optimal value of k is identified when adding more clusters does not reduce the Trace(W) value by much, thus indicating diminishing returns.
Friedman	This evaluates the clustering results by analyzing dispersion matrixes, focusing on compactness and separation. The k value with the lowest Friedman index value is chosen.
Rubin	Closely related to the Friedman method, the value of k that maximizes the Rubin index is selected.
Duda	This approach evaluates whether splitting a cluster into two significantly improves the quality of the clusters.
Pseudot2	The optimal value of k is identified at a local minimum on the Pseudot2 curve, which is followed by a sudden increase at the next value of k.
Beale	This method compares the residual sum of squares when using k clusters versus k + 1 clusters and uses an F-test to determine whether the improvement is statistically significant.
Ball	This is an internal cluster validation technique based on analyzing the ratio of within-cluster to between-cluster variation, with the aim of finding a balance between cluster compactness and separation.
Frey	The Frey method establishes a stopping rule by examining how much the WCSS decreases when increasing from k to k + 1. The goal is to stop increasing k when the percentage decrease in WCSS becomes consistently small, indicating that adding more clusters does not significantly improve data modeling.
SDindex	The best value for k is indicated by choosing the one that corresponds to the lowest SD index.
Cindex	This method compares the distances between points within the same cluster to the minimum and maximum possible distances among all data points. The best value of k has the lowest C-index.
CCC	The Cubic Clustering Criterion (CCC) method determines the optimal value for k by identifying the point where the CCC value reaches a local maximum or exhibits a significant peak for the first time.
SDbw	The Scattering-Density between-within (SDbw) index is a cluster validity index used to assess both the compactness (density) and separation (scatter) of clusters. The best value for k has the lowest SDbw value.

WCSS, within sum of squares.

Table 3. A confusion matrix for binary classification.

Predicted Values		Actual Values
		Positive	Negative
	Positive	TP	FP
	Negative	FN	TN

TP, TN, FP, and FN are the true positive, true negative, false positive, and false negative results, respectively.

Table 4. True group detection performances of the 27 methods.

Method	Number of Groups by Dataset												True Group Detection Frequency (%)
Method	S1 (3)	S2 (3)	S3 (3)	S4 (3)	S5 (3)	S6 (3)	S7 (3)	S8 (3)	S9 (3)	S10 (4)	R1 (3)	R2 (3)	True Group Detection Frequency (%)
Elbow	3 *	2	2	2	2	2	2	2	2	2	2	2	1 (8.33)
Silhouette	3 *	2	3 *	3 *	3 *	3 *	2	4	3 *	2	2	3 *	7 (58.33)
Ch	3 *	2	3 *	3 *	3 *	3 *	2	3 *	2	2	2	3 *	7 (58.33)
DB	4	10	3 *	3 *	3 *	3 *	2	3 *	3 *	2	2	3 *	7 (58.33)
Ratkowsky	3 *	3 *	3 *	3 *	2	3 *	2	3 *	3 *	2	2	3 *	8 (66.67)
PtBiserial	3 *	3 *	3 *	3 *	3 *	3 *	2	4	3 *	4 *	2	3 *	9 (75.00)
Mcclain	2	2	2	2	2	2	2	2	2	2	2	2	0 (0.00)
Dunn	10	8	2	3 *	3 *	3 *	2	2	4	4 *	2	3 *	5 (41.67)
Gap_Maechler2012	1	3 *	3 *	3 *	1	1	3 *	4	3 *	4 *	4	3 *	7 (58.33)
Gap_Dudoit2002	3 *	6	3 *	3 *	3 *	3 *	3 *	4	3 *	4 *	6	3 *	9 (75.00)
kl	3 *	3 *	3 *	3 *	4	3 *	3 *	5	4	4 *	5	3 *	8 (66.67)
Hartigan	3 *	3 *	3 *	3 *	3 *	3 *	3 *	3 *	3 *	4 *	3 *	3 *	12 (100)
Scott	3 *	3 *	3 *	3 *	3 *	3 *	3 *	3	3 *	3	3 *	3 *	11 (91.67)
Marriot	3 *	3 *	3 *	3 *	3 *	3 *	3 *	4	3 *	4 *	3 *	3 *	11 (91.67)
trcovw	4	3 *	3 *	3 *	5	3 *	3 *	3	3 *	3	3 *	3 *	9 (75.00)
Tracew	3 *	3 *	3 *	3 *	3 *	3 *	3 *	3	3 *	4 *	3 *	3 *	12 (100)
Friedman	3 *	3 *	3 *	3 *	9	3 *	3 *	3	3 *	3	3 *	3 *	10 (83.33)
Rubin	3 *	3 *	3 *	3 *	3 *	3 *	3 *	4	4	4 *	5	3 *	9 (75.00)
Duda	2	2	2	3 *	2	2	3 *	3 *	2	2	3 *	3 *	5 (41.67)
Pseudot2	2	2	2	3 *	2	2	3 *	3 *	2	2	3 *	3 *	5 (41.67)
Beale	2	2	2	3 *	2	2	3 *	2	2	2	2	2	2 (16.67)
Ball	3 *	3 *	3 *	3 *	3 *	3 *	3 *	3 *	3 *	3 *	3 *	3 *	11 (91.67)
Frey	1	1	1	1	1	1	3 *	1	1	1	3 *	1	2 (16.67)
SDindex	4	10	3 *	3 *	3 *	3 *	3 *	4	3 *	4 *	3 *	3 *	9 (75.00)
Cindex	4	9	10	4	10	9	4	3 *	3 *	8	4	5	2 (16.67)
CCC	3 *	10	3 *	8	3 *	4	5	4	4	4 *	5	3 *	5 (41.67)
SDbw	10	10	10	10	10	10	10	10	10	10	10	8	0 (0.00)

* The correct number of true data groups was detected.

Table 5. Performance of the distance metrics.

Dataset	Distance Metric	Accuracy	F1-Score	Precision	Recall
S1	Canberra	0.8160	0.8181	0.8160	0.8202
	Euclidean	0.9260	0.9275	0.9261	0.9288
	Manhattan	0.9320	0.9334	0.9321	0.9346
S2	Canberra	0.4920	0.5463	0.5059	0.5936
	Euclidean	0.8900	0.8918	0.8954	0.8883
	Manhattan	0.8540	0.8531	0.8551	0.8512
S3	Canberra	0.7655	0.7658	0.7669	0.7647
	Euclidean	0.9559	0.9558	0.9568	0.9547
	Manhattan	0.9639	0.9638	0.9648	0.9628
S4	Canberra	0.6700	0.6821	0.6939	0.6707
	Euclidean	0.9680	0.9708	0.9701	0.9715
	Manhattan	0.9620	0.9663	0.9648	0.9678
S5	Canberra	0.6450	0.7417	0.7633	0.7212
	Euclidean	0.9500	0.9499	0.9667	0.9337
	Manhattan	0.9350	0.9366	0.9567	0.9174
S6	Canberra	0.7550	0.7595	0.7866	0.7341
	Euclidean	0.9560	0.9563	0.9512	0.9616
	Manhattan	0.9430	0.9421	0.9414	0.9428
S7	Canberra	0.7620	0.7449	0.7498	0.7400
	Euclidean	0.9900	0.9870	0.9913	0.9828
	Manhattan	0.9780	0.9720	0.9789	0.9652
S8	Canberra	0.6470	0.6578	0.6581	0.6574
	Euclidean	0.9060	0.9087	0.9140	0.9034
	Manhattan	0.8810	0.8864	0.8916	0.8813
S9	Canberra	0.6360	0.6559	0.6754	0.6374
	Euclidean	0.9410	0.9432	0.9521	0.9345
	Manhattan	0.9110	0.9076	0.9232	0.8925
S10	Canberra	0.5760	0.5499	0.5523	0.5476
	Euclidean	0.8660	0.8698	0.8702	0.8695
	Manhattan	0.8320	0.8348	0.8333	0.8363
R1	Canberra	0.7978	0.8013	0.7976	0.8050
	Euclidean	0.9607	0.9624	0.9671	0.9577
	Manhattan	0.9551	0.9577	0.9605	0.9549
R2	Canberra	0.6544	0.5843	0.5763	0.5924
	Euclidean	0.9265	0.9212	0.9212	0.9212
	Manhattan	0.9338	0.9255	0.9257	0.9253

Note: Underlining indicates the best performance.

Table 6. Null hypothesis testing of the performances of the Canberra, Euclidean, and Manhattan distance metrics.

Null Hypothesis	Test Statistic (p-Value)
Null Hypothesis	Accuracy	F1-Score	Precision	Recall
The performances of the three methods follow the same distribution	23.473 ^K (<0.0001) ***	23.502 ^K (<0.0001) ***	23.536 ^K (<0.0001) ***	23.502 ^K (<0.0001) ***
The performances of Canberra and Manhattan follow the same distribution	−17.250 ^M (<0.0001) ***	−17.167 ^M (<0.0001) ***	−17.083 ^M (<0.0001) ***	−17.167 ^M (<0.0001) ***
The performances of Canberra and Euclidean follow the same distribution	−18.750 ^M (<0.0001) ***	−18.833 ^M (<0.0001) ***	−18.817 ^M (<0.0001) ***	−18.833 ^M (<0.0001) ***
The performances of Manhattan and Euclidean follow the same distribution	1.500 ^M (0.727) ^ns	1.667 ^M (0.698) ^ns	1.833 ^M (0.670) ^ns	1.667 ^M (0.727) ^ns

Note: The null hypotheses were tested using the Kruskal–Wallis test ^K and the Mann–Whitney U test ^M. ***, and ^ns indicate statistically significant differences at the 1% level and no statistically significant difference, respectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Junthopas, W.; Wongoutong, C. Pre-Determining the Optimal Number of Clusters for k-Means Clustering Using the Parameters Package in R and Distance Metrics. Appl. Sci. 2025, 15, 11372. https://doi.org/10.3390/app152111372

AMA Style

Junthopas W, Wongoutong C. Pre-Determining the Optimal Number of Clusters for k-Means Clustering Using the Parameters Package in R and Distance Metrics. Applied Sciences. 2025; 15(21):11372. https://doi.org/10.3390/app152111372

Chicago/Turabian Style

Junthopas, Wannaporn, and Chantha Wongoutong. 2025. "Pre-Determining the Optimal Number of Clusters for k-Means Clustering Using the Parameters Package in R and Distance Metrics" Applied Sciences 15, no. 21: 11372. https://doi.org/10.3390/app152111372

APA Style

Junthopas, W., & Wongoutong, C. (2025). Pre-Determining the Optimal Number of Clusters for k-Means Clustering Using the Parameters Package in R and Distance Metrics. Applied Sciences, 15(21), 11372. https://doi.org/10.3390/app152111372

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pre-Determining the Optimal Number of Clusters for k-Means Clustering Using the Parameters Package in R and Distance Metrics

Abstract

1. Introduction

2. Materials and Methods

2.1. The Distance Metrics

2.1.1. The Euclidean Distance

2.1.2. The Manhattan Distance

2.1.3. The Canberra Distance

3. The Datasets Used in the Study

4. The Methodology for K-Means Clustering

The Parameters Package for Determining the Number of Clusters

5. Performance Metrics

5.1. Confusion Matrix

5.2. Accuracy

5.3. Precision and Recall

5.4. F1-Score

5.5. Numerical Study

6. Results and Discussion

6.1. The Optimal Number of Clusters

6.2. The Three Distance Metrics

6.3. Comparison of the Results Using the Three Distance Metrics with the True Group

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI