1. Introduction
Clustering is an unsupervised data processing technique that can effectively uncover the distribution structure and key features hidden in a detection dataset [
1,
2]. Due to various clustering tasks, a large number of clustering algorithms have been proposed, such as the
C-Means (CM) [
3], Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [
4], and recently Density Peak Clustering (DPC) [
5]. Different algorithms can yield various clusters with various features; therefore, evaluating clustering results is very important in clustering analysis. Except the prior information, the clustering results generally are evaluated by a clustering validity index (function) or several [
6]. So far, several dozen validity indexes with different purposes have been proposed [
7]. The core task of these validity indexes is to determine the optimal number of clusters, which decides the correctness of clustering results. The typical validity indexes include the Davies–Bouldin measure [
8], Tibshirani Gap statistics [
9], Xie-Beni’s separation norm [
10], etc. Arbelaitz et al. [
11] compared the existing validity indexes by their applicable range. In recent years, some indexes have been proposed, such as set matching measure for external validity [
12]. More reviews can be found in [
13,
14].
These indexes have their respective application ranges and limitations. However, the following issues remain unsolved:
- (1)
Various features. The detected clusters in a dataset can have various features, such as densities, sizes, shapes, overlaps among clusters, and high dimensionality. However, most typical indexes are designed to assess the clustering results from the most used CM algorithm and cannot evaluate the clustering results from other clustering algorithms. Therefore, these clustering results except CM cannot be accurately evaluated;
- (2)
Evaluation criterion. Almost all the validity indexes are constructed based on the principle of maximizing the inter-cluster distance as opposed to minimizing the intra-cluster distances. Therefore, different constructions of the two distances lead to different indexes. Essentially, most validity indexes directly compute the mutual distances among data points but neglect their backgrounds. Hence, their evaluation methods fail to fully use the available information.
To address these issues, in this paper, we propose a grid partitioning-based validity index to evaluate the clustering results from any algorithms and arbitrary clustering features. The use of grid-based partitioning can quickly and easily map all data points into a grid structure. To measure the clusters with different features, all data points are normalized towards a spherical shape. Different from the existing validity indexes, we compute the number of empty and intersecting grids that serve as the background of each cluster, whereas the existing indexes only compute the mutual distances among data points. Extensive experiments are performed on real and synthetic datasets, validating the promising efficacy and effectiveness of the proposed validity index.
2. Related Work
To better understand the context and challenges of clustering result evaluation, in this section, three typical clustering algorithms widely used in the field of data clustering, namely CM, DBSCAN, and DPC, are introduced. Subsequently, three representative validity indexes are examined and summarized to provide a basis for the following analysis of the proposed grid partitioning-based validity index.
Let
be a detection dataset with
n data points in a
d-dimensional data space, and
be
c independent subsets of
X. If the data
xj is assigned to the
ith subset
Si,
uij = 1, or else 0, where
uij is a binary membership function and is computed as follows:
When performing a clustering algorithm to
X, all data in
X are assigned to subsets:
S1,
S2, …,
Sc, which is called a hard partitioning, satisfying the following:
2.1. Clustering Algorithm
The CM is the most-used clustering algorithm, owing to its simplicity and high efficiency. Detailed steps of CM are illustrated in Algorithm 1.
Algorithm 1. Flowchart of CM algorithm |
Input: detected dataset X with n data points and the number of partitioned clusters c. Output: c partitioned clusters and the cluster label of each data point in X. Steps: (1) Randomly select c data point in X as initial cluster centers: v1, v2, …, vc. (2) Repeat; (3) Assign any data point to the only cluster according to the distance to the nearest center; (4) Update every cluster center by the following equation:
(5) Stop if a convergence criterion is met and output the clustering results; (6) Otherwise, go back to Step (2). |
The key parameter in CM is the number of clusters (c), which must be determined in advance. In the absence of prior information, the number of clusters is usually solved by a validity index or several. However, CM algorithms struggle to identify the cluster structure of datasets with arbitrary shapes, the density-based clustering algorithms, DBSCAN is a typical density-based clustering algorithm.
Let ε be a uniform neighborhood radius of any point, p, in X, Nε(p) be the neighborhood of p, and Minpts be the minimum number of points in Nε(p). DBSCAN is based on the following notations:
- (1)
Point density: The density of any point p in X is measured by the number of points in Nε(p), termed as den(p);
- (2)
Core point: A point p in X is termed as a core point if den(p) is larger than Minpts;
- (3)
Directly density-reachable point: A point p is directly density-reachable from a point q if p ∈ Nε(q) and q is a core point;
- (4)
Density-reachable point: A point p is density-reachable from a point q, if there exists a chain of core points in X, p1, …, pn, p1 = q, pn = p, such that pi and pi+1 is directly density-reachable;
- (5)
Density-connected point: A point p is density-connected to a point q in X if there is a core point O ∈ X such that both p and q are density-reachable from O in X;
- (6)
Cluster and noise: A cluster C with ε and Minpts in X is a nonempty subset of X such that for any p, q ∈ C, p is density-connected to q in X. Noise are these objects in X not belonging to any cluster.
DBSCAN starts with an arbitrary object p in X and continuously retrieves all objects of X density-reachable from p with ε. All clusters are, thus, found in the way. The DBSCAN presents a commonly accepted notation of a cluster which has been used in almost all density-based algorithms. But DBSCAN requires the users to specify a global density defined by the two values of Minpts and ε. Their different values can generate very different clustering results. This is undesired in practice. So far, no effective method can determine the two values of Minpts and ε in a general way.
The density peak clustering (DPC) algorithm combines the advantages of CM and DBSCAN. DPC is based on the following two assumptions. First, the cluster center of any cluster must be surrounded by neighbors with lower local density. Second, the cluster prototypes are at relatively large distance from any points with higher local density. The distance between any pair of points
xi and
xj in
X is calculated as follows:
For each data point
i, DPC computes two quantities. Firstly, the local density
ρi of data point
xi is defined as follows:
where
dc is a cutoff radius. Specifically,
ρi is equal to the number of points that are closer than
dc to point
xi. The density in DPC is the same as that in DBSCAN. Like DBSCAN, DPC can detect arbitrary-shape clusters even though noisy data are contained. However, unlike DBSCAN, another key quantity
δi in DPC is measured by computing the minimum distance between point,
xi, and other higher density points,
A point with a high value of δi must be a local maximum of density around point i. These points with relatively high δi and ρi are regarded as the cluster centers. After determining the number of centers, all points are assigned to these centers to construct clusters by scanning once all data points in X.
2.2. Typical Validity Index
The validity index is a function designed to maximize the inter-cluster distance instead of minimizing the intra-cluster distance, where the intra-cluster distances measures the compactness of any cluster, while inter-cluster distances evaluate the separation among various clusters [
15,
16]. Usually, a clustering validity is a function
f(
c) that uses the number of clusters
c, as, its variable, and it can be represented as follows:
The trial-and-error strategy can be used to find the optimal number of clusters in Equation (3). First, a possible range of the number of clusters,
c, is required to be determined. Let
cmin be the minimum value of
c and
cmax be the maximum value of
c. Usually,
cmin is taken as 2 and
cmax is taken as
if there is no prior knowledge [
17] in a detection dataset,
X, with
n data points. In general, a selected clustering algorithm is performed to
X with the value of
c set from
cmin to
cmax. After computing the value of Equation (3) for all possible number of clusters in [
cmin,
cmax], the obtained maximum or minimum values is regarded as the optimal number of clusters.
The searching process of a validity index takes a trial-and-error strategy, where the points of maximal or minimal values are associated with the optimal number of clusters and the evaluation process starts with cmin = 2 and ends with a cmax that is large enough. Different validity indexes consist of different combinations of intra- and inter-cluster distances and, thus, lead to different evaluation results. We take three typical validity indexes as examples, as explained below.
Let and vi be the intra-cluster distance measure and cluster center of the ith cluster, respectively.
- (1)
Davies–Bouldin (DB) index [
8]. Let
denote the inter-cluster distance measure between clusters of
Ci and
Cj, and
c ranges in [
cmin,
cmax]. The
DB index is defined as follows:
- (2)
Dual-Center (DC) index [
18]. For any clustering center,
vi, determined by a partitional clustering algorithm, assume
is the closest center to
vi, then the dual center is calculated as
= (
vi + )/2. Finally, a novel validity index can be constructed, i.e.,
where
ni(
c) and
(
c) are the number of points of the
ith cluster when the prototypes are regarded as
vi and
, respectively. Among the existing validity indexes,
DC has higher accuracy and robustness.
- (3)
Gap Statistic (GS) index. The GS index firstly computes an intra-cluster measure as follows:
GS uses the notions of log(
Wc) and the expectation of log(
Wc), and, thus, the gap statistics was proposed as follows:
where
E* denotes the expectation under a null reference distribution. By denoting
, the optimal number of clusters is determined as follows:
In summary, the existing indexes all use the center to measure intra-cluster distance and have to depend on a selected clustering algorithm. They neither involve the clustering features such as different-size and arbitrary-shape clusters, nor those of the overlapped and high-dimensional clusters. Consequently, their evaluations for the clustering results are very limited. Hence, an efficient and comprehensive method is necessary, which can evaluate clustering results for any clustering algorithm and arbitrary clustering parameters.
3. Clustering Evaluation Based on Grid Structure
To evaluate different clustering results, we apply grid partitioning to any detection dataset, whereby a novel clustering evaluation index is constructed below.
3.1. Grid Partition and Clustering Center
All data points in any clustered dataset are represented by a set X = [x1, x2, …, xn] , where the ith data point in X refers to xi = , which is a single data point in a d-dimensional data space.
We partition all data points in M according to the step of a fast bisecting grid (BG) algorithm [
19,
20], and, thus, all data points are assigned to a set of grids based on the following steps:
- (1)
Solving the minimal grid that encloses all data objects in X.
Let lmin-i = min{x1i, x2i,…,xni} and rmax-i = max{x1i, x2i, …,xni}, i = 1, 2, …, d, and so GRID = [lmin−1, rmax−1] × [lmin−2, rmax−2] ×…× [lmin-d, rmax-d]. Let any edge of a grid be its bounded interval in the related dimension.
- (2)
Successively bisecting GRID in the following ways:
The first round of bisecting: The BG algorithm bisects the edge of GRID in a chosen dimension so that GRID is bisected into two equal-volume new grids, denoted as GRID11, GRID12. Accordingly, all data objects in GRID are assigned into GRID11, GRID12;
The second round of bisecting: BG bisects an edge of GRID1k (k = 1, 2) in a uniform chosen dimension so that GRID1k is bisected into two volume-equal new grids, denoted as GRID21, GRID22. Hence, all data objects in GRID1k are assigned into GRID21, GRID22;
The jth round of bisecting: BG bisects each grid in (j − 1)-th round in a uniform dimension into two volume-equal new grids. All obtained grids in the j-th round of bisecting are denoted as GRIDj1, GRIDj2, GRIDj3, …, , where 2j is the total number of grids in the jth round of bisecting;
Solving an optimal grid size: BG orders all grids generated at
jth round of bisecting into three sets of grid: D(
s,
j),
s = 1, 2, 3, satisfying that the density of any grid in D(
t,
j) is larger than any other grid in D(
t+1,
j),
t = 1, 2. Let |●| be the number of data objects of the set in the bracket. The optimal grid size in BG, say SIZE, is characterized as follows:
where SIZE is a bisecting index. Equation (12) aims to maximize the differences among all grids.
The bisecting stops if the bisected rounds equal OPT + q, where q represents the minimum number of additional rounds beyond OPT.
Figure 1 shows that all points in a two-dimensional dataset are assigned to a set of grids by the BG algorithm, and accordingly the four clusters in the dataset are contained in four groups of mutually connected grids.
When using CM to cluster all points in
X, the
ith cluster center,
However, not all clusters from various algorithms have a clustering center and manifest as spherical clusters. To normalize them, we first compute center
vi in
Ci by the sum of pairwise intra-cluster distances as follows:
Equation (14) shows that the sum of pairwise distances in a cluster is equal to the sum of distances from all data points to an equivalent center. When a cluster is spherical, the equivalent center is just that, recalculated by Equation (10).
3.2. Cluster Normalization and Validity Index
Partition all points in
M into
c clusters
C1,
C2, …,
Cc. For any
ith cluster with arbitrary shapes, we normalize all points in the cluster to make its shape approach spherical distribution. First, we compute its equivalent center
vi by Equation (14),
i = 1, 2, …,
c; then, we construct a spherical neighborhood and randomly insert |
Ci| distributed data points into the neighborhood. The neighborhood is centralized at
vi with radius
Ri, which satisfies the equation,
Figure 2 shows the normalizing process for two non-spherical clusters, and the red dotted circles refer to their individual means.
Figure 2a represents the original non-spherical clusters, while
Figure 2b represents the normalized clusters. As seen, after normalization operation, the shape of each cluster tends to be in a spherical structure.
Note that after the normalization operation, all points have been assigned to a set of grids. Hereafter, we call any sphere a cover that is located at the equivalent center
vi with radius
Ri, these grids in a cover that contain no data point are known as empty grids, and these grids between any two intersecting covers are called intersecting grids. According to the sum of both the number of empty grids and the number of intersecting grids, we define a new validity index as follows:
where
is the number of empty grids in
Ci when partitioning
X to
c clusters, and
is the number of intersecting grids in the two covers of
Ci and
Cj. The coefficient “2” results from any intersecting grid that falls into two clusters.
Figure 3a–e shows the clustering results when the number of clusters,
c, is taken as 2, 4, 6, 8, and 10, respectively, where each cluster is enclosed by a cover (circle) that has the minimal radius.
With a close look at these empty grids in green, it is seen that the number of empty grids decreases as c increases. On the other hand, when the number of clusters c is larger or smaller than the actual one, different numbers of intersecting grid appear among clusters. When the number of clusters is close to the real one, the number of intersecting grids in pink is least. However, it increases quickly when c is larger than the real one. Hence, their sum of the two classes of grids appears to be a minimum value when the real number of clusters is met.
Figure 3f shows the minimum at the curve as c increases from 2 to 8. Specially, the normalizing process for two non-spherical clusters possibly makes two well separated clusters into lapped clusters. However, since our proposed index does not exclude counting of overlapping clusters, its impact on the correctness of our proposed index is, thus, minimal.
Generally, there are two opposite monotonic trends on the value of z(c) in (16) as follows:
- (1)
As c increases before the real number of clusters, the number of empty grids in all covers totally decreases. Specially, the intersecting grids reduce but have little. When c is close to the real number of clusters, the number of empty grids tends to remain unchanged.
- (2)
When c is larger than the number of real clusters, the number of empty grids keeps decreasing gradually, but the number of intersecting grids rises. The positive increment of the number of intersecting grids is larger than the negative increments of the number of empty grids, which is an increase in z(c). Hence, there is minimum value at the real number of clusters along the curve of G(c).
Note that the maximal number of clusters is practically taken as
[
21] when searching for the real number of clusters. Hereafter, we can name the grid-partitioning-based valid index as GPVI. The GPVI step is shown in Algorithm 2.
Algorithm 2. Flowchart of GPVI. |
Flowchart of GPVI. Input: A dataset X ∈ Rd with n points and clustering results from any algorithm at c = 1, 2, …, cmax. Output: The suggested number of clusters. Steps: 1. Partition the data space to a grid structure; 2. Assign all points to their corresponding grids; 3. Cluster all data points to c cluster by a give clustering algorithm; 4. Determine each clustering center by Equation (14); 5. Normalize each cluster toward a spherical cluster by Equation (15); 6. Compute the value of z(c) by Equation (16) at c = 1, 2, …, cmax; 7. Determine the minimum of z(c); 8. Suggest an optimal number of clusters. |
4. Experiment
We test the accuracy of GPVI on four artificial datasets and eight real datasets, respectively, and compare it with three existing validity indexes, DB, DC and GS. In view of different characteristics of the detected datasets, the clustering results are obtained using CM, DPC, and DBSCAN algorithms, respectively.
4.1. Test on Four Artificial Datasets
Figure 4 shows four artificial datasets generated by the Matlab
® toolbox (Matlab R2022 and Comsol 2016), and each has various numbers of data points and different numbers of clusters, which are denoted as Set 1~4, respectively. Set 1 and Set 2 are added to 10% red noisy points to test the robustness of GPVI. Clusters in these sets have various densities, sizes, shapes, and distributions, and there are overlaps among clusters in Set 4.
We use the three algorithms
DPC,
CM, and
DBSCAN to cluster all data points in the three datasets and partition them into
c clusters when the number of clusters,
c, is taken as 1, 2, …,
cmax, respectively, where various numbers of clusters in
DBSCAN are obtained by taking various values of
. The accuracy of clustering evaluation of each validity index depends on whether the correct number of clusters can be found.
Figure 5 shows the curves of GPVI based on the three clustering algorithms, respectively. The points marked by small circles in these curves are the suggested optimal values by GPVI.
Set 1 contains density-different clusters that make DBSCAN fail to find the correct number of clusters, but both CM and DPC can find the correct one. In Set 2, CM cannot identify the cluster that is located at the center with lined shaped. The two datasets are affected by these 10% noisy points, verifying GPVI robustness to some extent. For Set 3, CM and DPC are erroneous. In Set 4, the CM algorithm cannot find most clusters correctly since there are overlaps among clusters. A remarkable advantage of GPVI is that it can point out the best clustering results on the three algorithms by comparing their minimal values of the relative curve. These results demonstrate that GPVI is capable of correctly finding the best clustering results even though different clustering algorithms and clusters with various features are applied.
We further compare the GPVI with the three existing indexes: DBI, GS, and DC.
Table 1 shows the numbers of clusters that are evaluated by the four validity indexes.
The three existing validity indexes are constructed for the CM algorithm, whereas CM was originally designed to partition spherical clusters rather than arbitrary-shape ones. Therefore, they may incorrectly evaluate the number of clusters in the four datasets. Essentially, in both Set 2 and Set 4 using the CM and DBSCAN, the error is very large. In contrast, GPVI based on DPC shows the best performance among all four datasets and achieves nearly correct results when using CM and DBSCAN in terms of accuracy. In terms of various shapes in Set 3, GPVI works well based on DPC and DBSCAN but CM cannot. Therefore, the merit of GPVI is to select the best one from any candidates of clustering results, no matter which clustering algorithm is used. However, if all available clustering results cannot contain the real number of clusters, GPVI can find it as well.
4.2. Test on the UCI Dataset
The UCI Machine Learning Repository [
22] contains various kinds of benchmark datasets, which is usually used for evaluating various machine learning algorithms. The UCI datasets, collected from the real world, cover a wide range of representative domains [
23].
In this paper, eight representative UCI datasets containing clusters of various sizes (e.g., Ecoli, Wholesale), densities (e.g., Wine), shapes (e.g., Satimage), and overlapped clusters (e.g., Banknote, Iris, Cancer, Pima, and Wine) are selected for validating our proposed index GPVI. The detailed characteristics of these datasets are listed in
Table 2. The first column denotes the names of datasets; the second and third columns represent the number of clusters and dimension of each dataset, respectively; the fifth and fourth columns denote the number of points in each cluster and the whole dataset, respectively.
Since all these datasets have high-dimensional features, we select two among these features to show the data distributions over all clusters in
Table 3.
Table 3 shows the original distributions of data points along with their actual clusters and the normalized distribution by Equation (14). As seen, the clustering structures among the original distribution are unclear, but after normalizing all data points, the clustering structure becomes clearer. The comparison shows the normalization operation is feasible and effective in evaluating the clustering results.
Furthermore,
Table 4 shows the selected numbers of clusters from the four validity indexes when using the three clustering algorithms: CM, DBSCAN, and DPC. Compared with the three existing indexes DB, DC, and GS, the evaluation results of GPVI are nearest to the real cluster numbers, and it is capable of finding the correct numbers of clusters among all eight datasets along all eight datasets except for Ecoli. However,
Table 3 shows that the Ecoli dataset includes two small clusters that have the number of data points 5, 2, and 2. However, all clustering algorithms find it difficult to identify the two clusters. When all clusters in a dataset are close to the sphere-distribution such as Iris and Cancer, all validity indexes can find the correct number of clusters along with different algorithms. However, in other datasets, only DPC can suggest the correct number of clusters by GPVI, but the other three indexes have errors among partial datasets such as Banknote and Segmentation. Noting that the DBSCAN is very unstable, in some datasets, it can generate very large error such as Satimage and Wine. Consequently, we conclude the GPVI outperforms the other three existing indexes using various clustering algorithms.
Nevertheless, the GPVI we propose is a commonly used external one; therefore, compared to the time required to execute the clustering algorithm once, this index must be executed multiple times to find the best clustering result, resulting in an increase in computational cost by the multiple times. Therefore, in practice, in order to improve time efficiency, the execution frequency of index should be minimized as much as possible through prior knowledge if possible.
5. Conclusions
The correct clustering process results from the correct solution of the number of clusters, but it remains very difficult for evaluating various clustering algorithms and different clusters with different features. In this paper, we propose a new clustering validity index that is independent of clustering algorithms and data distributions. Different from the existing validity index that mainly depends on the direct computation of clustered data points, our proposed validity focuses on the background measurement of each cluster. Along with a fast grid-based partitioning and a fast computation, the new index manifests a stronger generalization and stableness. Extensive experiments validate the proposed index according to accuracy and efficiency. Therefore, this index is unsupervised and outperforms most of the existing indices on some benchmark datasets.
There are two possible opportunities for the future research. Firstly, normalizing the arbitrary shape cluster to a spherical one can be improved. The method used in this paper only is a special example. Secondly, identifying the points in the overlapped area is still a challenge in clustering analysis. The transformation process may misclassify some points in the overlapped area and result in deviation. How to correct the deviation caused by the overlapped area remains one of our research focuses in the future.
Author Contributions
Conceptualization, J.W. and Z.Z.; Data curation, Z.Z.; Resources, J.W.; Software, J.W.; Supervision, S.Y.; Algorithm design, S.Y.; Writing, J.W. and Z.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
All data are available in the attachment of the submission.
Conflicts of Interest
The authors declare no conflicts of interest.
Correction Statement
This article has been republished with a minor correction to the existing affiliation information and author name. This change does not affect the scientific content of the article.
References
- Zhu, S.; Zhao, Y.; Yue, S. Double-Constraint Fuzzy Clustering Algorithm. Appl. Sci. 2024, 14, 1649. [Google Scholar] [CrossRef]
- Xu, R.; Wunsch, D. Survey of clustering algorithms. IEEE Trans. Neural Netw. 2005, 16, 645–678. [Google Scholar] [CrossRef] [PubMed]
- Ng, M.K. A note on constrained k-means algorithms. Pattern Recognit. 2000, 33, 515–519. [Google Scholar]
- Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, Portland, Oregon, OR, USA, 2–4 August 1996; AAAI Press: Washington, DC, USA, 1996; pp. 226–231. [Google Scholar]
- Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1497. [Google Scholar]
- Wang, Y.; Yue, S.; Hao, Z.; Ding, M.; Li, J. An unsupervised and robust validity index for clustering analysis. Soft Comput. 2019, 23, 10303–10319. [Google Scholar]
- Masud, M.A.; Huang, J.Z.; Wei, C.; Wang, J.; Khan, I.; Zhong, M. I-nice: A new approach for identifying the number of clusters and initial cluster centres. Inf. Sci. 2018, 466, 129–151. [Google Scholar] [CrossRef]
- Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, 2, 224–227. [Google Scholar] [CrossRef]
- Tibshirani, R.; Walther, G.; Hastie, T. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2001, 63, 411–423. [Google Scholar]
- Xie, X.L.; Beni, G. A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 23, 841–847. [Google Scholar] [CrossRef]
- Arbelaitz, O.; Gurrutxaga, I.; Muguerza, J.; Pérez, J.M.; Perona, I. An extensive comparative study of cluster validity indices. Pattern Recognit. 2013, 46, 243–256. [Google Scholar]
- Rezaei, M.; Fr€anti, P. Set Matching Measures for External Cluster Validity. IEEE Trans. Knowl. Data Eng. 2016, 28, 2173–2180. [Google Scholar] [CrossRef]
- Preedasawakul, O.; Wiroonsri, N. A Bayesian cluster validity index. Comput. Stat. Data Anal. 2025, 202, 1734–1740. [Google Scholar] [CrossRef]
- Akhanli, S.E.; Hennig, C. Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes. Stat. Comput. 2020, 30, 1523–1544. [Google Scholar] [CrossRef]
- Fahad, A.; Alshatri, N.; Tari, Z.; Alamri, A.; Khalil, I.; Zomaya, A.Y.; Foufou, S.; Bouras, A. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2014, 2, 267–279. [Google Scholar] [CrossRef]
- Du, M.; Ding, S.; Xue, Y. A novel density peaks clustering algorithm for mixed data. Pattern Recognit. Lett. 2017, 97, 46–53. [Google Scholar] [CrossRef]
- Ding, S.; Du, M.; Sun, T.; Xu, X.; Xue, Y. An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood. Knowl. Based Syst. 2017, 133, 294–313. [Google Scholar] [CrossRef]
- Yue, S.; Wang, J.; Bao, X. A new validity index for evaluating the clustering results by partitional clustering algorithms. Soft Comput. 2016, 20, 127–1138. [Google Scholar] [CrossRef]
- Yue, S.; Wei, M.; Wang, J.S.; Wang, H. A general grid-clustering approach. Pattern Recognit. Lett. 2008, 29, 1372–1384. [Google Scholar]
- Bandaru, S.; Ng, A.H.; Deb, K. Data mining methods for knowledge discovery in multi-objective optimization: Part a—Survey. Expert Syst. Appl. 2017, 70, 139–159. [Google Scholar] [CrossRef]
- Kwon, S.H.; Kim, J.H.; Son, S.H. Improved cluster validity index for fuzzy clustering. Electron. Lett. 2021, 57, 792–794. [Google Scholar] [CrossRef]
- UCI Dataset. Available online: http://archive.ics.uci.edu/ml/datasets.php (accessed on 17 March 2025).
- Ma, E.W.; Chow, T.W. A new shifting grid clustering algorithm. Pattern Recognit. 2004, 37, 503–514. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).