A Validity Index for Clustering Evaluation by Grid Structures

Wang, Jiachen; Zhang, Zuojing; Yue, Shihong

doi:10.3390/math13061017

Open AccessArticle

A Validity Index for Clustering Evaluation by Grid Structures

by

Jiachen Wang

,

Zuojing Zhang

and

Shihong Yue

^*

College of Mechanical and Electronic Engineering, Northwest A&F University, Yangling 712100, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(6), 1017; https://doi.org/10.3390/math13061017

Submission received: 6 February 2025 / Revised: 14 March 2025 / Accepted: 18 March 2025 / Published: 20 March 2025

Download

Browse Figures

Versions Notes

Abstract

The evaluation of clustering results plays an important role in clustering analysis. Most existing indexes are designed for the evaluation of results from the most-used K-means clustering algorithm; it can identify only spherical clusters rather than arbitrary clusters. However, in recent decades, various algorithms have been proposed to cluster arbitrary clusters that are nonspherical, such as ones with arbitrary shapes, different sizes, distinct densities, and instances where there is overlap among clusters. To effectively solve these issues, in this paper, we propose a new validity index based on a grid-partitioning structure. First, all data points in a dataset are assigned to a group of partitioned grids. Then, each cluster is normalized towards a spherical shape, and the number of empty and intersecting grids in all clusters is computed. The two groups of grids serve as the background of each cluster. Finally, according to various clustering results, the optimal number of clusters is obtained when the number of total grids reaches its minimal value. Experiments are performed on real and synthetic datasets for any algorithms and datasets, revealing the generalization and effectiveness of the new index.

Keywords:

clustering evaluation; validity index; grid structure; clustering algorithm; clustering features

MSC:

91C20

1. Introduction

Clustering is an unsupervised data processing technique that can effectively uncover the distribution structure and key features hidden in a detection dataset [1,2]. Due to various clustering tasks, a large number of clustering algorithms have been proposed, such as the C-Means (CM) [3], Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [4], and recently Density Peak Clustering (DPC) [5]. Different algorithms can yield various clusters with various features; therefore, evaluating clustering results is very important in clustering analysis. Except the prior information, the clustering results generally are evaluated by a clustering validity index (function) or several [6]. So far, several dozen validity indexes with different purposes have been proposed [7]. The core task of these validity indexes is to determine the optimal number of clusters, which decides the correctness of clustering results. The typical validity indexes include the Davies–Bouldin measure [8], Tibshirani Gap statistics [9], Xie-Beni’s separation norm [10], etc. Arbelaitz et al. [11] compared the existing validity indexes by their applicable range. In recent years, some indexes have been proposed, such as set matching measure for external validity [12]. More reviews can be found in [13,14].

These indexes have their respective application ranges and limitations. However, the following issues remain unsolved:

(1): Various features. The detected clusters in a dataset can have various features, such as densities, sizes, shapes, overlaps among clusters, and high dimensionality. However, most typical indexes are designed to assess the clustering results from the most used CM algorithm and cannot evaluate the clustering results from other clustering algorithms. Therefore, these clustering results except CM cannot be accurately evaluated;
(2): Evaluation criterion. Almost all the validity indexes are constructed based on the principle of maximizing the inter-cluster distance as opposed to minimizing the intra-cluster distances. Therefore, different constructions of the two distances lead to different indexes. Essentially, most validity indexes directly compute the mutual distances among data points but neglect their backgrounds. Hence, their evaluation methods fail to fully use the available information.

To address these issues, in this paper, we propose a grid partitioning-based validity index to evaluate the clustering results from any algorithms and arbitrary clustering features. The use of grid-based partitioning can quickly and easily map all data points into a grid structure. To measure the clusters with different features, all data points are normalized towards a spherical shape. Different from the existing validity indexes, we compute the number of empty and intersecting grids that serve as the background of each cluster, whereas the existing indexes only compute the mutual distances among data points. Extensive experiments are performed on real and synthetic datasets, validating the promising efficacy and effectiveness of the proposed validity index.

2. Related Work

To better understand the context and challenges of clustering result evaluation, in this section, three typical clustering algorithms widely used in the field of data clustering, namely CM, DBSCAN, and DPC, are introduced. Subsequently, three representative validity indexes are examined and summarized to provide a basis for the following analysis of the proposed grid partitioning-based validity index.

Let

X = {x_{1}, x_{2}, \dots, X_{n}}

be a detection dataset with n data points in a d-dimensional data space, and

S_{1}, S_{2}, \dots, S_{c}

be c independent subsets of X. If the data x_j is assigned to the ith subset S_i, u_ij = 1, or else 0, where u_ij is a binary membership function and is computed as follows:

u_{i j} = \{\begin{cases} 1, x_{j} \in S_{i} \\ 0, x_{j} \notin S_{i} \end{cases}, i = 1, 2, \dots, c; j = 1, 2, \dots, n

(1)

When performing a clustering algorithm to X, all data in X are assigned to subsets: S₁, S₂, …, S_c, which is called a hard partitioning, satisfying the following:

X = S_{1} \cup S_{2} \cup \dots \cup S_{c}, S_{i} \cap S_{j} = ϕ, i \neq j, i, j = 1, 2, \dots, c

(2)

2.1. Clustering Algorithm

The CM is the most-used clustering algorithm, owing to its simplicity and high efficiency. Detailed steps of CM are illustrated in Algorithm 1.

Algorithm 1. Flowchart of CM algorithm

Input: detected dataset X with n data points and the number of partitioned clusters c.
Output: c partitioned clusters and the cluster label of each data point in X.
Steps:
(1) Randomly select c data point in X as initial cluster centers: v₁, v₂, …, v_c.
(2) Repeat;
(3) Assign any data point to the only cluster according to the distance to the nearest center;
(4) Update every cluster center by the following equation:

v_{i} = \sum_{j = 1}^{n} u_{i j} x_{j} / \sum_{j = 1}^{n} u_{i j} for i = 1 ~ c;

(5) Stop if a convergence criterion is met and output the clustering results;
(6) Otherwise, go back to Step (2).

The key parameter in CM is the number of clusters (c), which must be determined in advance. In the absence of prior information, the number of clusters is usually solved by a validity index or several. However, CM algorithms struggle to identify the cluster structure of datasets with arbitrary shapes, the density-based clustering algorithms, DBSCAN is a typical density-based clustering algorithm.

Let ε be a uniform neighborhood radius of any point, p, in X, N_ε(p) be the neighborhood of p, and Minpts be the minimum number of points in N_ε(p). DBSCAN is based on the following notations:

(1): Point density: The density of any point p in X is measured by the number of points in N_ε(p), termed as den(p);
(2): Core point: A point p in X is termed as a core point if den(p) is larger than Minpts;
(3): Directly density-reachable point: A point p is directly density-reachable from a point q if p ∈ N_ε(q) and q is a core point;
(4): Density-reachable point: A point p is density-reachable from a point q, if there exists a chain of core points in X, p₁, …, p_n, p₁ = q, p_n = p, such that p_i and p_i+₁ is directly density-reachable;
(5): Density-connected point: A point p is density-connected to a point q in X if there is a core point O ∈ X such that both p and q are density-reachable from O in X;
(6): Cluster and noise: A cluster C with ε and Minpts in X is a nonempty subset of X such that for any p, q ∈ C, p is density-connected to q in X. Noise are these objects in X not belonging to any cluster.

DBSCAN starts with an arbitrary object p in X and continuously retrieves all objects of X density-reachable from p with ε. All clusters are, thus, found in the way. The DBSCAN presents a commonly accepted notation of a cluster which has been used in almost all density-based algorithms. But DBSCAN requires the users to specify a global density defined by the two values of Minpts and ε. Their different values can generate very different clustering results. This is undesired in practice. So far, no effective method can determine the two values of Minpts and ε in a general way.

The density peak clustering (DPC) algorithm combines the advantages of CM and DBSCAN. DPC is based on the following two assumptions. First, the cluster center of any cluster must be surrounded by neighbors with lower local density. Second, the cluster prototypes are at relatively large distance from any points with higher local density. The distance between any pair of points x_i and x_j in X is calculated as follows:

d (x_{i}, x_{j}) = | | x_{i} - x_{j} | |

(3)

For each data point i, DPC computes two quantities. Firstly, the local density ρ_i of data point x_i is defined as follows:

ρ_{i} = \sum_{j} χ (d (x_{i}, x_{j} - d_{c})), s . t ., χ (x) = 1 (i f x \leq 0) o r o (i f x > 0)

(4)

where d_c is a cutoff radius. Specifically, ρ_i is equal to the number of points that are closer than d_c to point x_i. The density in DPC is the same as that in DBSCAN. Like DBSCAN, DPC can detect arbitrary-shape clusters even though noisy data are contained. However, unlike DBSCAN, another key quantity δ_i in DPC is measured by computing the minimum distance between point, x_i, and other higher density points,

δ_{i} = \min_{i : ρ_{j} > ρ_{i}} d (x_{i}, x_{j})

(5)

A point with a high value of δ_i must be a local maximum of density around point i. These points with relatively high δ_i and ρ_i are regarded as the cluster centers. After determining the number of centers, all points are assigned to these centers to construct clusters by scanning once all data points in X.

2.2. Typical Validity Index

The validity index is a function designed to maximize the inter-cluster distance instead of minimizing the intra-cluster distance, where the intra-cluster distances measures the compactness of any cluster, while inter-cluster distances evaluate the separation among various clusters [15,16]. Usually, a clustering validity is a function f(c) that uses the number of clusters c, as, its variable, and it can be represented as follows:

max(min) f(c), c = 1, 2, …, C

(6)

The trial-and-error strategy can be used to find the optimal number of clusters in Equation (3). First, a possible range of the number of clusters, c, is required to be determined. Let c_min be the minimum value of c and c_max be the maximum value of c. Usually, c_min is taken as 2 and c_max is taken as

\sqrt{n}

if there is no prior knowledge [17] in a detection dataset, X, with n data points. In general, a selected clustering algorithm is performed to X with the value of c set from c_min to c_max. After computing the value of Equation (3) for all possible number of clusters in [c_min, c_max], the obtained maximum or minimum values is regarded as the optimal number of clusters.

The searching process of a validity index takes a trial-and-error strategy, where the points of maximal or minimal values are associated with the optimal number of clusters and the evaluation process starts with c_min = 2 and ends with a c_max that is large enough. Different validity indexes consist of different combinations of intra- and inter-cluster distances and, thus, lead to different evaluation results. We take three typical validity indexes as examples, as explained below.

Let

Δ_{i}

and v_i be the intra-cluster distance measure and cluster center of the ith cluster, respectively.

(1): Davies–Bouldin (DB) index [8]. Let $δ_{i j}$ denote the inter-cluster distance measure between clusters of C_i and C_j, and c ranges in [c_min, c_max]. The DB index is defined as follows:

$D B = \sum_{i = 1}^{c} R_{i} / c, s . t ., R_{i} = \max_{j, j \neq i} (Δ_{i} + Δ_{j}) / δ_{i j}, δ_{i j} = | v_{i} - v_{j} |, Δ_{i} = \sum_{x \in C_{i}} | x - v_{i} | / | C_{i} |$

(7)
(2): Dual-Center (DC) index [18]. For any clustering center, v_i, determined by a partitional clustering algorithm, assume $v_{i}^{'}$ is the closest center to v_i, then the dual center is calculated as $v_{i}^{″}$ = (v_i + $v_{i}^{'}$ )/2. Finally, a novel validity index can be constructed, i.e.,

$D C_{c} = \sum_{i = 1}^{c} Δ_{i} (c) / \sum_{i = 1}^{c} δ_{i} (c), s . t ., Δ_{i} = \sum_{k = 1}^{n_{i} (c)} {(x_{i} - v_{i})}^{2}, δ_{i} = \sum_{k = 1}^{n_{i}^{″} (c)} {(x_{i} - v_{i}^{″})}^{2}$

(8)

where n_i(c) and $n_{i}^{″}$ (c) are the number of points of the ith cluster when the prototypes are regarded as v_i and $v_{i}^{″}$ , respectively. Among the existing validity indexes, DC has higher accuracy and robustness.
(3): Gap Statistic (GS) index. The GS index firstly computes an intra-cluster measure as follows:

W_{c} = \sum_{i = 1}^{c} D_{i} / (2 | C_{i} |), s . t ., D_{i} = 2 | C_{i} | \sum_{j \in C_{i}} | | x_{j} - \bar{x} | |, \bar{x} = \sum_{i = 1}^{| C_{i} |} x_{i} / | C_{i} |

(9)

GS uses the notions of log(W_c) and the expectation of log(W_c), and, thus, the gap statistics was proposed as follows:

G a p_{c} : = E * [\log (W_{c})] - \log (W_{c}), s . t ., W_{c} = \sum_{i = 1}^{c} D_{i} / (2 |C_{i}|)

(10)

where E* denotes the expectation under a null reference distribution. By denoting

\bar{l} = (1 / B) \sum_{b} \log_{c} b

, the optimal number of clusters is determined as follows:

c * = \min c, s . t ., G a p (c) \geq G a p (c + 1) - s_{c + 1}, s_{c + 1} = s d_{c + 1} {(1 + 1 / B)}^{1 / 2}, s d_{c} = {[(1 / B) \sum_{b} \log {(W_{c b} - \bar{l})}^{2}]}^{1 / 2}

(11)

In summary, the existing indexes all use the center to measure intra-cluster distance and have to depend on a selected clustering algorithm. They neither involve the clustering features such as different-size and arbitrary-shape clusters, nor those of the overlapped and high-dimensional clusters. Consequently, their evaluations for the clustering results are very limited. Hence, an efficient and comprehensive method is necessary, which can evaluate clustering results for any clustering algorithm and arbitrary clustering parameters.

3. Clustering Evaluation Based on Grid Structure

To evaluate different clustering results, we apply grid partitioning to any detection dataset, whereby a novel clustering evaluation index is constructed below.

3.1. Grid Partition and Clustering Center

All data points in any clustered dataset are represented by a set X = [x₁, x₂, …, x_n]

\in R^{d \times n}

, where the ith data point in X refers to x_i = , which is a single data point in a d-dimensional data space.

We partition all data points in M according to the step of a fast bisecting grid (BG) algorithm [19,20], and, thus, all data points are assigned to a set of grids based on the following steps:

(1): Solving the minimal grid that encloses all data objects in X.

Let l_min-i = min{x_1i, x_2i,…,x_ni} and r_max-i = max{x_1i, x_2i, …,x_ni}, i = 1, 2, …, d, and so GRID = [l_min₋₁, r_max₋₁] × [l_min₋₂, r_max₋₂] ×…× [l_min-d, r_max-d]. Let any edge of a grid be its bounded interval in the related dimension.

(2)

Successively bisecting GRID in the following ways:

The first round of bisecting: The BG algorithm bisects the edge of GRID in a chosen dimension so that GRID is bisected into two equal-volume new grids, denoted as GRID¹¹, GRID¹². Accordingly, all data objects in GRID are assigned into GRID¹¹, GRID¹²;
The second round of bisecting: BG bisects an edge of GRID^1k (k = 1, 2) in a uniform chosen dimension so that GRID^1k is bisected into two volume-equal new grids, denoted as GRID²¹, GRID²². Hence, all data objects in GRID^1k are assigned into GRID²¹, GRID²²;
The jth round of bisecting: BG bisects each grid in (j − 1)-th round in a uniform dimension into two volume-equal new grids. All obtained grids in the j-th round of bisecting are denoted as GRID^j¹, GRID^j², GRID^j³, …, ${GRID}^{j 2^{j}}$ , where 2^j is the total number of grids in the jth round of bisecting;
Solving an optimal grid size: BG orders all grids generated at jth round of bisecting into three sets of grid: D(s, j), s = 1, 2, 3, satisfying that the density of any grid in D(t, j) is larger than any other grid in D(t+1, j), t = 1, 2. Let |●| be the number of data objects of the set in the bracket. The optimal grid size in BG, say SIZE, is characterized as follows:

SIZE = min_j {|D(2,j)|/|D(1,j)|} and OPT = arg min_j {|D(2,j)|/|D(1,j)|},

(12)

where SIZE is a bisecting index. Equation (12) aims to maximize the differences among all grids.
The bisecting stops if the bisected rounds equal OPT + q, where q represents the minimum number of additional rounds beyond OPT.

Figure 1 shows that all points in a two-dimensional dataset are assigned to a set of grids by the BG algorithm, and accordingly the four clusters in the dataset are contained in four groups of mutually connected grids.

When using CM to cluster all points in X, the ith cluster center,

v_{i} = \sum^{x_{j} \in C_{i}} u_{i j} x_{i} / \sum^{x_{j} \in C_{i}} u_{i j}, i = 1, 2, \dots, c

(13)

However, not all clusters from various algorithms have a clustering center and manifest as spherical clusters. To normalize them, we first compute center v_i in C_i by the sum of pairwise intra-cluster distances as follows:

\begin{array}{l} D_{i} & = \sum_{x_{p} \in C_{i}} \sum_{x_{q} \in C_{i}} | | x_{p} - x_{q} | |^{2} \\ = \sum_{x_{p} \in C_{i}} \sum_{x_{q} \in C_{i}} < x_{p} - x_{q}, x_{p} - x_{q} > \\ = \sum_{x_{p} \in C_{i}} \sum_{x_{q} \in C_{i}} < x_{p} - v_{r} + v_{r} - x_{q}, x_{p} - v_{r} + v_{r} - x_{q} > \\ = \sum_{x_{p} \in C_{i}} \sum_{x_{q} \in C_{i}} {< x_{p} - v_{r}, x_{p} - v_{r} > + 2 < x_{p} - v_{r}, v_{r} - x_{q} > + < v_{r} - {\bar{x}}_{q}, v_{r} - x_{q} >} \\ = \sum_{x_{p} \in C_{i}} \sum_{x_{q} \in C_{i}} < x_{p} - v_{r}, x_{p} - v_{r} > + 2 \sum_{x_{p} \in C_{r}} \sum_{x_{q} \in C_{r}} < x_{q} - v_{r}, v_{r} - x_{q} > + \sum_{x_{p} \in C_{r}} \sum_{x_{q} \in C_{r}} < v_{r} - x_{q}, v_{r} - x_{q} > \\ = 2 \sum_{x_{p} \in C_{i}} \sum_{x_{q} \in C_{i}} < x_{p} - v_{r}, v_{r} - x_{q} > \\ = 2 | C_{i} | \sum_{x_{j} \in C_{i}} < x_{q} - v_{r}, v_{r} - x_{q} > \end{array}

That is,

D_{i} = 2 | C_{i} | \sum_{x_{j} \in C_{i}} | | x_{j} - v_{i} | |^{2}, i = 1, 2, \dots, c

(14)

Equation (14) shows that the sum of pairwise distances in a cluster is equal to the sum of distances from all data points to an equivalent center. When a cluster is spherical, the equivalent center is just that, recalculated by Equation (10).

3.2. Cluster Normalization and Validity Index

Partition all points in M into c clusters C₁, C₂, …, C_c. For any ith cluster with arbitrary shapes, we normalize all points in the cluster to make its shape approach spherical distribution. First, we compute its equivalent center v_i by Equation (14), i = 1, 2, …, c; then, we construct a spherical neighborhood and randomly insert |C_i| distributed data points into the neighborhood. The neighborhood is centralized at v_i with radius R_i, which satisfies the equation,

R_{i} = \max d_{p q}, s . t, d_{p q} = | x_{p} - x_{q} |, x_{p}, x_{q} \in C_{i}, i = 1, 2, \dots, c

(15)

Figure 2 shows the normalizing process for two non-spherical clusters, and the red dotted circles refer to their individual means. Figure 2a represents the original non-spherical clusters, while Figure 2b represents the normalized clusters. As seen, after normalization operation, the shape of each cluster tends to be in a spherical structure.

Note that after the normalization operation, all points have been assigned to a set of grids. Hereafter, we call any sphere a cover that is located at the equivalent center v_i with radius R_i, these grids in a cover that contain no data point are known as empty grids, and these grids between any two intersecting covers are called intersecting grids. According to the sum of both the number of empty grids and the number of intersecting grids, we define a new validity index as follows:

\min z (c) = \sum_{i = 1}^{n} | E m p t y g r i d (c, i) | + 2 \sum_{i = 1}^{n} \sum_{j = i + 1}^{n} | C r o s s g r i d (c, i, j) |

(16)

where

| E m p t y g r i d (c, i) |

is the number of empty grids in C_i when partitioning X to c clusters, and

| C r o s s g r i d (c, i, j) |

is the number of intersecting grids in the two covers of C_i and C_j. The coefficient “2” results from any intersecting grid that falls into two clusters.

Figure 3a–e shows the clustering results when the number of clusters, c, is taken as 2, 4, 6, 8, and 10, respectively, where each cluster is enclosed by a cover (circle) that has the minimal radius.

With a close look at these empty grids in green, it is seen that the number of empty grids decreases as c increases. On the other hand, when the number of clusters c is larger or smaller than the actual one, different numbers of intersecting grid appear among clusters. When the number of clusters is close to the real one, the number of intersecting grids in pink is least. However, it increases quickly when c is larger than the real one. Hence, their sum of the two classes of grids appears to be a minimum value when the real number of clusters is met. Figure 3f shows the minimum at the curve as c increases from 2 to 8. Specially, the normalizing process for two non-spherical clusters possibly makes two well separated clusters into lapped clusters. However, since our proposed index does not exclude counting of overlapping clusters, its impact on the correctness of our proposed index is, thus, minimal.

Generally, there are two opposite monotonic trends on the value of z(c) in (16) as follows:

(1): As c increases before the real number of clusters, the number of empty grids in all covers totally decreases. Specially, the intersecting grids reduce but have little. When c is close to the real number of clusters, the number of empty grids tends to remain unchanged.
(2): When c is larger than the number of real clusters, the number of empty grids keeps decreasing gradually, but the number of intersecting grids rises. The positive increment of the number of intersecting grids is larger than the negative increments of the number of empty grids, which is an increase in z(c). Hence, there is minimum value at the real number of clusters along the curve of G(c).

Note that the maximal number of clusters is practically taken as

\sqrt{n}

[21] when searching for the real number of clusters. Hereafter, we can name the grid-partitioning-based valid index as GPVI. The GPVI step is shown in Algorithm 2.

Algorithm 2. Flowchart of GPVI.

Flowchart of GPVI. Input: A dataset X ∈ R^d with n points and clustering results from any algorithm at c = 1, 2, …, c_max.
Output: The suggested number of clusters.
Steps:
1. Partition the data space to a grid structure;
2. Assign all points to their corresponding grids;
3. Cluster all data points to c cluster by a give clustering algorithm;
4. Determine each clustering center by Equation (14);
5. Normalize each cluster toward a spherical cluster by Equation (15);
6. Compute the value of z(c) by Equation (16) at c = 1, 2, …, c_max;
7. Determine the minimum of z(c);
8. Suggest an optimal number of clusters.

4. Experiment

We test the accuracy of GPVI on four artificial datasets and eight real datasets, respectively, and compare it with three existing validity indexes, DB, DC and GS. In view of different characteristics of the detected datasets, the clustering results are obtained using CM, DPC, and DBSCAN algorithms, respectively.

4.1. Test on Four Artificial Datasets

Figure 4 shows four artificial datasets generated by the Matlab^® toolbox (Matlab R2022 and Comsol 2016), and each has various numbers of data points and different numbers of clusters, which are denoted as Set 1~4, respectively. Set 1 and Set 2 are added to 10% red noisy points to test the robustness of GPVI. Clusters in these sets have various densities, sizes, shapes, and distributions, and there are overlaps among clusters in Set 4.

We use the three algorithms DPC, CM, and DBSCAN to cluster all data points in the three datasets and partition them into c clusters when the number of clusters, c, is taken as 1, 2, …, c_max, respectively, where various numbers of clusters in DBSCAN are obtained by taking various values of

ε

. The accuracy of clustering evaluation of each validity index depends on whether the correct number of clusters can be found. Figure 5 shows the curves of GPVI based on the three clustering algorithms, respectively. The points marked by small circles in these curves are the suggested optimal values by GPVI.

Set 1 contains density-different clusters that make DBSCAN fail to find the correct number of clusters, but both CM and DPC can find the correct one. In Set 2, CM cannot identify the cluster that is located at the center with lined shaped. The two datasets are affected by these 10% noisy points, verifying GPVI robustness to some extent. For Set 3, CM and DPC are erroneous. In Set 4, the CM algorithm cannot find most clusters correctly since there are overlaps among clusters. A remarkable advantage of GPVI is that it can point out the best clustering results on the three algorithms by comparing their minimal values of the relative curve. These results demonstrate that GPVI is capable of correctly finding the best clustering results even though different clustering algorithms and clusters with various features are applied.

We further compare the GPVI with the three existing indexes: DBI, GS, and DC. Table 1 shows the numbers of clusters that are evaluated by the four validity indexes.

The three existing validity indexes are constructed for the CM algorithm, whereas CM was originally designed to partition spherical clusters rather than arbitrary-shape ones. Therefore, they may incorrectly evaluate the number of clusters in the four datasets. Essentially, in both Set 2 and Set 4 using the CM and DBSCAN, the error is very large. In contrast, GPVI based on DPC shows the best performance among all four datasets and achieves nearly correct results when using CM and DBSCAN in terms of accuracy. In terms of various shapes in Set 3, GPVI works well based on DPC and DBSCAN but CM cannot. Therefore, the merit of GPVI is to select the best one from any candidates of clustering results, no matter which clustering algorithm is used. However, if all available clustering results cannot contain the real number of clusters, GPVI can find it as well.

4.2. Test on the UCI Dataset

The UCI Machine Learning Repository [22] contains various kinds of benchmark datasets, which is usually used for evaluating various machine learning algorithms. The UCI datasets, collected from the real world, cover a wide range of representative domains [23].

In this paper, eight representative UCI datasets containing clusters of various sizes (e.g., Ecoli, Wholesale), densities (e.g., Wine), shapes (e.g., Satimage), and overlapped clusters (e.g., Banknote, Iris, Cancer, Pima, and Wine) are selected for validating our proposed index GPVI. The detailed characteristics of these datasets are listed in Table 2. The first column denotes the names of datasets; the second and third columns represent the number of clusters and dimension of each dataset, respectively; the fifth and fourth columns denote the number of points in each cluster and the whole dataset, respectively.

Since all these datasets have high-dimensional features, we select two among these features to show the data distributions over all clusters in Table 3.

Table 3 shows the original distributions of data points along with their actual clusters and the normalized distribution by Equation (14). As seen, the clustering structures among the original distribution are unclear, but after normalizing all data points, the clustering structure becomes clearer. The comparison shows the normalization operation is feasible and effective in evaluating the clustering results.

Furthermore, Table 4 shows the selected numbers of clusters from the four validity indexes when using the three clustering algorithms: CM, DBSCAN, and DPC. Compared with the three existing indexes DB, DC, and GS, the evaluation results of GPVI are nearest to the real cluster numbers, and it is capable of finding the correct numbers of clusters among all eight datasets along all eight datasets except for Ecoli. However, Table 3 shows that the Ecoli dataset includes two small clusters that have the number of data points 5, 2, and 2. However, all clustering algorithms find it difficult to identify the two clusters. When all clusters in a dataset are close to the sphere-distribution such as Iris and Cancer, all validity indexes can find the correct number of clusters along with different algorithms. However, in other datasets, only DPC can suggest the correct number of clusters by GPVI, but the other three indexes have errors among partial datasets such as Banknote and Segmentation. Noting that the DBSCAN is very unstable, in some datasets, it can generate very large error such as Satimage and Wine. Consequently, we conclude the GPVI outperforms the other three existing indexes using various clustering algorithms.

Nevertheless, the GPVI we propose is a commonly used external one; therefore, compared to the time required to execute the clustering algorithm once, this index must be executed multiple times to find the best clustering result, resulting in an increase in computational cost by the multiple times. Therefore, in practice, in order to improve time efficiency, the execution frequency of index should be minimized as much as possible through prior knowledge if possible.

5. Conclusions

The correct clustering process results from the correct solution of the number of clusters, but it remains very difficult for evaluating various clustering algorithms and different clusters with different features. In this paper, we propose a new clustering validity index that is independent of clustering algorithms and data distributions. Different from the existing validity index that mainly depends on the direct computation of clustered data points, our proposed validity focuses on the background measurement of each cluster. Along with a fast grid-based partitioning and a fast computation, the new index manifests a stronger generalization and stableness. Extensive experiments validate the proposed index according to accuracy and efficiency. Therefore, this index is unsupervised and outperforms most of the existing indices on some benchmark datasets.

There are two possible opportunities for the future research. Firstly, normalizing the arbitrary shape cluster to a spherical one can be improved. The method used in this paper only is a special example. Secondly, identifying the points in the overlapped area is still a challenge in clustering analysis. The transformation process may misclassify some points in the overlapped area and result in deviation. How to correct the deviation caused by the overlapped area remains one of our research focuses in the future.

Author Contributions

Conceptualization, J.W. and Z.Z.; Data curation, Z.Z.; Resources, J.W.; Software, J.W.; Supervision, S.Y.; Algorithm design, S.Y.; Writing, J.W. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data are available in the attachment of the submission.

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor correction to the existing affiliation information and author name. This change does not affect the scientific content of the article.

References

Zhu, S.; Zhao, Y.; Yue, S. Double-Constraint Fuzzy Clustering Algorithm. Appl. Sci. 2024, 14, 1649. [Google Scholar] [CrossRef]
Xu, R.; Wunsch, D. Survey of clustering algorithms. IEEE Trans. Neural Netw. 2005, 16, 645–678. [Google Scholar] [CrossRef] [PubMed]
Ng, M.K. A note on constrained k-means algorithms. Pattern Recognit. 2000, 33, 515–519. [Google Scholar]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, Portland, Oregon, OR, USA, 2–4 August 1996; AAAI Press: Washington, DC, USA, 1996; pp. 226–231. [Google Scholar]
Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1497. [Google Scholar]
Wang, Y.; Yue, S.; Hao, Z.; Ding, M.; Li, J. An unsupervised and robust validity index for clustering analysis. Soft Comput. 2019, 23, 10303–10319. [Google Scholar]
Masud, M.A.; Huang, J.Z.; Wei, C.; Wang, J.; Khan, I.; Zhong, M. I-nice: A new approach for identifying the number of clusters and initial cluster centres. Inf. Sci. 2018, 466, 129–151. [Google Scholar] [CrossRef]
Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, 2, 224–227. [Google Scholar] [CrossRef]
Tibshirani, R.; Walther, G.; Hastie, T. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2001, 63, 411–423. [Google Scholar]
Xie, X.L.; Beni, G. A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 23, 841–847. [Google Scholar] [CrossRef]
Arbelaitz, O.; Gurrutxaga, I.; Muguerza, J.; Pérez, J.M.; Perona, I. An extensive comparative study of cluster validity indices. Pattern Recognit. 2013, 46, 243–256. [Google Scholar]
Rezaei, M.; Fr€anti, P. Set Matching Measures for External Cluster Validity. IEEE Trans. Knowl. Data Eng. 2016, 28, 2173–2180. [Google Scholar] [CrossRef]
Preedasawakul, O.; Wiroonsri, N. A Bayesian cluster validity index. Comput. Stat. Data Anal. 2025, 202, 1734–1740. [Google Scholar] [CrossRef]
Akhanli, S.E.; Hennig, C. Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes. Stat. Comput. 2020, 30, 1523–1544. [Google Scholar] [CrossRef]
Fahad, A.; Alshatri, N.; Tari, Z.; Alamri, A.; Khalil, I.; Zomaya, A.Y.; Foufou, S.; Bouras, A. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2014, 2, 267–279. [Google Scholar] [CrossRef]
Du, M.; Ding, S.; Xue, Y. A novel density peaks clustering algorithm for mixed data. Pattern Recognit. Lett. 2017, 97, 46–53. [Google Scholar] [CrossRef]
Ding, S.; Du, M.; Sun, T.; Xu, X.; Xue, Y. An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood. Knowl. Based Syst. 2017, 133, 294–313. [Google Scholar] [CrossRef]
Yue, S.; Wang, J.; Bao, X. A new validity index for evaluating the clustering results by partitional clustering algorithms. Soft Comput. 2016, 20, 127–1138. [Google Scholar] [CrossRef]
Yue, S.; Wei, M.; Wang, J.S.; Wang, H. A general grid-clustering approach. Pattern Recognit. Lett. 2008, 29, 1372–1384. [Google Scholar]
Bandaru, S.; Ng, A.H.; Deb, K. Data mining methods for knowledge discovery in multi-objective optimization: Part a—Survey. Expert Syst. Appl. 2017, 70, 139–159. [Google Scholar] [CrossRef]
Kwon, S.H.; Kim, J.H.; Son, S.H. Improved cluster validity index for fuzzy clustering. Electron. Lett. 2021, 57, 792–794. [Google Scholar] [CrossRef]
UCI Dataset. Available online: http://archive.ics.uci.edu/ml/datasets.php (accessed on 17 March 2025).
Ma, E.W.; Chow, T.W. A new shifting grid clustering algorithm. Pattern Recognit. 2004, 37, 503–514. [Google Scholar]

Figure 1. A dataset with four clusters is partitioned to grids and intersecting grids. (a) A dataset with four clusters. (b) A set of grids in the 2D data space.

Figure 2. Different-shaped clusters are normalized to spherical clusters. (a) Two arbitrary-shape clusters. (b) Two spherical clusters by normalization operation.

Figure 3. A dataset with four clusters is partitioned to 2~10 clusters, and the curve of z(c) as c increases. (a) Two partitioned clusters. (b) Four partitioned clusters. (c) Six partitioned clusters. (d) Eight partitioned clusters. (e) Ten partitioned clusters. (f) Curve of z(c).

Figure 4. Four groups of artificial datasets. (a) Three clusters in Set 1. (b) Six clusters in Set 2. (c) Six clusters in Set 3. (d) Twelve clusters in Set 4.

Figure 5. Evaluation results of GPVI obtained using DPC, CM, and DBSCAN. (a) GPVI curve in Set 1 with noisy points (b) GPVI curve in Set 2 with noisy points. (c) GPVI curve in Set 3 (d) GPVI curve in Set 4.

Table 1. Evaluation results of three indexes for the four artificial datasets.

	DB			GS			DC			GPVI
Datasets	CM	DBSCAN	DPC	CM	DBSCAN	DPC	CM	DBSCAN	DPC	CM	DBSCAN	DPC	NC
Set 1	2×	2×	3√	5×	2×	2×	2×	5×	2×	2×	3√	3√	3
Set 2	3×	5×	6√	3×	4×	6√	6√	5×	8×	6√	6√	6√	6
Set 3	3×	4×	5×	4×	7×	5×	6√	6√	6√	6√	6√	6√	6
Set 4	2×	4×	12√	10×	18×	12√	10×	10×	12√	12√	12√	12√	12

Note: NC refers to the number of clusters. The number marked by “√” and “×” indicate that the result evaluated by the corresponding index is correct and incorrect, respectively. The possible range of cluster number is [2,20].

Table 2. Characteristics of eight real-world datasets from UCI.

Datasets	Number of Clusters	Dimension	Number of Points	Number of Each Cluster
Banknote	2	4	1372	762/610
Cancer	2	9	683	444/239
Iris	3	4	150	50/50/50
Ecoli	8	7	336	143/77/52/35/20/5/2/2
Satimage	6	36	2000	1533/1508/1358/707/703/626
Seeds	3	7	210	70/70/70
Wine	3	13	178	71/59/48
Wholesale	2	7	440	298/142

Table 3. Data distributions of eight datasets before and after normalization operation.

Banknote	Cancer	Iris	Ecoli	Satimage	Seeds	Wine	Wholesale

Table 4. Evaluation results of three indices for eight UCI datasets.

	DB			GS			DC			GPVI
Datasets	CM	DBSCAN	DPC	CM	DBSCAN	DPC	CM	DBSCAN	DPC	CM	DBSCAN	DPC	NC
Banknote	2√	3×	2√	2√	6×	4×	2√	4×	2√	2√	3×	2√	2
Cancer	2√	2√	2√	2√	4×	2√	2√	3×	2√	2√	2√	2√	2
Iris	2×	2×	3√	2×	2×	3√	2×	2×	3√	3√	3√	3√	3
Ecoli	5×	6×	5×	4×	4×	5×	6×	4×	8√	5×	6×	5×	8
Satimage	4×	4×	5×	8×	7×	6√	5×	4×	3×	6√	6√	6√	6
Seeds	2×	2×	2×	2×	2×	3√	5×	5×	3√	3√	3√	3√	3
Wine	3√	2×	2×	3√	5×	3√	2×	2×	3√	3√	3√	3√	3
Wholesale	2√	3×	3×	3×	3×	2√	2√	4×	2√	2√	2√	2√	2

Note: NC refers to the real number of clusters. The number marked by “√” and “×” indicates that the result evaluated by the corresponding index is correct and incorrect, respectively. The possible range of cluster number is outlined in [2,15].

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Zhang, Z.; Yue, S. A Validity Index for Clustering Evaluation by Grid Structures. Mathematics 2025, 13, 1017. https://doi.org/10.3390/math13061017

AMA Style

Wang J, Zhang Z, Yue S. A Validity Index for Clustering Evaluation by Grid Structures. Mathematics. 2025; 13(6):1017. https://doi.org/10.3390/math13061017

Chicago/Turabian Style

Wang, Jiachen, Zuojing Zhang, and Shihong Yue. 2025. "A Validity Index for Clustering Evaluation by Grid Structures" Mathematics 13, no. 6: 1017. https://doi.org/10.3390/math13061017

APA Style

Wang, J., Zhang, Z., & Yue, S. (2025). A Validity Index for Clustering Evaluation by Grid Structures. Mathematics, 13(6), 1017. https://doi.org/10.3390/math13061017

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Validity Index for Clustering Evaluation by Grid Structures

Abstract

1. Introduction

2. Related Work

2.1. Clustering Algorithm

2.2. Typical Validity Index

3. Clustering Evaluation Based on Grid Structure

3.1. Grid Partition and Clustering Center

3.2. Cluster Normalization and Validity Index

4. Experiment

4.1. Test on Four Artificial Datasets

4.2. Test on the UCI Dataset

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Correction Statement

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI