A Robust Adaptive Clustering Validity Index for Overlapping Data

Yan, Bin; Zhao, Juan

doi:10.3390/axioms15050366

Open AccessArticle

A Robust Adaptive Clustering Validity Index for Overlapping Data

by

Bin Yan

¹

and

Juan Zhao

^2,*

¹

School of Mathematics and Statistics, Hunan First Normal University, Changsha 410205, China

²

School of Elementary Education, Hunan First Normal University, Changsha 410205, China

^*

Author to whom correspondence should be addressed.

Axioms 2026, 15(5), 366; https://doi.org/10.3390/axioms15050366

Submission received: 13 March 2026 / Revised: 18 April 2026 / Accepted: 30 April 2026 / Published: 14 May 2026

Download

Browse Figures

Versions Notes

Abstract

Cluster Validity Indices (CVIs) act as a pivotal tool in machine learning for assisting in the determination of the optimal number of clusters. Nevertheless, traditional CVIs often exhibit subpar performance when confronted with the complex characteristics prevalent in real-world data, such as inter-cluster overlap, outliers and uneven density distribution. To address this challenge, this paper proposes a multiplicative, adaptive and robust Cluster Validity Index, designated as the Robust Adaptive (RA) index. This index takes the kernel density function of sample points as the fundamental tool and reconstructs its two core components: in the measurement of intra-cluster compactness, the concept of density quantiles is incorporated, which markedly enhances its robustness against outliers; in the measurement of inter-cluster separability, a density-based Jeffrey divergence method is developed to effectively characterize inter-cluster differences in overlapping datasets. To mitigate the impact of bandwidth selection on kernel density estimation, this study adopts strategies including Scott’s and Silverman’s heuristic algorithms, thus enabling adaptive learning of the inherent distribution characteristics of data. For experimental validation, a comprehensive set of experiments was conducted on both synthetic and real-world datasets. The results show that, in comparison with the classical indices (CH, DB, SIL, I) that demonstrate prominent performance on overlapping datasets, the RA index delivers superior performance in scenarios involving mild to moderate overlap, uneven density distribution and the presence of outliers. Among nine synthetic datasets, the RA index correctly identified the optimal number of clusters in eight cases, achieving a high success rate of 88.89% and outperforming all the comparative indices. On eight real-world datasets with diverse scales, dimensionalities and inherent structural features, the RA index was also verified to be the most robust and effective metric among the five participating indices for comparison. Meanwhile, its failure on complex datasets such as S-set4 and Iris, which contain both severe inter-cluster overlap and outliers, also indicates that density-based CVIs have inherent limitations when faced with data structures characterized by high overlap and faint cluster boundaries. This finding points to a clear direction for future research: constructing novel CVIs from the perspective of sparse matrices may serve as a feasible breakthrough path to address such limitations.

Keywords:

kernel density estimation; Jeffrey divergence; robustness; overlapping clustering

MSC:

62H30; 90C59

1. Introduction

Clustering Validity Indices (CVIs) serve as the primary gauge for assessing the alignment between algorithmic outputs and the underlying data structure. Consequently, the robustness of a CVI dictates the reliability of knowledge discovery in unsupervised learning tasks [1]. However, in complex data environments, the limitations of existing CVIs have emerged as a significant bottleneck, hindering the broader deployment of clustering analysis [2,3].

In practical applications, real-world datasets rarely exhibit an ideal well-separated structure; instead, overlapping and anomalous characteristics are prevalent [4,5,6]. Overlapping datasets refer to those where some samples simultaneously possess features of multiple clusters, with no clear boundaries between clusters. Typical examples include user interest tag datasets (a single user may be classified as both a “sports enthusiast” and a “music lover”) and medical image datasets (pixels corresponding to lesion areas and normal regions overlap with each other). Datasets containing outliers are defined as those with a small number of samples that deviate from the overall data distribution [7,8]. Such samples may arise from factors like measurement errors, noise interference, and anomalous events, as illustrated by fraudulent data in financial transactions and abnormal sensor data in environmental monitoring.

Traditional CVIs, such as the Calinski-Harabasz (CH) index [9] and the Davies-Bouldin (DB) index [10], are mostly based on two fundamental dimensions—intra-cluster compactness and inter-cluster separation—when evaluating the quality of clustering results. However, these conventional indices have inherent limitations in their design and calculation; their effectiveness is severely compromised, especially when dealing with modern complex datasets (e.g., overlapping datasets, datasets with outliers, or those that exhibit both characteristics) [11,12,13,14].

These CVIs perform satisfactorily when dealing with well-separated datasets. However, in scenarios involving overlapping datasets and datasets with outliers, their evaluation accuracy and stability drop drastically, and misjudgments may even occur [15]. For instance, traditional CVIs struggle to quantify the ambiguous membership relationships of overlapping clusters, making it easy to misclassify overlapping samples as outliers [16]. Meanwhile, outliers can severely interfere with the calculation of intra-cluster compactness and inter-cluster separation, rendering CVIs incapable of accurately reflecting the true clustering structure [7]. Furthermore, when overlapping samples and outliers coexist, the interference from both factors interacts synergistically, which seriously distorts the computation of intra-cluster compactness and inter-cluster separation. Traditional CVIs fail to distinguish between overlapping samples and outliers, thus tending to mislabel the former as the latter or vice versa. As a result, the evaluation results suffer from significant deviations and instability, which cannot provide a reliable basis for the optimization of clustering algorithms.

While these CVIs generally suffice for well-separated data, their utility deteriorates significantly in the presence of overlapping distributions or outliers, often leading to unreliable assessments [15]. A primary limitation is the inability to resolve ambiguous boundaries in overlapping regions, which frequently results in the misidentification of overlapping samples as noise [16]. Simultaneously, outliers skew the estimation of intra-cluster compactness and inter-cluster separation, effectively masking the true data topology [7]. This challenge is exacerbated when both phenomena coexist; the resulting interference creates a compounded distortion that standard indices cannot disentangle. Lacking the discriminative power to differentiate between boundary samples and noise, traditional CVIs produce evaluations prone to severe bias, thereby failing to offer a robust foundation for algorithm optimization.

Therefore, aiming at overlapping datasets and datasets with outliers, developing robust and adaptive cluster validity indices to overcome the limitations of traditional CVIs and improve the accuracy of clustering evaluation in complex scenarios constitutes the fundamental motivation and research background of this study.

2. Relative Works

As a vital tool for evaluating the quality of outputs generated by clustering algorithms, CVIs have a design that directly underpins the rationality of both cluster number determination and clustering pattern selection. Confronted with two prevalent challenges in practical applications, namely data overlap and outlier interference, the evolution of various validity indices fully embodies researchers’ continuous exploration into the inherent complexity of real-world data.

Evaluating the clustering validity of overlapping datasets has long been a challenging issue in academia. Early studies have indicated that traditional indices such as the SIL (Silhouette Index) [17] and COP (Clustering Optimization Proximity) [18] exhibit a notable performance decline when handling overlapping datasets. The SIL index was proposed by Rousseeuw in 1987; while it is applicable to various types of datasets, it encounters inherent difficulties in processing overlapping ones, as the ideal assumption of “intra-cluster compactness and inter-cluster separation” underlying it no longer holds in the overlapping regions of data [17]. The COP index delivers favorable performance on convex datasets and partially overlapping datasets, yet its performance deteriorates drastically with the increase in the overlap degree of datasets [18]. To address this inherent drawback, researchers have developed a variety of innovative approaches. In the field of fuzzy clustering, Kim et al. [19] proposed a fuzzy clustering validity index based on inter-cluster proximity in 2003. This index assesses clustering quality by quantifying the degree of inter-cluster overlap, thereby offering a novel perspective for handling overlapping data. From the perspective of data structure, some researchers have re-examined the computational approaches for compactness and separability. For instance, addressing the issue of unstable cluster centers in overlapping datasets, Wu et al. [20] introduced the median distance of cluster centers as a penalty term, which serves to mitigate the impact of cluster centers on separability. Said et al. [21] also pointed out that the distance between cluster centers could not always reflect the partition quality, and then introduced the Jeffrey divergence. The improved methods proposed by the above two groups of scholars have addressed the instability of cluster center distances in traditional clustering when measuring compactness and separability to a certain extent, yet such methods are only applicable to clustering for Gaussian-distributed datasets. In addition, their time cost rises remarkably as the number of samples increases [22].

The interference of outliers on CVIs mainly manifests itself in their adverse impacts on the calculation of cluster centers. Improper outlier handling methods will result in a significant deviation between the derived cluster centers and the actual ones, thereby causing clusters to be “distorted” by outliers. To tackle this problem, relevant studies have developed three primary technical approaches: The first approach focuses on improvements based on noise processing algorithms. Tang et al. [23] proposed a fuzzy cardinality measurement method based on maximum membership strength, and integrated it with the weighted intra-cluster sum of squared errors to construct a compactness measure with enhanced robustness to outliers. Such methods effectively mitigate the perturbation of outliers on the clustering process by dynamically adjusting the initial cluster centers. The second approach centers on developing robust statistical metrics to replace traditional centroids. Bezdek & Pal. [24] introduced the Generalized Dunn Index, which reduces the sensitivity to outliers by optimizing the way of inter-cluster distance measurement.

While traditional CVIs have laid a solid foundation, the landscape of clustering is rapidly evolving towards multi-view and multimodal learning. Recent studies, such as Enhanced latent multi-view subspace clustering and Trusted Multi-view Learning for Long-tailed Classification, have demonstrated significant progress in handling complex data correlations and noise. These methods often rely on sophisticated mechanisms, such as hierarchical opinion aggregation, to ensure robustness against outliers and data imbalance.

However, the evaluation of such complex structures remains a challenge. Most existing internal indices are ill-equipped to assess the non-convex and heterogeneous clusters often produced by multi-view algorithms. This highlights the necessity of the proposed RCVI. By shifting the paradigm from ‘distance-to-centroid’ to ‘density-distribution,’ RCVI provides a robust validation tool that aligns with the goals of these advanced frameworks: to uncover reliable structures amidst noise and overlap, regardless of the data’s view or modality. Data in practical applications is often afflicted with both data overlap and outlier issues, and such mixed datasets pose more stringent challenges to the evaluation of clustering validity. Although some scholars, including [22,25] have conducted preliminary investigations into this issue from the perspective of density, their performance is more geared toward the robust handling of outliers while downplaying the evaluation of the overlap degree problem. This thus lays a mathematical foundation for the construction of a clustering validity evaluation framework in this paper that is tailored to overlapping and outlier-contaminated datasets. In addition, the I index proposed by [21], which is based on the Jeffrey divergence, exhibits promising performance in evaluating the separability of overlapping data; yet it relies heavily on the sample covariance matrix when calculating inter-cluster separability, which restricts its application in high-dimensional, and especially ultra-high-dimensional, datasets. Drawing on the excellent properties of density in the compactness component of CVIs and the outstanding performance of the Jeffrey divergence in the separability component, we are inspired to introduce the kernel density estimation function. We use this function to modify the Jeffrey divergence-based separability measure proposed by [21], enabling it to be applicable to high-dimensional scenarios, and on this basis, propose a fully density-based adaptive and robust CVI.

3. The Design Rationale of the RA Index

The quality of a cluster should not be determined by aggregation around a single central point (e.g., a cluster centroid), but rather by whether a coherent high-density region is formed within it. Accordingly, developing a centroid-free and purely density-based CVI constitutes our core objective.

Inspired by density peak clustering [26,27], we can incorporate Kernel Density Estimation (KDE) into the construction of the RA Index. This function offers two key advantages: first, the bandwidth parameter of KDE can be locally adjusted, enabling accurate density estimations across regions with varying densities and thus allowing adaptive handling of datasets with uneven density distributions; second, instead of setting a global density threshold, separation is evaluated based on the relative divergence of density distributions. This design enables the RA Index to identify not only tightly compact clusters, but also relatively sparse yet structurally well-defined ones, thereby endowing it with stronger universality and accuracy when addressing complex real-world datasets.

3.1. Design of Robust Intra-Cluster Compactness

For an ideal cluster with high intra-cluster compactness, all data points within it should exhibit highly similar local density values, forming a narrow and concentrated density distribution. This implies that the density variation inside the cluster is smooth, with no obvious substructures or voids. Conversely, if a cluster has a wide density distribution or shows a multimodal pattern, it indicates a complex internal structure—there may be multiple sub-clusters or ambiguous cluster boundaries—and thus its compactness is poor.

To construct a robust metric for intra-cluster compactness, the RA index first employs Kernel Density Estimation (KDE) to model the density distribution within each cluster. KDE is a non-parametric probability density function estimation method that does not require any prior assumptions about the data distribution, making it well-suited for handling non-Gaussian and multimodal datasets. For a given cluster

C_{i}

containing

N_{i}

data points, denoted as

{x_{1}, x_{2}, \dots, x_{N_{i}}}

, the density estimate

{\hat{p}}_{H} (x)

at an arbitrary point

x_{i}

can be computed using Equation (1):

{\hat{p}}_{H} (x_{i}) = \frac{1}{N_{i}} \sum_{j = 1}^{N_{i}} K_{H} (x_{i} - x_{j})

(1)

where

K_{H} (u) = {| H |}^{- 1 / 2} K (H^{- 1 / 2} u)

denotes the kernel function, and H represents the bandwidth matrix, which governs the level of smoothing applied. Typically, a Gaussian kernel function can be adopted, expressed as

K (u) = {(2 π)}^{- d / 2} exp (- \frac{1}{2} u^{T} u)

, where d stands for the dimensionality of the dataset. Through KDE, the discrete data points within a cluster can be transformed into a continuous, smooth density field. This density field not only intuitively characterizes the aggregation pattern of intra-cluster data—with high-density regions corresponding to dense clusters of data points and low-density regions corresponding to sparse areas or cluster boundaries—but also its inherent smoothing property naturally suppresses the interference from isolated noise points. In contrast to simple distance-based metrics, KDE delivers a more comprehensive and robust characterization of intra-cluster structures, thereby laying a solid foundation for the subsequent calculation of intra-cluster compactness.

After constructing a density distribution model for each cluster via KDE, the RA index needs to quantify the consistency or stability of this distribution, which serves as the metric for intra-cluster compactness. For an ideal cluster, its internal density distribution should be relatively homogeneous and consistent, free from abrupt fluctuations or voids. Multiple strategies can be adopted to measure such consistency. One intuitive approach is to calculate the statistical dispersion of local density values across all points within the cluster. However, to avoid the use of variance—a statistic that is highly sensitive to outliers—we can employ more robust statistical measures, such as the Median Absolute Deviation (MAD) or Interquartile Range (IQR) of density values for all intra-cluster points. A smaller MAD or IQR value indicates that the density values of most points in the cluster are concentrated around the median, signifying a more consistent density distribution and thus a higher level of intra-cluster compactness.

To further enhance the robustness of the RA index against outliers, we explicitly recommend employing quantiles instead of the conventional mean and variance when calculating the specific metrics of intra-cluster compactness. For instance, when evaluating the degree of “compactness” among intra-cluster points, we can abandon the calculation of the variance of distances from all points to a certain “centroid”, and instead compute specific quantiles of the local density values of all points within the cluster, such as the 10th percentile (

P_{10}

) or the 25th percentile (

P_{25}

). A high

P_{25}

density value indicates that at least 75% of the points in the cluster are located in a relatively high-density region, which directly reflects the compactness of the main body of the cluster. This approach inherently filters out those isolated noise points with extremely low density, since they only affect the lowest few percentiles without significantly lowering the values of

P_{25}

or higher quantiles.

Similarly, when measuring the dispersion degree of the density distribution, the Interquartile Range (IQR =

P_{90} - P_{10}

) can be used in place of variance. The IQR only focuses on the distribution range of the middle 80% of the data and is completely immune to the interference of extreme values (outliers) at both ends. Through this means, the intra-cluster compactness metric of the RCVI achieves dual robustness: KDE smoothes the local density estimation, while the use of quantiles ensures that the final aggregated metrics are not affected by extreme values.

3.2. Design of Robust Inter-Cluster Separation

Inter-cluster separation is a metric that quantifies the dissimilarity between different clusters. In traditional methods, separation is typically defined as the distance between cluster centroids. However, this definition can be misleading when clusters exhibit irregular shapes or overlap with each other. From a density-based perspective, the core tenet for well-separated clusters is as follows: if two clusters are effectively separated, there should exist a distinct “density gap” in the region between them, i.e., a low-density transition zone. Reflected in their density distributions, this means that the overlapping portion of the two distributions should be extremely small. Therefore, we can measure the inter-cluster separation by quantifying the “dissimilarity” or “distance” between the density distributions of the two clusters.

To quantify the dissimilarity between the density distributions of two clusters more precisely, the RCVI incorporates the concept of divergence from information theory, such as the Kullback-Leibler (KL) divergence or Jeffrey divergence. The KL divergence can measure the “information loss” between two probability distributions, yet it is not symmetric. As a symmetric variant of the KL divergence, the Jeffrey divergence is more suitable for use as a distance metric. For two clusters

C_{i}

and

C_{j}

with respective density distributions

{\hat{p}}_{i}

and

{\hat{p}}_{j}

, the Jeffrey divergence

J D (C_{i}, C_{j})

between them is defined as Equation (2):

J D (C_{i}, C_{j}) = \frac{1}{2} \int ({\hat{p}}_{i} (x) - {\hat{p}}_{j} (x)) log (\frac{{\hat{p}}_{i} (x)}{{\hat{p}}_{j} (x)}) d x .

(2)

A larger value of this metric indicates a greater dissimilarity between the two density distributions, i.e., a higher level of inter-cluster separation. The advantage of using divergence measures from information theory lies in the fact that they provide a global metric based on the shape of the entire distribution, rather than merely relying on the distances between a few key points. This enables them to capture subtle differences in distribution patterns and renders them insensitive to outliers.

It is important to note the asymmetric nature of our robust design. For intra-cluster compactness, we adopt a quantile-based approach to resist the influence of low-density fringe points within the cluster. Conversely, for inter-cluster separability, we adopt Jeffrey divergence to capture the global discrepancy between distributions, ensuring that even minor overlaps at the tails (boundaries) are penalized. This combination allows RCVI to tolerate internal noise while remaining sensitive to inter-cluster interference.

4. Robust Addictive Cluster Validity Index: RA

To provide a rigorous mathematical definition of the RA Index, we first establish a clear and consistent symbol system. Let dataset X consist of N data points, denoted as

X = {x_{1}, x_{2}, \dots, x_{N}}

, where each data point

x_{i}

is a d-dimensional vector. Suppose that dataset X is partitioned into K clusters via a certain clustering algorithm (e.g., Gaussian mixture clustering, spectral clustering, etc.), represented as

C = {C_{1}, C_{2}, \dots, C_{K}}

. Each cluster

C_{k}

is a subset of dataset X, satisfying the conditions

C_{k} \cap C_{l} = \emptyset

(

\forall k \neq l

) and

⋃_{k = 1}^{K} C_{k} = X

. We use

n_{k} = | C_{k} |

to denote the number of data points contained in cluster

C_{k}

. This notation constitutes a standard paradigm in cluster analysis, laying the foundation for our subsequent definition of density-based metrics. It is worth noting that the design of the RA allows the number of clusters K to be variable; thus, the index can be used to evaluate clustering results under different values of K, thereby facilitating the determination of the optimal number of clusters.

4.1. Local Density Estimation Based on KDE

The core of the RA index lies in density-based analysis; hence, we need to define the local density for each data point. We employ Kernel Density Estimation (KDE) as the primary method for density estimation.

Definition 1.

For an arbitrary point

x_{i}

in the dataset, its local density

ρ (x_{i})

is defined as:

ρ (x_{i}) = \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{h} K (\frac{x_{i} - x_{j}}{h}) .

(3)

In Definition 1,

K (\cdot)

denotes the kernel function, and the standard Gaussian kernel is typically adopted, expressed as

K (u) = ϕ (u)

(where u represents the standard normal probability density function), the equation simplifies to:

ρ (x_{i}) = \frac{1}{N h \sqrt{2 π}} \sum_{j = 1}^{N} exp (- \frac{1}{2} {||\frac{x_{i} - x_{j}}{h}||}^{2}) .

(4)

Here h represents the bandwidth parameter (a scalar in the univariate case), governing the level of smoothing.

The selection of h is of crucial importance. A commonly used heuristic approach involves Scott’s rule or Silverman’s rule. For each cluster

C_{k}

, we can calculate the density values of all points within it, thereby forming a density set denoted as

P_{k} = {ρ (x_{i}) ∣ x_{i} \in C_{k}}

. This density set

P_{k}

lays the groundwork for the subsequent calculation of intra-cluster compactness. Through KDE, we transform discrete, distance-based data points into continuous, density-based distributions, and this transformation highlights the fundamental distinction between the RA index and traditional CVIs.

4.2. Intra-Cluster Compactness Index Based on Density Distribution

Based on Definition 1, we can construct a robust intra-cluster compactness index

C o m_{k}

. For each

C o m_{k}

, as mentioned earlier, we do not intend to use the variance, which is sensitive to outliers. Therefore, we adopt a quantile-based approach to quantify the dispersion degree of the density distribution.

Definition 2.

The intr-cluster compactness of cluster

C_{k}

is defined as:

C o m_{k} = \frac{1}{Q_{90} (P_{k}) - Q_{10} (P_{k}) + ε} .

(5)

In Definition 2,

Q_{90} (P_{k})

and

Q_{10} (P_{k})

denote the 90th and 10th percentiles of the density values of all data points within cluster

C_{k}

, respectively. A smaller value of this difference (i.e., the range covering 80% of the density values) indicates a higher level of concentration and consistency in the intra-cluster density, and thus corresponds to a larger value of the compactness for cluster

C o m_{k}

.

ε

is a tiny constant (e.g.,

10^{- 6}

) employed to avoid a zero denominator. From Figure 1 we can see that this compactness measurement is robust in that it only focuses on the density fluctuations of the middle 80% of data points in the cluster, automatically filtering out the extreme density values at the upper and lower tails, which are likely induced by noise or outliers. The overall intra-cluster compactness

C o m (K)

for the entire clustering result can be defined as the weighted average of the compactness indices of all clusters, with the weight assigned as the cluster size ratio

n_{k} / N

.

Definition 3.

The total compactness of a partition (e.g., K clusters)

C o m (K)

is defined as:

C o m (K) = \sum_{k = 1}^{K} \frac{n_{k}}{n} C o m_{k} .

(6)

4.3. Inter-Cluster Separation Index Based on Information Divergence

For inter-cluster separability, we adopt an information-theoretic divergence-based approach. To begin with, we construct a probability density function

p_{k} (x)

for each cluster

C_{k}

, which can be derived by computing the internal data points in accordance with Equation (3). Subsequently, we define the separability

S e p_{k l}

between cluster

C_{k}

and cluster

C_{l}

. To circumvent the asymmetry of the Kullback-Leibler (KL) divergence, we employ the symmetric Jeffrey divergence:

S e p_{k l} = 0.5 (D_{K L} (p_{k} | | p_{l}) + D_{K L} (p_{l} | | p_{k})),

(7)

where

D_{KL} (p_{k} ‖ p_{l}) = \int p_{k} (x) \cdot log (\frac{p_{k} (x)}{p_{l} (x)}) d x

. In practical computation, this integral can be approximated by a discretized summation over the data points. A larger value of

S_{k l}

indicates a greater discrepancy in the density distributions between the two clusters, or equivalently, a higher level of separability.

For any given cluster partition that yields K clusters, there exist

K (K - 1) / 2

pairwise separability values between these clusters. The overall inter-cluster separation performance is dominated by the smallest separability value across all cluster pairs. Thus, the global separability of the K clusters can be defined as the minimum value of

S e p_{k l}

for all

k \neq l

.

Definition 4.

The total separation of a partition (e.g., K clusters)

S e p (K)

is defined as:

S e p (K) = min_{k, l \in {1, 2, \dots, K}; j \neq l} S e p_{k l},

(8)

4.4. RA Index

Commonly used construction forms of Cluster Validity Indices (CVIs) fall into two categories: quotient models and product models. As indicated in Definitions 3 and 4, both the intra-cluster compactness and inter-cluster coupling degree defined in this paper exhibit the characteristic that a larger value corresponds to better performance. Therefore, the product model is adopted herein for constructing the CVI.

Definition 5.

The RA index is defined as:

R A (K) = C o m (K) S e p (K) .

(9)

Given that both the intra-cluster compactness

C o m (K)

and inter-cluster dispersion degree

S e p (K)

exhibit the characteristic of better performance corresponding to larger values, the cluster number K corresponding to the maximum value of the

R A (K)

index is thus regarded as the optimal cluster number for the dataset.

5. Experimental Studies

To evaluate the performance of the newly proposed index on datasets with overlap and outliers, we conduct tests on it using synthetic datasets and real-world datasets, respectively.

The clustering method adopted in the experiments is the Parametric Finite Gaussian Mixture Model, which exhibits promising performance in the clustering of overlapping data [28,29]. Using this clustering algorithm, we can partition the datasets into k clusters where the cluster number

k \in {2, 3, \dots, K}

. In this experiment, K is set to 15 if the true number of clusters of a dataset is less than or equal to 7, and to 20 when the true number is greater than 7 and less than 10. It should be noted that due to the constraints of visualization effectiveness, we did not select datasets with a true cluster number greater than 20 in the experiments. Nevertheless, experiments beyond the scope of this paper demonstrate that our method is not restricted by the true number of clusters.

The objective of the experiments is to evaluate the performance of the newly proposed index in determining the optimal number of clusters across various data scenarios. For this reason, we select four indices—CH, DB, SIL, and I—as the comparison indices, which were mentioned earlier and have been proven to exhibit promising performance on overlapping and outlier-contaminated datasets. These four indices have already been compared with traditional classic indices at the time of their proposal, and the comparison results have demonstrated that their performance is superior to that of classic indices (e.g., PC, PE, XB, et al.) in most scenarios. Thus, classic indices are not incorporated into the comparison experiments in this study. Table 1 summarizes the information about the four comparison indices.

5.1. Artificial Datasets

To enable an intuitive visual demonstration of CVI performance, all synthetic datasets employed consist of 2-D observational data. For synthetic datasets, experiments are designed to verify the effectiveness of the newly proposed index from two aspects: (a) Datasets with overlap, including three categories: slight overlap (Figure 2), moderate overlap (Figure 3); (b) Datasets with both overlap and outliers as shown in Figure 4. For both types of datasets mentioned above, we adopt Clustering-Datasets—a widely used benchmark dataset for clustering tests—which is publicly available for download on GitHub (https://github.com/milaan9/Clustering-Datasets (accessed on 13 March 2026)). All experiments in this paper are implemented by R4.5.1 software.

5.1.1. Overlapping Artificial Datasets

Figure 2 presents three datasets with slight overlap. Among them, the Size1 dataset (Figure 2a) consists of four clusters that are well-separated overall with slight pairwise overlap, and each cluster contains tightly aggregated internal data points. The 2d-4c-2 dataset (Figure 2b) also comprises four clusters, with significant differences in shape, density, and scale across clusters; Cluster 1 and Cluster 3 show slight overlap, while the remaining clusters are well-separated from one another. The Ds577 dataset (Figure 2c) contains three clusters that are distributed in a strip-like or linear pattern; these clusters are not tightly packed internally and have a highly skewed structure, with slight overlap existing between every pair of clusters.

Three datasets with moderate overlap are illustrated in Figure 3. In the 2d-3c dataset (Figure 3a), the three clusters feature well-distinguishable spatial positions with a certain degree of mutual overlap. The Engytime dataset (Figure 3b) contains two elliptical clusters with two more distinct overlapping regions between them. The Square4 dataset (Figure 3c) consists of four clusters with severe pairwise overlap, where the cluster boundaries are almost indistinguishable.

5.1.2. Mixed Datasets with Outliers

Evaluating CVIs on datasets with outliers, uneven density distribution, and those with both outliers and severe overlap is also a crucial aspect of CVI validation. Figure 4 presents three categories of such datasets: The Wingnut dataset (Figure 4a) comprises two clusters with an extremely uneven density distribution, where data points exhibit low density in the bottom-left and top-right regions, and these sparse regions can also be regarded as outliers; the Cure-t2-4k dataset (Figure 4b) features six clusters, where Cluster 6 overlaps with Cluster 3 and Cluster 5 at its two ends while the remaining clusters are well-separated. Outliers are scattered throughout the entire spatial domain of this dataset, making it a hybrid dataset characterized by the coexistence of both outliers and overlap. The S-set4 dataset (Figure 4c) stands as a typical representative of hybrid datasets, which comprises 15 clusters with severe pairwise overlap, and in whose peripheral regions scatter unevenly distributed outliers.

5.1.3. Experimental Results on Artificial Datasets

From the performance of CVIs on slightly overlapping datasets (Table 2, Table 3 and Table 4), all five indices exhibit favorable performance in the presence of slight data overlap. Specifically, the DB, SIL, I, and RA indices accurately identify the true number of clusters for the Size1, 2d-4c-2, and Ds577 datasets, while the CH index only incurs a minor misjudgment on the 2d-4c-2 dataset—falsely assigning the optimal partition to 5 clusters even though the dataset’s true cluster number is 4. An in-depth observation of the 2d-4c-2 dataset shows that its four clusters have notable discrepancies in density and scale, which indicates that the performance of the CH index under slight overlap scenarios is susceptible to the scale and density distribution of observed clusters. In contrast, the other four indices can well address this issue and maintain robust performance regardless of such cluster heterogeneities.

The CVI values presented in Table 2, Table 3 and Table 4 indicate that the final calculated results of different indices exhibit substantial numerical discrepancies, which pose practical challenges for the graphical visualization of clustering performance. To enable the clustering effectiveness reflected by the five indices to be displayed on a single plot for intuitive comparison, we perform a standardization process on the raw calculated results of all indices in accordance with Equation (10). It is noteworthy that the standardized results may generate both positive and negative values, yet this standardization process does not alter the original magnitude order of the raw numerical values. Thus, the evaluation criteria for each index can still be strictly followed in line with the rules specified in Table 1.

C V I^{*} = \frac{{CVI}_{value} - {\bar{X}}_{{CVI}_{value}}}{S_{{CVI}_{value}}}

(10)

where

{\bar{X}}_{{CVI}_{value}}

and

S_{{CVI}_{value}}

are the mean and standard deviation of CVI’s results, respectively.

Experiments on slightly overlapping datasets presented above reveal that all five indices deliver satisfactory performance, with the only exception of the CH index that incurs occasional misjudgments. This naturally raises the question of how each of these five indices would perform as the degree of data overlap increases. Accordingly, we further conduct experimental investigations on moderately overlapping datasets. Figure 5 illustrates the performance of the five indices on the 2d-3c dataset, from which it can be clearly observed that the DB, SIL, and the newly proposed RA indices yield favorable performance on this dataset, accurately identifying the optimal number of clusters as 3. In contrast, the I index exhibits a minor deviation by determining the optimal cluster number as 2, while the CH index suffers from a considerable deviation, with its identified optimal number of clusters being as high as 8.

Figure 6 presents the performance of all CVIs on the Engytime dataset. It should be noted that when the mixture model performs clustering on this dataset, some clusters contain only a single observation if the cluster number is greater than or equal to 10, which causes the SIL index to become invalid under such circumstances. Thus, only the evaluation results for cluster numbers ranging from 2 to 9 are illustrated in this experiment. As can be seen from the figure, among the five indices, only the newly proposed RA index accurately identifies the true optimal cluster number, while all the other indices fail to make a correct judgment. Among these ineffective indices, the SIL index is the relatively best performer, which determines 3 as the optimal cluster number; the DB index identifies 4 clusters as the optimal partition, the CH index gives 5, and the I index performs the poorest by assigning 8 as the optimal cluster number.

A comparison of the Engytime dataset with the four aforementioned datasets reveals that the Engytime dataset boasts only two clusters yet features a markedly higher degree of overlap between them, which even reaches a state of high-density overlap. Under such circumstances, the RA index still succeeds in accurately identifying the true optimal number of clusters, which demonstrates that the newly proposed index exhibits superior adaptability to the diverse scenarios of data overlap.

The Square4 dataset is composed of four pairwise overlapping clusters, yet its overlap degree is lower than that of the Engytime dataset, and clear contours of the four clusters can be distinctly observed in Figure 3c. Figure 7 depicts the performance of the five indices on the Square4 dataset, from which it can be concluded that all indices except the I index can effectively identify the true number of clusters under moderate overlap conditions, irrespective of the specific pattern of inter-cluster overlap.

Synthesizing the performance results of the five CVIs presented in Figure 5, Figure 6 and Figure 7, we can draw the following conclusions: The newly proposed RA index achieves the optimal performance in identifying the true optimal number of clusters for datasets with varying overlap degrees, and it adapts excellently to scenarios featuring diverse cluster scales, densities, and overlap degrees. Next come the DB and SIL indices, which can cope with clusters characterized by moderate overlap and distinct contours, yet their judgment performance deteriorates as the density of overlapping regions increases. The CH index is applicable to scenarios where the overlap degree is not excessively high and certain fuzzy cluster boundaries exist. In contrast, the I index fails completely in all tests on moderately overlapping datasets, mainly because its measurement of intra-cluster compactness still relies on the distance from samples to cluster centers. As is well known, cluster centers are inherently unstable; the performance of the I index gradually becomes invalid with the increase in both the overlap degree and the complexity of data structures.

Heatmaps of CVIs also serve as an important presentation method for their performance. For the convenience of comparison, we first perform a normalization process on the calculated results of CVIs in accordance with Equation (11). The normalized results range from 0 to 1, with the maximum value being exactly 1 and the minimum value exactly 0. As we know from the criteria for determining the optimal number of clusters specified in Table 1, smaller values of the DB and I indices indicate better clustering performance, while larger values are preferable for the other three indices. To facilitate unidirectional comparison, we process the DB and I indices using Equation (12), after which the new indices DB_t and I_t are derived. At this point, the values of DB_t and I_t still fall within the range of 0 to 1, and larger values of these two indices imply smaller original values of the DB and I indices, which in turn indicate a more optimal clustering partition. Thus, the CH, DB_t, SIL, I_t, and RA indices are all considered to indicate the optimal number of clusters when their values reach 1.

C V I^{*} = \frac{max \{{CVI}_{value}\} - {CVI}_{value}}{max \{{CVI}_{value}\} - min \{{CVI}_{value}\}} .

(11)

C V I_{t} = 1 - C V I^{*} .

(12)

The performance of CVIs on hybrid datasets can be obtained from Figure 8, Figure 9 and Figure 10. Specifically, Figure 8 illustrates that all indices except the I index successfully identify the true optimal number of clusters, which further validates the conclusion that the discriminative power of the I index deteriorates in scenarios with uneven density distribution.

Figure 4b,c both depict hybrid datasets integrating both overlap and outliers. Specifically, the Cure-t2-4k dataset in Figure 4b features a low degree of overlap, with only three clusters exhibiting marginal overlap at their edges. In contrast, the S-set dataset in Figure 4c contains fifteen clusters that overlap with one another extensively, and the dataset is also subject to interference from outliers in its peripheral regions, rendering it a hybrid dataset with an extremely complex structural configuration. Figure 9 presents the CVI evaluation results for the Cure-t2-4k dataset, from which it can be observed that the RA and DB indices accurately identify the true optimal number of clusters as 6. The I index deems either 6 or 7 clusters to be the optimal partition, while both the SIL and CH indices identify 8 clusters as the optimal clustering result. This demonstrates that the RA and DB indices exhibit good applicability for datasets with uneven density distribution; the I index also delivers a decent performance, whereas the performance of the SIL and CH indices deteriorates in such scenarios.

For the S-set4 dataset with an even more complex structural configuration, all five indices fail to accurately identify the true optimal number of clusters (Figure 10). Specifically, the DB, I, and RA indices determine 2 as the optimal clustering number, which deviates significantly from the true value of 15. The CH and SIL indices identify 17 clusters as the optimal partition; while this value has a relatively smaller deviation from the true cluster number, these two indices still do not succeed in the accurate recognition of the actual optimal number. This indicates that there remains considerable room for improvement in the modeling of CVI indices when confronted with datasets characterized by an extremely complex structure.

A preliminary explanation for the misjudgment of the newly proposed RA index on the S-set4 dataset can be derived from its fundamental definition. Specifically, the RA index employs kernel density quantile estimation for intra-cluster compactness measurement, and this estimation method exhibits strong robustness against outliers and is less affected in the presence of data overlap. However, its separability metric based on Jeffrey divergence requires that the overlap degree between different clusters should not be excessively high; otherwise, this metric tends to categorize these clusters into a single distribution, which ultimately leads to the failure of the RA index in this scenario.

Table 5 summarizes the test results of the five cluster validity indices (CVIs) on nine distinct datasets. Overall, the newly proposed RA index demonstrates the most outstanding overall performance, achieving an accuracy rate of up to 88.89% and failing only on the S-set4 dataset, which attests to its strong generalizability and robustness. The DB and SIL indices rank second in performance, with accuracy rates of 77.78% and 66.67%, respectively. These two indices exhibit stable and effective performance on most datasets yet have inherent limitations when handling datasets with complex characteristics such as Engytime and Cure-t2-4k. In contrast, the CH and I indices show inferior comprehensive performance, both reaching an accuracy rate of only 44.44%. This is because they impose more stringent assumptions on the underlying data distributions and are thus prone to failure in the presence of complex data structures. In conclusion, the RA index demonstrates favorable adaptability and robust performance in the clustering validity evaluation of datasets with complex structures involving overlap and outliers, and it holds promising application prospects for practical clustering tasks.

5.2. Real Datasets

Table 6 compiles the key statistical information of eight real-world datasets utilized for experimental testing, including the number of instances, attribute dimensions, and pre-defined cluster numbers. All the datasets can be downloaded from the UCI dataset [30]. In terms of data scale, these datasets exhibit remarkable variations in instance volume, ranging from small-scale datasets (e.g., Diabetes with 145 instances) to large-scale ones (e.g., Wreath with 14,980 instances). This design facilitates a systematic evaluation of the computational efficiency, stability, and noise sensitivity of cluster validity indices across datasets with different scales. In terms of feature dimensions, the number of attributes spans from low-dimensional cases (e.g., Faithful and Wreath with only 2 dimensions) to relatively high-dimensional ones (e.g., Musk with up to 166 dimensions). Such a dimensional design enables a comprehensive verification of each index’s capability in capturing clustering structures in low-dimensional spaces, addressing the sparsity and the curse of dimensionality in high-dimensional data, as well as its robustness and generalizability in adapting to varying levels of data structural complexity. By integrating diverse characteristics in data scale and feature dimension, this dataset suite provides a well-layered and comprehensively challenging test benchmark for the evaluation of cluster validity.

Table 7 presents the optimal number of clusters determined by five cluster validity indices (CH, DB, SIL, I, and RA) across ten real-world datasets, where boldface values denote successful identification of the true cluster number for each dataset. The performance of these indices varies substantially across different datasets: all indices accurately identified the true cluster number for the Faithful dataset, while only the RA index achieved correct judgment for Diabetes. For Olive, Wreath, Dietary_survey, and Wpbc datasets, the RA index consistently yielded correct results, whereas the other indices showed inconsistent performance—DB and SIL matched the true cluster number for Olive and Wreath, CH and SIL were correct for Wpbc, and CH, DB, and SIL succeeded in identifying the optimal clusters for Dietary_survey. In contrast, most indices failed to determine the true cluster number for the Iris and Glass datasets, with no single index achieving correct judgment for Glass. In the context of large-scale datasets, such as Musk and Eyes, the proposed RA index demonstrates robust discriminative capability, performing on par with the established DB and SIL indices. In contrast, the CH and I indices successfully identified the correct cluster structure in only one instance. This disparity underscores the strong applicability and reliability of the new method when applied to large-scale data. Overall, the RA index outperformed all other indices with an accuracy rate of 80.00%, followed by SIL (70.00%), DB (60.00%), and CH (40.00%), while the I index exhibited the poorest performance with a mere 25.00% accuracy rate. This outcome demonstrates the superior generalizability of the RA index in evaluating clustering validity on real-world datasets with diverse characteristics.

6. Conclusions and Future Work

6.1. Conclusions

Cluster Validity Indices (CVIs) serve as a pivotal tool in machine learning, which assists clustering algorithms in identifying the optimal number of clusters. Against the backdrop of the demand for accurate cluster number identification in datasets with overlap, outliers, or hybrid datasets integrating both of these characteristics, this study takes the kernel density function of observed sample points as the fundamental tool to reconstruct the measurement methods for intra-cluster compactness and inter-cluster separability, respectively. In the intra-cluster compactness measurement function, the concept of density quantiles is incorporated, which notably enhances the robustness against outliers. For inter-cluster separability, a newly proposed density-based Jeffrey divergence method is developed, which can effectively quantify the inter-cluster differences in overlapping datasets. Considering that kernel density estimation is highly susceptible to the choice of bandwidth, heuristic algorithms including Scott’s and Silverman’s classic methods are adopted to accommodate the features of diverse density distributions, thus enabling the adaptive calculation of density values for all sample points in a dataset. Ultimately, a multiplicative, adaptive, and robust Cluster Validity Index—the RA index—is proposed in this study.

In the validation of the RA index, experiments were conducted on both synthetic and real-world datasets, with four indices—CH, DB, SIL and I—which exhibit prominent performance on overlapping datasets, included as the control group for comparative analysis. Experimental results demonstrate that the RA index delivers favorable performance on both mildly and moderately overlapping datasets, and it still maintains superior performance in the presence of uneven density distribution and outliers. Among the nine synthetic datasets, the RA index achieved correct cluster number identification in eight cases, outperforming all comparative indices with a success rate of 88.89%. For real-world datasets featuring diverse scales, dimensions and inherent structural characteristics, the RA index also emerged as the most robust and effective metric among the five participating indices. This superior performance of the RA index is not only consistent with its outstanding performance on synthetic datasets with varying overlap degrees, density distributions and outlier ratios, but also validates the rationality of its design mechanism: the kernel density quantile estimation for intra-cluster compactness endows it with strong adaptability to non-uniform data distributions and latent noise in real-world data, while the Jeffrey divergence-based separability metric effectively captures the inter-cluster structural differences of practical datasets without being overly constrained by rigid assumptions on data distribution.

Meanwhile, the failure of the RA index on datasets such as S-set4 and Iris, which contain both outliers and severe inter-cluster overlap, also indicates that as a density-based CVI, the RA index experiences a significant performance decline when confronted with datasets with complex structures characterized by high overlap and faint cluster boundaries. For such complex datasets, future research may explore the construction of novel cluster validity indices from the perspective of sparse matrices.

6.2. Future Work

While this paper establishes a robust framework for single-view clustering validation, we acknowledge the growing complexity of modern datasets. A critical gap identified in this study is the lack of validity indices specifically designed for multi-view clustering scenarios. Unlike traditional single-view data, multi-view data often contains heterogeneous structures and complex correlations between views. Therefore, our primary future direction will focus on extending the proposed RA index to handle multi-view data, aiming to develop a validity index capable of evaluating clustering performance across diverse modalities and uncovering reliable structures amidst multi-source noise.

Author Contributions

Conceptualization, B.Y.; methodology, J.Z. and B.Y.; software, B.Y.; data curation, B.Y.; writing—original draft preparation, J.Z. and B.Y.; writing—review and editing, J.Z. and B.Y.; funding acquisition, B.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The Key Projects of Hunan Provincial Department of Education (No. 25A0674), General Project of Hunan Provincial Social Science Achievements Evaluation Committee (No. XSP26YBC707), and the Project of Social Science Popularization Base in Hunan province (No. XJK22ZDJD35).

Data Availability Statement

The artificial data sets are publicly available from GitHub https://github.com/milaan9/Clustering-Datasets (accessed on 13 March 2026), and the real data sets in experiments are available in the UCI Machine Learning Repository https://archive.ics.uci.edu/ (accessed on 13 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rezaee, B. A cluster validity index for fuzzy clustering. Fuzzy Sets Syst. 2010, 161, 3014–3025. [Google Scholar] [CrossRef]
Wiroonsri, N.; Preedasawakul, O. A correlation-based fuzzy cluster validity index with secondary options detector. Fuzzy Sets Syst. 2025, 523, 109632. [Google Scholar] [CrossRef]
Todeschini, R.; Ballabio, D.; Termopoli, V.; Consonni, V. Extended multivariate comparison of 68 cluster validity indices. A review. Chemom. Intell. Lab. Syst. 2024, 251, 105117. [Google Scholar] [CrossRef]
Le Capitaine, H.; Frelicot, C. A cluster-validity index combining an overlap measure and a separation measure based on fuzzy-aggregation operators. IEEE Trans. Fuzzy Syst. 2011, 19, 580–588. [Google Scholar] [CrossRef]
Lin, P.L.; Huang, P.H.; Huang, P.W. A cluster validity indexing method based on entropy for solving cluster overlapping problem. In New Trends on System Sciences and Engineering; IOS Press: Amsterdam, The Netherlands, 2015; pp. 557–569. [Google Scholar]
Yan, B.; Xie, Z. A robust fuzzy cluster validity index based on local distances. Expert Syst. Appl. 2026, 295, 128883. [Google Scholar] [CrossRef]
Liu, Y.; Jiang, Y.; Hou, T.; Liu, F. A new robust fuzzy clustering validity index for imbalanced data sets. Inf. Sci. 2021, 547, 579–591. [Google Scholar] [CrossRef]
Su, X.; Yu, B.; Liu, Y. A robust clustering algorithm based on weighted optimization of fuzzy feature difference using shadow sets. Fuzzy Sets Syst. 2025, 521, 109575. [Google Scholar] [CrossRef]
Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. 1974, 3, 1–27. [Google Scholar] [CrossRef]
Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
Petrovic, S. A comparison between the silhouette index and the davies-bouldin index in labelling ids clusters. In Proceedings of the 11th Nordic Workshop of Secure IT Systems; Citeseer: Princeton, NJ, USA, 2006; Volume 2006, pp. 53–64. [Google Scholar]
Wang, X.; Xu, Y. An improved index for clustering validation based on Silhouette index and Calinski-Harabasz index. IOP Conf. Ser. Mater. Sci. Eng. 2019, 569, 052024. [Google Scholar]
Ros, F.; Riad, R.; Guillaume, S. PDBI: A partitioning Davies-Bouldin index for clustering evaluation. Neurocomputing 2023, 528, 178–199. [Google Scholar] [CrossRef]
Chicco, D.; Campagner, A.; Spagnolo, A.; Ciucci, D.; Jurman, G. The Silhouette coefficient and the Davies-Bouldin index are more informative than Dunn index, Calinski-Harabasz index, Shannon entropy, and Gap statistic for unsupervised clustering internal evaluation of two convex clusters. PeerJ Comput. Sci. 2025, 11, e3309. [Google Scholar] [CrossRef]
Zhu, E.; Wang, X.; Liu, F. A new cluster validity index for overlapping datasets. J. Phys. Conf. Ser. 2019, 1168, 032070. [Google Scholar]
Ouchicha, C.; Ammor, O.; Meknassi, M. A new validity index in overlapping clusters for medical images. Autom. Control Comput. Sci. 2020, 54, 238–248. [Google Scholar] [CrossRef]
Peter, J.; Rousseeuw, S. A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Rezaee, M.R.; Lelieveldt, B.P.; Reiber, J.H. A new cluster validity index for the fuzzy c-mean. Pattern Recognit. Lett. 1998, 19, 237–246. [Google Scholar] [CrossRef]
Kim, D.W.; Lee, K.H.; Lee, D. Fuzzy cluster validation index based on inter-cluster proximity. Pattern Recognit. Lett. 2003, 24, 2561–2574. [Google Scholar] [CrossRef]
Wu, C.H.; Ouyang, C.S.; Chen, L.W.; Lu, L.W. A new fuzzy clustering validity index with a median factor for centroid-based clustering. IEEE Trans. Fuzzy Syst. 2014, 23, 701–718. [Google Scholar] [CrossRef]
Said, A.B.; Hadjidj, R.; Foufou, S. Cluster validity index based on Jeffrey divergence. Pattern Anal. Appl. 2017, 20, 21–31. [Google Scholar] [CrossRef]
Yan, B.; Yin, Y.; Liu, P. A New Cluster Validity Index Based on Local Density of Data Points. Axioms 2025, 14, 578. [Google Scholar] [CrossRef]
Tang, Y.; Sun, F.; Sun, Z. Improved validation index for fuzzy clustering. In Proceedings of the 2005, American Control Conference, 2005; IEEE: Piscataway, NJ, USA, 2005; pp. 1120–1125. [Google Scholar]
Bezdek, J.C.; Pal, N.R. Some new indexes of cluster validity. IEEE Trans. Syst. Man Cybern. Part B Cybern. 1998, 28, 301–315. [Google Scholar] [CrossRef]
Moulavi, D.; Jaskowiak, P.A.; Campello, R.J.; Zimek, A.; Sander, J. Density-based clustering validation. In Proceedings of the 2014 SIAM International Conference on Data Mining; SIAM: Philadelphia, PA, USA, 2014; pp. 839–847. [Google Scholar]
Matioli, L.; Santos, S.; Kleina, M.; Leite, E. A new algorithm for clustering based on kernel density estimation. J. Appl. Stat. 2018, 45, 347–366. [Google Scholar] [CrossRef]
Scaldelai, D.; Matioli, L.C.; Santos, S.; Kleina, M. MulticlusterKDE: A new algorithm for clustering based on multivariate kernel density estimation. J. Appl. Stat. 2022, 49, 98–121. [Google Scholar] [CrossRef] [PubMed]
Fraley, C.; Raftery, A.E. Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 2002, 97, 611–631. [Google Scholar] [CrossRef]
Scrucca, L.; Fop, M.; Murphy, T.B.; Raftery, A.E. mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. R J. 2016, 8, 289. [Google Scholar] [CrossRef]
Markelle, K.; Rachel, L.; Kolby, N. The UCI Machine Learning Repository. 1 February 2026. Available online: https://uci-ics-mlr-prod.aws.uci.edu/ (accessed on 29 April 2026).

Figure 1. Geometric Meaning of Compactness: The red and blue curves represent the 10th and 90th percentiles of density, respectively.

Figure 2. Datasets 1. Slightly overlapping artificial data sets: (a) Size1; (b) 2d-4c-2; (c) Ds577.

Figure 3. Datasets 2. Artificial data sets with moderate overlap: (a) 2d-3c; (b) Engytime; (c) Square4.

Figure 4. Datasets 3. Mixed datasets: (a) Wingnut; (b) Cure-t2-4k; (c) S-set4: 15 clusters.

Figure 5. CVIs’ performance on 2d-3c.

Figure 6. CVIs’ performance on Engytime.

Figure 7. CVIs’ performance on Square4.

Figure 8. CVIs’ performance on Mixed datasets: Wingnut.

Figure 9. CVIs’ performance on Mixed datasets: Cure-t2-4k.

Figure 10. CVIs’ performance on Mixed datasets: S-set4.

Table 1. Information of the comparison indices.

Index	Source	Criterion
CH	[9]	Max
DB	[10]	Min
SIL	[17]	Max
I	[21]	Min

Table 2. CVIs’ performance on Size1.

Index	K
Index	2	3	4	5	6	7	8	9	10
CH	775.37	1046.87	1937.99	1590.75	1554.50	1412.64	1329.65	1318.50	1270.45
DB	1.11	0.82	0.59	0.89	0.98	0.87	0.94	1.06	1.24
SIL	0.60	0.64	0.79	0.71	0.60	0.61	0.59	0.52	0.46
I	15.38	14.24	7.21	16.35	25.30	16.72	18.28	24.05	23.02
RA	52.41	58.72	75.12	24.90	30.09	31.02	30.52	25.85	22.60

Bold numbers indicate the best partition.

Table 3. CVIs’ performance on 2d-4c-2.

Index	K
Index	2	3	4	5	6	7	8	9	10
CH	59.72	666.59	1353.36	1536.61	1405.14	1329.87	1166.48	1393.80	1309.34
DB	2.40	1.24	0.53	0.72	0.81	0.99	1.01	0.99	1.03
SIL	0.62	0.65	0.77	0.73	0.67	0.66	0.52	0.56	0.54
I	3.37	4.04	2.59	3.87	4.50	5.29	7.90	11.57	10.09
RA	45.28	49.35	66.70	31.43	31.48	24.77	26.27	26.15	23.57

Bold numbers indicate the best partition.

Table 4. CVIs’ performance on Ds577.

Index	K
Index	2	3	4	5	6	7	8	9	10
CH	501.89	1114.60	813.99	656.86	583.18	485.20	797.79	479.59	772.42
DB	0.85	0.60	1.14	1.23	1.59	4.98	1.06	2.06	1.16
SIL	0.61	0.80	0.68	0.60	0.55	0.51	0.57	0.53	0.55
I	0.41	0.35	0.40	0.60	0.41	0.66	0.83	0.79	1.03
RA	24.01	26.10	7.57	5.48	5.81	2.10	11.04	4.69	6.59

Bold numbers indicate the best partition.

Table 5. CVIs’ performance on artificial datasets.

CVIs	Datasets									Accuracy Rate
CVIs	Size1	2d-4c-2	Ds577	2d-3c	Engytime	Square4	Wingnut	Cure-t2-4k	S-set4	Accuracy Rate
CH	✓	×	✓	×	×	✓	✓	×	×	44.44%
DB	✓	✓	✓	✓	×	✓	✓	✓	×	77.78%
SIL	✓	✓	✓	✓	×	✓	✓	×	×	66.67%
I	✓	✓	✓	×	×	×	×	✓	×	44.44%
RA	✓	✓	✓	✓	✓	✓	✓	✓	×	88.89%

✓ means success and × means failure.

Table 6. Summary information for real datasets.

Datasets	Instances	Attributes	Number of Clusters
Iris	150	4	3
Faithful	272	2	2
Diabetes	145	3	3
Olive	572	8	9
Wreath	1000	2	14
Dietary_survey	400	43	2
Wpbc	198	33	2
Glass	214	9	6
Musk	6598	166	2
Eyes	14,980	15	2

Table 7. The performances of CVIs on real datasets.

Datasets	CH	DB	SIL	I	RA
Iris	4	4	2	8	2
Faithful	2	2	2	2	2
Diabetes	2	2	2	2	3
Olive	5	8	8	3	8
Wreath	18	14	14	14	14
Dietary_survey	2	2	2	3	2
Wpbc	2	3	2	4	2
Glass	2	3	7	4	8
Musk	3	2	2	3	2
Eyes	2	2	2	2	2
Accuracy rate	40.00%	60.00%	70.00%	30.00%	80.00%

Bold number means success.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yan, B.; Zhao, J. A Robust Adaptive Clustering Validity Index for Overlapping Data. Axioms 2026, 15, 366. https://doi.org/10.3390/axioms15050366

AMA Style

Yan B, Zhao J. A Robust Adaptive Clustering Validity Index for Overlapping Data. Axioms. 2026; 15(5):366. https://doi.org/10.3390/axioms15050366

Chicago/Turabian Style

Yan, Bin, and Juan Zhao. 2026. "A Robust Adaptive Clustering Validity Index for Overlapping Data" Axioms 15, no. 5: 366. https://doi.org/10.3390/axioms15050366

APA Style

Yan, B., & Zhao, J. (2026). A Robust Adaptive Clustering Validity Index for Overlapping Data. Axioms, 15(5), 366. https://doi.org/10.3390/axioms15050366

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Robust Adaptive Clustering Validity Index for Overlapping Data

Abstract

1. Introduction

2. Relative Works

3. The Design Rationale of the RA Index

3.1. Design of Robust Intra-Cluster Compactness

3.2. Design of Robust Inter-Cluster Separation

4. Robust Addictive Cluster Validity Index: RA

4.1. Local Density Estimation Based on KDE

4.2. Intra-Cluster Compactness Index Based on Density Distribution

4.3. Inter-Cluster Separation Index Based on Information Divergence

4.4. RA Index

5. Experimental Studies

5.1. Artificial Datasets

5.1.1. Overlapping Artificial Datasets

5.1.2. Mixed Datasets with Outliers

5.1.3. Experimental Results on Artificial Datasets

5.2. Real Datasets

6. Conclusions and Future Work

6.1. Conclusions

6.2. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI