5.1.3. Experimental Results on Artificial Datasets
From the performance of CVIs on slightly overlapping datasets (
Table 2,
Table 3 and
Table 4), all five indices exhibit favorable performance in the presence of slight data overlap. Specifically, the DB, SIL, I, and RA indices accurately identify the true number of clusters for the Size1, 2d-4c-2, and Ds577 datasets, while the CH index only incurs a minor misjudgment on the 2d-4c-2 dataset—falsely assigning the optimal partition to 5 clusters even though the dataset’s true cluster number is 4. An in-depth observation of the 2d-4c-2 dataset shows that its four clusters have notable discrepancies in density and scale, which indicates that the performance of the CH index under slight overlap scenarios is susceptible to the scale and density distribution of observed clusters. In contrast, the other four indices can well address this issue and maintain robust performance regardless of such cluster heterogeneities.
The CVI values presented in
Table 2,
Table 3 and
Table 4 indicate that the final calculated results of different indices exhibit substantial numerical discrepancies, which pose practical challenges for the graphical visualization of clustering performance. To enable the clustering effectiveness reflected by the five indices to be displayed on a single plot for intuitive comparison, we perform a standardization process on the raw calculated results of all indices in accordance with Equation (
10). It is noteworthy that the standardized results may generate both positive and negative values, yet this standardization process does not alter the original magnitude order of the raw numerical values. Thus, the evaluation criteria for each index can still be strictly followed in line with the rules specified in
Table 1.
where
and
are the mean and standard deviation of CVI’s results, respectively.
Experiments on slightly overlapping datasets presented above reveal that all five indices deliver satisfactory performance, with the only exception of the CH index that incurs occasional misjudgments. This naturally raises the question of how each of these five indices would perform as the degree of data overlap increases. Accordingly, we further conduct experimental investigations on moderately overlapping datasets.
Figure 5 illustrates the performance of the five indices on the 2d-3c dataset, from which it can be clearly observed that the DB, SIL, and the newly proposed RA indices yield favorable performance on this dataset, accurately identifying the optimal number of clusters as 3. In contrast, the I index exhibits a minor deviation by determining the optimal cluster number as 2, while the CH index suffers from a considerable deviation, with its identified optimal number of clusters being as high as 8.
Figure 6 presents the performance of all CVIs on the Engytime dataset. It should be noted that when the mixture model performs clustering on this dataset, some clusters contain only a single observation if the cluster number is greater than or equal to 10, which causes the SIL index to become invalid under such circumstances. Thus, only the evaluation results for cluster numbers ranging from 2 to 9 are illustrated in this experiment. As can be seen from the figure, among the five indices, only the newly proposed RA index accurately identifies the true optimal cluster number, while all the other indices fail to make a correct judgment. Among these ineffective indices, the SIL index is the relatively best performer, which determines 3 as the optimal cluster number; the DB index identifies 4 clusters as the optimal partition, the CH index gives 5, and the I index performs the poorest by assigning 8 as the optimal cluster number.
A comparison of the Engytime dataset with the four aforementioned datasets reveals that the Engytime dataset boasts only two clusters yet features a markedly higher degree of overlap between them, which even reaches a state of high-density overlap. Under such circumstances, the RA index still succeeds in accurately identifying the true optimal number of clusters, which demonstrates that the newly proposed index exhibits superior adaptability to the diverse scenarios of data overlap.
The Square4 dataset is composed of four pairwise overlapping clusters, yet its overlap degree is lower than that of the Engytime dataset, and clear contours of the four clusters can be distinctly observed in
Figure 3c.
Figure 7 depicts the performance of the five indices on the Square4 dataset, from which it can be concluded that all indices except the I index can effectively identify the true number of clusters under moderate overlap conditions, irrespective of the specific pattern of inter-cluster overlap.
Synthesizing the performance results of the five CVIs presented in
Figure 5,
Figure 6 and
Figure 7, we can draw the following conclusions: The newly proposed RA index achieves the optimal performance in identifying the true optimal number of clusters for datasets with varying overlap degrees, and it adapts excellently to scenarios featuring diverse cluster scales, densities, and overlap degrees. Next come the DB and SIL indices, which can cope with clusters characterized by moderate overlap and distinct contours, yet their judgment performance deteriorates as the density of overlapping regions increases. The CH index is applicable to scenarios where the overlap degree is not excessively high and certain fuzzy cluster boundaries exist. In contrast, the I index fails completely in all tests on moderately overlapping datasets, mainly because its measurement of intra-cluster compactness still relies on the distance from samples to cluster centers. As is well known, cluster centers are inherently unstable; the performance of the I index gradually becomes invalid with the increase in both the overlap degree and the complexity of data structures.
Heatmaps of CVIs also serve as an important presentation method for their performance. For the convenience of comparison, we first perform a normalization process on the calculated results of CVIs in accordance with Equation (
11). The normalized results range from 0 to 1, with the maximum value being exactly 1 and the minimum value exactly 0. As we know from the criteria for determining the optimal number of clusters specified in
Table 1, smaller values of the DB and I indices indicate better clustering performance, while larger values are preferable for the other three indices. To facilitate unidirectional comparison, we process the DB and I indices using Equation (
12), after which the new indices DB
t and I
t are derived. At this point, the values of DB
t and I
t still fall within the range of 0 to 1, and larger values of these two indices imply smaller original values of the DB and I indices, which in turn indicate a more optimal clustering partition. Thus, the CH, DB
t, SIL, I
t, and RA indices are all considered to indicate the optimal number of clusters when their values reach 1.
The performance of CVIs on hybrid datasets can be obtained from
Figure 8,
Figure 9 and
Figure 10. Specifically,
Figure 8 illustrates that all indices except the I index successfully identify the true optimal number of clusters, which further validates the conclusion that the discriminative power of the I index deteriorates in scenarios with uneven density distribution.
Figure 4b,c both depict hybrid datasets integrating both overlap and outliers. Specifically, the Cure-t2-4k dataset in
Figure 4b features a low degree of overlap, with only three clusters exhibiting marginal overlap at their edges. In contrast, the S-set dataset in
Figure 4c contains fifteen clusters that overlap with one another extensively, and the dataset is also subject to interference from outliers in its peripheral regions, rendering it a hybrid dataset with an extremely complex structural configuration.
Figure 9 presents the CVI evaluation results for the Cure-t2-4k dataset, from which it can be observed that the RA and DB indices accurately identify the true optimal number of clusters as 6. The I index deems either 6 or 7 clusters to be the optimal partition, while both the SIL and CH indices identify 8 clusters as the optimal clustering result. This demonstrates that the RA and DB indices exhibit good applicability for datasets with uneven density distribution; the I index also delivers a decent performance, whereas the performance of the SIL and CH indices deteriorates in such scenarios.
For the S-set4 dataset with an even more complex structural configuration, all five indices fail to accurately identify the true optimal number of clusters (
Figure 10). Specifically, the DB, I, and RA indices determine 2 as the optimal clustering number, which deviates significantly from the true value of 15. The CH and SIL indices identify 17 clusters as the optimal partition; while this value has a relatively smaller deviation from the true cluster number, these two indices still do not succeed in the accurate recognition of the actual optimal number. This indicates that there remains considerable room for improvement in the modeling of CVI indices when confronted with datasets characterized by an extremely complex structure.
A preliminary explanation for the misjudgment of the newly proposed RA index on the S-set4 dataset can be derived from its fundamental definition. Specifically, the RA index employs kernel density quantile estimation for intra-cluster compactness measurement, and this estimation method exhibits strong robustness against outliers and is less affected in the presence of data overlap. However, its separability metric based on Jeffrey divergence requires that the overlap degree between different clusters should not be excessively high; otherwise, this metric tends to categorize these clusters into a single distribution, which ultimately leads to the failure of the RA index in this scenario.
Table 5 summarizes the test results of the five cluster validity indices (CVIs) on nine distinct datasets. Overall, the newly proposed RA index demonstrates the most outstanding overall performance, achieving an accuracy rate of up to 88.89% and failing only on the S-set4 dataset, which attests to its strong generalizability and robustness. The DB and SIL indices rank second in performance, with accuracy rates of 77.78% and 66.67%, respectively. These two indices exhibit stable and effective performance on most datasets yet have inherent limitations when handling datasets with complex characteristics such as Engytime and Cure-t2-4k. In contrast, the CH and I indices show inferior comprehensive performance, both reaching an accuracy rate of only 44.44%. This is because they impose more stringent assumptions on the underlying data distributions and are thus prone to failure in the presence of complex data structures. In conclusion, the RA index demonstrates favorable adaptability and robust performance in the clustering validity evaluation of datasets with complex structures involving overlap and outliers, and it holds promising application prospects for practical clustering tasks.