This section first presents the experimental settings and the datasets used for evaluation. Then, the comparison methods and evaluation metrics are introduced. Finally, a series of comprehensive experiments are conducted to assess the effectiveness of the proposed method.
4.1. Experimental Settings
The proposed method mainly contains the Siamese autoencoder, unsupervised metric learning, and manifold learning layer. The architectures of the encoder and decoder are detailed in
Table 1. The encoder is stacked with three convolution layers and one Dense layer. The other Siamese encoder shares the exact same structure as this one. The decoder structure is a mirrored version of the encoder. The size of Dense1 layer is the number of clusters. The size of Dense2 layer is
.
The manifold learning layer performs the UMAP method. We set the UMAP parameters as follows: n_neighbors = 10, min_dist = 0.01, metric = ‘euclidean’, andn_components = 2. These settings were specifically chosen to capture local structure by considering the 10 nearest neighbors, ensure a well-distributed representation with a minimum distance of 0.01, employ the Euclidean metric for distance calculation, and visualize the data in a two-dimensional space. In the case of no pre-training, we iterate 300 times. In the case of pertaining, we pre-train the autoencoder 200 epochs and finetune 50 epochs. For both pre-training and finetuning of the encoder, the minibatch size is set to 256. To ensure a fair and robust evaluation, considering the variability inherent in deep clustering initializations, all reported performance metrics for our proposed DRSE method are the average results of 5 independent runs with random initializations.
4.2. Datasets
The proposed method is evaluated on five commonly used image datasets of varying scales, feature dimensions, and category numbers, to comprehensively assess clustering performance. Sample images from the experimental datasets are presented in
Figure 3. Details of the datasets are summarized in
Table 2. Since clustering is an unsupervised task, the training and testing samples are merged during the training phase.
MNIST-full: The dataset is a 10-category handwritten digit dataset containing 70,000 samples, and each sample is 28 × 28 gray-scale pixels.
MNIST-test: The dataset is a testing dataset of the MNIST-full dataset. Specially, it contains 10,000 samples, and each sample is 28 × 28 gray-scale pixels.
USPS: The dataset is a 10-category handwritten digit dataset containing 9298 samples, and each sample is 16 × 16 gray-scale pixels.
COIL20: The dataset contains 20 image objects, and each is imaged from 72 viewpoints. So there are 1440 samples in the dataset. In our experiment, we resize the image from 32 × 32 to 18 × 18.
Letters: The dataset is a 10-category color English letters containing 10,000 samples. We randomly choose 1000 images from each category from A to J. Each sample is 28 × 28 color-scale pixels.
4.3. Comparison Methods
DRSE is evaluated against several conventional clustering algorithms, including K-means, spectral clustering (SC) [
29], and Gaussian mixture models (GMMs) [
30], Representative Point-based Clustering with Neighborhood Information (RPC-NI) [
31]. In addition, we also compare our model with other state-of-the-art clustering methods based on deep learning, including deep embedded clustering (DEC), deep clustering with convolutional autoencoders (DCEC) [
10], K-deep-autoencoder (K-DAE) [
12], structural deep clustering network (SDCN) [
32], learning the precise feature for cluster assignment (LPFCA) [
33], deep spectral clustering using dual autoencoder network (DSCDAN) [
34], Not too deep clustering (N2D) [
35], bi-directional discriminative representation learning clustering (BDRC) [
36], Deep Embedding Clustering algorithm based on Residual Autoencoder (DECRA) [
14]. For a fair comparison, the deep clustering algorithms we selected are all simple network models, with autoencoders accounting for the majority. The loss function is based on reconstruction loss and KL divergence. DEC, DCEC, K-DAE, SDCN, DSCDAN, N2D, and BDRC are all autoencoder structures, so the loss functions include reconstruction loss. DEC, DCEC, SDCN, DSCDAN, and BDRC all use KL divergence as the clustering loss. K-DAE is K autoencoders, and the group with the smallest reconstruction loss is selected as the best clustering result. SDCN adds a graph neural network to the fully connected autoencoder to obtain the structural information of the data. N2D adds manifold learning to embedded features to protect the manifold structure of the data. Both DSCDAN and BDRC have dual decoders and add mutual information to learn more discriminative representations. For fair comparison, baseline results were cited from their respective original papers, ensuring consistent evaluation settings.
4.4. Evaluation Metrics
Accuracy (
ACC) is used to measure the accuracy of clustering, taking the maximum matching value of the true label and the cluster label. The mathematical formula is shown in (19).
represents the ground-truth label, is the predicted cluster assignment, n is the total number of samples, and m denotes the optimal mapping function.
Normalized Mutual Information (
NMI) is used to measure the degree of correlation between two random variables. The mathematical formula of
NMI is shown in (20).
I is mutual information;
Y is the true label,
C is the cluster label, and
H is the entropy. Adjusted Rand Index (
ARI) is used to measure the similarity between clustering results and true labels. The mathematical formula of
ARI is shown in (21).
RI represents the rand index. E[RI] represents the expected value of the Rand Index when the clustering is purely random. max(RI) represents the maximum possible value of the rand index.
4.5. Results Analysis
We compare our method with 13 clustering methods, including three conventional clustering methods, eight state-of-the-art deep clustering methods. The clustering results are shown in
Table 3. The bold and underline results, respectively represent the best and second-best clustering performance for each dataset. The “-” indicates that the source code was not available. DRSE+ stands for clustering results with pre-training stage.
It should be noted that to ensure the fairest comparison, the performance metrics of the baseline methods in
Table 3 are directly cited from their original publications. Since the official source codes for several baselines are not publicly available, reporting standard deviations across multiple runs for all methods was not feasible. However, the substantial performance margins achieved by our method, particularly on the COIL20 and USPS datasets, clearly demonstrate the robustness and superiority of the proposed framework.
From
Table 3, we have the following observations and conclusions:
First, deep clustering methods significantly outperform traditional ones, highlighting the importance of effective representation learning. For instance, DEC surpasses all conventional clustering methods by over 10% on large-scale datasets. However, on small-sample datasets like COIL20, the traditional SC algorithm still performs well, even outperforming several deep methods (DEC, DCEC, KDAE, SDCN) on NMI, suggesting that classical methods remain valuable in limited-data scenarios.
Second, the proposed method DRSE achieves promising performance on all datasets, which proves the effectiveness of our method. On MNIST datasets, most deep clustering methods achieve brilliant clustering performance. This is because the dataset is simple and clear. Especially, the DSCDAN method achieves the state-of-the-art results on the MNIST dataset. DSCDAN achieves state-of-the-art results, benefiting from the dataset’s high quality and its use of spectral clustering. It is worth noting that even when DRSE+ uses the simplest K-means method to initialize the network, the ACC is only slightly lower by less than 0.3%. Moreover, on more complex image datasets such as USPS and COIL20, DRSE+’s outstanding feature extraction capability is evident. Specifically, on the USPS and COIL20 datasets, ACC is significantly higher, by nearly 24% and 15%, respectively. In contrast, the clustering performance of our proposed DRSE method is superior on each dataset. We can find that the clustering performance is over 0.92 uniquely and exceeds the recently proposed method BDRC by 6% on the COIL20 dataset. Due to the fact that Letters are color images, most deep clustering algorithms have advantages, especially in terms of NMI and ARI metrics. However, deep clustering algorithms are not robust on various metrics: some are high and some are low, and only the DRSE algorithm achieves the highest results on each metric. The results indicate that the DRSE method has high generalization. This is because we introduced the Siamese encoders.
Third, DRSE performs nearly as well as DRSE+ across three datasets, despite requiring no pre-training. This is due to the inclusion of a discriminative representation learning module, which enables the network to learn compact, discriminative features end-to-end. The Siamese encoder ensures robust representations and reliable pseudo-labels, allowing the self-supervised clustering module to optimize effectively without pre-training. Notably, DRSE still achieves the best results on COIL20, where small sample sizes typically hinder deep clustering methods.
In order to demonstrate the performance of the algorithm more clearly, the error rates of DRSE and several state-of-the-art (SOTA) methods are reported on three datasets, along with the Relative Error Reduction (REC). The experimental results are shown in
Table 4. Since almost all comparison algorithms achieve high results on the MNIST dataset, the error rate decrease is limited. But The relative error rates on the USPS and COIL20 datasets both dropped by more than 20%. Especially on the COIL20 dataset, the REC of ACC reached 68.8% and the REC of ARI reached 45.5%. In addition, a stacked bar chart of the error rates across all three datasets is presented in
Figure 4. As shown, DRSE consistently achieves lower error rates across all three metrics compared with the BDRC method, demonstrating its superior overall performance.
In order to show the clustering results in an intuitive way, we draw a stacked bar chart to show the improvement of DRSE based on DCEC, as shown in
Figure 5. Besides, we conduct visualization experiments on the MNIST-test, USPS, and COIL20 datasets. For comparison and analysis, we also provide visualization results of DCEC. The visualization results are shown on
Figure 6,
Figure 7 and
Figure 8.
The
Figure 5 exhibits the cumulative clustering performance of DCEC and DRSE on four datasets. We can obviously find that there is a large gap between the results of DCEC and DRSE. This demonstrates that our contributions play significant positive roles on clustering performance.
From
Figure 6,
Figure 7 and
Figure 8, the result of DREC is that the samples within the cluster are more compact, and the samples between the clusters are farther away. This strongly demonstrates the positive impact of the discriminative representation learning layer. In addition, the introduced manifold learning layer protects the data structure better and filters redundant information, which is evident on the COIL20 dataset.
4.6. Ablation Study
An ablation study is conducted to assess the impact of the Siamese encoders, manifold learning, and the discriminative representation learning layer on model performance. To verify the effectiveness of the UMAP method used in our model, the method is compared with t-SNE using three evaluation metrics. The impact of center loss on the convergence behavior of the model is also investigated.
4.6.1. Contributions of Each Module
In this section, the contribution of each module is analyzed in detail. DRSE mainly contains two components, including Siamese encoders (SE) and representation learning (RL). The representation learning module contains a manifold learning layer and a discrimination representation learning. The impact of each module on clustering performance is detailed in
Table 5. The bold font indicates the best clustering performance. For comparison, the experiments are performed with pre-training stage.
Representation learning aims to learn the “friendly” representation for clustering. One is to filter redundant feature information while protecting the manifold structure of data. The second is to employ center loss for increasing intra-cluster homogeneity. The results show the module plays a pivotal role. on clustering results. On the USPS dataset, both ACC and ARI metrics increase by more than 10%. The most important is that the ACC and ARI of the COIL20 dataset, respectively, exceed 27.1% and 30.3%. Siamese encoders aim to improve the generalization of models and learn a robust representation. Although in the simple and clear image dataset MNIST-test, the ACC, NMI, ARI, respectively, improve by nearly 2%, 3%, and 4%. Due to the limited number of samples in the COIL20 dataset, the role of the Siamese encoders becomes extremely crucial. Consequently, the most significant performance improvements are observed on this dataset. The ACC, NMI, and ARI, respectively, improve by nearly 7%, 3%, and 6%.
Furthermore, it is highly unlikely to obtain comparable results using a simple encoder trained with a standard cross-entropy loss on pseudo labels. A simple encoder lacks the structural regularization provided by the Siamese consistency. More importantly, applying standard cross-entropy to pseudo labels inevitably leads to confirmation bias. Since the predicted labels are noisy during the initial training phases, the network will overfit to these errors and amplify them. Our framework mitigates this issue by jointly leveraging the Siamese consistency to learn inherent data representations independent of labels and the Center Loss to pull samples to cluster centers directly in the metric space. This joint strategy ensures a much more accurate and stable convergence than a simple cross entropy baseline.
4.6.2. Clustering Time Analysis of Each Module
In order to compare the clustering time of each module, we conducted an ablation study based on the clustering time. The experiment results are shown in
Table 6. From
Table 6, it is found that the clustering time used by DRSE w/o RL is the shortest, and the gap between the DRSE and DRSE w/o SE is small, indicating that the branch of discriminative representation learning increases the clustering time, which is about 30 s, but is acceptable. More importantly, DRSE w/o SE improves ACC by more than 10% compared with DRSE w/o RL on USPS and COIL20 datasets. which shows that DRSE has achieved a good balance in clustering performance and time complexity.
4.6.3. Comparison of Different Manifold Learning Methods
To compare with manifold learning methods such as t-SNE and UMAP, clustering results on the USPS dataset are visualized, as shown in
Figure 9. A quantitative analysis is conducted on the MNIST-test, USPS, and COIL20 datasets. The clustering results are shown in
Table 7.
From
Figure 9, it is evident that both methods are capable of grouping similar samples into the same clusters, while UMAP clusters between different categories are farther apart and more compact for the same category of samples. This indicates that both protect the local structure of the data, and it is clear that UMPA protects the global structure of the data as well.
From
Table 7, the following observations can be made: First, UMAP outperforms t-SNE on all three datasets. Second, due to the small number of samples and a larger number of categories in the COIL20 dataset, higher requirements are placed on the deep clustering model. Because of this, the advantages of the UMAP algorithm are highlighted. The three metrics have been significantly improved. To sum up, we choose the UMAP algorithm in our model.
This visual evidence confirms our theoretical motivation discussed earlier. UMAP protects not only the local neighborhood but also the meaningful global distances between different clusters. This synergy with the Center Loss explains the quantitative superiority of UMAP over t-SNE across all metrics in
Table 7. The bold font indicates the best clustering performance.
4.6.4. Analysis of Balance Parameters
The performance of DRSE is further examined under different settings of balance parameters. The purpose of reconstructing loss is to make the features learned by the autoencoder as consistent as possible with the data, and the purpose of KL divergence is to make the probability distribution of the features learned by the encoder as consistent as possible with the data. The goals of these two loss functions are consistent, and they are of the same trend and magnitude. We use TensorBoard to visualize the two losses, and the results are shown in
Figure 10. The horizontal axis is “RELATIVE”, which represents training time. The vertical axis is the value of loss. From
Figure 10, it can be found that the two losses change in the same order of magnitude within a relative period of time, between
and
. Hence, their coefficients are fixed at 1.
The reconstruction loss
and clustering loss
are the basis of deep joint clustering. More importantly, the deep metric learning loss
is to improve discrimination with intra-cluster compactness. It can be treated as an effective cluster-promoting regularizer for joint clustering problem. We tune the balance parameter in the range of {0.2, 0.4, 0.6, 0.8, 1, 10}. The effect of parameters on the USPS dataset is shown in
Figure 11. As we can see from
Figure 11, when the parameter is relatively small, DRSE can maintain promising results. We also set the balance parameter as 1.
It is important to note that the tuning of the balance parameter
was not dataset-specific. As illustrated in
Figure 10, the model exhibits strong robustness and maintains stable clustering performance across the evaluated range of {0.2, 0.4, 0.6, 0.8, 1}. Due to this low sensitivity, we did not perform complex per-dataset fine-tuning; instead, we uniformly adopted a consistent value of
across all experiments and datasets to demonstrate the generalization capability of the proposed framework.
4.6.5. Convergence Analysis
To verify model convergence, the variations in three evaluation metrics during the optimization process are analyzed. In addition, the impact of center loss on clustering performance is reported. The first column is DRSE, and the other is no center loss based on DRSE (DRSE_cl). It should be noted that this is the convergence situation without pre-training.
As illustrated in
Figure 12,
Figure 13 and
Figure 14, DRSE achieves convergence on all datasets. And the convergence speed is fast. These metrics increase rapidly in a few epochs and then grow slowly. Compared with DRSE_cl, the speed of convergence of DRSE is the same, and the values of metrics improve a little on MNIST-test. This is because the images of dataset are clear, so the convergence is easy. However, the convergence speed of DRSE is nearly 20 epochs faster than DRSE_cl on the USPS dataset. The values of ACC and ARI exceed nearly 10%, and NMI exceeds nearly 5% on the COIL20 dataset. Besides, the results of DRSE are more stable on COIL20 dataset. These apparent performances prove that discriminative representation learning is critical to clustering results. Discriminative representation enables fast clustering, which in turn leads the network to fast convergence.