Discriminative Representation Learning for Fast and Accurate Clustering

Hou, Haiwei; Wang, Lijuan

doi:10.3390/app16062887

Open AccessArticle

Discriminative Representation Learning for Fast and Accurate Clustering

by

Haiwei Hou

¹

and

Lijuan Wang

^1,2,*

¹

School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China

²

School of Information Engineering, Xuzhou College of Industrial Technology, Xuzhou 221400, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 2887; https://doi.org/10.3390/app16062887

Submission received: 15 February 2026 / Revised: 11 March 2026 / Accepted: 11 March 2026 / Published: 17 March 2026

(This article belongs to the Special Issue Advances in AI Platform Infrastructure: Databases, Knowledge Management and Hardware Acceleration)

Download

Browse Figures

Versions Notes

Abstract

Deep clustering aims to boost clustering performance by learning powerful representations via deep learning. Despite their superiority over conventional shallow algorithms, autoencoder-based methods are typically hindered by heavy dependencies on large datasets and computationally expensive pre-training phases. Moreover, they often struggle to learn representations that are sufficiently discriminative for complex clustering tasks. To bridge this gap, we introduce a novel discriminative clustering framework utilizing Siamese encoders. By jointly training a Siamese encoder and a discriminative learning module, our method simultaneously captures robust features from data augmentations and imposes intra-cluster compactness. This dual optimization yields highly discriminative representations, which obviates the necessity for pre-training while ensuring rapid convergence and high accuracy. Extensive experiments on multiple benchmarks validate the superiority of our approach over state-of-the-art baselines.

Keywords:

machine learning; deep clustering; image clustering; self-supervised deep clustering

1. Introduction

In the era of big data, mining the latent value of data through classification has become a hotspot in machine learning. It has been widely used in data security, medical information, and personalized recommendations [1]. Labeling data is a time-consuming and laborious task, especially given the current scale of data. To address this challenge, clustering techniques have been extensively studied and applied. Clustering aims to divide a set of data into clusters according to similarity. The samples within the clusters have similarities, while the samples between clusters have differences [2]. There are various clustering algorithms for different datasets and different applications. Traditional clustering algorithms can be divided into partition-based clustering algorithms, such as K-means [3]; Hierarchical clustering algorithms, such as agglomerative clustering [4]; Grid-based clustering algorithms, such as the Statistical Information Grid (STING) algorithm [5]; Density-based clustering algorithms, such as the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm [6]; and graph-based clustering algorithms, such as the spectral clustering algorithm [7]. At present, data are increasingly high-dimensional and heterogeneous. With the advancement of deep learning, neural networks have been incorporated into clustering, leading to substantial improvements in performance. Consequently, deep clustering has attracted considerable attention in recent years.

Autoencoder-based methods have been widely explored in deep clustering owing to their capacity for unsupervised representation learning. Among them, Deep Embedded Clustering (DEC) [8] is a seminal work that jointly optimizes feature learning and clustering assignments via a KL-divergence loss, establishing a foundation for subsequent deep clustering approaches [9,10,11,12,13]. Recently, several methods have been proposed to enhance representation quality or simplify training procedures. Deep Embedded Clustering with Residual Autoencoder (DECRA) [14] introduces a residual autoencoder with adaptive weighting to improve robustness and generalization. Deep Embedding clustering algorithm driven by sample Stability (DECS) [15] eliminates the reliance on pseudo-labels by enforcing stable sample-centroid relationships. Deep Clustering with Hypersphere Similarity Supervision (DCSS) [16] adopts a hypersphere-based clustering strategy and supervises pairwise similarities to better model latent structures. For high-dimensional data, the 3D Attention Convolutional Autoencoder (3D-ACAE) [17] uses a 3D Attention Convolutional Autoencoder to extract spatial-spectral features in hyperspectral image clustering. Despite their progress, most of these methods still require pre-training stages or lack mechanisms to learn discriminative representations effectively.

Although the aforementioned autoencoder-based methods have achieved significant clustering performance, they still face several limitations: First, poor discriminability of representations hinders clustering effectiveness. Second, autoencoder-based deep clustering methods generally require pre-training, which prevents end-to-end training. To address these issues, we propose a Discriminative Representation Learning clustering framework based on Siamese Encoders (DRSE), a joint training framework that combines Siamese encoders with a discriminative representation learning module. Specifically, the Siamese encoders learn robust representations. The discriminative representation learning module promotes intra-cluster compactness to yield more discriminative representations. By producing robust and discriminative representations, the deep clustering model can converge rapidly without a pre-training stage, and clustering performance does not degrade. While pseudo-supervision has been explored in deep clustering, to the best of our knowledge, this work presents a novel configuration that directly integrates supervised metric learning objectives (specifically, center loss) with Siamese autoencoders and manifold learning. By strategically decoupling topological refinement at the epoch level from mini-batch metric learning constraints, this specific combination prevents trivial solutions and effectively tightens intra-cluster variance, all without requiring the pre-training. The main contributions are summarized as follows:

A novel joint framework that eliminates the pre-training phase of autoencoders while enhancing clustering performance.
A representation learning module that effectively captures discriminative features. Redundant features are filtered while preserving the manifold structure of the embedding space. Subsequently, center loss is employed to improve intra-cluster compactness.
Extensive experiments on benchmark datasets demonstrate that the proposed method consistently outperforms state-of-the-art approaches, highlighting its effectiveness and generalization capability.

The remainder of this paper is organized as follows: Section 2 reviews the related works on autoencoder and deep metric learning. Section 3 describes the proposed DRSE framework. Section 4 presents the experimental results and analysis. Finally, the conclusion is reported in Section 5.

2. Related Works

This section provides a brief overview of relevant studies on autoencoders and deep metric learning, which serve as a foundation for the proposed approach. Furthermore, it is worth noting that the paradigm of unsupervised and self-supervised representation learning is rapidly expanding beyond traditional image datasets. Recent studies have demonstrated its significant potential in diverse data modalities, such as sensor-derived human movement data. For instance, recent work has successfully explored self-supervised representation learning for gait event detection using smartphone IMUs [18], highlighting the broad applicability of learning robust representations from unlabeled sensor signals.

2.1. Autoencoder for Clustering

Autoencoder was first proposed using a Multi-Layer Perceptron (MLP) for denoising. Borlard used the MLP autoencoder in the same period for dimensionality reduction [19]. Hinton proposed a deep autoencoder to improve representation learning performances [20]. To solve the dimensional disaster of clustering tasks, a deep autoencoder usually embeds raw data into a low-dimensional embedding space and then performs traditional clustering algorithms in the embedding space. The quality of the extracted features is crucial for clustering performance. Therefore, the primary goal of embedding is to learn the most relevant information from raw data. The deep autoencoder achieves this by reconstructing the loss. Specifically, the encoder performs nonlinear feature extraction and the decoder reconstructs the data from the embedding space. Researchers typically employ a layer-wise greedy algorithm to pre-train the autoencoder, enabling the network to extract useful features that are subsequently clustered using traditional algorithms. Then they use specific clustering loss and reconstruction loss to fine-tune the network and update the parameters. The standard architecture of autoencoders for clustering is shown in Figure 1.

2.2. Deep Metric Learning

The final goal of clustering is to assign similar samples to the same cluster. Intuitively, we can make intra-cluster homogeneity to improve clustering performance. Metric learning combines its data characteristics and the target problem to learn an effective measurement [21]. Deep metric learning [22] is mapping from sample to feature through a specific loss function. Under this mapping, the feature distance corresponding to samples within a class is minimized, and the feature distance corresponding to samples between classes is maximized.

The first concept of deep metric learning is contrastive loss [23]. Since the contrastive loss only constrains the short distance between the samples with inter-class and the long distance between the samples with intra-class. So the triple loss [24] further constrains the relative relationship between the pair of samples with inter-class and the pair of samples with intra-class, that is, fix an anchor sample and restrict the feature distance of the samples within the class is smaller than the samples between the classes. However, the loss of triples has the following problems. One is that due to the need to select triples, the complexity is higher for a dataset containing n samples than

O (n^{3})

. The other is that triple loss is sensitive to anchor point selection. Next, coupled cluster loss [25] was proposed, making similar samples clustered and dissimilar samples far away. To make the sample features not only separable but also discriminative. Center loss [26] was proposed to cluster similar features around the center, thereby expanding the class distance and reducing the inter-class distance to learn more discriminative features.

This paper leverages the center loss from deep metric learning to enhance feature discriminability, making the representations more conducive to clustering. However, deep metric learning is a supervised learning method. The clustering task involves unlabeled data, so we first obtain pseudo-labels. The two tasks complement each other.

3. The Proposed Method

As illustrated in Figure 2, the proposed DRSE framework comprises three core components: Siamese encoders, a discriminative representation learning module, and a self-supervised cluster prediction module. An original image and its augmented version are simultaneously passed through the Siamese encoder to obtain embedding features. Each branch serves a distinct purpose—one is optimized to maintain clustering consistency, while the other is tailored to enhance the discriminative power of the representations. To further refine the embeddings, manifold learning is employed to suppress redundant features while preserving the intrinsic structural characteristics of the data. On this foundation, a center loss is incorporated to improve the compactness and separability of feature clusters. The overall model is jointly trained with a composite objective that integrates reconstruction loss, clustering loss, and center loss, promoting the formation of compact and well-separated clusters.

3.1. Siamese Encoders for Clustering

The convolutional autoencoders can explore the structural information and extract the hierarchical feature of the 2D image, so we replace the fully connected autoencoders with convolutional autoencoders for image clustering. Besides, we introduce the Siamese encoders to our model. Compared with traditional encoders, the Siamese encoder has the following advantages: (1) It can better capture the similarity between original images and transformed images. That is, the inherent representation of the data. (2) It can learn more robust representations. (3) It can improve the performance of clustering. The Siamese encoders share weights with each other. However, their inputs are different. One is original images, the other is randomly transformed images. Given the input

X = {x_{1}, x_{2}, x_{3}, \dots x_{n}}

, n is the number of samples. The randomly transformed image is defined by

\tilde{x} = T (x)

(1)

The encoder attempts to learn a function that maps the x and

\tilde{x}

into the embedding vectors

Z = {z_{1}, z_{2}, z_{3}, \dots z_{n}}

and

\tilde{Z} = {{\tilde{z}}_{1}, {\tilde{z}}_{2}, {\tilde{z}}_{3}, \dots {\tilde{z}}_{n}}

, which is defined as:

z = f_{w} (x)

(2)

\tilde{z} = f_{w} (\tilde{x})

(3)

The decoder attempts to learn a function that maps the learned embedding space back to the original input space

X^{'} = {x_{1}^{'}, x_{2}^{'}, x_{3}^{'}, \dots x_{n}^{'}}

, which is defined by

x^{'} = g_{U} (\tilde{z})

(4)

The decoder is parameterized by U and its mapping function is denoted as g. The reconstruction loss is computed as the mean squared error between the decoder’s output and the original input of the encoder, which is defined by

L_{r} = \frac{1}{n} {‖ x^{'} - \tilde{x} ‖}^{2}

(5)

3.2. Discriminative Representation Learning

Although the features extracted by the autoencoder contain valuable information, further embedding refinement is necessary for two main reasons. First, the embedding space may include redundant or highly similar features, which require filtering. Second, the extracted features often lack sufficient exploration of the underlying manifold structure. For these reasons, we incorporate a manifold learning layer into our framework.

t-distributed Stochastic Neighbor Embedding (t-SNE) [27] is the most widely used technique for dimensionality reduction. It is a nonlinear method by optimizing the local distances. First, Stochastic Neighbor Embedding (SNE) converts the similarity (Euclidean distance) into a conditional probability. Second, t-SNE turns optimized conditional probability into optimized joint distribution.

Uniform Manifold Approximation and Projection (UMAP) [28] is a recently proposed manifold learning method. UMAP aims to make the data in high-dimensional space similar to the data in low-dimensional embedding space. We need to construct a data graph in the input space and the embedding space. Then we minimize the difference between the two distributions. UMAP and t-SNE excel in manifold learning methods, with the following advantages compared with other methods: (a) Suitable for high-dimensional data. (b) Capable of capturing nonlinear relationships between the data. Therefore, we have chosen these two methods for discussion. In the Section 4, we have compared these two methods.

We specifically select UMAP over other topology-preserving techniques for the bottleneck layer because it offers a superior balance of local and global structural preservation. While techniques like t-SNE excel at capturing nonlinear local relationships, they often distort the global structure and inter-cluster distances. UMAP captures local neighborhoods effectively while significantly better preserving the global topology. This global structural fidelity provides meaningful relative data density and a robust Euclidean distance metric. Therefore, it is highly compatible with our subsequent center loss optimization that relies on Euclidean distances to enforce intra-cluster compactness.

The first is to construct the K-Nearest Neighbors graph of the data in the input space.

N_{i} \in {{\tilde{z}}_{i, 1}, \dots, {\tilde{z}}_{i, k}}

is the set of neighbors. k is the number of neighbors and is predefined. UMAP uses the Gaussian kernel to measure the similarity between data, which is defined by

m_{j | i} = \{\begin{matrix} exp (- \frac{| | {\tilde{z}}_{i} - {\tilde{z}}_{j} | |_{2} - ρ_{i}}{σ_{i}}) & i f {\tilde{z}}_{j} \in N_{i} \\ 0 & Otherwise \end{matrix}

(6)

m_{j | i}

indicates the conditional probability that

{\tilde{z}}_{j}

is a neighbor of

{\tilde{z}}_{i}

. The

ρ_{i}

is the distance from

x_{i}

to its nearest neighbor:

ρ_{i} = min {{‖ {\tilde{z}}_{i} - {\tilde{z}}_{i, j} ‖}_{2} | 1 \leq j \leq k}

(7)

To make the measure symmetric,

m_{j | i}

symmetrize as:

m_{i j} = m_{j | i} + m_{i | j} - m_{j | i} m_{i | j}

(8)

The second is to construct the data graph in the embedding space. The probability that point

{\tilde{z}}_{d j}

as the neighbor of point

{\tilde{z}}_{d i}

can be calculated by the similarity between the two points:

e_{i j} = {(1 + a | | {\tilde{z}}_{d i} - {\tilde{z}}_{d j} {| |}_{2}^{2 b})}^{- 1}

(9)

e_{i j}

is symmetric with respect to i and j. As a rule of thumb, a and b are set to 1. Finally, UMAP uses the fuzzy cross-entropy to measure the difference in two similarities of graphs, which is defined as:

f_{c e} = \sum_{i = 1}^{n} \sum_{j = 1, j \neq i}^{n} (m_{i j} ln (\frac{m_{i j}}{e_{i j}}) + (1 - m_{i j}) ln (\frac{1 - m_{i j}}{1 - e_{i j}}))

(10)

Clustering aims to group similar samples into the same cluster. A natural way to enhance clustering performance is to increase intra-cluster compactness. To this end, deep metric learning is incorporated to encourage feature representations to aggregate around their corresponding cluster centers. This work introduces a novel configuration of a supervised metric learning strategy into an unsupervised clustering framework. The core idea is to assign a learnable center to each cluster and iteratively minimize the distance between features and their associated centers during optimization, while simultaneously updating the center positions. The corresponding loss function is defined as follows:

L_{d} = \frac{1}{2} \sum_{i = 1}^{m} | | {\tilde{z}}_{d i} - u_{y_{i}} {| |}_{2}^{2}

(11)

{\tilde{z}}_{d i}

represents the embedded feature,

u_{y i}

denotes the center of the

y_{i} - t h

class, and m indicates the mini-batch size.

It is noted that

y_{i} - t h

class is known first. In order to obtain more reliable pseudo-labels, we add the manifold learning layer to reduce the dimensionality and explore the manifold structure of the data. Then, we perform the K-means algorithm on embedding space

{\tilde{z}}_{d i}

to obtain the pseudo label (class).

Ideally, updating cluster centers requires computing the average feature distance for each class over the entire dataset. However, this approach becomes impractical for large-scale datasets. To address this, the update is instead performed using samples within each mini-batch. The revised update mechanism is defined in Equations (16) and (17).

\frac{\partial L_{d}}{\partial z_{i}} = {\tilde{z}}_{d i} - u_{y_{i}}

(12)

Δ c_{j} = \frac{\sum_{i = 1}^{m} δ (y_{i} = j) (u_{j} - {\tilde{z}}_{d i})}{1 + \sum_{i = 1}^{m} δ (y_{i} = j)}

(13)

where

δ (c o n d i t i o n) = 1

if the condition is satisfied, otherwise

δ (c o n d i t i o n) = 0

.

To clarify the interaction among these components within the training pipeline, UMAP is applied periodically at the beginning of each epoch to map the original latent embeddings z into a refined lower-dimensional space

{\tilde{z}}_{d}

. K-means clustering is then performed on this entire refined space to generate reliable pseudo-labels y and cluster centers c. Finally, during the mini-batch optimization, the center loss utilizes these generated pseudo-labels and centers to continuously pull the augmented embeddings

\tilde{z}

toward their corresponding targets. This specific design ensures that UMAP provides stable structural guidance for clustering without participating directly in the computationally heavy mini-batch backpropagation of the network.

3.3. Self-Supervised Cluster Prediction

We employ the Student’s t-distribution as the cluster prediction, which is defined by

q_{i j} = \frac{{(1 + | | {\tilde{z}}_{d i} - u_{j} | |^{2} / α)}^{-}^{\frac{α + 1}{2}}}{\sum_{j^{'}} {(1 + | | {\tilde{z}}_{d i} - u_{j^{'}} | |^{2} / α)}^{-}^{\frac{α + 1}{2}}}

(14)

{\tilde{q}}_{i j} = \frac{{(1 + | | {\tilde{z}}_{d i} - {\tilde{u}}_{j} | |^{2} / α)}^{-}^{\frac{α + 1}{2}}}{\sum_{j^{'}} {(1 + | | {\tilde{z}}_{d i} - {\tilde{u}}_{j^{'}} | |^{2} / α)}^{-}^{\frac{α + 1}{2}}}

(15)

q_{i j}

and

{\tilde{q}}_{i j}

indicate that the probability of assigning sample i to cluster j.

z_{i}

and

{\tilde{z}}_{d i}

are the embedding point.

u_{j}

and

{\tilde{u}}_{j}

is the initial cluster centroids.

α

is the degree of freedom that sets constant 1 for all experiments.

In clustering tasks, achieving unsupervised training is essential. Therefore, we optimize a KL divergence objective function to train the network. In the case of Siamese encoders, we follow two principles: First, the cluster prediction should be as accurate as possible. Second, clusters are formed by learning from high-confidence assignments. Therefore, we select the encoder of the original image to generate the target distribution. The target distribution P is defined by

p_{i j} = \frac{\frac{{q_{i j}}^{2}}{\sum_{i} q_{i j}}}{\sum_{j^{'}} (\frac{{q i j}^{2}}{\sum_{i} q_{{i j}^{'}}})}

(16)

The target distribution guides the cluster prediction more reliably through its own high-confidence predictions. So the KL divergence between cluster prediction and target distributions is defined by

L_{c} = K L (P | | \tilde{Q}) = \sum_{i} \sum_{j} p_{i j} log \frac{p_{i j}}{{\tilde{q}}_{i j}}

(17)

To summarize, the overall objective function is formulated as follows:

L = L_{r} + L_{c} + λ L_{d}

(18)

A common challenge in deep clustering without pre-training is the risk of the model collapsing to a trivial solution, such as assigning all samples to a single cluster. Our framework naturally avoids this through the joint optimization objective. Specifically, the target distribution P in the KL divergence incorporates a cluster frequency normalization term. This inherently penalizes overly large clusters and forces the model to maintain a balanced assignment distribution. Concurrently, while the target distribution ensures cluster diversity, the Center Loss prevents spatial collapse by pulling samples toward their respective, distinctly updated local centers. Furthermore, the consistency learning from the Siamese encoder acts as a structural regularizer, ensuring the feature space retains meaningful semantic variations rather than degenerating into a single point.

To summarize the overall training pipeline, the model operates through the following sequence in each epoch: (1) Embedding extraction, where Siamese autoencoders generate robust representations from the augmented data; (2) Manifold refinement, where UMAP is periodically applied to preserve topological structures and filter redundant features; (3) Clustering and Pseudo-labels generation, where K-means is applied to the refined embeddings to produce cluster assignments; and finally, (4) Mini-batch optimization, where the network parameters are updated via gradient descent using the combined reconstruction, clustering, and center loss objectives. The full training process is summarized in Algorithm 1.

Algorithm 1 DRSE.

Input: Dataset X; batch size m; cluster number c; number of nearest neighbors k; training epochs E
Output: Cluster assignments s

1:: for $epoch = 1$ to E do
2:: Sample a mini-batch ${x_{i}}_{i = 1}^{m}$ from X
3:: Apply random augmentation through Equation (1)
4:: Compute embedding features through Equations (2) and (3)
5:: Compute reconstruction loss through Equation (5)
6:: Filter redundant features through Equations (6)–(10)
7:: Apply K-means on ${\tilde{z}}_{d i}$ to obtain pseudo-labels $y_{i}$ and cluster centers u
8:: Compute center loss through Equation (11)
9:: Compute clustering loss through Equations (14)–(17)
10:: Compute overall loss L through Equation (18)
11:: Update parameters via gradient descent to minimize L
12:: end for
13:: return s

4. Experiment

This section first presents the experimental settings and the datasets used for evaluation. Then, the comparison methods and evaluation metrics are introduced. Finally, a series of comprehensive experiments are conducted to assess the effectiveness of the proposed method.

4.1. Experimental Settings

The proposed method mainly contains the Siamese autoencoder, unsupervised metric learning, and manifold learning layer. The architectures of the encoder and decoder are detailed in Table 1. The encoder is stacked with three convolution layers and one Dense layer. The other Siamese encoder shares the exact same structure as this one. The decoder structure is a mirrored version of the encoder. The size of Dense1 layer is the number of clusters. The size of Dense2 layer is

(dim

∖ 8) \times (dim

∖ 8) \times 128

.

The manifold learning layer performs the UMAP method. We set the UMAP parameters as follows: n_neighbors = 10, min_dist = 0.01, metric = ‘euclidean’, andn_components = 2. These settings were specifically chosen to capture local structure by considering the 10 nearest neighbors, ensure a well-distributed representation with a minimum distance of 0.01, employ the Euclidean metric for distance calculation, and visualize the data in a two-dimensional space. In the case of no pre-training, we iterate 300 times. In the case of pertaining, we pre-train the autoencoder 200 epochs and finetune 50 epochs. For both pre-training and finetuning of the encoder, the minibatch size is set to 256. To ensure a fair and robust evaluation, considering the variability inherent in deep clustering initializations, all reported performance metrics for our proposed DRSE method are the average results of 5 independent runs with random initializations.

4.2. Datasets

The proposed method is evaluated on five commonly used image datasets of varying scales, feature dimensions, and category numbers, to comprehensively assess clustering performance. Sample images from the experimental datasets are presented in Figure 3. Details of the datasets are summarized in Table 2. Since clustering is an unsupervised task, the training and testing samples are merged during the training phase.

MNIST-full: The dataset is a 10-category handwritten digit dataset containing 70,000 samples, and each sample is 28 × 28 gray-scale pixels.

MNIST-test: The dataset is a testing dataset of the MNIST-full dataset. Specially, it contains 10,000 samples, and each sample is 28 × 28 gray-scale pixels.

USPS: The dataset is a 10-category handwritten digit dataset containing 9298 samples, and each sample is 16 × 16 gray-scale pixels.

COIL20: The dataset contains 20 image objects, and each is imaged from 72 viewpoints. So there are 1440 samples in the dataset. In our experiment, we resize the image from 32 × 32 to 18 × 18.

Letters: The dataset is a 10-category color English letters containing 10,000 samples. We randomly choose 1000 images from each category from A to J. Each sample is 28 × 28 color-scale pixels.

4.3. Comparison Methods

DRSE is evaluated against several conventional clustering algorithms, including K-means, spectral clustering (SC) [29], and Gaussian mixture models (GMMs) [30], Representative Point-based Clustering with Neighborhood Information (RPC-NI) [31]. In addition, we also compare our model with other state-of-the-art clustering methods based on deep learning, including deep embedded clustering (DEC), deep clustering with convolutional autoencoders (DCEC) [10], K-deep-autoencoder (K-DAE) [12], structural deep clustering network (SDCN) [32], learning the precise feature for cluster assignment (LPFCA) [33], deep spectral clustering using dual autoencoder network (DSCDAN) [34], Not too deep clustering (N2D) [35], bi-directional discriminative representation learning clustering (BDRC) [36], Deep Embedding Clustering algorithm based on Residual Autoencoder (DECRA) [14]. For a fair comparison, the deep clustering algorithms we selected are all simple network models, with autoencoders accounting for the majority. The loss function is based on reconstruction loss and KL divergence. DEC, DCEC, K-DAE, SDCN, DSCDAN, N2D, and BDRC are all autoencoder structures, so the loss functions include reconstruction loss. DEC, DCEC, SDCN, DSCDAN, and BDRC all use KL divergence as the clustering loss. K-DAE is K autoencoders, and the group with the smallest reconstruction loss is selected as the best clustering result. SDCN adds a graph neural network to the fully connected autoencoder to obtain the structural information of the data. N2D adds manifold learning to embedded features to protect the manifold structure of the data. Both DSCDAN and BDRC have dual decoders and add mutual information to learn more discriminative representations. For fair comparison, baseline results were cited from their respective original papers, ensuring consistent evaluation settings.

4.4. Evaluation Metrics

Accuracy (ACC) is used to measure the accuracy of clustering, taking the maximum matching value of the true label and the cluster label. The mathematical formula is shown in (19).

A C C = max_{m} \frac{\sum_{i = 1}^{n} 1 \{y_{i} = m (c_{i})\}}{n}

(19)

y_{i}

represents the ground-truth label,

c_{i}

is the predicted cluster assignment, n is the total number of samples, and m denotes the optimal mapping function.

Normalized Mutual Information (NMI) is used to measure the degree of correlation between two random variables. The mathematical formula of NMI is shown in (20).

N M I (Y, C) = \frac{I (Y, C)}{\frac{1}{2} [H (Y) + H (C)]}

(20)

I is mutual information; Y is the true label, C is the cluster label, and H is the entropy. Adjusted Rand Index (ARI) is used to measure the similarity between clustering results and true labels. The mathematical formula of ARI is shown in (21).

A R I = \frac{R I - E [R I]}{max (R I) - E [R I]}

(21)

RI represents the rand index. E[RI] represents the expected value of the Rand Index when the clustering is purely random. max(RI) represents the maximum possible value of the rand index.

4.5. Results Analysis

We compare our method with 13 clustering methods, including three conventional clustering methods, eight state-of-the-art deep clustering methods. The clustering results are shown in Table 3. The bold and underline results, respectively represent the best and second-best clustering performance for each dataset. The “-” indicates that the source code was not available. DRSE+ stands for clustering results with pre-training stage.

It should be noted that to ensure the fairest comparison, the performance metrics of the baseline methods in Table 3 are directly cited from their original publications. Since the official source codes for several baselines are not publicly available, reporting standard deviations across multiple runs for all methods was not feasible. However, the substantial performance margins achieved by our method, particularly on the COIL20 and USPS datasets, clearly demonstrate the robustness and superiority of the proposed framework.

From Table 3, we have the following observations and conclusions:

First, deep clustering methods significantly outperform traditional ones, highlighting the importance of effective representation learning. For instance, DEC surpasses all conventional clustering methods by over 10% on large-scale datasets. However, on small-sample datasets like COIL20, the traditional SC algorithm still performs well, even outperforming several deep methods (DEC, DCEC, KDAE, SDCN) on NMI, suggesting that classical methods remain valuable in limited-data scenarios.

Second, the proposed method DRSE achieves promising performance on all datasets, which proves the effectiveness of our method. On MNIST datasets, most deep clustering methods achieve brilliant clustering performance. This is because the dataset is simple and clear. Especially, the DSCDAN method achieves the state-of-the-art results on the MNIST dataset. DSCDAN achieves state-of-the-art results, benefiting from the dataset’s high quality and its use of spectral clustering. It is worth noting that even when DRSE+ uses the simplest K-means method to initialize the network, the ACC is only slightly lower by less than 0.3%. Moreover, on more complex image datasets such as USPS and COIL20, DRSE+’s outstanding feature extraction capability is evident. Specifically, on the USPS and COIL20 datasets, ACC is significantly higher, by nearly 24% and 15%, respectively. In contrast, the clustering performance of our proposed DRSE method is superior on each dataset. We can find that the clustering performance is over 0.92 uniquely and exceeds the recently proposed method BDRC by 6% on the COIL20 dataset. Due to the fact that Letters are color images, most deep clustering algorithms have advantages, especially in terms of NMI and ARI metrics. However, deep clustering algorithms are not robust on various metrics: some are high and some are low, and only the DRSE algorithm achieves the highest results on each metric. The results indicate that the DRSE method has high generalization. This is because we introduced the Siamese encoders.

Third, DRSE performs nearly as well as DRSE+ across three datasets, despite requiring no pre-training. This is due to the inclusion of a discriminative representation learning module, which enables the network to learn compact, discriminative features end-to-end. The Siamese encoder ensures robust representations and reliable pseudo-labels, allowing the self-supervised clustering module to optimize effectively without pre-training. Notably, DRSE still achieves the best results on COIL20, where small sample sizes typically hinder deep clustering methods.

In order to demonstrate the performance of the algorithm more clearly, the error rates of DRSE and several state-of-the-art (SOTA) methods are reported on three datasets, along with the Relative Error Reduction (REC). The experimental results are shown in Table 4. Since almost all comparison algorithms achieve high results on the MNIST dataset, the error rate decrease is limited. But The relative error rates on the USPS and COIL20 datasets both dropped by more than 20%. Especially on the COIL20 dataset, the REC of ACC reached 68.8% and the REC of ARI reached 45.5%. In addition, a stacked bar chart of the error rates across all three datasets is presented in Figure 4. As shown, DRSE consistently achieves lower error rates across all three metrics compared with the BDRC method, demonstrating its superior overall performance.

In order to show the clustering results in an intuitive way, we draw a stacked bar chart to show the improvement of DRSE based on DCEC, as shown in Figure 5. Besides, we conduct visualization experiments on the MNIST-test, USPS, and COIL20 datasets. For comparison and analysis, we also provide visualization results of DCEC. The visualization results are shown on Figure 6, Figure 7 and Figure 8.

The Figure 5 exhibits the cumulative clustering performance of DCEC and DRSE on four datasets. We can obviously find that there is a large gap between the results of DCEC and DRSE. This demonstrates that our contributions play significant positive roles on clustering performance.

From Figure 6, Figure 7 and Figure 8, the result of DREC is that the samples within the cluster are more compact, and the samples between the clusters are farther away. This strongly demonstrates the positive impact of the discriminative representation learning layer. In addition, the introduced manifold learning layer protects the data structure better and filters redundant information, which is evident on the COIL20 dataset.

4.6. Ablation Study

An ablation study is conducted to assess the impact of the Siamese encoders, manifold learning, and the discriminative representation learning layer on model performance. To verify the effectiveness of the UMAP method used in our model, the method is compared with t-SNE using three evaluation metrics. The impact of center loss on the convergence behavior of the model is also investigated.

4.6.1. Contributions of Each Module

In this section, the contribution of each module is analyzed in detail. DRSE mainly contains two components, including Siamese encoders (SE) and representation learning (RL). The representation learning module contains a manifold learning layer and a discrimination representation learning. The impact of each module on clustering performance is detailed in Table 5. The bold font indicates the best clustering performance. For comparison, the experiments are performed with pre-training stage.

Representation learning aims to learn the “friendly” representation for clustering. One is to filter redundant feature information while protecting the manifold structure of data. The second is to employ center loss for increasing intra-cluster homogeneity. The results show the module plays a pivotal role. on clustering results. On the USPS dataset, both ACC and ARI metrics increase by more than 10%. The most important is that the ACC and ARI of the COIL20 dataset, respectively, exceed 27.1% and 30.3%. Siamese encoders aim to improve the generalization of models and learn a robust representation. Although in the simple and clear image dataset MNIST-test, the ACC, NMI, ARI, respectively, improve by nearly 2%, 3%, and 4%. Due to the limited number of samples in the COIL20 dataset, the role of the Siamese encoders becomes extremely crucial. Consequently, the most significant performance improvements are observed on this dataset. The ACC, NMI, and ARI, respectively, improve by nearly 7%, 3%, and 6%.

Furthermore, it is highly unlikely to obtain comparable results using a simple encoder trained with a standard cross-entropy loss on pseudo labels. A simple encoder lacks the structural regularization provided by the Siamese consistency. More importantly, applying standard cross-entropy to pseudo labels inevitably leads to confirmation bias. Since the predicted labels are noisy during the initial training phases, the network will overfit to these errors and amplify them. Our framework mitigates this issue by jointly leveraging the Siamese consistency to learn inherent data representations independent of labels and the Center Loss to pull samples to cluster centers directly in the metric space. This joint strategy ensures a much more accurate and stable convergence than a simple cross entropy baseline.

4.6.2. Clustering Time Analysis of Each Module

In order to compare the clustering time of each module, we conducted an ablation study based on the clustering time. The experiment results are shown in Table 6. From Table 6, it is found that the clustering time used by DRSE w/o RL is the shortest, and the gap between the DRSE and DRSE w/o SE is small, indicating that the branch of discriminative representation learning increases the clustering time, which is about 30 s, but is acceptable. More importantly, DRSE w/o SE improves ACC by more than 10% compared with DRSE w/o RL on USPS and COIL20 datasets. which shows that DRSE has achieved a good balance in clustering performance and time complexity.

4.6.3. Comparison of Different Manifold Learning Methods

To compare with manifold learning methods such as t-SNE and UMAP, clustering results on the USPS dataset are visualized, as shown in Figure 9. A quantitative analysis is conducted on the MNIST-test, USPS, and COIL20 datasets. The clustering results are shown in Table 7.

From Figure 9, it is evident that both methods are capable of grouping similar samples into the same clusters, while UMAP clusters between different categories are farther apart and more compact for the same category of samples. This indicates that both protect the local structure of the data, and it is clear that UMPA protects the global structure of the data as well.

From Table 7, the following observations can be made: First, UMAP outperforms t-SNE on all three datasets. Second, due to the small number of samples and a larger number of categories in the COIL20 dataset, higher requirements are placed on the deep clustering model. Because of this, the advantages of the UMAP algorithm are highlighted. The three metrics have been significantly improved. To sum up, we choose the UMAP algorithm in our model.

This visual evidence confirms our theoretical motivation discussed earlier. UMAP protects not only the local neighborhood but also the meaningful global distances between different clusters. This synergy with the Center Loss explains the quantitative superiority of UMAP over t-SNE across all metrics in Table 7. The bold font indicates the best clustering performance.

4.6.4. Analysis of Balance Parameters

The performance of DRSE is further examined under different settings of balance parameters. The purpose of reconstructing loss is to make the features learned by the autoencoder as consistent as possible with the data, and the purpose of KL divergence is to make the probability distribution of the features learned by the encoder as consistent as possible with the data. The goals of these two loss functions are consistent, and they are of the same trend and magnitude. We use TensorBoard to visualize the two losses, and the results are shown in Figure 10. The horizontal axis is “RELATIVE”, which represents training time. The vertical axis is the value of loss. From Figure 10, it can be found that the two losses change in the same order of magnitude within a relative period of time, between

10^{- 1}

and

10^{- 2}

. Hence, their coefficients are fixed at 1.

The reconstruction loss

L_{r}

and clustering loss

L_{c}

are the basis of deep joint clustering. More importantly, the deep metric learning loss

L_{d}

is to improve discrimination with intra-cluster compactness. It can be treated as an effective cluster-promoting regularizer for joint clustering problem. We tune the balance parameter in the range of {0.2, 0.4, 0.6, 0.8, 1, 10}. The effect of parameters on the USPS dataset is shown in Figure 11. As we can see from Figure 11, when the parameter is relatively small, DRSE can maintain promising results. We also set the balance parameter as 1.

It is important to note that the tuning of the balance parameter

λ

was not dataset-specific. As illustrated in Figure 10, the model exhibits strong robustness and maintains stable clustering performance across the evaluated range of {0.2, 0.4, 0.6, 0.8, 1}. Due to this low sensitivity, we did not perform complex per-dataset fine-tuning; instead, we uniformly adopted a consistent value of

λ = 1

across all experiments and datasets to demonstrate the generalization capability of the proposed framework.

4.6.5. Convergence Analysis

To verify model convergence, the variations in three evaluation metrics during the optimization process are analyzed. In addition, the impact of center loss on clustering performance is reported. The first column is DRSE, and the other is no center loss based on DRSE (DRSE_cl). It should be noted that this is the convergence situation without pre-training.

As illustrated in Figure 12, Figure 13 and Figure 14, DRSE achieves convergence on all datasets. And the convergence speed is fast. These metrics increase rapidly in a few epochs and then grow slowly. Compared with DRSE_cl, the speed of convergence of DRSE is the same, and the values of metrics improve a little on MNIST-test. This is because the images of dataset are clear, so the convergence is easy. However, the convergence speed of DRSE is nearly 20 epochs faster than DRSE_cl on the USPS dataset. The values of ACC and ARI exceed nearly 10%, and NMI exceeds nearly 5% on the COIL20 dataset. Besides, the results of DRSE are more stable on COIL20 dataset. These apparent performances prove that discriminative representation learning is critical to clustering results. Discriminative representation enables fast clustering, which in turn leads the network to fast convergence.

5. Conclusions and Feature Work

In this paper, a deep image clustering framework based on a Siamese encoder and a discriminative representation learning module is proposed. In this framework, the Siamese encoder shares weights with each other. But their inputs are different. One is the original images, the other is transformed images. The Siamese encoders learn the robust representation by reconstruction loss and KL loss. The discriminative representation learning module is performed on embedding features and their centers. Before that, we add a manifold learning layer to filter redundant features and explore the manifold of embedding space, which facilitates the next processes. Together, these components constitute the representation learning (RL) module. The RL module effectively learns discriminative representations, as demonstrated by the visualization results and the ablation study. To verify the impact of center loss, we draw the chart of changes in evaluation metrics during the optimization process. Furthermore, the trends in these charts are systematically compared and analyzed. Moreover, we evaluate our method on different sample sizes, different categories, and different dimension datasets. The experimental results outperform state-of-the-art methods and demonstrate the superiority of our method.

All experiments in this study were conducted on standard benchmark datasets (MNIST, USPS, COIL20, etc.). However, applying this method to more complex natural image datasets (e.g., CIFAR10, STL10, or Tiny ImageNet) with high intraclass variance remains a limitation of the current architecture. For scenarios involving massive scale or high variability, the shallow convolutional autoencoder used in this study would likely struggle to capture sufficient semantic features. Future work could address this by replacing the backbone with more powerful architectures, such as deep ResNets or Vision Transformers (ViTs), to extend the framework’s applicability to complex visual domains. Importantly, the core methodology proposed in this work is fundamentally independent of any specific backbone. The structural guidance and decoupling mechanisms we designed can be seamlessly integrated with any feature extractor, allowing the framework to flexibly scale with the advancement of underlying architectures.

Author Contributions

Conceptualization, H.H.; methodology, H.H. and L.W.; software, H.H.; valida tion, H.H.; formal analysis, H.H.; investigation, H.H.; resources, H.H. and L.W.; data curation, L.W.; writing—original draft preparation, H.H.; writing—review and editing, H.H. and L.W.; visualization, H.H.; supervision, H.H.; project administration, H.H.; funding acquisition, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the Cyan and Blue project of Jiangsu, China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Oyewole, G.J.; Thopil, G.A. Data clustering: Application and trends. Artif. Intell. Rev. 2023, 56, 6439–6475. [Google Scholar] [PubMed]
Zhou, S.; Xu, H.; Zheng, Z.; Chen, J.; Li, Z.; Bu, J.; Wu, J.; Wang, X.; Zhu, W.; Ester, M. A comprehensive survey on deep clustering: Taxonomy, challenges, and future directions. ACM Comput. Surv. 2024, 57, 1–38. [Google Scholar] [CrossRef]
Zhang, Z.; Chen, X.; Wang, C.; Wang, R.; Song, W.; Nie, F. Structured multi-view k-means clustering. Pattern Recognit. 2025, 160, 111113. [Google Scholar]
Cai, Z.; Yang, X.; Huang, T.; Zhu, W. A new similarity combining reconstruction coefficient with pairwise distance for agglomerative clustering. Inf. Sci. 2020, 508, 173–182. [Google Scholar] [CrossRef]
Chen, X.; Hou, D.; Han, Y.; Ding, X.; Hua, P. Clustering analysis of grid nanoindentation data for cementitious materials. J. Mater. Sci. 2021, 56, 12238–12255. [Google Scholar] [CrossRef]
Kim, J.; Lee, H.; Ko, Y.M. Constrained Density-Based Spatial Clustering of Applications with Noise (DBSCAN) using hyperparameter optimization. Knowl.-Based Syst. 2024, 303, 112436. [Google Scholar]
Kong, L.; Xue, J.; Nie, F.; Li, X. Direct Spectral Clustering with New Graph Learning for Better Fitting. IEEE Trans. Knowl. Data Eng. 2025, 37, 3991–4002. [Google Scholar]
Xie, J.; Girshick, R.; Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
Guo, X.; Gao, L.; Liu, X.; Yin, J. Improved deep embedded clustering with local structure preservation. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, VIC, Australia, 19–25 August 2017. [Google Scholar]
Guo, X.; Liu, X.; Zhu, E.; Yin, J. Deep clustering with convolutional autoencoders. In Neural Information Processing; Springer: Berlin/Heidelberg, Germany, 2014; pp. 373–382. [Google Scholar]
Ghasedi Dizaji, K.; Herandi, A.; Deng, C.; Cai, W.; Huang, H. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Opochinsky, Y.; Chazan, S.E.; Gannot, S.; Goldberger, J. K-autoencoders deep clustering. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–8 May 2020. [Google Scholar]
Guo, X.; Zhu, E.; Liu, X.; Yin, J. Deep embedded clustering with data augmentation. In Proceedings of the Asian Conference on Machine Learning, Beijing, China, 14–16 November 2018. [Google Scholar]
Li, M.; Cao, C.; Li, C.; Yang, S. Deep embedding clustering based on residual autoencoder. Neural Process. Lett. 2024, 56, 127. [Google Scholar] [CrossRef]
Cheng, Z.; Li, F.; Wang, J.; Qian, Y. Deep embedding clustering driven by sample stability. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar]
Sadeghi, M.; Armanfard, N. Deep clustering with self-supervision using pairwise similarities. arXiv 2024, arXiv:2405.03590. [Google Scholar] [CrossRef]
Zheng, Z.; Zhang, S.; Song, H.; Yan, Q. Deep clustering using 3D attention convolutional autoencoder for hyperspectral image analysis. Sci. Rep. 2024, 14, 4209. [Google Scholar] [CrossRef]
Mănescu, A.M.; Mănescu, D.C. Self-Supervised Gait Event Detection from Smartphone IMUs for Human Performance and Sports Medicine. Appl. Sci. 2025, 15, 11974. [Google Scholar] [CrossRef]
Bourlard, H.; Kamp, Y. Auto-Association by Multilayer Perceptrons and Singular Value Decomposition. Biol. Cybern. 1988, 59, 291–294. [Google Scholar] [CrossRef] [PubMed]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed]
Seung, H.S.; Lee, D.D. The Manifold Ways of Perception. Science 2000, 290, 2268–2269. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Chen, Z.; Hu, J. Deep metric learning in projected-hypersphere space. Pattern Recognit. 2024, 161, 111245. [Google Scholar] [CrossRef]
Cao, X.; Lin, H.; Guo, S.; Xiong, T.; Jiao, L. Transformer-based masked autoencoder with contrastive loss for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5524312. [Google Scholar] [CrossRef]
Boutros, F.; Damer, N.; Kirchbuchner, F.; Kuijper, A. Self-restrained triplet loss for accurate masked face recognition. Pattern Recognit. 2022, 124, 108473. [Google Scholar] [CrossRef]
Li, X.; Yang, X.; Ma, Z.; Xue, J.H. Deep metric learning for few-shot image classification: A review of recent developments. Pattern Recognit. 2023, 138, 109381. [Google Scholar] [CrossRef]
Farzaneh, A.H.; Qi, X. Facial expression recognition in the wild via deep attentive center loss. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2021. [Google Scholar]
Cieslak, M.C.; Castelfranco, A.M.; Roncalli, V.; Lenz, P.H.; Hartline, D.K. t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis. Mar. Genom. 2019, 51, 100723. [Google Scholar] [CrossRef]
Vermeulen, M.; Smith, K.; Eremin, K.; Rayner, G.; Walton, M. Application of Uniform Manifold Approximation and Projection (UMAP) in Spectral Imaging of Artworks. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2021, 252, 119547. [Google Scholar] [CrossRef]
Jia, H.; Ding, S.; Xu, X.; Nie, R. The latest research progress on spectral clustering. Neural Comput. Appl. 2014, 24, 1477–1486. [Google Scholar] [CrossRef]
Wang, R.; Han, S.; Zhou, J.; Chen, Y.; Wang, L.; Du, T.; Ji, K.; Zhao, Y.o.; Zhang, K. Transfer-Learning-Based Gaussian Mixture Model for Distributed Clustering. IEEE Trans. Cybern. 2022, 53, 7058–7070. [Google Scholar] [CrossRef] [PubMed]
Shang, Z.; Dang, Y.; Wang, H.; Liu, S. Representative Point-Based Clustering With Neighborhood Information for Complex Data Structures. IEEE Trans. Cybern. 2025, 55, 1620–1633. [Google Scholar] [CrossRef] [PubMed]
Bo, D.; Wang, X.; Shi, C.; Zhu, M.; Lu, E.; Cui, P. Structural deep clustering network. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020. [Google Scholar]
Gan, Y.; Dong, X.; Zhou, H.; Gao, F.; Dong, J. Learning the Precise Feature for Cluster Assignment. IEEE Trans. Cybern. 2021, 52, 8587–8600. [Google Scholar] [CrossRef]
Yang, X.; Deng, C.; Zheng, F.; Yan, J.; Liu, W. Deep spectral clustering using dual autoencoder network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
McConville, R.; Santos-Rodriguez, R.; Piechocki, R.J.; Craddock, I. N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding. In Proceedings of the 25th International Conference on Pattern Recognition, Virtual, 10–15 January 2021. [Google Scholar]
Wang, Y.; Chang, D.; Fu, Z.; Zhao, Y. Learning a bi-directional discriminative representation for deep clustering. Pattern Recognit. 2023, 137, 109237. [Google Scholar] [CrossRef]

Figure 1. Standard structure of autoencoder used for deep clustering.

Figure 2. Overall framework of the proposed method.

Figure 3. Image samples from datasets used in our experiments.

Figure 4. The stacked bar chart of error rate with BDRC.

Figure 5. The stacked bar chart for intuitive comparison of the clustering performance with DCEC.

Figure 6. Visualization of MNIST-test Dataset.

Figure 7. Visualization of USPS dataset.

Figure 8. Visualization of COIL20 Dataset.

Figure 9. Visualization to show the difference in embedding using t-SNE and UMAP on USPS dataset.

Figure 10. Visualization of loss values.

Figure 11. Clustering results with different balance parameters.

Figure 12. MNIST-test Dataset.

Figure 13. USPS dataset.

Figure 14. COIL20 Dataset.

Table 1. The structure of the autoencoder in our method.

Encoder	Decoder
Conv(LeakyReLU, kernel = 5, channel = 32)	Dense2
Conv(LeakyReLU, kernel = 5, channel = 64)	DeConv((LeakyReLU, kernel = 3, channel = 64))
Conv(LeakyReLU, kernel = 3, channel = 128)	DeConv((LeakyReLU, kernel = 5, channel = 32))
Dense1	DeConv((LeakyReLU, kernel = 3, channel = 1))

Table 2. The summary of datasets.

Datasets	Clusters	Samples	Dimensions
MNIST-full	10	70,000	28 × 28 × 1
MNIST-test	10	10,000	28 × 28 × 1
USPS	10	9297	16 × 16 × 1
COIL20	20	1440	32 × 32 × 1
Letters	10	10,000	32 × 32 × 3

Table 3. Clustering results of compared methods on five datasets in terms of ACC, NMI and ARI.

Datasets	MNIST			MNIST-Test			USPS			COIL20			Letters
Metrics	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
Kmeans	0.532	0.500	0.365	0.543	0.505	0.384	0.668	0.627	0.545	0.401	0.640	0.381	0.557	0.549	0.414
SC	0.668	0.782	0.617	0.660	0.704	0.523	0.656	0.796	0.650	0.703	0.831	0.612	0.557	0.727	0.554
GMM	0.449	0.377	0.249	0.554	0.505	0.394	0.327	0.627	0.492	0.442	0.645	0.369	0.540	0.552	0.417
DEC	0.869	0.848	0.814	0.776	0.724	0.660	0.733	0.731	0.653	0.601	0.745	0.525	0.558	0.593	0.432
DCEC	0.888	0.882	0.841	0.852	0.809	0.773	0.789	0.821	0.740	0.725	0.797	0.641	0.554	0.653	0.494
KDAE	0.772	0.800	0.702	0.754	0.730	0.628	0.765	0.793	0.712	0.713	0.796	0.635	0.583	0.641	0.505
SDCN	0.802	0.761	0.715	0.756	0.699	0.614	0.781	0.795	0.718	0.610	0.786	0.596	0.628	0.670	0.562
LPFCA	0.936	0.924	0.914	-	-	-	0.910	0.879	0.845	0.876	0.891	0.795	-	-	-
DSCDAN	0.976	0.941	0.945	0.980	0.946	0.956	0.739	0.846	0.722	0.826	0.922	0.786	0.622	0.521	0.685
N2D	0.977	0.939	0.950	0.957	0.900	0.907	0.960	0.905	0.923	0.806	0.916	0.793	0.533	0.717	0.583
BDRC	0.979	0.943	0.954	-	-	-	0.975	0.933	0.951	0.913	0.960	0.901	-	-	-
DECRA	0.964	0.917	0.924	0.890	0.879	0.846	0.951	0.885	0.907	0.735	0.829	0.681	-	-	-
RPC-NI	-	-	-	0.786	0.713	0.656	0.827	0.767	0.694	0.829	0.879	0.780	-	-	-
DRSE	0.979	0.943	0.953	0.975	0.936	0.947	0.980	0.945	0.959	0.927	0.951	0.901	0.883	0.789	0.759
DRSE+	0.981	0.948	0.958	0.977	0.940	0.950	0.981	0.947	0.962	0.973	0.971	0.946	0.904	0.825	0.792

Table 4. Compare error rate with BDRC algorithm.

Dataset	Metric (%)	BDRC	DRSE+	REC (%)
MNIST-full	ACC	2.1	1.9	9.5
	NMI	5.7	5.2	8.8
	ARI	4.6	4.2	8.7
USPS	ACC	2.5	1.9	24.0
	NMI	6.7	5.3	20.9
	ARI	4.9	3.8	22.4
COIL20	ACC	8.7	2.7	68.9
	NMI	4.0	2.9	27.5
	ARI	9.9	5.4	45

Table 5. Clustering performance with different modules on three datasets.

Datasets	Methods	ACC	NMI	ARI
MNIST-test	DRSE w/o RL	0.970	0.933	0.934
	DRSE w/o SE	0.966	0.917	0.927
	DRSE	0.981	0.948	0.962
USPS	DRSE w/o RL	0.846	0.899	0.827
	DRSE w/o SE	0.972	0.927	0.945
	DRSE	0.981	0.947	0.961
COIL20	DRSE w/o RL	0.702	0.812	0.643
	DRSE w/o SE	0.902	0.939	0.881
	DRSE	0.973	0.971	0.946

Table 6. Clustering time with different modules on three datasets.

Datasets	Methods	Clustering Time (s)
MNIST-test	DRSE w/o RL	13.13
	DRSE w/o SE	44.90
	DRSE	52.06
USPS	DRSE w/o RL	9.48
	DRSE w/o SE	39.75
	DRSE	45.31
COIL20	DRSE w/o RL	2.94
	DRSE w/o SE	7.99
	DRSE	8.89

Table 7. Clustering results of different manifold learning methods.

Datasets	Methods	ACC	NMI	ARI
MNIST-test	DRSE (t-SNE)	0.971	0.931	0.939
	DRSE	0.981	0.948	0.962
USPS	DRSE (t-SNE)	0.978	0.940	0.956
	DRSE	0.981	0.947	0.961
COIL20	DRSE (t-SNE)	0.856	0.910	0.820
	DRSE	0.973	0.971	0.946

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hou, H.; Wang, L. Discriminative Representation Learning for Fast and Accurate Clustering. Appl. Sci. 2026, 16, 2887. https://doi.org/10.3390/app16062887

AMA Style

Hou H, Wang L. Discriminative Representation Learning for Fast and Accurate Clustering. Applied Sciences. 2026; 16(6):2887. https://doi.org/10.3390/app16062887

Chicago/Turabian Style

Hou, Haiwei, and Lijuan Wang. 2026. "Discriminative Representation Learning for Fast and Accurate Clustering" Applied Sciences 16, no. 6: 2887. https://doi.org/10.3390/app16062887

APA Style

Hou, H., & Wang, L. (2026). Discriminative Representation Learning for Fast and Accurate Clustering. Applied Sciences, 16(6), 2887. https://doi.org/10.3390/app16062887

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Discriminative Representation Learning for Fast and Accurate Clustering

Abstract

1. Introduction

2. Related Works

2.1. Autoencoder for Clustering

2.2. Deep Metric Learning

3. The Proposed Method

3.1. Siamese Encoders for Clustering

3.2. Discriminative Representation Learning

3.3. Self-Supervised Cluster Prediction

4. Experiment

4.1. Experimental Settings

4.2. Datasets

4.3. Comparison Methods

4.4. Evaluation Metrics

4.5. Results Analysis

4.6. Ablation Study

4.6.1. Contributions of Each Module

4.6.2. Clustering Time Analysis of Each Module

4.6.3. Comparison of Different Manifold Learning Methods

4.6.4. Analysis of Balance Parameters

4.6.5. Convergence Analysis

5. Conclusions and Feature Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI