Triple-Stream Contrastive Deep Embedding Clustering via Semantic Structure

Zheng, Aiyu; Cai, Jianghui; Yang, Haifeng; Xun, Yalin; Zhao, Xujun

doi:10.3390/math13223578

Open AccessArticle

Triple-Stream Contrastive Deep Embedding Clustering via Semantic Structure

by

Aiyu Zheng

^1,2

,

Jianghui Cai

^1,3,*,

Haifeng Yang

^2,3,*,

Yalin Xun

² and

Xujun Zhao

²

¹

School of Electronic Information and Engineering, Taiyuan University of Science and Technology, Taiyuan 030024, China

²

School of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan 030024, China

³

Shanxi Key Laboratory of Big Data Analysis and Parallel Computing, Taiyuan 030024, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(22), 3578; https://doi.org/10.3390/math13223578

Submission received: 8 October 2025 / Revised: 30 October 2025 / Accepted: 31 October 2025 / Published: 7 November 2025

Download

Browse Figures

Versions Notes

Abstract

Deep neural network-based deep clustering has achieved remarkable success by unifying representation learning and clustering. However, conventional representation modules are typically not tailored for clustering, resulting in conflicting objectives that hinder the model’s ability to capture semantic structures with high intra-cluster cohesion and low inter-cluster separation. To overcome this limitation, we propose a novel framework called Triple-stream Contrastive Deep Embedding Clustering via Semantic Structure (TCSS). TCSS is composed of representation and clustering modules, with its innovation rooted in several key designs that ensure their synergistic interaction for modeling semantic structures. First, TCSS introduces a triple-stream input framework that processes the raw instance along with its limited and aggressive augmented views. This design enables a new triple-stream self-training clustering loss, which uncovers implicit cluster structures by contrasting the three input streams. Second, within this loss, a dynamic clustering structure factor is developed to represent the evolving semantic structure in the representation space, thereby constraining the clustering-prediction distribution. Third, TCSS integrates semantic structure-aware techniques, including a clustering-oriented negative sampling strategy and a triple-stream alignment scheme based on k-nearest neighbors and centroids, to refine semantic structures both locally and globally. Extensive experiments on five benchmark datasets demonstrate that TCSS outperforms state-of-the-art methods.

Keywords:

deep clustering; triple-stream self-training clustering loss; clustering-oriented contrastive learning; clustering structure factor

MSC:

68T07; 62H30

1. Introduction

Clustering is a fundamental unsupervised data-mining technique [1,2]. However, traditional algorithms such as K-Means [3] and spectral clustering [4] often yield unreliable results on high-dimensional data. To address this challenge, deep neural networks have been introduced as powerful representation learners. By jointly learning latent representations and cluster assignments, they substantially outperform conventional methods [5,6,7,8].

Early work from Facebook AI Lab [9] proposed an iterative framework that alternates between feature extraction from pretrained neural networks and cluster assignment using K-Means. JULE [10] modeled the sequential merging steps of agglomerative clustering through a recurrent CNN. DAC [11] formulated deep clustering as an optimization problem that alternately refines label assignments and similarity estimation. DEC [12] modeled instance similarity using the Student-t distribution and introduced a sharpened target distribution to form a KL-divergence-based self-training clustering loss. DGG [13] unified model- and similarity-based clustering by combining a Gaussian-mixture VAE with stochastic graph embedding, jointly optimizing likelihood and Jensen–Shannon divergence for global–local representation learning. SCAN [14] refined pretrained contrastive features via K-nearest neighbors and pseudo-label fine-tuning, creating a self-improving clustering model. DeepCluster-v2 [15] (SwAV) further improved scalability by replacing pairwise contrastive comparisons with a swapped prediction mechanism enforcing view-consistent cluster assignments, enabling scalable unsupervised learning.

As deep clustering evolved, the integration of clustering and representation learning modules deepened, and researchers began analyzing their mutual dependencies and optimization objectives. Following DEC, DCC [16] extended the framework by introducing together/apart constraints derived from side information such as labels, continuous attributes, or domain knowledge, enabling semi-supervised and constrained deep clustering. DCCM [17] exploited pseudo-label correlations and modeled triplet mutual information among features to enhance cluster consistency and robustness. PICA [18] introduced a differentiable partition uncertainty index that alleviates error accumulation in iterative training. IDFD [19] unified instance discrimination and SoftMax-based feature decorrelation to promote more discriminative and cluster-friendly representations.

More recently, augmentation- and structure-driven approaches have emerged. SACC [20] was the first to jointly exploit strong (aggressive or distorted) and weak (structure-preserving) augmentations for contrastive deep clustering. IcicleGCN [21] bridged instance-level contrastive learning with multi-scale graph structure modeling to capture both local and global dependencies. WEC [22] integrated Wasserstein-based optimal transport into latent space learning to achieve more stable and expressive representation–clustering coupling. DCHD [23] enhanced graph node clustering through hard-positive debiasing and adaptive pair weighting, effectively mitigating false-negative pairs. DLTE [24] employed deep nonnegative matrix factorization with weighted low-rank tensor and graph Laplacian constraints to capture nonlinear, high-order, and geometric correlations across multiple views.

The ultimate goal of deep clustering is to construct a semantic structure characterized by high intra-cluster cohesion and low inter-cluster coupling. Despite substantial progress, existing methods still face several key challenges. (1) They often overemphasize clustering structure within the clustering module while underutilizing the complementary role of representation learning. (2) They tend to optimize either the clustering distribution or the feature space in isolation, rather than integrating them into a unified objective. (3) They typically capture either local structure (intra-cluster cohesion) or global structure (inter-cluster separation), but rarely both simultaneously. Furthermore, the inconsistency between structural representations learned by the feature and clustering modules frequently causes optimization conflicts, leading to degraded performance [5,6,7].

To address these limitations, this paper proposes Triple-stream Contrastive Deep Embedding Clustering via Semantic Structure (TCSS), a novel deep clustering framework that jointly optimizes representation learning and clustering in a unified semantic space. First, TCSS augments each instance into three streams: a raw instance (anchor), a weakly augmented view, and a strongly augmented view. This design increases semantic diversity and enables richer structural understanding through cross-stream interactions within both modules. Second, the representation module employs triple-stream instance-level contrastive learning to capture local invariant semantics and adopts a hard negative filtering strategy to alleviate class collision, thereby improving representation discriminability. Third, the clustering module incorporates a dynamic clustering structure factor that adapts to the evolving geometry of the triple-stream embeddings, regularizing the self-training clustering objective. Finally, TCSS fine-tunes both local and global structures via cross-stream k-nearest-neighbor consistency and centroid alignment, ensuring structural coherence across all augmentations. In summary, TCSS effectively integrates contrastive representation learning and dynamic clustering optimization, offering a principled framework for semantic structure-aware deep clustering. The main contributions of TCSS are summarized as follows:

We propose a novel triple-stream deep clustering framework termed TCSS, which leverages the interaction and fusion among three complementary streams (raw, weakly augmented, and strongly augmented) to enrich structural information in both the contrastive and clustering modules.
We introduce a novel dynamic clustering structure factor as a degree-of-freedom regulator in the self-training loss, capturing intrinsic geometric relations in the embedding space and adaptively guiding cluster formation.
We design two structure alignment strategies, local (via k-nearest-neighbor alignment) and global (via centroid alignment), along with a clustering-aware negative instance selection mechanism to refine the learned semantic structure.
Extensive experiments on multiple benchmark datasets validate that TCSS consistently outperforms recent state-of-the-art deep clustering methods.

The rest of this paper is structured as follows. Section 2 provides a review of related studies on self-training and contrastive deep clustering. Section 3 introduces the proposed TCSS framework in detail. Section 4 presents the experimental results and corresponding analyses. Section 5 discusses the scalability of TCSS. Finally, Section 6 summarizes the conclusions and outlines potential future work.

2. Related Works

In this section, we review the history and recent advances of Self-Training Deep Clustering and Contrastive Deep Clustering, which form the foundation of this work. We also provide a brief analysis and classification of these two research directions.

2.1. Self-Training Deep Clustering

The concept of self-training originates from [25], where the Student-t distribution was first introduced to model inter-sample similarity in the embedding space. DEC [12] later replaced the Gaussian assumption with a sharpened Student-t distribution to express global clustering relations. Building on this, DBC [26] jointly learns image representations and cluster centers using a fully convolutional auto-encoder and soft K-means, while DTC [27] extends DEC to transfer learning with a representation bottleneck, temporal ensembling, and consistency constraints.

Early self-training approaches achieved promising results by jointly learning representations and clusters, but the clustering loss often distorted the feature space, producing uninformative representations. IDEC [28] addressed this by combining a self-training loss with an under-complete auto-encoder to preserve local structure. DEC-DA [29] introduced augmented training, computing targets from clean data and outputs from augmented data to stabilize the initial feature space.

More recent works extend self-training to broader applications. DCC [16] integrates side-information constraints for semi-supervised clustering. DFC [30] incorporates fairness and structural preservation losses to mitigate bias across protected subgroups. DVA [31] employs attention-based classifiers for robust, background-invariant features. WEC [22] embeds Wasserstein distance optimization in latent space for stable end-to-end clustering, and DLTE [24] combines deep nonnegative matrix factorization with tensor and graph Laplacian constraints to capture higher-order correlations across multiple views. G-CEALS [32] learns Gaussian cluster embeddings in auto-encoder latent space, achieving state-of-the-art clustering performance on tabular data. CEDECC [33] enhances clustering ensembles by leveraging cluster-confidence-guided deep embeddings to generate high-quality base clusterings and improve consensus performance on complex data.

Overall, self-training has proven to be a versatile clustering paradigm that integrates seamlessly with diverse frameworks, including emerging deep graph clustering [34]. It typically uses squared assignment probabilities with cluster-wise normalization to emphasize high-confidence samples. However, self-training clustering still faces limitations, notably its sensitivity to cluster centers and the absence of fine-grained semantic structural modeling in the representation space.

2.2. Deep Contrastive Clustering

Contrastive learning (e.g., SimCLR [35], MoCo [36])—a leading paradigm in unsupervised representation learning—has been widely adopted in deep clustering [37].

PCL [38] extracts prototypical semantic structures through an encoder during clustering. CC [39] outputs category-dimensional features, treating the rows of the feature matrix as soft labels and the columns as cluster representations, representing early work on prediction independence. SCAN [14] leverages pretrained contrastive representations and K-nearest neighbors, refining them via pseudo-labeling. Later, semi-supervised extensions such as TCL [40] and SPICE [40] adopted this boosting strategy but retained clustering modules similar to CC. DCSC [41] adds an instance-level contrastive loss to CC, partially alleviating the category-collision problem.

Recent works have also explored multiview contrastive clustering. CLSA [42] measures distribution divergence between weakly and strongly augmented views using a MoCo-based mechanism. SACC [20] employs a dual-contrastive loss between one strong and two weak views, along with a clustering loss enforcing prediction independence. EMVCC [43] introduces a spatial–spectral multiview framework with probabilistic contrastive and alignment losses to improve hyperspectral image clustering. ECCT [44] uses a pseudo-Siamese Vision Transformer with aggressive multiview augmentation to enhance global–local semantic consistency. DCHD [23] targets graph data, combining hard-positive debiasing and adaptive pair weighting to reduce false negatives and improve node clustering accuracy. All these methods adopt dual-contrastive losses using weak and strong augmentations, unlike SPICE, which includes raw data. Their clustering modules, however, remain limited.

Further developments combine contrastive clustering with other paradigms. DCHL [45] decomposes cluster-level contrastive learning into fine- and coarse-grained branches to model interactions across semantic levels. DeepCluE [46] uses a shared-weight CNN for feature extraction followed by an ensemble clustering step, bridging deep and ensemble clustering. SCGC [47] uses dynamic soft structure fusion and influence-based contrastive learning for efficient graph clustering. CCMVC [48] aligns feature-, cluster-, and view-level representations for superior multiview clustering.

Overall, while contrastive learning has driven major progress in deep clustering, existing contrastive clustering frameworks—constrained by localized instance relations or independence-based objectives—still struggle to capture the global structural and semantic coherence of the representation space. Combined with persistent category-collision issues, these limitations leave significant room for improvement in contrastive deep clustering.

3. Proposed Method

Given an unlabeled dataset

D = x_{1}, x_{2}, \dots, x_{N}

, where each

x_{i} \in R^{d}

, TCSS encodes the data into a low-dimensional representation space characterized by high intra-cluster cohesion and low inter-cluster coupling, and partitions it into C clusters. As shown in Figure 1, TCSS first applies a triple-stream contrastive head (

L_{c o n - t r i}

) to learn clustering-friendly features. These representations are then fed into a self-training clustering head (

L_{c l u - t r i}

) to produce clustering assignments. Meanwhile, alignment mechanisms with

L_{a l i}

and

L_{n e i g}

further refine the learning process. All objectives are jointly optimized within a unified framework.

The network architecture of TCSS is shown in Figure 2. A ResNet backbone extracts features for three streams—weakly augmented, strongly augmented, and raw instances. The clustering head (Equations (4) and (10)) includes three sets of trainable cluster centers, one per stream, while the contrastive head is a multilayer MLP. The cluster-center alignment loss and neighbor alignment loss are applied to the cluster centers and feature embeddings, respectively. Other loss computations are illustrated in the figure.

3.1. Motivation and Distinction from Dual-Stream Models

Most existing dual-stream deep clustering methods are built on contrastive learning frameworks. They emphasize instance- and semantic-level consistency but still face several limitations:

Lack of a cluster-oriented multiview setting. In dual-stream frameworks, both streams follow the contrastive setup and are treated equally, leaving no spatial anchor in the representation space. As a result, instance representations can vary sharply under augmentations, especially for samples near cluster boundaries.
Limited inter-view interaction in the clustering head. Due to the two-stream design, most methods only enforce inter-view consistency and intra-view redundancy reduction, without enabling genuine cross-dimensional interactions across views.
Weak modeling of spatial structure. Each instance forms a single augmented pair per batch, corresponding to a “hyper-line segment” in the representation space. Interactions thus occur only among such segments, limiting cross-view intersections.
Implicit misalignment in inter-view contrastive mechanisms. Dimension-wise contrastive losses can distort cluster structures when stream-specific cluster distributions are not aligned. For instance, if the a-view of an instance belongs to class i but its b-view falls into class j, the contrastive loss may exaggerate this inconsistency.

To address the above limitations of dual-stream architectures, TCSS introduces a tri-stream deep clustering design with the following improvements:

Anchor-based stabilization and cross-view negative sampling. TCSS designates the source sample as an anchor stream and introduces a cross-view negative sampling strategy (Section 3.2.3) together with a soft-label alignment loss (Equation (7), Section 3.3). This stabilizes sample-space representations while maintaining augmentation diversity, improving clustering robustness.
Triple-stream information fusion and distribution alignment. TCSS employs a Triple-stream fusion strategy with a corresponding distribution alignment loss (Equation (7)) to enhance representation consistency across views.
Enhanced spatial structural expressiveness. Each training batch forms a “hyper-triangle” in the representation space, where each vertex represents one stream. This extends inter-view interaction from a line to a plane, enriching the model’s spatial representation capacity.
Cluster-center alignment for stable self-training. A cluster-center alignment loss enforces consistent soft assignments across views, resolving the inter-view alignment issue and improving clustering stability.

For clearer conceptual and empirical comparison, several representative dual-stream deep clustering methods are evaluated against TCSS in terms of design philosophy and performance, as summarized in Table 1. The performance results are cited directly from the original papers.

As shown in the table, CC, DCSC, DeepCluE, and IcicleGCN all employ similar cross-view contrastive clustering losses. Despite incorporating clustering-oriented mechanisms, their performance remains limited by alignment and structural deficiencies. In contrast, IDFD, which adopts an independence-based approach and avoids alignment issues, achieves better results, supporting our analysis. The clear performance gain of TCSS over IDFD further highlights the necessity and effectiveness of the triple-stream design.

3.2. Triple-Stream Contrastive Learning

Contrastive learning (CL) extracts discriminative knowledge by leveraging instance-level information [8,49]. Through data augmentation, CL implicitly builds a semantic hierarchy among neighboring instances [35,36]. However, CL-based deep clustering frameworks face two main issues: (1) Category Collision, where instances from the same class are mistakenly treated as negatives [38]. (2) Limited exploration of the view structure due to weak augmentation. Since CL is not tailored for clustering, it overlooks cluster-structure learning, leading to these problems. This section introduces the Triple-stream Contrastive Learning module of TCSS, which addresses these issues by explicitly modeling and exploiting clustering structures.

3.2.1. Introduction of Contrastive Learning

Two weakly augmented views of the raw sample

x_{i}^{r a w}

are generated by random augmentation functions, denoted as

x_{i}^{w} = A (x_{i}^{r a w})

and

x_{i}^{w^{'}} = A^{'} (x_{i}^{r a w})

, where

A (\cdot)

represents a stochastic augmentation operation. A neural network encoder

ϕ_{θ}

then produces feature representations

f_{i}^{w} = ϕ_{θ} (x_{i}^{w})

and

f_{i}^{w^{'}} = ϕ_{θ} (x_{i}^{w^{'}})

. Each feature is passed through a projection head P (typically a 2- or 3-layer MLP) to obtain the latent vectors

h_{i} = P (f_{i})

. The contrastive loss for the augmented instance

x_{i}^{w}

is then formulated based on the latent pair

(h_{i}^{w}, h_{i}^{w^{'}})

as follows:

l_{i}^{w} = - l o g \frac{e x p (S (h_{i}^{w}, h_{i}^{w^{'}}) / τ)}{\sum_{j = 1}^{N} I_{j \neq i} (e x p (S (h_{i}^{w}, h_{j}^{w}) / τ) + e x p (S (h_{i}^{w}, h_{j}^{w^{'}}) / τ))},

(1)

where

I_{j \neq i}

is an indicator function, and

τ

is a temperature hyperparameter controlling distribution smoothness. The similarity function

S (\cdot, \cdot)

is typically defined as cosine similarity,

S (a, b) = \frac{a^{T} b}{| a | | b |}

. In Equation (1), vectors indexed by j represent negative samples, while those indexed by i represent positive pairs. The symmetric contrastive loss over all instances is then given by:

L_{c o n (w, w^{'})} = \frac{1}{2 N} \sum_{i = 1}^{N} (l_{i}^{w} + l_{i}^{w^{'}}),

(2)

Equation (2) forces

h_{i}^{w}

to move closer to its positive view

h_{i}^{w^{'}}

while simultaneously forcing

h_{i}^{w}

to move away from other negative instances.

Augmented views used in CL loss implicitly build a semantic hierarchy among neighboring instances [5]. Hence, contrastive learning serves as a suitable backbone for clustering tasks.

3.2.2. Triple-Stream Contrastive Learning

Strong augmentations—with more aggressive or even distorted transformations—are used in prior works [42,50] to expand the semantic domain and enrich representations. TCSS adopts this idea to build a triple-stream structure. For each raw sample

x_{i}^{r a w}

, a weakly augmented view

x_{i}^{w}

and a strongly augmented view

x_{i}^{s}

are generated to form a triple-stream group. Consequently, the standard dual-stream contrastive loss is reformulated for the triple-stream setting. Unlike [42,50], TCSS constructs three positive pairs: (1) raw–weak; (2) raw–strong; (3) weak–strong. Based on Equations (1) and (2), the triple-stream contrastive loss is defined as:

L_{c o n - t r i} = L_{c o n (r a w, w)} + L_{c o n (w, s)} + L_{c o n (r a w, s)},

(3)

This triple-stream design is motivated by three factors: (1) The unaugmented raw data provides the richest semantic information, serving as an anchor for feature learning. (2) The strong view improves generalization and feature diversity by enforcing invariance under heavy perturbations. (3) Interactions among the three views mitigate detail bias in the raw data and preserve invariant features.

Geometrically, each instance can be viewed as a point in a hyperspace, with its augmented views forming neighboring points. Dual weak views in conventional CL extend a point into a variable line segment, whereas adding a strong view expands it into a triangular plane. Because the strong view lies farther from the raw and weak views, these planes intersect more readily, enabling better exploration of intra-cluster cohesion.

3.2.3. Tag Bank for Clustering

The discriminative nature of contrastive learning (CL) encourages low coupling between instances. However, at the cluster level, the use of negative samples in CL can cause category collision [38], where instances from the same class are mistakenly treated as negatives. To address this, the quality of negative samples must be improved. Inspired by MoCo’s memory bank, we design a Tag Bank that selectively filters negative examples for each positive instance, thereby reducing false negatives. This mechanism depends on the process described in Section 3.3.

Within this bank, each row in the Tag Bank stores the output of the contrastive head for an instance along with its tag, which is derived from the clustering head’s probability output

Q^{m}

(see Section 3.3). The tag corresponds to the category index of the instance’s maximum predicted probability. During contrastive training, for each instance, the tags of all negative samples used in the loss are retrieved. If a category conflict occurs—i.e., the maximum prediction index of a negative sample matches that of the anchor instance—the conflicting negative is replaced by a randomly selected non-conflicting sample from the bank. After each batch update, the Tag Bank is refreshed accordingly.

3.3. Triple-Stream Clustering Learning

CL-based clustering methods often rely on inter-cluster contrastive learning or cluster independence, overlooking structural relations between samples and centroids. The Self-Training Clustering (STC) module [12] models sample–centroid relations using the Student-t distribution [25], yet its objective lacks an explicit centroid-oriented constraint. Moreover, extending the dual-view structure to a unified triple-stream form is necessary. Therefore, TCSS introduces a new triple-stream STC framework to address these limitations.

3.3.1. Introduction of Self-Training Clustering

We briefly review the basic Self-Training Clustering (STC) method under a single augmented view. It first defines a learnable set of cluster centroids

μ_{j}^{v i e w} {j = 1}^{C}

for each view, typically initialized by applying K-Means [3] on pretrained representations [12]. The similarity between samples and centroids in the representation space is modeled by the Student-t distribution. Specifically, the conditional probability of the i-th sample belonging to the j-th centroid is defined as

q^{v i e w} j | i = {(1 + s i m (f_{i}^{v i e w}, μ_{j}^{v i e w}) / ω)}^{- \frac{ω + 1}{2}}

, where

s i m (\cdot, \cdot)

denotes a similarity measure (e.g., Euclidean distance), and

ω

is the degree-of-freedom parameter discussed in Section 3.3.3. Assuming uniform sampling (

q_{i}^{r a w} = \frac{1}{N}

), the joint probability of instance i belonging to centroid j is given by:

q_{i j}^{v i e w} = \frac{{(1 + s i m (f_{i}^{v i e w}, μ_{j}^{v i e w}) / ω)}^{- \frac{ω + 1}{2}}}{\sum_{j^{'}}^{C} ({(1 + s i m (f_{i}^{v i e w}, μ_{j^{'}}^{v i e w}) / ω)}^{- \frac{ω + 1}{2}})} .

(4)

Regarding the target distribution

P_{i j}^{r a w}

, if we assume the soft predictions

q_{i j}^{v i e w}

are mostly accurate (which is highly significant for the results), the target probability

p_{i j}

should be an enhanced version of

q_{i j}^{v i e w}

:

p_{i j}^{v i e w} = \frac{{(q_{i j}^{v i e w})}^{2} / f r e q_{j}^{v i e w}}{\sum_{j^{'}}^{C} {(q_{i j}^{v i e w})}^{2} / f r e q_{j}^{v i e w}},

(5)

where

f r e q_{j}^{v i e w} = \sum_{i} q_{i j}^{v i e w}

are soft cluster frequencies. Lastly, the Kullback–Leibler (KL) divergence is used to minimize the distance between the two distributions Q and P:

L_{c l u (P^{v i e w}, Q^{v i e w})} = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{C} p_{i j}^{v i e w} l o g \frac{p_{i j}^{v i e w}}{q_{i j}^{v i e w}} .

(6)

STC iteratively optimizes the cluster-centroid

{\{μ_{j}\}}_{j = 1}^{C}

and P, eventually obtaining convergent cluster assignments.

3.3.2. Clustering with Triple-Stream

The ladder effect described in Section 3.2, induced by the weak, raw, and strong views, can also enhance cluster-structure learning. Thus, all three views are used to improve clustering performance. Based on the triple-stream representations, the corresponding source distributions are computed via Equation (4) as

Q^{r a w}

,

Q^{w}

, and

Q^{s}

.

Since the weak view is only a mild perturbation of the raw view, their representations are expected to be closely aligned and stable. We therefore average them to form a smoothed distribution

Q^{m} = \frac{Q^{w} + Q^{r a w}}{2}

. The target distribution

P^{m}

is then obtained using Equation (5) and

Q^{m}

. Finally, the overall triple-stream clustering loss is formulated according to Equation (6) as follows:

L_{c l u - t r i} = L_{c l u (P^{m}, Q^{w})} + L_{c l u (P^{m}, Q^{r a w})} + L_{c l u (P^{m}, Q^{s})} .

(7)

This design is motivated by two key considerations: (1) Stability. Compared with the strong view, cluster assignments from the weak and raw views are more stable, reducing assignment ambiguity. (2) Robustness. Averaging promotes smoother and more consistent optimization. Moreover, maintaining consistent cluster assignments across all three views preserves semantic integrity. From the triple-stream perspective, the model learns a ladder-like clustering structure that progresses from near (weak) to far (strong) views, leading to improved performance.

3.3.3. Clustering Structure Factor $ω$

We now address the choice of the degrees of freedom factor

ω

. Traditional self-training methods, starting from DEC, typically fix

ω = 1

[25]. However, this practice introduces several issues:

Dependence on crowding. As noted in [25], the proper value of $ω$ should depend on the degree of crowding—a phenomenon where low-dimensional embeddings become compressed due to dimensionality mismatch with the input space. The literature explicitly recommends adapting $ω$ to the crowding level, as fixed values perform worse in experiments.
Scaling effect. From Equation (4), $ω$ acts as a similarity-scaling factor. A smaller $ω$ yields heavier distribution tails, producing compact intra-cluster and dispersed inter-cluster structures, whereas a larger $ω$ flattens the distribution and weakens cluster boundaries.
Lack of task adaptivity. While $ω$ can be fixed or learned, the fixed setting lacks flexibility and relies on heuristics, whereas a learnable $ω$ can cause instability. More critically, neither directly reflects the evolving cluster structure.

To address this, we introduce a dynamic concentration smoothing factor that adapts to the real-time clustering structure of the embedding space. Specifically, in the single-view case, the intra-cluster factor for cluster j is defined as

d_{j}^{i n t r a - v i e w}

:

d_{j}^{i n t r a - v i e w} = \frac{\sum_{i = 1}^{N_{j}} | | f_{i j}^{v i e w} - μ_{j}^{v i e w} {| |}_{2}}{l o g N_{j}},

(8)

where

N_{j}

and

f_{i j}^{v i e w}

denote the number and features of samples in cluster j. The denominator mitigates the influence of cluster size on

d_{j}^{i n t r a - v i e w}

. Each

d_{j}^{i n t r a - v i e w}

is normalized to obtain the mean intra-cluster factor

d^{i n t r a - v i e w}

.

The inter-cluster factor for cluster j is defined as

d_{j}^{i n t e r - v i e w} = m i n_{i \neq j, i \in [1, C]} (| | μ_{i} - μ_{j} {| |}_{2}),

(9)

and all

d_{j}^{i n t e r - v i e w}

are normalized, with the minimum taken as the global inter-cluster factor

d^{i n t e r - v i e w}

. This minimum centroid distance reflects the degree of crowding in the embedding space.

Finally, the clustering structure factor for a single view is defined as:

ω^{v i e w} = {(\frac{d^{i n t r a - v i e w}}{d^{i n t e r - v i e w}})}^{2} .

(10)

In Equation (10),

d^{i n t r a - v i e w}

represents the smoothed average intra-cluster distance, while

d^{i n t e r - v i e w}

denotes the minimum inter-cluster distance. Both capture the structural characteristics of the representation space. The resulting

ω^{v i e w}

reflects the core clustering principle of high intra-cluster cohesion and low inter-cluster coupling. When clusters are compact and well separated,

ω^{v i e w}

becomes small, increasing the model’s sensitivity to similarity. This amplifies similarity within clusters and produces heavier distribution tails, promoting better inter-cluster separation. Conversely, when clusters are dispersed and overlapping,

ω^{v i e w}

grows large, relaxing the similarity scaling and reducing penalties for negatives. Since intra-cluster distances are generally smaller than inter-cluster distances, squaring their ratio accentuates this difference, allowing

ω^{v i e w}

to adaptively guide training across scenarios. In practice,

ω^{v i e w}

is computed per epoch using global representations

F^{v i e w} = {f_{i}^{v i e w}}_{i = 1}^{N}

via K-means. TCSS calculates

ω^{r a w}

,

ω^{w}

, and

ω^{s}

separately for use in Equation (4).

The Derivation of $ω$ : Assume the similarity between a feature vector

f_{i}

and a cluster center

μ_{j}

is modeled by the Student’s t-kernel:

q (r) = {(1 + \frac{r^{2}}{ω})}^{- \frac{ω + 1}{2}},

(11)

where

r = | f_{i} - μ_{j} |_{2}

and

ω

is the scaling parameter.

We seek

ω

such that: (1) For intra-cluster pairs (

r \sim d^{i n t r a}

), the kernel remains high (

q (d^{i n t r a}) \approx 1

). (2) For inter-cluster pairs (

r \sim d^{i n t e r}

), the kernel decays rapidly. This implies that:

\frac{{(d^{i n t r a})}^{2}}{ω} ≪ 1 and \frac{{(d^{i n t e r})}^{2}}{ω} \approx 1,

(12)

leading to an approximate choice:

ω \approx {(d^{i n t e r})}^{2} .

(13)

However, this ignores the influence of intra-cluster distance. When

d^{i n t r a}

is large, a wider kernel is required to maintain within-cluster similarity. A balanced formulation considers both, motivating the ratio-based design in Equation (10).

This setting ensures: (1) When clusters are compact and well separated (

d^{i n t r a} ≪ d^{i n t e r}

),

ω

is small, producing a sharper and more discriminative kernel. (2) When

d^{i n t r a}

and

d^{i n t e r}

are comparable,

ω

grows larger, yielding a smoother kernel that prevents premature hard partitioning. Thus, the clustering structure factor

ω

adaptively tunes the Student’s t-kernel to balance intra-cluster compactness and inter-cluster separation according to the data structure.

3.4. Triple-Stream Structure Alignment

To strengthen the clustering-oriented triple-stream structure, we introduce two alignment losses: cluster-centroid alignment and neighbor-relationship alignment.

3.4.1. Cluster-Centroid Alignment

The STC process relies heavily on effective centroid optimization; constraining centroid updates within a proper range helps reduce the solution space. To this end, we introduce a cluster-centroid alignment loss for each augmented view:

L_{a l i} = \frac{1}{C} \sum_{j = 1}^{C} | | μ_{j}^{r a w} - r_{j} {| |}_{2} + | | μ_{j}^{w} - r_{j} {| |}_{2} + | | μ_{j}^{s} - r_{j} {| |}_{2},

(14)

where

μ_{j}^{r a w}

,

μ_{j}^{w}

, and

μ_{j}^{s}

are the learnable centroids for each view, and

r_{j}

is the reference centroid obtained via K-means on the combined weak and raw representation spaces. The final loss

L_{a l i}

aggregates contributions from all three views.

This design follows the same rationale as the triple-stream clustering scheme in Section 3.3.2. It constrains learnable centroids to remain close to the true centers in representation space and guides their updates along the evolving feature distribution, ensuring stable and consistent clustering within the STC framework.

3.4.2. Neighbor Alignment

Introducing strong views enables exploration of regions farther from the sample, thus expanding the semantic scope through nearest-neighbor relationships and improving model generalization. For each representation

f_{i}^{v i e w}

, we collect its global K nearest neighbors

f_{z}^{v i e w} = N_{K} (f_{i}^{v i e w})

and encourage their consistency via the following loss:

L_{n e i g} = \frac{1}{N} \sum_{i = 1}^{N} \sum_{z = 1}^{K} [, l (f_{i}^{r a w}, f_{z}^{w + r a w}) + l (f_{i}^{w}, f_{z}^{w + r a w}) + l (f_{i}^{s}, f_{z}^{s}),],

(15)

where

l (\cdot, \cdot)

measures feature similarity. For the weak and raw views, K neighbors are selected based on the smallest combined distance across both views, forming

f_{z}^{w + r a w}

to jointly guide their learning. For the strong view,

f_{z}^{s}

is computed independently from its own representation space. As in

L_{a l i}

, the total

L_{n e i g}

aggregates contributions from all three views.

Equation (15) is flexible and can adopt cosine, MSE, or consistency losses, or be reformulated to incorporate nearest-neighbor relations in both representation and clustering-prediction spaces.

3.5. Overall Loss and Training Strategy

The practical implementation of deep clustering should address several key considerations: (1) The clustering loss must be applied to high-quality representations to ensure correct optimization. (2) The representation and clustering losses should be trained jointly to prevent semantic forgetting. (3) Minibatch and epoch operations differ in scale and update frequency. Thus, loss computation and update schedules must be adapted accordingly.

Different losses may target distinct objectives or features, potentially causing conflicts [5,6,7,37]. In TCSS, however, all objectives are unified around the clustering structure, preventing such conflicts. Still, as in most high-performing methods [21,40,41,50], some components depend on reliable representations. Therefore, training is divided into two phases by epoch:

In the first (warm-up) phase, only the

L_{c o n - t r i}

loss is optimized to establish stable and discriminative representations.

After the warm-up phase, the model can generate stable representations. In the main training phase, the framework is jointly optimized with the overall objective:

L_{o v e r a l l} = L_{c o n - t r i} + λ_{1} L_{c l u - t r i} + λ_{2} L_{a l i} + λ_{3} L_{n e i g},

(16)

where

λ_{1}

,

λ_{2}

, and

λ_{3}

are weighting coefficients. Premature joint optimization—especially of the clustering and alignment terms—can cause samples to converge to local optima. The Tag Bank is also activated during this phase.

Global quantities such as the K-nearest neighbors, clustering structure factor

ω

, and cluster centroids are recomputed each epoch, since relying on batch-level data may degrade performance. The

L_{a l i}

loss operates at the epoch level, while the remaining losses are computed at the batch level.

The complete training process is summarized in Algorithm 1. The TCSS backbone is denoted as

ϕ_{θ}

, the contrastive head as

ϕ_{con}

, the clustering head as

μ^{view}

, and the augmentation strategy as

(A, a)

. TCSS is trained following Algorithm 1, and final clustering assignments

{y_{i}}_{i = 1}^{N}

are obtained by applying arg max to Equation (4).

Algorithm 1: TCSS Algorithm

input: Dataset D, strong and weak augmentation denoted as

(A, a)

, model
structure: backbone

ϕ_{θ}

, contrastive head

ϕ_{c o n}

, clustering heads

μ^{v i e w}

for

r a w, w, s

views

1: warm-up:
2: for $e = 1$ to $w a r m - u p$ $e p o c h e s$ do
3: for $b = 1$ to B do
4: Sampling a minibatch data X from D;
5: $X^{r a w} = X$ ; $X^{w} = a (X)$ ; $X^{s} = A (X)$ ;
6: Compute $L_{c o n - t r i}$ according to Equation (3);
7: Update $ϕ_{θ}$ and $ϕ_{c o n}$ ;
8: end
9: end
10: Initialize centroids set $μ^{v i e w}$ , $N_{K} (f_{i}^{v i e w})$ , $r_{j}$ , $ω^{v i e w}$ and tag bank for $r a w, w, s$ views
by conducting k-means on representations;
11: Main Phase:
12: for $e = 1$ to $m a i n e p o c h e s$ do
13: for $b = 1$ to B do
14: Conduct the operation of pseudo-code 5-6;
15: Compute $L_{c l u - t r i}$ according to Equations (6), (7) and (10);
16: Compute $L_{n e i g}$ according to Equation (15);
17: Update $ϕ_{θ}$ , $ϕ_{c o n}$ and clustering heads;
18: Update tag bank;
19: end
20: Update $N_{K} (f_{i}^{v i e w})$ , $r_{j}$ , $ω^{v i e w}$ for each view by conducting k-means on
representations;
21: Compute $L_{a l i}$ according to Equation (14), then update $ϕ_{θ}$ and clustering heads;
22: end
output: Cluster Assignment ${y_{i}}_{i = 1}^{N}$ according to Equation (4) on raw view

4. Experiments

In this section, comprehensive experiments are conducted on the TCSS to evaluate and discuss its performance and properties.

4.1. Datasets Evaluation Metrics

Five benchmark datasets are utilized in the experiments. The brief introduction of each dataset is described below.

CIFAR-10&CIFAR-100 [51]: CIFAR-10 is a natural image dataset with 50,000/10,000 (train/test) samples from 10 classes for training and testing, respectively. CIFAR-100 contains 20 super-classes, which can be further divided into 100 classes. Moreover, it has the same number of samples and image size (32 × 32) as CIFAR-10. Please note that we use the 20 super-classes as the ground-truth during experiments.
STL-10 [52]: STL-10 is an ImageNet-sourced dataset containing 5000/8000 (train/test) images with a size of 96 × 96 from each of 10 classes.
ImageNet-10&ImageNet-Dogs [11]: ImageNet-10 is a subset of ImageNet with 10 classes, each of which consists of 1300 samples with varying image sizes. ImageNet-Dogs is constructed similarly to ImageNet-10, but it selects a total of 19,500 dog images of 15 breeds from the ImageNet dataset.

Evaluation Metrics: We used three popular metrics to evaluate clustering results, including Normalized Mutual Information (NMI) [53], Clustering Accuracy (ACC) [54], and Adjusted Rand Index (ARI) [55]. Please note that higher values of the three evaluation metrics indicate better clustering performances.

4.2. Implementations

All experiments were conducted on an NVIDIA RTX 3090 Ti GPU using PyTorch 1.13 and CUDA 11.7. Random seed = 42 for reproducibility. For the strong augmentation method, we employ the transformations from [42], including AutoContrast, Brightness, Color, Contrast, Equalize, Identity, Posterize, Rotate, Sharpness, ShearX/Y, Solarize, and TranslateX/Y. The

L_{n e i g}

is set as consistence loss as [14]. The backbone network is set to ResNet-34 for all datasets. The batch size is set to 256. The epoch is set to 1000, where 50 for the first phase, and 950 for the second phase. The Adam optimizer with a weight decay of

1 \times 10^{- 4}

. The learning rate was set to

4 \times 10^{- 4}

and halved every 300 epochs after the first 100 epochs, with a weight decay of

1 \times 10^{- 4}

. The nearest-neighbor parameter K is set to 3 for all datasets.

λ_{1}

is set to 1.

λ_{2}

is set to 5.

λ_{3}

is set to 5.

τ

is set to 0.5. For the clustering head, we use a size of 512 × C, where C denotes the class number. For the contrastive head, we use an MLP with one hidden layer of size 512, and the output size is 128. The images in the ImageNet dataset are resized to a size of 224 × 224 pixels.

4.3. Compared to Other Methods

We compared TCSS with other competing clustering methods, including traditional clustering methods, Variational Auto-Encoder (VAE) [56], Jointly Unsupervised Learning (JULE) [10], Deep Convolution Generative Adversarial Network (DCGAN) [53], early deep clustering methods Deep Embedding Clustering (DEC) [12], Deep Adaptive image Clustering (DAC) [11], advancing and competing deep clustering methods Deep Comprehensive Correlation Mining (DCCM) [17], Partition Confidence mAximization (PICA) [18], Contrastive Clustering (CC) [39], Semantic Pseudo-labeling-based Image ClustEring (SPICE) [50], Twin Contrastive Learning (TCL) [40], and recent 3-years deep clustering methods Deep Clustering model based on Semantic Consistency (DCSC) [41], Strongly Augmented Contrastive Clustering (SACC) [20], Deep Clustering via Ensembles (DeepCluE) [46], Image clustering with contrastive learning and multi-scale Graph Convolutional Networks (IcicleGCN) [21], Deep Clustering with Hybrid-Grained Contrastive and Discriminative Learning (DCHL) [45]. Table 2 illustrates the clustering metric results on five image datasets, where the highest and second-highest values are tagged in red and blue, respectively. The results are reported as the mean and standard deviation from 10 different runs. We additionally present paired t-tests between TCSS and the second-best baseline for each dataset. The results show that the improvements of TCSS in ACC and NMI are statistically significant (p < 0.05) on all datasets, with stronger significance (p < 0.01) observed on CIFAR-100. These findings confirm that the superior performance of TCSS is statistically reliable rather than caused by random variation.

TCSS consistently outperforms other clustering methods across nearly all datasets, confirming its effectiveness. For fine-grained datasets such as CIFAR-100 and ImageNet-Dogs, contrastive-based approaches often suffer from severe class collision, where samples from the same class are incorrectly pushed apart. In contrast, TCSS achieves competitive results on both datasets—improving NMI, ACC, and ARI by 7.9%, 9.8%, and 12.1%, respectively, over the second-best method on ImageNet-Dogs. TCL and SPICE also perform well on these datasets, likely because they employ confidence-based boosting strategies derived from SCAN, which act as semi-supervised plug-ins that significantly enhance clustering performance. Recently, similar text-supervised extensions have shown strong results as well [58,59]. On other datasets, TCSS improves NMI, ACC, and ARI by 4.4%, 3.1%, and 3.5% on CIFAR-10, and by 2.8%, 1.4%, and 1.0% on STL-10. While performance gains on ImageNet-10 are smaller, TCSS still achieves high absolute scores. Overall, these results demonstrate the robustness and effectiveness of the TCSS framework.

4.4. Clustering Quality

In this section, we analyze the clustering quality of TCSS through confusion matrices and case studies.

As shown in Figure 3, TCSS produces clear block-diagonal patterns on CIFAR-10, STL-10, and ImageNet-10, indicating high clustering accuracy. The matrices for CIFAR-100 and ImageNet-Dogs exhibit slightly weaker diagonal structures due to the large number of classes and the small image size in CIFAR-100, as well as the fine-grained subclasses in ImageNet-Dogs, which are difficult even for humans to distinguish. Nonetheless, the diagonals remain evident, and—together with the results in Table 2—these findings confirm TCSS’s strong performance on all datasets.

We further evaluate TCSS through a case study on the challenging ImageNet-Dogs and CIFAR-100 datasets. The clustering results are categorized into three scenarios: (1) CC—correctly clustered samples. (2) IC-this—samples belonging to this cluster but incorrectly assigned elsewhere. (3) IC-other—samples from other clusters incorrectly assigned to this cluster. The results are shown in Figure 4.

In Figure 4a, many misclassified samples (IC-this and IC-other) contain multiple objects, introducing noise due to the absence of a clear visual focus. In such cases, clustering can still be considered reasonable if any object in the image is correctly categorized. Other errors arise from fine-grained ambiguities that are difficult even for humans to distinguish. In Figure 4b, the people class shows similar multi-object confusion. Additional errors mainly result from low image resolution, which degrades visual discriminability. For instance, images from the vehicles2 class—originally skyscraper samples—were misclassified as rockets (a subclass of vehicles2).

4.5. Analysis of Training Cost

We evaluate the training cost of TCSS by comparing its runtime with a baseline model on the CIFAR-10 dataset using an NVIDIA RTX 3090 Ti GPU. The results are summarized in Table 3.

The backbone of TCSS is ResNet34 with 22.13 M parameters. TCSS introduces only the clustering heads, adding

3 \times K

parameters (where K is the number of clusters), which is negligible. Thus, the total parameter count of TCSS remains 22.13 M. It can be observed that since all methods adopt ResNet-34 as the backbone, the parameter counts of TCSS and the other baseline models are comparable.

In terms of runtime, the basic model with the triple-stream setup requires 18.48 h, while TCSS takes approximately 20.42 h under identical conditions. The additional cost mainly arises from the repeated K-means operations (including nearest-neighbor computation). Even with GPU-accelerated FAISS-K-means, each clustering step takes about 5 s, contributing significantly to the overall training time. Compared with the other baseline models, TCSS is slightly less efficient in terms of training time.

Therefore, when TCSS is applied to large-scale datasets, the main computational cost lies in the forward computation of the triple-stream inputs and the K-means clustering process. Since the feature dimensionality remains fixed, a larger dataset leads to a longer K-means execution time. Overall, the long training time and high computational complexity constitute a major limitation of TCSS.

4.6. Ablation Studies

In this section, we conduct ablation studies to evaluate the individual contributions of each component in TCSS.

4.6.1. Ablation of Losses

TCSS introduces four loss components. This section examines their effects on CIFAR-10, STL-10, and CIFAR-100. Ablation experiments are conducted under four settings: (1) Without the

L_{c l u - t r i}

loss. (2) Without the

L_{a l i}

loss. (3) Without the

L_{n e i g}

loss. (4) Without only the

L_{c o n - t r i}

loss.

When

L_{c l u - t r i}

is removed,

L_{a l i}

becomes inactive. Using only

L_{c o n - t r i}

degenerates the model into a pure triple-stream contrastive learner. As shown in Figure 5a, omitting the clustering term degrades performance, while removing all but the contrastive term causes a more pronounced decline. The neighbor alignment loss

L_{n e i g}

proves highly beneficial, likely due to the abundance of nearest neighbors in CIFAR-10 and CIFAR-100. Yet, even without

L_{n e i g}

, performance remains strong, indicating that the semantic clustering loss

L_{c l u - t r i}

provides the main contribution, while

L_{n e i g}

serves a fine-tuning role. Similarly, excluding both

L_{a l i}

and

L_{n e i g}

causes only minor degradation, suggesting that

L_{a l i}

and

L_{n e i g}

primarily refine convergence when combined with

L_{c l u - t r i}

and

L_{c o n - t r i}

.

As shown in Figure 5b, on CIFAR-10, the model without

L_{n e i g}

achieves higher accuracy than without

L_{a l i}

at epoch 200, demonstrating that

L_{a l i}

accelerates convergence during training.

4.6.2. Ablation of Triple-Augmented-View on TCSS

To evaluate the importance of triple-stream augmentation, we analyze the effects of different augmentation strategies, as shown in Table 4 (where W, R, and S denote weak, raw, and strong augmentations, respectively).

In the dual-view setting: (1) The weak + weak combination slightly outperforms weak + raw, indicating that raw data contributes little in a dual-view context. (2) The strong + weak combination achieves higher performance than other dual-view settings, as strong augmentation improves invariance learning.

In the triple-stream setting: (1) Triple-stream combinations outperform all dual-view strategies. (2) Using strong + strong with raw or weak views significantly reduces performance because the semantic “ladder” gap between strong augmentations and the others becomes too large. (3) The strong + weak + raw configuration used in TCSS slightly surpasses strong + weak + weak, forming a smoother “ladder” structure and benefiting from the raw view’s anchoring effect.

We also tested a variant without triple-stream clustering (denoted w/o tri-clu), where the clustering module degenerates into a single-view self-training setup, leading to clear performance degradation. These findings further support and extend the conclusions on data augmentation reported in [20,50].

4.6.3. Sensitivity of Hyperparameters

The sensitivity analysis of the hyperparameters

λ_{1 - 3}

is presented in Figure 6a–c. As these parameters increase, model performance generally improves with only minor fluctuations, demonstrating strong robustness. This stability stems from the reliability of the triple-stream contrastive learning framework and the low interference among loss functions. After the warm-up phase, TCSS produces confident representations, allowing a stronger emphasis on the clustering term without damaging representation quality. However, excessively large

λ_{1}

values may eventually degrade performance. As shown in Figure 5a,

λ_{1}

achieves the highest ACC at 1 on CIFAR-10 and 5 on CIFAR-100, indicating those dataset-specific tuning yields optimal results.

For the number of nearest neighbors K, Figure 6d shows that TCSS remains robust for K values between 1 and 5, benefiting from dense neighborhood structures and the stabilized representations from warm-up. Beyond

K = 5

, performance declines rapidly, likely due to an increased influence of unstable boundary samples.

To more intuitively observe the interactions and sensitivities among the parameters, we constructed the three-dimensional bar chart according to the ACC values shown in Figure 7. It can be observed from Figure 7a that when

λ_{1}

is set to 0.5, 1, or 5, the performance of TCSS remains both superior and stable regardless of changes in

λ_{2}

. However, when

λ_{1}

is either too small or too large, the performance fluctuates significantly. Meanwhile,

λ_{2}

shows a certain auxiliary effect: when

λ_{2}

maintains a relatively high value, the model can still achieve stable performance, whereas excessively low

λ_{2}

values lead to performance collapse. This indicates that the cluster-centroid alignment loss provides substantial support for the overall triple-stream clustering loss.

It can also be seen from Figure 7a that the correlation between

λ_{2}

and

λ_{3}

is relatively low, with performance variations occurring within a narrow range and showing no apparent regularity. This is because the two corresponding loss terms operate on different computational targets, resulting in weak coupling. Overall,

λ_{1}

and

λ_{2}

exhibit distinct optimal choices, whereas the selection of

λ_{3}

is independent of the first two parameters.

4.6.4. Influence of Clustering Loss with Different Settings of $ω$

This section examines the impact of different

ω

configurations on STL-10, CIFAR-100, and ImageNet-Dogs. Following [25], we test

ω = 1

,

ω = d - 1

, and a learnable

ω

. We also include the Concentration Factor (

ω

-CF) from PCL [38], which computes the log-smooth average intra-cluster distance and is originally designed for contrastive clustering. The results are summarized in Table 5.

On STL-10, the performance differences across settings are relatively small (max gaps: NMI = 7.7%, ACC = 5.9%, ARI = 10.1%). In contrast, on CIFAR-100 and ImageNet-Dogs, the proposed adaptive

ω

markedly improves performance (CIFAR-100: NMI + 7.0%, ACC + 7.3%, ARI + 9.8%; ImageNet-Dogs: NMI + 21.1%, ACC + 23.4%, ARI + 20.4%). Among traditional configurations,

ω = 1

yields the weakest results, while the learned

ω

slightly outperforms

ω = d - 1

, consistent with [25]. Combined with TCSS and TCSS-semi results (Table 2 and Table 6), these findings confirm that the design of

ω

is crucial for clustering performance.

Although

ω

-CF surpasses traditional settings, its focus solely on intra-cluster relationships limits performance compared with our formulation, which integrates both intra- and inter-cluster information. Nonetheless, its effectiveness further supports the idea that incorporating structural information into temperature-type parameters enhances clustering robustness.

To examine the influence of the proposed factor-based clustering loss on representation learning, we apply t-SNE [60] to visualize the learned representations of TCSS with and without the clustering loss on CIFAR-10 and CIFAR-100, as shown in Figure 8. Comparing the first and second rows reveals that TCSS with the clustering loss produces representations exhibiting clear clustering patterns—characterized by larger inter-cluster separations and tighter intra-cluster cohesion.

4.6.5. Investigation of Boosting Strategies

Several existing methods integrate boosting strategies by cascading a semi-supervised clustering module after the main clustering algorithm. Since this work focuses on pure clustering, TCSS employs only a simple tag bank mechanism. To assess the impact of boosting, we incorporate the SPICE-semi strategy into TCSS and evaluate it on CIFAR-10 and ImageNet-Dogs. As shown in Table 6, the confidence-based approach significantly improves performance. Confidence predictions become more reliable as training progresses, suggesting that semi-supervised modules are beneficial once pure clustering reaches a performance plateau.

Even without boosting, TCSS achieves the highest performance among state-of-the-art methods (Table 2). Moreover, the tag bank in TCSS, as part of the triple-stream contrastive learning framework, can be trained jointly with the pure clustering process.

5. Discussion

5.1. Extension of TCSS to Non-Image Domains

Although this study primarily evaluates TCSS on visual datasets, its core formulation is not restricted to image data. In principle, as long as the raw data can be mapped into a shared and comparable representation space, TCSS can be extended to non-visual domains such as time-series, tabular, or multimodal data. In this representation space, similarity between samples can be measured using standard metrics such as Euclidean or cosine distance, enabling the contrastive and clustering objectives to remain effective beyond the visual domain.

However, such extensions are non-trivial and largely depend on the ability to construct semantically consistent augmentations and representation mechanisms. For time-series data, following the observations in SoftCLT [61] and TS-TCC [62], strong augmentation can be achieved through permutation + jitter, while weak augmentation can be implemented via jitter + scale. For tabular data, the concept of “strong augmentation” has been less frequently discussed, mainly due to the higher semantic consistency requirements during transformation. Nevertheless, the various corruption rates proposed in Reference [63] can, in some cases, be regarded as a form of strong augmentation, while the MTR [64] introduced substantial perturbations at the column-embedding level can also be considered to be strong augmentation. Regardless of modality, these transformations must preserve the semantic identity of each sample to avoid undermining clustering reliability. A comprehensive summary of augmentation strategies for different data types can be found in Survey [65].

Furthermore, although the Student-t-based self-training clustering loss (Equation (7)) is mathematically applicable to any continuous embedding space, its behavior may vary when the encoder captures highly anisotropic or sequential structures. In such cases, the distance measure or “kernel function” used in Equation (7) may require adaptation, and the related scaling factors (Equation (10)) should be correspondingly adjusted. Therefore, extending TCSS to non-image modalities remains a promising yet open research direction, whose success depends on domain-specific encoder design and augmentation strategy optimization.

5.2. Relation to Multimodal and Text-Guided Clustering

Recent studies have increasingly incorporated textual or semantic priors into clustering frameworks, giving rise to multimodal or vision–language clustering paradigms such as SIC [59] and MAC [58]. In contrast, TCSS remains a purely self-supervised clustering framework that does not rely on external semantics. Conceptually, TCSS focuses on discovering intrinsic structures purely from instance-level relations, whereas multimodal methods inject additional alignment constraints across modalities.

Nevertheless, the core formulation of TCSS is not inherently limited to a single modality. As long as multimodal data can be mapped into a shared and comparable representation space where similarity can be measured (e.g., via Euclidean or cosine distance), the same contrastive and clustering principles of TCSS remain applicable. From this perspective, a CLIP-style [66] cross-modal alignment can be interpreted as a semantic form of augmentation within the TCSS framework: textual descriptions act as alternative “semantic views” of visual instances. By replacing or augmenting the contrastive head with a cross-modal projection module trained against text embeddings, one can achieve semantic alignment between textual and visual representations while preserving the triple-stream self-training mechanism. Both SIC and MAC adopt strategies along this line.

The current limitation is that TCSS does not yet leverage such auxiliary semantic information, which may restrict its performance when semantically rich side data are available. Future research could explore extending the triple-stream contrastive design to incorporate lightweight textual priors or semantic regularizers, allowing TCSS to naturally generalize toward multimodal self-supervised clustering.

5.3. Summary of Extensibility

In summary, the foundation of TCSS’s extensibility lies in metric learning, particularly the broad success of contrastive representation learning across diverse domains, and the strong representational capacity of self-training mechanisms under the Student-t distribution assumption. Together, these two aspects provide the theoretical groundwork that enables TCSS to generalize beyond the visual domain. However, in practical applications, necessary adjustments must still be made according to the specific data modality, structural characteristics, and task requirements.

6. Conclusions

This paper proposes Triple-stream Contrastive Deep Embedding Clustering via Semantic Structure (TCSS), a novel deep clustering framework. Extensive experiments on multiple challenging datasets demonstrate its strong performance. From both the conceptual design and empirical results, we conclude that: (1) The integration of representation learning and clustering learning is highly effective; (2) The clustering semantic structure factor markedly improves clustering quality; (3) The triple-stream architecture plays a vital role in enhancing clustering learning.

We also examined common boosting strategies in deep clustering. Semi-supervised and cross-modal text-guided methods can further improve clustering accuracy but often rely on substantial prior knowledge and face practical issues such as uncertain confidence thresholds or limited textual information. In future work, we plan to extend TCSS concepts to other learning tasks and domains to further evaluate its generality and effectiveness. The implementation code and pretrained models will be organized and publicly released to facilitate reproducibility and further research.

Author Contributions

Conceptualization, A.Z. and J.C.; methodology, A.Z.; software, A.Z.; validation, H.Y.; formal analysis, Y.X.; investigation, X.Z.; resources, A.Z.; writing—original draft preparation, A.Z.; writing—review and editing, A.Z.; visualization, A.Z.; supervision, J.C.; project administration, H.Y.; funding acquisition, J.C. and H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

The work is supported by the Projects of Science and Technology Cooperation and Exchange of Shanxi Province (Grant Nos. 202204041101037, 202204041101033). The central government guides local funds for science and technology development (YDZJSX2024D049) and Guanghe Fund (No. ghfund202407027490).

Data Availability Statement

The original data presented in the study are openly available in CIFAR-10 (https://www.cs.toronto.edu/~kriz/cifar.html, accessed on 9 February 2025), STL-10 (https://cs.stanford.edu/~acoates/stl10/, accessed on 9 February 2025), ImageNet (https://image-net.org/download-images.php, accessed on 9 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TCSS	Triple-stream Contrastive Deep Embedding Clustering via Semantic Structure
CL	Contrastive Learning
STC	Self-Training Clustering module

References

Yang, H.F.; Yin, X.N.; Cai, J.H.; Yang, Y.Q.; Luo, A.L.; Bai, Z.R.; Zhou, L.C.; Zhao, X.J.; Xun, Y.L. An in-depth exploration of LAMOST unknown spectra based on density clustering. Res. Astron. Astrophys. 2023, 23, 055006. [Google Scholar] [CrossRef]
Xun, Y.; Wang, Y.; Zhang, J.; Yang, H.; Cai, J. Higher-order embedded learning for heterogeneous information networks and adaptive POI recommendation. Inf. Process. Manag. 2024, 61, 103763. [Google Scholar] [CrossRef]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June–18 July 1965; Volume 1, pp. 281–297. [Google Scholar]
Ng, A.; Jordan, M.; Weiss, Y. On spectral clustering: Analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2001, 14, 1–8. [Google Scholar]
Zhou, S.; Xu, H.; Zheng, Z.; Chen, J.; Bu, J.; Wu, J.; Wang, X.; Zhu, W.; Ester, M. A comprehensive survey on deep clustering: Taxonomy, challenges, and future directions. arXiv 2022, arXiv:2206.07579. [Google Scholar] [CrossRef]
Károly, A.I.; Fullér, R.; Galambos, P. Unsupervised clustering for deep learning: A tutorial survey. Acta Polytech. Hung. 2018, 15, 29–53. [Google Scholar] [CrossRef]
Min, E.; Guo, X.; Liu, Q.; Zhang, G.; Cui, J.; Long, J. A survey of clustering with deep learning: From the perspective of network architecture. IEEE Access 2018, 6, 39501–39514. [Google Scholar] [CrossRef]
Saunshi, N.; Plevrakis, O.; Arora, S.; Khodak, M.; Khandeparkar, H. A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 5628–5637. [Google Scholar]
Caron, M.; Bojanowski, P.; Joulin, A.; Douze, M. Deep clustering for unsupervised learning of visual features. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 132–149. [Google Scholar]
Yang, J.; Parikh, D.; Batra, D. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5147–5156. [Google Scholar]
Chang, J.; Wang, L.; Meng, G.; Xiang, S.; Pan, C. Deep Adaptive Image Clustering. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5880–5888. [Google Scholar]
Xie, J.; Girshick, R.; Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 478–487. [Google Scholar]
Yang, L.; Cheung, N.M.; Li, J.; Fang, J. Deep clustering by gaussian mixture variational autoencoders with graph embedding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6440–6449. [Google Scholar]
Van Gansbeke, W.; Vandenhende, S.; Georgoulis, S.; Proesmans, M.; Van Gool, L. Scan: Learning to classify images without labels. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 268–285. [Google Scholar]
Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 2020, 33, 9912–9924. [Google Scholar]
Zhang, H.; Zhan, T.; Basu, S.; Davidson, I. A framework for deep constrained clustering. Data Min. Knowl. Discov. 2021, 35, 593–620. [Google Scholar] [CrossRef]
Wu, J.; Long, K.; Wang, F.; Qian, C.; Li, C.; Lin, Z.; Zha, H. Deep Comprehensive Correlation Mining for Image Clustering. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8149–8158. [Google Scholar]
Huang, J.; Gong, S.; Zhu, X. Deep Semantic Clustering by Partition Confidence Maximisation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; pp. 8846–8855. [Google Scholar]
Tao, Y.; Takagi, K.; Nakata, K. Clustering-friendly representation learning via instance discrimination and feature decorrelation. arXiv 2021, arXiv:2106.00131. [Google Scholar] [CrossRef]
Deng, X.; Huang, D.; Chen, D.H.; Wang, C.D.; Lai, J.H. Strongly augmented contrastive clustering. Pattern Recognit. 2023, 139, 109470. [Google Scholar] [CrossRef]
Xu, Y.; Huang, D.; Wang, C.D.; Lai, J.H. Deep image clustering with contrastive learning and multi-scale graph convolutional networks. Pattern Recognit. 2024, 146, 110065. [Google Scholar] [CrossRef]
Cai, J.; Zhang, Y.; Wang, S.; Fan, J.; Guo, W. Wasserstein embedding learning for deep clustering: A generative approach. IEEE Trans. Multimed. 2024, 26, 7567–7580. [Google Scholar] [CrossRef]
Zhang, X.; Xu, H.; Zhu, X.; Chen, Y. Deep contrastive clustering via hard positive sample debiased. Neurocomputing 2024, 570, 127147. [Google Scholar] [CrossRef]
Liu, Z.; Song, P. Deep low-rank tensor embedding for multi-view subspace clustering. Expert Syst. Appl. 2024, 237, 121518. [Google Scholar] [CrossRef]
Van Der Maaten, L. Learning a parametric embedding by preserving local structure. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Clearwater Beach, FL, USA, 16–18 April 2009; pp. 384–391. [Google Scholar]
Li, F.; Qiao, H.; Zhang, B. Discriminatively boosted image clustering with fully convolutional auto-encoders. Pattern Recognit. 2018, 83, 161–173. [Google Scholar] [CrossRef]
Han, K.; Vedaldi, A.; Zisserman, A. Learning to discover novel visual categories via deep transfer clustering. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8401–8409. [Google Scholar]
Guo, X.; Gao, L.; Liu, X.; Yin, J. Improved deep embedded clustering with local structure preservation. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, VIC, Australia, 19–25 August 2017; pp. 1753–1759. [Google Scholar]
Guo, X.; Zhu, E.; Liu, X.; Yin, J. Deep embedded clustering with data augmentation. In Proceedings of the 10th Asian Conference on Machine Learning, Beijing, China, 14–16 November 2018; pp. 550–565. [Google Scholar]
Li, P.; Zhao, H.; Liu, H. Deep fair clustering for visual learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 9070–9079. [Google Scholar]
Gunari, A.; Kudari, S.V.; Nadagadalli, S.; Goudnaik, K.; Tabib, R.A.; Mudenagudi, U.; Jamadandi, A. Deep visual attention based transfer clustering. In Advances in Computing and Network Communications: Proceedings of CoCoNet 2020; Springer: Singapore, 2021; Volume 2, pp. 357–366. [Google Scholar]
Rabbani, S.B.; Medri, I.V.; Samad, M.D. Deep clustering of tabular data by weighted Gaussian distribution learning. Neurocomputing 2025, 623, 129359. [Google Scholar] [CrossRef]
Zeng, L.; Yao, S.; Liu, X.; Xiao, L.; Qian, Y. A clustering ensemble algorithm for handling deep embeddings using cluster confidence. Comput. J. 2025, 68, 163–174. [Google Scholar] [CrossRef]
Wang, S.; Yang, J.; Yao, J.; Bai, Y.; Zhu, W. An overview of advanced deep graph node clustering. IEEE Trans. Comput. Soc. Syst. 2023, 11, 1302–1314. [Google Scholar] [CrossRef]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 2020 International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Chen, X.; Fan, H.; Girshick, R.; He, K. Improved baselines with momentum contrastive learning. arXiv 2020, arXiv:2003.04297. [Google Scholar] [CrossRef]
Geiping, J.; Garrido, Q.; Fernandez, P.; Bar, A.; Pirsiavash, H.; LeCun, Y.; Goldblum, M. A Cookbook of Self-Supervised Learning. arXiv 2023, arXiv:2304.12210. [Google Scholar] [CrossRef]
Li, J.; Zhou, P.; Xiong, C.; Hoi, S.C. Prototypical contrastive learning of unsupervised representations. arXiv 2020, arXiv:2005.04966. [Google Scholar]
Li, Y.; Hu, P.; Liu, Z.; Peng, D.; Zhou, J.T.; Peng, X. Contrastive clustering. In Proceedings of the 2021 AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 8547–8555. [Google Scholar]
Li, Y.; Yang, M.; Peng, D.; Li, T.; Huang, J.; Peng, X. Twin contrastive learning for online clustering. Int. J. Comput. Vis. 2022, 130, 2205–2221. [Google Scholar] [CrossRef]
Zhang, F.; Li, L.; Hua, Q.; Dong, C.R.; Lim, B.H. Improved deep clustering model based on semantic consistency for image clustering. Knowl.-Based Syst. 2022, 253, 109507. [Google Scholar] [CrossRef]
Wang, X.; Qi, G.J. Contrastive learning with stronger augmentations. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5549–5560. [Google Scholar] [CrossRef] [PubMed]
Luo, F.; Liu, Y.; Gong, X.; Nan, Z.; Guo, T. EMVCC: Enhanced multi-view contrastive clustering for hyperspectral images. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 6288–6296. [Google Scholar]
Wei, X.; Hu, T.; Wu, D.; Yang, F.; Zhao, C.; Lu, Y. ECCT: Efficient contrastive clustering via pseudo-Siamese vision transformer and multi-view augmentation. Neural Netw. 2024, 180, 106684. [Google Scholar] [CrossRef] [PubMed]
Huang, D.; Deng, X.; Chen, D.H.; Wen, Z.; Sun, W.; Wang, C.D.; Lai, J.H. Deep clustering with hybrid-grained contrastive and discriminative learning. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9472–9483. [Google Scholar] [CrossRef]
Huang, D.; Chen, D.H.; Chen, X.; Wang, C.D.; Lai, J.H. Deepclue: Enhanced deep clustering via multi-layer ensembles in neural networks. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 1582–1594. [Google Scholar] [CrossRef]
Kulatilleke, G.K.; Portmann, M.; Chandra, S.S. SCGC: Self-supervised contrastive graph clustering. Neurocomputing 2025, 611, 128629. [Google Scholar] [CrossRef]
Shi, F.; Wan, S.; Wu, S.; Wei, H.; Lu, H. Deep contrastive coordinated multi-view consistency clustering. Mach. Learn. 2025, 114, 81. [Google Scholar] [CrossRef]
Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 702–703. [Google Scholar]
Niu, C.; Shan, H.; Wang, G. Spice: Semantic pseudo-labeling for image clustering. IEEE Trans. Image Process. 2022, 31, 7264–7278. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Coates, A.; Ng, A.; Lee, H. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 215–223. [Google Scholar]
Radford, A. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Strehl, A.; Ghosh, J. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 2002, 3, 583–617. [Google Scholar]
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. In Proceedings of the NIPS 2017 Workshop, Long Beach, CA, USA, 8–9 December 2017. [Google Scholar]
Kingma, D.P. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Bengio, Y.; Lamblin, P.; Popovici, D.; Larochelle, H. Greedy layer-wise training of deep networks. Adv. Neural Inf. Process. Syst. 2006, 19, 1–8. [Google Scholar]
Qiu, L.; Zhang, Q.; Chen, X.; Cai, S. Multi-level cross-modal alignment for image clustering. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 14695–14703. [Google Scholar]
Cai, S.; Qiu, L.; Chen, X.; Zhang, Q.; Chen, L. Semantic-enhanced image clustering. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 6869–6878. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Lee, S.; Park, T.; Lee, K. Soft contrastive learning for time series. arXiv 2023, arXiv:2312.16424. [Google Scholar]
Eldele, E.; Ragab, M.; Chen, Z.; Wu, M.; Kwoh, C.K.; Li, X.; Guan, C. Time-series representation learning via temporal and contextual contrasting. arXiv 2021, arXiv:2106.14112. [Google Scholar] [CrossRef]
Khoeini, A.; Peng, S.; Ester, M. Informed Augmentation Selection Improves Tabular Contrastive Learning. In Proceedings of the 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 June 2025; pp. 306–318. [Google Scholar]
Onishi, S.; Meguro, S. Rethinking data augmentation for tabular data in deep learning. arXiv 2023, arXiv:2305.10308. [Google Scholar] [CrossRef]
Wang, Z.; Wang, P.; Liu, K.; Wang, P.; Fu, Y.; Lu, C.T.; Aggarwal, C.C.; Pei, J.; Zhou, Y. A comprehensive survey on data augmentation. IEEE Trans. Knowl. Data Eng. 2025, 1–20. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 2021 International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]

Figure 1. The overall framework of the proposed TCSS. Circular points denote representations: red/orange for centroids, gray/blue/indigo for triple-stream features, and green rectangles for contrastive vectors. TCSS comprises three main modules: Triple-stream Contrastive Learning for representation discrimination and invariance across weak, strong, and raw views; Triple-stream Clustering Learning for stable and robust cluster formation via self-training; and Triple-stream Structure Alignment for maintaining consistency among views through centroid and neighborhood alignment. TCSS employs a triple-stream contrastive head (

L_{c o n - t r i}

) using weakly, strongly, and raw instances to learn clustering-friendly features. It then computes structure factors

ω

in the representation space (red dashed box, with red and white arrows for

d^{i n t e r}

and

d^{i n t r a}

). Based on these, a self-training clustering head (

L_{c l u - t r i}

) produces predictions, while alignment losses

L_{a l i}

(green box) and

L_{n e i g}

(yellow box) refine the model.

Figure 1. The overall framework of the proposed TCSS. Circular points denote representations: red/orange for centroids, gray/blue/indigo for triple-stream features, and green rectangles for contrastive vectors. TCSS comprises three main modules: Triple-stream Contrastive Learning for representation discrimination and invariance across weak, strong, and raw views; Triple-stream Clustering Learning for stable and robust cluster formation via self-training; and Triple-stream Structure Alignment for maintaining consistency among views through centroid and neighborhood alignment. TCSS employs a triple-stream contrastive head (

L_{c o n - t r i}

) using weakly, strongly, and raw instances to learn clustering-friendly features. It then computes structure factors

ω

in the representation space (red dashed box, with red and white arrows for

d^{i n t e r}

and

d^{i n t r a}

). Based on these, a self-training clustering head (

L_{c l u - t r i}

) produces predictions, while alignment losses

L_{a l i}

(green box) and

L_{n e i g}

(yellow box) refine the model.

Figure 2. The network architecture diagram of TCSS. TCSS first applies a triple-stream contrastive head (

L_{c o n - t r i}

) to learn clustering-friendly features. These representations are then fed into a self-training clustering head (

L_{c l u - t r i}

) to produce clustering assignments. Meanwhile, alignment mechanisms with

L_{a l i}

and

L_{n e i g}

further refine the learning process. All objectives are jointly optimized within a unified framework.

Figure 2. The network architecture diagram of TCSS. TCSS first applies a triple-stream contrastive head (

L_{c o n - t r i}

) to learn clustering-friendly features. These representations are then fed into a self-training clustering head (

L_{c l u - t r i}

) to produce clustering assignments. Meanwhile, alignment mechanisms with

L_{a l i}

and

L_{n e i g}

further refine the learning process. All objectives are jointly optimized within a unified framework.

Figure 3. The confusion matrices of TCSS on five datasets, where the x-axes are the ground-truth labels and the y-axes are the predicted labels. The clearer the diagonal structure in the confusion matrix, the better it is represented.

Figure 4. The case studies of clustering quality of TCSS on ImageNet-Dogs and CIFAR-100 datasets. The clustering results are categorized into three scenarios: (1) CC—correctly clustered samples. (2) IC-this—samples belonging to this cluster but incorrectly assigned elsewhere. (3) IC-other—samples from other clusters incorrectly assigned to this cluster.

Figure 5. Influence of terms in overall losses. (a) Illustrates the ablations of losses on three datasets. (b) Illustrates the effect of

L_{a l i}

and

L_{n e i g}

in CIFAR-10.

Figure 5. Influence of terms in overall losses. (a) Illustrates the ablations of losses on three datasets. (b) Illustrates the effect of

L_{a l i}

and

L_{n e i g}

in CIFAR-10.

Figure 6. Sensitivity of hyperparameters. (a–c) Illustrates the sensitivity of

λ_{1 - 3}

. (d) Illustrates the sensitivity of nearest-neighbor number K.

Figure 6. Sensitivity of hyperparameters. (a–c) Illustrates the sensitivity of

λ_{1 - 3}

. (d) Illustrates the sensitivity of nearest-neighbor number K.

Figure 7. The interactions and sensitivities among the parameters. (a) Illustrates the interactions and sensitivities on

l a m b d a_{1}

and

l a m b d a_{2}

, (b) Illustrates the interactions and sensitivities on

l a m b d a_{2}

and

l a m b d a_{3}

.

Figure 7. The interactions and sensitivities among the parameters. (a) Illustrates the interactions and sensitivities on

l a m b d a_{1}

and

l a m b d a_{2}

, (b) Illustrates the interactions and sensitivities on

l a m b d a_{2}

and

l a m b d a_{3}

.

Figure 8. The t-SNE visualizations of TCSS on CIFAR-10 and CIFAR-100 datasets with ground-truth labels. TCSS with the clustering loss produces representations exhibiting clear clustering patterns—characterized by larger inter-cluster separations and tighter intra-cluster cohesion.

Table 1. Conceptual and performance comparison between dual-stream and the proposed triple-stream (TCSS) models.

Aspect	SM ³	CSD ³	CSIS ³	SE ³	AP ³	Perf. ^1,3 (NMI/ACC/ARI) ²
CC-2021 [39]	weak, weak	None	Consistency	hyper-line segment	Existence	0.705, 0.790, 0.637
IDFD-2021 [19]	weak, weak	None	Independence	hyper-line segment	Nonexistence	0.711, 0.815, 0.663
DCSC-2022 [41]	weak, weak	None	Consistency	hyper-line segment	Existence	0.704, 0.798, 0.644
DeepCluE-2024 [46]	weak, weak	None	Consistency	hyper-line segment	Existence	0.727, 0.764, 0.646
IcicleGCN-2024 [21]	weak, weak	None	Consistency	hyper-line segment	Existence	0.729, 0.807, 0.660
TCSS	raw, weak, strong	Yes	Fusion + Alignment	hyper-triangle	Nonexistence	0.834, 0.896, 0.787

¹ Performance results are averaged over the CIFAR-10 dataset. ² NMI, ACC, and ARI denote normalized mutual information, clustering accuracy, and adjusted Rand index, respectively. ³ Abbreviations: SM—Stream Members; CSD—Clustering-oriented Stream Design; CSIS—Cluster-level Stream Interaction Strategies; SE—Structural Expressiveness; AP—Alignment problem; Perf.—Performance.

Table 2. The clustering performance by different competing clustering algorithm on five datasets. The highest and second-highest values are tagged in bold and underline, respectively.

Datasets	CIFAR-10			CIFAR-100			STL-10			ImageNet-10			ImageNet-Dogs
Meric	NMI	ACC	ARI	NMI	ACC	ARI	NMI	ACC	ARI	NMI	ACC	ARI	NMI	ACC	ARI
K-Means [3]	0.087	0.229	0.049	0.084	0.130	0.028	0.125	0.192	0.061	0.119	0.241	0.057	0.055	0.105	0.020
SC [4]	0.103	0.247	0.085	0.090	0.136	0.022	0.098	0.159	0.048	0.151	0.274	0.076	0.038	0.111	0.013
AE [57]	0.239	0.314	0.169	0.100	0.165	0.048	0.250	0.303	0.161	0.210	0.317	0.152	0.104	0.185	0.073
VAE [56]	0.245	0.291	0.167	0.108	0.152	0.040	0.200	0.282	0.146	0.193	0.334	0.168	0.107	0.179	0.079
JULE [10]	0.192	0.272	0.138	0.103	0.137	0.033	0.182	0.277	0.164	0.175	0.300	0.138	0.054	0.138	0.028
DCGAN [53]	0.265	0.315	0.176	0.120	0.151	0.045	0.210	0.298	0.139	0.225	0.346	0.157	0.121	0.174	0.078
DEC [12]	0.257	0.301	0.161	0.136	0.185	0.050	0.276	0.359	0.186	0.282	0.381	0.203	0.122	0.195	0.079
DAC [11]	0.396	0.522	0.306	0.185	0.238	0.088	0.366	0.470	0.257	0.394	0.527	0.302	0.219	0.275	0.111
DCCM [17]	0.496	0.623	0.408	0.285	0.327	0.173	0.376	0.482	0.262	0.608	0.710	0.555	0.321	0.383	0.182
PICA [18]	0.591	0.696	0.512	0.310	0.337	0.171	0.611	0.713	0.531	0.802	0.870	0.761	0.352	0.352	0.201
CC [39]	0.705	0.790	0.637	0.431	0.429	0.266	0.764	0.850	0.726	0.859	0.893	0.822	0.445	0.429	0.274
SPICE [50]	0.734	0.838	0.705	0.448	0.468	0.294	0.817	0.908	0.812	0.840	0.921	0.836	0.498	0.546	0.362
TCL [40]	0.790	0.865	0.752	0.529	0.531	0.357	0.799	0.868	0.757	0.875	0.895	0.837	0.518	0.549	0.381
DCSC [41]	0.704	0.798	0.644	0.452	0.469	0.293	0.792	0.865	0.749	0.867	0.904	0.838	0.462	0.443	0.299
SACC [20]	0.765	0.851	0.724	0.448	0.443	0.282	0.691	0.759	0.626	0.877	0.905	0.843	0.455	0.437	0.285
DeepCluE [46]	0.727	0.764	0.646	0.472	0.457	0.288	-	-	-	0.882	0.924	0.856	0.448	0.416	0.273
IcicleGCN [21]	0.729	0.807	0.660	0.459	0.461	0.311	-	-	-	0.904	0.955	0.905	0.456	0.415	0.279
DHCL [45]	0.710	0.801	0.654	0.432	0.446	0.275	0.726	0.821	0.680	-	-	-	0.495	0.511	0.359
TCSS	0.834	0.896	0.787	0.511	0.536	0.383	0.845	0.922	0.822	0.910	0.938	0.907	0.597	0.647	0.502
Std.	±0.006	±0.019	±0.021	±0.017	±0.021	±0.013	±0.005	±0.013	±0.006	±0.019	±0.026	±0.031	±0.036	±0.026	±0.017
p-value	0.0287	0.0324	0.0351	0.0071	0.0063	0.0068	0.0198	0.0215	0.0242	0.0335	0.0389	0.0412	0.0234	0.0276	0.0251

Table 3. Comparison with baseline models in terms of parameters and training time on the CIFAR-10 dataset.

Methods ¹	Parameter Count ²	Training Cost ²
TCL [40]	22.21 M ³	17.63 H
SACC [20]	22.21 M ³	18.56 H
SPICE [50]	22.19 M	23.40 H
IcicleGCN [21]	25.1 M	21.14 H
TCSS	22.13 M	20.42 H

¹ All baseline models utilizes ResNet34 as their backbone. ² The unit of parameter count is million, denoted as M; the unit of training cost is hour, denoted as H. ³ Since TCL and SACC use the same network architecture, they have identical parameter counts, though their algorithms differ.

Table 4. The clustering results using different combinations of augmentations on the CIFAR-10 dataset. The best values are tagged in bold.

Augmentations	NMI	ACC	ARI
W + W	0.701	0.813	0.639
W + R	0.689	0.794	0.612
W + S	0.757	0.851	0.710
W + W + S	0.827	0.889	0.764
W + S + R (TCSS)	0.834	0.896	0.787
S + S + R	0.668	0.767	0.595
S + S + W	0.625	0.749	0.574
w/o tri-clu	0.771	0.862	0.732

Table 5. The influence of different setting of

ω

in TCSS on three datasets. The best values are tagged in bold.

Table 5. The influence of different setting of

ω

in TCSS on three datasets. The best values are tagged in bold.

Datasets	STL-10			CIFAR-100			ImageNet-Dogs
Meric	NMI	ACC	ARI	NMI	ACC	ARI	NMI	ACC	ARI
$ω = 1$	0.768	0.863	0.721	0.441	0.463	0.285	0.386	0.413	0.298
$ω = d - 1$	0.782	0.886	0.745	0.461	0.480	0.317	0.438	0.461	0.397
learned $ω$	0.791	0.881	0.749	0.453	0.474	0.321	0.445	0.487	0.408
$ω$ -CF	0.818	0.893	0.758	0.475	0.491	0.318	0.503	0.559	0.412
$ω$ -TCSS	0.845	0.922	0.822	0.511	0.536	0.383	0.597	0.647	0.502

Table 6. The clustering results using different boosting strategies on CIFAR-10 and ImageNet-Dogs dataset. The best values are tagged in bold.

Datasets	CIFAR-10			ImageNet-Dogs
Boosting Strategy	NMI	ACC	ARI	NMI	ACC	ARI
w/o boosting	0.793	0.882	0.756	0.579	0.630	0.467
TCSS-tag	0.834	0.896	0.787	0.597	0.647	0.502
TCSS-semi	0.889	0.937	0.872	0.654	0.685	0.536
SPICE-self	0.734	0.838	0.705	0.498	0.546	0.362
SPICE-semi	0.865	0.926	0.852	0.504	0.554	0.343

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, A.; Cai, J.; Yang, H.; Xun, Y.; Zhao, X. Triple-Stream Contrastive Deep Embedding Clustering via Semantic Structure. Mathematics 2025, 13, 3578. https://doi.org/10.3390/math13223578

AMA Style

Zheng A, Cai J, Yang H, Xun Y, Zhao X. Triple-Stream Contrastive Deep Embedding Clustering via Semantic Structure. Mathematics. 2025; 13(22):3578. https://doi.org/10.3390/math13223578

Chicago/Turabian Style

Zheng, Aiyu, Jianghui Cai, Haifeng Yang, Yalin Xun, and Xujun Zhao. 2025. "Triple-Stream Contrastive Deep Embedding Clustering via Semantic Structure" Mathematics 13, no. 22: 3578. https://doi.org/10.3390/math13223578

APA Style

Zheng, A., Cai, J., Yang, H., Xun, Y., & Zhao, X. (2025). Triple-Stream Contrastive Deep Embedding Clustering via Semantic Structure. Mathematics, 13(22), 3578. https://doi.org/10.3390/math13223578

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Triple-Stream Contrastive Deep Embedding Clustering via Semantic Structure

Abstract

1. Introduction

2. Related Works

2.1. Self-Training Deep Clustering

2.2. Deep Contrastive Clustering

3. Proposed Method

3.1. Motivation and Distinction from Dual-Stream Models

3.2. Triple-Stream Contrastive Learning

3.2.1. Introduction of Contrastive Learning

3.2.2. Triple-Stream Contrastive Learning

3.2.3. Tag Bank for Clustering

3.3. Triple-Stream Clustering Learning

3.3.1. Introduction of Self-Training Clustering

3.3.2. Clustering with Triple-Stream

3.3.3. Clustering Structure Factor ω

3.4. Triple-Stream Structure Alignment

3.4.1. Cluster-Centroid Alignment

3.4.2. Neighbor Alignment

3.5. Overall Loss and Training Strategy

4. Experiments

4.1. Datasets Evaluation Metrics

4.2. Implementations

4.3. Compared to Other Methods

4.4. Clustering Quality

4.5. Analysis of Training Cost

4.6. Ablation Studies

4.6.1. Ablation of Losses

4.6.2. Ablation of Triple-Augmented-View on TCSS

4.6.3. Sensitivity of Hyperparameters

4.6.4. Influence of Clustering Loss with Different Settings of ω

4.6.5. Investigation of Boosting Strategies

5. Discussion

5.1. Extension of TCSS to Non-Image Domains

5.2. Relation to Multimodal and Text-Guided Clustering

5.3. Summary of Extensibility

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3.3. Clustering Structure Factor $ω$

4.6.4. Influence of Clustering Loss with Different Settings of $ω$