View-Driven Multi-View Clustering via Contrastive Double-Learning

Multi-view clustering requires simultaneous attention to both consistency and the diversity of information between views. Deep learning techniques have shown impressive abilities to learn complex features when working with extensive datasets; however, existing deep multi-view clustering methods often focus only on either consistency information or diversity information, making it difficult to balance both aspects. Therefore, this paper proposes a view-driven multi-view clustering using the contrastive double-learning method (VMC-CD), aiming to generate better clustering results. This method first adopts a view-driven approach to consider information from other views to encourage diversity, thus guiding feature learning. Additionally, it presents the idea of dual contrastive learning to enhance the alignment of views at both the clustering and feature levels. The VMC-CD method’s superiority over various cutting-edge methods is substantiated by experimental findings across three datasets, affirming its effectiveness.


Introduction
Multi-view data usually include representations from diverse features or sources, where each view contains shared semantic information inherent in the multi-view data.The insights derived from multiple views tend to complement each other [1,2].Visual information, for instance, can be characterized through diverse techniques like SIFT, HOG, and LBP.Likewise, environmental data like temperature and humidity can be gathered using several sensors positioned across a specified region, while the details of these data might vary, there are overarching similarities in the cluster patterns when viewed from a broader perspective.Multi-view clustering seeks to categorize data into various groups by leveraging insights from all accessible viewpoints [3][4][5][6].However, acquiring knowledge from multiple sources simultaneously is challenging [7].
To tackle these challenges, a plethora of deep learning techniques have been introduced [18][19][20][21][22]. Deep multi-view clustering endeavours to improve performance by harnessing the feature representation capabilities inherent in deep models.In essence, these methods strive to enhance distinct consensus representations by utilizing specialized encoder networks for each viewpoint to transform the data.
Recently, contrastive learning has been integrated into deep learning frameworks to obtain unique representations from various viewpoints [23,24].Most contrastive learning

•
Introduction of a VMC-CD technique, which incorporates valuable information from alternate views while learning feature representations across diverse viewpoints.It provides guidance information in an attention-driven manner, effectively integrating multiple views into a discriminative common representation to guide feature learning.• Introduction of dual contrastive learning, conducting contrastive learning at both the clustering and feature levels, encouraging consistency in clustering across multiple views while preserving their feature diversity.

•
Experiments on three multi-view datasets, demonstrating the effectiveness of the VMC-CD method.

Related Work
This section delves into recent advancements in pertinent areas, specifically focusing on multi-view clustering and contrastive learning.

Multi-View Clustering
Multi-view clustering and classification techniques can generally be categorized into two main types: conventional methods and deep learning-based approaches.Conventional multi-view clustering methods can be subdivided into five distinct categories.Firstly, some methods are achieved through non-negative matrix factorization techniques, such as that in Liu et al. [31], who explored common latent factors between multiple views and established a deep structure [32] to find more consistent shared features.
Approaches in the second category utilize self-representation to illustrate the relationships among samples [33].In research conducted by [5], a self-representation layer was employed to hierarchically reconstruct view-specific subspaces and encoding layers, thereby enhancing the consistency of cross-view subspaces.
Approaches in the third category employ dimensionality reduction methods to convert multi-view data into a common, low-dimensional space, enabling a uniform representation.Subsequently, clustering outcomes are derived using established clustering methods [34].Canonical Correlation Analysis (CCA) [35] is a notable technique within this branch.In a recent study [36], a versatile framework was introduced for reducing the dimensionality of multi-view data, enabling the handling of multi-view feature representations within kernel space.
Methods in the fourth category employ graph models for multi-view clustering [37,38].The central concept of this method is to identify a common graph among various perspectives and then utilize spectral graph techniques (like spectral clustering) on this shared graph to derive clustering outcomes.Moreover, a study by [39] introduced graph autoencoders for learning multi-view representations.The study [40] focused on extracting valuable insights from complex multi-view data dispersed across various high-dimensional spaces.Through graph learning, the fundamental correlations between different views are uncovered, thereby addressing the issue of effective multi-view collaboration.
The last category of methods tackles this issue using kernel function strategies [41,42], frequently utilizing predefined kernel functions like Gaussian kernels to handle diverse views.Subsequently, these methods linearly or nonlinearly blend these kernel functions to establish a uniform kernel.Yet, the primary challenge with this method is the identification of appropriate kernel functions.
These statistical models face a common limitation in their ability to capture intricate structures within the data.As a result, deep multi-view clustering has garnered considerable attention within the community and has demonstrated effectiveness across various practical scenarios.
In early research, Wang et al. [18] employed a deep autoencoder design to acquire a consolidated representation of multi-view data, yielding commendable results in speech and visual analysis tasks.Subsequently, Andrew et al. [27] introduced an enhancement to Deep Canonical Correlation Analysis (DCCA).Their work centred on creating a unified representation of multi-view data by maximizing the correlation between extracted deep features and CCA.Subsequently, Abavisani et al. [43] introduced a deep multi-view subspace clustering network aimed at revealing a unified affinity matrix across all viewpoints.Moreover, Zhu et al. [44] utilized deep autoencoders for self-representation learning and incorporated diversity and ubiquitous regularization to capture meaningful interconnections among different viewpoints.
While existing algorithms typically prioritize either maximizing view correlation for consistency or maximizing view independence for complementarity, this paper advocates for emphasizing diversity alongside maintaining consistency.This balanced approach aims to achieve improved results by striking a harmonious equilibrium between diversity and consistency.

Contrastive Learning
Contrastive learning has significantly progressed in the realm of self-supervised representation learning [24].Fundamentally, contrastive learning strives to enhance the feature space of raw data by amplifying similarities among positive pairs (similar instances) while reducing the similarities among negative pairs (dissimilar instances) [45].Positive pairs generally consist of data from the identical instance, while negative pairs consist of data from different instances.
For instance, Chen et al. [24] introduced a visualization representation framework within contrastive learning.This framework seeks to optimize the agreement between diverse augmented views of a singular example within the latent feature space.
Lately, an approach named Contrastive Prediction (COMPLETER) [46] has advanced significantly by combining reconstruction, cross-view contrastive learning, and cross-view dual prediction methodologies.This method stands out not just for its effectiveness in incomplete multi-view clustering but also for its ability to simultaneously handle data recovery and consistency learning in incomplete multi-view datasets.
These methods contribute to learning high-quality representations based on data.However, determining invariant representations across multiple views remains a challenging problem.

Methods
In this section, we initially present a clear formulation of the problem and delineate its particulars.Next, we propose a network framework to address this problem.We then delve into each module of the proposed network, including the deep autoencoder module, dual contrastive learning module, and attention weight learning module, in detail.

Problem Formulation
with n v views and N samples, let us denote the v-th view of the multi-view data as X (v) .Each view X this represents a certain sample dimension of a particular view, and it is important to note that different samples in each view may have different dimensions.Given K as the cluster count, instances with identical semantic labels can be grouped together into a shared cluster.Hence, there is a requirement to partition N samples into K distinct clusters.

Overview of the Network Architecture
According to Figure 1, the VMC-CD method aims to directly extract semantic labels for end-to-end clustering from raw data instances spanning multiple perspectives.We achieve this by applying the dual contrastive learning module to feature representation learning, introducing an end-to-end deep clustering network structure.Additionally, we have given special treatment to the encoder by integrating a view-driven attention mechanism.As shown in the diagram below, the proposed VMC-CD network architecture consists of three main modules: the deep autoencoder module, the dual contrastive learning module, and the attention weight learning module (AT BLOCK).The core of the entire architecture is the deep autoencoder module, which learns features conducive to clustering across multiple perspectives through unsupervised representation learning.The dual contrastive learning module is divided into two parts: one part performs contrastive learning on the discriminative feature representation learned by the aforementioned encoder-decoder, known as the feature-level contrastive learning part, and the other part optimizes parameters through contrastive soft clustering assignment, known as the clustering-level contrastive learning part.The attention weight learning module primarily enhances the clustering level of the discriminative feature representation by leveraging information from other views.

Deep Autoencoder Module
Our network architecture primarily relies on a deep autoencoder module comprising multi-view feature encoders and multi-view feature decoders.When learning feature representations, our attention mechanism incorporates information from other views.Thus, during the feature encoding phase, we take into account relevant information from alternate views.Illustrated in the diagram below, the multi-view feature encoder utilized in this study comprises two components: a view-specific autoencoder module and an attention module influenced by other views, specifically the attention weight learning module (discussed in detail in Section 3.5).
The view-specific encoder module comprises three initial blocks: a linear layer, a batch normalization layer, and an activation layer (ReLU).The feature encoder module primarily aims to convert view-specific data into a discriminative feature representation.This is achieved by integrating the output of the view-specific autoencoder module with the output of the attention module, which is influenced by information from other perspectives.Subsequently, a softmax function is applied to generate the feature representation.On the other hand, the feature decoder module performs the opposite operation, converting the discriminative feature representation back to the original view information.The construction of each decoder block is the same as that of the encoder block.
The overview steps of the autoencoder module are as follows: First, based on input feature data X (ν) i , the encoder component acquires a compact representation i , where X (ν) i represents the i-th sample of the v-th view, and Z (v) i represents the low-dimensional representation of the i-th sample of the v-th view.This is the simplified formula: Here, f E (.) broadly refers to a series of operations by the encoder on the input data X (ν) i .For the v-th view, where v = m, the specific encoder is as follows:  mi (details on how to calculate this will be provided in Section 3.5).This module embeds the interesting information into the attention weights, which are then element-wise multiplied with the view-specific feature representation, resulting in a view-specific discriminative representation.The specific formula is as follows: Here, ξ(.) represents the result after passing through the sigmoid function.Subsequently, the decoder transforms the reconstructed features Z (m) i back into the original input data by extracting hidden representations.The Equation is as follows: The specific decoder is as follows: mi represents the latent representation of the i-th sample of the m-th view after passing through a certain layer of the decoder.W represent the weights and biases of the encoder part.Let X mi .

X (m) i
represents the reconstructed data of the i-th sample of the m-th view.This allows us to construct the reconstruction loss function.In this study, the autoencoder network's objective function is attained through the minimization of the reconstruction error.After extrapolating the loss of the m-th view to all views, the total reconstruction loss is as follows:

Dual Contrastive Learning Module
The dual contrastive learning module is divided into two parts.One part performs contrastive learning on the discriminative feature representation learned by the encoderdecoder, referred to as feature-level contrastive learning in this paper.The other part optimizes parameters through contrastive clustering assignment, referred to as clusteringlevel contrastive learning.
Feature-level contrastive learning is performed within the latent space of the autoencoder representation to explore the common information representation across various views.This process focuses on learning the alignment between different views by maximizing their mutual information.The loss function for feature-level contrastive learning is denoted by: Here, I represents mutual information, H represents entropy, and the parameter ∂ is used to regularize entropy.According to information theory, entropy represents information content.Hence, a higher entropy H Z (m) i signifies a larger information content within X (m) i , ensuring diversity of information across different views.Additionally, maximizing mu- will maintain information coherence across diverse perspectives during feature acquisition.
The contrastive clustering assignment used here is a soft clustering assignment method, unlike hard clustering, which allows data points to be assigned to multiple categories with different probabilities or membership degrees.Soft clustering assigns each data point a membership value for every category, denoting the degree to which the data point pertains to that specific category.These membership values form a membership matrix, where data points are rows and categories are columns, reflecting the membership of data points to each category.In contrast, hard clustering assignment requires each data point to be explicitly and uniquely assigned to a single category, without allowing for sharing or ambiguity.The specific application in this paper is as follows: For any view V = v, after obtaining r vi , a separate branch is opened for all views to undergo further processing.The specific operational process is as follows: As shown in the above equation, we first pass it through two linear layers with the purpose of dimensionality reduction, making the dimension of this vector equal to the number of clusters K, in order to proceed with the next step of soft clustering assignment.
Let r i .If all samples are processed uniformly according to the above, matrix represents the j-th element of the i-th row of matrix H (v) , indicating the likelihood that sample i in view vs. belongs to cluster j.
To enhance the diversity between cluster assignments and thus strengthen the effectiveness is used to reinforce the results of , improving the performance of soft clustering.The calculation process for each element in Q (v) is as follows: Let Q (ν) j be the j-th column of Q (v) .Each element in Q ij , represents the soft clustering assignment of sample i to cluster j.Therefore, Q (ν) j denotes the clustering assignment of samples belonging to the same semantic cluster.Samples that are the same across different views share the same semantic information.The similarity between two clustering assignments Q for cluster j can be measured by the following equation: The symbols v 1 and v 2 represent two different views, but the clustering assignment probabilities of instances between different views are similar because these instances represent the same samples.Additionally, if instances from multiple views are used to describe different samples, they are uncorrelated with each other.The similarity between cluster assignments within clusters should be maximized, while the similarity between cluster assignments across clusters should be minimized.We perform clustering on samples concurrently, ensuring coherence in the clustering assignments.The cross-view contrastive loss between Q is defined as follows: The symbol τ represents a temperature parameter, Q denotes positive clustering assignments pairs between two views, while Q (j ̸ = k) represent negative clustering assignment pairs between two views.The cross-view contrastive loss induced across multiple views is designed as follows: The cross-view contrastive loss explicitly compares clustering assignment pairs across multiple views.It pulls together pairs from the same cluster assignment and pushes apart pairs from different cluster assignments.To avoid a scenario where all instances are grouped into a single sub-cluster, we introduce a regularization term as follows: The term P represents the loss defined as the cross-view consistency loss, which prevents all instances from belonging to the same cluster 'j'.Therefore, the total loss for the contrastive clustering level is as follows:

The Attention Weight Learning Module (AT BLOCK)
As shown in Figure 2, in learning the feature representations of multiple views, attention is generated from other views, incorporating interest information from these views during the feature encoding process.We constructed the AT BLOCK using fully connected layers with ReLU and a transformer, connecting the transformer with two ReLU fully connected layers' inputs and outputs through skip connections.The primary purpose of AT BLOCK is to map complex data into spaces corresponding to different views, obtaining attention weights for different views to guide feature learning.
The attention module is structured with a sequence of fully connected layers followed by ReLU activation.Through the sigmoid function, the attention module calculates attention weights, which encapsulate relevant information within the dataset.
In the multi-view feature encoder input, with two views, the feature learning procedure integrates input from the other view's data into the attention-driven module to support feature learning.This is symbolized as A 1 = X 2 and A 2 = X 1 .The specific process of the attention module is as follows: o denote the weights and biases of the linear layer in the e-th layer.

Total Loss Function
After introducing all the losses and their computation methods, we can obtain the total loss function of VMC-CD: Here, λ 1 and λ 2 are weighting hyperparameters.ℓ cl represents the loss function for clusterlevel contrastive learning.ℓ ch represents the loss function for feature-level contrastive learning.ℓ rec represents the loss function for reconstruction.The weighted sum of these three constitutes the total loss function ℓ vmc in this context.

Complexity Analysis
Let α and β represent the mini-batch size and maximum number of neurons in the proposed network architecture's hidden layer, respectively.Let d z denote the dimensionality of the view feature representation.The overall complexity of the model is denoted by O(αβn v d v ), while the complexities of the reconstruction loss, feature-level contrastive learning loss, and cluster-level contrastive learning loss are represented by O(αn v d v ), O(αd z n v ) and O α 2 Kn v ((n v − 1) + n v (K − 1)) + n v K , respectively.Therefore, the overall complexity of the proposed method is denoted by v , where T represents the maximum number of iterations during training.

Algorithm Flow
This algorithm flow (Algorithm 1) is shown below.

Algorithm 1 View-driven dual-contrastive learning in multi-view clustering
Requirements: multi-view data samples X = X (ν) ∈ R dv×N nv ν=1 , maximum number of iterations T max 1: Initialize the parameters of the autoencoder network and set t = 0 2: While t < T max and loss function ℓ vmc is not converged do 3: Compute the loss and update the parameters of the entire network 4: t = t + 1 5: Obtain discriminative feature representations for all views 6: Concatenate different view feature representations of the same sample to form Z 1 : . . .: Z nv , pass it through the k-means clustering algorithm, yielding the clustering outcome denoted as Q 7: Output: Clustering result Q

Experiment
In this section, comprehensive experiments were conducted to evaluate the efficacy of the VMC-CD method proposed in this study.We performed experiments on five commonly utilized multi-view datasets, evaluating the performance of our method against other established multi-view clustering techniques.The source code of VMC-CD is implemented in Python 3.7.All experiments were carried out using a system that includes a GeForce RTX 3080 Ti GPU with 16 GB of memory, a 12th Gen Intel Core i9-12900H CPU, and 32 GB of RAM.In this study, we utilized five commonly used datasets.These datasets are Caltech101-20 [47], Scene-15 [48], LandUse-21 [49], MNIST-USPS [50,51] and BDGP [52].The Caltech101-20 dataset comprises 2386 images representing 20 subjects and incorporates HOG and GIST features as distinct perspectives.The LandUse-21 dataset includes 2100 satellite images across 21 classes, utilizing PHOG, LBP, and GIST features.The Scene-15 dataset comprises 485 images showcasing 15 scenes and incorporates PHOG, LBP, and GIST features.The MNIST-USPS dataset is a handwritten digit image dataset with two different styles, each view containing 10 categories with 500 examples per category.The BDGP dataset has 2500 drosophila embryo images, five categories, each image with 1750-dimensional visual and 79-dimensional textual features for clustering.

Evaluation Metrics
In this study, we utilize accuracy (ACC), normalized mutual information (NMI), and the adjusted Rand index (ARI) as the primary metrics to assess clustering performance.Improved clustering outcomes are indicated by higher values on these metrics.

Network Architecture and Parameter Settings
The VMC-CD model was trained using the Adam optimizer with an initial learning rate of 0.0001.The batch size was fixed at 256, and the number of training iterations varied depending on the dataset: 200 iterations for Caltech101-20, MNIST-USPS, and BDGP, 700 iterations for LandUse-21, and 500 iterations for Scene-15.For all datasets, the entropy parameter in the feature-level contrastive learning, denoted as ∂, is set to 9, while the temperature coefficient, denoted as τ, is set to 1.The hyperparameters λ 1 , λ 2 , and µ are chosen from the range [0.05, 0.1, 0.2, 0.5, 1] based on different datasets.For cluster-level contrastive learning, two linear layers are established.The dimension of the first linear layer is selected from the range of [32,64] depending on the dataset, while the dimension of the second linear layer is configured to match the number of clusters in the dataset.
As shown in Table 2, we also conducted experiments on two additional datasets and compared our model with other models that perform well on these datasets, achieving excellent results.The comparison methods include: Deep Embedded Clustering (DEC) [22], Improved Deep Embedded Clustering (IDEC) [61], Binary Multi-View Clustering (BMVC) [62], Multi-View Clustering via Late Fusion Alignment Maximization (MVC-LFA) [63], Deep Therefore, our model was evaluated on a total of five datasets.In previous studies, models typically demonstrated excellent performance on only a limited number of datasets.In contrast, the VMC model exhibited outstanding performance across all five datasets, showcasing its exceptional generalization capabilities.This broad applicability highlights the robustness and versatility of the VMC model.

Ablation Studies
According to the overall loss equation, which includes three different loss components; the first part is the reconstruction loss component for obtaining consensus representation, the second part is the feature-level contrastive learning component, and the third part is the cluster-level contrastive learning component.To validate the importance of components in VMC-CD, we conducted ablation studies using the same experimental settings to isolate external interference factors.Specifically, we considered two special cases: one where only cluster-level loss is considered during end-to-end training, without considering feature-level loss and another where only feature-level loss is considered during end-to-end training, without considering cluster-level loss.The table displays the results of these two special cases along with the three metrics of our model.The clustering outcomes presented in the initial two rows of the table correspond to the two distinct scenarios.As anticipated, optimal performance is attained when both feature-level contrastive learning and cluster-level contrastive learning are simultaneously incorporated.
In terms of accuracy, VMC-CD with dual contrastive learning outperformed the model without feature-level loss by 9.22%, 0.98%, and 0.09% on the three datasets and the NMI and ARI metrics also showed improvements.VMC-CD with dual contrastive learning also performed better compared to the model without cluster-level loss, with improvements of 24.85%, 3.41%, and 2.38% on the three datasets.Therefore, dual contrastive learning plays a crucial role in learning invariant representations across views and is indispensable.The specific experimental results are shown in Tables 3-5.

Parameter Sensitivity Analysis
As shown in Figure 3, we conducted experiments on the Caltech101-20 dataset to study the sensitivity of the parameters λ 1 and λ 2 in the proposed VMC-CD method.The λ 1 parameter is selected from {0.1, 0.5, 1, 2}, and the λ 2 parameter is selected from {0.01, 0.05, 0.1, 0.5, 1}.The chart showcases how the VMC-CD method performs in clustering, measured by ACC, NMI, and ARI scores, across various combinations of λ 1 and λ 2 .The clustering performance of the VMC-CD method on the Caltech101-20 dataset varies with different combinations of λ 1 and λ 2 .The results exhibit relative consistency in ACC and NMI scores but sensitivity to the λ 1 parameter concerning ARI.Specifically, an increase in the λ 1 parameter correlates with a notable decrease in ARI.Apart from the previously discussed visualizations, we also include t-SNE [67] visualizations showcasing the learning of a unified representation on the Caltech101-20 dataset.As shown in Figure 5, with an increase in the number of epochs, the learned representation becomes more condensed and distinctive.Additionally, the following Table 6 presents the number of iterations and runtime of the model on five datasets, further illustrating the speed and efficiency of the proposed model in this paper.

Discussion
The VMC-CD method effectively addresses the challenge of multi-view clustering by balancing consistency and diversity of information.It not only advances multi-view clustering techniques but also aligns with trends in deep learning and data clustering research.Emphasizing a view-driven approach and dual contrastive learning, it improves clustering performance and feature alignment.Future directions may include exploring dynamic dataset handling and high-dimensional data applications.VMC-CD represents significant progress in multi-view clustering, inspiring research in deep learning and data clustering.

Conclusions
This paper introduces a view-driven dual-contrastive learning approach for multiview clustering.This method involves incorporating relevant information from other views during the feature representation learning phase, promoting view diversity, and facilitating consensus feature learning.The concept of dual-contrastive learning is introduced, which promotes view consistency from both the clustering level and the feature level, complementing each other.

Figure 1 .
Figure 1.Architecture of the view-driven dual contrastive learning for multi-view clustering.
(e)T r r (e−1) mi + b (e) r e = 2, 3 f BN W (e)T r r (e−1) mi + b (e) r e = 4 (2) r (e) mi represents the latent representation of the i-th sample of the m-th view after passing through a certain layer of the encoder.W (e) r denotes the weights, while b (e) r denotes the biases of the encoder segment.BN represents the batch normalization operation.f (•) represents the ReLU activation function .The attention module driven by other views obtains attention weights through the sigmoid function, denoted by o

Figure 2 .
Figure 2. Architecture of the attention weight learning module.
to the attention module for the i-th sample of the m-th view, O (e) mi represents the feature representation after the e-th layer of the attention module, and t(•) represents the result after the transformer block.W

Figure 4
Figure 4 displays the curves depicting the evolution of each clustering metric with respect to the iteration count on the Scene-15 dataset.The illustrated curves showcase the exceptional stability of the method proposed in this paper, consistently delivering robust clustering performance.

Figure 4 .
Figure 4. Change curve of clustering metrics with iteration count on Scene-15 dataset.

Table 3 .
Ablative study of main components of the proposed VMC-CD method on Caltech101-20 datasets.

Table 4 .
Ablative study of main components of the proposed VMC-CD method on Scene-15 datasets.

Table 5 .
Ablative study of main components of the proposed VMC-CD method on LandUse-21 datasets.

Table 6 .
Number of iterations and runtime on the aforementioned five datasets.