Next Article in Journal
A Swarm-Based Multi-Objective Framework for Lightweight and Real-Time IoT Intrusion Detection
Previous Article in Journal
Oscillatory Analysis of Third-Order Hybrid Trinomial Delay Differential Equations via Binomial Transform
Previous Article in Special Issue
Comparative Analysis of Explainable AI Methods for Manufacturing Defect Prediction: A Mathematical Perspective
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Cluster Complementarity and Consistency Mining for Multi-View Representation Learning

1
School of Management, Xi’an Jiaotong University, Xi’an 710049, China
2
Business School, Qinghai Institute of Technology, Xining 810016, China
3
School of Mathematics and Statistics, Fuyang Normal University, Fuyang 236037, China
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(15), 2521; https://doi.org/10.3390/math13152521
Submission received: 1 July 2025 / Revised: 29 July 2025 / Accepted: 4 August 2025 / Published: 5 August 2025
(This article belongs to the Special Issue Artificial Intelligence and Data Science, 2nd Edition)

Abstract

With the advent of the big data era, multi-view clustering (MVC) methods have attracted considerable acclaim due to their capability in handling the multifaceted nature of data, which achieves impressive results across various fields. However, two significant challenges persist in MVC methods: (1) They resort to learning view-invariant information of samples to bridge the heterogeneity gap between views, which may result in the loss of view-specific information that contributes to pattern mining. (2) They utilize fusion strategies that are susceptible to the discriminability of views, i.e., the concatenation and the weighing fusion of cross-view representations, to aggregate complementary and consistent information, which is difficult to guarantee semantic robustness of fusion representations. To this end, a simple yet effective cluster complementarity and consistency learning framework (CommonMVC) is proposed for mining patterns of multiview data. Specifically, a cluster complementarity learning is devised to endow fusion representations with discriminate information via nonlinearly aggregating view-specific information. Meanwhile, a cluster consistency learning is introduced via modeling instance-level and cluster-level partition invariance to coordinate the clustering partition of various views, which ensures the robustness of multi-view data pattern mining. Seamless collaboration between two components effectively enhances multi-view clustering performance. Finally, comprehensive experiments on four real-world datasets demonstrate CommonMVC establishes a new state-of-the-art baseline for the MVC task.

1. Introduction

With the proliferation of multimedia technologies, a vast amount of data generated by various sensors and extractors has emerged in recent years [1,2]. For instance, in the traffic analysis task, data usually are collected from ultrasonic radars, optical cameras, and infrared radars. In healthcare, data can be obtained from various diagnostic tools such as MRI scans, CT scans, and X-rays. These data are commonly called multi-view data, and each view exhibits consistency that expresses common properties of samples and complementarity that expresses specific properties of samples. Effective aggregation of consistent and complementary information can endow data with discriminative and expressive abilities, which greatly facilitate the mining of data patterns in the downstream tasks [3,4].
As one of the most important multi-view information fusion paradigms, multi-view clustering (MVC) has received increasing attention, which is divided into two categories based on differences among fusion strategies of consistent and complementary information, i.e., the multi-view concatenation fusion method [5,6,7,8] and the multi-view weighing fusion method [9,10,11,12]. The former directly concatenates features from different views into a single fusion representation to fuse consistent and complementary information for clustering. The latter assigns different weights to features of each view based on their importance and reliability in fusing consistent and complementary information, ensuring that the most informative views contribute more to the final clustering result.
Although existing MVC methods have made encouraging progress in mining patterns of multi-view data, there are still two issues. Firstly, most MVC methods [13,14,15,16] impose alignment constraints, such as contrastive alignment and adversarial alignment, on view-specific representations to bridge the heterogeneity gap between views for aggregating consistent and complementary information. For example, MGCC proposes a multi-view granular ball contrastive clustering method via maximizing similarities between data within the same granular ball and minimizing similarities between data in different granular balls, which enforces the local structure invariance for clustering [17]. However, the essence of these alignment constraints is to explore inter-view invariant information, i.e., consistency, which may suppress complementarity information mining of multi-view data and cause the loss of useful information. Secondly, most MVC methods [12,18,19,20] utilize fusion strategies, i.e., the direct concatenation or the weighed combination of view-specific representations, to obtain view-common representations of multi-view data. For example, DMAC-SI concatenates view-specific representations with semantic invariance in the mirror encoder-decoder framework and then imposes a data-driven partitioning strategy on fusion representations for obtaining clustering results [21]. Such a manner, which is susceptible to representation discriminability in each view, may not guarantee semantic robustness of view-common representations.
In the paper, a simple and effective cluster complementarity and consistency leaning (CommonMVC) is proposed to solve the above problems. Specifically, CommonMVC defines a view-specific representation learning and cluster partitioning component via deploying encoding-decoding networks with clustering layers for each view to obtain vanilla latent representations and soft assignments of data. Then, CommonMVC designs a cluster complementarity learning via the nonlinear fusion functions to learn view-common representations, which can fully complement information between views and avoid the nuisance of certain views with fuzzy clustering structures. These nonlinear fusion functions are derived from maximizing mutual information between views, which are implemented using the multi-layer perceptron networks. Meanwhile, CommonMVC designs a cluster consistency learning via modeling instance-level and cluster-level partition invariance. The former ensures that the same sample from different views belongs to the same cluster, while the latter ensures that the same cluster in different views has a similar sample distribution. This dual-level invariance enhances the consistency and robustness of multi-view data pattern mining. In CommonMVC, the seamless collaboration of the three components in CommonMVC yields improved multi-view clustering performance with more accurate, comprehensive, and consistent clustering results in multi-view data. Ultimately, experiment results on four real-world datasets demonstrate that CommonMVC achieves state-of-the-art performance in MVC.
The main contributions of CommonMVC are threefold:
  • A complementarity-leaning strategy is proposed to nonlinearly aggregate view-specific information, which ensures robustness of fusion representations.
  • A consistency leaning strategy is devised via modeling instance-level and cluster-level partition invariance, to alleviate nuisances of some views with fuzzy cluster structures, which effectively enhances intrinsic pattern mining.
  • Extensive evaluations conducted on four benchmarks verify that CommonMVC sets a new standard for MVC tasks compared with baseline methods.
The paper proceeds as follows: Section 2 constructs related work of CommonMVC. Section 3 constructs the methodological framework, detailing our novel architecture and adaptive mechanisms. Section 4 conducts rigorous multidimensional evaluation across benchmarks, comparative analysis, and ablation studies. Section 5 synthesizes core contributions, discusses practical implications, and delineates emergent research trajectories.

2. Related Work

Multi-view clustering refers to the methodology in data analysis and machine learning that utilizes information from multiple sources or feature sets to enhance clustering accuracy and robustness. Based on the paradigm of information fusion, MVC methods are divided into multi-view concatenation fusion methods and multi-view weighing fusion methods.

2.1. Multi-View Concatenation Fusion Method

Multi-view concatenation fusion methods combine features from different views into a single comprehensive representation by directly concatenating them, enabling unified data processing and clustering. For instance, 2CMV concatenates consistent and complementary information extracted via the matrix factorization to learn fusion representations and then performs vanilla k-means on fusion representations to obtain clustering results of multi-view data [22]. MFK concatenates consistent and complementary information within the mutual guidance between view-common cluster partitioning and view-specific coefficient matrices learning, and then employs k-means on fusion representations to obtain clustering results of multi-view data [23]. MvAGC concatenates smooth view-specific representations learned via self-expression layers with the graph filter and then utilizes spectral clustering on the consensus matrix of fusion representations for capturing clustering structures [24]. CSMSC concatenates consistent and complementary information between views disentangled via view-common and view-specific self-expression matrices for performing spectral clustering to explore data structures [25]. DMCAG defines an anchor multi-view subspace clustering to learn the consensus matrix between views, and then utilizes spectral self-supervised learning to align view-specific clustering assignment for mining patterns of data [26]. DMIM designs an over-clustering encoding-decoding network via maximizing the dependence between representations in different views and minimizing the dependence between representations and data in the same views and then performs k-means on weighted fusion representations to explore clustering patterns [11]. SDMVC concatenates view-specific representations learned via a self-supervised clustering consistency strategy to fuse complementary and consistent information, along with an end-to-end pattern mining [12].

2.2. Multi-View Weighing Fusion Method

Multi-view weighing fusion methods assign different weights to each view based on their importance and reliability, combining them to capture both consistent and complementary information for more accurate and robust clustering. For instance, CoMVC conducts a contrastive encoding-decoding network via weighted aligning view-specific complementary information to learn view-common representations and then utilizes the Cauchy-Schwarz divergence to explore structure divergences between data [27]. MVaDE conducts a variational generation network from hidden categories to real data for weighted aggregation of consistent and complementary information, where the mixture of Gaussian distributions is used to infer cluster partitions of multi-view data [7]. MSEC obtains an inter-view consensus matrix for clustering via weighted fusion of view-specific consensus matrices that are learned via view-specific metric functions, where inter-view and intra-view structure invariant constraints are used to enhance the aggregation of consistent and complementary information [8]. CMSC fuses information of views based on view importance to learn the consensus matrix in a shared subspace where the correlations between views are maximized via the canonical correlation analysis and then performs spectral clustering to explore data structures [28]. COMIC aligns view-specific graphs to aggregate consistent and complementary information and then conducts weighted partitions in a view-specific graph to mine patterns of multi-view data [10]. CMIB designs an information bottleneck encoding-decoding network to learn robust view-common representations via weighting complementary, consistent, and superfluous information among views and then performs k-means on view-common representations to obtain final clustering results [9]. AE2-Nets designs an encoding-decoding degradation network to fuse consistent and complementary information for learning fusion representations and then performs k-means to obtain final clustering results [29]. DealMVC designs a dual contrastive multi-view clustering network via aligning inter-view and intra-view feature-level similarity graphs and class-level pseudo-label graph and then enforces intra-view pseudo-label graphs to be as similar as possible to obtain clustering results [13]. VCGA devises an information-decoupled multi-view graph clustering method via exploring view-specific graphs and the consensus graphs with the help of the local and global structure exploration, which fully explores the local manifold for mining patterns of multiview data [18]. HCLS CGL designs a confidence neighbor multi-view weighing fusion via weighting the group-wise structures between data to boost the aggregation of consistent and complementary structures across different views by leveraging the confidence of neighboring relationships [30]. STMVC proposes an uncertainty refinement strategy to trustily separate multi-view representations significantly belonging to different clusters to fuse well-separated view-specific representations into high-quality discriminative representations for clustering, where the discriminative representations were weighted on the basis of view-specific uncertainties according to the Dempster-Safer theory of evidence [14]. DMAC proposes an anchor graph alignment strategy to match multi-view representations of the view-average discriminative representations for clustering by introducing a perturbation-injected learnable anchor graph, where the anchor graph convolution clustering distributions instructed the consistency refinement of discriminative multi-view representations for efficient cluster structure mining [15].
In multi-view clustering, existing fusion strategies mainly fall into concatenation fusion and weighted fusion. Concatenation fusion directly connects features from different views, which raises the dimensionality and brings in redundant info. Weighted fusion assigns weights to views based on their importance and then combines them linearly, but it has a subjective weight-allocation process and is sensitive to view-quality differences. Nonlinear fusion, as in CommonMVC, surpasses both. Using nonlinear transformation functions, CommonMVC can uncover complex relationships between views. This makes the model more robust and better at fitting data structures, improving clustering accuracy. So, nonlinear fusion is a more effective, flexible, and robust solution than concatenation and weighted fusion.

3. The Proposed Method

Consider the task of clustering a multi-view data set of n samples with V views { x i } i = 1 n = { x i 1 R D 1 , , x i V R D V } i = 1 n , where x i v denotes the v-th view data of the i-th sample x i and D v denotes the feature dimension in the v-th view. A cluster complementarity and consistency leaning (CommonMVC) is invented to achieve MVC, which consists of view-specific representation learning and cluster partitioning, cluster complementarity learning, and cluster consistency learning, as shown in Figure 1.

3.1. View-Specific Representation Learning and Cluster Partitioning

Generally, each view, describing different aspects of the same samples, contains view-specific complementary information, which can greatly enhance the clustering performance. Thus, view-specific representation learning and cluster partitioning are designed to fully explore and capture intrinsic information hidden in each view.
Representation Learning. Inspired by effectiveness in previous works [31,32,33], deep autoencoders are utilized to learn latent representations of heterogeneous multi-view data with high dimensions. The deep autoencoder for each view works as follows: the encoder compresses the original data into a lower-dimensional latent space, capturing the essential features while discarding noise and redundancy. The decoder then attempts to reconstruct the original data from this latent representation. This process helps the model learn robust and meaningful features.
Specifically, given the v-th view data of the i-th sample x i v , the encoder E n c v is used to transform data x i v into the latent representation z i v , i.e., z i v = E n c v ( x i v ; θ v ) , where θ v denotes the encoder parameter in the v-th view. Meanwhile, the decoder D e c v is used to generate reconstruction data x ¯ i v via z i v , i.e., x ¯ i v = D e c v ( z i v ; ϕ v ) , where ϕ v denotes the decoder parameter in the v-th view. To ensure that the learned representations retain as much useful information as possible from the original data, we maximize the mutual information between the original data and the corresponding representations in each view, as follows:
L b a s e = v = 1 V max I ( X v ; Z v )   = v = 1 V d x v d z v p ( x v , z v ) log p ( x v , z v ) p ( x v ) p ( z v )   v = 1 V d x v p ( x v ) d z p ( z v | x v ) log q ( x v | z v )
where I ( X v ; Z v ) represents the mutual information between the original data X v and its latent representation Z v in the v-th view. Mutual information quantifies the amount of information obtained about one random variable through the other. By maximizing it, we ensure that the latent representations Z v retain as much information as possible from the original data X v , which is crucial for preserving the essential characteristics and patterns in the data.
Cluster Partitioning. To avoid suboptimal representation for clustering, the parameterized nonlinear mappings based on Student’s t-distribution are constructed on the top of encoders in each view, which generates soft cluster assignments of samples in an end-to-end manner, i.e., f v ( Z v ; C v ) : Z v S v , where C v = { c 1 v , c 2 v , , c k v } denotes the learnable parameters and S v denotes soft cluster assignments of samples in the v-th view. The generation process of S v is as follows:
s i j v = ( 1 + z i v c j v 2 ) 1 l = 1 k ( 1 + z i v c l v 2 ) 1
where s i j v is the j-th element of S i v and represents the probability of assigning the i-th sample to the j-th cluster in the v-th view. Cluster partitioning utilizes Student’s t-distribution to produce smooth and robust cluster assignments, where samples closer to a cluster center have higher assignment probabilities.

3.2. Cluster Complementarity Learning

In general, some information aggradation policies, i.e., concatenated and weighed combinations, are used to generate fusion representations between views via aggregating view-specific representations. However, the concatenated combination, i.e., Z = [ Z 1 , Z 2 , , Z V ] , has two significant drawbacks that may distort the final fused representation and ultimately degrade performance in downstream tasks. First, it introduces view redundancy, leading to the inclusion of repetitive or irrelevant information. Second, it is susceptible to the influence of noisy views with fuzzy data structures. The weighted combination, i.e., Z = W 1 Z 1 + W 2 Z 2 + , , + W V Z V , where W denotes the weight matrix, is commonly regarded as a linear mapping from view-specific representations to fusion representations. Such a linear mapping between variables cannot capture complementary information, i.e., the inseparable clusters in one variable are still inseparable in other variables. Conversely, if there is complementarity between variables, the complementary information can be delivered by a nonlinear mapping.
Rooted in the above discussions, the simple and effective nonlinear transformation functions are designed to learn fusion representations, i.e., Z v = f v ( Z ) , where f v denotes the nonlinear mapping function of the v-th view. Then, the mutual information between Z and Z v is maximized to optimize parameters of g v , which effectively aggregates inter-view complementarity learning of clustering structures, as follows:
L c o m = v = 1 V max I ( Z ; Z v ) = v = 1 V d z v d z p ( z v , z ) log p ( z v , z ) P ( z v ) P ( z ) = v = 1 V d z v d z p ( z v , z ) log p ( z v | z ) p ( z v ) = v = 1 V H ( z v ) + d z p ( z ) d z v p ( z v | z ) log p ( z v | z ) d z v p ( z v ) d z p ( z | z v ) log q ( z v | z )
where q ( z v | z ) is the inferring distribution from z to z v , which can be implemented by the nonlinear mapping function g v .
Cluster complementarity learning utilizes the nonlinear transformation functions to capture fusion representations with more discriminative clustering structure than that of view-specific representations and can effectively relieve interference of views with fuzzy clustering structures in fusing complementary patterns, ensuring the cluster robustness in learning multi-view fusion representations.

3.3. Cluster Consistency Learning

Due to the absence of label information in multi-view clustering, it is not possible to determine which view’s clustering partition is precise. However, multi-view clustering consistency can help mitigate these challenges because it ensures that different views provide similar clustering predictions, which reduces the likelihood of relying on any single, potentially inaccurate view. By aligning the clustering results across multiple views, the overall accuracy and robustness of the clustering outcomes are enhanced. To ensure this consistency, we propose a novel cluster consistency learning via utilizing a global clustering partition to effectively harmonize the clustering predictions across different views. Specifically, multi-view cluster consistency learning consists of instance-level partition invariance and cluster-level partition invariance. The former refers to the fact that the same sample from different views should belong to the same label, and the latter refers to the fact that the same cluster in different views should satisfy the same sample distribution.
To achieve two-partition invariance in the cluster consistency learning, a global semantic distribution is obtained as follows. A parameterized M L P is constructed to generate global semantics of fusion representations, i.e., M L P ( Z ; C ) : Z S , where C = { c 1 , c 2 , , c k } stands for cluster prototypes of fusion representations and S denotes the global clustering partition:
S = M L P ( Z ; C ) = δ ( ε ( Z , C ) )
where ε ( · ) stands for the membership relationship between data and clusters:
q i j = ε ( z i , c j ) = 1 1 + z i c j 2 2
where q i j stands for the similarity degree of the i-th sample and the j-th cluster. δ ( · ) transforms the similarities to the clustering partition:
s i j = δ ( s i ) = ( q i j / j q i j ) 2 j ( q i j / j q i j ) 2
The Kullback–Leibler divergence between the global clustering partition of fusion representations S i , · and local clustering partitions of view-specific representations S i , · v is minimized to perform instance-level partition invariance learning, where S i , · and S i , · v denote the i-th row of the global clustering partition and local clustering partitions, respectively.
L i p i = v = 1 V i = 1 n j = 1 k s i j log s i j s i j v
Meanwhile, the Kullback–Leibler divergence between the global sample distribution of clusters S · , i and local sample distribution of clusters S · , i v is minimized to perform cluster-level partition invariance learning, where S · , i and S · , i denote the i-th column of the global clustering partition and local clustering partitions, respectively.
L c p i = v = 1 V j = 1 k i = 1 n s i j log s i j s i j v
After training the model, final clustering results of samples are obtained as follows:
y i = argmax j ( s i j )

3.4. The Loss Function

CommonMVC conducts the loss function L to optimize the cluster complementarity learning and cluster consistency learning, as follows:
L = L b a s e + α L c o m + β L i p i + γ L c p i
where α , γ , and β are parameters. L c o m is the complementary loss that guides cluster complementarity learning. L i p i and L c p i are the consistent losses that guide cluster consistency learning. That is, L trains CommonMVC to fully explore complementarity and consistency hidden in multi-view data, which effectively enhances intrinsic pattern mining.
The model analysis. The analysis of why L i p i and L c p i can guarantee inter-view-consistent clustering partitions is as follows: For the v-th and the l-th views, given the optimization errors ϵ v and ϵ l of L i p i v and L i p i j , respectively, we have:
0 K L ( S i , · S i , · l v ) = i = 1 n j = 1 K s i j log s i j s i j v ε v
0 K L ( S i , · S i , · l ) = i = 1 n j = 1 K s i j log s i j s i j l ε l
via subtracting Equation (11) from Equation (12), having:
ε v i = 1 n j = 1 K s i j log s i j v s i j l ε l
When s i j is fixed, and ϵ v and ϵ l decrease in the optimization precessing:
i = 1 n j = 1 K s i j log s i j v s i j l 0
Thus, S i , · v and S i , · l tend to be consistent.
Consider ξ v and ξ l as the optimization error of L c p i v and L c p i j , respectively, having:
0 K L ( S · , j | | S · , j v ) = j = 1 K i = 1 n s i j log s i j s i j v ξ v
0 K L ( S · , j | | S · , j l ) = j = 1 K i = 1 n s i j log s i j s i j l ξ l
via subtracting Equation (15) from Equation (16), having:
ξ v j = 1 K i = 1 n s i j log s i j v s i j l ξ l
When s i j is fixed, and ξ v and ξ l gradually decrease in the optimization precessing:
j = 1 K i = 1 n s i j log s i j v s i j l 0
Thus, S · , j v and S · , j l tend to be consistent. Cluster consistent learning ensures multi-view clustering consistency via utilizing S to align rows and columns of each view S v .

4. Experimental Evaluation

4.1. Setup

Datasets. Four public multi-view datasets, i.e., Handwritten, ORL, LandUse-21, and Scene-15, are utilized to evaluate the performance of CommonMVC. Handwritten contains 2000 samples from 10 classes, where pixel averages in 2 × 3 windows and profile correlations are used as two views. ORL comprises 400 samples belonging to 40 classes where intensity of gray level and Gabor descriptors are used as two views. LandUse-21 includes 2100 samples categorized into 21 classes, where PHOG and LBP features are used as two views. Scene-15 consists of 4485 samples spanning 15 classes, where PHOG and GIST features are used as two views. The detailed information is shown in Table 1.
Evaluation Metrics: To evaluate the clustering performance, three common metrics, i.e., ACC, NMI, and ARI, are used in the experiments [31]. ACC represents the rate at which samples are correctly partitioned. NMI represents similarity between the prediction distribution and the ground-truth distribution of data. ARI represents the purity of predictions. The larger values of the ACC, NMI, and ARI mean the superior clustering performance.
Comparison methods. In the experiment, fourteen comparison methods are used, including MVaDE [9], CoMVC [10], DealMVC [7], GCFAggMVC [11], SDMVC [12], DCCA [34], DCCAE [35], AE 2 -Nets [8], MSEC [28], MGCC [17], DMAC [19], and STMVC [14]. To ensure fair comparison, all experiments are conducted under identical hardware and software environments, with hyperparameter configurations strictly following the specifications reported in each method’s original publication.
Implementation details. CommonMVC, implemented using PyTorch 2.0.1, was employed in the experiments, where the vector features of the samples were first normalized to the [ 0 , 1 ] range before being input into the model for training. The training process utilized the Adam optimizer with a batch size of 100, maintaining a learning rate of 10 3 across all CommonMVC modules. The trade-off parameters α , β , and γ were consistently set to [ 0.1 , 1 ], respectively, for all datasets. To ensure optimal performance of the comparison methods, a grid search strategy was applied to their trade-off parameters. Each experiment is conducted five times, with the average results reported as the final outcomes to ensure a fair comparison.

4.2. Clustering Performance Evaluation

Table 2 and Table 3 illustrate the quantitative clustering results of CommonMVC and fourteen comparison methods on four datasets in terms of ACC, NMI, and ARI. The results show that CommonMVC contains the optimal clustering performance and sets a new baseline in the multi-view clustering task. The seasons have two aspects: (1) CommonMVC designs a cluster complementarity learning via the nonlinear fusion functions to learn view-common representations, which fully captures complementary information between views and avoids the nuisance of certain views with fuzzy clustering structures. (2) CommonMVC designs a cluster consistency learning via defining a relaxed alignment on the semantic level rather than the representation level, which alleviates the tendency of view-specific representations to be the same while capturing consistent information between views during the optimization process. In addition, there are also two observations. First, the clustering results of MVC methods are better than those obtained by FeatConcate, which proves the necessity of developing multi-perspective clustering methods. Second, the performance of one-stage MVC methods outperforms two-stage MVC methods in pattern mining of data. The reason is that one-stage MVC methods capture clustering-friendly information of multi-view data, rather than common information captured by two-stage MVC methods.
Figure 2 presents the outcomes of the Nemenyi test. A detailed perusal of the figure reveals that the average rank of the CommonMVC method surpasses those of the other methods across all four datasets. This notable ranking advantage carries substantial statistical weight. It strongly suggests that the CommonMVC method holds a significant edge over its counterparts. The consistent excellence of the CommonMVC method across various datasets makes it stand out distinctively among numerous approaches. Consequently, it offers a more precious option for relevant studies and applications.

4.3. Ablation Analysis

This section conducts loss ablation and architecture ablation on Handwritten and ORL in terms of ACC, NMI, and ARI to evaluate the contribution of each component in CommonMVC. (1) The loss ablation contains three variants. Variant_1 denotes the erasure of the mutual information I ( X v ; Z v ) . Variant_2 denotes the erasure of D K L ( S i , · , S i , · v ) and D K L ( S · , j , S · , j v ) . Variant_3 denotes the erasure of the mutual information I ( Z ; Z v ) . From ablation results in Table 4, we can observe that the removal of each component of the loss function results in a degradation of performance. This observation underscores the necessity of each term in the loss function in ensuring optimal performance. Hence, these findings validate the rationality and efficacy of the loss function design. (2) The view ablation conducts view ablation on Handwritten and ORL datasets. As shown in Table 5, View_1 and View_2 represent the clustering results using only the first or second view, respectively. CommonMVC, utilizing both views with the nonlinear fusion strategy, outperforms individual views in all metrics. It shows the effectiveness of integrating complementary information from multiple views. This demonstrates that the nonlinear fusion of multiple views in CommonMVC can capture more comprehensive patterns than relying on a single view.

4.4. Convergence Analysis

To validate the convergence behavior of CommonMVC, we analyzed the loss values across different training epochs on four distinct datasets, as depicted in Figure 3. The graph reveals a consistent and rapid decrease in the loss across all datasets, indicating that CommonMVC efficiently optimizes its objective functions during training. Notably, the convergence trend is evident early in the training process, with the loss stabilizing after approximately 60 epochs. The rapid convergence suggests that CommonMVC not only effectively captures the underlying data structure but also requires relatively few epochs to reach an optimal solution, making it a robust and efficient model for various datasets.

4.5. Parameter Analysis

In the hyperparameter analysis, we examine the impact of three key parameters α , γ , and β on the performance of CommonMVC. These parameters balance different terms in the overall loss function L of CommonMVC. We perform systematic hyperparameter search experiments across the ranges of 10, 1, 0.1, 0.01, and 0.001 for all parameters. The experiments are carried out on the Handwritten dataset, with ACC as the performance metric. Figure 4 demonstrates the optimal result is gained when α is in the range of 0.1 to 1, and γ and β are within the range of 0.01 to 0.1. Under these hyperparameter settings, the model successfully balances common view learning with individual and common feature protection. This balance enabled the model to effectively extract cross-view common features while preserving individual characteristics, leading to the best ACC on the Handwritten dataset. The findings highlight the importance of selecting appropriate hyperparameters to achieve effective feature learning and clustering performance in CommonMVC.

5. Conclusions

In this paper, we introduce CommonMVC, an innovative framework for multi-view clustering designed to mine patterns within multi-view data. CommonMVC disentangles and learns the complementary and consistent information hidden in multi-view data by defining view-specific representation learning and clustering partitioning components, along with nonlinear fusion functions. This approach uniquely designs a nonlinear fusion mechanism to enhance the complementary information between views and employs a relaxed semantic-level alignment strategy to ensure similar clustering predictions across different views, thereby improving the accuracy and robustness of the clustering outcomes. Comprehensive experiments conducted on four datasets have validated that CommonMVC achieves a new baseline for MVC tasks. CommonMVC outperforms existing baseline methods across various evaluation metrics. However, we acknowledge that our method has certain limitations. First, the computational complexity of the nonlinear fusion mechanism may increase with the number of views and the complexity of the data, which could limit its scalability for very large-scale multi-view datasets. Second, our current framework assumes that all views are available during training and testing. In scenarios where views may be missing, the performance of CommonMVC could be affected. Developing strategies to handle missing views and further optimizing the computational efficiency of the nonlinear fusion process will be important directions for future research.

Author Contributions

Methodology, Y.W.; Validation, H.L.; Investigation, H.L.; Data curation, H.L.; Writing—original draft, Y.W.; Writing—review & editing, Y.W. and H.L.; Supervision, H.L.; Funding acquisition, Y.W. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Qinghai Institute of Technology Doctoral Extension Program (202301lwys028), the Qinghai Kunlun Talented Person Project (2023-QLGKL YCZX-031), the Ministry of Education Humanities and Social Sciences Project (22XJC630006) and the Doctoral Foundation of Fuyang Normal University under Grant 2025KYQD0031.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhang, Y.; Cai, J.; Wu, Z.; Wang, P.; Ng, S.K. Mixture of Experts as Representation Learner for Deep Multi-View Clustering. Proc. AAAI Conf. Artif. Intell. 2025, 39, 22704–22713. [Google Scholar] [CrossRef]
  2. Jiang, H.; Ma, W.; Dai, J.; Ding, J.; Tong, X.; Wang, Y.; Du, X.; Jiang, D.; Luo, Y.; Zhang, J. Cross representation subspace learning for multi-view clustering. Expert Syst. Appl. 2025, 286, 128007. [Google Scholar] [CrossRef]
  3. Long, Z.; Wang, Q.; Ren, Y.; Liu, Y.; Zhu, C. TLRLF4MVC: Tensor Low-Rank and Low-Frequency for Scalable Multi-View Clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 6900–6911. [Google Scholar] [CrossRef]
  4. Bian, J.; Xie, X.; Wang, C.; Yang, L.; Lai, J.; Nie, F. Angular Reconstructive Discrete Embedding with Fusion Similarity for Multi-View Clustering. IEEE Trans. Knowl. Data Eng. 2025, 37, 45–59. [Google Scholar] [CrossRef]
  5. Yu, Y.; Lu, Z.; Nie, F.; Yu, W.; Miao, Z.; Li, X. Pseudo-Label Guided Bidirectional Discriminative Deep Multi-View Subspace Clustering. IEEE Trans. Knowl. Data Eng. 2025, 37, 4213–4224. [Google Scholar] [CrossRef]
  6. Zhang, P.; Pan, Y.; Wang, S.; Yu, S.; Xu, H.; Zhu, E.; Liu, X.; Tsang, I.W. Max-Mahalanobis Anchors Guidance for Multi-View Clustering. Proc. AAAI Conf. Artif. Intell. 2025, 39, 22488–22496. [Google Scholar] [CrossRef]
  7. Yin, M.; Huang, W.; Gao, J. Shared Generative Latent Representation Learning for Multi-view Clustering. Proc. AAAI Conf. Artif. Intell. 2020, 34, 6688–6695. [Google Scholar] [CrossRef]
  8. Cui, J.; Li, Y.; Fu, Y.; Wen, J. Multi-view Self-Expressive Subspace Clustering Network. In Proceedings of the 31st ACM International Conference on Multimedia 2023, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 417–425. [Google Scholar]
  9. Wan, Z.; Zhang, C.; Zhu, P.; Hu, Q. Multi-view Information-bottleneck Representation Learning. Proc. AAAI Conf. Artif. Intell. 2021, 35, 10085–10092. [Google Scholar] [CrossRef]
  10. Peng, X.; Huang, Z.; Lv, J.; Zhu, H.; Zhou, J.T. COMIC: Multi-view Clustering Without Parameter Selection. In Proceedings of the International Conference on Machine Learning 2019, Long Beach, CA, USA, 9–15 June 2019; pp. 5092–5101. [Google Scholar]
  11. Mao, Y.; Yan, X.; Guo, Q.; Ye, Y. Deep Mutual Information Maximin for Cross-modal Clustering. Proc. AAAI Conf. Artif. Intell. 2021, 35, 8893–8901. [Google Scholar] [CrossRef]
  12. Xu, J.; Ren, Y.; Tang, H.; Yang, Z.; Pan, L.; Yang, Y.; Pu, X.; Yu, P.S.; He, L. Self-supervised Discriminative Feature Learning for Deep Multi-view Clustering. IEEE Trans. Knowl. Data Eng. 2022, 35, 7470–7482. [Google Scholar] [CrossRef]
  13. Yang, X.; Jin, J.; Wang, S.; Liang, K.; Liu, Y.; Wen, Y.; Liu, S.; Zhou, S.; Liu, X.; Zhu, E. DEALMVC: Dual Contrastive Calibration for Multi-view Clustering. In Proceedings of the ACM International Conference on Multimedia 2023, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 337–346. [Google Scholar]
  14. Hu, S.; Tian, B.; Liu, W.; Ye, Y. Self-supervised Trusted Contrastive Multi-view Clustering with Uncertainty Refined. Proc. AAAI Conf. Artif. Intell. 2025, 39, 17305–17313. [Google Scholar] [CrossRef]
  15. Wang, B.; Zeng, C.; Chen, M.; Li, X. Towards Learnable Anchor for Deep Multi-View Clustering. Proc. AAAI Conf. Artif. Intell. 2025, 39, 21044–21052. [Google Scholar] [CrossRef]
  16. Fei, L.; He, J.; Zhu, Q.; Zhao, S.; Wen, J.; Xu, Y. Deep Multi-View Contrastive Clustering via Graph Structure Awareness. IEEE Trans. Image Process. 2025, 34, 3805–3816. [Google Scholar] [CrossRef] [PubMed]
  17. Su, P.; Huang, S.; Ma, W.; Xiong, D.; Lv, J. Multi-view Granular-ball Contrastive Clustering. Proc. AAAI Conf. Artif. Intell. 2025, 39, 20637–20645. [Google Scholar] [CrossRef]
  18. Gu, Z.; Feng, S. Individuality Meets Commonality: A Unified Graph Learning Framework for Multi-view Clustering. ACM Trans. Knowl. Discov. Data 2023, 17, 7. [Google Scholar] [CrossRef]
  19. Chen, Z.; Wu, X.; Xu, T.; Li, H.; Kittler, J. Deep Discriminative Multi-View Clustering. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 6974–6978. [Google Scholar] [CrossRef]
  20. Xu, J.; Meng, M.; Liu, J.; Wu, J. Deep multi-view clustering with diverse and discriminative feature learning. Pattern Recognit. 2025, 161, 111322. [Google Scholar] [CrossRef]
  21. Gao, J.; Liu, M.; Li, P.; Zhang, J.; Chen, Z. Deep Multiview Adaptive Clustering with Semantic Invariance. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 12965–12978. [Google Scholar] [CrossRef]
  22. Luong, K.; Nayak, R. A Novel Approach to Learning Consensus and Complementary Information for Multi-view Data Clustering. In Proceedings of the International Conference on Data Engineering 2020, Dallas, TX, USA, 20–24 April 2020; pp. 865–876. [Google Scholar]
  23. Zheng, X.; Tang, C.; Liu, X.; Zhu, E. Multi-view Clustering via Matrix Factorization Assisted K-means. Neurocomputing 2023, 534, 45–54. [Google Scholar] [CrossRef]
  24. Lin, Z.; Kang, Z. Graph Filter-based Multi-view Attributed Graph Clustering. In Proceedings of the International Joint Conference on Artificial Intelligence 2021, Montreal, QC, Canada, 19–27 August 2021; pp. 2723–2729. [Google Scholar]
  25. Luo, S.; Zhang, C.; Zhang, W.; Cao, X. Consistent and Specific Multi-view Subspace Clustering. In Proceedings of the Conference on Artificial Intelligence 2018, New Orleans, LA, USA, 2–7 February 2018; pp. 3730–3737. [Google Scholar]
  26. Cui, C.; Ren, Y.; Pu, J.; Pu, X.; He, L. Deep Multi-view Subspace Clustering with Anchor Graph. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence 2023, Macao, China, 19–25 August 2023; pp. 3577–3585. [Google Scholar]
  27. Trosten, D.J.; Lokse, S.; Jenssen, R.; Kampffmeyer, M. Reconsidering Representation Alignment for Multi-view Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 1255–1265. [Google Scholar]
  28. Gao, Q.; Lian, H.; Wang, Q.; Sun, G. Cross-modal Subspace Clustering via Deep Canonical Correlation Analysis. Proc. AAAI Conf. Artif. Intell. 2020, 34, 3938–3945. [Google Scholar] [CrossRef]
  29. Zhang, C.; Liu, Y.; Fu, H. AE2-nets: Autoencoder in Autoencoder Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 2577–2585. [Google Scholar]
  30. Wen, J.; Liu, C.; Xu, G.; Wu, Z.; Huang, C.; Fei, L.; Xu, Y. Highly Confident Local Structure Based Consensus Graph Learning for Incomplete Multi-view Clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 15712–15721. [Google Scholar]
  31. Gao, J.; Liu, M.; Li, P.; Laghari, A.A.; Javed, A.R.; Victor, N.; Gadekallu, T.R. Deep Incomplete Multiview Clustering via Information Bottleneck for Pattern Mining of Data in Extreme-environment IoT. IEEE Internet Things J. 2023, 11, 26700–26712. [Google Scholar] [CrossRef]
  32. Gao, J.; Li, P.; Laghari, A.A.; Srivastava, G.; Gadekallu, T.R.; Abbas, S.; Zhang, J. Incomplete Multiview Clustering via Semidiscrete Optimal Transport for Multimedia Data Mining in IoT. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 158. [Google Scholar] [CrossRef]
  33. Li, P.; Laghari, A.A.; Rashid, M.; Gao, J.; Gadekallu, T.R.; Javed, A.R.; Yin, S. A Deep Multimodal Adversarial Cycle-consistent Network for Smart Enterprise System. IEEE Trans. Ind. Inform. 2022, 19, 693–702. [Google Scholar] [CrossRef]
  34. Andrew, G.; Arora, R.; Bilmes, J.; Livescu, K. Deep Canonical Correlation Analysis. In Proceedings of the International Conference on Machine Learning 2013, Atlanta, GA, USA, 17–19 June 2013; pp. 1247–1255. [Google Scholar]
  35. Wang, W.; Arora, R.; Livescu, K.; Bilmes, J. On Deep Multi-view Representation Learning. In Proceedings of the International Conference on Machine Learning 2015, Lille, France, 6–11 July 2015; pp. 1083–1092. [Google Scholar]
Figure 1. The illustration of CommonMVC. Given multi-view data = { x i 1 , , x i V } i = 1 n , CommonMVC generates view-specific representations = { z i 1 , , z i V } i = 1 n via deep autoencoders, i.e., z i v = E n c v ( x i v ; θ v ) and x ¯ i v = D e c v ( z i v ; ϕ v ) , and then obtains view-specific cluster partitions { s i 1 , , s i V } i = 1 n from view-specific representations via clustering network { f v } v = 1 V . Meanwhile, CommonMVC obtains fusion representation { Z i } i = 1 n via nonlinear transformation functions { g v } v = 1 V and generates the global partition S via the clustering network MLP ( · ) . Finally, CommonMVC maximizes the mutual information I ( X v ; Z v ) to optimize the view-specific representation learning and cluster partitioning, maximizes the mutual information I ( Z ; Z v ) to achieve the cluster complementarity learning, and performs instance-level and cluster-level partition invariance learning via minimizing D K L ( S i , · , S i , · v ) and D K L ( S · , j , S · , j v ) , respectively.
Figure 1. The illustration of CommonMVC. Given multi-view data = { x i 1 , , x i V } i = 1 n , CommonMVC generates view-specific representations = { z i 1 , , z i V } i = 1 n via deep autoencoders, i.e., z i v = E n c v ( x i v ; θ v ) and x ¯ i v = D e c v ( z i v ; ϕ v ) , and then obtains view-specific cluster partitions { s i 1 , , s i V } i = 1 n from view-specific representations via clustering network { f v } v = 1 V . Meanwhile, CommonMVC obtains fusion representation { Z i } i = 1 n via nonlinear transformation functions { g v } v = 1 V and generates the global partition S via the clustering network MLP ( · ) . Finally, CommonMVC maximizes the mutual information I ( X v ; Z v ) to optimize the view-specific representation learning and cluster partitioning, maximizes the mutual information I ( Z ; Z v ) to achieve the cluster complementarity learning, and performs instance-level and cluster-level partition invariance learning via minimizing D K L ( S i , · , S i , · v ) and D K L ( S · , j , S · , j v ) , respectively.
Mathematics 13 02521 g001
Figure 2. The Nemenyi test on four datasets. (a) Nemenyi test on Handwritten; (b) Nemenyi test on ORL; (c) Nemenyi test on LandUse-21; (d) Nemenyi test on Scene-15.
Figure 2. The Nemenyi test on four datasets. (a) Nemenyi test on Handwritten; (b) Nemenyi test on ORL; (c) Nemenyi test on LandUse-21; (d) Nemenyi test on Scene-15.
Mathematics 13 02521 g002
Figure 3. Convergence analysis on four datasets.
Figure 3. Convergence analysis on four datasets.
Mathematics 13 02521 g003
Figure 4. Parameter analysis on the Handwritten dataset. (a) α v.s. β ; (b) γ v.s. α ; (c) γ v.s. β .
Figure 4. Parameter analysis on the Handwritten dataset. (a) α v.s. β ; (b) γ v.s. α ; (c) γ v.s. β .
Mathematics 13 02521 g004
Table 1. The information of datasets.
Table 1. The information of datasets.
DatasetSamplesViewClassContent
Handwritten2000210Handwritten Digits
ORL400240Human Faces
LandUse-212100221Satellite Images
Scene-154485215Scene Images
Table 2. The average clustering results on Handwritten and ORL datasets.
Table 2. The average clustering results on Handwritten and ORL datasets.
DatasetHandwrittenORL
Metric ACC NMI ARI ACC NMI ARI
FeatConcate0.6100.6070.5530.5710.7530.477
DCCA0.6630.6600.6140.5970.7780.502
DCCAE0.6920.6700.6330.5940.7750.499
AE 2 -Nets0.8150.7140.6670.6880.7570.514
MVaDE0.8880.8080.7760.6950.7360.504
SiMVC0.8300.7610.6980.6920.7560.526
SDMVC0.8990.8210.8010.6100.7560.504
GCFAggMVC0.8280.7170.6660.6500.8350.509
MSEC0.8620.8270.6980.6470.7700.551
DealMVC0.8130.7180.6420.1370.3290.044
SGDMC0.9040.8370.8180.7020.8250.551
MGCC0.8930.8200.8060.6880.8190.552
DMAC0.8880.8170.7980.6860.8040.527
STMVC0.9100.8400.8200.6920.8180.546
CommonMVC0.9120.8480.8230.7110.8480.580
Table 3. The average clustering results on LandUse and Scene datasets.
Table 3. The average clustering results on LandUse and Scene datasets.
DatasetLandUse-21Scene-15
Metric ACC NMI ARI ACC NMI ARI
FeatConcate0.1230.1610.0360.2080.3040.116
DCCA0.1550.2320.0440.3620.2890.109
DCCAE0.1560.2440.0440.3640.2980.115
AE 2 -Nets0.2480.3040.1040.2610.3040.121
MVaDE0.2250.2250.0940.3780.2990.118
SiMVC0.2450.2580.0960.3770.2940.126
SDMVC0.2380.2290.1200.3860.2130.126
GCFAggMVC0.2400.2420.1150.2860.2050.124
MSEC0.2340.2530.0980.2850.2330.154
DealMVC0.1800.1920.0650.2780.2260.140
SGDMC0.2430.2660.1100.2930.2810.151
MGCC0.2040.2200.0980.2530.2740.128
DMAC0.2380.2550.1160.3000.3010.142
STMVC0.2230.2420.1030.2920.3070.146
CommonMVC0.2590.2800.1200.3100.3120.152
Table 4. Loss analysis on the Handwritten and ORL dataset in terms of ACC, NMI and ARI.
Table 4. Loss analysis on the Handwritten and ORL dataset in terms of ACC, NMI and ARI.
DatasetHandwrittenORL
Metric ACC NMI ARI ACC NMI ARI
Variant_10.7220.7280.6210.5380.6530.403
Variant_20.7250.7090.6140.6300.7450.497
Variant_30.9090.8310.8100.6580.7710.524
CommonMVC0.9120.8480.8230.7110.8480.580
Table 5. View analysis on the Handwritten and ORL dataset in terms of ACC, NMI and ARI.
Table 5. View analysis on the Handwritten and ORL dataset in terms of ACC, NMI and ARI.
DatasetHandwrittenORL
Metric ACC NMI ARI ACC NMI ARI
View_10.8220.7550.7230.6330.7370.498
View_20.7220.6200.6360.5780.6840.421
CommonMVC0.9120.8480.8230.7110.8480.580
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wen, Y.; Li, H. Cluster Complementarity and Consistency Mining for Multi-View Representation Learning. Mathematics 2025, 13, 2521. https://doi.org/10.3390/math13152521

AMA Style

Wen Y, Li H. Cluster Complementarity and Consistency Mining for Multi-View Representation Learning. Mathematics. 2025; 13(15):2521. https://doi.org/10.3390/math13152521

Chicago/Turabian Style

Wen, Yanyan, and Haifeng Li. 2025. "Cluster Complementarity and Consistency Mining for Multi-View Representation Learning" Mathematics 13, no. 15: 2521. https://doi.org/10.3390/math13152521

APA Style

Wen, Y., & Li, H. (2025). Cluster Complementarity and Consistency Mining for Multi-View Representation Learning. Mathematics, 13(15), 2521. https://doi.org/10.3390/math13152521

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop