Adaptive Weighted Graph Fusion Incomplete Multi-View Subspace Clustering

With the enormous amount of multi-source data produced by various sensors and feature extraction approaches, multi-view clustering (MVC) has attracted developing research attention and is widely exploited in data analysis. Most of the existing multi-view clustering methods hold on the assumption that all of the views are complete. However, in many real scenarios, multi-view data are often incomplete for many reasons, e.g., hardware failure or incomplete data collection. In this paper, we propose an adaptive weighted graph fusion incomplete multi-view subspace clustering (AWGF-IMSC) method to solve the incomplete multi-view clustering problem. Firstly, to eliminate the noise existing in the original space, we transform complete original data into latent representations which contribute to better graph construction for each view. Then, we incorporate feature extraction and incomplete graph fusion into a unified framework, whereas two processes can negotiate with each other, serving for graph learning tasks. A sparse regularization is imposed on the complete graph to make it more robust to the view-inconsistency. Besides, the importance of different views is automatically learned, further guiding the construction of the complete graph. An effective iterative algorithm is proposed to solve the resulting optimization problem with convergence. Compared with the existing state-of-the-art methods, the experiment results on several real-world datasets demonstrate the effectiveness and advancement of our proposed method.


Introduction
Traditional clustering methods [1][2][3][4] usually use a single view to measure the similarity of samples. With the rapid progress of data collection, individual features are not enough to describe data points. Multiple views usually contain supplementary information, which may be beneficial to explore the basic structure of the data. With the development of information technology, data mining and other technologies, many datasets in the real-world can be presented from different perspectives, called multi-view data. For example, the same text can be expressed in various languages. In biometric recognition scope, faces, fingerprints, palm prints and iris could form the different views of multi-view data. In the field of medical diagnosis, different examinations of patients can be regarded as different views. Multi-view data could provide sufficient information than the traditional single feature representation in revealing the underlying clustering structure. Furthermore, distinct views contain specific information of intra-view and complementary information of inter-view, which are negotiated with each other to boost the performance of clustering [5][6][7][8][9][10][11][12][13][14].
Based on different mechanisms, we can divide the existing multi-view clustering methods into four categories. The first category methods refer to multi-kernel clustering. These methods usually combine multiple pre-defined kernels to reach optimal clustering results [12,[15][16][17]. The second kind Figure 1. Framework of the proposed adaptive weighted graph fusion incomplete multi-view subspace clustering (AWGF-IMSC). It is a novel incomplete multi-view clustering method to fuse the local-structure contained graph with adaptive view-importance learning. Incomplete graphs of different scales are fused into a complete graph with automatically learning weights. In addition, the constructed complete graph will further guide the learning process of incomplete graphs and latent representations.
Compared with existing methods, the proposed adaptive weighted graph fusion incomplete multi-view subspace clustering (AWGF-IMSC) algorithm has the following contributions: • It induces the similarity graph fusion after obtaining latent spaces to extract the local structure of inner views. By virtue of it, noise existing in the original space can be eliminated in latent space and contribute to better graph construction. • It incorporates relations between missing samples and complete samples into the complete graph. The sparse constraint imposed on the complete graph improves the view-inconsistency and reduces the disagreements between views, making the proposed method more robust in most cases.

•
The importance of each view is automatically learned and adaptively optimized during the optimization. Consequently, the important view has strong guidance in the learning process. Moreover, there is no limitation to the number of views in our approach. The proposed method is applicable to any multi-view datasets.
The rest of the paper is organized as follows. The next Section 2 denotes the notations and symbols used in this paper. Section 3 introduces methods mostly related to our work. The proposed algorithm and its optimization process are formulated in Section 4. Besides, we also give the analysis of convergence and complexity of the proposed algorithm in this part. Extensive experiment results and analysis are shown in Section 5, before conclusion and prospectives.

Notation
For clarity, we give the notation used throughout the paper at the beginning. We use bold letters to represent matrices and vectors. For matrix A, A :,j and A i,j represent its j-th column and (i, j) element, respectively. A , Tr(A) and A −1 denote the transpose, trace and the inverse operations on matrix A, respectively. · F denotes the Frobenius norm. The 2,1 norm is denoted as  (1) , A (2) , · · · , A (m) }, the superscript (i) represents the i-th view. In individual A (i) ∈ RR d i ×N , each column indicates an instance. N is the number of the samples and d i represents the feature dimension of corresponding i-th view.

Related Work
In this section, we will present the work most relevant to the proposed method, i.e., semi-non-negative matrix factorization and subspace learning.

Semi-Non-Negative Matrix Factorization for Single View
Non-negative matrix factorization (NMF) is a significant branch in the field of matrix factorization. NMF aims at finding two non-negative matrix U ∈ RR d×K + and V ∈ RR K×N + to roughly approximate the original data matrix, i.e., X ≈ UV. Since many real-world datasets are usually high-dimensional, the NMF methods have been widely applied in image analysis [40], data mining, speech denoising [41] and population genetics, etc. The semi-NMF [42] is an extension of traditional NMF, which only requires the coefficient matrix to be non-negative. Specifically, given the data matrix X ∈ R d×N , the semi-NMF utilizes the base matrix U ∈ R d×k and the non-negative coefficient matrix V ∈ R k×N to approximate the matrix X: Ding et al. [42] further propose an iterative optimization algorithm to find the local optimal solution. The updating strategy can be concluded as follows: With V being fixed, U can be updated by U = XV (VV ) −1 .
with U being fixed, V can be updated by The positive and negative elements of matrix M are denoted as M + i,j and M − i,j . And they hold on the property M i,j = M + i,j − M − i,j . NMF and semi-NMF methods are also employed universally in multi-view clustering. Many of the multi-view clustering (MVC) methods utilize NMF to reduce dimension on each view or directly reach a consistent latent presentation [28,43]. Especially in the incomplete multi-view scenario, NMF and semi-NMF play significant roles in achieving a consistent representation from different incomplete views. Li et al. [35] learn a shared representation for the paired instances and view-specific representations for unpaired instances via NMF. The complete latent representation can be attained by combining shared and view-specific representations. The method in [44] utilizes weighted semi-NMF to reach a consensus representation. Then, the 2,1 norm regularized regression is imposed to align the different basis matrices. Although these NMF-based methods could learn a consensus representation from the incomplete views, the number of views and the absence of local structure limit their performance.

Subspace Clustering
Subspace clustering is an extension of the traditional clustering method which aims at grouping data in different subspaces [45,46]. The self-representation property [47] of subspace clustering aims to represent data points by the linear combinations of themselves. The formulation can be expressed as: where X ∈ RR d×N is the original data, Z is the self-representation coefficient matrix, with each column being a new representation for corresponding data point. β > 0 is a trade-off parameter. Since Z reflects the correlations among samples, it can be regarded as a graph and then we can perform spectral clustering algorithm on it to get the final clustering result.

Incomplete Multi-View Spectral Clustering with Adaptive Graph Learning (IMSC-AGL)
In paper [48], a novel graph-based multi-view clustering method is proposed to deal with incomplete multi-view scenarios termed incomplete multi-view spectral clustering with adaptive graph learning (IMSC-AGL). IMSC-AGL optimizes the shared graph from the low-dimensional representations individually formed by each view. Moreover, a nuclear-norm constraint is introduced to ensure the low-rank property of the ideal graph. The mathematical formulation can be written as, where Y v represents the complete samples in v-th view. Z (v) denotes the respective v-th view's graph. F (v) represents the clustering indicator matrix with proper size. Moreover, M refers to the final shared clustering indicator matrix. Although IMSC-AGL achieves considerable performance in various applications, it can still be improved from the number of hyper-parameters and considering to fuse multiple information in a weighted manner.

Adaptive Weighted Graph Fusion Incomplete Multi-View Subspace Clustering
In this section, we present our adaptive graph fusion incomplete multi-view subspace clustering method (AWGF-IMSC) in detail and give a unified objective function.
For incomplete multi-view data, we remove the incomplete instances and reform as X (i) ∈ RR d i ×n i , where d i and n i represent the feature dimension and the numbers of visible samples of i-th view, respectively. We assume that semi-NMF factorizes the input data X (i) ∈ RR d i ×n i into base matrix U (i) ∈ RR d i ×k and coefficient matrix V (i) ∈ RR k×n i . k is the dimension of target space and is commonly set to the number of the clusters of X (i) . Considering that the missing samples differ in each view, we learn latent representations of the corresponding visible samples in each view. Therefore, the semi-NMF for individual view can be formulated as: To further exploit the intra-view similarity structure and the underlying subspace structure, we utilize the self-representation property [47] on the k × n i dimensional latent representation V (i) to construct the graph. Thus, we can obtain the different graphs Z (i) of individual views by solving the following problem: where the constraint 0 ≤ Z (i) j,k ≤ 1 and Z (i) 1 = 1 guarantee a good probabilistic explanation for Z (i) . After obtaining the graphs on each view, the natural idea is to integrate the multiple incomplete information into a complete one. In order to establish the correspondence between the incomplete and complete graphs, we denote the index matrix O (i) . The index matrix O (i) ∈ RR n i ×N can extract the visible instances of view i from the complete graph. To be specific, the matrix O (i) is defined as: Through the index matrix, we can achieve the transformation between complete and incomplete graphs: In the second condition, O (i) expands the graph Z (i) intoẐ * , whereẐ * has the same size with Z * , but the irrelevant items to view i are zero.
Owing to the size of the graph and the similarity magnitude differing among views, it is unreasonable to directly add up the multiple graphs. Consequently, we aim to integrate the multiple information into a completed graph with adaptive learning weights {α i } m i=1 . With the help of the index matrix, relevant elements can be extracted from Z * . Then, we can adaptively fuse {Z (i) } m i=1 into a complete graph with auto-learning weights, as illustrated in Equation (6).
where α i is the weight for i-th view. It is automatically learned and optimized to illustrate the importance of i-th view. In this manner, the complete graph is learned by a weighted combination of incomplete graphs. Besides, with the fusion of beneficial information, the inconsistencies between different views, noise and outliers in individual view are also integrated into the complete graph.
Considering that, an additional sparse constraint is added on Z * . Therefore, integrating the above parts into an unified objective function, we have our optimization goal as: λ 1 and λ 2 are non-negative trade-off parameters. In the proposed framework, we have four terms: using semi-NMF to obtain latent representation, conducting graph construction with self-representation, adaptive graph fusion and sparse regularizer. Finally, we get a full-size graph Z * incorporating all the sample information in the latent subspace.

Optimization Algorithm for AWGF-IMSC
The constraint problem in Equation (7) is not jointly convex with regard to all the variables. In this section, we propose an alternating iterative algorithm to solve this optimization problem.

Update U (i)
With V (i) , Z (i) , α i and Z * fixed, for each U (i) , we need to solve the following problem, Each of U (i) can be solved separately since views are independent from each other. Therefore, the optimization problem that we minimize can be rewritten as: The solution for U (i) can be easily obtained by setting the derivation w.r.t. U (i) to zero.
Then, we can get the optimal closed-form solution: Fixing U (i) , Z (i) , α i and Z * , the minimum problem for optimizing V (i) can be simplified as: We can update V (i) in Equation (12) referring to the update strategy in semi-NMF, since the semi-NMF and subspace learning processes are isolated from each other. The partial derivation of L(V (i) ) with respect to V (i) can be obtained as: According to the optimization of semi-NMF and the KKT condition, we can get Based on this, we can achieve the updating rule for V (i) :

Update Z (i)
When U (i) , V (i) , α i and Z * fixed, the optimization for Z (i) can be simplified as: Denoting we can obtain the following equivalent question Setting the derivative with respect to Z (i) :,j to zero, we can get :,j . For each view, we can obtain the following closed-form solution and Z * and removing other terms, the optimization for α i can be transformed into solving Equation (18).
For each view, we can obtain the following Lagrange function: where γ is the Lagrange multipliers. Setting the derivative of L(α i ) w.r.t. α i to zero, we can obtain: According to the constraint ∑ m i=1 α i = 1, we can compute γ and further get each α i .

Update Z *
With U (i) , V (i) , Z (i) and α i fixed, we need to minimize the following objective for Z (i) .
We can get the equivalent element-wise equation: Note that j, p differ in different views. r is the count of views in which samples j and p exist simultaneously. If O As can be seen in Equation (23), the solution of optimal Z * is a weighted combination of self-representation graph over the view which corresponding samples are visible. Moreover, the noise and outliers will be given a very small value to make the Z * sparse. Therefore, we can get a robust and complete graph revealing all of the relationships of the samples.

Convergence and Computational Complexity
We end up in this section by analyzing the convergence analysis and computational complexity of our proposed method.
Convergence analysis: We first analyze the convergence of the proposed method. Algorithm 1 is a convex problem during the updating of each variable. Each sub-problem obtains a global optimum solution and the value of the objective function is non-increasing until converges. Experiment results in the next section demonstrate this in practice.
Computational complexity analysis: With the optimization process outlined in Algorithm 1, the total time complexity consists of five parts referring to the alternate steps. For incomplete multi-view setting, dimensionality and the number of complete samples varies across different views. With notations in the following, the first stage for computing , where q is the number of iterations. The time cost of updating Z (i) is O(n i (n i k 2 + n 3 i )). The time cost of updating α i is O(mn 3 i ). At last, solving Z * acquires O(mn 2 i ). After all, the time complexity of our algorithm is O(mn i (n i k 2 + n 3 i + n i + n i k)).

Algorithm 1: AWGF-IMSC
, the number of cluster k, hyper-parameters λ 1 and λ 2 . Initialize: Initialize V (i) ∈ R k×n i based on preliminary k-means. If sample i belongs to cluster k, the corresponding element in V (i) equals to 1 else 0. Initialize Z (i) with the k-nearest neighbors on visible data points in each view. Initialize α i with 1 m , where m is the number of views. Initialize Z * with the average of Z (i) . while not convergence do Update U (i) by Equation (11); Update V (i) by solving Equation (14); Update Z (i) by Equation (17); Update α i by Equation (20); Update Z * by Equation (23); end Output: Z * Performing spectral clustering on Z * to get the clustering results.

Datasets
To demonstrate the effectiveness of our proposed algorithm AWGF_IMSC, we do comparisons with six baseline methods on four benchmark datasets. The statistical information of the datasets is displayed in Table 1. Detailed introductions are as follows: • BUAA-visnir face database (BUAA) [49]. The dataset BUAA used in this paper contains 1350 instances of 150 categories. Each instance has visible images (VIS) and near infrared images (NIR), which naturally form a two-view dataset. Both VIS and NIR images are 640×480 pixels. Then, they are resized into 10 × 10 matrix and vectorized into 100-dimensional features.

•
Caltech7 [50]. The Caltech7 dataset is a subset of the Caltech101 dataset, containing seven categories (Face, Motorbikes, Dolla-Bill, Garfield, Snoopy, Stop-Sign and Windsorchair) and 1474 instances. The original images of dataset Caltech7 differ in size. We follow the work in [48], selecting two of five given features as the multi-view dataset. The selected two views refer to 512 dimensional GIST features [51] and 928 dimensional local binary patterns(LBP) features [51].

•
One-hundred plant species leaves dataset (100Leaves) [52]. The 100Leaves dataset contains 1600 instances from 100 categories. The original images of 100Leaves differ in size, too. Shape descriptor, fine scale margin and texture histogram features constitute three-views to depict samples from different perspectives.

•
Mfeat handwritten digit dataset (Mfeat) [53]. This dataset contains 2000 samples. The size of the original images of dataset Mfeat is 891 × 702 pixels. The public multi-view dataset of it has six views. In our experiments, we select 76-dimensional features of Fourier coefficients of the character shapes and 240-dimensional features of pixel averages. As shown in Figure 2, we randomly select six pictures of different categories from four original datasets for display.

Baselines
We conduct extensive experiments, comparing with several state-of-the-art incomplete multi-view clustering methods. Brief introductions are given below.

•
Best single view (BSV). BSV first fills the missing samples with the average feature values of its view. The affinity matrices can be constructed by Gaussian kernel. Then, we perform spectral clustering algorithm on the similarity matrix of each view and report the best clustering performance.
• Partial multi-view clustering (PVC) [35]. This method supposes that the instances available in both views should have a common representation. The view-specific instances which are missing in another view should maintain the specific information. Based on the NMF, this method integrates the common and view-specific representations in the latent space to form a unified representation.

•
Multiple incomplete view clustering via weighted non-negative matrix factorization with 2,1 regularization (MIC) [38]. This paper first fills the missing instances with an average value of features and then learns a 2,1 regularized latent subspace by weighted NMF.

•
Incomplete multi-modal visual data grouping (IMG) [36]. IMG proposes to use the latent representation to generate a complete graph, which establishes a connection between missing data from different views.

•
Doubly aligned incomplete multi-view clustering (DAIMC) [37]. The proposed method first aligns the samples into a common representation by semi-NMF and then aligns the base matrices with the help of 2,1 regularized regression modal.

•
Incomplete multi-view spectral clustering with adaptive graph learning (INMF-AGL) [48] induces a co-regularization term to learn the common representation, which integrates the graph learning and spectral clustering.
For the compared methods, we run their demo with the suggest or default parameters and repeat five times to obtain average results. Note that the PVC and IMG methods can only deal with the two-view scenarios. In our experiments, we combine two different views and report the best result.
Following most existing works, we utilize accuracy (ACC) and normalized mutual information (NMI) to measure the clustering results; higher values representing better clustering performance. Then we give the definition of the two metrics. Denoting true positive (TP), false positive (FP), false negative (FN) and true negative (TN), we can obtain the ACC and NMI definitions as follows: indicating the percentage of correct predicted results in the total samples. NMI quantifies the amount of information contained in a random variable about another random variable. For clarity, the expression can be formulated as: where mutual information I(X, Y) is ∑ x ∑ y p(x, y) log p(x,y) p(x)p(y) , p(x, y) is the joint probability distribution of X and Y, p(x) is the marginal probability distribution of X. H(X) = − ∑ i p(x i ) log p(x i ) is the information entropy, regarded as the uncertainty of random variables.

Experiment Setting
In our experiments, we generate incomplete data from complete multi-view datasets in the way of One-complete, which means that we randomly select one of the views to be complete. The rest of the views suffer different incomplete ratio (IR) from 10%, 20%, 30%, 40%, 50%, 60%, 70%. For the two-view datasets BUAA, Caltech7 and Mfeat, we randomly select one view as the complete view. The incomplete case occurs in the rest view with randomly removing 10-70% samples. For more than two-view occasions, one view is chosen randomly and the rest views suffer 10-70% missing. The datasets used in this paper can be found in our Github (https://github.com/Jeaninezpp/Incomplete-multi-view-datasets). In our experiments, we perform our proposed method five times as the compared method for fairness. Our code is available at https://github.com/Jeaninezpp/AWGF-code.

Experiment Results and Analysis
Experiment results on different datasets of various compared method are enumerated in Tables 2-5. These four tables present the ACC results of the above algorithms on four benchmark datasets. Each row shows the accuracy of compared methods under a certain incomplete ratio. We highlight the best results in bold. Each column represents the evolution of the accuracy of the corresponding method as the incomplete ratio increase. Under each incomplete ratio, we can get the sequence number by sorting the accuracy from high to low. The sequence number is regarded as the rank of the algorithms under a certain incomplete ratio. Then, we can obtain the average rank by taking the average of the ranks over each algorithm. The average rank illustrates the robustness of the method in terms of incomplete ratio. Furthermore, we depict the NMI metric in Figure 3 with line charts. Based on these results, we have the following observations: • Compared with the proposed method, the BSV method yields worse clustering performance. This is mainly because directly filling the missing instances with the average features will lead them to be clustered into the same group. The weighted NMF methods DAIMC and MIC perform better than BSV at a low missing rate since the NMF-based methods learn a shared representation to exploit the complementary information across views. Besides, the weighted manner reduces the negative impact of the missing instances. However, with the increasing incomplete ratio, these two methods suffer a sharp decline, especially apparent in Mfeat dataset. Methods like IMG and INMF_AGL involving the graph construction perform better than them. Our proposed method integrates the advantages of NMF-based and graph-based methods, adaptively fusing the graph learned from each embedding space. Therefore, our AWGF_IMSC method reaches the best clustering performance in most cases.  (Table 3), our method transcends the second best method by 7.94%, 7.58%, 7.18%, 7.53%, 14.93%, 9.24% and 7%, respectively. More significant improvements can be seen on dataset Caltech7. In Table 2, the ACCs of the proposed method are 23.42% , 13.10%, 15.02%, 9.74%, 10.48%, 10.8% and 11.43% higher than the second best INMF_AGL method. These significant results verify the effectiveness of the proposed adaptive weighted graph-based fusion learning for incomplete multi-view clustering. Our method achieves the best average rank in Caltech7, BUAA and 100Leaves. In Mfeat, the average rank of the proposed method is second best, which is only 0.43 more than the best, but 0.85 less than the third. • As shown in Figure 3, we can also observe that the proposed algorithm outperforms other methods on all of the datasets under various incomplete ratio. Besides, our method appears a relatively stable trend as the missing rate increases. Moreover, the abnormal phenomenon of BSV in dataset Caltech7 (Figure 3a) maybe because the preserved complete view has an excellent structure when generating large missing datasets. Therefore, the results of the BSV will be outstanding. Other methods are affected by the negative impact of missing samples and thus produce a lower effect than BSV, while our method is still superior to all compared methods by a more significant proportion, further illustrating the effectiveness and superiority of the proposed method.

Analysis of the Parameter Sensitivity
In this section, we analyze the impact of the hyper-meters λ 1 and λ 2 in AWGF-IMSC on clustering performance. The parameters are chosen ranging from 10 −3 , 10 −2 , · · · , 10 3 by grid search. Figures 4 and 5 plot the NMI results by varying λ 1 and λ 2 in a large range on BUAA, Caltech7, 100Leaves and Mfeat.
We have the following observations from Figures 4 and 5: (i) All the two hyper-parameters are effective in improving the clustering performance; (ii) AWGF-IMSC is practically stable against these parameters that it achieves competitive performance in a wide range of parameter settings in a low missing rate; (iii) With the incomplete ratio increasing, the combinations of a relatively smaller λ 1 and a more prominent λ 2 tend to achieve better performance. The reason is that the λ 2 controls the impact of the sparse regularizer of the complete graph. Imposing sparseness requirements on the graph within a specific range will filter out the noise in the graph and the inconsistency between the views. (iv) For different datasets, we can conclude that most of the datasets are relatively stable when

Convergence Analysis
Our algorithm is theoretically guaranteed to converge to a local minimum, as illustrated in the optimization part. For the convergence analysis, we conduct experiments on the four datasets with all incomplete ratios and all suggest parameter scope. We randomly select from each dataset and draw the evolution of the objective value, as shown in Figure 6. In the above experiments, we observe that our algorithm's objective values monotonically decrease at each iteration and usually converge in less than 20 iterations. These results verify our proposed algorithm's convergence.

Conclusions
This article proposes a novel incomplete multi-view clustering method to fuse the local-structure contained graph with adaptive view-importance learning. We incorporate representation learning and incomplete graph fusion into a unified framework, whereas two processes can negotiate with each other, serving for graph learning tasks. A sparse regularization is imposed on the complete graph to do it more robustly to the view-inconsistency. Moreover, the importance of different views is automatically learned, further guiding the construction of the complete graph. We conduct experiments to illustrate the effectiveness and superiority of the proposed method. Some recent works utilize deep neural networks, such as GAN (generative adversarial network), to generate missing features to solve incomplete multi-view clustering problems. The utilization of neural networks to handle multi-view clustering and incomplete multi-view clustering will be advanced considerations in the future.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: