Adaptive Kernel Graph Nonnegative Matrix Factorization

: Nonnegative matrix factorization (NMF) is an efﬁcient method for feature learning in the ﬁeld of machine learning and data mining. To investigate the nonlinear characteristics of datasets, kernel-method-based NMF (KNMF) and its graph-regularized extensions have received much attention from various researchers due to their promising performance. However, the graph similarity matrix of the existing methods is often predeﬁned in the original space of data and kept unchanged during the matrix-factorization procedure, which leads to non-optimal graphs. To address these problems, we propose a kernel-graph-learning-based, nonlinear, nonnegative matrix-factorization method in this paper, termed adaptive kernel graph nonnegative matrix factorization (AKGNMF). In order to automatically capture the manifold structure of the data on the nonlinear feature space, AKGNMF learned an adaptive similarity graph. We formulated a uniﬁed objective function, in which global similarity graph learning is optimized jointly with the matrix decomposition process. A local graph Laplacian is further imposed on the learned feature subspace representation. The proposed method relies on both the factorization that respects geometric structure and the mapped high-dimensional subspace feature representations. In addition, an efﬁcient iterative solution was derived to update all variables in the resultant objective problem in turn. Experiments on the synthetic dataset visually demonstrate the ability of AKGNMF to separate the nonlinear dataset with high clustering accuracy. Experiments on real-world datasets veriﬁed the effectiveness of AKGNMF in three aspects, including clustering performance, parameter sensitivity and convergence. Comprehensive experimental ﬁndings indicate that, compared with various classic methods and the state-of-the-art methods, the proposed AKGNMF algorithm demonstrated effectiveness and superiority.


Introduction
In pattern recognition, machine learning and data mining, clustering can help to decipher the data distribution and cluster data characteristics.The key point of clustering tasks is to find the internal structure information of the original data and make them more discriminative [1][2][3][4].Several methods have been developed for clustering, including spectral clustering and k-means [5][6][7], which rely on the metric of data similarity [8][9][10].Due to the potential clustering representation, nonnegative matrix factorization, which is an effective method for data dimensionality reduction and feature extraction, has been widely used in clustering tasks [11,12].NMF expresses part-based data by finding two nonnegative matrices whose product is close to the raw data and only allows additive combinations of data.Differently from the interpretability limitations of other matrix-factorization methods involving singular value decomposition (SVD) [13], independent component analysis (ICA) [14], principal component analysis (PCA) [15], etc., with a negative decomposition component, NMF can provide the nonnegative factorization of multivariate data [16,17].
NMF also offers ease of implementation and decomposition, and the interpretability of factorization results.
Many studies show that real-world data can be considered as representations from a nonlinear low-dimensional manifold [18][19][20].Ordinary NMF [21] ignores the intrinsic manifold structure of the data space.Therefore, Cai et al. [22] proposed graph regularization NMF (GNMF) to explore the geometric relationship of the data as a graph-regularization method for improving the performance of clustering.However, GNMF is a linear method that fails to exploit the nonlinear characteristics of the data.In the work of Tolić et al. [23], the nonlinear graph-regularized KNMF (KOGNMF) was further proposed, where the nonlinear properties of the manifold and its global geometric structure are induced.However, the graph adjacency matrix employed in the KOGNMF is predefined by the k-nearestneighborhood (knn) or ball graph technique.The graph constructed by these methods is sensitive to noise and outliers in the data [24][25][26].In order to overcome this drawback, a learning method is utilized to generate the graph similarity matrix and then regularize the NMF method [27].Nevertheless, graph learning and matrix factorization are still performed as two separate steps, which leads to suboptimal performance for clustering.
Recently, Peng et al. [28] proposed a flexible NMF with adaptively learned graph regularization (FNMFG), where the graph is jointly learned with matrix factorization.Analogously, the adaptive graph-regularized NMF (NMFAN) method was proposed [29].Yi et al. [30] proposed the NMF with locality constrained adaptive graph (NMF-LCAG), which can integrate nonnegative matrix factorization and adaptive graph learning with two locality constraints.Instead of predefined graph-based manifold regularization terms, the unified formulations can simultaneously optimize the similarity matrix and the data representation, resulting in better performance.To avoid situations where the non-convexity of NMF models frequently reaches poor local solutions in the presence of noise or outliers, Chen et al. [31] proposed sparsity-constrained graph non-negative matrix factorization (SGNMF) to enhance robustness and eliminate noise, and sparse graph non-negative matrix factorization was presented as a global optimization problem by applying the sum of the different smooth functions to approximate the l 0 norm.Yang et al. [32] proposed self-paced nonnegative matrix factorization with adaptive neighbors (SPLNMFAN).Self-paced regularization is introduced to find a better factorized matrix by sequentially selecting data, and the adaptive graph learns the data graph by selecting the local optimal neighbors for each data point.However, since the existing graph-based NMF models are essentially linear, they are not suitable for tasks that deal with nonlinear data.
In this paper, we propose a novel, graph-regularized NMF model referred to as the adaptive kernel graph nonnegative matrix factorization (AKGNMF) model to overcome the above limitations from a new point of view and explore the manifold structure on nonlinear subspaces by adaptively learning the kernel similarity of high-dimensional mappings.Compared with traditional nonnegative matrix-factorization methods, AKGNMF maps origin data to high-dimensional subspaces and learns global similarity through adaptive graphs, further introducing a flexible graph regularization item that preserves local manifold structures.This strategy is able to exploit nonlinear structural information and obtain factorizations with efficient feature representations.In order to mine the potential structural information of nonlinear data, we used the idea of subspace clustering based on Gaussian kernels to project the original data.Specifically, we tried to acquire the global kernel similarity between the original high-dimensional feature space and the mapped subspace; decomposed the sample matrix of the nonlinear mapping to obtain the feature matrix and coefficient matrix; and used the manifold structure obtained by adaptive learning to constrain it to obtain the features under the high-dimensional nonlinear space.Importantly, we presented a unified framework for graph learning and matrix factorization simultaneously.The learned graph was optimized by combining the kernel matrix and the coefficient matrix to alternate iterations jointly, so that the global structure information of the similarity matrix and the local topology of the coefficient representation can be used simultaneously.During the learning procedure, the factorization matrix and the similarity matrix negotiate with each other to find the optimal subspace that maintains the original structure information to the greatest extent.Moreover, an efficient iterative solution was derived to optimize our problem.The convergence of the solution is also demonstrated.The major contributions are summarized as follows: (1) We performed learning using an optimal graph that most closely approximates the initial kernel matrix, and attempted to preserve the sample's similarity.This adaptive strategy can better accomplish manifold structure learning.(2) Both the similarity matrix of graphs and the decomposition matrix from the highdimensional nonlinear mapping features of the original data can be learned in the proposed model.All variables are reciprocally updated in an alternating iterative optimization algorithm, and we simultaneously obtained similarity information and valid feature representation.(3) Our method takes non-linear mapping into consideration, meaning it is more capable of handling both linear and non-linear data.Instead of using the previous constructed and fixed graph-regularization term, the adaptively learned similarity preserves the ideal local geometry structure for feature representation.Moreover, the kernel matrix itself contains global similarity information of data points; hence, it is feasible to conserve the overall relations by learning the graph close to the kernel.( 4) Comprehensive experiments were conducted on both synthetic and real-world datasets to exhibit the effectiveness of the proposed algorithms and demonstrate their superiority.
AKGNMF has potential application value in real-world scenarios such as face recognition, speech recognition, and biomedical engineering.From the perspective of pattern recognition, NMF is essentially a method of dimensionality reduction and feature extraction.For the feature extraction of a face or a voice, the aim is to obtain a matrix-factorization method with sparser decomposition results, a greater number of obvious local features, less redundancy between data and a faster decomposition speed.These real-world data are often nonlinear, and AKGNMF can capture the structural information of the data in a high-dimensional space and has the potential to obtain more effective features with higher precision.In addition, to manage the complex data in biomedicine and chemical engineering, AKGNMF can provide efficient and fast preprocessing for the analysis of these data.As the decomposition of NMF does not have negative values, combining the structural information of adaptive learning to analyze the molecular sequence of gene DNA can make the analysis results more reliable.
The rest of this paper is organized as follows: In Section 2, we briefly introduce NMF, GNMF and similarity-preserving clustering for graph learning.In Section 3, we propose the AKGNMF model framework and algorithm and discuss solutions.Section 4 introduces the comparison and initial analysis of the experimental results of our method and other nonnegative factorization clustering methods on seven datasets.Finally, conclusions are provided in Section 5.

Related Work
In this work, all vectors are denoted with boldface lowercase letters and all matrices are denoted with boldface uppercase letters.The important notation mentioned in the following is summarized in Table 1.

Graph-Regularized Nonnegative Matrix Factorization
NMF is a matrix decomposition method under the constraint that all elements in the matrix are nonnegative numbers.The main idea is that for any given nonnegative matrix X, the NMF algorithm can find a nonnegative matrix W and a nonnegative matrix H, thereby decomposing a nonnegative matrix into the left and right nonnegative matrices.Let X = [x 1 , ..., x n ] be a matrix with column vectors x i ≈ R m ; thus, the NMF problem can be formulated as follows: where W = (w 1 , ..., w n ) ∈ R m×k and H = (h 1 , ..., h n ) ∈ R n×k are two nonnegative matrices, and k is a prespecified rank parameter.The column vectors of W are called the basis vectors, and the column vectors of H are called the encoding vectors.Two mechanisms are proposed to estimate the quality of NMF in Equation (1).The former is based on the Euclidean distance, and the latter is based on the divergence distance [33].In this paper, we focus on the former, and the corresponding objective function can be formulated as follows: min where • denotes the Frobenius norm of a matrix.When being performed within the Euclidean space, the NMF-based method is inappropriate for revealing the intrinsic geometric structure of data space in common cases.Cai et al. [22] proposed a novel graph-regularized nonnegative matrix-factorization algorithm that, in addition to learning a parts-based representation, can also combine a geometricbased regularizer.Thus, the intrinsic geometrical and discriminating structures of the data space are available to be discovered.The GNMF is effective at solving clustering problems, since the intrinsic geometrical structure is revealed by incorporating a Laplacian regularization term.
The GNMF objective function based on Euclidean distance is minimized as follows: where β > 0 is the regularization parameter, Tr(•) indicates the trace of a matrix and L is the graph Laplacian which satisfies the equation L = D − W, where D is a diagonal matrix in which the entries are column (or row, since W is symmetric) sums of W.
The aim of GNMF is to find a parts-based representation space in which two data points are sufficiently close to each other when they are connected in the graph.The geometric information is encoded by constructing a nearest neighbor graph; however, the graphbased methods can be easily affected by the input affinity matrix and use of Laplacian graphs.Namely, these methods are affected by several elements such as the neighborhood size, choices of weighting metric, etc.

Graph Learning
Recently, graph learning methods, including adaptive local structure learning and adaptive global structure learning techniques, have been proposed to obtain the structural information of the data automatically.To preserve the local manifold structure, adaptive-neighbor-based methods [34,35] have been proposed to obtain an optimal graph of input data in several machine learning tasks.
The self-expression method is a global similarity learning approach in graph learning [36].It assumes that each data point can be expressed as a linear combination of other data points of the dataset.If data points x i and x j are similar to each other, the corresponding weight z ij (and z ji ) is assigned a larger value.Therefore, Z can be regarded as the similarity matrix, and the self-expression-based graph learning problem can be formulated as follows: min where µ > 0 is a trade-off parameter and ρ(Z) is a regularizer of Z. Two commonly used assumptions about Z are low-rank and sparse assumptions, which means the learned Z can reveal the low-dimensional structure of data and can also be robust to the data scale.Equation ( 4) can be used to identify the neighbors automatically corresponding to the optimization process and utilizes all data points to capture the global structural information.In this way, the individual pairwise similarity information hidden in the data is explored [34] and the graph similarity matrix can be obtained automatically from data.
The graph learning techniques lead to a better structural representation of data relationships than the traditional method in many tasks of machine learning [8].However, most existing methods are linear models for original data, which ignore the nonlinear hidden structures in data.

Adaptive Kernel Graph Nonnegative Matrix Factorization
In this section, we propose a nonlinear adaptive graph-regularized NMF algorithm that can jointly perform nonnegative matrix factorization with graph-similarity learning in kernel spaces.The nonlinear relationship between input samples is also explored, which promotes the improvement of clustering performance.

Kernel Nonnegative Matrix Factorization Review
In order to handle nonlinear data, we consider the basic idea of kernel subspace clustering.Then, the data can be mapped into a nonlinear transformation to the higher D-dimensional space by performing kernel tricks.We assume that Φ(x i ) represents the subspaces of the kernel space, and let K ∈ R n n be a kernel matrix whose elements are computed as follows: where ker : R d × R d → R is the kernel function and The nonlinear NMF problem aims to find two nonnegative matrices, W and H, whose product can be used to approximate the mapping of the original matrix: where W is the basis in feature space and H is the clustering matrix.As Φ is derived from the representation of high-dimensional space, it is unreasonable to decompose Φ(X) directly [37,38].According to [37], we use W as a linear combination of transformed input data points to solve this problem.Thus, we assume that W lies in the column space of Φ(X): Equation ( 2) can be interpreted as a representation of simple conversion to the new basis, and the minimization problem can be generalized as follows:

Adaptive Kernel Graph Nonnegative Matrix Factorization
To exploit the geometric structure of the data in the nonlinear feature space, the kernelgraph-regularization term is integrated within the KNMF method.As mentioned previously, the NMF aims to identify the best-approximated basis vectors applied to the data Φ(X) = WH .Let h i = [h i1 , ..., h ik ] imply the ith column of H, i.e., the h i representing the ith data point with respect to the basis W = Φ(X)F.According to the local invariance assumption of the graph-regularization term [39][40][41], for a data distribution, if there exist two data points Φ(x i ) and Φ(x j ) with significant similarity in the original geometry, then the low dimensional representations of h i and h j can retain the same relationship.This can be measured, as shown below: where Laplacian matrix L S = D S − S, and S is a symmetric similarity matrix.S = S+S 2 .D S is a diagonal matrix whose elements are the column sums of S.Then, we can obtain the following KNMF model with graph regularization, as follows: The fixed similarity matrix in Equation ( 11) is predefined by the original input data, which may be sub-optimal for the embedded representation H.We also note that the representation matrix H is computed from the nonlinear feature space of input data, but the graph is obtained from the original input data space.To exploit the nonlinear graph information of input data in kernel spaces, we induce the self-expression-based, global graph-learning term to solve Equation (11) in the kernel spaces.Then, our objective function is formulated as follows: min where γ is a trade-off parameter.In this way, we unified the graph similarity learning and NMF in nonlinear space.We noticed that the graph similarity in Equation ( 12) considers both the nonlinear mapping feature Φ(X) and the embedded representation H, which leads to a much more flexible regularization for the first error term.By substituting the quadratic terms with a kernel matrix, Equation ( 12) can be further transformed to the following formulation: where Tr(•) is the trace operator and K is the kernel matrix of dataset X.The graph similarity matrix S can be optimized jointly by performing matrix factorization.The linear relations among the data in the high-dimensional space are recovered through this model, which is equivalent to discovering the nonlinear characteristics of the original data.Considering that the kernel matrix K itself contains the similarity information of data points, the graph S is expected to be close to K [8]; i.e., the graph S's learning will benefit from the preservation of manifold geometric structures in kernel space.Mathematically, we can optimize the following objective function: By introducing a coefficient θ > 1, we combined Equations ( 13) and ( 14); formally, our proposed objective function is as follows: min where β, γ, θ and µ are the parameters to balance the representation structure, data global structure, kernel similarity and regularization, respectively.We refer to the model satisfying Equation ( 15) as AKGNMF.Since representation matrix H and similarity matrix S are used for extracting features and capturing the data structure, respectively, the proposed approach performs matrix factorization and graph-structure learning simultaneously.In the next section, we propose a novel algorithm to solve Equation ( 15) and optimize its objective function with alternating rules.

Optimization
Solving Equation ( 15) to provide each variable with an optimized solution at once is challenging, since all the variables in the loss function are coupled together.Here, we develop an alternating iterative algorithm to solve Equation (15) efficiently.

Update H and F
We fix S, and Equation (15) becomes min Although the optimization problem of Equation ( 16) is convex in H only or F only, it is not convex if both variables are used together, which means that the algorithm can only converge to a local minimum.To solve the problem of Equation ( 16), a two-step iterative strategy can be adopted to alternatively optimize (F and H).Meanwhile, the kernel matrix K ∈ R n×n is defined as K ≡ Φ (X)Φ(X) [42].Ψ = ψ ij is defined as the Lagrange multiplier for constraint H ≥ 0, as Ψ = ψ ij gives the KKT condition [43] The Lagrange multiplier matrix for constraint F ≥ 0 is defined in the same way.By repeatedly adopting the same iterative procedure to fix the matrices F and H alternatively, the multiplicative update rules of F and H can be obtained as follows: When H is fixed, we use the equality ∑ i,j and d F .Then, Equation ( 19) can be reformulated column-wise as follows: The closed-form solution is presented as follows: We summarize the detailed updating procedure of AKGNMF in Algorithm 1.
Output: H, F, S. Calculate Kernel matrix K. repeat Update H by solving Equation (17).Update F by Equation (18).For each i, update the ith column of S according to Equation (21).until Stopping criterion is met.

Convergence Analysis
Here, we investigate the convergence of the proposed algorithm on a feasible solution and conclude with the following theorem: Theorem 1.For H ≥ 0, F ≥ 0, the objective in Equation ( 16) is non-increasing under the updating rules in Equations ( 17), ( 18) and ( 21); hence, it converges.
Detailed proof of the above theorem is illustrated in Appendix A. The proof derives from the viewpoints in Lee's [21] and Cai's [22] papers for NMF and GNMF.

Complexity
The updating of H, F and S dominantly decides on the main computational cost of Algorithm 1.For updating H, the complexity is O(kn 2 ).Updating F has the same complexity as H.Both H and F involve the matrix inverse, which requires O(kn 2 ).To update S, the computational complexity magnitude of matrix inverse operation (i.e., (γK + µI) −1 ) is O(n 3 ).Therefore, the whole run-time complexity is equal to O(t(kn 2 + n 3 )) for clustering n data points into k clusters, where t is the number of iterations.

Experiment
In this section, the performances of the proposed AKGNMF algorithm in clustering tasks on both synthetic and real-world datasets is presented and compared with the performances of classical approaches.

Datasets and the Evaluation Metrics
We conducted our experiments with seven datasets, including UCI, corpus and face datasets for clustering experiments.The UCI datasets were Soybean [44], Dermatology [45], Glass [46] and Vehicle [47].The corpus dataset was the NIST Topic Detection and Tracking (TDT2) corpus [48].YALE [49] and JAFFE [50] are face databases, in which the images of the same person correspond to the same cluster.Divergent factors such as time, illumination condition and with/without glasses, lead to various facial expressions or configurations illustrated by each image.The YALE face database has 165 grayscale images of 15 individuals, and the JAFFE face database contains 213 images of 7 facial expressions posed by 10 Japanese females.The specification of these datasets is listed in Table 2, including the numbers of instances and features and the number of clusters.The clustering tasks were used to verify the performance of our proposed method.The effectiveness of our method in clustering tasks was quantitatively evaluated using the following three widely used metrics: accuracy (ACC), normalized mutual information (NMI) and Purity.
The calculation of accuracy stands for the percentage of data points that are correctly clustered with respect to the external ground-truth labels.For each data point x i , let g i and c i be the clustering results and the ground truth cluster label, respectively.Then, the ACC is defined as follows: where n suggests the overall amount of data points, and f(•) is the best permutation mapping function that maps each clustered index to a true class label based on the Kuhn-Munkres algorithm.The Kronecker delta function δ is defined as follows: The NMI is intended to assess the quality of clustering.p(l) and p( l) can be induced from the joint distribution p(l, l), as the marginal probability distribution functions of two sets of clusters L and L. H(•) is the entropy function.Then, the NMI can be defined as follows: .
The Purity represents the most common category in each cluster.The purity of the clusters can be calculated as a weighted sum of the purity values of each cluster, which is defined as follows: where n i is the number of points in cluster C i , n j i represents the total number of points assigned to the j-th cluster for the i-th input group and c is the number of clusters.

Comparison Methods
To investigate the performance of our clustering method, we compared our method to eight recent clustering approaches with significant performances.In general, these methods can be classified into direct clustering approaches of (graph/kernel) nonnegative matrix factorization-based clustering methods.
• K-means [51].The most famous and commonly used clustering algorithm is based on Euclidean distance.It is widely used among all clustering algorithms because of its simplicity and efficiency.

•
Nonnegative matrix factorization (NMF) [21].As a classical multivariate analysis method, it incorporates extra constraints, such as locality, which can be shown to improve decomposition performance, while identifying better local features or providing a more sparse representation.• Graph-regularized nonnegative matrix factorization (GNMF) [22].In this method, an affinity graph is constructed to encode the geometric information and provide greater discriminating power than with the standard NMF algorithm.• Kernel-based nonnegative spectral clustering methods KNSC-Ncut and KNSC-Rcut [23].
The kernel matrix under the kernel-based NMF multiplicative update rules refers to the nonlinear graph affinity matrix in Ncut and Rcut spectral clustering.

•
Clustering with similarity preserving (SPC) [8].Single kernel learning based on similarity-preserving clustering methods.• AKGNMF.Our proposed non-negative matrix-factorization method explores the graph's structure in the nonlinear feature space, and the similarity matrix is automatically learned from the nonlinear mapping data.The similarity matrix can be learned jointly with matrix decomposition.
We further present the computational complexity of other competing methods, as shown in Table 3.

Complexity Methods Complexity
Concerning the parameters of the comparison methods, we tuned the key parameters meticulously for a fair comparison.To achieve the best performance of each method, we used the grid-search method to obtain the parameters for the compared algorithms.

Clustering Results on Synthetic Data
To verify the performance of our method more intuitively, we visualize the clustering results of two synthetic datasets in Figures 1 and 2. The test dataset in Figure 1 was generated with 200 points, which are distributed in the pattern of two moons.Points belonging to each moon formed a cluster.In Figure 2, a 4-cluster dataset with 200 points is produced.From Figures 1 and 2, AKGNMF can process the data well with nonlinearity.Significantly, AKGNMF separates the nonlinear clusters with higher clustering accuracy compared with other methods.

Clustering Results on Real Data
For each of the compared methods, we followed the recommended parameter range in the original paper and used the optimal parameter group.We present the best performance and the mean value after 20 independent runs.
Figure 3 compares the accuracy results of kernel-based clustering methods on all datasets.AKGNMF can achieve the best accuracy in most datasets, showing its advantages in capturing nonlinear manifold structures.Table 4 records results of all methods on all datasets using the evaluation metrics of accuracy, NMI, and purity.The best results for every dataset are highlighted in boldface, and the average performances are shown in parentheses.AKGNMF outperformed other methods in most cases, as follows.
(1) In all the experiments except that with the JAFFE dataset, AKGNMF performed better than the other NMF-based and graph-based clustering approaches.For the JAFFE dataset, AKGNMF also presented competitive clustering accuracy.(2) For NMF and GNMF, the accuracy of AKGNMF on the Glass dataset increased by 39.72% and 15.42%, respectively.Accuracy also improves by 45.48% and 28.94% for the TDT2 dataset.Hence, this demonstrates the ability of graph learning to adaptively capture structural information.
(3) With respect to k-means and the recently proposed kernel-based non-negative spectral clustering methods KNSC-Ncut and KNSC-Rcut, the improvement is promising.When comparing these three methods, the accuracy of AKGNMF for all datasets was found to be the highest.(4) Instead of directly constructing linear graph adjacency matrix in KOGNMF, AKGNMF obtains the optimal similarity matrix in the same nonlinear feature space as matrix factorization.A better graph structure boosts the data-representation performance of KNMF, which leads to better clustering performance of the AKGNMF method.For example, compared with KOGNMF, in TDT2, Glass and YALE datasets, the best accuracy of AKGNMF was found to improved by 11.02%, 7% and 6.06%, respectively.(5) In terms of similarity preservation, CAN mainly focuses on local similarity, which may ignore global similarity and lead to suboptimal results.The global structural information obtained using the AKGNMF method from high-dimensional maps is more advantageous on most datasets.Compared with SPC, we learned the nonlinear graph structure combined with the inherent potential features of NMF, and considered both the kernelized input data and the factorized representation, thereby realizing better performance of the proposed method in clustering tasks.As the results show, the best performance of AKGNMF in the dermatology dataset was improved by 16.94%, 18.19%, and 16.40% in terms of accuracy, NMI, and purity metrics, respectively; and the average performance was improved by 11.62%, 14.59%, and 11.8%, respectively.The best results for every dataset are highlighted in boldface, and the average performances are shown in parentheses.

Parameter Analysis
The AKGNMF algorithm's multiplicative rules involve the following five parameters: β, γ, µ, θ and a Gaussian kernel with σ.In the adopted method, we used the Gaussian kernel and defined it as K(X i , X j ) = exp(− x i − x j 2 /σ 2 ), where σ is the kernel width.To choose an appropriate value of the parameter σ, a grid search was performed for 40 values of σ in the range of [0.1, 4] with a step size of ∆σ = 0.1 for datasets Dermatology, Glass, Soybean, JAFFE and YALE.For the Vehicle and TDT2 datasets, the process is in the range σ = [10, 100] with ∆σ = 10 (step size).For the trade-off parameter γ, we also used the grid search in the range of [0.001, 100] in the same way.The important parameters mainly analyzed and discussed are θ, β and µ.As mentioned previously, θ represents the similarity-preserving capability, β corresponds to the graph regularization and µ is a trade-off parameter for the regularization term of S. Take the JAFFE dataset as an example, which demonstrates the sensitivity of our model to the parameters shown in Figure 4, which works well over a relatively wide range of values.

Convergence Study
The proposed updating rules for minimizing the objective function of AKGNMF are essentially iterative.We performed a theoretical analysis to prove the convergence of the proposed optimization algorithm.In this subsection, we investigate some examples to further empirically prove this.
Figure 5 shows the convergence curves of AKGNMF on all datasets.For each figure, the y-axis is the value of the objective function, and the x-axis denotes the iteration number.It can be observed that the objective function indeed decreases its value and the objective value sequences tend to converge within about 100 iterations on most datasets, which verifies the convergence and effectiveness of the proposed AKGNMF method.

Conclusions
In this paper, we proposed a novel kernel graph regularized nonlinear nonnegative matrix-factorization method, termed adaptive kernel graph nonnegative matrix factorization (AKGNMF).We formulated a novel framework to jointly learn an optimal graph similarity matrix and perform nonnegative matrix factorization in the kernel space.The learning process could effectively help to discover the nonlinear characteristics of input data.Moreover, an efficient iterative algorithm to solve the problem was developed.Extensive experiments were conducted on seven benchmark datasets, and the results demonstrate the superior performance of AKGNMF compared with the state-of-the-art methods.
There are many research issues worthy of exploration in future work.For example, considering the high computational complexity of graph-learning operations, it is worth trying to further enhance the efficiency.In addition, we will consider combining graphlearning-based NMF with deep neural networks for improved performance in the nonlinear representation of data.
The objective function of AKGNMF can be rewritten as Equation ( 16) and only considers the related items of H and F as follows: (A5) Since the multiplicative update rules are essentially element-wise, it is sufficient to show that each B ab is non-increasing under the update step given in Equation (17).

n
the kernel matrix X ∈ R n m the input data matrix L ∈ R n n the graph Laplacian matrix Φ(•) ∈ R n D the nonlinear mapping function W ∈ R k m the basis matrix in input space H

Figure 3 .
Figure 3.The clustering accuracy of kernel methods for the independent number of iterations on 7 datasets.

Figure 4 .
Figure 4.The influences of parameters on the JAFFE dataset.
h ab in H, we use B ab to denote the part of the objective relevant to h ab .Then, we can gainB ab = 2F KFH − 2F K + 2βHL ab (A4) B ab = 2F KF + 2βL ab . A

Table 2 .
Description of the datasets.

Table 3 .
Comparison of computational complexity.

Table 4 .
Clustering results measured on benchmark datasets.
(15) is an auxiliary function for B ab , a part of Equation(15), which is only relevant to h ab .Proof.Obviously, we have A(h, h) = B ab (h) by the above equation; thus, we only need to show that A h, h (t) ab ≥ B ab (h).In this respect, we compare the auxiliary function given in Equation (A7) with the Taylor expansion of B ab (h):Thus, Equation (A10) holds and A h, h t ab ≥ B ab (h).From Lemma 2, we know that A h, h t ab is an auxiliary function of B ab (h ab ).We can now demonstrate the convergence of Theorem 1.