2,1 Norm and Hessian Regularized Non-Negative Matrix Factorization with Discriminability for Data Representation

: Matrix factorization based methods have widely been used in data representation. Among them, Non-negative Matrix Factorization (NMF) is a promising technique owing to its psychological and physiological interpretation of spontaneously occurring data. On one hand, although traditional Laplacian regularization can enhance the performance of NMF, it still suffers from the problem of its weak extrapolating ability. On the other hand, standard NMF disregards the discriminative information hidden in the data and cannot guarantee the sparsity of the factor matrices. In this paper, a novel algorithm called (cid:96) 2,1 norm and Hessian Regularized Non-negative Matrix Factorization with Discriminability ( (cid:96) 2,1 HNMFD), is developed to overcome the aforementioned problems. In (cid:96) 2,1 HNMFD, Hessian regularization is introduced in the framework of NMF to capture the intrinsic manifold structure of the data. (cid:96) 2,1 norm constraints and approximation orthogonal constraints are added to assure the group sparsity of encoding matrix and characterize the discriminative information of the data simultaneously. To solve the objective function, an efﬁcient optimization scheme is developed to settle it. Our experimental results on ﬁve benchmark data sets have demonstrated that (cid:96) 2,1 HNMFD can learn better data representation and provide better clustering results.


Introduction
In many real-world applications, the input data is usually high-dimensional.On one hand, this is a serious challenge for storage and computation.On the other hand, it makes a lot of machine learning algorithms unworkable due to the curse of dimensionality [1].Means of obtaining a concise and informative data representation for high-dimensional data has become a highly significant focus.Matrix factorization is one kind of popular and effective model of data representation, and finds two or more low-rank matrix factors and their product can well approximate the data matrix.Various matrix factorization methods have been proposed, adopting different constraints on matrix factors.The classical matrix factorization models include Principal Component Analysis (PCA) [2], Singular Value Decomposition (SVD), QR decomposition, vector quantization.
Among the various matrix factorization approaches, Non-negative Matrix Factorization (NMF) [3] is a promising one.In NMF, data matrix X is decomposed into a non-negative basic matrix U which reveals the latent semantic structure, and a non-negative encoding matrix V, which denotes a new representation with respect to the basis matrix.Because of the non-negative constraints, NMF only allows pure additive combinations, and leads to a parts-based representation.Due to its psychological and physiological interpretation, NMF and its variants have been widely used in computer vision [3], pattern recognition [4], image processing [5], document analysis [6].
Standard NMF performs factorization in Euclidean space.It is unable to discover geometrical structures in data space, which is critical in real-world applications.Therefore, lots of recent work has focused on preserving the intrinsic geometry of the data space by adding different constraints to the objective function of NMF.Cai et al. [7] proposed graph regularized NMF (GNMF) by constructing a nearest neighbor graph while preserving the local geometrical information of the data space.Lu et al. [8] proposed Manifold Regularized Sparse NMF for hyperspectral unmixing, in which manifold regularization was introduced into sparsity-constrained NMF for unmixing.Gu et al. [9] proposed Neighborhood-Preserving Non-negative Matrix Factorization, which imposed an additional constraint on NMF that each item be able to be represented as a linear combination of its neighbors.All the mentioned graph regularized NMF methods construct a graph to encode the geometrical information and use graph Laplacian as a smooth operator.Despite the successful application of graph Laplacian in semi-supervised and unsupervised learning, it still suffers from the problems that the solution is biased towards a constant, as well as its lack of extrapolating power [10].
Sparsity regularization methods that focus on selecting the input variables that best describe the output have been widely investigated.Hoyer [11] proposed a sparse constraint NMF and added the 1 norm constraint on the basis and encoding matrices, which were able to discover sparse representations better than those given by standard NMF.Cai et al. [12] proposed Unified Sparse Subspace Learning (USSL) for learning sparse projections by using a 1 norm regularizer.The limitation of the 1 norm penalty is that it is unable to guarantee successful models in cases of categorical predictors, for the reason that each dummy variable is selected independently [13].So 1 norm is not feasible for conducting feature selection.To settle this issue, Nie et al. [14] proposed a robust feature selection approach by imposing 2,1 norm on loss functions.Yang et al. [15] proposed 2,1 norm regularized discriminative feature selection for unsupervised learning.Gu et al. [16] combined feature selection and subspace learning simultaneously in a joint framework, which is based on using 2,1 norm on the projection matrix and achieves the goal of feature selection.The 2,1 norm penalty term encourages row sparsity as well as the correlations of all the features.Recently, some researchers proposed 1/2 norm [17] regularized NMF [18,19], and low-rank regularized NMF [20,21] with improved performance for special purposes.The 1/2 norm can usually induce sparser solutions than its 1 counterpart, but it is usually unstable.The limitation of the low rank constraint is that it is not suited to feature selection in general.
What's more, discriminative information is very important for learning a better representation.For example, by exploiting the partial label information as hard constraints of NMF, Liu [22] developed a semi-supervised Constrained NMF (CNMF), which obtained better discriminating power.Li et al. [23] proposed robust structured NMF a semi-supervised NMF learning algorithm, which learns a robust discriminative data representation by pursuing the block-diagonal structure and the 2,p norm loss function.But under unsupervised scenario, we cannot have the label information.In fact, we could add approximate orthogonal constraints to obtain some discriminative information under unsupervised conditions.Unfortunately, standard NMF ignores this important information.
To address these flaws, a novel NMF algorithm, called 2,1 norm and Hessian Regularized Non-negative Matrix Factorization with Discriminability ( 2,1 HNMFD), is developed in this paper, which is designed to include local geometrical structure preservation, row sparsity and to exploit discriminative information at the same time.Firstly, Hessian regularization is introduced in the framework of NMF to preserve the intrinsic manifold of the data.Then, 2,1 norm constraints are added on the coefficient matrix to ensure that the representation vectors are row sparse.Furthermore, approximate orthogonal constraints are added to capture some discriminative informational in the data.An optimization scheme is developed to solve the objective function.
The rest of the paper is organized as follows: In Section 2, we give a brief review of related works.In Section 3, we introduce our 2,1 HNMFD algorithm and the optimization scheme.Experimental results are presented in Section 4. Finally, we draw a conclusion and point out future work in Section 5.

Related Works
This section presents a brief review of related works.At first, we describe the notations used throughout the paper.

Common Notations
In this paper, we use lowercase boldface letters and uppercase boldface letters denote vectors and matrices, respectively.For matrix M, we denote its (i, j)-th element by M ij .The i-th element of a vector b is denoted by b i .Given a set of N items, we use matrix X ∈ R M×N + to represent the non-negative original data matrix where the i-th column vector is according to the feature vector for the i-th item.Throughout this paper, ||M|| F denotes the Frobenius norm of matrix M.

NMF
NMF is an effective decomposition for multivariate non-negative data.Given a non-negative matrix X = {x 1 , . . . ,x N } ∈ R M×N , each column of X is a data vector.The goal of NMF is to find two low-rank matrices U and V that minimize the following objective function [3]: It is easy to see that when both U and V are taken as variables simultaneously, the objective function J N MF is not convex.But when V is fixed, J N MF is convex in U and vice versa.So Lee and Seung [24] developed an iterative multiplicative updating rule as follows: By constructing auxiliary functions, J N MF is proved to be non-increasing under the above update rules [24].

GNMF
In [7], Cai et al. developed a graph regularized non-negative matrix factorization (GNMF) method to obtain a compact data representation that discovers hidden concept, and respects the intrinsic geometric structure simultaneously.GNMF minimizes the objective function as follows: where L = D − W is called graph Laplacian, W denotes the weight matrix constructed by finding the k nearest neighbors for each data point, and D is a diagonal matrix whose entries are column sums of W, i.e., The objective function J GN MF is also not convex when both U and V are taken as variables simultaneously.Therefore it is unlikely to find the global minima.The Using the following update rules [7], local minima of the objective function J GN MF can be obtained: Cai et al. [7] has proved that the objective function J GN MF is non-increasing under the above updating rules.

2,1 Norm and Hessian Regularized Non-Negative Matrix Factorization with Discriminability
In this section, a novel 2,1 norm and Hessian Regularized Non-negative Matrix Factorization with Discriminability ( 2,1 HNMFD) model is developed, which performs Hessian regularized Non-negative Matrix Factorization (HNMF) and preserves discriminative information, as well as maintaining row sparsity for encoding matrices simultaneously.Then, an alternating optimization scheme is developed to solve its objective function.

Hessian Regularized Non-Negative Matrix Factorization
Hessian energy is motivated by Eells-energy for mapping between manifolds [25].Given a smooth manifold M ⊂ R n and a map function f : M → R r , the Eells-energy of f can be written as [10]: where a b f is the second covariant derivation of f , T x M is the tangent space at point x ∈ M and dV(x) is the natural volume element.Using normal coordinate, M || a b f || 2 T x M⊗T x M can be written as: where C r and C s are normal coordinates.So given point x i , the norm of the second covariant derivative is just the Frobenius norm of the Hessian of f in standard coordinate.Thus the resulting functional is called Hessian regularizer S Hess ( f ): ) can be approximated as follows: The operator H can be computed by fitting a second-order polynomial p(X) in normal coordinates to f (X j ) k j=1 .Let V ki = f k (X i ) and X k = (X k1 , . . ., X kp ), the estimate of the Frobenius norm of the Hessian of f at x i is thus given by where rsβ and the total estimated Hessian energy ŜHess ( f ) is the sum over all data points as follows: where B is denoted as the Hessian regularization matrix, and is the accumulated matrix summing up all the matrices B (i) .Applying Hessian energy as the regularization term in NMF to estimate the local manifold structure, the Hessian regularized NMF (HNMF) can be formulated as: where λ is the regularization parameter.

Sparseness Constraints
To distinguish the importance of different features, we try to encourage the significant features to be non-zero values, and the insignificant features to be zero, after the iterative update.Since each row of encoding matrix V corresponds to a feature in the original space, we add 2,1 norm regularization to the encoding matrix V, which can enforce some rows in V to tend to zero.For new representation matrix V, a row sparseness regularizer is introduced into the objective function to shrink some row vectors in V to be zero.In this way, we are able to preserve the important features and remove the irrelevant features.The 2,1 norm of matrix V is defined as: where v j. represents the j-th row of V.

Discriminative Constraints
To characterize some discriminative information in the learned representation matrix V, we follow the works done in [26,27], in which a scaled indicator matrices were developed.Given an indicated matrix Y = {0, 1} N×K , where Y ij = 1 if the i-th data point belongs to the j-th category.The scaled indicated matrix is defined as , where each column of F is: where n j is the number of samples in the j-th group.We encourage the new representation V to capture the discriminative information in F. Intuitively, we only need V to approximate F T , i.e., ||V − F T || 2 F ≤ ε, where ε is any small constant.Unfortunately, in unsupervised scenarios, we cannot obtain any label information in advance.However, we find that the scaled indicator matrix is strictly orthogonal where I k is a k × k identity matrix.Since F is orthogonal, V should be orthogonal too.However, this constraint is too strict.So we relax the orthogonal constraint and let V be approximately orthogonal, i.e.,

Optimization
In this section, we will introduce an iterative algorithm which can give the solution to Equation (15).As far as we can see, the objective function of 2,1 HNMFD is not convex in both U and V, so we cannot result in a closed-form solution.In the following, we will present an alternative scheme which can obtain local minima.Firstly, the optimization problem of Equation ( 15) can be rewritten as follows: let ψ ik and Φ kj be the Lagrange multiplier for constraint U ik ≥ 0 and V kj ≥ 0, respectively, then the Lagrange function L can be written as follows:

Updating U
The partial derivation of L with respect to U is: Using the Karush-Kuhn-Tucker (KKT) conditions, ψ ik U ik = 0, we get The above equation leads to the following updating formula:

Updating V
The partial derivation of L with respect to V is: where R is a diagonal matrix with the i-th diagonal element as where Equation ( 22) leads to the following updating formula: The algorithm is shown in Algorithm 1.

Computational Complexity Analysis
In this section, we discuss the extra computational cost of our proposed algorithm.2,1 HNMFD needs (N 2 M) to construct the neatest neighbor graph.Suppose the multiplicative updates stops after t iterations, the complexity for updating 2,1 HNMFD is (tNMK).Thus the overall complexity of 2,1 HNMFD is (tNMK + N 2 M), which is similar to that of GNMF.

Proof of Convergence
Theorem 1.The function value in Equation ( 15) is non-increasing under the rules in Equations ( 20) and (23).
The updating rule for U is the same as in the classical NMF.Thus O in Equation ( 15) is non-increasing under Equation (20).In the next, we will prove that O is non-creasing under Equation (23).The proof uses the auxiliary function [18] defined as follows.
If G is an auxiliary function for F, then F is non-increasing under the updating rule Proof for Lemma 1.
In this next section, we will show that the updating rule for V in Equation ( 23) is exactly the rule in Equation ( 24) with a proper auxiliary function.We use F ab to denote the part of O that is only relevant to v ab .25) is an auxiliary function for F ab .

Proof for
ab ) ≥ F ab (v).Similar proof can be see in [7].
Proof for Theorem 1.By substituting G(v, v ab ) in Equation (24) with Equation ( 25), we obtain the updating rule as below, which is identical to Equation (23).Since G(v, v (t) ab ) is the auxiliary function of F ab (v), F ab (v) is non-increasing under this updating rule.So O in Equation ( 15) is non-increasing under Equation ( 23).

Experiment
In this section, we evaluate the performance of 2,1 HNMFD.To demonstrate the advantages of the proposed method, we have compared the results of the proposed method with related state-of-the-art methods.All statistical significance tests were performed using a significance level of 0.05.We used Student's t-tests in the experiments.
To perform data clustering for NMF-based method, the original data were firstly transformed by different NMF algorithms to generate new representations.Then, new representations were fed to Kmeans clustering algorithm to obtain the final clustering result.

Data Sets
We use five real-world data sets to evaluate the proposed method.These datasets are described below: The Yale face dataset consists of 165 gray-scale face images of 15 persons.There are 11 images per subject, each with a different facial expression or configuration: center-light, with/without glasses, normal, right-light, sad, sleepy, surprised and wink.
The ORL face dataset contains 10 different face images for 40 different persons; each of the 400 images has been collected against a dark, homogeneous background, with the subjects in an upright, frontal position, with some tolerance for side movement.
The UMIST face dataset contains 575 images of 20 people, each covering a range of poses from profile to frontal views.Subjects cover a range in terms of race, sex and appearance.
The COIL20 data set contains 32 × 32 gray scale images of 20 objects, viewed from varying angles.The CMU PIE face dataset contains 32 × 32 gray scale face images of 68 people.Each person has 42 facial images under various light and illumination conditions.
The important statistics of these datasets are summarized in Table 1.

Evaluation Metrics
In our experiments, we set the number of clusters equal to the number of classes for all algorithms.To evaluate the performance of clustering, we use Accuracy and Normalized Mutual Information (NMI) to measure the clustering results.
Accuracy is defined as follows: where r i and s i are cluster labels of item i in the clustering results and ground truth, respectively.If x = y, δ(x, y) equals 1 and otherwise equals 0, and map(r i ) is the permutation mapping function which maps r i to the equivalent cluster label in ground truth.
The NMI is defined as follows: where If the two cluster sets are completely independent, NMI(C, C † ) = 0.

Baseline
To demonstrate how the clustering performance can be enhanced by 2,1 HNMFD, we compare the following state-of-the-art clustering algorithms: (1) Traditional Kmeans clustering algorithm (Kmeans).

Clustering Results
Table 2 presents the clustering accuracy of all of the algorithms on each of the three data sets, while Table 3 presents the normalized mutual information.The observations are as follows.Firstly, NMF-based methods, including NMF, GNMF and 2,1 HNMFD, outperform the Kmeans method.This suggests the superiority of parts-based data representation for perceiving the hidden matrix factors.
Secondly, Ncut and GNMF exploit geometrical information, and achieve more superior performance than Kmeans and NMF methods.This suggests that geometrical information is very important in learning the hidden factors.
Finally, on all the data sets, 2,1 HNMFD always outperforms the other clustering methods.This demonstrates that by exploiting the power of Hessian regularization, group sparse regularization and discriminative information, new method can learn a more meaningful representation.

Parameter Sensitivity
2,1 HNMFD has three parameters, λ, µ and γ.We investigated their influence on 2,1 HNMFD's performance by varying one parameter at a time while fixing the other two.For each specific setting, we run 2,1 HNMFD 10 times and record the average performance.
We plot the performance of 2,1 HNMFD with respect to λ in Figure 1a.Parameter λ measures the importance of the graph embedding regularization terms of 2,1 HNMFD.A too small λ may cause graph regularization so weak that the local geometrical information of data cannot be effectively characterize, while too big λ may cause a trivial solution.2,1 HNMFD shows superior performance when λ equals 0.01, 0.001 and 0.1 for YALE, ORL and UMIST, respectively.
We plot the performance of 2,1 HNMFD with respect to µ in Figure 1b.Parameter µ controls the orthogonality of the learned representation.When µ is too small, the orthogonal constraint will be too weak, and 2,1 HNMFD may be ill-defined.When µ is too large, the constraint may dominate the objective function of 2,1 HNMFD, and the learned representation will be too sparse, which is also unfaithful to the real-world situation.We can observe that 2,1 HNMFD is able to achieve encouraging performance when µ equals 0.001, 0.001 and 0.1 for YALE, ORL and UMIST respectively.
We plot the performance of 2,1 HNMFD with respect to γ in Figure 1c.Parameter γ controls the degree of sparsity of the encoding matrix.Sparsity constraints that are too weak or too heavy will be bad for the learned representation.We find that 2,1 HNMFD consistently outperforms the best baseline methods on the three datasets when γ = 1.

Convergence Analysis
The updating rules for minimizing the objective function of 2,1 HNMFD are essentially iterative.We have provided its convergence proof.Next, we analyze how fast the rules can converge.
We investigate the empirical convergence properties of both GNMF and 2,1 HNMFD on three datasets.For each figure, the x-axis denotes the iterative number and the y-axis is the value of objective function with log scale.Figure 2a-c show the objective function value against the number of iterations performed for data set YALE, ORL and UMIST, respectively.We observed that, at the beginning, the objective function values for both GNMF and 2,1 HNMFD dropped drastically, and were able to converge very fast, usually within 100 iterations.

Convergence Analysis
The updating rules for minimizing the objective function of 2 1   HNMFD are essentially iterative.
We have provided its convergence proof.Next, we analyze how fast the rules can converge.
We investigate the empirical convergence properties of both GNMF and 2 1   HNMFD on three datasets.For each figure, the x-axis denotes the iterative number and the y-axis is the value of objective function with log scale.Figure 2a-c show the objective function value against the number of iterations performed for data set YALE, ORL and UMIST, respectively.We observed that, at the beginning, the objective function values for both GNMF and

Conclusions and Future Work
In this paper, we have discussed a novel matrix factorization method, called 2,1 norm and Hessian Regularized Non-negative Matrix Factorization with Discriminability ( 2,1 HNMFD), for data representation.On one hand, 2,1 HNMFD uses Hessian regularization to preserve the local manifold structures of data.On the other hand, 2,1 HNMFD exploits the 2,1 norm constraint to obtain sparse representation, and uses an approximation orthogonal constraint to characterize the discriminative information of the data.Experimental results on 5 real-world datasets suggest that 2,1 HNMFD is able to learn a better part-based representation.This paper only considers single-view cases.In the future, we will consider multi-view cases, and learn a meaningful representation for multi-view data.

Figure 1 . 1 
Figure 1.Influence of different parameter settings on the performance of 2 1   HNMFD in 3 datasets: (a) varying  while fixing  and  , (b) varying  while fixing  and  , and (c) varying  while fixing  and  .

Table 1 .
Statistics of the datasets.