Group Sparsity and Graph Regularized Semi-Nonnegative Matrix Factorization with Discriminability for Data Representation

Abstract: Semi-Nonnegative Matrix Factorization (Semi-NMF), as a variant of NMF, inherits the merit of parts-based representation of NMF and possesses the ability to process mixed sign data, which has attracted extensive attention. However, standard Semi-NMF still suffers from the following limitations. First of all, Semi-NMF fits data in a Euclidean space, which ignores the geometrical structure in the data. What’s more, Semi-NMF does not incorporate the discriminative information in the learned subspace. Last but not least, the learned basis in Semi-NMF is unnecessarily part based because there are no explicit constraints to ensure that the representation is part based. To settle these issues, in this paper, we propose a novel Semi-NMF algorithm, called Group sparsity and Graph regularized Semi-Nonnegative Matrix Factorization with Discriminability (GGSemi-NMFD) to overcome the aforementioned problems. GGSemi-NMFD adds the graph regularization term in Semi-NMF, which can well preserve the local geometrical information of the data space. To obtain the discriminative information, approximation orthogonal constraints are added in the learned subspace. In addition, `21 norm constraints are adopted for the basis matrix, which can encourage the basis matrix to be row sparse. Experimental results in six datasets demonstrate the effectiveness of the proposed algorithms.


Introduction
Nonnegative Matrix Factorization (NMF) [1] is a useful data representation technique for finding compact and low dimensional representations of data.NMF decomposes the nonnegative data matrix X into a basis matrix U and an encoding matrix V whose product can approximate the original data matrix X. Due to the nonnegative constraint, NMF only allows additive combination, which leads to the parts-based representation.The parts-based representation is consistent with the psychological intuition of combining parts to form a whole, so NMF has been widely used in data mining and pattern-recognition problems.The nonnegative constraints distinguish NMF from many other traditional matrix factorization algorithms, such as Principal Component Analysis (PCA) [2], independent component analysis [3] and Singular Value Decomposition (SVD).However, the major limitation of NMF is that it cannot deal with mixed sign data.
To address the limitation of NMF while inheriting all its merits, Ding et al. [4] proposed Semi-Nonnegative Matrix Factorization (Semi-NMF), which can handle mixed sign items in data matrix X.Specifically, Semi-NMF only imposes non-negative constraints on encoding matrix V and allows mixed signs in both the data matrix X and the basis matrix U.This allows Semi-NMF to learn a new representation from any signed data and extends the range of application of NMF ideas.
Numerous studies [5][6][7] have demonstrated that data are usually drawn from sampling a probability distribution that has support on or near a submanifold of the ambient space.Some manifold learning algorithms such as ISOMAP [5], Locally Linear Embedding (LLE) [6], Laplacian Eigenmap [7], etc, have been proposed to detect the hidden manifold structure.All these algorithms use the locally invariant idea [8], i.e., the nearby points are very likely to have similar embeddings.If the geometrical structure is utilized, the learning performance will be evidently enhanced.
On the other hand, the discriminative information of data is very important in computer vision and pattern recognition.Usually, exploiting label information in the framework of NMF can allow obtaining discriminative information.For example, Liu et al. [9] proposed Constrained Nonnegative Matrix Factorization (CNMF), which imposes the label information for the objective function as hard constraints.Li et al. [10] developed a semi-supervised robust structured NMF, which exploited the block-diagonal structure in the framework of NMF.Unfortunately, under the unsupervised scenario, we cannot have any label information.However, through reformulation of the scaled indicator matrix, we find that there is an approximation orthogonal discriminability in the learned subspace.Adding approximation orthogonal constraints to the new representation, we could acquire some discriminative information in the learned subspace.Donoho and Stodden [11] theoretically proved that NMF cannot guarantee decomposing an object into parts.In other words, NMF may be unable to result in the parts-based representation for some datasets.To ensure the parts-based representation, sparse constraints have been introduced to NMF.Hoyer [12] proposed sparse constrained NMF, which added the 1 norm penalty on the basis and encoding matrix and obtained more sparse representation than standard NMF.However, the 1 norm regularization cannot guarantee that all the data vectors are sparse in the same features [13], so it is not suitable for feature selection.To settle this issue, Nie [14] proposed a robust feature selection method emphasizing joint 2,1 -norm minimization on both the loss function and regularization.Yang et al. [15], Hou et al. [16] and Gu et al. [17] used the 2,1 -norm regularization in discriminant feature selection, sparse regression and subspace learning, respectively.The 2,1 -norm regularization is regarded as a powerful model for sparse feature selection and has attracted increasing attention [14,15,18].
The goal of this paper is to preserve the local geometrical structure of the mixed sign data and characterize the discriminative information in the learned subspace under the framework of Semi-NMF.In addition, we encourage the basis matrix to be group sparse, which is suitable to reserve the important basis vectors and remove the irrelevant ones.We propose a novel algorithm, called Group sparsity and Graph regularized Semi-Nonnegative Matrix Factorization with Discriminability (GGSemi-NMFD), for data representation.Graph regularization [19] has been introduced to encode the local structure of non-negative data in the framework of NMF.We apply it to preserve the intrinsic geometric structure of mixed sign data in the framework of Semi-NMF.In addition, discriminative information is also very important in pattern recognition.To incorporate the discriminative information of the data, we add approximate orthogonal constraints in the learned latent subspace, and thus improve the performance of Semi-NMF in clustering tasks.We further constrain the learned basis matrix to be row sparse.This is inspired by the intuition that different dimensions of basis vectors have different importance.For model optimization, we develop an effective iterative updating scheme for GGSemi-NMFD.Experimental results on six real-world datasets demonstrate the effectiveness of our approach.
To summarize, it is worthwhile to highlight three aspects of the proposed method here: 1.While the standard Semi-NMF models the data in the Euclidean space, GGSemi-NMFD exploits the intrinsic geometrical information of the data distribution and adds it as a regularization term.Hence, when the data are sampled from a high dimensional space's submanifold, our algorithm is especially applicable.
2. To incorporate the discriminative information of the data, we add approximate orthogonal constraints in the learned space.By adding the approximate orthogonal constraints, our algorithm can have more discriminative power than the standard Semi-NMF.
3. Our algorithm adds 2,1 -norm constraints in the basis matrix, which can shrink some rows of basis matrix U to zero, making basis matrix U suitable for feature selection.By preserving the group sparse structure in the basis matrix, our algorithm can acquire more flexible and meaningful semantics.
The remainder of this paper is organized as follows: Section 2 presents a brief overview of related works.Section 3 introduces our GGSemi-NMFD algorithm and the optimization scheme.Experimental results on six real-world datasets are presented in Section 4. Finally, we draw the conclusion in Section 5.

Related Work
In this section, we briefly review some related works that are closely related to our work.

NMF
Given a non-negative data matrix X ∈ R M×N , the goal of NMF is to find a non-negative basis matrix U ∈ R M×K and a non-negative encoding matrix V ∈ R K×N , where their product can well approximate the non-negative data matrix X.Here, K denotes the desired reduced dimension.
The least square objective function of NMF is formulated as follows: min It is clear that Equation ( 1) is not convex when both U and V are taken as variables.However, it is convex in U when V is fixed and vice versa.Lee and Seung [20] presented an iterative multiplicative updating rules as follows: ( Using the above updating rules, we could find the local optimal solution of Equation (1).

Semi-NMF
One limitation of NMF is that it cannot handle the mix signed data.To settle this issue, Ding et al. [4] proposed Semi-Nonnegative Matrix Factorization (Semi-NMF).Specifically, Semi-NMF relaxes the non-negative constraints to data matrix X and basis matrix U and only adds non-negative constraints in encoding matrix V.In this way, Semi-NMF can process mix signed matrix and inherit all the merit of NMF.The objective function of Semi-NMF is written as follows: min To solve Equation (3), Ding et al. [4] proposed the updating rule as follows: where we separate the positive and negative parts of a matrix A as:

Model
In this section, we propose a novel algorithm, called Group sparsity and Graph regularized Semi-Nonnegative Matrix Factorization with Discriminability (GGSemi-NMFD), which considers the group sparsity of the basis matrix and better preserves the locally geometric structure of the data, as well as incorporates the discriminative information in the learned subspace.

Graph Regularized Semi-NMF
Spectral graph theory [21] and manifold learning theory [7] have demonstrated that the local geometrical structure can be effectively fitted through a nearest neighbor graph on a scatter of data points.For each data point x i. ∈ X, we could find its k nearest neighbors N i and put an edge between x i. and its neighbors in the adjacent matrix W. Thus, we could use the graph regularization term to measure the smoothness of the low-dimensional representation: where D is a diagonal matrix whose entries are column sums of There are many ways to define adjacent weight matrix W; here, we use the 0-1 weighting, since it is simple and effective.It is defined as: where N i denotes the set of k nearest neighbors of data point x i .Combing the graph regularization term with the standard Semi-NMF objective function, we could obtain the graph regularized Semi-NMF; it can be written as follows: min where the regularization parameter α ≥ 0 controls the smoothness of the new representation.
In [19], Cai et al. proposed Graph regularized Non-negative Matrix Factorization (GNMF), which considers the local invariance in the framework of NMF.However, the major difference between GGSemi-NMF and GNMF is that GGSemi-NMF constructs the data graph for any signed data, but GNMF constructs the data graph only for non-negative data.What is more, GNMF ignores the discriminative information and cannot guarantee the parts-based representations.Therefore, GGSemi-NMF extends GNMF and has some novel properties, which will be represented in detail as follows.

Discriminative Constraints
If we could obtain the discriminative information hidden in the data, it will be a benefit for learning a better representation.To address this issue, we follow the works in [22,23], where the indicated matrix is given.At first, we introduce the indicator matrix Y = {0, 1} N×K , where Y ij = 1 if the i-th data point belongs to the j-th group.Then, the scaled indicator matrix can be defined as follows: where each column in F is: where n j is the number of samples in the j-th group.If the new representation V can obtain the discriminative information, it will be discriminative.Unfortunately, under the unsupervised scenario, we cannot have the label information in advance.However, we find that the scaled indicator matrix is strictly orthogonal: where I k is a K × K identity matrix.Hence, V T should also be orthogonal.However, the orthogonal constraint is too strict.Therefore, we relax it and let V be approximal orthogonal, i.e.,

Group Sparse Constraints
Usually, basis matrix U has redundant and irrelevant basis vectors.Removing the non-significant basis vectors and keeping the important one will lead to learning a better representation.To achieve this aim, we choose the third regularization term to distinguish the importance of different dimensions of basis vectors.Specifically, we encourage the significant dimensions of basis vectors to be non-zero values and the non-significant ones to be zero.Motivated by [14,15], we add the 2,1 -norm constraint on the basis matrix U, which propels some rows in U towards zero.Thus, we can reserve the important dimensions of basis vectors (i.e., items with non-zero values) and remove the unimportant ones (i.e., items with zero values).The 2,1 -norm is defined as follows: where u j. represents the j-th row of U, which reveals the importance of the j-th basis vector.

Objective Function
By integrating Equations ( 7), ( 10) and ( 11), the overall loss function of GGSemi-NMFD is defined as: min where α, β and λ are regularization parameters.Parameter α measures the smoothness of the learned representation; parameter β controls the orthogonality of V; and parameter λ controls the degree of sparsity in basis matrix U.

Optimization
In this section, we will give the solution to Equation (12).As we see, objective function Equation ( 12) is nonconvex in both U and V together, so we cannot have a closed-form solution.
We will give an alternating scheme to optimize the objective function, which can achieve the local optimal solution in the following.For the ease of representation, we define: 3.5.1.Updating Rule for U Optimizing Equation (12) with respect to U is equivalent to optimizing: Inspired by [14], the derivative of the objective function with respect to U is as follows: where E is a diagonal matrix with e kk = 1 2 w k.
2 .Letting ∂O ∂U = 0, we get the following updating rule for U:

Updating Rule for V
Let Φ be the Lagrange multiplier for constraint V ≥ 0. Keeping the part of O that is related to V, the Lagrange function L (V) is defined as: The partial derivative of L (V) with respect to V is as: By using the Karush-Kuhn-Tucker condition, i.e., Φ jl V jl = 0, we get the following equations, where we separate the positive and negative parts of matrix A as: Then, we obtain the following multiplicative updating rule:

Experimental Section
To demonstrate the effectiveness of GGSemi-NMFD, we carried out extensive experiments on six public datasets: ORL, YALE, UMIST, Ionosphere, USPST and Waveform.All statistical significance tests were performed using Student's t-tests with a significance level of 0.05.All the NMF-based methods are random initializations.

Datasets and Metrics
In our experiment, we use 6 datasets that are widely used as benchmark datasets in the clustering literature.The statistics of these datasets are summarized in Table 1.
ORL: The ORL face dataset contains face images of 40 distinct persons.Each person has ten different images, taken at different times, totaling 400.All images are cropped to 32 × 32 pixel grayscale images, and we reshape them into a 1024-dimensional vector.
YALE: The YALE face database contains 165 grayscale images in GIF format of 15 individuals.There are 11 images per subject, one per different facial expression or configuration.All images are cropped to 32 × 32 pixel grayscale images, and we reshape them into a 1024-dimensional vector.
UMIST: The UMIST face databases contains 575 images from 20 individuals.All images are cropped to 23 × 28 pixel grayscale images, and we reshape them into a 644-dimensional vector.
Ionosphere: Ionosphere is from the UCI repository.Ionosphere was collected by a radar system and consists of a phased array of 16 high-frequency antennas with a total transmitted power of the order of 6.4 kilowatts.The dataset consists of 351 instances with 34 numeric attributes.
USPST: The USPST dataset comes from the USPS system, and each image in USPST is presented at the resolution of 16 × 16 pixels.It is the test split of the USPS.
Waveform: Waveform is obtainable at the UCI repository.It has three categories with 21 numerical attributes and 2746 instances.We utilize clustering performance to evaluate the effectiveness of data representation.Clustering Accuracy (ACC) and Normalized Mutual Information (NMI) are two widely-used metrics for clustering performance, whose definitions are as follows: where r i and s i are cluster labels of item i in clustering results and in the ground truth, respectively; δ(x, y) equals 1 if x = y and equals 0 otherwise; and map(r i ) is the permutation mapping function that maps r i to the equivalent cluster label in the ground truth.H(C) denotes the entropy of cluster set C. MI(C, C † ) is the mutual information between C and C † : p(c i ) is the probability that a randomly-selected item from all testing items belongs to cluster c i , and p(c i , c † j ) is the joint probability that a randomly-selected item is in c i and c † j simultaneously.If C and C † are identical, NMI(C, C † ) = 1.NMI(C, C † ) = 0 when the two cluster sets are completely independent.

Compared Algorithms
To demonstrate how the clustering performance can be improved by our method, we compare the following popular clustering algorithms:

Parameter Settings
Baseline methods have several parameters to be tuned.To compare these methods fairly, we perform grid search in the parameter space for each method and recode the best average results.
For datasets ORL, YALE and UMIST, we set K, the dimension of latent space, to the number of true classes of the dataset [19], for all NMF-based methods.For the dataset Ionosphere, since its class number is too small (only 2 classes), we set K = 20 for NMF-based methods.We applied the compared methods to learn a new representation, and then, Kmeans was adapted for data clustering on the new data representation.For a given cluster number, 10 test runs were conducted on different classes of data randomly chosen from the dataset.
Note that there is no parameter selection for Kmeans, PCA, NMF and Semi-NMF, given the number of clusters.
In the coming section, we repeat clustering 10 times, and the mean and the standard error are computed.Additionally, we report the best average result for each method.

Performance Comparison
Tables 2-7 show the clustering results on the ORL, YALE, UMIST, Ionosphere, USPST and Waveform datasets, respectively.
The experiments reveal some important points.

1.
The NMF-based methods, including NMF, Semi-NMF, GNMF and GGSemi-NMFD, outperform the PCA and Kmeans methods, which demonstrates the merit of the parts-based representation in discovering the hidden factors.2.
On nonnegative datasets, NMF demonstrates somewhat superior performance over Semi-NMF.

3.
For nonnegative datasets, methods considering the local geometrical structure of data, such as GNMF and GGSemi-NMFD, significantly outperform NMF and Semi-NMF, which suggests the importance of exploiting the intrinsic geometric structure of data.

4.
When dataset has mixed signs, NMF and GNMF cannot work.Semi-NMF tends to outperform Kmeans and PCA, which indicates the advantage of the parts-based representation in finding the hidden matrix factors even in the mixed sign data. 5.
Regardless of the datasets, our GGSemi-NMF always represents the best performance.This shows that by leveraging the power of parts-based representation, graph Laplacian regularization, group sparse constraints and discriminative information simultaneously, GGSemi-NMFD can learn a better compact and meaningful representation.

Parameter Study
GGSemi-NMFD has four parameters, α, β, λ and the number of nearest neighbors k.Parameter α measures the weight of the graph Laplacian; parameter β controls the orthogonality of the learned representation; parameter λ controls the sparse degree of the basis matrix; and k controls the complexity of the graph.We investigated their influence on GGSemi-NMFD's performance by varying one parameter at a time while fixing the others.For each specific setting, we run GGSemi-NMFD 10 times, and the average performance was recorded.
The results are shown in Figures 1-4 for ORL, YALE, UMIST and Ionosphere respectively (results for USPST and Waveform were similar to Ionosphere).We found that the four parameters have the same behavior: when increasing the parameter from a very tiny value, the performance curves first rose and then descended.This denotes that when assigned proper values, the graph Laplacian, approximation orthogonal and sparseness constraints, as well as the number of nearest neighbors are surely helpful to learn a better representation.For dataset ORL, we set α = β = 10, λ = 1.For dataset YALE, we set α = β = 1.λ = 0.1.For dataset UMIST, we set α = 1000, β = 0.1 and λ = 100.For dataset Ionosphere, we set α = β = 0.01, λ = 0.1.For nearest neighbors k, we can observe from the result that GGSemi-NMF consistently outperforms the best baseline algorithms on four datasets when k ∈ [3,7].

Convergence Analysis
The updating rules for minimizing the objective function of GGSemi-NMFD in Equation ( 12) are essentially iterative, and it can be proven that these rules are convergent.Figure 5a-d show the convergence curve of GGSemi-NMFD on datasets ORL, YALE, UMIST and Ionosphere, respectively.For each figure, we use the objective function values with log scale (blue line) and the values of the objective function in the next two iterates (green line) to measure the convergence of GGSemi-NMFD.As can be seen, usually within dozens of iterations, the multiplicative update rules for GGSemi-NMFD converge very fast.

Conclusions
In this work, we proposed Group sparsity and Graph regularized Semi-Nonnegative Matrix Factorization with Discriminability (GGSemi-NMFD), a novel latent representation learning algorithm for representation learning from any signed data.GGSemi-NMFD tried to learn a semantic latent subspace of items by exploiting the graph Laplacian, discriminative information and sparse constraints, simultaneously.The graph Laplacian term encouraged items of the same category to be near each other.Approximation orthogonal constraints were introduced to incorporate some discriminative information in the learned subspace.Another novel property of GGSemi-NMFD was that it allowed each dimension of the basis matrix to be related or unrelated with new representation by imposing the 2,1 -norm penalty on basis matrix U. Therefore, GGSemi-NMFD is able to learn a more plentiful and flexible semantic latent subspace.We proposed an efficient optimization method for GGSemi-NMFD and demonstrated its validity by six real-world data sets.Experimental results on six real-world data sets indicate that GGSemi-NMFD is effective and outperforms the baselines significantly.In our future work, we will investigate multi-view case [25], which can learn a more accurate representation from multi-view data.

Figure 5 .
Figure 5. Convergence analysis of GGSemi-NMFD on: (a) ORL; (b) YALE; (c) UMIST; and (d) Ionosphere.The y-axes for objective function values are in the log scale.

Table 1 .
Statistics of the datasets.

Table 5 .
Clustering performance on Ionosphere.

Table 7 .
Clustering performance on Waveform.