A Block Coordinate Descent-based Projected Gradient Algorithm for Orthogonal Non-negative Matrix Factorization

This article utilizes the projected gradient method (PG) for a non-negative matrix factorization problem (NMF), where one or both matrix factors must have orthonormal columns or rows. We penalise the orthonormality constraints and apply the PG method via a block coordinate descent approach. This means that at a certain time one matrix factor is fixed and the other is updated by moving along the steepest descent direction computed from the penalised objective function and projecting onto the space of non-negative matrices. Our method is tested on two sets of synthetic data for various values of penalty parameters. The performance is compared to the well-known multiplicative update (MU) method from Ding (2006), and with a modified global convergent variant of the MU algorithm recently proposed by Mirzal (2014). We provide extensive numerical results coupled with appropriate visualizations, which demonstrate that our method is very competitive and usually outperforms the other two methods.


Motivation
Many machine learning applications require processing large and high dimensional data. The data could be images, videos, kernel matrices, spectral graphs, etc., represented as an m × n matrix R. The data size and the amount of redundancy increase rapidly when m and n grow. To make the analysis and the interpretation easier, it is favorable to obtain compact and concise low rank approximation of the original data R. This low-rank approximation is known to be very efficient in a wide range of applications, such as: text mining [2,27,30], document classification [3], clustering [19,32], spectral data analysis [2,12], face recognition [35], and many more.
There exist many different low rank approximation methods. For instance, two well-known strategies, broadly used for data analysis, are singular value decomposition (SVD) [9] and principle component analysis (PCA) [11]. Much of real-world data are non-negative, and the related hidden parts express physical features only when the non-negativity holds. The factorizing matrices in SVD or PCA can have negative entries, making it hard or impossible to put a physical interpretation on them. Non-negative matrix factorization was introduced as an attempt to overcome this drawback, i.e., to provide the desired low rank non-negative matrix factors.

Problem formulation
A non-negative matrix factorization problem (NMF) is a problem of factorizing the input non-negative matrix R into the product of two lower rank non-negative matrices G and H: where R ∈ R m×n + usually corresponds to the data matrix, G ∈ R m×p + represents the basis matrix, and H ∈ R p×n + is the coefficient matrix. With p we denote the number of factors for which it is desired that p min(m, n). If we consider each of the n columns of R being a sample of m-dimensional vector data, the factorization represents each instance (column) as a non-negative linear combination of the columns of G, where the coefficients correspond to the columns of H. The columns of G can be therefore interpreted as the p pieces that constitute the data R. To compute G and H, condition (1) is usually rewritten as a minimization problem using the Frobenius norm: It is demonstrated in certain applications that the performance of the standard NMF in (NMF) can often be improved by adding auxiliary constraints which could be sparseness, smoothness, and orthogonality. Orthogonal NMF (ONMF) was introduced by Ding et al., [8]. To improve the clustering capability of the standard NMF, they imposed orthogonality constraints on columns of G or on rows of H. Considering the orthogonality on columns of G, it is formulated as follows: If we enforce orthogonality on the columns of G and on rows of H, we obtain the bi-orthogonal ONMF (bi-ONMF), which is formulated as where I denotes the identity matrix.

Related work
The NMF was firstly studied by Paatero et al., [26,1] and was made popular by Lee and Seung [17,18]. There are several different existing methods to solve (NMF). The most used approach to minimize (NMF) is a simple MU method proposed by Lee and Seung [17,18]. In Chu et al., [23], several gradient-type approaches have been mentioned. Chu et al., reformulated (NMF) as an unconstrained optimization problem, and then applied the standard gradient descent method. Considering both G and H as variables in (NMF), it is obvious that f (G, H) is a non-convex function. However, considering G and H separately, we can find two convex sub-problems. Accordingly, a block-coordinate descent (BCD) approach [18] is applied to obtain values for G and H that correspond to a local minimum of f (G, H). Generally, the scheme adopted by BCD algorithms is to recurrently update blocks of variables only, while the remaining variables are fixed. NMF methods which adopt this optimization technique are, e.g., the MU rule [17], the active-set-like method [15], or the PG method for NMF [20]. In [20], two PG methods were proposed for the standard NMF. The first one is an alternating least squares (ALS) method using projected gradients. This way, H is fixed first and a new G is obtained by PG. Then, with the fixed G at the new value, the PG method looks for a new H. The objective function in each least squares problem is quadratic. This enabled the author to use Taylor's extension of the objective function to obtain an equivalent condition with the Armijo rule, while checking the sufficient decrease of the objective function as a termination criterion in a step-size selection procedure. The other method proposed in [20] is a direct application of the PG method to (NMF). There is also a hierarchical ALS method for NMF which was originally proposed in [6,10] as an improvement to the ALS method. It consists of a BCD method with single component vectors as coordinate blocks.
As the original ONMF algorithms in [19,32] and their variants [33,34,5] are all based on the MU rule, there has been no convergence guarantee for these algorithms. For example, Ding et al., [8] only prove that the successive updates of the orthogonal factors will converge to a local minimum of the problem. Because the orthogonality constraints cannot be rewritten into a non-negatively constrained ALS framework, convergent algorithms for the standard NMF (e.g., see [20,14,13,16]) cannot be used for solving the ONMF problems. Thus, no convergent algorithm was available for ONMF until recently. Mirzal [24] developed a convergent algorithm for ONMF. The proposed algorithm was designed by generalizing the work of Lin [21] in which a convergent algorithm was provided for the standard NMF based on a modified version of the additive update (AU) technique of Lee [18]. Mirzal [24] provides the global convergence for his algorithm solving the ONMF problem. In fact, he first proves the non-increasing property of the objective function evaluated by the sequence of the iterates. Secondly, he shows that every limit point of the generated sequence is a stationary point, and finally he proves that the sequence of the iterates possesses a limit point.

Our contribution
In this paper, we consider the penalty reformulation of (bi-ONMF), i.e., we add the orthogonality constraints multiplied with penalty parameters to the objective function to obtain reformulated problems (ONMF) and (bi-ONMF). The main contributions are: • We develop an algorithm for (ONMF) and (bi-ONMF), which is essentially a BCD algorithm, in literature also known as alternating minimization, coordinate relaxation, the Gauss-Seidel method, subspace correction, domain decomposition, etc., see e.g. [4,29]. For each block optimization, we use a PG method and Armijo rule to find a suitable step-size.
• We construct synthetic data sets of instances for (ONMF) and (bi-ONMF), for which we know the optimum value by construction.
• We use MATLAB [31] to implement our algorithm and two well-known (MU-based) algorithms: the algorithm of Ding [8] and of Mirzal [24]. The code is available upon request.
• The implemented algorithms are compared on the constructed synthetic data-sets in terms of: (i) the accuracy of the reconstruction, and (ii) the deviation of the factors from orthonormality. Accuracy is measured by the so-called root-square error (RSE), defined as and deviations from orthonormality are computed using formulas (17) and (18) from Sect. 4. Our numerical results show that our algorithm is very competitive and almost always outperforms the MU algorithms.

Notations
Some notations used throughout our work are described here. We denote scalars and indices by lower-case Latin letters, vectors by lowercase boldface Latin letters, and matrices by capital Latin letters. R m×n denotes the set of m by n real matrices, and I symbolizes the identity matrix. We use the notation ∇ to show the gradient of a real-valued function. We define ∇ + and ∇ − as the positive and (unsigned) negative parts of ∇, respectively, i.e., = + − − . and denote the element-wise multiplication and the element-wise division, respectively.

Structure of the paper
The rest of our work is organized as follows. In Sect. 2, we review the well-known MU method and the rules being used for updating the factors per iteration in our computations. We also outline the global convergent MU version of Mirzal [24]. We then present our PG method and discuss the stopping criteria for it. Sect. 4 presents the synthetic data and the result of implementation of the three decomposition methods presented in Sect. 3. This implementation is done for both the problem (ONMF), as well as (bi-ONMF). Some concluding results are presented in Sect. 5.
2 Existing methods to solve (NMF) 2.1 MU method of Ding [8] Several popular approaches to solve (NMF) are based on so-called MU algorithms, which are simple to implement and often yield good results. The MU algorithms originate from the work of Lee and Seung [18]. Various MU variants were later proposed by several researchers, for an overview see [7]. At each iteration of these methods, the elements of G and H are multiplied by certain updating factors.
As already mentioned, (ONMF) was proposed by Ding et al., [8] as a tool to improve the clustering capability of the associated optimization approaches. To adapt the MU algorithm for this problem, they employed standard Lagrangian techniques: they introduced the Lagrangian multiplier Λ (a symmetric matrix of size p × p) for the orthogonality constraint, and minimized the Lagrangian function where the orthogonality constraint is moved to the objective function as the penalty term Trace(Λ(G T G − I)). The complementarity conditions from the related KKT conditions can be rewritten as a fixed point relation, which finally can lead to the following MU rule for (ONMF): (3) They extended this approach to non-negative three factor factorization with demand that two factors satisfy orthogonality conditions, which is a generalization of (bi-ONMF). The MU rules (28)-(30) from [8], adapted to (bi-ONMF), are the main ingredients of Algorithm 1, which we will call Ding's algorithm. Algorithm 1 converges in the sense that the solution pairs G and Algorithm 1. Ding's MU algorithm for (bi-ONMF) INPUT: R ∈ R m×n + , p ∈ N 1. Initialize: generate G ≥ 0 as an m × p random matrix and H ≥ 0 as a p × n random matrix.
If R has zero vector as columns or rows, a division by zero may occur. In contrast, denominators close to zero may still cause numerical problems. To escape this situation, we follow [28] and add a small positive number δ to the denominators of the MU terms (4). Note that Algorithm 1 can be easily adapted to solve (ONMF) by replacing the second MU rule from (4) with the second MU rule of (3).

MU method of Mirzal [24]
In [24], Mirzal proposed an algorithm for (ONMF) which is designed by generalizing the work of Lin [21]. Mirzal used the so-called modified additive update rule (the MAU rule), where the updated term is added to the current value for each of the factors. This additive rule has been used by Lin in [21] in the context of a standard NMF. He also provided in his paper a convergence proof, stating that the iterates generated by his algorithm converge in the sense that RSE is decreasing and the limit point is a stationary point. In [24], Mirzal discussed the orthogonality constraint on the rows of H, while in [25] the same results are developed for the case of (bi-ONMF).
Here we review the Mirzal's algorithm for (bi-ONMF), presented in the unpublished paper [25]. This algorithm actually solves the equivalent problem (pen-ONMF) where the orthogonality constraints are moved into the objective function (the so-called penalty approach), and the importance of the orthogonality constraints are controlled by the penalty parameters α, β: The gradients of the objective function with respect to G and H are: For the objective function in (pen-ONMF), Mirzal proposed the MAU rules along with the use ofḠ = (ḡ) ij andH = (h) ij , instead of G and H, to avoid the zero locking phenomenon [24, Section 2]:ḡ where ν is a small positive number.
Note that, the algorithms working with the MU rules for (pen-ONMF) must be initialized with positive matrices to avoid zero locking from the start, but non-negative matrices can be used to initialize the algorithm working with the MAU rules (see [25]).
Mirzal [25] used the MAU rules with some modifications by consideringḠ andH in order to guarantee the non-increasing property, with a constant step to make δ G and δ H grow in order to satisfy the property. Here, δ G and δ H are the values added within the MAU terms to the denominator of update terms for G and H, respectively. The proposed algorithm by Mirzal [25] is summarised as Algorithm 2 below.

Main steps of PG method
In this subsection we adapt the PG method proposed by Lin [20] to solve both (ONMF) as well as (bi-ONMF). Lin applied PG to (NMF) in two ways. The first approach is actually a Algorithm 2. Mirzal's algorithm for bi-ONMF [25] INPUT: inner dimension p, maximum number of iterations: maxit; small positive δ, small positive step to increase δ.
BCD method. This method consecutively fixes one block of variables (G or H) and minimizes the simplified problem in the other variable. The second approach by Lin directly minimizes (NMF). Lin's main focus was on the first approach and we follow it. We again try to solve the penalised version of the problem (pen-ONMF) by the block coordinate descent method, which is summarised in Algorithm 3.
The objective function in (pen-ONMF) is not quadratic any more, so we lose the nice properties about Armijo's rule that represent advantages for Lin. We managed to use the Armijo rule directly and still obtained good numerical results, see Sect. 4.
We refer to (8) or (9) as sub-problems. Obviously, solving these sub-problems in every iteration could be more costly than Algorithms 1-2. Therefore, we must find effective methods for solving these sub-problems. Similarly to Lin, we apply the PG method to solve the subproblems (8) - (9). Algorithm 4 contains the main steps of the PG method for solving the latter and can be straightforwardly adapted for the former.
For the sake of simplicity, we denote by F H the function that we optimize in (8), which is actually a simplified version (pure H terms removed) of the objective function from (pen-ONMF) for H fixed: . Similarly, for G is fixed, the objective function from (9) will be denoted by:

Repeat
Fix H := H k and compute new G as follows: Fix G := G k+1 and compute new H as follows: In Algorithm 4, P is the projection operator which projects the new point (matrix) on the cone of non-negative matrices (we simply put negative entries to 0). Inequality (10) shows the Armijo rule to find a suitable step-size guaranteeing a sufficient decrease. Searching for λ k is a time-consuming operation, therefore we strive to do only a small number of trials for new λ in Step 3.1. Similarly to Lin [20], we allow for λ any positive value. More precisely, we start with λ = 1 and if the Armijo rule (10) is satisfied, we increase the value of λ by dividing it with γ < 1. We repeat this until (10) is no longer satisfied or the same matrix H λ as in the previous iteration is obtained. If the starting λ = 1 does not yield H λ which would satisfy the Armijo rule (10), then we decrease it by a factor γ and repeat this until (10) is satisfied. The numerical results obtained using different values of parameters γ (updating factor for λ) and σ (parameter to check (10)) are reported in the following subsections.

Stopping criteria for Algorithms 3 and 4
As practiced in the literature (e.g. see [22]), in a constrained optimization problem with the non-negativity constraint on the variable x, a common condition to check whether a point x k is close to a stationary point is Algorithm 4. PG method using Armijo rule to solve sub-problem (9) INPUT: 0 < σ < 1, γ < 1, and initial H 0 .
Find a λ (using updating factor γ) such that for 3. Until some stopping criteria is satisfied.
where f is the differentiable function that we try to optimize and ∇ P f (x k ) is the projected gradient defined as and ε is a small positive tolerance. For Algorithm 3, (11) becomes We impose a time limit in seconds and a maximum number of iterations for Algorithm 4 as well. Following [20], we also define stopping conditions for the sub-problems. The matrices G k+1 and H k+1 returned by Algorithm 4, respectively, must satisfy whereε and ε is the same tolerance used in (13). If the PG method for solving the sub-problem (8) or (9) stops after the first iteration, then we decrease the stopping tolerance as follows: where τ is a constant smaller then 1.

Numerical results
In this section we demonstrate, how the PG method described in Sect. 3, performs compared to the MU-based algorithms of Ding and Mirzal, which were described in Subsections 2.1 and 2.2, respectively.

Artificial data
We created two sets of synthetic data using MATLAB [31]. The first set we call bi-orthonormal set (BION). It consists of instances of matrix R ∈ R n×n + , which were created as products of G and H, where G ∈ R n×k + has orthonormal columns while H ∈ R k×n + has orthonormal rows. We created five instances of R, for each pair (n, k 1 ) and (n, k 2 ) from Table 1.
Matrices G were created in two phases: firstly, we randomly (uniform distribution) selected a position in each row; secondly, we selected a random number from (0, 1) (uniform distribution) for the selected position in each row. Finally, if it happens that after this procedure some column of G is zero or has a norm below 10 −8 , we find the first non-zero element in the largest column of G (according to Euclidean norm) and move it into the zero column. We created H similarly. Each triple (R, G, H) was saved as a triple of txt files. For example, NMF BIOG data R n=200 k=80 id=5.txt contains 200 × 200 matrix R obtained by multiplying matrices G ∈ R 200×80 and H ∈ R 80×200 , which were generated as explained above. With id=5, we denote that this is a 5th matrix corresponding to this pair (n, k). The second set contains similar data to BION, but only one factor (G) is orthonormal, while the other (H) is nonnegative but not necessarily orthonormal. We call this dataset uni-orthonormal (UNION). All computations are done using MATLAB [31] and a high performance computer available at Faculty of Mechanical Engineering of University of Ljubljana. This is Intel Xeon X5670 (1536 hyper-cores) HPC cluster and an E5-2680 V3 (1008 hyper-cores) DP cluster, with an IB QDR interconnection, 164 TB of LUSTRE storage, 4.6 TB RAM and with 24 TFlop/s performance.

Numerical results for UNION
In this subsection, we present numerical results, obtained by Ding's, Mirzal's, and our algorithm for a uni-orthogonal problem (ONMF), using the UNION data, introduced in the previous subsection. We have adapted the last two algorithms (Algorithms 2, 3) for UNION data by setting α = 0 in the problem formulation (bi-ONMF) and in all formulas underlying these two algorithms.
Recall that for UNION data we have for each pair n, k from Table 1 five symmetric matrices R for which we try to solve (ONMF) by Algorithms 1, 2 and 3. Note that all these algorithms demand as input the internal dimension k, i.e. the number of columns of factor G, which is in general not known in advance. Even though, we know this dimension by construction for UNION data, we tested the algorithms using internal dimensions p equal to 20%, 40%, . . . , 100% of k. For p = k, we know the optimum of the problem, which is 0, so for this case we can also estimate how good are the tested algorithms in terms of finding the global optimum.
The first question we had to answer was which value of β to use in Mirzal's and PG algorithms. It is obvious that larger values of β moves the focus from optimizing the RSE to guaranteeing the orthonormality, i.e., feasibility for the original problem. We decided not to fix the value of β but to run both algorithms for β ∈ {1, 10, 100, 1000} and report the results.
For each solution pair G, H returned by all algorithms, the non-negativity constraints are held by the construction of algorithms, so we only need to consider deviation of G from orthonormality, which we call infeasibility and define it as The computational results that follow in the rest of this subsection were obtained by setting the tolerance in the stopping criterion to ε = 10 −10 , the maximum number of iterations to 1000 in Algorithm 3 and to 20 in Algorithm 4. We also set a time limit to 3600 seconds. Additionally, for σ and γ (updating parameter for λ in Algorithm 4) we choose 0.001 and 0.1, respectively. Finally, for τ from (16) we set a value of 0.1.
In general, Algorithm 3 converges to a solution in early iterations and the norm of the projected gradient falls below the tolerance shortly after running the algorithm. Tables 2 and 3

Numerical results for bi-orthonormal data (BION)
In this subsection we provide the same type of results as in the previous subsection, but for the BION dataset.
We used almost the same setting as for UNION dataset: ε = 10 −10 , maxit = 1000, σ = 0.001 and time limit = 3600s. Parameters γ, τ were slightly changed (based on experimental observations): γ = 0.75 and τ = 0.5. Additionally, we decided to take the same values for α, β in Algorithms 2 and 3, since the matrices R in BION dataset are symmetric and both orthogonality constraints are equally important. We computed the results for values of α = β from {1, 10, 100, 1000}. In Tables 4-5 we report average RSE and average infeasibility, respectively, of the solutions obtained by Algorithms 1, 2, and 3. Since for this dataset we need to monitor how orthonormal are both matrices G and H, we adapt the measure for infeasibility as follows:  not have a big impact on RSE and infeasibility for Algorithm 3, a significant difference can be observed only when the internal dimension is equal to the real internal dimension, i.e., when p = 100%. Based on these numerical results, we can conclude that smaller β achieve better RSE and almost the same infeasibility, so it would make sense to use β = 1.
For Algorithm 2 these differences are bigger and it is less obvious which β is appropriate. Again, if RSE is more important then smaller values of β should be taken, otherwise larger values.

Concluding remarks
We presented a projected gradient method to solve the orthogonal non-negative matrix factorization problem. We penalized the deviation from orthonormality with some positive parameters and added the resulted terms to the objective function of the standard non-negative matrix factorization problem. Then, we considered minimizing the resulted objective function under the non-negativity conditions only, in a block coordinate decent approach.
The method was tested on two sets of synthetic data, one containing uni-orthonormal matrices and the other containing bi-orthonormal matrices. Different values for the adjusting parameters of orthogonality were applied in the implementation to determine good pairs of such values. The performance of our algorithm was compared with two algorithms based on multiplicative updates rules. Algorithms were compared regarding the quality of factorization   Table 2. It contains six plots which illustrate the quality of Algorithms 1, 2 and 3 regarding RSE on UNION instances with n = 100, 500, 1000, for β ∈ {1, 10, 100, 1000}. We can see that regarding RSE the performance of these algorithms on this dataset does not differ a lot. As expected, larger values of β yield larger values of RSE, but the differences are rather small. However, when p approached 100 % of k, Algorithm 3 comes closest to the global optimum RSE = 0.
(RSE) and how much the resulting factors deviate from orthonormality.
We provided an extensive list of numerical results which demonstrate that our method is very competitive and outperforms the others.    Table 3. It contains six plots which illustrate the quality of Algorithms 1, 2 and 3 regarding infeasibility on UNION instances with n = 100, 500, 1000, for β ∈ {1, 10, 100, 1000}. We can see that regarding infeasibility the performance of these algorithms on this dataset does not differ a lot. As expected, larger values of β yield smaller values of infeasG, but the differences are rather small.     Table 4 RSE obtained by Algorithms 1, 2 and 3 on the BION data. For the latter two algorithms, we used α = β ∈ {1, 10, 100, 1000}. For each n ∈ {50, 100, 200, 500, 1000} we take all ten matrices R (five of them corresponding to k = 0.2n and five to k = 0.4n). We run all three algorithms on these matrices with inner dimensions p ∈ {0.2k, 0.4k, . . . , 1.0k} with all possible values of α = β. Like before, each row represents the average (arithmetic mean value) of RSE obtained on instances corresponding to given n and given p as a percentage of k. We can see that the larger the β, the worse the RSE, which is consistent with expectations.
[5] Choi, S. Algorithms for orthogonal nonnegative matrix factorization.   Table 4. We can observe that with these settings of all algorithms we can bring infeasibility to order of 10 −3 very often, for all values of β.