Block-Active ADMM to Minimize NMF with Bregman Divergences

Over the last ten years, there has been a significant interest in employing nonnegative matrix factorization (NMF) to reduce dimensionality to enable a more efficient clustering analysis in machine learning. This technique has been applied in various image processing applications within the fields of computer vision and sensor-based systems. Many algorithms exist to solve the NMF problem. Among these algorithms, the alternating direction method of multipliers (ADMM) and its variants are one of the most popular methods used in practice. In this paper, we propose a block-active ADMM method to minimize the NMF problem with general Bregman divergences. The subproblems in the ADMM are solved iteratively by a block-coordinate-descent-type (BCD-type) method. In particular, each block is chosen directly based on the stationary condition. As a result, we are able to use much fewer auxiliary variables and the proposed algorithm converges faster than the previously proposed algorithms. From the theoretical point of view, the proposed algorithm is proved to converge to a stationary point sublinearly. We also conduct a series of numerical experiments to demonstrate the superiority of the proposed algorithm.


Overview of the Matrix Factorization Algorithms
Unsupervised learning is a form of machine learning in which models are trained on unlabeled data to classify patterns or make inferences without any external guidance or supervision. The key advantage of unsupervised learning is its ability to uncover hidden structures and relationships within datasets. This may not be readily apparent through manual inspection, enabling the automatic discovery of insights and patterns in large datasets. However, working with large datasets can be challenging because they are often noisy and high-dimensional, which makes their processing and analysis difficult. To address this challenge, researchers often use dimensionality reduction techniques to extract meaningful features from the data. By reducing the dimensionality of the data, the computation cost of training and classification algorithms can be improved. Dimensionality reduction can be achieved by reducing the size of the feature vector, which is the input data used to train and test machine learning models. Unsupervised learning methods factor the data matrix subject to various constraints for such dimensionality reduction. Depending on the constraints, the resulting factors have significantly different data representations. Principal component analysis (PCA) [1] enforces no constraints on the factorization. Consequently, PCA achieves an optimal low-dimension approximation to the data matrix while retaining as much of the original variation as possible. For this reason, PCA has been widely applied in various applications such as face recognition [2][3][4] and document representation [5][6][7].
In parallel, previous studies have shown that there is some psychological and physiological evidence for parts-based representation in the human brain [8][9][10]. Nonnegative matrix factorization (NMF) [8] has been proposed to learn the parts of objects by enforcing the nonnegative constraints. In particular, NMF approximates the data matrix by a product of two nonnegative matrices. The nonnegative constraints are useful to learn a parts-based representation of the data because it only allows additive combinations. It has been shown nonnegative matrix factorization is superior to PCA in fields that use nonnegative datasets such as face recognition [11][12][13][14], document clustering [15,16], audio signal processing [17,18], and recommendation systems [19,20].

Nonnegative Matrix Factorization
Nonnegative Matrix Factorization (NMF) is a technique for factorizing a matrix into two nonnegative matrices, denoted as W and H. This method is distinct because it constrains all elements in W and H to be nonnegative. To grasp the concept of NMF, it is essential to comprehend the underlying intuition behind matrix factorization.
As shown in Figure 1, suppose we have a matrix V of size m × n, where each element is greater than or equal to zero. With NMF, we can decompose V into two matrices: W of size m × k and H of size k × n, where k is a chosen rank. Notably, both W and H have only nonnegative elements. Here, V is defined as: The primary goal of NMF is to perform dimensional reduction and feature extraction. By specifying a lower dimension k, the main objective of NMF is to identify two matrices, W ∈ R m×k and H ∈ R k×n , containing only nonnegative elements, as illustrated in Figure 1.
Specifically, by setting k ≤ min(m, n), the factorization process breaks down the original matrix V into two matrices. Therefore, the dimensional reduction occurs as the original matrix V of size m × n is represented as the product of a smaller matrix W of size m × k and a smaller matrix H of size k × n, resulting in a dimensional reduction from m × n to m × k + k × n. Note that (m × k + k × n) ≤ (m × n) since k ≤ min(m, n). Typical machine learning algorithms, including statistical machine learning and deep learning methods such as convolutional neural networks (CNN), take a time superlinear in the input size. This training time and classification time also depend heavily on the feature space size. NMF likely reduces both the input size and the feature vector size. With the training time and classification time being superlinear, the efficiency of both improves significantly with NMF.
The underlying assumption of NMF is that the input comprises a set of latent features, each of which is represented by a column in the W matrix. Moreover, each column in the H matrix represents the "coordinates of a data point" in the W matrix, essentially holding the weights related to matrix W. In essence, each data point represented by a column in V can be approximated by a summation of nonnegative vectors represented by columns in W weighted by a row in H.

Image Processing-Facial Feature Extraction
In order to better understand the intuition behind the NMF algorithm, we consider real-world scenarios, specifically the application of the algorithm to image processing. Suppose we have an input image consisting of pixels that form matrix X. NMF produces two factors (W, H) such that each image X(:, j) is approximated as a linear combination of the columns in W. As shown in Figure 2, for facial images, the columns of W can be interpreted as basic images consisting of features such as eyes, noses, mustaches, and lips. The columns of H indicate the presence of these features in the corresponding X image.

Contributions
This paper makes the following innovative contributions revolving around a novel algorithm for tackling NMF challenges: We present a coordinate descent approach coupled with an innovative strategy for selecting coordinates to address the ADMM subproblems.

2.
In contrast to the classic ADMM and multiplicative update methods, our proposed algorithm attains a notably reduced error level while showcasing enhanced convergence characteristics, marked by an enhanced stability, smoother trajectories, and expedited convergence.

3.
We establish the effectiveness of our approach through a rigorous theoretical analysis and substantiate our claims via an array of comprehensive experiments conducted on synthetic and real datasets in Section 6. These experiments collectively serve to underscore the superior performance and potential of our novel methodology.

1.
Comparing the proposed method in Algorithm 3 with the classical ADMM, we use much fewer primal and dual decision variables. Specifically, the ADMM in Algorithm 1 introduces new primal variables W + and H + , and dual variables α X , α W , and α H , while the proposed method in Algorithm 3 introduces no primal variables, and only one dual variable α X . This helps with the efficiency of the algorithm.

2.
In Section 4, we introduce a new approach termed the "block active method" designed to tackle the problems formulated in (14). Our central result, as established in Theorem 4, rigorously demonstrates that under reasonable assumptions, our proposed method converges towards a stationary point denoted as x * in Equation (15) at a sublinear rate of convergence.
To expound on this, we demonstrate that the error, as defined by f (x k ) − f (x * ) on the left-hand side of the equation within Theorem 4, consistently diminishes. This reduction is characterized by the relation f (x k ) − f (x) = O(k −1 ), indicative of the error's gradual decline to zero with iteration count k approaching infinity. This type of convergence behavior is denoted as sublinear [22] due to its property of diminishing error reduction over iterations. This stands in contrast to the linear convergence typified by expressions such as γ k for some constant 0 < γ < 1, where the decline in error remains consistent. 3.
NMF finds applications in tasks such as face recognition, document clustering, audio signal processing, and recommendation systems. When employing NMF to address analogous optimization problems, there should not be any difference in the theoretical results.

4.
The image resolution may or may not affect the results. Given NMF is a nonconvex optimization problem, a global min cannot be guaranteed to be found in the general setup. The quality of the solution a method converges to depends on several factors, such as the initialization of W, H, and X, and the learning rate rho. Thus, improving the resolution of the data or quality of the data may or may not improve the result.

Paper Organization
This paper is organized as follows. Section 2 introduces the NMF problem and places it within the context of the related existing research. Section 3 provides a concise overview of the widely recognized ADMM. Section 4 presents the proposed block-active method as a coordinate descent approach coupled with an innovative strategy for selecting coordinates to address the ADMM subproblems. Section 5 introduces an innovative ADMM-style approach for addressing the NMF problem (2). Section 6 assesses the experimental outcomes across both synthetic and real datasets, while Section 7 serves as the concluding segment.

NMF Problem and Previous Work
The NMF problem can be formulated as follows: where D(V|V) represents some measure of divergence between V and its approximation V. A general family of divergence functions is the β-divergence, denoted by D β . The βdivergence between two matrices is defined as the sum of the elementwise divergence, i.e., The three most commonly used β-divergence functions with NMF in practice are the Euclidean distance, Kullback-Leibler (KL) divergence, and Itakura-Saito (IS) divergence so as to model the Gaussian noise, Poisson noise, and multiplicative gamma noise, respectively. Particularly, we have In the literature on NMF, many algorithms have been proposed to solve Problem (2) for β = 2, including multiplicative updates (MU) [23][24][25][26], projected gradient descent (PGD) [27], hierarchical alternating least square (HALS) [28][29][30], the alternating direction method of multipliers (ADMM) [31,32], and alternating nonnegative least square (ANLS) [33]. Unfortunately, there are few works proposed to solve the NMF problem with the general Bregman divergence. In this paper, we propose an ADMM that can be used to solve the NMF problem with the general Bregman divergence. In fact, we are not the first one proposing an ADMM method to solve the NMF problem. For example, reference [31] also proposed an ADMM method to solve NMF problem and each subproblem had a closed-form solution. However, our method introduces a much fewer number of auxiliary variables, and we use an iterative method to solve each subproblem. By doing so, the proposed algorithm converges much faster than the previously proposed algorithms. We provide the theoretical analysis of the proposed algorithm and construct a line of numerical experiments to demonstrate the performance of the proposed method.

Alternating Direction Method of Multipliers
In this section, we provide a brief review of the well-known alternating direction method of multipliers (ADMM). We consider an optimization problem formulated as follows: where f : R n → R is the objective function, x ∈ R n is the decision variable, A ∈ R m×n and b ∈ R m are a given matrix and vector. Since the objective function f could be nonconvex, searching for a global solution is not easy. Instead, the common pursuit is to find a stationary point of the problem. A stationary point of (5) is a vector x * that satisfies The ADMM method can help us find such vector x * . In particular, the augmented Lagrangian function is given by where y ∈ R m is the Lagrangian multiplier. Then, the ADMM method updates x and y by optimizing L ρ (x, y) alternatively. In particular, suppose we are given feasible vectors (x k , y k ) at the kth iteration. Then, where x k+1 is a global minimizer of L ρ (x, y k ) for a fixed y k , and y k+1 is the result of a one-step gradient ascent with step size ρ. Here the step size for y k+1 could be different from the penalty parameter ρ so that y k+1 = y k + α(Ax k+1 − b) for some α = ρ. However, it is common to use α = ρ. Then, we obtain a sequence {x k , y k } of vectors and [34] shows this sequence converges to a stationary point (x * , y * ) that satisfies (3). The paper [31] adapts the ADMM framework to solve the NMF problem with the Bregman divergence. They firstly reformulate the problem (2) by introducing additional auxiliary variables as follows: where X, W + , and H + are additional auxiliary variables. The corresponding augmented Lagrangian function of (2) is given by where α X , α W , and α H are the Lagrange multipliers. The updates are taken by minimizing L ρ alternatively with respect to each primal variable and taking a gradient ascent in each of the Lagrange multipliers. The algorithm is summarized in Algorithm 1.
Here, the notation A \ b is an operator in Matlab that takes the inverse of A and multiplies it by b, that is,

Block-Active Method
In this section, we consider a constrained optimization problem formulated as follows: where f is a convex objective function, and x ∈ R n is the decision variable. Section 5 proposes a new ADMM-type method where the problem (14) is an important subproblem. We propose to use a block coordinate descent (BCD) method to solve the problem (14). In general, a BCD method picks up a block of coordinates of the decision variable and minimizes the objective function only with respect to the selected block of coordinates.
In particular, let x k be the current feasible point at the kth iteration. Let i k be the selected coordinate. Then, the update rule is given by where e i k is a vector whose entries are all zeros, except the i k th entry is equal to 1. How to select the block is significant in a BCD method. In general, there are three ways to select the block, that is, cyclic, random, and greedy. In the cyclic selection rule, each block is selected cyclically. Each block is selected randomly if the random selection rule is applied. A block is selected if it has the largest magnitude of the partial derivative, that is,

Block-Active Method
Here, we propose a new block coordinate method called block-active method to solve the problem (14) where the block is selected based on the stationary condition. Note that a vector x * is a stationary point of (14) if it satisfies Here, Equation (15) is called stationary condition. At each iteration, we collect the coordinates that do not satisfy the stationary condition. In particular, let x ≥ 0 be a feasible point. We construct an index set F as follows: Note that here, we include some extra coordinates in F , that is, Later on, we show that if x is already a stationary point, including these extra coordinates does not make the block-active method move away from a stationary point. Instead, if x is not a stationary point, with the help of a scale matrix H, including these extra coordinates make the proposed method converge faster.
Given the index set F , we define vectors g and d as follows: where g ∈ R |F | , d ∈ R |F | , and H ∈ R |F |×|F | is a strictly positive definite (p.d.) matrix. Given a scalar α > 0, we define a single-variable function x(α) as follows: From Equation (18), we can see only part of vector x is selected and updated. The subvector of x is selected based on the stationary condition (16). The algorithm is summarized in Algorithm 2. As noted by [35], our method has the potential to be extended in a distributed manner.

Algorithm 2:
Block-active method to minimize (14) Initialize x 0 for iteration = 1, 2, · · · do Select index set F k based on the stationary condition (16) Compute g and d according to (17) Choose an appropriate step size α k x k+1 = x k (α k ) by (18) end
If x is not a stationary point, then there exists α > 0 such that Definition 1. A function g is called L-smooth if ∇g is Lipschitz continuous with constant L > 0.
In particular, there exists a strictly positive scalar L > 0 for which Given a function g is L-smooth, we have the following well-known descent lemma.
Suppose the objective function f in (14) is L-smooth. It follows from the descent lemma that the sequence { f (x k )} generated by the block-active method consistently decreases and it converges to f (x * ) sublinearly.

Theorem 2 (Convergence result). Assume the objective function f is L-smooth and λ min
where z 2 H := ∑ i,j∈F z i Q ij z j .
In the above theorem, we demonstrate that the error terms defined as f (x k ) − f (x * ) are consistently diminished. This reduction is characterized by the relation f (x k ) − f (x * ) = O(k −1 ), indicating the gradual decline to zero as the iteration count k approaches ∞. This type of convergence behavior is so-called sublinear [22]. It stands in contrast to the linear convergence rate in the form of γ k for some constant γ ∈ (0, 1), where the reduction is constant. On the other hand, the smoothness assumption of the objective function L can be dropped but the convergence can still be ensured by using the arguments introduced in [36,37] for the non-Lipschitz optimization. The proof of Theorem 1 and 2 can be found within Appendices A and B.

Block-Active ADMM
In this section, we propose a new ADMM-type method to solve the NMF problem (2) by using the block-active method to solve the subproblems. Particularly, since the intermediate quantity X = W H needs to be updated repeatedly once the matrices W and H are updated, we directly introduce this quantity as a new variable in the optimization problem. Thus, the NMF problem becomes min D(V|X), Since the ADMM framework is good at dealing with equality constraints, we propose a new algorithm based on the ADMM framework by introducing one dual variable α X . The corresponding augmented Lagrange function is given by The updates alternately optimize L ρ with respect to each of the three primal variables, followed by one update on the dual variable. The updates are summarized as follows.
Since the optimization with respect to X does not have any constraint, X + has a closed-form solution by solving the equation Thus, this subproblem can be solved by the method we proposed in the previous section, called block-active method. In particular, we can choose the scaling matrix Q as part of W T W based on the index set F . Since W ∈ R M×K and M K, W T W is highly likely strictly positive definite so that Q is a submatrix of W T W. Moreover, we denote nnls_blockactive(A, B) as the procedure proposed in the previous section and used to solve the nonnegative constraint problem in the form of Algorithm 3 is provided as an example of the proposed block-active ADMM method for the case where β = 1 in the β-divergence distance.

Algorithm 3: Block active ADMM.
Inputs V Initialize X, W, H, α X for iteration = 1, 2, . . . do As established in ( [32], Theorem 2), the alternating direction method of multipliers (ADMM) demonstrates convergence to a stationary point, when descents are achieved for subproblems within each iteration, despite the global problem being nonconvex. In our context, we are addressing a nonconvex optimization challenge, specifically the nonnegative matrix factorization (NMF) problem as defined in Equation (2) and subsequently reformulated in Equation (23). Through the utilization of the ADMM framework, we strategically partition the problem into a series of subproblems, each solvable within an iteration. Leveraging the convergence assurance provided by Theorem 2, we can confidently assert that a descent is guaranteed within each subproblem. Consequently, invoking the findings of ( [32], Theorem 2) within our specific context, we secure a robust convergence result for the proposed method delineated in Algorithm 3, leading to the attainment of a stationary point.

Synthetic Datasets
We first tested the proposed algorithm on a moderately synthetic dataset with m = 500, n = 500, and k = 150. We generated the ground truth W 0 and H 0 , and V = W 0 H 0 . We examined the performance of the proposed algorithm against the standard multiplicative update [38] and the ADMM [34]. We set ρ = 1 and the maximum iteration to be 1000. The performance results are shown in Figure 3. We can see that the proposed block method can achieve a much lower error level given the same amount of time. The maximum iteration is set to be 1000. We record the objective value on the log scale. We can see the proposed block method can achieve a much lower error level given the same amount of running time. In another word, the block method is faster to achieve the specified accuracy than the other two methods.

Real Datasets
We evaluated the proposed method against both the multiplicative update and ADMM algorithms using real datasets. These datasets were generated using either a 2D imaging sensor or a near-infrared (NIR) imaging sensor.

1.
UMist (https://cs.nyu.edu/~roweis/data.html, accessed on 2 January 2022): This dataset is an image dataset containing 575 images of 20 people, which consist of images of individuals captured in various poses, ranging from profile to frontal views. All files in the dataset are in the PGM format, have a resolution of approximately 220 × 220 pixels, and are 256-bit grayscale images.

2.
ORL  Figures 4-8. Based on the results on the real datasets, we can see the objective value using multiplicative update decreases faster at the beginning, but later, the proposed block method can converge to a better solution which has a much lower error level. In addition, comparing to the ADMM, the proposed block method is much more stable. In Figure 7, the objective value using the ADMM does not consistently decrease. That is because ρ = 1 is too small for the YaleB dataset. However, using ρ = 1, the proposed block method does not diverge, and the objective value continuously decreases. The source code for the proposed algorithms has been added to GitHub, accessible at https://github.com/Xinyao90/Block-active-ADMM-to-Minimize-NMF-with -Bregman-Divergences.git, accessed on 10 August 2023.

Conclusions
In this paper, we proposed a new block method that aimed to solve the nonnegative matrix factorization problem using the general Bregman divergence distance metric. Nonnegative matrix factorization is a widely used technique in various fields such as image processing, speech analysis, and bioinformatics. In particular, image processing systems heavily rely on databases generated by existing imaging sensors, and the efficiency and accuracy of these systems depend on the performance of the nonnegative matrix factorization method used to process these databases.
Our proposed block method was built on the framework of the alternating direction method of multipliers (ADMM), which is a popular algorithm used to solve optimization problems. However, instead of following the traditional approach of the ADMM to solve the subproblems, we introduced a new method that employed a block coordinate method. In this approach, we selected a block based on the stationary condition, which allowed us to converge faster and to a solution with a lower error level compared to the previous ADMM method.
To demonstrate the effectiveness of our proposed method, we conducted a series of numerical experiments. The experiments included comparisons of our block method with the traditional ADMM method and other state-of-the-art methods in terms of runtime and overall accuracy. Our numerical results showed the dominance of our proposed block method over other methods, highlighting its effectiveness in solving the nonnegative matrix factorization problem using the general Bregman divergence distance metric.
In summary, our proposed block method provides an efficient and accurate solution to the nonnegative matrix factorization problem using the general Bregman divergence distance metric. Its unique approach to solving subproblems using a block coordinate method has proven to be faster and more accurate than traditional methods, as demonstrated by our numerical experiments. Our proposed method can help improve the performance of image processing systems and other applications that rely on nonnegative matrix factorization.
Then, F = W 1 W 2 . Moreover, if i ∈ W 2 , then x(α) i = x i for all α > 0. Since x is not a stationary point, then W 1 is nonempty. Let i ∈ W 1 . If x i = 0 and d i > 0, then If x i > 0 and d i = 0, then define α = sup{α : x i + αd i ≥ 0, ∀i ∈ W 1 }.
Note that here, α is either ∞ or a positive number. Then, we define the direction d ∈ R |F | as follows: Therefore, for all 0 < α ≤ α, we have x(α) i = x i + αd i , for all i ∈ F . Moreover, we have As a result, d is a feasible descent direction, so that for any α ≤ α, we have f (x(α)) < f (x).

Appendix B. Proof of Theorem 2
Proof. Let x be the feasible point at the kth iteration. Let G(x) = x−x + α . Then, the descent lemma implies that