An Efficient Orthonormalization-Free Approach for Sparse Dictionary Learning and Dual Principal Component Pursuit

Sparse dictionary learning (SDL) is a classic representation learning method and has been widely used in data analysis. Recently, the ℓm-norm (m≥3,m∈N) maximization has been proposed to solve SDL, which reshapes the problem to an optimization problem with orthogonality constraints. In this paper, we first propose an ℓm-norm maximization model for solving dual principal component pursuit (DPCP) based on the similarities between DPCP and SDL. Then, we propose a smooth unconstrained exact penalty model and show its equivalence with the ℓm-norm maximization model. Based on our penalty model, we develop an efficient first-order algorithm for solving our penalty model (PenNMF) and show its global convergence. Extensive experiments illustrate the high efficiency of PenNMF when compared with the other state-of-the-art algorithms on solving the ℓm-norm maximization with orthogonality constraints.


Introduction
In this paper, we focus on solving the optimization problem with orthogonality constraints: where W is the variable, Y ∈ R n×N is a given data matrix, and I p denotes the identity matrix in R p×p . Besides, the m -norm is defined as (2,4]. For brevity, the orthogonality constraints W W = I p in (1) can be expressed as W ∈ S n,p := {W ∈ R n×p |W W = I p }. Here, S n,p denotes the Stiefel manifold in real matrix space, and we call it the Stiefel manifold for simplicity in the rest of our paper. The sparse dictionary learning (SDL) exploits the low-dimensional features within a set of unlabeled data, and therefore plays an important role in unsupervised representative learning. More specifically, given a data set Y = [y 1 , y 2 , . . . , y N ] ∈ R n×N that contains N samples in R n , SDL aims to compute a full-rank matrix D ∈ R n×p named as dictionary, and an associated sparse representation X = [x 1 , . . . , x N ] that satisfies Existing approaches [25][26][27][28] for solving convex problem (4) are not scalable and not competent in high dimensional cases [29]. On the other hand, the Random Sampling and Consensus (RANSAC) algorithm [30] has been one of the most popular methods in computer vision for the high relative dimension setting. RANSAC alternates between fitting a subspace to a randomly sampled minimal number of points (n − 1 in the case of DPCP) and measuring the quality of selected subspace by using the number of data-points close to the subspace. In particular, as described in [29], RANSAC can be extremely effective when the probability of sampling outlier-free samples inside the allocated time budget is large. Recently, Tsakiris and Vidal [31] introduce Denoised-DPCP (DPCP-d) by minimizing y − W Y 2 F + γ y 1 over the constraints W W = 1, y ∈ R N . In the same paper, Tsakiris and Vidal [31] propose an Iteratively-Reweighted-Least-Squares algorithm (DPCP-IRLS) for solving the non-convex DPCP problem (4). The authors illustrate that DPCP-IRLS can successfully handle 30% to 50% of outliers and showed its high efficiency compared with RANSAC. In addition, Zhu et al. [32] propose a projected subgradient-based algorithm named DPCP-PSGM, which exhibits great efficiency on reconstructing road-plane in the KITTA dataset. There are also some approaches using smoothing techniques to approximate the 1 -norm term such as Logcosh [8,33], Huber loss [34], pseudo-Huber [5], etc. Then, algorithms for minimizing a smooth objective function on a sphere can be applied. Nonetheless, these smoothing techniques often introduce approximation errors as the smooth objective functions usually lead to dense solutions. Qu et al. [35] and Sun et al. [8] propose a rounding step as postprocessing to achieve exact recovery [16] by solving a linear programming, which leads to addition computational cost.
The main difficulties in developing efficient algorithms are the nonsmoothness and nonconvexity in DPCP models. By observing the similarity between SDL and DPCP, we consider to adopt the m -norm maximization to reformulate DPCP as a smooth optimization problem on sphere.

Contribution
In this paper, we first point out that the DPCP problem can be formulated as the m -norm (m ∈ (2, 4]) maximization (1) with p = 1. Therefore, both SDL and DPCP can be unified as a smooth optimization problem on the Stiefel manifold.
Motivated by PenC [21], we propose a novel penalty function as the following expression, where β > 0 is the penalty-parameter and Φ is the operator that symmetrizes the square matrix, defined by Φ(M) = M+M 2 . We show that h(W) is bounded from below, then the convex compact constraint in PenC can be avoided. Therefore, we propose the following smooth unconstrained penalty model for m -norm maximization (PenNM), We prove that Equation (6) with m ∈ (2, 4] is an exact penalty function of Equation (1) under some mild conditions. Moreover, when p = 1, we verify that PenNM does not introduce any first-order stationary point other than those of Equation (1) and x = 0. Based on the new exact penalty model, we propose an efficient orthonormalization-free first-order algorithm named PenNMF with no additional constraint. In PenNMF, we adopt an approximate gradient in each iterate instead of the exact one in which the second-order derivative of the original objective involves. The global convergence of PenNMF under mild conditions can be established.
The numerical experiments on synthetic and real imaginary data demonstrate that PenNMF outperforms PenCF and MSP/GPM in solving SDL, especially in large-scale cases. As an infeasible method, PenNMF shows superior performance when compared with MSP and GPM, which invoke an orthonormalization process to keep the feasibility. Moreover, when compared with PenCF, PenNMF also shows better performance, implying the benefits of avoiding the constraints in PenC. In our numerical experiments on DPCP, our proposed model (1) with p = 1 shows comparable accuracy with 1 -norm based penalty model (4) on solving road-plane recovery in KITTA dataset. In some test examples, (1) can have even better accuracy than (4). Besides, PenNMF takes less CPU time while achieving comparable accuracy in reconstructing road-plane in KITTA dataset when compared with other state-of-the-art algorithms such as DPCP-PSGM and DPCP-d.

Notations and Terminologies
Norms: In this paper, · m denotes the element-wise m-th norm of a vector or matrix, . Besides, · F denotes the Frobenius norm and · denotes the 2-th operator norm, i.e., A equals the maximum singular value of A. Besides, we denote σ min (A) as the smallest singular value of a given matrix A. The operator A • B stands for the Hadamard product of matrices A and B with the same size. |A| and A •l represent the component-wise absolute value and l-th power of matrix A, respectively. Besides, for two symmetric matrices A and B, A B denotes that A − B is semi-positive definite, and A B denotes that A − B is positive definite.
Optimality Condition: W is a first-order stationary point of (1) if and only if Besides, W is a first-order stationary point of PenNM if and only if ∇h(W) = 0.

Model Description
In this section, we first discuss how to reformulate DPCP as an m -norm maximization with orthognoality constraints. To construct a orthonormalization-free algorithm, we minimize h(W) rather than directly solve (1). As an unconstrained penalty problem for (1), the model (6) may introduce additional infeasible first-order stationary points. Therefore, in this section, we characterize the equivalence between (1) and (6) to provide theoretical guarantees for our approach.

m -Norm Maximization for DPCP Problems
Based on the fact that maximization of a higher-order norm promotes spikiness and sparsity, we maximize the m -norm ofŴ Y over the constraintŴ YY Ŵ = 1. The model can be expressed as Although with different constraints to (1), (4) can be reshaped to the formulation of (1). Let Y = RZ be the rank-revealing QR decomposition of Y, where Z ∈ R n×N is an orthogonal matrix and R ∈ R n×n is an upper-triangular matrix, and denote W = R −TŴ , then the optimization model can be reshaped as Clearly, problem (8) is a special case of (1) with p = 1. Moreover, suppose W * is a global minimizer of (8), the solution for DPCP problem can be recovered byŴ * = R −T W * . The detailed framework for solving DPCP by m -norm maximization is presented in Algorithm 1.

Algorithm 1 Framework for Solving DPCP by m -Norm Maximization.
Require: Data matrix Y ∈ R n×N 1: Perform QR-factorization for Y. Namely, Y = RZ where R is upper-triangular matrix and Z ∈ R n×N is orthogonal matrix; 2: Compute the solutionŴ for (1); 3: Return W * = R −TŴ

Equivalence
In this subsection, we first derive the expression for ∇ f (W) and ∇h(W). Proposition 1. The gradient and the Hessian of f (W) can be expressed as respectively. Moreover, the gradient of h(W) can be formulated as Proof. From the work in [17] we Based on the expression for ∇ f (W), the Hessian of f can be expressed as With the expression for ∇h(W), we can establish the equivalence between (1) and our proposed model, (6). The equivalence is illustrated in Theorem 4, and the main body of the proofs is presented in Appendix A.

Theorem 2. (First-order equivalence)
F andW is a first-order stationary point of (6), then eitherW W = I p holds, which further implies thatW is a first-order stationary point of Theorem 2 characterizes the relationship between the first-order stationary points of (1) and those of (6). Namely, the penalty model only yields the first-order stationary points other than those of the original model (1) far away from the Stiefel manifold. When p = 1, we can derive a stronger result on those additional first-order stationary points produced by the penalty model in Corollary 3.

Corollary 3. (Stronger first-order equivalence for
F , andW is a first-order stationary point of (6), then eitherW W = I p holds, which further implies thatW is a first-order stationary point of problem (1), orW = 0. Theorem 2 characterizes the equivalence between (1) and (6) in the sense that all the infeasible first-order stationary points of (6) is relatively far away from the constraint W W = I p . Besides, Corollary 3 shows that when p = 1, the only infeasible first-order stationary point of (6) is 0. Therefore, when we achieve a solution near the constraint W W = I p by solving (1), we can conclude that W is a first-order stationary point of (1). Instead of directly solving (1), we can compute the first-order stationary point of (6) and thus avoid intensive orthonormalization in the computation.

Global Convergence
In this section, we focus on developing an infeasible approach for solving (6). The calculation of the gradient of h(W) is involved with the second-order derivative, which is typically even more expensive than the iterations in MSP/GPM. Therefore, we consider to solve (6) by an approximated gradient descent algorithm. Let D(W) := ∇ f (W) − WΦ(W ∇ f (W)) + βW(W WW W − I p ) be the approximation for the gradient of h(W), we present the detailed algorithm as Algorithm 2.

Algorithm 2 First-Order Method for Solving (6). (PenNMF)
Require: f : R n×p → R, β > 0; 1: Randomly choose W 0 satisfies W 0 W 0 = I p , set k = 0; 2: while not terminate do 3: Compute inexact gradient Compute stepsize η k ; 5: W k+1 = W k − η k D(W k ); 6: k + +; 7: end while 8: Return W k Next, we establish the convergence of PenNMFin Theorem 4, which illustrates the global convergence and worst-case convergence rate of PenNMF under mild conditions. The main body of the proof is presented in Appendix B. . Then, W k weakly converges to a first-order stationary point of (1). Moreover, for any k = 1, 2, · · · , the convergence rate of PenNMF can be estimated by

Some Practical Settings
As illustrated in Algorithm 2, the hyperparameters in PenNMF are the penalty parameter β and stepsize η k . In the theoretical analysis for PenNMF, the upper bound of η k adopted in Theorem 4 is too restrictive in practice. There are many adaptive stepsize for first-order algorithms, and here we consider the Barzilai-Borwein (BB) stepsize [36], and alternating Barzilai-Borwein (ABB) stepsize [37], where and . We suggest to choose the stepsize η k as ABB stepsize in PenNMF, and we test PenNMF with ABB stepsize in our numerical experiments. Another parameter is β, which controls the smooth penalty term in h(W). Similarly, the lower-bound for β in Theorem 4 is too large to be practical. In our numerical examples, we uses the constant s := ∇ f (W 0 ) F , which is an approximation to ∇ 2 f (W 0 ) F , as an upper-bound for β. According to the work in [21], we suggest to choose the penalty parameter by Additionally, to achieve high accuracy in feasibility, we perform the polar factorization to the final solution generated by PenCF and PenNMF as the default postprocess. More precisely, when we compute the final solution W k by PenNMF, we can compute its rank-revealing singular-value decomposition W k = U k Σ k V k and returnŴ k := U k V k . Using the same proof techniques in [21], our postprocess leads to decrease in feasibility as well as the functional value. Moreover, the numerical experiments in [19] show that the introduced orthonormalization process results in little changes in ∇h(W). Therefore, we suggest to perform the described postprocess for PenNMF.

Numerical Examples
In this section, we present our preliminary numerical examples. We compare our algorithm with some state-of-the-art algorithms on SDL and DPCP problems, which are formulated as (1) and (8), respectively. Then, we observe the performance of our algorithm under different selections of parameters, and then choose the default setting. All the numerical experiments in this section are tested on an Intel(R) Core(R) Silver 4110 CPU @ 2.1 GHz, with 32 cores and 394 GB of memory running under Ubuntu 18.04 and MATLAB R2018a.

Numerical Results on Sparse Dictionary Earning
In this subsection, we mainly compare the numerical performance of PenNMF with some state-of-the-art algorithms on SDL. As illustrated in Table 2 in [17], MSP is significantly faster than the Riemannian subgradient [3] and Riemannian trust-region method [8]. Therefore, to have a better illustration on the performance of PenNMF, we compare PenNMF with state-of-the-art algorithms on solving (1), which is a smooth optimization problem with orthogonality constraints. We first select two state-of-the-art algorithms on solving optimization problems with orthogonality constraints. One is Manopt [38,39], a projection-based feasible method. In our numerical test, we choose nonlinear conjugate gradient with inexact linear-search strategy to accelerate Manopt. Another one is PenCF [21], which is an infeasible approach for optimization problems with orthogonality constraints. In our algorithms we choose to apply Alternating Bzarzilar-Borwein stepsize to accelerate PenNMF, and uses all parameters as default setting described in [21]. Besides, we test the MSP algorithm [17] and GPM algorithm [18]. It is worth to mention that when m = 4, the MSP and GPM are actually the same. According to the numerical examples in [18], m = 3 has better recovery quality than the case m = 4. Therefore, in our numerical experiments, we test the mentioned algorithms on the case where m = 3.
The stopping criteria for Manopt , while the stopping criteria for PenCF and PenNMF is ∇h(W k ) F ≤ 10 −2 . Besides, the max iteration for all compared algorithms is set as 200.
In all test examples, we randomly generate the sparse representation X by X * = randn(n, N). * (randn(n, N) < 0.3) and the dictionary W * by randomly selecting a point on Stiefel manifold. Then, the original data matrix Y is constructed by Y o = W * X * . To test the performance of all compared algorithms, we add different types of noise to Y o . We first fix the level of noise θ = 0.3 and choose n from 20 to 100. Then, we test the performance of compared algorithms with different types of noisy while fix n = 50. In our numerical tests, the "Noise" denotes the Gaussian Noise, where Y is constructed by Y = Y o + θ · randn(n, N). Besides, the term "Outliers" denotes the Gaussian outliers, where outliers = randn(n, round(θm)); Y = cat(2, Y o , outliers);. Additionally, the term "Corruption" refers to the Gaussian corruption to Y o , which is achieved by rademacher = (rand(n, m) < 0.5) Besides, the term 'CPU time' denotes the averaged run-time, while the term 'Error' denotes the 1 − Ŵ W * 4 4 , whereŴ denotes the final output of all the compared algorithm. The numerical results are listed in Figure 1. From Figure 1d-f, j-l we conclude that all these compared algorithms achieve almost the same accuracy in all the cases. Besides, for Gaussian noise, the performance of PenNMF is comparable to MSP/GPM algorithm and outperforms Manopt. Moreover, with Gasuuain outliers and Gaussian corruption, the performance of PenNMF is better than PenCF, MSP/GPM, and Manopt. One possible explanation is that for Manopt invokes computing the Riemannian gradient, line-search in each iteration, resulting in higher computational complexity than MSP/GPM. Besides, the infeasible approaches overcome the bottleneck in the orthonormalization process in Manopt and MSP/GPM, and thus achieve comparable performance to MSP/GPM. Additionally, PenCF solves a constrained model by taking approximated gradient descent steps, while in PenNMF the model is an unconstrained one. The absence of constraint helps to improve the performance of PenNMF.
Besides testing on synthetic datasets, we also perform extensive experiments to verify the performance of PenNMF on real imagery data. A classic application of dictionary learning involves learning sparse representations of image patches [40]. In this paper, we extend the experiments in [17] to learn patches from grayscale and color images. Based on the 512 × 512 grayscale image "Barbara", we construct the clean data matrix Y o by vectorizing each 16 × 16 patches from it. Then, we use the same approach to construct the clean data matrix Y from 512 × 512 grayscale images "Boat" and "Lena", together with a 256 × 256 grayscale image "House". In "Barbara", "Boat", and "Lena", the clean data matrix Y ∈ R 256×247,009 , and the data matrix from "House" satisfies Y ∈ R 256×58,081 . Besides, we construct the matrix Y ∈ R 192×62,001 by vectorizing the 8 × 8 × 3 patches from the 256 × 256 RGB image "Duck". In such setting, all the compared algorithms recover the dictionary for all three channels simultaneously rather than learn them once for each channel in "Duck". Such approach is aslo applied to generate the data matrix in R 192×146,633 from 338 × 450 RGB image "Chateau". We run MSP/GPM, PenNMF, PenCF, and Manopt with m = 3 to compute the dictionary from Y = Y o + θ · randn(n, N) with different level of noise, where Y o is generated in the same manner as our first numerical experiment and has the same size as these patched figures. The numerical results are presented in Figure 2 and Figure A1. In all experiments, PenNMF takes less time than PenCF, MSP/GPM, and Manopt, which further illustrate the high efficiency of PenNMF in tackling the real imagery data, especially in the large-scale case.

Dual Principal Component Pursuit
In this subsection, we first verify the recovery property of our proposed model (8), which is a special case of (1) by fixing p = 1. We first compare the distance between global minimizer of (8) and the ground-truth for DPCP problem. We first fix n = 30 and randomly select W * ∈ R n . Then, we randomly generate N 1 inliers in the hyperplane whose normal vector is W * . Besides, we randomly generate N 2 outliers in R n following Gaussian distribution. Additionally, the data is corrupted by Gaussian noise by adding θ √ n · randn(n, N) to Y. Then, we normalize each sample in Y. The range of N 1 is [10,500], whereas the range of N 2 is [10,3000]. We run each test problem for 5 instances. Moreover, in each instance, we run DPCP-PSGM to solve (4) and PenNMF to solve (8) with m = 3 and 4, and get the solutionW for each model. We plot the principal angle betweenW and W * in Figure 3. From Figure 3a and 3d we can conclude that (4) can tolerate O(N 2 1 ) outliers while achieve exact recovery, which coincides the theoretical results presented in [32]. For model (8), numerical experiments do not show the exact recovery ability of (8) for m = 3 and 4. However, with some tolerance on the principal angle, we also observe that (4) can tolerate O(N 2 1 ) outliers. Moreover, we conclude that with m = 3, (8) has better ability to recover the normal vector than m = 4. As a result, in the rest of this subsection, we only test (8) with m = 3.
In addition, we analyze the number of successfully recovered instances, where the 1 − W , W * 2 is less than 0.1 or 0.2. The results are presented in Figure 4. From Figure 4, we can conclude that, with tolerance on the errors, the m -norm maximization model can successfully recover the normal vector. Moreover, in model (8), m = 3 has better performance than m = 4, which coincides with the numerical experiments in [18]. Therefore, when applying m -norm maximization model to solving the DPCP problems, we suggest to choose m = 3 in (8).  In the rest of this subsection, we test the numerical performance of PenNMF on solving DPCP problem, which plays an important role in autonomous driving applications. DPCP is applied to recover the road-plane, which can be regarded as inliers, from the 3d point clouds in KITTA dataset [22], which is recorded from a moving platform while driving in and around Karlsruhe, Germany. This dataset consists of image data together with corresponding 3D points collected by a rotating 3D laser scanner [32]. Moreover, DPCP only uses the 3D point clouds with the objective of determining the 3D points that lie on the road plane (inliers) and those off that plane (outliers): Given a 3D point cloud of a road scene, the DPCP problem focuses on reconstructing an affine plane {x ∈ R 3 |a x − b = 0} as a representation for the road. Equivalently, this task can be converted to a linear subspace learning problem by embedding the affine plane into the linear hyperplane H ⊆ R 4 with normal vectorb = [a, −b], through the mapping x → [x, 1] [29]. We use the experimental set-up in [29,32] to further compare Equations (4) and (8), RANSAC, and other alternative methods in the task of 3D road plane detection in KITTA dataset. Each point cloud contains over 10 5 samples with approximately 50% outliers. Besides, the samples are homogenized and normalized to unit 2 -norm.
We use 11 frames annotated in [29,32] from KITTA dataset. We compare DPCP-PSGM [29], DPCP-IRLS, and DPCP-d [31], which focus on solving the 1 -norm minimization model (4). Besides, we test RANSAC and 2,1 -RPCA [25]. Additionally, we test PenNMF and MSP/GPM on solving our proposed model (8), which is a special case of (1). For DPCP-PSGM, DPCP-d, DPCP-IRLS, and 2,1 -RPCA, all parameters are set by following the suggestions in [32]. Figure 5 illustrates the numerical performance of all the compared algorithms. We present the numerical results in Figure 5d-f. Moreover, we draw the performance profiles proposed by Dolan and Moré [41] in Figure 5a-c to present an illustrative comparison on the performance of all compared algorithms. The performance profiles can be regarded as distribution functions for a performance metric for benchmarking and comparing optimization algorithms. Besides, we draw the recovery results of frames 328 and 441 in KITTA-CITY-71, which is presented in Figure 6. Here the term "AUC" denotes the area under the AUC curve, and "iterations" denotes the total iterations taken by these compared algorithms. Besides, "Prob" in Figure 5d-f denotes the indexes of tested frames, which are presented in Table 1.   From Figure 5a, we can conclude that PenNMF and MSP/GPM successfully recover the hyperplanes with comparable accuracy. Moreover, in problems 3, 7, and 9, PenNMF and MSP produce better classification accuracy than other approaches. Besides, in the aspect of CPU time, PenNMF and MSP cost much less time than other compared algorithms in most cases. Moreover, from Figure 5c, we can conclude that PenNMF takes less time than MSP as well as other compared algorithms in almost all the cases. As a result, we can conclude that our proposed model (1)

Conclusions
Sparse dictionary learning (SDL) and dual principal pursuit (DPCP) are two powerful tools in data science. In this paper, we formulate DPCP as a special case of the m -norm maximization on the Stiefel manifold proposed for SDL. Then, we propose a novel smooth unconstrained penalty model PenNM for the original optimization problem with orthogonality constraints. We show PenNM is an exact penalty function of (1) under mild assumptions. We develop an novel approximate gradient approach PenNMF for solving PenNM. The global convergence of PenNMF as well as its sublinear convergence rate are established. Numerical experiments illustrate that our proposed approach enjoys better performance than MSP/GPM [17,18] on various testing problems.
Author Contributions: Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing-original draft preparation, writing-review and editing, visualization, supervision, project administration, X.H. and X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript: In this section, we present the proof for Theorem 2 and Corollary 3. As (6) is an unconstrained optimization problem, the upper-bound for ∇h(W) F should be estimated. Before estimating the upper bounds of ∇ f (W) F and ∇ 2 f (W)[W(W W − I p )] F , we first present two linear algebraic inequalities:

Proof.
Lemma A2. For any A ∈ R n×N and any m ≥ 3, we have A •(m) Proof. This lemma directly follows the fact that Here, the last inequality follows the fact that Lemma A4. For any W ∈ R n×p , Proof. From the expression of ∇ 2 f (W) in Proposition 1, Here, the second inequality directly uses Lemma A1 and the last inequality follows Lemma A2.
In the rest of this section, we consider the equivalence between (1) and (6). We first establish the relationships between the first-order stationary points of (6) and problem (1).
From the optimality condition of (6), we derive an important equality in Lemma A5.
Lemma A5. For any first-order stationary pointW of (6) and any symmetric matrix T ∈ R p×p that satisfies TW W =W W T, we have Proof. SupposeW is a first-order stationary point of (6), by the first-order optimality condition, ∇h(W) = 0. Then, for any symmetric matrix T ∈ R p×p that satisfies TW W =W W T, W T, ∇h(W) = 0.
As described in Proposition 1, ∇ f (W) can be separated into three parts, we estimate their inner-product withWT respectively.
Here, the second equality follows the fact that that tr (BC) = tr (BCB) = 0 holds for any symmetric B and skew-symmetric C. Besides, the last inequality follows that (W W − I p )T = T(W W − I p ). As a result, we achieve the following equality: Additionally, we estimate their inner-product ofWT and βW(W WW W − I p ) and achieve the following equality W T, β(W WW W − I p ) Based on the above two equations, multiplying (W W − I p )W on both sides of 0 = ∇h(W) results in 0 = W T, ∇h(W) and thus we complete the proof.
Then based on the equality in Lemma A5, the following proposition shows that all first-order stationary point of (6) is uniformly bounded.
SupposeW is a first-order stationary point that satisfies which leads to the contradictory and shows that Here, the second equality directly follows Lemma A5. The first inequality uses Lemma A4 and the fact that W (W W − I p ) ≤ W 3 . Besides, the second inequality follows the fact thatW W W 2 uu . The second inequality The fourth inequality uses the fact that tr W W − I p ≤ W 2 F , and the last inequality follows the fact that Combine Lemma A5 and Proposition A6, we restate Theorem 2 as Theorem A7 and achieve the equivalence between (1) and (6).
F , andW is a first-order stationary point of (6), then either W W = I p holds, which further implies thatW is a first-order stationary point of problem (1), or the inequality Proof. When β ≥ 4(m + 2) Y m F , any first-order stationary pointW of (6) satisfies that W 2 ≤ 2.
Then from Lemma A5, we have showing that tr β(W W − I p ) 2W W (W W + I p ) = 0. Then by the positive-definiteness ofW W , we can conclude thatW W − I p = 0.
As a result, we have that eitherW W − I p = 0 or σ min (W W ) ≤ (2m+4) Y m F β , and completes the proof.
F , andW is a first-order stationary point of (6), then eitherW W = I p holds, which further implies thatW is a first-order stationary point of problem (1), orW = 0.
Proof. By the same routine of Theorem 2, when β ≥ 4(m + 2) Y m F , any first-order stationary point W of (6) satisfies that W 2 ≤ 2. Then, following the same proof routine in Lemma A5 and Theorem 2, we have holds for any W ∈ R n \ 0. Then, for anyW = 0, we can conclude that and thus W 2 − 1 = 0. As a result, from (A1), whenW is a first-order stationary point of (6), either W = 0 orW W − I p = 0.

Appendix B. Proof for Theorem 4
In this section, we present the main body of the proof for Theorem 4. To show the convergence of PenNMF, we first present some preliminary lemmas. Then, we show that the updating direction D(W k ) is a descending direction and thus h(W k+1 ) ≤ h(W k ), as illustrated in Lemma A12. Together with Lemma A10, we show that the sequence is restricted in the neighborhood of the constraints, and we achieve the global convergence property of PenNMF in Theorem 4. We first estimate the upper-bound of the term f (W) − 1 2 W W − I p , Λ(W) in h(W).
Lemma A9. For any W ∈ R n×p , Proof. We first estimate the upper-bound for | f (W)|, which can be achieved by Besides, from Lemma A3, we have Combine the above two equations, we achieve and complete the proof.
We then show that the penalty term ψ(W) := 1 6 W 6 F − 1 2 W 2 F builds a barrier around S n,p , i.e., those points that are sufficiently far from S n,p have higher functional value than those points that are close to S n,p .
Besides, as W W − I p 2 F ≤ δ implies Lemma A10 shows that the smooth penalty term builds a barrier around S n,p . Moreover, we characterize the relations between D(W) F and W W − I p F in the following lemma.
Proof. First, we present two linear algebra relationships: The first is the inequality ||A|| F ≥ A+A 2 F holds for any square matrix A, which is quite obvious and the proof is omitted. The second is the equality ||AB + BA|| F = 2||AB|| F holds for any symmetric matrices A and B, which results from the fact ||AB where the last equality uses the fact that σ min ( for any k = 0, 1, · · · . Proof. By the explicit expression of ∇h(W k ), we first have Here, the first inequality follows Lemma A3 and Lemma A4. Besides, by the definition of M 1 , we have Substitute W k+1 − W k = −ηD(W k ) and (A14) into (A13), we have Then by Lemma A10, as h(W k+1 ) ≤ h(W k ), we can conclude that W k+1 W k+1 − I p 2 F ≤ δ. Then, by induction we can conclude that W k W k − I p 2 F ≤ δ holds for k = 1, 2, 3, · · · . Then, by (A15) again we conclude that holds for k = 1, 2, 3, · · · and completes our proof.
The following lemma shows that when our algorithm stops atW, thenW is a first-order stationary point of (1).
Proof. Suppose D(W) = 0, then by the same proof routine in Theorem 2, we consider the inner-product of D(W) andW(W W − I p ): Here, the fourth equation follows the definition of Λ(W) := Φ(W ∇ f (W)) and the first inequality uses the fact that W ≤ 2, then together with Lemma A3, we can conclude that Λ(W) 2 (4m + 8) Y m F · I p . Besides, the last inequality uses the fact that β 2W W (W W + I p ) (4m + 8) Y m F · I p . Then we can conclude that thatW W = I p . By the definition of ∇h(W), we have showing that ∇ f (W) = 0. Together with Theorem 2 we can conclude thatW is a first-order stationary point of (1).
Based on the Lemmas A10-A13, we restate Theorem 4 as Theorem A14 and show the global convergence property of PenNMF in the following theorem.
Theorem A14. Suppose δ ∈ 0, 1 3 and β ≥ max 228m Y m F , 32 δ Y m F . Let {W k } be the iterate sequence generated by PenNMF, starting from any initial point W 0 satisfying ||W 0 W 0 − I p || 2 F ≤ 1 8 δ, and the stepsize η k ∈ 1 2η ,η , whereη ≤ 1 2M 1 . Then, W k weakly converges to a first-order stationary point of (1). Moreover, for any k = 1, 2, · · · , the convergence rate of PenNMF can be estimated by Proof. By Lemma A12, it holds that If W * is a cluster point of {W k }, we have W k+1 − W k = 0. Together with W * W * = I p implied by Lemma A11, we can conclude that W * is a first-order stationary point of problem (1).
Calculating the summation of the above inequalities from k = 0 to N − 1, we have showing that lim inf k→+∞ D(W k ) F = 0, which further implies that D(W * ) = 0. By Lemma A13, D(W * ) = 0 implies that W * is a first-order stationary point of (1). Moreover, by (A19), we have that and complete the proof.

Appendix C. Additional Experimental Results
In this section, we propose some additional numerical experiments. Figure A1 shows the top 12 basis computed from the testing instances in Section 4.1 by PenNMF. As described in [17], the top bases are those with the largest coefficients in terms of 1 -norm.