Stable Tensor Principal Component Pursuit: Error Bounds and Efficient Algorithms

The rapid development of sensor technology gives rise to the emergence of huge amounts of tensor (i.e., multi-dimensional array) data. For various reasons such as sensor failures and communication loss, the tensor data may be corrupted by not only small noises but also gross corruptions. This paper studies the Stable Tensor Principal Component Pursuit (STPCP) which aims to recover a tensor from its corrupted observations. Specifically, we propose a STPCP model based on the recently proposed tubal nuclear norm (TNN) which has shown superior performance in comparison with other tensor nuclear norms. Theoretically, we rigorously prove that under tensor incoherence conditions, the underlying tensor and the sparse corruption tensor can be stably recovered. Algorithmically, we first develop an ADMM algorithm and then accelerate it by designing a new algorithm based on orthogonal tensor factorization. The superiority and efficiency of the proposed algorithms is demonstrated through experiments on both synthetic and real data sets.


Introduction
In recent years, different types of tensor data have emerged with the significant progress of modern sensor technology, such as color images [1], videos [2], functional MRI data [3], hyper-spectral images [4], point could data [5], traffic stream data [6], etc. Thanks to its multi-way nature, tensor-based methods have natural superiority over vector and matrix-based methods in analyzing and processing ubiquitous modern multi-way data, and have found extensive applications in computer vision [1,7], data mining [5], machine learning [2], signal processing [8], to name a few. In real applications, the acquired tensor data may often suffer from noises and gross corruptions owing to many different reasons such as sensor failure, lens pollution, communication interference, occlusion in videos, or abnormalities in a sensor network [9], etc. At the same time, many real-world tensor data, such as face images or videos, have been shown to have some low-dimensional structure and can be well approximated by a smaller number of "principal components" [8]. Then, a question naturally arises: how to pursue the principal components of an observed tensor data in the presence of both noises and gross corruptions? We will answer this question in this paper and refer to the proposed methodology as Stable Tensor Principal Component Pursuit (STPCP).
The tensor low-rankness is an ideal model of the property that a tensor data can be well approximated by a small number of principal components [8]. In the last decade, low-rank tensor models have attracted much attention in many fields [10]. There are multiple low-rank tensor models since there exist different definitions of tensor rank. Among these models, the low CP rank model [11] and the low Tucker rank model [1] should be the most famous two. The low CP rank model approximates the underlying tensor by the sum of a small number of rank-1 tensors, whereas the low Tucker rank model assumes the unfolding matrix along each mode are low rank. To estimate an unknown low-rank tensor from corrupted observations, it is a natural option to consider the rank minimization problem which chooses the tensor of lowest rank as the solution from a certain feasible set. However, tensor rank minimization, even in its 2-way (matrix) case, is generally NP-hard [12] and even harder in higher-way cases [13]. For tractable solutions, researchers turn to a variety of convex surrogates for tensor rank [1,[14][15][16][17][18] to replace the tensor rank in rank minimization problem. Methods based on surrogates for the tensor CP rank and Tucker rank have been extensively explored in both the theoretical side and the application side [14,17,[19][20][21][22][23][24].
Recently, the low-tubal-rank model [16,25] has shown better performance than traditional tensor low-rank models in many tensor recover tasks such as image/video inpainting/denoising/ sensing [2,25,26], moving object detection [27], multi-view learning [28], seismic data completion [29], WiFi fingerprint [30], MRI imaging [16], point cloud data inpainting [31], and so on. The tubal rank is a new complexity measure of tensor defined through the framework of tensor singular value decomposition (t-SVD) [32,33]. At the core of existing low-tubal-rank models is the tubal nuclear norm (TNN) which is a convex surrogate for the tubal rank. In contrast to CP-based tensor nuclear norms or Tucker-based tensor nuclear norms which models low-rankness in the original domain, TNN models low-rankness in the Fourier domain. It is pointed out in [25,34,35] that TNN has superiority over traditional tensor nuclear norms in exploiting the ubiquitous "spatial-shifting" property in real-world tensor data.
Inspired by the superior performance of TNN, this paper adopts TNN as a low-rank regularizer in the proposed STPCP model. Specifically, the proposed STPCP aims to estimate the underlying tensor data L 0 ∈ R n 1 ×n 2 ×n 3 from an observation tensor M polluted by both small dense noises and sparse gross corruptions as follows: where S 0 is a tensor denoting the sparse corruptions and E 0 is a tensor representing small dense noises. Model (1) is also known as robust tensor decomposition in [36,37]. Our STPCP model is first formulated as a TNN-based convex problem. Then, our theoretical analysis gives upper bound on the estimation error of L 0 and S 0 . In contrast to the analysis in [37], the proposed STPCP can exactly recovery the underlying tensor L 0 and the sparse corruption tensor S 0 when the noise term E 0 vanishes. For efficient solution of the proposed STPCP model, we develop two algorithms with extensions to a more challenging scenario where missing observations are also considered. The first algorithm is an ADMM algorithm and the second algorithm accelerates it using tensor factorization. Experiments show the effectiveness and the efficiency of the designed algorithms.
We organize the rest of this paper as follows. In Section 2, we briefly introduce basic preliminaries for t-SVD and some related works. The proposed STPCP model is formulated and analyzed theoretically in Section 3. We design two algorithms in Section 4 and report experimental results in Section 5. This work is concluded in Section 6. The proofs of theorems, propositions, and lemmas are given in the appendix.

Preliminaries and Related Works
In this section, some preliminaries of t-SVD are first introduced. Then, the related works are presented.
Notations. We denote vectors by bold lower-case letters, e.g., a ∈ R n , matrices by bold upper-case letters, e.g., A ∈ R n 1 ×n 2 , and tensors by underlined upper-case letters, e.g., A ∈ R n 1 ×n 2 ×n 3 . For a given 3-way tensor, we define its fiber as a vector given through fixing all indices but one, and its slice as a matrix obtained by fixing all indices but two. For a given 3-way tensor A, we use A ijk to denote its (i, j, k)-th element; A (k) := A(:, :, k) is used to denote its k-th frontal slice. A is used to denote the tensor after performing 1D Discrete Fourier Transformation (DFT) on all tube fibers A(i, j, :) of T, ∀i = 1, 2, · · · , n 1 , j = 1, 2, · · · , n 2 , which can be efficiently computed by the Matlab command A = fft(A, [], 3). We use dft3(·) and idft3(·) to represent the 1D DFT and inverse DFT along the tube fibers of 3-way tensors, i.e., dft3(A) := fft(A, [], 3), idft3(A) := ifft(A, [], 3).
For a given matrix M ∈ R n 1 ×n 2 , define the nuclear norm and spectral norm of M respectively as: where p = min{n 1 , n 2 }, and σ 1 (M) ≥ · · · ≥ σ p (M) are the singular values of M in a non-ascending order. The l 0 -norm, l 1 -norm, Frobenius norm, l ∞ -norm of a tensor A ∈ R n 1 ×n 2 ×n 3 is defined as: where 1(C) is an indicator function whose value is 1 if the condition C is true, and 0 otherwise. Given two matrices A = (a ij ) ∈ C n 1 ×n 2 , B = (b ij ) ∈ C n 1 ×n 2 , we define their inner product as follows: where A H denotes conjugate transpose of matrix A andā ij denotes the conjugation of complex number a ij . Given two 3-way tensors A, B ∈ R n 1 ×n 2 ×n 3 , we define their inner product as follows:

Tensor Singular Value Decomposition
We first define 3 operators based on block matrices which are introduced in [33]. For a given 3-way tensor A ∈ R n 1 ×n 2 ×n 3 , we define its block vectorization bvec(·) and the inverse operation bvfold(·) in the following equation: (2) . . .
We further define the block circulant matrix bcirc(·) of any 3-way tensor A ∈ R n 1 ×n 2 ×n 3 as follows: Equipped with above defined operators, we are now in a position to define the t-product of 3-way tensors.
Definition 1 (t-product [33]). Given two tensors A ∈ R n 1 ×n 2 ×n 3 and B ∈ R n 2 ×n 4 ×n 3 , the t-product of A and B is a new 3-way tensor C with size n 1 × n 4 × n 3 : A more intuitive interpretation of t-SVD is as follows [33]. If we treat a 3-way tensor A ∈ R n 1 ×n 2 ×n 3 as a matrix of size n 1 × n 2 whose entries are the tube fibers, then the tensor t-product can be analogously understood as the "matrix multiplication" where the standard scalar product is replaced with the vector circular convolution between the tubes (i.e., vectors): where represent the operation of circular convolution [33] of two vectors a, b ∈ R n 3 defined as 3 . We also define the block diagonal matrix bdiag(·) of any 3-way tensor A ∈ R n 1 ×n 2 ×n 3 and its inverse bdfold(·) as follows bdiag(A) :=     A (1) . . .
We also use A (or A) to denote the block diagonal matrix of tensor A = dft3(A) (i.e., the Fourier version of A) i.e., (1) . . .
Then the relationship between DFT and circular convolution further indicates that the conducting t-product in the original domain is equivalent to performing standard matrix product on the Fourier block diagonal matrices [33]. Since matrix product on the Fourier block diagonal matrices can be parallel written as matrix product of all the frontal slices in the Fourier domain, we have the following relationships: The relationship between the t-product and FFT also indicates that the inner product of two 3-way tensors A, B ∈ R n 1 ×n 2 ×n 3 and the inner product of their corresponding Fourier block diagonal matrices A, B ∈ C n 1 n 3 ×n 2 n 3 satisfy the following relationship: When A = B = X, one has: We further define the concepts of tensor transpose, identity tensor, f-diagonal tensor and orthogonal tensor as follows.
Definition 2 (tensor transpose [33]). Given a tensor A ∈ R n 1 ×n 2 ×n 3 , then define its transpose tensor A of size n 2 × n 1 × n 3 which can be formed through first transposing all the frontal slices of A and then exchanging each k-th transposed frontal slice with the (n 3 + 2 − k)-th transposed frontal slice for all k = 2, 3, · · · , n 3 .
Definition 4 (f-diagonal tensor [33]). We call a 3-way tensor f-diagonal if all the frontal slices of it are diagonal matrices.
Definition 5 (orthogonal tensor [33]). We call a tensor Q ∈ R n×n×n 3 an orthogonal tensor if the following equations hold: Then, the tensor singular value decomposition (t-SVD) can be given as follows.
Definition 6 (Tensor singular value decomposition, and Tensor tubal rank [38]). Given any 3-way tensor X ∈ R n 1 ×n 2 ×n 3 , then it has the following factorization called tensor singular value decomposition (t-SVD): where the left and right factor tensors U ∈ R n 1 ×n 1 ×n 3 and V ∈ R n 2 ×n 2 ×n 3 are orthogonal, and the middle tensor Σ ∈ R n 1 ×n 2 ×n 3 is a rectangular f -diagonal tensor.
A visual illustration for the t-SVD is shown in Figure 1. It can be computed efficiently by FFT and IFFT in the Fourier domain according to Equation (4). For more details, see [2]. Definition 7 (Tensor tubal rank [38]). The tensor tubal rank of any 3-way tensor X ∈ R n 1 ×n 2 ×n 3 is defined as the number of non-zero tubes of Σ in its t-SVD shown in Equation (5), i.e., Definition 8 (Tubal average rank [38]). The tubal average rank r a (A) of any 3-way tensor A ∈ R n 1 ×n 2 ×n 3 is defined as the averaged rank of all frontal slices of A as follows, Definition 9 (Tensor operator norm [2,38]). The tensor operator norm F op of any 3-way tensor F ∈ R n 1 ×n 2 ×n 3 is defined as follows: The relationship between t-product and FFT indicates that Definition 10 (Tensor spectral norm [38]). The tensor spectral norm A sp of any 3-way tensor F ∈ R n 1 ×n 2 ×n 3 is defined as the matrix spectral norm of A, i.e., We further define the tubal nuclear norm.
Definition 11 (Tubal nuclear norm [2]). For any tensor A ∈ R n 1 ×n 2 ×n 3 with t-SVD A = U * Σ * V , the tubal nuclear norm (TNN) of A is defined as: where r = r tubal (A).
To understand the tubal nuclear norm, first note that where (i) holds because of the definition of DFT [2], (ii) holds by the property of l 1 -norm, and (iii) is a result of DFT [2]. Thus, the tubal rank of A is also the number of non-zero diagonal elements of Σ(i, i, 1), i.e., the first frontal slice of tensor Σ in the t-SVD of A. Similar to the matrix singular values, the values Σ(i, i, 1), i = 1, 2, · · · , n 3 are also called the singular values of tensor A. As the matrix nuclear norm is the sum of matrix singular values, the tubal nuclear norm can be similarly understood as the sum of tensor singular values. One can also verify by the property of DFT [2] that: which indicates that the TNN of A ∈ R n 1 ×n 2 ×n 3 is also the averaged nuclear norm all frontal slices of A. Thus, TNN indeed models the low-rankness of Fourier domain. Now, we will show that the low-tubal-rank model is ideal to some real-world tensor data, such as color images and videos.
First, we consider a natural image of size 256 × 256 × 3, shown in Figure 2a. In Figure 2b, we plot the distribution of its singular values, i.e., the values of Σ(i, i, 1) along with the index i. As can be seen from Figure 2b, there are only a small number of singular values with large magnitude, and most of the singular values are close to 0. Then, we can say that some natural color images are approximately low tubal rank. Then, consider a commonly used YUV sequence Mother-daughter_qcif (These data can be download from the following link https://sites.google.com/site/subudhibadri/fewhelpfuldownloads.) whose first frame is shown in Figure 3a. We use the Y components of the first 30 frames, and get a tensor of size 144 × 176 × 30 and show the distribution of tensor singular values in Figure 3b. We can see from Figure 3b that similar to Figure 2b, there are only a small number of singular values with large magnitude, and most of the singular values are close to 0. Then, we can say that some videos can be well approximately low tubal rank. For TNN and tensor spectral norm, we highlight the following two lemmas.

Lemma 1.
[2] TNN is the convex envelop of the tensor average rank in the unit ball of tensor spectral norm {T ∈ R n 1 ×n 2 ×n 3 | T sp ≤ 1}.

Lemma 2. [2]
The TNN and the tensor spectral norm are dual norms to each other.

Related Works
In this subsection, we briefly introduce some related works. The proposed STPCP is tightly related to the Tensor Robust Principal Component Analysis (TRPCA) which aims to recover a low-rank tensor L 0 and a sparse tensor S 0 from their sum M = L 0 + S 0 . This is a special case of our measurement Model (1) where the noise tensor E 0 is a zero tensor.
In [39], the SNN-based TRPCA model is proposed by modeling the underlying tensor as a low Tucker rank one min where SNN (Sum of Nuclear Norms) is defined as L SNN := the mode-k matricization of L [40]. Model (14) indeed assumes the underlying tensor to be low Tucker rank, which can be too strong for some real tensor data. The TNN-based TRPCA model uses TNN to impose low-rankness in the final solution L as follows As shown in [2], when the underlying tensor L 0 satisfy the tensor incoherent conditions, by solving Problem (15), one can exactly recover the underlying tensor L 0 and S 0 with high probability with parameter λ = 1/ max{n 1 , n 2 }n 3 .
When the noise tensor E 0 is not zero, the robust tensor decomposition based on SNN is proposed in [36] as follows: where λ 1 and λ 2 are positive regularization parameters. The estimation error on L and S is analyzed with an upper bound in [36].
In [37], the TNN-based RTD model is proposed as follows: where α is an upper estimate of l ∞ -norm of the underlying tensor L 0 . An upper bound on the estimation error is also established. However, in the analysis of Model (17), the error does not vanish as the noise tensor E 0 vanishes which means the analysis cannot guarantee exact recovery in the noiseless setting (which can be provided by the analysis of TNN-based TRPCA (15) by Lu et al. [2]). The Bayesian approach is also used for robust tensor recovery. The CP decomposition under sparse corruption and small dense noise is considered [41], and tensor rank estimation is achieved using Bayesian approach. In [42], CP decomposition under missing value and small dense noise is considered with rank estimation similar to [41]. A sparse Bayesian CP model is proposed in [43] to recover a tensor with missing value, outliers and noises. In [44], a fully Bayesian treatment is proposed to recover a low-tubal-rank tensor corrupted by both noises and outliers.

Theoretical Guarantee for Stable Tensor Principal Component Pursuit
In this section, we formulate the proposed STPCP model and give the main theoretical result which upper bounds the estimation error and guarantees exact recovery in the noiseless setting.

The Proposed STPCP
As for the measurement Model (1), we further assume that the noise tensor E 0 has bounded energy measured in F-norm, i.e., E 0 F ≤ δ. Please note that the limited energy assumption is very mild, since most signals are of limited energy.
To recover the low-rank tensor L 0 and the sparse tensor S 0 , we first produce the following optimization problem: where λ is a positive parameter balancing the two regularizers. The motivation is to use TNN as a low-rank regularization term to exploit the low-dimensional structure in the signal tensor, whereas tensor l 1 -norm is used to impose sparsity in the corruption tensor (since we assumes it to be sparse).
The relationship between Model (18) and existing models are discussed in Remark 1 and Remark 2.

Remark 1.
The following models can be seen as special cases as the proposed STPCP Model (18); (I). When δ = 0, i.e., in the noiseless case, the proposed model degenerates to the TRPCA Model (15) [2]. (II). When n 3 = 1, then the stable tensor PCP Model (18) degenerates to the Stable Principal Component Pursuit (SPCP) [45] which aims to pursuit the principal components modeled by low-rank matrix L 0 from it observation M corrupted by both noises E 0 and sparse corruptions S 0 . The SPCP is formulated as follows: min (III). When n 3 = 1 and δ = 0, the proposed STPCP further degenerates to Robust Principal Component Analysis (RPCA) [46] given as follows Remark 2. The differences from the proposed Model (18) and TNN-based RTD Model ( (17) [37]) is as follows. First, our model does not need to upper estimate the l ∞ -norm of the underlying tensor. Second, our model is a constrained optimization problem, whereas Model (17) is an unconstrained optimization problem.

A Theorem for Stable Recovery
To analyze the statistical performance of Model (18), we should assume on the underlying low-rank tensor L 0 that it is not sparse. Only by this assumption, L 0 can be identified from its mixture with sparse S 0 . Such an assumption can be described by the tensor incoherence condition [2,47], which is used to provide an identifiablility for low-rank L 0 .
Definition 12 (Tensor incoherence condition [2,47]). Given a 3-way tensor T ∈ R n 1 ×n 2 ×n 3 with tubal rank r, suppose it has the skinny t-SVD T = U * Λ * V , where U ∈ R n 1 ×r×n 3 , V ∈ R r×n 2 ×n 3 are orthogonal tensors, and Λ ∈ R r×r×n 3 is an f -diagonal tensor. Then, T is said to satisfy the tensor incoherent condition (TIC) with parameter µ(T) if the following inequalities hold: wheree i ∈ R n 1 ×1×n 3 is a tensor column basis with only the (i, 1, 1)-th element being 1 and all the others being 0, ande j ∈ R n 2 ×1×n 3 is also a tensor column basis with only the (j, 1, 1)-th element being 1 and all the others being 0.

Assumption 1.
Suppose the true tensor L 0 in the measurement model (1) satisfies tensor incoherence condition with parameter µ.
Assumption 1 intrinsically ensures that the row bases and column bases of L 0 do not align well with the canonical row and column bases. Thus, the low-rank L 0 is not sparse, which avoids the ambiguity that low-rank component can also be sparse in the measurement Model (1).
We should also force the sparse component in Model (1) is not low rank.

Assumption 2.
Assume the support Ω of S 0 is drawn uniformly at random. (18).

Now we can establish an upper bound on the estimation error ofL andŜ in Problem
Theorem 1 (An Upper Bound on the Estimation Error). Suppose L 0 and S 0 satisfy Assumption 1 and Assumption 2, respectively. If the tubal rank r of L 0 and the sparsity (i.e., the l 0 -norm) s of S 0 are respectively upper bounded as follows: r ≤ c r min{n 1 , n 2 } µ log 2 (n 3 max{n 1 , n 2 }) , and s ≤ c s n 1 n 2 n 3 (24) where c l and c s are two sufficiently small numerical constants independent on the dimensions n 1 , n 2 and n 3 . Then the estimator defined in Model (18) satisfy the following inequalities: with probability at least 1 − c 1 (n 3 max{n 1 , n 2 }) −c 2 (over the choice of support of S 0 ), where c 1 and c 2 are positive constants independent on the dimensions n 1 , n 2 and n 3 .
The proof of Theorem 1 are given in the appendix. In Theorem 1, estimation errors on L 0 and S 0 are separately established. It indicates that the estimation error scales linearly with the noise level δ, which is in consistence with the result in [37].

Remark 3.
A significant progress over [37] is that in the noiseless setting where E 0 vanishes, our analysis can provide exact recovery guarantee of L 0 and S 0 . This is because the tensor incoherence condition adopted in our analysis intrinsically ensures that the low-rank tensor L 0 is not sparse and thus can be separated from the sparse corruption tensor, whereas the non-spiky condition adopted in [37] fails to provide identifiability in the measurement Model (1).
For Theorem 1, we also give the following remark. (I). When δ = 0, i.e., in the noiseless case, the error bounds in Theorem 1 will vanish, which means exact recovery of L 0 and S 0 can be guaranteed. This result is consistent with the analysis in [2] for TNN-based TRPCA Model (15). (II). When n 3 = 1, the error bound on the sparse component in Theorem 1 is consistent with the error bound shown in Equation (8) of [45]. The upper bound on error of the low-rank component in Theorem 1 is sharper than that given in Equation (8) of [45]. (III). When n 3 = 1 and δ = 0, the proposed STPCP has consistent theoretical guarantee with the analysis of RPCA [46].

Algorithms
In this section, we design two algorithms. The first algorithm is based on the framework of ADMM [48] which has been extensively used in convex optimization with good convergence behavior. However, ADMM requires full SVDs on large matrices in each iteration which is high computational burden in high-dimensional settings. Thus, the second algorithm is proposed to solve this issue by using a factorization trick which can instead conducting SVDs on much smaller matrices.

An ADMM Algorithm
The proposed estimator (18) is equivalent to the following unconstrained problem: where γ is a positive parameter balancing the data fidelity term and the regularization term. Besides being corrupted by noises and outliers, the observed tensor M may also suffer from missing entries which can be taken as outliers with known positions in many applications. Thus, it is more practical to consider the recovery of L 0 against outliers S 0 , noises E 0 and missing entries shown in the following measurement model: where tensor B ∈ R n 1 ×n 2 ×n 3 denote the missing mask where B ijk = 1, if the (i, j, k)-th entry of L is observed and B ijk = 0 otherwise, and denotes element-wise multiplication. Taking into consideration of missing entries, Model (26) can be further modified as: By adding auxiliary variables to Problem (28), we obtain: The Augmented Lagrangian (AL) of Problem (29) is given as follows: where Y 1 , Y 2 ∈ R n 1 ×n 2 ×n 3 are Lagrangian multipliers and ρ is a penalty parameter. According the strategy of ADMM, we update prime variables (L, S) and (K, R) by alternative minimization of AL in Problem (29) as follows: • Update (L, S). We update (L, S) by minimizing L ρ with other variables fixed as follows: = argmin (L,S) Taking derivatives of the right-hand side of Equation (31) with respect to L and S respectively, and setting the results zero, we obtain: Resolving the above equation group yields: where denotes entry-wise division and 1 denotes the tensor all whose entries are 1. • Update (K, R). We update (K, R) by minimizing L ρ with other variables fixed as follows: Please note that Problem (34) can further be solved separately as follows: and where S · TNN τ (·) is the proximity operator of TNN [5]. and S · 1 τ (·) is the proximity operator of tensor l 1 -norm given as follows [49]: In [5], a closed-form expression of S τ (·) is given as follows: Lemma 3 (Proximity operator of TNN [5]). For any 3D tensor A ∈ R n 1 ×n 2 ×n 3 with reduced t-SVD A = U * Λ * V , where U ∈ R n 1 ×r×n 3 and V ∈ R n 2 ×r×n 3 are orthogonal tensors and Λ ∈ R r×r×n 3 is the f-diagonal tensor of singular tubes, the proximity operator S · TNN τ (A) at A can be computed by: . The Lagrangian multipliers are updated by gradient ascent as follows: The algorithm is summarized in Algorithm 1. The convergence analysis of Algorithm 1 is established in Theorem 2.
The proof of Theorem 2 is given in the Appendix A. In a single iteration of Algorithm 1, the main cost comes from updating L t which involves computing FFT, IFFT and n 3 SVDs of n 1 × n 2 matrices [47]. Hence Algorithm 1 has per-iteration complexity of order O n 1 n 2 n 3 (n 1 ∧ n 2 + log n 3 ) . Thus, if the total iteration number is T, then the total computational complexity is:

A Faster Algorithm
To reduce the cost of computing TNN which is a main cost of Algorithm 1, we propose the following lemma which indicates that TNN is orthogonal invariant. Lemma 4. Given a tensor X ∈ R r×r×n 3 , let Q ∈ R n 1 ×r×n 3 a two semi-orthogonal tensors, i.e., Q * Q = I ∈ R r×r×n 3 and r ≤ min{n 1 , n 2 }. Then, we have the following relationship: The proof of Lemma 4 can be found in the appendix. Equipped with Lemma 4, we decompose the low-rank component in Problem (28) as follows: where I r ∈ R r×r×n 3 is an identity tensor. The similar strategy has been used in low-rank matrix recovery from gross corruptions by [50]. Furthermore, we propose the following model for Problem (28): where r is an upper estimation of tubal rank of the underlying tensor r * = r tubal (L 0 ). In contrast to Model (28), the proposed Model (39) is a non-convex optimization problem. That means Model (39) may have many local minima. We establish a connection between the proposed Model (39) with Model (28) in the following theorem. (39) and Model (28)). Let (Q * , X * , S * ) be a global optimal solution to Problem (39). Furthermore, let (L , S ) be the solution to Problem (28), and r tubal (L ) ≤ r, where r is the initialized tubal rank. Then (Q * * X * , S * ) is also the optimal solution to Problem (28).

Theorem 3 (Connection between Model
The proof of Theorem 3 can be found in the appendix. Theorem 3 states that the global optimal point of the (non-convex) Model (39) coincides with solution of the (convex) Model (28). It further indicates that the accuracy of Model (39) cannot exceed Model (28), which can be validated numerically in the experiment section.
To solve Model (39), we also use the ADMM framework. First, by adding auxiliary variables, we have the following problem: min L,S,R,Q,X The augmented Lagrangian of Problem (40) is: According the strategy of ADMM, we update prime variables (L, S) and (Q, X, R) by alternative minimization of AL in Problem (41)  = argmin (L,S) Taking derivatives of the right-hand side with respect to L and S respectively, and setting the results zero, we obtain: Resolving the above equation group yields: • Update Q. We update Q by minimizing L ρ with other variables fixed as follows: where operator P(·) is defined in Lemma 5 as follows.

Lemma 5 ([51]
). Given any tensors A ∈ R r×n 2 ×n 3 , B ∈ R n 1 ×n 2 ×n 3 , suppose tensor B * A has t-SVD B * A = U * Λ * V , where U ∈ R n 1 ×r×n 3 and V ∈ R r×r×n 3 . Then, the problem: has a closed-form solution as: • Update (X, R):We update (X, S) by minimizing L ρ with other variables fixed as follows: Please note that Problem (48) can further be solved separately as follows: and The equality (i) in Equation (49) holds because according to Q * Q = I, we have: • Update (Y 1 , Y 2 ). The Lagrangian multipliers are updated by gradient ascent as follows: The algorithmic steps are summarized in Algorithm 2. The complexity analysis is given as follows.
In each iteration of Algorithm 2, the update of L requires FFT/IFFT, and n 3 multiplications of n 1 -by-r and r-by-n 2 matrices, which costs O (n 1 n 2 + rn 1 + r n 2)n 3 log n 3 + rn 1 n 2 n 3 ; updating S costs O n 1 n 2 n 3 ; updating of Q involves FFT/IFFT and n 3 SVDs of n 1 -by-r matrices, which costs O rn 1 n 3 log n 3 + r 2 n 1 n 3 ; updating X involves FFT/IFFT and n 3 SVDs of r-by-n 2 , which costs O rn 2 n 3 log n 3 + r 2 n 2 n 3 ) . Then, the per-iteration computational complexity of Algorithm 2 is dominated by: O max n 1 n 2 n 3 log n 3 , r 2 (n 1 + n 2 )n 3 .
Since the low-tubal-rank assumption r min{n 1 , n 2 } is adopted in this paper, the per-iteration of Algorithm 2 is much lower than Algorithm 1.

Synthetic Data
We first verify the correctness of Theorem 1. Specifically, we check whether the following two statements indicated in Theorem 1 hold in experiments on synthetic data sets: (I). (Exact recovery in the noiseless setting.) Our analysis guarantees that the underlying low-rank tensor L 0 and sparse tensor S 0 can be exactly recovered in the noiseless setting. This statement will be checked in Section 5.1.1. (II). (Linear scaling of errors with the noise level.) In Theorem 1, the estimation errors on L 0 and S 0 scales linearly with the noise level δ. This statement will be checked in Section 5.1.2.
Signal Generation. With a given tubal rank r 0 , we first generate the underlying tensor L 0 ∈ R n 1 ×n 2 ×n 3 by L 0 = A * B/n 3 , where tensors A ∈ R n 1 ×r 0 ×n 3 and B ∈ R r 0 ×n 2 ×n 3 are generated with i.i.d. standard Gaussian elements. Then, the sparse corruption tensor S 0 is generated by choosing its support uniformly at random. The non-zero elements of S 0 will be i.i.d. sampled from a certain distribution that will be specified afterwards. Furthermore, the noise tensor E 0 is generated with entries sampled i.i.d. from N (0, σ 2 ) with σ = c L 0 F / √ n 1 n 2 n 3 , where we set constant c is to control the signal noise ratio. Finally, the observed tensor M is formed by M = L 0 + S 0 + E 0 .

Exact Recovery in the Noiseless Setting
We first check Statement (I), i.e., exact recovery in the noiseless setting. Specifically, we will show that Algorithm 1 and Algorithm 2 can exactly recover the underlying tensor L 0 and the sparse corruption S 0 . We first test the recovery performance of different tensor sizes by setting n = n 1 = n 2 ∈ {100, 160, 200} and n 3 = 20, with (r tubal (L 0 ), S 0 0 ) = (0.05n, 0.05n 2 n 3 ). The non-zero elements of tensor S 0 is sampled from i.i.d. symmetric Bernoulli distribution, i.e., the possibility of being 1 or −1 are 1/2. The results are shown in Table 1. It can be seen that both Algorithm 1 and Algorithm 2 can obtain relative standard error (RSE) smaller than 1e − 5 by which we can say that L 0 and S 0 are exact recovered. We can also see that Algorithm 2 runs much faster than Algorithm 1. Table 1. Performance of Algorithm 1 and Algorithm 2 in both accuracy and speed for different tensor sizes when the gross corruption. Outliers from symmetric Bernoulli, observation tensor M ∈ R n×n×n 3 , n 3 = 30, r tubal (L 0 ) = 0.05n, S 0 1 = 0.05n 2 n 3 , noise level c = 0, r = max 2r tubal (L 0 ) , 15 . We then test whether the recovery performance can be affected by the distribution of the corruptions. This is done by choosing the non-zeros elements of S 0 from i.i.d. standard Gaussian distribution. The experimental results are reported in Table 2. We can find that both Algorithm 1 and Algorithm 2 can exactly recover the true L 0 and S 0 and Algorithm 2 runs much faster than Algorithm 1.
We also conduct STPCP by Algorithm 1 and Algorithm 2 with missing entries. After generating L 0 , S 0 and E 0 , we get the observation by Model (27). We choose the support of B uniformly at random with possibility 0.8 and then set elements in the chosen support to be 1. Thus, %20 of the entries are missing. The corrupted observation M is then formed by M = B (L 0 + S 0 + E 0 ). We show the recover results in Table 3. We can see that the underlying low-rank tensor L 0 can be exactly recovered and the observed part of the corruption tensor B S 0 can also be exactly recovered (Please note that it is impossible to recover the unobserved entries of a sparse tensor S 0 [52]). Table 2. Performance of Algorithm 1 and Algorithm 2 in both accuracy and speed for different tensor sizes when the gross corruption. Outliers from standard Gaussian distribution, observation tensor M ∈ R n×n×n 3 , n 3 = 30, r tubal (L 0 ) = 0.05n, S 0 1 = 0.05n 2 n 3 , noise level c = 0, r = max 2r tubal (L 0 ) , 15 .  Table 3. Performance of Algorithm 1 and Algorithm 2 in both accuracy and speed for different tensor sizes when the gross corruption. Outliers from symmetric Bernoulli, observation tensor M ∈ R n×n×n 3 , n 3 = 30, r tubal (L 0 ) = 0.05n, S 0 1 = 0.05n 2 n 3 , noise level c = 0, r = max 2r tubal (L 0 ) , 15 , with %20 random missing entries.  Figure 4. We can see that the estimation error has linear scaling behavior along with the noise level as Theorem 1 indicates. Since the results for n = 80 and n = 100 are quite similar to the case of n = 60, they are simply omitted.

Real Data Sets
In this section, experiments on real data sets (color images and videos) are carried out to evaluate the effectiveness and efficiency of the proposed Algorithms 1 and 2. Besides noises and sparse corruptions, we also consider missing values which is more challenging. The proposed algorithms are compared with the following typical models: (I). NN-I: tensor recovery based on matrix nuclear norms of frontal slices formulated as follows: This model will be used for image restoration in Section 5.2.1. Please note that Model (53) is equivalent to parallel matrix recovery on each frontal slice. (II). NN-II: tensor recovery based on matrix nuclear norm formulated as follows: where L = [l 1 , l 2 , · · · , l n 3 ] ∈ R n 1 n 2 ×n 3 with l k := vec(L (k) ) ∈ R n 1 n 2 defined as the vectorization [40] of frontal slices L (k) , for all k = 1, 2, · · · , n 3 . This model will be used for video restoration in Section 5.2.2. (III). SNN: tensor recovery based on SNN formulated as follows: where L (i) ∈ R n i ×∏ j =i n j is the mode-i matriculation of tensor L ∈ R n 1 ×n 2 ×n 3 , for all i = 1, 2, 3.
We solve the above Model (53) Please note that a larger PSNR value indicates higher quality ofL.

Color Images
Color images are the most commonly used 3-way tensors. We test the twenty 256-by-256-by-3 color images which have been used in [37], and carry out robust tensor recovery with missing entries (see Figure 5). Following [37], for a color image L 0 ∈ R n×n×3 , we choose its support uniformly at random with ratio ρ s and fill in the values with i.i.d. symmetric Bernoulli variables to generate S 0 . The noise tensor E 0 is generated with i.i.d. zero-mean Gaussian entries whose standard deviation is given by σ = 0.05 L 0 F / √ 3n 2 . Then, we form the binary observation mask B by choosing its support uniformly at random with ratio ρ obs . Finally, the partially observed corruption M = B (L 0 + S 0 + E 0 ) are formed. We consider two scenarios by setting (ρ obs , ρ s ) ∈ {(0.9, 0.1), (0.8, 0.2)}. For NN (Model (53)), we set the regularization parameters λ = 1/ √ nρ obs (suggested by [46]), and set parameter γ = E 0 sp where E 0 sp is estimated as 6.5σ 3ρ obs n log(6n) (suggested by [5]). For SNN, the parameters are chosen to satisfy γ = 0.05, α 1 = α 2 = 3nρ obs , α 3 = 0.01 3nρ obs . For Algorithm 1 and Algorithm 2, we set γ = 0.3 E 0 sp , and λ = 1/ 3nρ obs . The initialized rank r in Algorithm 2 is set as 60.
In each setting, we test each color image for 10 times and report the averaged PSNR and time.
For quantitative comparison, we show the PSNR values and running times in Figures 6 and 7 for settings of (ρ obs , ρ s ) = (0.9, 0.1) and (ρ obs , ρ s ) = (0.8, 0.2), respectively. Several visual examples are shown in Figure 8 for qualitative comparison for the setting of (ρ obs , ρ s ) = (0.8, 0.2). We can see from Figures 6-8 that the proposed Algorithm 1 has the highest recovery quality and the proposed Algorithm 2 has the second highest quality but the fastest running time.

Videos
In this subsection, video restoration experiments are conducted on four broadly used YUV videos (They can be downloaded from https://sites.google.com/site/subudhibadri/fewhelpfuldownloads: Akiyo_qcif, Scilent_qcif, Carphone_qcif, and Claire_qcif.) Owing to computational limitation, we simply use the first 32 frames of the Y components of all the videos which results in four 144-by-176-by-30 tensors. For a 3-way data tensor L 0 ∈ R n 1 ×n 2 ×n 3 , To generate corruption S 0 , the support is chosen uniformly at random with ratio ρ s and then elements in the support are filled in with i.i.d. symmetric Bernoulli variables. The noise tensor E 0 is also generated with i.i.d. zero-mean Gaussian entries whose standard deviation is given by σ = 0.05 L 0 F / √ n 1 n 2 n 3 . Then, the binary observation mask B is formed thorough choosing its support uniformly at random with ratio ρ obs . Finally, the partially observed corruption M = B (L 0 + S 0 + E 0 ) are formed.
We also consider two scenarios by setting (ρ obs , ρ s ) ∈ {(0.9, 0.1), (0.8, 0.2)}. NN-II Model (54) is used in video restoration. For NN-II, we set the regularization parameters λ = 1/ √ n 1 n 2 ρ obs (suggested by [46]), and set parameter γ = E 0 sp where E 0 sp is estimated as 6.5σ ρ obs n 1 n 3 log((n 1 + n 2 )n 3 ) (suggested by [5]). For SNN, the parameters are chosen to satisfy γ = 0.05, α 1 = α 2 = √ n 1 n 3 ρ obs , α 3 = 5 √ n 1 n 3 ρ obs . For Algorithm 1 and Algorithm 2, we set γ = 0.3 E 0 sp , and λ = 1/ max{n 1 , n 2 }n 3 ρ obs after careful parameter tuning. The initialized rank r in Algorithm 2 is set as 60. In each setting, we test each video for 10 times and report the averaged PSNR and time. For quantitative comparison, we show the PSNR values and running times in Table 4. It can be seen that Algorithm 1 has the highest recovery quality and the proposed Algorithm 2 has the second highest quality but the fastest running time.

Conclusions
This paper studied the problem of stable tensor principal component pursuit which aims to recover a tensor from noises and sparse corruptions. We proposed a constrained tubal nuclear norm-based model and established upper bounds on the estimation error. In contrast to prior work [37], our theory can guarantee exact recovery in the noiseless setting. We also designed two algorithms, the first ADMM algorithm can be accelerated by the second Algorithm which adopts a factorization strategy. We validated the correctness of our theory by simulations on synthetic data, and evaluated the effectiveness and efficiency of the proposed algorithms via experiments on color images and videos.
For future directions, it is a natural and interesting extension to consider recovery of 4-way tensors [35] with arbitrary linear transformation [53,54]. It is also interesting to use tensor factorization-based methods [55,56] for STPCP. Another challenging future direction is developing tools to verify whether the unknown tensor satisfies the tensor incoherence condition from its incomplete or corrupted observations. For extensions of the proposed approach to higher-way tensors, we produce the following two ideas: 1.
By recursively applying DFT over successive modes higher than 3 and then unfolding the obtained tensor into 3-way [57], the proposed algorithms and theoretical analysis can be extended to higher-way tensors.

2.
By using the overlapped orientation invariant tubal nuclear norm [58], we can extend the proposed algorithm to higher-order cases and obtain orientation invariance. Before Proving Theorem 1, we should define some notations and operators first. Suppose L 0 ∈ R n 1 ×n 2 ×n 3 with tubal rank r has the skinny t-SVD L 0 = U * Λ * V , where U ∈ R n 1 ×r×n 3 , V ∈ R r×n 2 ×n 3 are orthogonal tensors, and Λ ∈ R r×r×n 3 is an f -diagonal tensor. Define the following set: T := U * A + B * V | A ∈ R r×n 2 ×n 3 , B ∈ R n 1 ×r×n 3 ⊂ R n 1 ×n 2 ×n 3 . (A1) Then, define the projector onto T for any tensor T ∈ R n 1 ×n 2 ×n 3 as follows: Let Ω ⊥ be the complement of Ω ⊂ [n 1 ] × [n 2 ] × [n 3 ] which is the support of S 0 . Then, define two operators P Ω , P Ω ⊥ as follows: for any T ∈ R n 1 ×n 2 ×n 3 . Define two sets Γ and Γ ⊥ as follows: Then, for any tensors X ι , X s ∈ R n 1 ×n 2 ×n 3 , the projectors of the tensor X = (X ι , X s ) into the sets Γ and Γ ⊥ are given as follows, respectively: For any tensors X ι , X s ∈ R n 1 ×n 2 ×n 3 , define two operators on X = (X ι , X s ) as follows: (P T × P Ω )(X) = (P T (X ι ), P Ω (X s )), (P T ⊥ × P Ω ⊥ )(X) = (P T ⊥ (X ι ), P Ω ⊥ (X s )).
Also define two norms as follows: where µ is a constant that will be determined afterwards. We first give Lemma A1 which can be seen as a modified version of Lemma C.1 in [2].
In this way, the proof of Theorem 1 is completed.
Appendix A.4. Proof of Theorem 3 Proof. Please note that (Q * * X * , S * ) is a feasible point of Problem (28), then we have: By the assumption that r tubal (L ) ≤ r, there exists a decomposition L = Q * X , such that (Q , X , S ) is also a feasible point of Problem (39).
By L = Q * X , we have: Thus, we deduce: According to Equations (A41) and (A43), we further have: In this way, (Q * * X * , S * ) is also the optimal solution to Problem (28).