Next Article in Journal
A Comprehensive Evaluation of Near-Real-Time and Research Products of IMERG Precipitation over India for the Southwest Monsoon Period
Next Article in Special Issue
A Remote Sensing Image Destriping Model Based on Low-Rank and Directional Sparse Constraint
Previous Article in Journal
Retrieval of Water Vapour Profiles from GLORIA Nadir Observations
Previous Article in Special Issue
Unpaired Remote Sensing Image Super-Resolution with Multi-Stage Aggregation Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Guaranteed Robust Tensor Completion via ∗L-SVD with Applications to Remote Sensing Data

1
School of Automation, Guangdong University of Technology, Guangzhou 510006, China
2
Tensor Learning Team, RIKEN AIP, Tokyo 103-0027, Japan
3
Key Laboratory of Intelligent Detection and The Internet of Things in Manufacturing, Ministry of Education, Guangzhou 510006, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2021, 13(18), 3671; https://doi.org/10.3390/rs13183671
Submission received: 30 July 2021 / Revised: 3 September 2021 / Accepted: 9 September 2021 / Published: 14 September 2021
(This article belongs to the Special Issue Remote Sensing Image Denoising, Restoration and Reconstruction)

Abstract

:
This paper conducts a rigorous analysis for the problem of robust tensor completion, which aims at recovering an unknown three-way tensor from incomplete observations corrupted by gross sparse outliers and small dense noises simultaneously due to various reasons such as sensor dead pixels, communication loss, electromagnetic interferences, cloud shadows, etc. To estimate the underlying tensor, a new penalized least squares estimator is first formulated by exploiting the low rankness of the signal tensor within the framework of tensor L -Singular Value Decomposition ( L -SVD) and leveraging the sparse structure of the outlier tensor. Then, an algorithm based on the Alternating Direction Method of Multipliers (ADMM) is designed to compute the estimator in an efficient way. Statistically, the non-asymptotic upper bound on the estimation error is established and further proved to be optimal (up to a log factor) in a minimax sense. Simulation studies on synthetic data demonstrate that the proposed error bound can predict the scaling behavior of the estimation error with problem parameters (i.e., tubal rank of the underlying tensor, sparsity of the outliers, and the number of uncorrupted observations). Both the effectiveness and efficiency of the proposed algorithm are evaluated through experiments for robust completion on seven different types of remote sensing data.

Graphical Abstract

1. Introduction

Despite the broad adoption of advanced sensors in various remote sensing tasks, the quality of data remains a critical issue and can significantly influence the actual performances of the backend applications. Many types of modern remote sensing data in the modality of optical, hyperspectral, multispectral, thermal, Light Detection and Ranging (LiDAR), Synthetic Aperture Radar (SAR), etc., are typically multi-way and can be readily stored, analyzed, and processed by tensor-based models [1,2,3,4,5,6,7]. In some extreme circumstances, the data tensor may encounter missing entries, gross sparse outliers, and small dense noises at the same time, as a result of partial sensor failures, communication errors, occlusion by obstacles, and so on [8,9]. To robustly complete a partially observed data tensor corrupted by outliers and noises, the problem of robust tensor completion arises.
When only a fraction of partially corrupted observations are available, the crucial point of robust tensor completion lies in the assumption that the underlying data tensor is highly redundant such that the main components of it remain only slightly suppressed by missing information, outliers, and noises, and thus can be effectively reconstructed by exploiting the intrinsic redundancy. The tensor low-rankness is an ideal tool to model the redundancy of tensor data, and has gained extensive attention in remote sensing data restoration [5,10,11].
As higher-order extensions of low-rank matrix models [12], low-rank tensor models are typically formulated as minimization problems of the tensor rank function [13]. However, there are multiple definitions of tensor ranks, such as the CP rank [14], Tucker rank [15], TT rank [16], TR rank [17], etc., which focus on low rank structures in the original domains (like the pixel domain of optimal images) [18,19]. Recently, a remarkably different example named the low-tubal-rank tensor model [20,21] was proposed within the algebraic framework of tensor Singular Value Decomposition (t-SVD) [20,22], which captures low-rankness in the frequency domain defined via Discrete Fourier Transform (DFT). As discussed in [18,19,21,23], the low-tubal-rank tensor models are capable to exploit both low-rankness and smoothness of the tensor data, making it quite suitable to analyze and process diverse remote sensing imagery data which are often simultaneously low-rank and smooth [5,10].
Motivated by the advantages of low-tubal-rankness in modeling remote sensing data, we resolve the robust tensor completion problem by utilizing a generalized low-tubal-rank model based on the tensor L -Singular Value Decomposition ( L -SVD) [24], which leverages low-rankness in more general transformed domains rather than DFT. What needs to be pointed out is that the L -SVD has become a research focus in tensor-based signal processing, computer vision, and machine learning very recently [18,23,25,26]. Regarding the preference of theory in this paper, we only introduce several typical works with statistical analysis as follows. For tensor completion in the noiseless settings, Lu et al. [26] proposed a L -SVD-based model which can exactly recover the underlying tensor under mild conditions. For tensor completion from partial observations corrupted by sparse outliers, Song et al. [27] designed a L -SVD-based algorithm with exact recovery guarantee. Zhang et al. [25] developed a theoretically guaranteed approach via the L -SVD to for tensor completion from Poisson noises. The problem of tensor recovery from noisy linear observations is studied in [18] based the L -SVD with guaranteed statistical performance.
In this paper, we focus on statistical guaranteed approaches in a more challenging setting than the aforementioned L -SVD-based models, where the underlying signal tensor suffers from missing entries, sparse outliers, and small dense noises simultaneously. Specifically, we resolve the problem of robust tensor completion by formulating a L -SVD-based estimator whose estimation error is established and further proved to be minimax optimal (up to a log factor). We propose an algorithm based on Alternating Direction Method of Multipliers (ADMM) [28,29] to compute the estimator and evaluate both the effectiveness and efficiency on seven different types of remote sensing data.
The remainder of this paper proceeds as follows. We first introduce some notation and preliminaries in Section 2. Then, the proposed estimator for robust tensor completion is formulated in Section 3. We compute the estimator by using an ADMM-based algorithm described in Section 4. The statistical performance of the proposed estimator is analyzed in Section 5. Experimental results on both synthetic and real datasets are reported in Section 7. We summarize this paper and discuss future directions briefly in Section 8. The proofs of the theoretical results are given in Appendix A.

2. Preliminaries

In this section, we first introduce some notations and then give a brief introduction to the L -SVD framework.

2.1. Notations

Main notations are listed in Table 1. Let [ d ] : = { 1 , , d } , d N + . Let a b = max { a , b } and a b = min { a , b } , a , b R . For i [ d ] , e i R d denotes the standard vector basis whose i t h entry is 1 with the others 0. For ( i , j , k ) [ d 1 ] × [ d 2 ] × [ d 3 ] , the outer product e i e j e k denotes a standard tensor basis in R d 1 × d 2 × d 3 , whose ( i , j , k ) t h entry is 1 with the others 0. For a 3-way tensor, a tube is a vector defined by fixing indices of the first two modes and varying the third one; A slice is a matrix defined by fixing all but two indices. For any set Θ , | Θ | denotes its cardinality and Θ its complement. Absolute positive constants are denoted by C , c , c 0 , etc whose values may vary from line to line. When the field and size of a tensor are not shown explicitly, it is defaulted to be in R d 1 × d 2 × d 3 . The spectral norm · and nuclear norm · of a matrix are the maximum and the sum of the singular values, respectively.

2.2. Tensor L -Singular Value Decomposition

The tensor L -SVD is a generalization of the t-SVD [22]. To get a better understanding of L -SVD, we first introduce several basic notions of t-SVD as follows. For any tensor T R d 1 × d 2 × d 3 , its block circulant matrix bcirc ( T ) is defined as
bcirc ( T ) : = T ( 1 ) T ( d 3 ) T ( 2 ) T ( 2 ) T ( 1 ) T ( 3 ) T ( d 3 ) T ( d 3 1 ) T ( 1 )
We also define the block vectorization operator and its inverse operator for any T R d 1 × d 2 × d 3 by:
bvec ( T ) : = T ( 1 ) T ( 2 ) T ( d 3 ) , bvfold ( bvec ( T ) ) = T
Then, based on the operators defined above, we are able to give the definition of the tensor t-product.
Definition 1
(T-product [22]). For any tensors A R d 1 × d 2 × d 3 and B R d 2 × d 4 × d 3 , their t-product is a tensor C of size d 1 × d 4 × d 3 computed as follows:
C = A B : = bvfold bcirc ( A ) bvec ( B )
If we view the 3-way tensor C R d 1 × d 4 × d 3 as a d 1 -by- d 4 “matrix” C of tubes C ( i , j , : ) R d 3 , then the t-product can be analogously conducted like the matrix multiplication by changing scalar multiplication by the circular convolution between the tubes (i.e., vectors), as follows:
C ( i , j , : ) = k = 1 d 2 A ( i , k , : ) B ( k , j , : )
where the symbol ⊛ denotes the circular convolution of two tubes a , b R d 3 defined as follows [22]:
( a b ) j = k = 1 d 3 a k b 1 + ( j k ) mod d 3
where mod ( · ) is the modulus operator. According to the well-known relationship between circular convolution and DFT, the t-product is equivalent to matrix multiplication between all the frontal slices in the Fourier domain [22], i.e., 
C ¯ = A ¯ B ¯
where T ¯ denotes the tensor obtained by conducting DFT on all the mode-3 fibers of any tensor T , i.e., 
T ¯ = T × 3 F d 3
where F d 3 is the transform matrix of DFT [22], and  × 3 denotes the tensor mode-3 product [30].
In [24], Kernfeld et al. extended the t-product to the tensor L -product by replacing DFT by any invertible linear transform L ( · ) induced by a non-singular transformation matrix L , and established the framework of L -SVD. In the latest studies, the  transformation matrix L defining the transform L is restricted to be orthogonal [18,26,31,32] (unitary in [25,27]) for better properties, which is also followed in this paper.
Given any orthogonal matrix  L R d 3 × d 3 (though we restrict L to be orthogonal for simplicy, our analysis still holds with simple extensions for unitary L [27]), define the associated linear transform L ( · ) with inverse L 1 ( · ) on any T R d 1 × d 2 × d 3 as
T ¯ = L ( T ) : = T × 3 L , a n d L 1 ( T ) : = T × 3 L 1
Definition 2
(Tensor L -product [24]). The L –product of any A R d 1 × d 2 × d 3 and B R d 2 × d 4 × d 3 under the invertible linear transform L in Equation (4), denoted by A L B , is defined as the tensor C R d 1 × d 4 × d 3 such that L ( C ) = L ( A ) L ( B ) .
Definition 3
( L –block-diagonal matrix [18]). For any T R d 1 × d 2 × d 3 , its L –block-diagonal matrix, denoted by T ¯ , is defined as the block diagonal matrix whose i-th diagonal block is the i-th frontal slice T ¯ ( i ) of T ¯ = L ( T ) , i.e.,
T ¯ : = bdiag ( T ¯ ) : = T ¯ ( 1 ) T ¯ ( d 3 ) R d 1 d 3 × d 2 d 3
Based on the notions of tensor L –transpose, L -identity tensor, L -orthogonal tensor, and f-diagonal tensor [24], the L –SVD (illustrated in Figure 1) is given.
Theorem 1
(Tensor L –SVD, L -tubal rank [24]). Any T R d 1 × d 2 × d 3 has a tensor L –Singular Value Decomposition ( L –SVD) under any L in Equation (4), given as follows
T = U L D L V
where U R d 1 × d 1 × d 3 , V R d 2 × d 2 × d 3 are L -orthogonal, and  D R d 1 × d 2 × d 3 is f-diagonal.
The L -tubal rank of T R d 1 × d 2 × d 3 is defined as the number of non-zero tubes of D in its L –SVD in Equation (5) i.e.,
r t b ( T ) : = # { i | D ( i , i , : ) 0 , i [ d 1 d 2 ] }
where # counts the number of elements of a given set.
For any T R d 1 × d 2 × d 3 , we have the following equivalence between its L -SVD and the matrix SVD of its L –block-diagonal matrix T ¯ :
T = U L D L V T ¯ = U ¯ · D ¯ · V ¯ .
Considering the block diagonal structure of T ¯ , we define the tensor L -multi-rank on its diagonal blocks T ¯ ( i ) :
Definition 4
(Tensor L –nuclear norm, tensor L -spectral norm [26]). The tensor L –nuclear norm ( L -TNN) and L -spectral norm of any T R d 1 × d 2 × d 3 under any L in Equation (4) are defined as the matrix nuclear norm and matrix spectral norm of T ¯ , respectively, i.e.,
T : = T ¯ , T s p : = T ¯ .
As proved in [26,27], L –TNN is the convex envelop of the l 1 -norm of the L –multi-rank in unit tensor L -spectral norm ball. Thus, L –TNN encourages a low L –multi-rank structure which means low-rankness in spectral domain. When the linear transform L represents the DFT (although we restrict the L in Equation (4) to be orthogonal, we still consider TNN as a special case of L –TNN up to constants and real/complex domain) along the 3-rd mode, L –TNN and tensor L -spectral norm degenerate to the Tubal Nuclear Norm (TNN) and the tensor spectral norm, respectively, up to a constant factor d 3 1 [26,33].

3. Robust Tensor Completion

In this section, we will formulate the robust tensor completion problem. The  observation model will be shown first.

3.1. The Observation Model

Consider an underling signal tensor L R d 1 × d 2 × d 3 which possesses intrinsically low-dimensionality structure characterized by low-tubal-rankness, that is r t b ( L ) d 1 d 2 . Suppose we obtain N scalar observations y i of L R d 1 × d 2 × d 3 from the noisy observation model:
y i = L + S , X i + ξ i , i [ N ] ,
where the tensor S R d 1 × d 2 × d 3 represents some gross corruptions (e.g., outliers, errors, etc.) additive to the signal L which is element-wisely sparse (the presented theoretical analysis and optimization algorithm can be generalized to more sparsity settings of corruptions (e.g. the tube-wise sparsity [20,34], and slice-wise sparsity [34,35]) by using the tools developed for robust matrix completion in [36] and robust tensor decomposition in [34]; for simplicity, we only consider the most common element-wisely sparse case), ξ i ’s are random noises sampled i.i.d. from Gaussian distribution N ( 0 , σ 2 ) , and  X i ’s are known random design tensors in R d 1 × d 2 × d 3 satisfying the following assumptions:
Assumption A1.
We make two natural assumptions on the design tensors:
I.
All the corrupted positions of L are observed, that is, the (unknown) support Θ s = supp S : = { ( i , j , k ) | S i j k 0 } of the corruption tensor S is fully observed. Formally speaking, there exists an unknown subset  X s { X i } i = 1 N drawn from an (unknown) distribution Π Θ s on the set X Θ s : = e j e k e l , ( j , k , l ) Θ s , such that each element in X Θ s is sampled at least once.
II.
All uncorrupted positions of L are sampled uniformly with replacement for simplicity of exposition. Formally speaking, each element of the set X s : = { X i } i = 1 N X s is sampled i.i.d. from an uniform distribution Π Θ s on the set X Θ s : = e j e k e l , ( j , k , l ) Θ s .
According to the observation model (6), the true tensor L is first corrupted by a sparse tensor S and then sampled to N scalars { y i } with additive Gaussian noises { ξ i } (see Figure 2). The  corrupted positions of L are further assumed in Assumption A1 to be totally observed with design tensors in X s { X i } i = 1 N , and the remaining uncorrupted positions are sampled uniformly through design tensors in X s = { X i } i = 1 N X s .
Let y = ( y 1 , , y N ) R N and ξ = ( ξ 1 , , ξ N ) R N be the vector of observations and noises, respectively. Define the design operator X : R d 1 × d 2 × d 3 R N as X ( · ) : = · , X 1 , , · , X N , and its adjoint operator X ( z ) : = i = 1 N z i X i for any z R N . Then the observation model (6) can be rewritten in a compact form
y = X ( L + S ) + ξ .

3.2. The Proposed Estimator

The aim of robust tensor completion is to reconstruct the unknown low-rank L and sparse S from incomplete and noisy measurements { ( X i , y i ) } i = 1 N generated by the observation model (6). It can be treated as a robust extension of tensor completion in [33], and a noisy partial variant of tensor robust PCA [37].
To reconstruct the underlying low-rank tensor L and sparse tensor S , it is natural to consider the following minimization model:
min L , S 1 2 N y X ( L + S ) 2 2 + λ ι r t b ( L ) + λ s S 1 ,
where we use least squares as the fidelity term for Gaussian noises, the tubal rank as the regularization to impose low-rank structure in L , the tensor l 0 -(pseudo)norm to regularize S for sparsity, λ ι , λ s 0 are tunable regularization parameters, balancing the regularizations and the fidelity term.
However, general rank and l 0 -norm minimization is NP-hard [12,38], making it extremely hard to soundly solve Problem (7). For tractable low-rank and sparse optimization, we follow the most common idea to relax the non-convex functions r t b ( · ) and · 0 to their convex surrogates, i.e.,  the L –tubal nuclear norm · and the tensor l 1 -norm · 1 , respectively. Specifically, the following estimator is defined:
( L ^ , S ^ ) : = error L a , S a 1 2 N y X ( L + S ) 2 2 + λ ι L + λ s S 1 ,
where a > 0 is a known constant constraining the magnitude of entries in L and S . The additional constraint L a and S a is very mild since most signals and corruptions are of limited energy in real applications. It can also provide a theoretical benefit to exclude the “spiky” tensors, which is important in controlling the separability of L and S . Such “non-spiky” constraints are also imposed in previous literatures [36,39,40], playing a key role in bounding the estimation error.
Then, it is natural to ask the following questions:
Q1:
How to compute the proposed estimator?
Q2:
How well can the proposed estimator estimate L and S ?
We first discuss Q1 in Section 4 and then answer Q2 in Section 5.

4. Algorithm

In this section, we answer Q1 by designing an algorithm based on ADMM to compute the proposed estimator.
To solve Problem (8), the first step is to introduce auxiliary variables g , K , T , M , N to deal with the complex couplings between X ( · ) , · 2 · , · 1 , and  · as follows:
min g , L , S , K , T , M , N 1 2 N g 2 2 + λ ι K + λ s T 1 + δ a ( M ) + δ a ( N ) , s . t . g = y X ( L + S ) , L = K = M , S = T = N
where δ a ( · ) is the indicator function of tensor l -norm ball defined as follows
δ a ( M ) = 0 M a + M > a
We then give the augmented Lagrangian of Equation (9) with Lagrangian multipliers z and { Z i } i = 1 4 and penalty parameter ρ > 0 :
L ρ ( L , S , g , K , T , M , N , z , Z 1 , Z 2 , Z 3 , Z 4 ) = 1 2 N g 2 2 + λ ι K + λ s T 1 + δ a ( M ) + δ a ( N ) + z , g + X ( L + S ) y + ρ 2 g + X ( L + S ) y 2 2 + Z 1 , L K + ρ 2 L K F 2 + Z 2 , L M + ρ 2 L M F 2 + Z 3 , S T + ρ 2 S T F 2 + Z 4 , S N + ρ 2 S N F 2
Following the framework of standard two-block ADMM [41], we separate the primal variables into two blocks ( L , S ) and ( g , K , T , M , N ) , and update them alternatively as follows:
Update the first block ( L , S ) : After the t-th iteration, we first update ( L , S ) by keeping the other variables fixed as follows:
( L t + 1 , S t + 1 ) = error L , S L ρ ( L , S , g t , K t , T t , M t , N t , z t , Z 1 t , Z 2 t , Z 3 t , Z 4 t ) = error L , S z t , g t + X ( L + S ) y + ρ 2 g t + X ( L + S ) y 2 2 + Z 1 t , L K t + ρ 2 L K t F 2 + Z 2 t , L M t + ρ 2 L M t F 2 + Z 3 t , S T t + ρ 2 S T t F 2 + Z 4 t , S N t + ρ 2 S N t F 2
By taking derivatives, respectively, to L and S and setting them to zero, we obtain the following system of equations:
X z t + ρ X X ( L + S ) + g t y + Z 1 t + ρ ( L K t ) + Z 2 t + ρ ( L M t ) = 0 X z t + ρ X X ( L + S ) + g t y + Z 3 t + ρ ( S T t ) + Z 4 t + ρ ( S N t ) = 0
Through solving the system of equations in Equation (12), we obtain
L t + 1 = 1 4 ρ X X ( X X + I ) 1 ( 2 A + B ι + B s ) 2 ( A + B ι ) S t + 1 = 1 4 ρ X X ( X X + I ) 1 ( 2 A + B ι + B s ) 2 ( A + B s )
where I denotes the identity operator, and the intermediate tensors are given by A = X ( z t + ρ g t y ) , B ι = Z 1 t + Z 2 t ρ ( K t + M t ) , and  B s = Z 3 t + Z 4 t ρ ( T t + N t ) .  
Update the second block ( g , K , T , M , N ) : According to the special form of the Lagrangian in Equation (10), the variables g , K , T , M , N in the second block can be updated separately as follows.
We first update g with fixed ( L , S ) :
g t + 1 = error g L ρ ( L t + 1 , S t + 1 , g , K , T , M , N , z t , Z 1 t , Z 2 t , Z 3 t , Z 4 t ) = error g 1 2 N g 2 2 + ρ 2 g + X ( L t + 1 + S t + 1 ) y + ρ 1 z t 2 2 = N ρ 1 + N ρ y X ( L t + 1 + S t + 1 ) ρ 1 z t
We then update K with fixed ( L , S ) :
K t + 1 = error K L ρ ( L t + 1 , S t + 1 , g , K , T , M , N , z t , Z 1 t , Z 2 t , Z 3 t , Z 4 t ) = error K λ ι K + Z 1 t , L t + 1 K + ρ 2 L t + 1 K F 2 = Prox λ ι ρ 1 · ( L t + 1 + ρ 1 Z 1 t ) ,
where Prox ρ 1 · ( · ) is the proximality operator of L –TNN given in the following lemma.
Lemma 1
(A modified version of Theorem 3.2 in [26]). Let L 0 R d 1 × d 2 × d 3 be any tensor with L –SVD L 0 = U L D L V . Then the proximality operator of L –TNN at L 0 with constant τ > 0 , defined as Prox τ · ( L 0 ) : = error L τ L + 1 2 L L 0 F , can be computed by
Prox τ · ( L 0 ) = U L D τ L V
where
D τ = L 1 ( L ( D ) τ ) + ) .
where t + denotes the positive part of t, i.e.,  t + = max ( t , 0 ) .
We update T with fixed ( L , S ) :
T t + 1 = error T L ρ ( L t + 1 , S t + 1 , g , K , T , M , N , z t , Z 1 t , Z 2 t , Z 3 t , Z 4 t ) = error T λ s T 1 + Z 3 t , S t + 1 T + ρ 2 S t + 1 T F 2 = Prox λ s ρ 1 · 1 ( S t + 1 + ρ 1 Z 3 t ) ,
where Prox τ · 1 ( T ) is the proximality operator [19] of the tensor l 1 -norm at point T given as Prox τ · 1 ( T ) = sign ( T ) ( | T | τ ) + , where ⊙ denotes the element-wise product.
We then update M with fixed ( L , S ) :
M t + 1 = error M L ρ ( L t + 1 , S t + 1 , g , K , T , M , N , z t , Z 1 t , Z 2 t , Z 3 t , Z 4 t ) = error M δ a ( M ) + Z 2 t , L t + 1 M + ρ 2 L t + 1 M F 2 = Proj a · ( L t + 1 + ρ 1 Z 2 t ) ,
where Proj a · ( · ) is the projector onto the tensor l -norm ball of radius a, which is given by Proj a · ( M ) = sign ( M ) min ( | M | , a ) [19].
Similarly, we update N as follows:
N t + 1 = error N L ρ ( L t + 1 , S t + 1 , g , K , T , M , N , z t , Z 1 t , Z 2 t , Z 3 t , Z 4 t ) = error N δ a ( N ) + Z 4 t , S t + 1 N + ρ 2 S t + 1 N F 2 = Proj a · ( S t + 1 + ρ 1 Z 4 t ) .
Update the dual variables z and { Z i } : According to the update strategy of dual variables in ADMM [41], the variables z and { Z i } can be updated using dual ascent as follows:
z t + 1 = z t + ρ ( g t + 1 + X ( L t + 1 + S t + 1 ) y ) Z 1 t + 1 = Z 1 t + 1 + ρ ( L t + 1 K t + 1 ) Z 2 t + 1 = Z 2 t + 1 + ρ ( L t + 1 M t + 1 ) Z 3 t + 1 = Z 3 t + 1 + ρ ( S t + 1 T t + 1 ) Z 4 t + 1 = Z 4 t + 1 + ρ ( S t + 1 N t + 1 )
The algorithm for solving Problem (8) is summarized in Algorithm 1.
Algorithm 1 Solving Problem (8) using ADMM.
Input: 
The design tensors { X i } and observations { y i } , the regularization parameters λ ι , λ s , the  l 1 -norm bound a, the penalty parameter ρ of the Lagrangian, the convergence tolerance δ , the maximum iteration number T max .
1:
Initialize t = 0 , g 0 = z 0 = 0 R N , L 0 = S 0 = K 0 = T 0 = M 0 = N 0 = Z 1 0 = Z 2 0 = Z 3 0 = Z 4 0 = 0 R d 1 × d 2 × d 3
2:
for t = 0 , , T max do
3:
    Update ( L t + 1 , S t + 1 ) by Equation (13);
4:
    Update ( g t + 1 , K t + 1 , T t + 1 , M t + 1 , N t + 1 ) by Equations (14)–(20), respectively;
5:
    Update ( z t + 1 , Z 1 t + 1 , Z 2 t + 1 , Z 3 t + 1 , Z 4 t + 1 ) by Equation (21);
6:
    Check the convergence criteria:
7:
(i) convergence of primal variables:
A t + 1 A t δ , A { g , L , S , K , T , M , N }
(ii) convergence of constraints:
max { L t + 1 K t + 1 , L t + 1 M t + 1 } δ max { S t + 1 T t + 1 , S t + 1 N t + 1 } δ g t + 1 + X ( L t + 1 + S t + 1 ) y δ
8:
end for
Output: 
( L ^ , S ^ ) = ( L t + 1 , S t + 1 ) .
Complexity Analysis: The time complexity of Algorithm 1 is analyzed as follows. Due to the special structures of design tensors { X i } , the operators X and ( X X X X + I ) 1 can be implemented with time cost O ( N ) and O ( d 1 d 2 d 3 + N ) , respectively. The cost of updating L , S , T , M , N and { Z i } is O ( d 1 d 2 d 3 ) . The main time cost in Algorithm 1 lies in the update of K which needs the L –SVD on d 1 × d 2 × d 3 tensors, involving the L -transform (costing O ( d 1 d 2 d 3 2 ) in general), and d 3 matrix SVDs on d 1 × d 2 matrices (costing O ( d 1 d 2 d 3 ( d 1 d 2 ) ) ). Thus, the one-iteration cost of Algorithm 1 is
O ( d 1 d 2 d 3 ( ( d 1 d 2 ) + d 3 ) )
in general, and can be reduced to O ( d 1 d 2 d 3 ( ( d 1 d 2 ) + log d 3 ) ) for some linear transforms L which have fast implementations (like DFT and DCT).
Convergence Analysis: According to [28], the convergence rate of general ADMM-based algorithms is O ( 1 / t ) , where t is the iteration number. The convergence analysis of Algorithm 1 is established in Theorem 2.
Theorem 2
(Convergence of Algorithm 1). For any positive constant ρ, if the unaugmented Lagrangian function L 0 ( g , L , S , K , T , M , N , z , Z 1 , Z 2 , Z 3 , Z 4 ) has a saddle point, then the iterations ( g t , L t , S t , K t , T t , M t , N t , z t , Z 1 t , Z 2 t , Z 3 t , Z 4 t ) in Algorithm 1 satisfy the residual convergence, objective convergence and dual variable convergence (defined in [41]) of Problem (9) as t .
Proof. 
The key idea is to rewrite Problem (9) into a standard two-block ADMM problem. For notational simplicity, let
f ( u ) = 0 , g ( v ) = 1 2 N g 2 2 + λ ι K + λ s S 1 + δ a ( M ) + δ a ( N ) ,
with u , v , w , c and A defined as follows
u = vec ( L ) vec ( S ) R 2 d 1 d 2 d 3 , v = g vec ( K ) vec ( T ) vec ( M ) vec ( N ) R N + 4 d 1 d 2 d 3 , w = z vec ( Z 1 ) vec ( Z 2 ) vec ( Z 3 ) vec ( Z 4 ) R N + 4 d 1 d 2 d 3 , c = y 0 0 0 0 R N + 4 d 1 d 2 d 3 ,
and
A = X X I d 1 d 2 d 3 0 I d 1 d 2 d 3 0 0 I d 1 d 2 d 3 0 I d 1 d 2 d 3 R ( N + 4 d 1 d 2 d 3 ) × ( 2 d 1 d 2 d 3 ) , w i t h X = vec ( X 1 ) vec ( X 2 ) vec ( X N ) R N × ( 2 d 1 d 2 d 3 ) ,
where vec ( · ) denotes the operation of tensor vectorization (see [30]).
It can be verified that f ( · ) and g ( · ) are closed, proper convex functions. Then, Problem (9) can be re-written as follows:
min u , v f ( u ) + g ( v ) s . t . A u v = c .
According to the convergence analysis in [41], we have:
objective convergence : lim t f ( u t ) + g ( v t ) = f + g , dual variable convergence : lim t w t = w , constraint convergence : lim t A u t v t = c ,
where f , g are the optimal values of f ( u ) , g ( v ) , respectively. Variable w is a dual optimal point defined as:
w = w = z vec ( Z 1 ) vec ( Z 2 ) vec ( Z 3 ) vec ( Z 4 )
where ( z , Z 1 , Z 2 , Z 3 , Z 4 ) is the component of dual variables in a saddle point
( g , L , S , K , T , M , N , z , Z 1 , Z 2 , Z 3 , Z 4 ) of the unaugmented Lagrangian
L 0 ( g , L , S , K , T , M , N , z , Z 1 , Z 2 , Z 3 , Z 4 ) . □

5. Statistical Performance

In this section, we answer Q2 by studying the statistical performances of the proposed estimator ( L ^ , S ^ ) . Specifically, the goal is to upper bound the squared F-norm error L ^ L F 2 + S ^ S F 2 . We will first give an upper bound on the estimation error in a non-asymptotic manner, and then prove that the upper bound is minimax optimal up to a logarithm factor.

5.1. Upper Bound on the Estimation Error

We establish an upper bounds on the estimation error in the following theorem. For notational simplicity, let N s = | X s | and N ι = | X s | denote the number of corrupted and uncorrupted observations of L in the observation model (6), respectively.
Theorem 3
(Upper bounds on the estimation error). If the number of uncorrupted observations in the observation model (6) satisfy
N ι c 1 d 1 d 3 log ( d 1 d 3 + d 2 d 3 ) log 2 ( d 1 + d 2 )
and regularization parameters in Problem (8) are set by
λ ι = c 2 ( σ a ) log ( d 1 d 3 + d 2 d 3 ) d 1 d 2 , a n d λ s = c 3 ( σ a ) l o g ( d 1 d 3 + d 2 d 3 ) N ,
then it holds with probability at least 1 c 5 ( d 1 d 3 + d 2 d 3 ) 1 that:
L ^ L F 2 + S ^ S F 2 d 1 d 2 d 3 C r t b ( L ) · ( σ 2 a 2 ) ( d 1 d 2 ) d 3 log d ˜ N ι + N s log ( d 1 d 3 + d 2 d 3 ) N ι + S 0 · a 2 d 1 d 2 d 3 .
Theorem 3 implies that, if the noise level σ and spikiness level a are fixed, and all the corrupted positions are observed exactly only once (i.e., the number of corrupted observations N s = S 0 ), then the estimation error in Equation (26) would be bounded by
O r t b ( L ) · ( d 1 d 2 ) d 3 log ( d 1 d 3 + d 2 d 3 ) N ι + S 0 · log ( d 1 d 3 + d 2 d 3 ) N ι + 1 d 1 d 2 d 3 .
Note that, the bound in Equation (27) is intuition-consistent: if the underlying tensor L gets more complex (i.e., with higher tubal rank), then the estimation error will be larger; if the corruption tensor S gets denser, then the estimation error will also become larger; if the number of uncorrupted observations N ι gets larger, then the estimation error will decrease. The scaling behavior of the estimation error in Equation (27) will be verified through experiments on synthetic data in Section 7.1.
Remark 1
(Consistence with prior models for robust low-tubal-rank tensor completion). According to Equation (27), our L –SVD-based estimator in Equation (8) allows the tubal rank r t b ( L ) to take the order O ( d 2 / log ( d 1 d 3 + d 2 d 3 ) ) , and the corruption ratio S 0 / ( d 1 d 2 d 3 ) to be O ( 1 ) for approximate estimation with small error. It is slightly better with a logarithm factor than the results for t-SVD-based tensor robust completion model in [8] which allows r t b ( L ) = O ( d 2 / log 2 ( d 1 d 3 + d 2 d 3 ) ) and S 0 / ( d 1 d 2 d 3 ) = O ( 1 ) .
Remark 2
(Consistence with prior models for noisy low-tubal-rank tensor completion). If S 0 = 0 , i.e., the corruption S vanishes, then we obtain
L ^ L F 2 d 1 d 2 d 3 = O r t b ( L ) ( d 1 d 2 ) d 3 log ( d 1 d 3 + d 2 d 3 ) N
which is consistent with the error bound for t-SVD-based noisy tensor completion [42,43,44], and L –SVD-based tensor Dantzig Selector in [18].
Remark 3
(Consistence with prior models for robust low-tubal-rank tensor decomposition). In the setting of Robust Tensor Decomposition (RTD) [34], the fully observed model instead of our estimation model in Equation (6) is considered. For the RTD problem, our error bound in Equation (27) is consistent with the t-SVD-based bound for RTD [34] (up to a logarithm factor).
Remark 4
(No exact recovery guarantee). According to Theorem 3, when σ = 0 and S 0 = 0 , i.e., in the noiseless case, the estimation error is upper bounded by O ( a ( d 1 d 2 ) d 3 r t b ( L ) log d ˜ / N ) which is not zero. Thus, no exact recovery is guaranteed by Theorem 3. It can be seen as a trade-off that we do not assume the low-tubal-rank tensor L to satisfy the tensor incoherent conditions [8,35,37] which essentially ensures the separability between L and S .

5.2. A Minimax Lower Bound for the Estimation Error

In Theorem 3, we established the estimation error for Model (8). Then one may ask the complementary questions: how tight is the upper bound? Are there fundamental (model-independent) limits of estimation error in robust tensor completion? In this section, we will answer the questions.
To analyze the optimality of the proposed upper bound in Theorem 3, the minimax lower bounds of the estimation error is established for the tensor pair ( L , S ) belonging to the class A ( r , s , a ) of tensor pairs defined as:
A ( r , s , a ) : = ( L , S ) | r t b ( L ) r , S 0 s , L < a , S a
We then define the associated element-wise minimax error as follows
M ( A ( r , s , a ) ) : = inf ( L ^ , S ^ ) sup ( L , S ) A ( r , s , a ) E L ^ L F 2 + S ^ S F 2 d 1 d 2 d 3 ,
where the infimum ranges over all pairs of estimators ( L ^ , S ^ ) , the supremum ranges over all pairs of underlying tensors ( L , S ) in the given tensor class A ( r , s , a ) , and the expectation is taken over the design tensors { X i } and i.i.d. Gaussian noises { ξ i } in the observation model (6). We come up with the following theorem.
Theorem 4
(Minimax lower bound). Suppose the dimensionality d 1 , d 2 2 , the rank and sparsity parameters r [ d 1 d 2 ] , s d 1 d 2 d 3 / 2 , the number of uncorrupted entries N ι r d 1 d 3 , and the number of corrupted entries N s τ r d ˜ with a constant τ > 0 . Then, under Assumption A1, there exist absolute constants b ( 0 , 1 ) and c > 0 , such that
M ( A ( r , s , a ) ) b ϕ ( N , r , s )
where
ϕ ( N , r , s ) : = ( σ a ) 2 r ( d 1 + d 2 ) d 3 + N s N ι + s d 1 d 2 d 3 .
The lower bound given in Equation (30) implies that the proposed upper bound in Theorem 3 is optimal (up to a log factor) in the minimax sense for tensors belonging to the set A ( r , s , a ) . That is to say no estimator can obtain more accurate estimation than our estimator in Equation (8) (up to a log factor) for ( L , S ) A ( r , s , a ) , thereby showing the optimality of the proposed estimator.

6. Connections and Differences with Previous Works

In this section, we discuss the connections and differences with existing nuclear norm based robust matrix/tensor completion models, where the underlying matrix/tensor suffers from missing values, gross sparse outliers, and small dense noises at the same time.
First, we briefly introduce and analyze the two most related models, i.e., the matrix nuclear norm based model [36] and the sum of mode-wise matrix nuclear norms based model [45] as follows.
(1)
The matrix Nuclear Norm (NN) based model [36]: If the underlying tensor is of 2-way, i.e., a matrix, then the observation model in Equation (6) becomes the setting for robust matrix completion, and the proposed estimator in Equation (8) degenerates to the matrix nuclear norm based estimator in [36]. In both model formulation and statistical analysis, this work can be seen as a 3-way generalization of [36].
Moreover, by conducting robust matrix completion on each frontal slice of a 3-way tensor, we can obtain the matrix nuclear norm based robust tensor completion model as follows:
min L , S 1 2 N y X ( L + S ) 2 2 + λ ι k = 1 d 3 ( L ( k ) + λ s S ( k ) 1 )
(2)
The Sum of mode-wise matrix Nuclear Norms (SNN) based model [45]: Huang et al. [45] proposed a robust tensor completion model based on the sum of mode-wise nuclear norms deduced by the Tucker decomposition as follows
min L , S 1 2 N y X ( L + S ) 2 2 + k = 1 3 α k L ( k ) + λ s S 1 ,
where L ( k ) R d i × j k d j is the mode-k matriculation of tensor L R d 1 × d 2 × d 3 , for all i = 1 , 2 , 3 .
The main differences between SNN and this work are two-fold: (i) SNN is based on the Tucker decomposition [15], whereas this work is based on the recently proposed tensor L -SVD [24]; (ii) the theoretical analysis for SNN cannot guarantee the minimax optimality of the model in [45], whereas this works rigorously proof of the minimax optimality of the proposed estimator is established in Section 5.
Then, we discuss the following related works which can be seen as special cases of this work.
(1)
The robust tensor completion model based on t-SVD [46]: In a short conference presentation [46] (whose first author is the same as this paper), the t-SVD-based robust tensor completion model is studied. As t-SVD can be viewed as a special case of the L -SVD (when DFT is used as the transform L), the model in [46] can be a special case of ours.
(2)
The robust tensor recovery models with missing values and sparse outliers [8,27]: In [8,27], the authors considered the robust reconstruction of incomplete tensor polluted by sparse outliers, and proposed t-SVD (or L -SVD) based models with theoretical guarantees for exact recovery. As they did not consider small dense noises, their settings are indeed a special case of our observation model (6) when E = 0 .
(3)
The robust tensor decomposition based on t-SVD [34]: In [34], the authors studied the t-SVD-based robust tensor decomposition, which aims at recovering a tensor corrupted by both gross sparse outliers and small dense noises. Comparing with this work, Ref. [34] can be seen as a special case when there are no missing values.

7. Experiments

In this section, experiments on synthetic datasets will be first conducted to validate the sharpness of the proposed upper bounds in Theorem 3. Then, both effectiveness and efficiency of the proposed algorithm will be demonstrated through experiments on seven different types of remote sensing datasets. All codes are written in Matlab, and all experiments are performed on a Windows 10 laptop with AMD Ryzen 3.0 GHz CPU and 8 GB RAM.

7.1. Sharpness of the Proposed Upper Bound

Sharpness of the proposed upper bounds in Theorem 3 will be validated. Specifically, we will check whether the upper bounds in Equation (27) can reflect the true scaling behavior of the estimation error. As predicted in Equation (27), if the upper bound is “sharp”, then it is expected that the Mean Square Errors (MSE) ( L ^ L F 2 + S ^ S F 2 ) / ( d 1 d 2 d 3 ) will possess a scaling behavior very similar to the upper bound: approximately linear w.r.t the tubal rank of the underlying tensor L , the l 0 -norm of the corruption tensor S , and the reciprocal of uncorrupted observation number N ι . We will examine whether this expectation will happen in simulation studies on synthetic datasets.
The synthetic datasets are generated as follows. Similar to [26], we consider three cases of linear transform L with orthogonal matrix L : (1) Discrete Fourier Transform (DFT); (2) Discrete Cosine Transform (DCT) [24]; (3) Random Orthogonal Matrix (ROM) [26]. The underlying low-rank tensor L R d 1 × d 2 × d 3 with L -tubal rank r is generated by L = P L Q , where P R d 1 × r × d 3 and Q R r × d 2 × d 3 are i.i.d. sampled from N ( 0 , 1 ) . L is then normalized such that L = 1 . Second, to generate the sparse corruption tensor S , we first form S 0 with i.i.d. uniform distribution Uni ( 0 , 1 ) and then uniformly select γ d 1 d 2 d 3 entries. Thus the number of corrupted entries S 0 = γ d 1 d 2 d 3 . Third, we uniformly select N ι elements from the uncorrupted positions of ( L + S ) . Finally, the noise { ξ i } are sampled from i.i.d. Gaussian N ( 0 , σ 2 ) with σ = 0.1 L F / d 1 d 2 d 3 . We consider f-diagonal tensors with d 1 = d 2 = d { 80 , 100 , 120 } , d 3 = 30 and tubal rank r t b ( L ) { 3 , 6 , 9 , 12 , 15 } . We choose corruption ratio γ { 0.01 : 0.01 : 0.1 } and uncorrupted observation ratio N ι / ( d 1 d 2 d 3 N s ) { 0.4 : 0.1 : 0.9 } . In each setting, the MSE averaged over 30 trials is reported.
In Figure 3, we report the results for 100 × 100 × 30 tensors when the DFT is adopted as the linear transform L in Equation (4). According to sub-plots (a), (b), and (d) in Figure 3, it can be seen that the MSE scales approximately linearly w.r.t. r t b ( L ) , S 0 , and N ι 1 . There results accord well with our expectation for the size 100 × 100 × 30 and linear transform L = DFT . As very similar phenomena are also observed in all the other settings where d { 80 , 120 } and L { DCT , ROM } , we simply omit them.Thus, it can be verified that the scaling behavior of the estimation error can be approximately predicted by the proposed upper bound in Equation (27).

7.2. Effectiveness and Efficient of the Proposed Algorithm

In this section, we evaluate both the effectiveness and efficiency of the proposed Algorithm 1 by conducting robust tensor completion on seven different types of datasets collected from several remote sensing related applications from Section 7.2.1Section 7.2.7.
Following [25], we adopted three different transformations L in Equation (4) to define the L –TNN: the first two transformations are DFT and DCF (denoted by TNN (DFT) and TNN (DCT), respectively), and the third one named TNN (Data) depends on the given data motived by [27,31]. We first perform SVD on the mode-3 unfolding matrix of L as L ( 3 ) = U S V , and then use U as the desired transform matrix in the L –product (4). The proposed algorithm is compared with the aforementioned models NN [36] in Equation (32) and SNN [45] in Equation (33) in Section 6. Both Model (32) and Model (33) are solved by using ADMM with implementations by ourselves in Matlab language.
We conduct robust tensor completion on the datasets in Figure 4 with a similar settings as [47]. For a d 1 × d 2 × d 3 tensor data L re-scaled by L = 1 , we choose its support uniformly at random with ratio ρ s and fill in the values with i.i.d. standard Gaussian variables to generate the corruption S . Then, we randomly sample the entries of L + S uniformly with observation ratio ρ obs . The noises { ξ i } are further generated with i.i.d. zero-mean Gaussian entries whose standard deviation is given by σ = 0.05 L F / d 1 d 2 d 3 to generate the observations { y i } . The goal in the experiments is to estimate the underlying signal L from { y i } . The effectiveness of algorithms are measured by the Peaks Signal Noise Ratio (PSNR) and structural similarity (SSIM) [48]. Specifically, the PSNR of an estimator L ^ is defined as
PSNR : = 10 log 10 d 1 d 2 d 3 L 2 L ^ L F 2 ,
for the underlying tensor L R d 1 × d 2 × d 3 . The SSIM is computed via
SSIM : = ( 2 μ L μ L ^ + ( 0.01 ω ¯ ) 2 ) ( 2 σ L , L ^ + ( 0.03 ω ¯ ) 2 ) ( μ L 2 + μ L ^ 2 + ( 0.01 ω ¯ ) 2 ) ( σ L 2 + σ L ^ 2 + ( 0.03 ω ¯ ) 2 ) ,
where μ L , μ L ^ , σ L , σ L ^ , σ L , L ^ and ω ¯ denotes the local means, standard deviation, cross-covariance, and dynamic range of the magnitude of tensors L and L ^ . Larger PSNR and SSIM values indicate higher quality of the estimator L ^ .

7.2.1. Experiments on an Urban Area Imagery Dataset

Area imagery data processing plays a key role in many remote sensing applications, such as land-use mapping [49]. We adopt the popular area imagery dataset UCMerced [50], which is a 21 class land use image dataset meant for research purposes. The images were manually extracted from large images from the USGS National Map Urban Area Imagery collection for various urban areas around the country. The pixel resolution of this public domain imagery is 1 foot, and each RGB image measures 256 × 256 pixels. There are 100 images for each class, and we chose the 85-th image to form a dataset of 21 images as shown in Figure 4.
We consider two scenarios by setting ( ρ obs , ρ s ) { ( 0.3 , 0.2 ) , ( 0.8 , 0.3 ) } for the d × d × 3 images. For NN (Model (32)), we set the regularization parameters λ s = λ ι / d ρ obs (suggested by [38]), and tune the parameter λ ι around 6.5 σ ρ obs d log ( 6 d ) (suggested by [51]). For SNN, the parameter λ s is tuned in { 0.01 , 0.05 , 0.1 , 1 } for better performance in most cases, and the weight α is set by α 1 = α 2 = λ s 3 d ρ obs , α 3 = 0.01 λ s 3 d ρ obs . For Algorithm 1, we tune λ ι around 2 σ 3 ρ obs d log ( 6 d ) , and let λ s = λ ι / 3 d ρ obs for TNN (DFT) and λ s = λ ι / d ρ obs for TNN (DCT) and TNN (Data). In each setting, we test each image for 10 trials and report the averaged PSNR (in db), SSIM and running time (in seconds).
We present the PSNR, SSIM values and running time in Figure 5 and Figure 6 for settings of ( ρ obs , ρ s ) = ( 0.3 , 0.2 ) and ( ρ obs , ρ s ) = ( 0.8 , 0.3 ) , respectively, for quantitative evalution, with visual examples shown in Figure 7 and Figure 8. It can seen that from Figure 5, Figure 6, Figure 7 and Figure 8 that the proposed TNN (Data) has the highest recovery quality in most cases, and posses a comparative running time as NN. We attribute the promising performance of the proposed algorithm to the extraordinary representation power of the low-tubl-rank models: low-tubal-rankness can exploit both low-rankness and smoothness simultaneously, whereas traditional models like NN and SNN can only exploit low-rankness in the original domain [18].

7.2.2. Experiments on Hyperspectral Data

Benefit from its fine spectral and spatial resolutions, hyperspectral image processing has been extensively adopted in many remote sensing applications [10,52]. In this section, we conduct robust tensor completion on subsets of the two representative hyperspectral datasets described as follows:
  • Indian Pines: This dataset was collected by AVIRIS sensor in 1992 over the Indian Pines test site in North-western Indiana and consists of 145 × 145 pixels and 224 spectral reflectance bands. We use the first 30 bands in the experiments due to the trade-off between the limitation of computing resources and the efforts for parameter tuning.
  • Salinas A: The data were acquired by AVIRIS sensor over the Salinas Valley, California in 1998, and consists of 224 bands over a spectrum range of 400–2500 nm. This dataset has a spatial extent of 86 × 83 pixels with a resolution of 3.7 m. We use the first 30 bands in the experiments too.
We consider three settings, i.e., Setting I ( ρ obs = 0.3 , ρ s = 0.2 ) , Setting II ( ρ obs = 0.6 , ρ s = 0.25 ) , and Setting III ( ρ obs = 0.8 , ρ s = 0.3 ) for robust completion of hyper-spectral data. For NN, we set the regularization parameters λ s = λ ι / ρ obs ( d 1 d 2 ) (suggested by [38]), and tune the parameter λ ι around 6.5 σ ρ obs ( d 1 d 2 ) log ( d 1 d 3 + d 2 d 3 ) (suggested by [51]). For SNN, the parameter λ s is tuned in { 0.01 , 0.05 , 0.1 , 1 } for better performance in most cases, and we chose the weight α by α 1 = α 2 = α 3 = λ s ρ obs ( d 1 d 2 ) d 3 (suggested by [47]). For Algorithm 1, we tune the parameter λ ι around 2 σ ρ obs ( d 1 d 2 ) d 3 log ( d 1 d 3 + d 2 d 3 ) , and let λ s = λ ι / ρ obs ( d 1 d 2 ) d 3 for TNN (DFT) and λ s = λ ι / ρ obs ( d 1 d 2 ) for TNN (DCT) and TNN (Data). In each setting, we test each image for 10 trials and report the averaged PSNR (in db), SSIM and running time (in seconds).
For quantitative evalution, we report the PSNR, SSIM values and running time in Table 2 and Table 3 for the Indian Pines and Salinas A datasets, respectively. The visual examples are, respectively, shown in Figure 9 and Figure 10. It can seen that the proposed TNN (Data) has the highest recovery quality in most cases, and has a comparative running time as NN, indicating the effectiveness and efficiency of low-tubal-rank models in comparison with original domain-based models NN and SNN.

7.2.3. Experiments on Multispectral Images

Multispectral imaging captures image data within specific wavelength ranges across the electromagnetic spectrum, and has become one of the most widely utilized datatype in remote sensing. This section presents simulated experiments on multispectral images. The original data are two multispectral images Beads and Cloth from the Columbia MSI Database (available at http://www1.cs.columbia.edu/CAVE/databases/multispectral accessed on 28 July 2021) containing scenes of a variety of real-world objects. Each MSI is of size 512 × 512 × 31 with intensity range scaled to [0, 1].
We also consider three settings, i.e., Setting I ( ρ obs = 0.3 , ρ s = 0.2 ) , Setting II ( ρ obs = 0.6 , ρ s = 0.25 ) , and Setting III ( ρ obs = 0.8 , ρ s = 0.3 ) for robust completion of multi-spectral data. We tune the parameters in the same way as Section 7.2.2. In each setting, we test each image for 10 trials and report the averaged PSNR (in db), SSIM and running time (in seconds).
For quantitative evalution, we report the PSNR, SSIM values and running time in Table 4 and Table 5 for the Beads and Cloth datasets, respectively. The visual examples for the Cloth dataset is shown in Figure 11. We can also find that the proposed TNN (Data) achieves the highest accuracy in most cases, and has a comparative running time as NN, which demonstrates both the effectiveness and efficiency of low-tubal-rank models.

7.2.4. Experiments on Point Could Data

With the rapid advances of sensor technology, the emerging point cloud data provide better performance than 2D images in many remote sensing applications due to its flexible and scalable geometric representation [53]. In this section, we also conduct experiments on a dataset (scenario B from http://www.mrt.kit.edu/z/publ/download/velodynetracking/dataset.html, accessed on 28 July 2021) for Unmanned Ground Vehicle (UGV). The dataset contains a sequence of point cloud data acquired from a Velodyne HDL-64E LiDAR. We select 30 frames (Frame Nos. 65-94) from the data sequence. The point cloud data is formatted into two tensors sized 64 × 870 × 30 representing the distance data (named SenerioB Distance) and the intensity data (named SenerioB Intensity), , respectively.
We also consider three settings, i.e., Setting I ( ρ obs = 0.3 , ρ s = 0.2 ) , Setting II ( ρ obs = 0.6 , ρ s = 0.25 ) , and Setting III ( ρ obs = 0.8 , ρ s = 0.3 ) for robust completion of point cloud data. We tune the parameters in the same way as Section 7.2.2. In each setting, we test each image for 10 trials and report the averaged PSNR (in db), SSIM and running time (in seconds). For quantitative evalution, we report the PSNR, SSIM values and running time in Table 6 and Table 7 for the SenerioB Distance and SenerioB Intensity datasets, respectively. We can also find that the proposed TNN (Data) achieves the highest accuracy in most cases, and has a comparative running time as NN, which demonstrates both the effectiveness and efficiency of low-tubal-rank models.

7.2.5. Experiments on Aerial Video Data

Aerial videos (or time sequences of images) are broadly used in many computer vision based remote sensing tasks [54]. We experiment on a 180 × 320 × 30 tensor which consists of the first 30 frames of the Sky dataset (available at http://www.loujing.com/rss-small-target, accessed on 28 July 2021) for small object detection [55].
We also consider three settings, i.e., Setting I ( ρ obs = 0.3 , ρ s = 0.2 ) , Setting II ( ρ obs = 0.6 , ρ s = 0.25 ) , and Setting III ( ρ obs = 0.8 , ρ s = 0.3 ) . We tune the parameters in the same way as Section 7.2.2. In each setting, we test each image for 10 trials and report the averaged PSNR (in db), SSIM and running time (in seconds). For quantitative evalution, we report the PSNR, SSIM values and running time in Table 8. It is also found that the proposed TNN (Data) achieves the highest accuracy in most cases, and can run as fast as NN.

7.2.6. Experiments on Thermal Imaging Data

Thermal infrared data can provide important measurements of surface energy fluxes and temperatures in various remote sensing applications [7]. In this section, we experiment on two infrared datasets as follows:
  • The Infraed Detection dataset [56]: this dataset is collected for infrared detection and tracking of dim-small aircraft targets under ground/air background (available at http://www.csdata.org/p/387/, accessed on 28 July 2021). It consists of 22 subsets of infrared image sequences of all aircraft targets. We use the first 30 frames of data3.zip to form a 256 × 256 × 30 tensor due to the trade-off between the limitation of computing resources and the efforts for parameter tuning.
  • The OSU Thermal Database [3]: The sequences were recorded on the Ohio State University campus during the months of February and March 2005, and show several people, some in groups, moving through the scene. We use the first 30 frames of Sequences 1 and form a tensor of size 320 × 240 × 30 .
Similiar to Section 7.2.2, we test in three settings, i.e., Setting I ( ρ obs = 0.3 , ρ s = 0.2 ) , Setting II ( ρ obs = 0.6 , ρ s = 0.25 ) , and Setting III ( ρ obs = 0.8 , ρ s = 0.3 ) , and use the same strategy for parameter tuning. In each setting, we test each image for 10 trials and report the averaged PSNR (in db), SSIM and running time (in seconds). For quantitative evalution, we report the PSNR, SSIM values and running time in Table 9 and Table 10 for the Infraed Detection and OSU Thermal Database datasets, respectively. The visual examples are, respectively, shown in Figure 12 and Figure 13. It can seen that the proposed TNN (Data) has the highest recovery quality in most cases, and has a comparative running time as NN, showing both effectiveness and efficiency of low-tubal-rank models in comparison with original domain-based models NN and SNN.

7.2.7. Experiments on SAR Data

Polarimetric synthetic aperture radar (PolSAR) has attracted lots of attention from remote sensing scientists because of its various advantages, e.g., all-weather, all-time, penetrating capability, and multi-polarimetry [57]. In this section, we adopt the PolSAR UAVSAR Change Detection Images dataset. It is a dataset of single-look quad-polarimetric SAR images acquired by the UAVSAR airborne sensor in L-band over an urban area in San Francisco city on 18 September 2009, and May 11, 2015. The dataset #1 have length and width of 200 pixels, and we use the first 30 bands.
We also consider three settings, i.e., Setting I ( ρ obs = 0.3 , ρ s = 0.2 ) , Setting II ( ρ obs = 0.6 , ρ s = 0.25 ) , and Setting III ( ρ obs = 0.8 , ρ s = 0.3 ) . We tune the parameters in the same way as Section 7.2.2. In each setting, we test each image for 10 trials and report the averaged PSNR (in db), SSIM and running time (in seconds) in Table 11. It is also found that the proposed TNN (Data) achieves the highest accuracy in most cases, and can run as fast as NN.

8. Conclusions

In this paper, we resolve the challenging robust tensor completion problem by proposing a L -SVD-based estimator to robustly reconstruct a low-rank tensor in the presence of missing values, gross outliers, and small noises simultaneously. Specifically, this work can be concluded in the following three aspects:
(1)
Algorithmically, we design an efficient algorithm within the framework of ADMM to efficiently compute the proposed estimator with guaranteed convergence behavior.
(2)
Statistically, we analyze the statistical performance of the proposed estimator by establishing a non-asymptotic upper bound on the estimation error. The proposed upper bound is further proved to be minimax optimal (up to a log factor).
(3)
Experimentally, the correctness of the upper bound is first validated through simulations on synthetic datasets. Then both effectiveness and efficiency of the proposed algorithm are demonstrated by extensive comparisons with state-of-the-art nuclear norm based models (i.e., NN and SNN) on seven different types of remote sensing data.
However, from a critical point of view, the proposed method has the following two limitations:
(1)
The orientational sensitivity of L -SVD: Despite the promising empirical performance of the L -SVD-based estimator, a typical defect of it is the orientation sensitivity owing to low-rankness strictly defined along the tubal orientation which makes it fail to simultaneously exploit transformed low-rankness in multiple orientations [19,58].
(2)
The difficulty in finding the optimal transform L ( · ) for L -SVD: Although a direct use of fixed transforms (like DFT and DCT) may produce fairish empirical performance, it is still unclear how to find the best optimal transformation L ( · ) for any certain tensor L when only partial and corrupted observations are available.
According to the above limitations, it is interesting to consider higher-order extensions of the proposed model in an orientation invariant way like [19] and discuss the statistical performance. It is also interesting to consider the data-dependent transformation learning like [31,59]. Another future direction is to consider more efficient solvers of Problem (8) using the factorization strategy or Frank–Wolfe method [47,60,61,62].

Author Contributions

Conceptualization, A.W. and Q.Z.; Data curation, Q.Z.; Formal analysis, G.Z.; Funding acquisition, A.W., G.Z. and Q.Z.; Investigation, A.W.; Methodology, G.Z.; Project administration, A.W. and Q.Z.; Resources, Q.Z.; Software, A.W.; Supervision, G.Z. and Q.Z.; Validation, A.W., G.Z. and Q.Z.; Visualization, Q.Z.; Writing—original draft, A.W. and G.Z.; Writing—review & editing, A.W., G.Z. and Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grants 61872188, 62073087, 62071132, U191140003, 6197309, in part by the China Postdoctoral Science Foundation under Grant 2020M672536, in part by the Natural Science Foundation of Guangdong Province under Grants 2020A1515010671, 2019B010154002, 2019B010118001.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

In this paper, all the data supporting our experimental results are publicly available with references or URL links.

Acknowledgments

The authors are grateful to the editor and reviewers for their valuable time in processing this manuscript. The first author would like to thank Shasha Jin for her kind understanding in these months, and Zhong Jin in Nanjing University of Science and Technology for his long time support. The authors are also grateful to Xiongjun Zhang for sharing the code for [25] and Olga Klopp for her excellent theoretical analysis in [36,51] which is so helpful.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Theoretical Results

Appendix A.1. Additional Notations and Preliminaries

For the ease of exposition, we first list the additive notations often used in the proofs in Table A1.
Table A1. Additional notations in the proofs.
Table A1. Additional notations in the proofs.
NotationsDescriptionsNotationsDescriptions
Δ ι : = L L ^ estimation error of L Δ s : = S S ^ estimation error of S
r : = r t b ( L ) L -tubal-rank of L s : = S 0 sparsity of S
ϱ : = N / N ι inverse uncorrupted ratioa l -norm bound in Equation (8)
d ˜ ( d 1 + d 3 ) d 3
Ω ι : = { i [ N ] | X i , S = 0 } index set of design tensors { X i } corresponding to uncorrupted entries
Ω s : = { i [ N ] | X i , S 0 } index set of design tensors { X i } corresponding to corrupted entries
E : = 1 N i Ω ι ξ i X i stochastic tensor defined to lower bound parameters λ ι and λ s
W : = 1 N i Ω ι X i stochastic tensor defined to lower bound parameter λ s
R Σ : = 1 N ι i Ω ι ε i X i random tensor defined in bounding Δ ι F with i.i.d. Rademacher { ε i }
T Π : = E X i X i , T Θ s 2 expectation of X i , · 2 for i Ω ι defined to establish the RSC condition
We then introduce the decomposability of L -TNN and tensor l 1 -norm which plays a key role in the analysis.
Decomposability of L -TNN. Suppose L has reduced L -SVD as L = U L D L V , where U R d 1 × r × d 3 and V R d 2 × r × d 3 are orthogonal and D R r × r × d 3 is f-diagonal. Define projectors P ( · ) and P ( · ) as follows:
P ( T ) = U L U L T + T L V L V U L U L T L V L V , P ( T ) = ( I U L U ) L T L ( I V L V ) .
Then, it can be verified that:
(I).
T = P ( T ) + P ( T ) , T R d 1 × d 2 × d 3 ;
(II).
P ( A ) , P ( B ) = 0 , A , B R d 1 × d 2 × d 3 .
(III).
r t b ( P ( T ) ) 2 r t b ( L ) , T R d 1 × d 2 × d 3 .
In the same way to the results in supplementary material of [43], it can also be shown that the following equations hold:
(I).
(Decomposability of L –TNN) For any A , B R d 1 × d 2 × d 3 satisfying A L B = 0 , A L B = 0 ,
A + B = P ( A ) + P ( B ) .
(II).
(Norm compatibility inequality) For any T R d 1 × d 2 × d 3
T = d 3 r t b ( T ) T F .
Decomposability of tensor l 1 -norm [63]. Let Ω e [ d 1 ] × [ d 2 ] × [ d 3 ] denote any index set and Ω e its complement. Then for any tensor T , define two tensors T Ω e and T Ω e as follows
T Ω e ( i , j , k ) : = T i j k , ( i , j , k ) Ω e 0 , ( i , j , k ) Ω e , T Ω e : = T T Ω e .
Then, one has
(I).
(Decomposability of l 1 -norm) For any T R d 1 × d 2 × d 3 , T 1 = T Ω e 1 + T Ω e 1 .
(II).
(Norm compatibility inequality) For any T R d 1 × d 2 × d 3 , T Ω e 1 = | Ω e | T Ω e F .

Appendix A.2. The Proof for Theorem 3

The proof follows the lines of [34,36]. For notational simplicity, we define the following two sets
Ω ι : = { i [ N ] | X i , S = 0 } , Ω s : = { i [ N ] | X i , S 0 }
which denote the index set of design tensors { X i } corresponding to uncorrupted/corrupted entries, respectively.

Appendix A.2.1. Mainstream of Proving Theorem 3

Proof of Theorem 3.
Let F ( L , S ) = 1 N i = 1 N ( y i L + S , X i ) 2 + λ ι L + λ s S 1 for simplicity. Then, according to the optimality of ( L ^ , S ^ ) to Problem (8), it holds that
F ( L ^ , S ^ ) F ( L , S )
and
Δ ι = L ^ L L ^ + L 2 a Δ s = S ^ S S ^ + S 2 a
Equation (A5) indicates that
1 N i = 1 N ( ξ i + Δ ι + Δ s , X i ) 2 + λ ι L ^ + λ s S ^ 1 1 N i = 1 N ξ i 2 + λ ι L + λ s S 1
which leads to
1 N i Ω ι Δ ι + Δ s , X i 2 2 N i Ω s | ξ i X i , Δ ι + Δ s | 1 N i Ω s X i , Δ ι + Δ s 2 : = I + 2 | E , Δ ι | + λ ι ( L L ^ ) : = I I + 2 E , Δ Θ s s + λ s ( S 1 S ^ 1 ) : = I I I
where E : = 1 N ι i Ω ι ξ i X i , and the equality E , Δ s = E , Δ Θ s s holds. Now each item in the right hand side of (A8) will be upper bounded separately as follows. Following the idea of [36], the upper bound will be analyzed upon the following event
E : = max 1 i N | ξ i | C σ log d ˜
According to the tail behavior of the maximum in a sub-Gaussian sequence, it holds with an absolute constant C > 0 such that P [ E ] 1 1 / ( 2 d ˜ ) .
Bound I . On the event E , we get
I 1 N i Ω s ξ i 2 C σ 2 | Θ s | log d ˜ N .
Bound I I . Note that according to the properties of L -TNN, we have
L L ^ = L L Δ ι = L L P ( Δ ι ) P ( Δ ι ) L ( L P ( Δ ι ) P ( Δ ι ) ) = L ( L + P ( Δ ι ) P ( Δ ι ) ) = P ( Δ ι ) P ( Δ ι )
Thus, we can bound term I I by
I I 2 E s p Δ ι + λ ι P ( Δ ι ) P ( Δ ι )
By letting λ ι 4 E s p , it holds that
I I 3 2 λ ι P ( Δ ι ) 3 2 λ ι 2 r d 3 Δ ι F .
Bound I I I : Note that since S Θ s = 0 , we have Δ Θ s s = S ^ Θ s , leading to
I I I 2 E S ^ Θ s 1 + λ s ( S 1 S ^ 1 ) ( 2 E λ s ) S ^ Θ s 1 + λ s S 1
Letting λ s 4 E yields
I I I λ s S 1
Thus, putting Equations (A10)–(A12) together, we have the following inequality on the event E :
1 N i Ω ι X i , Δ ι + Δ s 2 3 λ ι r d 3 2 Δ ι F + λ s S 1 + C σ 2 | Θ s | log d ˜ N
Then, we follow the line of [36] to specify a kind of Restricted Strong Convexity (RSC) for the random sampling operator formed by the design tensors { X i } on a carefully chosen constrained set. The RSC will show that when the error tensors ( Δ ι , Δ s ) belong to the constrained set, the following relationship:
1 N i Ω ι X i , Δ ι + Δ s 2 κ 0 Δ ι + Δ s Π 2 τ ,
holds with an appropriate residual τ with high probability.
Before explicitly defining the constrained set, we first consider the following set where Δ s should lie:
B ( δ 1 , δ 2 ) : = { B R d 1 × d 2 × d 3 | B Π 2 δ 1 2 , B 1 δ 2 }
with two positive constants δ 1 and δ 2 whose values will be specified later. We also define the following set of tensor pairs:
D ( r , κ , β ) : = ( A , B ) | A + B Π 2 β , A + B 1 , A r d 3 A Θ s F + κ
We then define the constrained set as the intersection:
D ( r , κ , β ) { R d 1 × d 2 × d 3 × B ( δ 1 , δ 2 ) }
To bound the estimation error in Equation (26), we will upper bound Δ ι F and Δ s F separately.
Note that
Δ ι F 2 = Δ Θ s ι F 2 + Δ Θ s ι F 2 Δ Θ s ι F 2 + 4 a 2 | Θ s | = | Θ s | Δ Θ s ι Π 2 + 4 a 2 | Θ s |
and similarly
Δ s F 2 | Θ s | Δ Θ s s Π 2 + 4 a 2 | Θ s |
We will bound Δ Θ s ι Π 2 and Δ Θ s s Π 2 separately. We first bound Δ Θ s ι Π 2 in what follows.
Case 1: If Δ ι + Δ s Π 2 16 a 2 128 log d ˜ N ι , then we use the following inequality
Δ ι + Δ s Π 2 1 2 Δ ι Π 2 Δ s Π 2
which holds due to ( x + y ) 2 = x 2 + y 2 + 2 x y x 2 + y 2 + 2 · x / 2 · 2 y x 2 + y 2 ( x 2 / 2 + 2 y 2 ) = x 2 / 2 y 2 .
Thus, we can upper bound Δ ι Π with an upper bound of Δ s Π in Lemma A1
Δ ι Π 2 16 a 2 128 log d ˜ N ι + Δ s Π 2
Case 2: Suppose Δ ι + Δ s Π 2 16 a 2 128 log d ˜ N ι . First, according to Lemma A3, it holds on the event E defined in Equation (A9) that
Δ ι ( i ) P ( Δ ι ) + P ( Δ ι ) ( i i ) 4 P ( Δ ι ) + 2 N s N λ ι ( a N λ s + C σ 2 log d ˜ ) ( i i i ) 4 2 r d 3 P ( Δ ι ) F + 2 N s N λ ι ( a N λ s + C σ 2 log d ˜ ) ( i v ) 4 2 r d 3 Δ ι F + 2 N s N λ ι ( a N λ s + C σ 2 log d ˜ ) ( v ) 4 2 r d 3 ( Δ Θ s ι F + Δ Θ s ι F ) + 2 N s N λ ι ( a N λ s + C σ 2 log d ˜ ) ( v i ) 32 r d 3 Δ Θ s ι F + a 128 r d 3 | Θ s | + 2 N s N λ ι ( a N λ s + C σ 2 log d ˜ )
where ( i ) holds due to the triangular inequality; ( i i ) is a direct consequence of Lemma A3, and the definition of event E ; ( i i i ) holds because r t b ( P ( T ) ) 2 r , and T r t b ( T ) d 3 T F for any T R d 1 × d 2 × d 3 ; ( i v ) stems from the inequality P ( T ) F T F = P ( T ) F 2 + P ( T ) F 2 ; ( v ) is due to the triangular inequality; ( v i ) holds since Δ ι 2 a .
Note that according to Lemma A1, we have with probability at least 1 2.5 / d ˜ :
Δ s 4 a B δ 1 , δ 2 w i t h δ 1 = Δ 1 4 a , δ 2 = N s 4 a N λ s ( 4 a 2 + 8 a N λ s + C σ 2 log d ˜ )
where Δ 1 is defined in Lemma A4.
Together with Equation (A22), we have
1 4 a ( Δ ι , Δ s ) D ( r , κ , β ) ( R d 1 × d 2 × d 3 × B δ 1 , δ 2 )
with the following parameters
r = 32 r a n d κ = 2 a 2 r d 3 | Θ s | + N s 2 a N λ ι ( a N λ s + C σ 2 log d ˜ )
Then, according to Lemma A6, it holds with probability at least 1 2 / d ˜ that
1 16 a 2 N ι i Ω ι X i , Δ ι + Δ s 2 1 32 a 2 Δ ι + Δ s Π 2 τ ( r , κ , δ 1 , δ 2 )
where τ ( r , κ , δ 1 , δ 2 ) is defined in Equation (A64).
Recall that Equation (A13) writes:
1 N i Ω ι X i , Δ ι + Δ s 2 3 λ ι r d 3 2 Δ ι F + N s ( a λ s + C σ 2 log d ˜ N )
Letting ϱ : = N / N ι , we further obtain
1 2 Δ ι + Δ s Π 2 3 ϱ λ ι 2 r d 3 Δ ι F + C τ 9 8 ϱ 2 λ ι 2 r d 3 · 4 d 1 d 2 d 3 + Δ ι F 2 4 d 1 d 2 d 3 + C τ 9 2 ϱ 2 λ ι 2 r d 1 d 2 d 3 2 + Δ Θ s ι F 2 4 d 1 d 2 d 3 + a 2 | Θ s | d 1 d 2 d 3 + C τ
Thus, by using Equation (A20) and Lemma A1, we have
Δ Θ s ι F 2 d 1 d 2 d 3 C ϱ 2 λ ι 2 r d 1 d 2 d 3 2 + a 2 | Θ s | d 1 d 2 d 3 + τ
where
τ = ϱ N s ( a λ s + C σ 2 log d ˜ N ) + 16 a 2 τ ( r , κ , δ 1 , δ 2 )
Note that the bound on Δ s Π is given in Lemma A4, and the values of λ ι and λ s can be set according to Lemmas A8 and A9, respectively. Then, by putting Equations (A18), (A19) and (A52) together, and using Lemmas A8 and A9 to bound associated norms of the stochastic quantities E , W , and R Σ in the error term, we can obtain the bound on Δ ι F 2 + Δ s F 2 and complete the proof. □

Appendix A.2.2. Lemmas for the Proof of Theorem 3

Lemma A1.
Letting λ s 4 ( E + 2 a W ) , it holds that
Δ Θ s s 1 3 Δ Ω s s 1 + 1 N λ s 4 a 2 N s + i Ω s ξ i 2
Proof of Lemma A1.
By the standard condition for optimality over a convex set, it holds that for any feasible ( L , S )
( L , S ) , F ( L ^ , S ^ ) 0
which further leads to
2 N i = 1 N ( y i X i , L ^ + S ^ ) X i , L + S L ^ S ^ + λ ι L ^ , L L ^ + λ s S ^ 1 , S S ^ 0 .
Letting ( L , S ) ( L ^ , S ) , we have
2 N i = 1 N ( y i X i , L ^ + S ^ ) X i , Δ s + λ s S ^ 1 , Δ s 0 .
Note that
2 N i = 1 N ( y i X i , L ^ + S ^ ) X i , Δ s = 2 N i = 1 N ( ξ i + X i , Δ ι + Δ s ) X i , Δ s = 2 N i = 1 N X i , Δ s 2 2 N i = 1 N ξ i X i , Δ s 2 N i = 1 N X i , Δ ι X i , Δ s = 2 N i = 1 N X i , Δ s 2 2 N i Ω s ξ i X i , Δ s + 2 N i Ω ι ξ i X i , Δ s 2 N i Ω s X i , Δ ι X i , Δ s + 2 N i Ω ι X i , Δ ι X i , Δ s = 2 N i = 1 N X i , Δ s 2 + 2 N i Ω s ξ i X i , Δ s + 2 N i Ω s X i , Δ ι X i , Δ s 2 E , Δ s 2 N i Ω ι X i , Δ ι X i , Δ s 2 N i = 1 N X i , Δ s 2 1 N i Ω s ( ξ i 2 + X i , Δ s 2 ) 1 N i Ω s ( X i , Δ ι 2 + X i , Δ s 2 ) 2 E , Δ s 2 N i Ω ι X i , Δ ι X i , Δ s = 1 N i Ω s ξ i 2 + 1 N i Ω s ( X i , Δ ι 2 2 E , Δ s 2 N i Ω ι X i , Δ ι X i , Δ s 1 N i Ω s ξ i 2 + 4 a 2 | Θ s | N 2 E , Δ s 2 N i Ω ι X i , Δ ι X i , Δ s
Thus, we have
λ s S ^ 1 , S ^ S 1 N i Ω s ξ i 2 + 4 a 2 | Θ s | N 2 E , Δ s 2 N i Ω ι X i , Δ ι X i , Δ s 1 N i Ω s ξ i 2 + 4 a 2 | Θ s | N + 2 | E , Δ s | + 2 N i Ω ι X i , Δ ι X i , Δ s 1 N i Ω s ξ i 2 + 4 a 2 | Θ s | N + 2 E Δ s 1 + 2 1 N ι i Ω ι X i , Δ ι X i Δ s 1 1 N i Ω s ξ i 2 + 4 a 2 | Θ s | N + 2 E Δ s 1 + 4 a W Δ s 1
One the other hand, the definition of sub-differential indicates
S 1 S ^ 1 S S ^ , S ^ 1
which implies
λ s ( S ^ 1 S 1 ) 1 N i Ω s ξ i 2 + 4 a 2 | Θ s | N + 2 E Δ s 1 + 2 W Δ s 1
Also note that
S ^ 1 S 1 = S Δ s 1 S 1 = S Θ s ( Δ Θ s s + Δ Θ s s ) 1 S Θ s 1 S Θ s Δ Θ s s 1 Δ Θ s s 1 S Θ s 1 = S Θ s 1 + Δ Θ s s 1 Δ Θ s s 1 S Θ s 1 = Δ Θ s s 1 Δ Θ s s 1
which implies
λ s ( Δ Θ s s 1 Δ Θ s s 1 ) 1 N i Ω s ξ i 2 + 4 a 2 | Θ s | N + 2 E Δ s 1 + 2 W Δ s 1
Since λ s 4 ( E + 2 a W ) , we have
λ s ( Δ Θ s s 1 Δ Θ s s 1 ) 1 N i Ω s ξ i 2 + 4 a 2 | Θ s | N + λ s 2 ( Δ Θ s s 1 + Δ Θ s s 1 )
Thus, it holds that
Δ Θ s s 1 3 Δ Θ s s 1 + 1 N λ s ( 4 a 2 | Θ s | + i Ω s ξ i 2 )
which complete the proof. □
Lemma A2.
It holds that
1 N ι i Ω ι X i , Δ ι X i 2 a W
Proof of Lemma A2.
Note that
1 N ι i Ω ι X i , Δ ι X i ( i ) sup T 1 1 1 N ι i Ω ι X i , Δ ι X i , T 2 a sup T 1 1 1 N ι i Ω ι X i , Δ ι 2 a X i , T ( i i ) 2 a sup T 1 1 1 N ι i Ω ι X i , T 2 a W
where ( i ) hold since · is the dual norm of · 1 ; ( i i ) holds since | X i , Δ ι | Δ ι 2 a and the tensor 1 -norm · 1 is invariant to changes in sign. □
Lemma A3.
By letting λ ι 4 E s p and λ s E , we have
P ( Δ ι ) 3 P ( Δ ι ) + 2 a λ s λ ι N s + 2 N λ ι i Ω s ξ i 2
Proof of Lemma A3.
In Equation (A33), letting ( L , S ) ( L , S ) , we obtain
2 N i = 1 N ( y i X i , L ^ + S ^ ) X i , Δ ι + Δ s + λ ι L ^ , Δ ι + λ s S ^ 1 , Δ s 0
First, note that
2 N i = 1 N ( y i X i , L ^ + S ^ ) X i , Δ ι + Δ s = 2 N i = 1 N ( ξ i + X i , Δ ι + Δ s ) X i , Δ ι + Δ s = 2 N i = 1 N X i , Δ ι + Δ s 2 2 N i = 1 N ξ i X i , Δ ι + Δ s = 2 N i = 1 N X i , Δ ι + Δ s 2 2 N i Ω s ξ i X i , Δ ι + Δ s 2 E , Δ ι 2 E , Δ Θ s s
Also, we have according to the convexity of · and · 1 that
L L ^ L L ^ , L ^ , and S 1 S ^ 1 S S ^ , S ^ 1
Thus, we have
λ ι ( L ^ L ) + λ s ( S ^ 1 S 1 ) 2 E s p Δ ι + 2 E Δ Θ s s 1 + 1 N i Ω s ξ i 2 .
Moreover, it is often used that
L ^ L P ( Δ ι ) P ( Δ ι ) ,
Since we set λ ι 4 E s p and λ s 4 E , we have
λ ι ( P ( Δ ι ) P ( Δ ι ) ) + λ s ( S ^ 1 S 1 ) λ ι 2 ( P ( Δ ι ) + P ( Δ ι ) ) + λ s 2 S ^ Θ s 1 + 1 N i Ω s ξ i 2
where we use S ^ Θ s = Δ Θ s s . It implies
λ ι 2 P ( Δ ι ) + λ s S ^ Θ s 1 + λ s 2 S ^ Θ s 1 3 λ ι 2 P ( Δ ι ) + λ s S 1 + 1 N i Ω s ξ i 2
Note that, S 1 a N s . Thus, we have
P ( Δ ι ) 3 λ ι P ( Δ ι ) + 2 a λ s λ ι N s + 2 N λ ι i Ω s ξ i 2
Lemma A4.
If N ι d ˜ and λ s 4 ( E + 2 a W ) , then on the event E defined in Equation (A9), we have
Δ s 1 N s N λ s ( 4 a 2 + 8 a N λ s + C σ 2 log d ˜ )
and it holds with probability at least 1 2.5 / d ˜ that
Δ s Π Δ 1 : = C ( ϱ 2 N s N ι ( 4 a 2 + 2 a N λ ι + C N s σ 2 log d ˜ ) + 16 a N s N λ s ( 4 a 2 + 8 a N λ s + C σ 2 log d ˜ ) E [ E ] )
Proof of Lemma A4.
We first prove Equation (A51), and then prove Equation (A52).
(I) The proof of Equation (A51): Recall that Lemma A1 implies
Δ Θ s s 1 3 Δ Ω s s 1 + 1 N λ s 4 a 2 N s + i Ω s ξ i 2
Then, we have on event E :
Δ s 1 = Δ Θ s s 1 + Δ Θ s s 1 ( i ) Δ Θ s s 1 + Δ Ω s s 1 ( i i ) 4 Δ Ω s s 1 + 1 N λ s 4 a 2 N s + i Ω s ξ i 2 ( i i i ) N s N λ s ( 4 a 2 + 8 a N λ s + C σ 2 log d ˜ )
where ( i ) holds due to Assumption A1.I; ( i i ) is a direct use of Lemma A1; ( i i i ) stems from the facts that Δ s 2 a and the definition of event E in Equation (A9).
Thus, Equation (A51) is proved.
(II) The proof of Equation (A52): According to the optimality of ( L ^ , S ^ ) to Problem (8), we have
F ( L ^ , S ^ ) F ( L ^ , S )
which implies
1 N i = 1 N ι ( ξ i + X i , Δ ι + Δ s ) 2 + λ s S ^ 1 1 N i = 1 N ι ( ξ i + X i , Δ ι ) 2 + λ s S 1
which further leads to
1 N i Ω ι X i , Δ s 2 + 1 N i Ω s X i , Δ s 2 + 2 N i Ω s ξ i X i , Δ s + 2 N i Ω s X i , Δ ι X i , Δ s + 2 N i Ω ι X i , Δ ι X i , Δ Θ s s + 2 i Ω ι E , Δ Θ s s + λ s S ^ 1 λ s S 1
Note that by using 2 a b > ( 1 / 2 a 2 + 2 b 2 ) , we have
1 N i Ω s X i , Δ s 2 + 2 N i Ω s ξ i X i , Δ s + 2 N i Ω s X i , Δ ι X i , Δ s ( 2 N i Ω s ξ i 2 + 2 N i Ω s X i , Δ ι 2 )
Thus on the event E defined in Equation (A9), we have
1 N i Ω ι X i , Δ s 2 | 2 N i Ω ι X i , Δ ι X i , Δ Θ s s | + | 2 i Ω ι E , Δ Θ s s | + λ s ( S 1 S ^ 1 ) + 2 N i Ω s ξ i 2 + 2 N i Ω s X i , Δ ι 2 ( i ) ( 4 a W + 2 E ) Δ Θ s s 1 + λ s ( S 1 S ^ 1 ) + 2 N i Ω s ξ i 2 + 2 N i Ω s X i , Δ ι 2 ( i i ) λ s 2 S ^ Θ s 1 + λ s ( S 1 S ^ 1 ) + 2 N i Ω s ( ξ i 2 + X i , Δ ι 2 ) ( i i i ) λ s S 1 + 2 N i Ω s ( ξ i 2 + 4 a 2 ) ( i v ) 2 N s N ( 4 a 2 + 2 a N λ ι + C N s σ 2 log d ˜ )
where ( i ) holds due to Lemma A2; ( i i ) holds because λ s 4 ( 2 a W + E ) , and Δ Θ s s = S ^ Θ s ; ( i i i ) holds because S ^ 1 = S ^ Θ s 1 + S ^ Θ s 1 S ^ Θ s 1 , and | X i , Δ ι | Δ ι 2 a ; ( i v ) holds as a consequence of S a , | Θ s | N s (due to Assumption A1.I), and the definition of event E .
Now, we discuss the bound of Δ s Π in two cases.
  • Case 1. If Δ s Π 2 β = 4 a 2 128 log d ˜ N ι , then Equation (A52) holds trivially.
  • Case 2. If Δ s Π 2 4 a 2 128 log d ˜ N ι , then we have
    Δ s 2 a D N s 2 a N λ s ( 4 a 2 + 8 a N λ s + C σ 2 log d ˜ ) , δ
    due to the fact Δ s / ( 2 a ) 1 and Equation (A51). Then according to Lemma A5, it holds with probability at least 1 - 1 / d ˜ 2 that
    1 N ι i Ω ι X i , Δ s 2 1 2 Δ s Π 2 16 a N s N λ s ( 4 a 2 + 8 a N λ s + C σ 2 log d ˜ ) E [ E ] .
    Combing Equations (A57) and (A58) yields the bound on Δ s Π .
Lemma A5.
Define the following set
D ( δ , β ) : = B R d 1 × d 2 × d 3 | B 1 , B Π 2 β , B 1 δ
Then, it holds with probability at least 1 2 / d ˜ 3 that
1 N ι i Ω ι X i , B 2 1 2 B Π 2 8 δ E [ R Σ ]
for any B D ( δ , β ) .
Proof of Lemma A5.
We prove this lemma using a standard peeling argument. First, define the following
G : = B D ( δ , β ) s u c h t h a t | 1 N ι i Ω ι X i , B B Π 2 | 1 2 B Π 2 + 8 δ E [ R Σ ]
We partition this set to simpler events with l N + :
G l : = { B D ( δ , β ) C ( t ) w i t h t [ α l 1 β , α l β ] s u c h t h a t | 1 N ι i Ω ι X i , B B Π 2 | 1 2 B Π 2 + 8 δ E [ R Σ ] }
Note that according to Lemma A7, we have with t [ α l 1 β , α l β ) :
P [ G l ] P sup B C ( t ) | 1 N ι i Ω ι X i , B B Π 2 | 1 2 α α l β + 8 δ E [ R Σ ] exp ( n ( α l β ) 2 32 α 2 )
Thus, we have
P G P l = 1 G l l = 1 P [ G l ] l = 1 exp ( N ι ( α l β ) 2 32 α 2 ) = exp ( N ι β 2 32 ) + l = 2 exp ( N ι β 2 32 α 2 ( l 1 ) ) ( i ) exp ( N ι β 2 32 ) + l = 2 exp ( N ι β 2 32 · 2 ( l 1 ) log α ) exp ( N ι β 2 32 ) + exp ( N ι β 2 16 log α ) 1 exp ( N ι β 2 16 log α )
where ( i ) is due to x log x for positive x. By setting α = e and recalling the value of β = 128 log d ˜ N ι , the lemma is proved. □
Lemma A6.
For any ( A , B ) C ( r , κ , β ) R d 1 × d 2 × d 3 × B ( δ 1 , δ 2 ) , it holds with probability at least 1 2 / d ˜ that
1 N ι i Ω ι X i , A + B 2 1 2 A + B Π 2 τ ( r , κ , δ 1 , δ 2 )
where
τ ( r , κ , δ 1 , δ 2 ) = 4 ( 16 α + 1 ) r d 3 | Θ s | E [ R Σ s p ] 2 + 8 κ E [ R Σ s p ] + 8 δ 2 E [ R Σ ] + 4 δ 1 2
Proof of Lemma A6.
The proof is very similar to that of Lemma A5, and we simply omit it. □
Lemma A7.
Define the set
C ( t ) : = { T R d 1 × d 2 × d 3 | T Π 2 t , T 1 }
and
Z t : = sup T C ( t ) 1 N ι i Ω ι X i , T 2 T Π 2 .
Then, it holds that
P [ Z t E [ Z t ] + t 4 α ] exp ( N ι t 2 128 α 2 ) , a n d E [ Z t ] 8 E sup T C ( t ) R Σ , T
Proof of Lemma A7.
First, we study the tail behavior of Z t by directly using the Massart’s inequality in Theorem 14.2 of [64]. According to the Massart’s inequality, it holds for any s > 0
P [ Z t E [ Z t ] + s ] exp ( N ι s 2 8 )
By letting s = t / ( 4 α ) , the first inequality in Equation (A67) is proved.
Then, we will upper bound the expectation of Z t . By standard symmetrization argument [65], we have
E [ Z t ] = E sup T C ( t ) 1 N ι i Ω ι X i , T 2 E X , T 2 = 2 E sup T C ( t ) 1 N ι i Ω ι ε i X i , T 2
where ε i ’s are i.i.d. Randemacher variables. Further, according to the contraction principle [66], it holds that
E [ Z t ] 8 E sup T C ( t ) 1 N ι i Ω ι ε i X i , T = 8 E sup T C ( t ) R Σ , T
In the following, we consider the two cases:
Case 1. Consider T D ( δ , β ) C ( t ) , we have
E [ Z t ] 8 E sup T R Σ , T 8 E R Σ T 1 8 δ E [ R Σ ] .
By letting s = t / ( 2 α ) in Equation (A68), we obtain
P Z t 8 δ E [ R Σ ] + t 2 α exp ( N ι t 2 32 α 2 )
when T D ( δ , β ) C ( t ) .
Case 2. Consider T = A + B , where ( A , B ) C ( r , κ , β ) , B B ( δ 1 , δ 2 ) , and T Π 2 t . The goal in this case is to upper bound
E [ Z t ] 8 E sup A , B R Σ , A + B ( i ) 8 E sup A R Σ , A + 8 E sup B R Σ , B ( i i ) 8 E sup A R Σ s p A + 8 E sup B R Σ B 1 ( i i i ) 8 E sup A R Σ s p A + 8 δ 2 E [ R Σ ]
where ( i ) holds as a property of the sup operation; ( i i ) holds due to the definition of dual norm; ( i i i ) stems from the condition B B ( δ 1 , δ 2 ) .
It remains to upper bound A in Equation (A72). First, according to the definition of C ( r , κ , β ) , we have
A r d 3 A Θ s F + κ
We also have
A Π ( i ) A + B Π + B Π ( i i ) t + δ 1 .
where ( i ) holds due to the triangular inequality, and ( i i ) is a result of conditions T Π 2 t and B B ( δ 1 , δ 2 ) . Since Assumption A1.II indicates A Π 2 = | Θ s | 1 A Θ s F 2 , combing Equations (A74) and (A75) further yields an upper bound on A as follows:
A r d 3 | Θ s | ( t + δ 1 ) + κ
which further leads to
8 E sup A R Σ s p A 8 E [ R Σ s p ] ( r d 3 | Θ s | ( t + δ 1 ) + κ )
The application of 2 a b a / c + b c is also used to further relax the above inequality:
8 E [ R Σ s p ] ( r d 3 | Θ s | t t 4 α + 64 α r d 3 | Θ s | E [ R Σ s p ] 2 8 E [ R Σ s p ] ( r d 3 | Θ s | δ 1 4 δ 1 2 + 4 r d 3 | Θ s | E [ R Σ s p ] 2
Thus, putting things in Equations (A72), (A76) and (A77) together, we obtain
E [ Z t ] t 4 α + 4 ( 16 α + 1 ) r d 3 | Θ s | E [ R Σ s p ] 2 + 8 κ E [ R Σ s p ] + 8 δ 2 E [ R Σ ] + 4 δ 1 2 = : τ
which further gives
P Z t t 2 α + τ exp ( N ι t 2 128 )
for T = A + B , where ( A , B ) C ( r , κ , β ) , B B ( δ 1 , δ 2 ) , and T Π 2 t . □
Lemma A8.
Under Assumption A1, there exists an absolute constant C > 0 such that the following bounds on the tensor spectral norm of stochastic tensors E and R Σ hold:
(I)
For tensor E , we have
P E s p C σ max t + log d ˜ ϱ N ( d 1 d 2 ) + log ( d 1 d 2 ) ( t + log d ˜ ) N 1 e t
(II)
For tensor R Σ , we have
E R Σ log d ˜ N ι ( d 1 d 2 ) + log 2 d ˜ N ι
Proof of Lemma A8.
Equation (A79) can be seen as a special case of Lemma 8 in [18] when k = 1 . Equation (A80) can be proved very similarly to Equation (A79) followed by tricks used in proof of Lemma 6 in [51]. We omit the details due to the high similarity. □
Lemma A9.
Under Assumption A1, there exists an absolute constant C > 0 such that the following bounds on the l -norm of stochastic tensors E , W , and R Σ hold:
(I) 
For tensor E , we have
P E C σ t + log d ˜ ϱ N d 1 d 2 d 3 + t + log d N 1 e t
E E C σ log d ˜ ϱ N d 1 d 2 d 3 + log d N
(II) 
For tensor W , we have
P W C 1 ϱ d 1 d 2 d