Nonparametric Tensor Completion Based on Gradient Descent and Nonconvex Penalty

: Existing tensor completion methods all require some hyperparameters. However, these hyperparameters determine the performance of each method, and it is di ﬃ cult to tune them. In this paper, we propose a novel nonparametric tensor completion method, which formulates tensor completion as an unconstrained optimization problem and designs an e ﬃ cient iterative method to solve it. In each iteration, we not only calculate the missing entries by the aid of data correlation, but consider the low-rank of tensor and the convergence speed of iteration. Our iteration is based on the gradient descent method, and approximates the gradient descent direction with tensor matricization and singular value decomposition. Considering the symmetry of every dimension of a tensor, the optimal unfolding direction in each iteration may be di ﬀ erent. So we select the optimal unfolding direction by scaled latent nuclear norm in each iteration. Moreover, we design formula for the iteration step-size based on the nonconvex penalty. During the iterative process, we store the tensor in sparsity and adopt the power method to compute the maximum singular value quickly. The experiments of image inpainting and link prediction show that our method is competitive with six state-of-the-art methods.


Introduction
Real-world data are often sparse but rich in structures and can be stored in arrays. Tensors are K-way arrays that can be used to store multimodal data, image/video data, complex relationship network data, etc. At present, tensors have been successfully applied in many fields, such as image restoration [1], recommendation systems [2], signal processing [3], and high-order web link analysis [4]. Moreover, tensors have also been applied in clustering and classification in some recent studies [5,6]. A comprehensive survey of the applications of tensors can be found in [7]. In these applications, a decisive work is to fill in the missing values of the tensor, namely, tensor completion.
For matrix completion, a common method is to decompose the matrix into two-factor matrices and then use them to calculate the missing data [8,9]. Another method is to turn it into a Rank Minimization (RM) problem. Analogous to matrix completion, the methods of tensor completion can also be divided into two categories: tensor decomposition and RM. Tucker decomposition and CANDECOMP/PARAFAC (CP) [10] are two classic methods of tensor decomposition, and they decompose a high-order tensor into a kernel tensor and some factor vectors. Reference [11] proposed tensor Singular Value Decomposition (t-SVD), but it can only be used for small-scale tensors because the tensor will be expanded to a large-scale matrix. In [12], the t-SVD was applied to an image deblurring problem. Reference [13] proposed a low-order tensor decomposition for tagging recommendation, 1.
Unlike existing methods, our method has no parameters and is easily manipulated. 2.
In each iteration, we use tensor matricization and SVD to approximate the gradient descent direction, so the entries outside the observation range can also be updated. 3.
Considering the symmetry of every dimension of a universal tensor, we select the optimal gradient tensor via scaled latent nuclear norm in each iteration.

4.
We design the formula of iteration step-size elaborately, which makes our iteration able to achieve a higher convergence speed and a lower error.
The rest of the paper is organized as follows. Section 2 introduces the background knowledge. Our method is proposed in Section 3. Section 4 gives the experimental results and analysis. Finally, the conclusions are given in Section 5.

Symbols and Formulas
In this paper, vectors and matrices are denoted by lowercase and uppercase bold italic letters, respectively, and tensors are denoted by bold handwriting. The relevant symbols and formulas involved in this paper are as follows: is the inner product of two tensors X and Y of the same dimension, where X i 1 ···i D and Y i 1 ···i D are the elements in X and Y, respectively.
To make tensor matricization and matrix tensorization easy to understand, a 3-dimensional tensor is provided below as an example. Suppose X is a tensor of size 3 × 4 × 2 as Figure 1 shows.

Related Algorithms
SVD needs to consume a substantial amount of memory and CPU resources. In the real world, data usually have a large scale but are very sparse. For a large sparse matrix, the power method [21] can compute the maximum singular value quickly. Gradient descent is a simple but effective method for solving unconstrained convex optimization problems. It is used to minimize a function by iteratively moving in the direction of the steepest descent as defined by the negative of the gradient. In this paper, we use the gradient descent method to solve the optimization problem of tensor completion and use the power method to improve computational efficiency.

Problem Description
The goal of tensor completion is to fill in the missing entries of a partially known tensor. To circumvent this, a usual approach is to find a tensor that is close enough to the original tensor in the positions of the known entries. Suppose is a D-order tensor, ∈ × ×•••× ; the positions of the observed (namely, known) entries in are indicated by Ω . The tensor completion can be formulated by the following optimization problem: where otherwise, it is 0. Problem (1) becomes a matrix completion when D = 2. Problem (1) is an unconstrained optimization problem. If has identical entries with in the range of Ω but has any entries outside the range of Ω , then it makes ‖ ( − )‖ obtain the minimum value 0. However, such is meaningless because it ignores the inner structure and correlation of the data. The most common way is to constrain with rank or nuclear norm [22,23]. However, it introduces some hyperparameters, e.g., the upper limit for rank or norm. For unsupervised learning, it is difficult to choose the appropriate hyperparameters. To address this issue, we propose an efficient nonparametric iteration method to solve the unconstrained optimization problem (1). Although we do not add constraints for problem (1), in each iteration, we not only Then the three matricizations of X are · · · · · · 10 11 12 22 23 24 .

Related Algorithms
SVD needs to consume a substantial amount of memory and CPU resources. In the real world, data usually have a large scale but are very sparse. For a large sparse matrix, the power method [21] can compute the maximum singular value quickly. Gradient descent is a simple but effective method for solving unconstrained convex optimization problems. It is used to minimize a function by iteratively moving in the direction of the steepest descent as defined by the negative of the gradient. In this paper, we use the gradient descent method to solve the optimization problem of tensor completion and use the power method to improve computational efficiency.

Problem Description
The goal of tensor completion is to fill in the missing entries of a partially known tensor. To circumvent this, a usual approach is to find a tensor that is close enough to the original tensor in the positions of the known entries. Suppose A is a D-order tensor, A ∈ R I 1 ×I 2 ×···×I D ; the positions of the observed (namely, known) entries in A are indicated by Ω. The tensor completion can be formulated by the following optimization problem: Problem (1) is an unconstrained optimization problem. If X has identical entries with A in the range of Ω but has any entries outside the range of Ω, then it makes P Ω (X − A) 2 F obtain the minimum value 0. However, such X is meaningless because it ignores the inner structure and correlation of the data. The most common way is to constrain X with rank or nuclear norm [22,23]. However, it introduces some hyperparameters, e.g., the upper limit for rank or norm. For unsupervised learning, it is difficult to choose the appropriate hyperparameters. To address this issue, we propose an efficient nonparametric iteration method to solve the unconstrained optimization problem (1).
Although we do not add constraints for problem (1), in each iteration, we not only calculate the missing entries by the aid of data correlation, but also consider the low-rank of tensor and the convergence speed of iteration.

Iterative Calculation Based on Gradient Descent
For simplicity, we convert (1) into the following form: F(X) is a continuous differentiable convex function, and its derivative is We use the gradient descent method to solve problem (2), and the iterative formula is as follows: where n is the number of iterations and λ is the iteration step-size. Note that P Ω (A − X n ) only has values in the range of Ω , so (4) cannot update the entries of X outside the range of Ω . Therefore, we hope to find an approximation of P Ω (X n − A) that has values outside the range of Ω so that (4) can update the entries of X outside the range of Ω. Based on tensor matricization and SVD, we have where d ∈ {1, 2, .., D}; r n d is the rank of the matrix (P Ω (A − X n )) d ; σ n d,i , u n d,i and v n d,i are the ith singular value, left singular vector, and right singular vector of (P Ω (A − X n )) d , respectively; and σ n d,1 > σ n d,2 > . . . > σ n d,r n d . P Ω (A − X n ) can be approximated by selecting the first m (m < r n d ) singular values in (5). In theory, the larger the value of m is, the better the approximation of P Ω (A − X n ). However, (i) too many singular values will increase the computational complexity, and (ii) small singular values usually represent noise. Therefore, we only use the largest singular value and corresponding left and right singular vectors to approximate P Ω (A − X n ), then Here, (P Ω (A − X n )) d is the mode-d unfolding of P Ω (A − X n ) and is usually a large-scale matrix, so the traditional SVD cannot be used for efficient calculation. We adopt the power method to calculate the maximum singular value and corresponding singular vectors quickly.
Proof. For a tensor X, we have In (5), u n d,i are orthogonal to each other, and v n d,i are orthogonal to each other. Then, according to (8) and (5), we have According to (8), (7) and (5), we can deduce Since λ ∈ (0, 1), F X n+1 < F(X n ).

Selection of the Unfolding Direction
In (7), there exist D different directions when unfolding P Ω (A − X n ) since A is a D-order tensor. When D = 2, tensor completion is reduced to matrix completion. For a 2-order tensor M ∈ R I 1 ×I 2 , M 1 = M and M 2 = M T . The two unfolding matrices have identical singular values and exchange left and right singular vectors, so the choice of unfolding direction will not affect the final results. When D > 2, different unfolding directions may lead to different accuracies and convergence speeds, so we need to select the optimal unfolding direction. Considering the symmetry of every dimension of a tensor, the optimal unfolding direction in each iteration may be different. Below, we discuss how to choose the unfolding direction d in each iteration.
Real-world data tensors often exhibit low-rank structures, and tensor completions usually attempt to recover a low-rank tensor that best approximates a partially observed data tensor [24]. In tensor completion, rank is often surrogated by nuclear norm. Overlapped nuclear norm [25] and scaled latent nuclear norm [26] are two commonly used tensor nuclear norms, but the latter is more appropriate than the former when used in tensor completion [27]. For a D-order tensor X, its scaled latent nuclear norm is defined as follows: In (11), we let X d = X, X i = 0 (i d); then Therefore, we have The nuclear norm is a convex surrogate for rank [28]. In (4), if X n and P Ω (A − X n ) are both low-rank tensors, then X n+1 will also tend to be a low-rank tensor. Therefore, we choose the unfolding direction d that minimizes the scaled latent nuclear norm of P Ω (A − X n ), i.e., The calculation of ., D}) is independent and we can perform these calculations in parallel.

Design of the Iteration Step-Size
Another part of (7) is determining the step-size λ of each iteration. We consider the following.

1.
In the gradient descent method, −F (X) is the fastest descent direction, and we use the maximum singular value (and corresponding singular vectors) of its unfolding matrix to calculate it approximately. If the maximum singular value is very large, we may ignore some larger singular values, and the approximation of the fastest descent direction may be unsatisfactory. Therefore, we should adopt a small step-size to avoid excessive errors. Conversely, if the maximum singular value is very small, the approximation may be more accurate, and we can adopt a large step-size. In other words, the larger the maximum singular value is, the smaller the step-size.

2.
In the gradient descent method, −F (X) will become increasingly smaller during the iterative process; thus, the maximum singular value σ n d of each iteration presents a downtrend as a whole. Then, according to Point 1, the step-size should show an upward trend during the iterative process. However, the traditional approach and some related approaches [29,30] all make the step-size increasingly smaller during the iterative process, which does not meet our requirements. 3.
The step-size λ can also be viewed as a penalty for the singular value. In matrix completion, the nonconvex function is used to penalize the singular value and achieves a better effect than the direct use of the nuclear norm [31,32]. Reference [32] penalizes the singular value by the f (x) = log(x + 1) function for matrix completion and does not introduce additional parameters.
Based on the above three points, the formula of the iteration step-size is designed as follows: In (15), the larger σ n d,1 is, the smaller λ, and the singular value after penalty is log σ n d,1 + 1 , which is a nonconvex function. In addition, it easy to prove λ ∈ (0, 1).

Optimization of Calculation
If we directly use (7) to compute X n+1 , we need to store and update each entry of X n during the iterative process. The time and space complexities are tremendous. We split (7) into the following two parts: where Ω denotes the positions of the missing entries in tensor A. The goal of tensor completion is to calculate the entries in the range of Ω. To achieve this, we need to calculate the values of σ n d,1 , u n d,1 , and v n d,1 in each iteration. Because the values of σ n d,1 , u n d , and v n d,1 depend only on P Ω (X n ) but not on P Ω (X n ). That is, we just need to store and update the entries of X n in the range of Ω (namely, P Ω (X n )) during the iterative process. Furthermore, P Ω (X n ) can be stored in a sparse tensor.

Analysis of Time Complexity
According to (14), we need to calculate the maximum singular of (P Ω (A − X n )) d for each dimension d (d ∈ {1, 2, .., D}) in each iteration. The power method is used to calculate the maximum singular of matrix, which itself is also an iterative method. In each iteration of the power method, we need to calculate a matrix-vector multiplication, where the size of the matrix is I d × Thus, the time complexity of the matrix-vector multiplication is at most I d × Based on the above analysis, the time complexity of our method is O(N × D × P × I 1 × I 2 × · · · × I D ), where N is the number of iterations in the gradient descent method and P is the number of iterations in the power method.

Experiments
In this section, we first compare the performance of our NTC against some recently proposed methods and then demonstrate the effectiveness of the step-size design in NTC. Experiments are performed on a PC with Intel i7 CPU and 32 GB RAM. The software environment is MATLAB R2017a on Windows 10. We use the Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) as performance indicators, where Ω T represents the entries in the testing set and Ω T is the number of entries in Ω T .

Performance Comparison
We use image inpainting and link prediction to perform the experiments for performance comparison. Our competitors include (i) SPC [33], which applies the total variation constraint to the low-rank model for tensor completion; (ii) TR-LS [16], which adopts a dual framework to solve the low-rank tensor completion; (iii) Rprecon [17], which is a Riemannian manifold preconditioning approach for tensor completion; (iv) GeomCG [18], which applies Riemannian optimization to the completion of fixed-rank tensor; (v) FFW [28], which uses the Frank-Wolfe algorithm and scaled latent nuclear norm for tensor completion; and (vi) CTD-S [34], which accelerates tensor decomposition by removing redundancy. They all need to tune several parameters.
The experimental data are divided into two parts: the training set and testing set. For the methods that need to tune some parameters, we take out a subset of data from the training set, and this portion is called the verification set, which is used to learn the optimal parameters. Our method has no parameters, so it does not need a verification set. Each experiment is repeated five times, and the results are averaged. As mentioned, in (14), we can calculate {1, 2, .., D}) in parallel. However, considering (i) 3-dimensional tensors are used for our tests; (ii) the power method is used to calculate the maximum singular value (σ n d,1 ) quickly; and (iii) parallelization itself also needs a certain amount of overhead, we do not parallelize these calculations in our tests. Figure 2 shows three RGB images, which are with 720 × 1280, 600 × 960, and 1024 × 1024 pixels, respectively. Each image can be stored in a tensor of size "height" × "width" × 3. For each image, we experiment with two training-test ratios: 25%:75% and 40%:60%. For the methods with parameters, we select 20% of the training set as the verification set (for parameter tuning). The test results are shown in Tables 1 and 2, where we highlight the best result(s) in bold and underline the second-best result(s) in each column.
Symmetry 2019, 11,1512 8 of 12 Figure 2 shows three RGB images, which are with 720 × 1280, 600 × 960, and 1024 × 1024 pixels, respectively. Each image can be stored in a tensor of size "height" × "width" × 3. For each image, we experiment with two training-test ratios: 25%:75% and 40%:60%. For the methods with parameters, we select 20% of the training set as the verification set (for parameter tuning). The test results are shown in Tables 1 and 2, where we highlight the best result(s) in bold and underline the second-best result(s) in each column.  It can be seen from Tables 1 and 2 that, on the whole, our NTC has the smallest errors and performs best. TR-LS and FFW can also achieve good results. CTD-S is a noniterative tensor completion method, and its performance is not satisfactory.
Next, we use the latest dataset on MovieLens [35] to build three three-dimensional tensors of  It can be seen from Tables 1 and 2 that, on the whole, our NTC has the smallest errors and performs best. TR-LS and FFW can also achieve good results. CTD-S is a noniterative tensor completion method, and its performance is not satisfactory.
Next, we use the latest dataset on MovieLens [35] to build three three-dimensional tensors of "user-movie-category" to perform the experiments. Table 3 presents the related information of the three datasets. The known entries in each tensor (dataset) are randomly divided into a training set and a testing set. For each dataset, we experiment with two training-test ratios: 80%:20% and 60%:40%. Similarly, we select 20% of the training set as a verification set for the methods with parameters. Tables 4 and 5 give the results.  In Tables 4 and 5, each method yields a total of 12 error values. Overall, NTC performs best, as its six errors are the smallest and four errors are the second-smallest. In particular, for the 2400T and 4000T datasets, our NTC achieves the smallest RMSE. TR-LS, Rprecon, GeomCG, and FFW can achieve better results on some datasets, and CTD-S has the highest RMSE and MAE. Unfortunately, 32 GB RAM cannot support SPC working on the 4000T dataset. In a word, our method is competitive with other methods.

Effectiveness of Our Step-Size Design
We choose 10,532 users and 22,157 movies from the latest dataset on MovieLens to build a two-dimensional tensor. The tensor contains 1,048,575 known entries, where 80% of the data is selected as the training set and the rest is selected as the testing set. We compare the performance of our step-size design against the performances of the other three designs. Our design is denoted as D0, and the other three designs are denoted as D1, D2, and D3. The four designs make the singular value after penalization (namely, λ × σ n d,1 ) equal to log(σ n d,1 + 1), 1, σ n d,1 , and σ n d,1 /2 n , respectively. Figure  From Figure 3 we can see that (i) D3 performs the worst, and its RMSE is significantly higher than that of other designs; (ii) D0 and D2 have better performances than D1; and (iii) D2 converges very fast in the first 300 s, but then is overtaken by D0. Overall, our design (D0) can achieve a higher convergence speed and a lower error.

Conclusions
This paper proposes a new Nonparametric Tensor Completion (NTC) method based on gradient descent and nonconvex penalty. Our method formulates tensor completion as an unconstrained optimization problem and designs an efficient iterative method to solve it. We use gradient descent to solve the optimization problem of tensor completion and build a gradient tensor with tensor matricizations and SVD. We select the optimal direction based on the scaled latent nuclear norm and design the formula of iteration step-size elaborately. Furthermore, during the iterative process, we store the tensor in sparsity and adopt the power method to compute the maximum singular value quickly. Unlike existing methods, our method has no parameters and is easily manipulated. We use image inpainting and link prediction to compare NTC against six state-of-the-art methods, and the test results demonstrate that NTC is competitive with them. In addition, an experiment of twodimensional tensor completion shows the effectiveness of our step-size design.
In this paper, we select the optimal unfolding direction by scaled latent nuclear norm to construct an iterative formula. Next, we will try to select more than one unfolding directions and weight these directions so as to achieve lower error and faster convergence speed.
Author Contributions: The authors contributed equally to the theoretical framing and algorithms and the corresponding author took principle responsibility for writing the article. From Figure 3 we can see that (i) D3 performs the worst, and its RMSE is significantly higher than that of other designs; (ii) D0 and D2 have better performances than D1; and (iii) D2 converges very fast in the first 300 s, but then is overtaken by D0. Overall, our design (D0) can achieve a higher convergence speed and a lower error.

Conclusions
This paper proposes a new Nonparametric Tensor Completion (NTC) method based on gradient descent and nonconvex penalty. Our method formulates tensor completion as an unconstrained optimization problem and designs an efficient iterative method to solve it. We use gradient descent to solve the optimization problem of tensor completion and build a gradient tensor with tensor matricizations and SVD. We select the optimal direction based on the scaled latent nuclear norm and design the formula of iteration step-size elaborately. Furthermore, during the iterative process, we store the tensor in sparsity and adopt the power method to compute the maximum singular value quickly. Unlike existing methods, our method has no parameters and is easily manipulated. We use image inpainting and link prediction to compare NTC against six state-of-the-art methods, and the test results demonstrate that NTC is competitive with them. In addition, an experiment of two-dimensional tensor completion shows the effectiveness of our step-size design.
In this paper, we select the optimal unfolding direction by scaled latent nuclear norm to construct an iterative formula. Next, we will try to select more than one unfolding directions and weight these directions so as to achieve lower error and faster convergence speed.
Author Contributions: The authors contributed equally to the theoretical framing and algorithms and the corresponding author took principle responsibility for writing the article.