Correntropy Based Matrix Completion

This paper studies the matrix completion problems when the entries are contaminated by non-Gaussian noise or outliers. The proposed approach employs a nonconvex loss function induced by the maximum correntropy criterion. With the help of this loss function, we develop a rank constrained, as well as a nuclear norm regularized model, which is resistant to non-Gaussian noise and outliers. However, its non-convexity also leads to certain difficulties. To tackle this problem, we use the simple iterative soft and hard thresholding strategies. We show that when extending to the general affine rank minimization problems, under proper conditions, certain recoverability results can be obtained for the proposed algorithms. Numerical experiments indicate the improved performance of our proposed approach.


Introduction
Arising from a variety of applications such as online recommendation systems [1,2], image inpainting [3,4] and video denoising [5], the matrix completion problem has drawn tremendous and continuous attention over recent years [6][7][8][9][10][11][12]. The matrix completion aims at recovering a low rank matrix from partial observations of its entries [7]. The problem can be mathematically formulated as: where X, B ∈ R m×n and Ω is an index set. Due to the nonconvexity of the rank function rank(·), solving this minimization problem is NP-hard in general. To obtain a tractable convex relaxation, the nuclear norm heuristic was proposed [7]. Incorporated with the least squares loss, the nuclear norm regularization was proposed to solve (1) when the observed entries are contaminated by Gaussian noise [13][14][15][16]. In real-world applications, datasets might be contaminated by non-Gaussian noise or sparse gross errors, which can appear in both explanatory and response variables. However, it has been well understood that the least squares loss cannot be resistant to non-Gaussian noise or outliers.
To address this problem, some efforts have been made in the literature. Ref. [17] proposed a robust approach by using the least absolute deviation loss. Huber's criterion was adopted in [18] to introduce robustness into matrix completion. Ref. [19] proposed to use an L p (0 < p ≤ 1) loss to enhance the robustness. However, as explained later, the approaches mentioned above cannot be robust to impulsive errors. In this study, we propose to use the correntropy-induced loss function in matrix completion problems when pursuing robustness.

Related Work and Discussions
In matrix completion, solving the optimization problem in Model (1) is NP-hard, and a usual remedy is to consider the following nuclear norm convex relaxation: Theoretically, it has been demonstrated in [7,8] that under proper assumptions, with an overwhelming probability, one can reconstruct the original matrix. Situations of the matrix completion with noisy entries have been also considered; see, e.g., [6,9]. In the noisy setting, the corresponding observed matrix turns out to be: where B Ω denotes the projection of B onto Ω, and E refers to the noise. The following two models are frequently adopted to deal with the noisy case: min X∈R m×n s.t. rank(X) ≤ R, and its convex relaxed and regularized heuristic: where λ > 0 is a regularization parameter. Similar theoretical reconstruction results have been also derived in the noiseless case under technical assumptions. Along this line, various approaches have been proposed [14][15][16]23,24]. Among others, Refs. [10,25] interpreted the matrix completion problem as a specific case of the trace regression problem endowed with an entry-wise least squares loss, · 2 F . In the above-mentioned settings, the noise term E is usually assumed to be Gaussian or sub-Gaussian to ensure the good generalization ability, which certainly excludes the heavily-tailed noise and/or outliers.

Existing Robust Matrix Completion Approaches
It has been well understood that the least squares estimator cannot deal with non-Gaussian noise or outliers. To alleviate this limitation, some efforts have been made.
In a seminal work, Ref. [17] proposed a robust matrix completion approach, in which the model takes the following form: min X,E∈R m×n The above model can be further formulated as: where λ > 0 is a regularization parameter. The robustness of the model (4) results from using the least absolute deviation loss (LAD). This model was later applied to the column-wise robust matrix completion problem in [26]. By further decomposing E into E = E 1 + E 2 , where E 1 refers to the noise and E 2 stands for the outliers, Ref. [18] proposed the following robust reconstruction model: where λ, γ > 0 are regularization parameters. They further showed that the above estimator is equivalent to the one obtained by using Huber's criterion when evaluating the data-fitting risk. We also note that [19] adopted an L p (0 < p ≤ 1) loss to enhance the robustness.

Our Proposed Nonconvex Relaxation Approach
As stated previously, matrix completion models based on the least squares loss cannot perform well with non-Gaussian noise and/or outliers. Accordingly, robustness can be pursued by using a robust loss as mentioned earlier. Associated with a nuclear norm penalization term, they are essentially regularized M-estimator. However, note that the LAD loss and the L p loss penalize the small residuals strongly and hence cannot lead to accurate prediction for unobserved entries from the trace regression viewpoint. Moreover, robust statistics reminds us that models based on the above three mentioned loss functions cannot be robust to impulsive errors [27,28]. These limitations encourage us to employ more robust surrogate loss functions to address this problem. In this paper, we present a nonconvex relaxation approach to deal with the matrix completion problem with entries heavily contaminated by noise and/or outliers.
In our study, we propose the robust matrix completion model based on a robust and nonconvex loss, which is defined by: with σ > 0 a scale parameter. To give an intuitive impression, plots of loss functions mentioned above are given in Figure 1. As mentioned above, this loss function is induced by the correntropy, which measures the similarity between two random variables [20,21] and has found many successful applications [29][30][31]. Recently, it was shown in [22] that regression with the correntropy-induced losses regresses towards the conditional mean function with a diverging scale parameter σ when the sample size goes to infinity. It was also shown in [32] that when the noise variable admits a unique global mode, regression with the correntropy-induced losses regresses towards the conditional mode. As argued in [22,32], learning with correntropy-induced losses can be resistant to non-Gaussian noise and outliers, while ensuring good prediction accuracy simultaneously with properly chosen σ.
Associated with the ρ σ loss, our rank-constraint robust matrix completion problem is formulated as: min where the data-fitting risk σ (X) is given by: The nuclear norm heuristic model takes the following form: where λ > 0 is a regularization parameter.

Affine Rank Minimization Problem
In this part, we will show that our robust matrix completion approach can be extended to deal with the robust affine rank minimization problems.
It is known that the matrix completion problem (1) is a special case of the following affine rank minimization problem: where b ∈ R p is given, and A : R m×n → R p is a linear operator defined by: where A i ∈ R m×n for each i. Introduced and studied in [33], this problem has drawn much attention in recent years [14][15][16]23]. Note that (7) can be reduced to the matrix completion problem (1) if we set p = |Ω| (the cardinality of Ω), and let A (i−1)n+j = e i (m)e j (n) T for each (i, j) ∈ Ω, where e i (m), i = 1, . . . , m and e j (n), j = 1, . . . , n are the canonical basis vector of R m and R n , respectively. In fact, (5) and (6) can be naturally extended to handle cases with noise and outliers of (7). Denote the risk as follows: The rank constrained model can be formulated as: and the nuclear norm regularized heuristic takes the form: Referring to computational considerations presented below, we will focus on the more general optimization problems (8) and (9), which can be directly applied to (5) and (6).

Algorithms and Analysis
We consider using gradient descent-based algorithms to solve the proposed models. It is usually admitted that gradient descent is not very efficient. However, in our experiments, we find that gradient descent is still efficient, and comparable with some state-of-the-art methods. On the other hand, we present recoverability and convergence rate results for gradient descent applied to the proposed models. Such results and analysis may help us better understand the models and such a nonconvex loss function from the algorithmic aspects.
We first consider gradient descent with hard thresholding for solving (8). The derivation is standard. Denote S R := {X ∈ R m×n | rank(X) ≤ R}. By the differentiability of σ , when Y is sufficiently close to X, σ can be approximated by: Here, α > 0 is a parameter, and ∇ σ (Y), the gradient of σ at Y, is equal to: Now, the iterates can be generated as follows: with: We simply write (11) as X (k+1) = P S R (Y (k+1) ), where P S R denotes the hard thresholding operator, i.e., the best rank-R approximation to Y (k+1) . The algorithm is presented in Algorithm 1.

end while
The algorithm starts from an initial guess X (0) and continues until some stopping criterion is satisfied, e.g., where is a certain given positive number. Indeed, such a stopping criterion makes sense, as Proposition A3 shows that X (k) − X (k+1) F → 0. To ensure the convergence, the step-size should satisfy α > L := A 2 2 , where A 2 denotes the spectral norm of A. For matrix completion, the spectral norm is smaller than one, and thus, we can set α > 1. In Appendix A, we have shown the Lipschitz continuity of ∇ σ (·), which is necessary for the convergence of the algorithm. α can also be self-adaptive by using a certain line-search rule. Algorithm 2 is the line-search version of Algorithm 1. (9) is similar, with only the hard thresholding P R replaced by the soft thresholding S τ , which can be derived as follows. Denote

Algorithm 3 Gradient descent iterative soft thresholding for (9).
Input: linear operator A : R m×n → R p , initial guess X (0) ∈ R m×n , parameter λ > 0, σ > 0 Output: the recovered matrix X (k+1) while a certain stopping criterion is not satisfied do 1: Choose a fixed step-size α −1 > 0, or choose it via the line-search rule.

Convergence
With the Lipschitz continuity of ∇ σ presented in Appendix A, it is a standard routine to show the convergence of Algorithms 1 and 3, i.e., let {X (k) } be a sequence generated by Algorithm 1 or 3. Then, every limit point of the sequence is a critical point of the problem. In fact, the results can be enhanced to the statement that "the entire sequence converges to a critical point", namely one can prove that lim k→∞ X (k) = X * where X * is a critical point. This can be achieved by verifying the so-called Kurdyka-Łojasiewicz (KL) property [34] of the problems (8) and (9). As this is not the main concern of this paper, we omit the verification here.

Recoverability and Linear Convergence Rate
For affine rank minimization problems, the convergence rate results have been obtained in the literature; see, e.g., [23,24]. However, all the existing results are obtained for algorithms that solve the optimization problems incorporating the least squares loss. In this part, we are concerned with the recoverability and convergence rate of Algorithm 1. These results give the understanding of this loss function from the algorithmic aspect, which is in accordance with and extends our previous work [22].
It has been known that the convergence rate analysis requires the matrix RIPcondition [33]. In our context, instead of using the matrix RIP, we adopt the concept of the matrix scalable restricted isometry property (SRIP) [24].
Definition 1 (SRIP [24]). For any X ∈ S r , there exist constants ν r , µ r > 0 such that: Due to the scalability of ν r , µ r on the operator A, SRIP is a generalization of the RIP [33] as commented in [24]. We point out that the results of Algorithm 1 for the affine rank minimization problem (8) rely on the SRIP condition. However, in the matrix completion problem (5), this condition cannot be met, since ν r in this case is zero. Consequently, the results provided below cannot be applied directly to the matrix completion problem (5). However, similar results might be established for (5), if some refined RIP conditions are assumed to hold for the operator A in the situation of matrix completion [23]. To obtain the convergence rate results, besides the SRIP condition, we also need to make some assumptions. Assumption 1.

The spectral norm of A is upper bounded as
Based on Assumption 1, the following results for Algorithm 1 can be derived.
The proof of Theorem 1 relies on the following lemmas, which reveal certain properties of the loss function˜ σ . Lemma 1. For any σ > 0 and t ∈ R, it holds: This completes the proof.
Proof. It is not hard to check that h is nonnegative on σ > 0.
Proof of Theorem 1. By the fact that X * is rank-R and X (k+1) is the best rank-R approximation to Y (k+1) , we have: we know that: where the last inequality follows from: To verify our first assertion, it remains to bound the first two terms by means of X (k) − X * 2 F . We consider the first term. Denoting y k i = A i , X (k) − X * , we know that: The choice of σ k+1 tells us that: and consequently: Then, by the fact that Λ 2 2 ≤ 1 and the choice of the step-size α, we observe that the second term of (13) can be upper bounded by: Combining (14) and (15) and denoting γ = 1 − 2 exp (−2(1 − β)), we come to the following conclusion: where the last inequality follows from the SRIP condition and the fact that γ < 0 by the range of β.
We now proceed to show the linear convergence in the˜ σ sense. Following from the inequality Combining with Inequality (A1), we see that˜ σ k+1 (X (k+1) ) can be upper bounded by: We need to upper bound ∇˜ σ k+1 (X (k) ), X * − X (k) and α 2 X (k) − X * 2 F in terms of˜ σ k+1 (X (k) ). We first consider the second term. Under the SRIP condition, we have: . Lemma 2 tells us that: Summing the above inequalities over i from 1 to p, we have: Therefore, α 2 X (k) − X * 2 F can be bounded as follows: We proceed to bound ∇˜ σ k+1 (X (k) ), X * − X (k) . It follows from (14) and Lemma 1 that: Combining (17)- (19) together, we get: where the last inequality follows from α ≤ 6 5 ν 2 2R . By Lemma 3, the function σ 2 (1 − exp(−t 2 /σ 2 )) is nondecreasing with respect to σ > 0. This in connection with the fact that: , and consequently, q 3 ∈ (0.2, 0.2620). We thus have: The proof is now completed.
The above results show that it is possible that Algorithm 1 will find X * if the magnitude of the noise is not too large. Moreover, the results also imply that the algorithm is safe when there is no noise.

Numerical Experiments
This section presents numerical experiments to illustrate the effectiveness of our methods. Empirical comparisons with other methods are implemented on synthetic and real data contaminated by outliers or non-Gaussian noise.
The following 4 algorithms are implemented. RMCσ -IHTand RMCσ -ISTare denoted as Algorithms 1 and 3 incorporated with the line-search rule, respectively. The approach proposed in [16] is denoted as MC-2 -IST, which is an iterative soft thresholding algorithm based on the least squares loss. The robust approach based on the LAD loss proposed in [17] is denoted by RMC-1 -ADM. Empirically, the σ value of σ is set to be 0.5; the tuned parameter λ of RMCσ -IST and MC-2 -IST is set to λ = min{m,n} 10 √ max{m,n} , while for RMC-1 -ADM, λ = 1/ max{m, n}, as suggested in [17]. All the numerical computations are conducted on an Intel i7-3770 CPU desktop computer with 16 GB of RAM. The supporting software is MATLAB R2013a. Some notations used frequently in this section are introduced first in Table 1. Bold number in the tables of this section means that it is the best among the competitors.

Evaluation on Synthetic Data
The synthetic datasets are generated in the following way: 1. Generating a low rank matrix: We first generate an m × n matrix with i.i.d. Gaussian entries ∼N(0,1), where m = n = 1000. Then, a ρ r m -rank matrix M is obtained from the above matrix by rank truncation, where ρ r varies from 0.04-0.4. 2. Adding outliers: We create a zero matrix E ∈ R m×n and uniformly randomly sample ρ o m 2 entries, where ρ o varies from 0-0.6. These entries are randomly drawn from the chi-square distribution, with four degrees of freedom. Multiplied by 10, the matrix E is used as the sparse error matrix. 3. Missing entries: ρ m m 2 of the entries are randomly missing, with ρ m varying between {0, 10%, 20%, 30%}. Finally, the observed matrix is denoted as B = P Ω (M + E).
RMCσ -IHT (Algorithm 1), RMCσ -IST (Algorithm 3) and RMC-1 -ADM [17] are implemented respectively on the matrix completion problem with the datasets generated above. For these three algorithms, the same initial guess with the all-zero matrix X 0 = 0 is applied. The stopping criterion is X (k+1) − X (k) F ≤ 10 −3 , or restrictions on the number of iterations, which is set to be 500. For each tuple (ρ m , ρ r , ρ o ), we repeat 10 runs. The algorithm is regarded as successful if the relative error of the resultX satisfies X − M F / M F ≤ 10 −1 .
Experimental results of RMCσ -IHT (top), RMCσ -IST (middle) and RMC-1 -ADM (bottom) are reported in Figure 2, which are given in terms of phase transition diagrams. In Figure 2, the white zones denote perfect recovery in all the experiments, while the black ones denote failure for all the experiments. In each diagram, the x-axis represents the ratio of rank, i.e., we let ρ r = rank m ∈ [0.04, 0.4], and the y-axis represents the level of outliers, i.e., we let ρ o = outliers m 2 ∈ [0, 0.6]. The level of missing entries ρ m varies from left to right in each row. As shown in Figure 2, our approach outperforms RMC-1 -ADM when ρ o and ρ r increase. We also observe that RMCσ -IHT performs better than RMCσ -IST when the level of outliers increases, while RMCσ -IST outperforms RMCσ -IHT when the ratio of missing entries increases.
Comparison of the computational time and the relative error are also reported in Table 2. In this experiment, the level of missing entries ρ m = {20%, 30%}, the ratio of rank ρ r = 0.1 and the level of outliers ρ o varies between {0.1, 0.15, 0.2, 0.25, 0.3}. For each ρ o , we randomly generate 20 instances and then average the results. In the table, "time" denotes the CPU time, with the unit being second, and "rel.err" represents the relative error introduced in the previous paragraph. The results also demonstrate the improved performance of our methods in most of the cases on CPU time and relative error, especially for RMCσ -IHT.

Image Inpainting and Denoising
One typical application of matrix completion is the image inpainting problem [4]. The datasets and the experiment are conducted as follows: 1. We first choose five gray images, named "Baboon", "Camera Man", "Lake", "Lena" and "Pepper" (the size of each image is 512 × 512), each of which is stored in a matrix M. 2. The outliers matrix E is added to each M, where E is generated in the same way as the previous experiment, and the level of outliers ρ o varies among {0.3, 0.4, 0.5, 0.6, 0.7}. 3. The ratio of the missing entries is set to 30%. RMCσ -IST, RMC-1 -ADM and MC-2 -IST, are tested in this experiment. In addition, we also test the Cauchy loss-based model min X c (X) + λ X * , which is denoted as RMCc -IST, where: where c > 0 is a parameter controlling the robustness. Empirically, we set c = 0.15. Other parameters are set to the same as those of RMCσ -IST. The above model is also solved by soft thresholding similar to Algorithm 3. Note that Cauchy loss has a similar shape as that of Welsch loss and also enjoys the redescending property; such a loss function is also frequently used in the robust statistics literature. The initial guess is X 0 = 0. The stopping criterion is X (k+1) − X (k) F ≤ 10 −2 , or the iterations exceed 500.
Detailed comparison results in terms of the relative error and CPU time are listed in Table 3, from which one can see the efficiency of our method. Indeed, experimental results show that our method can be terminated within 80 iterations. According to the relative error in Table 3, our method performs the best in almost all cases, followed by RMCc -IST. This is not surprising because the Cauchy loss-based model enjoys similar properties as the proposed model. We also observe that the RMC-1 -ADM algorithm cannot deal with situations when images are heavily contaminated by outliers. This illustrates the robustness of our method. Table 2. Comparison of RMCσ -IHT(Algorithm 1), RMCσ -IST(Algorithm 3) and RMC-1 -ADM [17] on CPU time and the relative error on synthetic data. ρ m = 0.3, ρ r = 0.1. rel.err, relative error. To better illustrate the robustness of our method empirically, we also attach images recovered by the three methods in Figure 3. For the sake of saving space, we merely list the recovery results for the case ρ o = 0.6 with 30% missing entries. In Figure 3, the first column represents five original images, namely, "Baboon", "Camera Man", "Lake", "Lena" and "Pepper". Images in the second column are contaminated images with 60% outliers and 30% missing entries. Recovered results of each image are report in the remaining columns respectively by using RMCσ -IST, RMC-1 -ADM, MC-2 -IST and RMCc -IST. One can observe that the images recovered by our method retain most of the important information, followed by RMCc -IST. Our next experiment is designed to show the effectiveness of our method in dealing with the non-Gaussian noise. We assume that the entries of the noise matrix E are i.i.d drawn from Student's t distribution, with three degrees of freedom. We then scale E by a factor s n , and we denote the corresponding E := s n · E. The noise scale factor s n varies in {0.01, 0.05, 0.1}, and ρ m varies in {0.1, 0.3, 0.5}. The results are shown in Table 4, where the image "Building" is used. We list the recovered images in Figure 4 with the case s n = 0.05. From the table and the recovered images, we can see that our method also performs well when the image is only contaminated by non-Gaussian noise. Table 3. Experimental results of RMCσ -IST (Algorithm 3), RMC-1 -ADM [17] and MC-2 -IST [16] on different images with ρ r = 0.1, ρ m = 0.3 and ρ o varying from 0.3 to 0.7.

Background Subtraction
Background subtraction, also known as foreground detection, is one of the major tasks in computer vision, which aims at detecting changes in image or video sequences and finds application in video surveillance, human motion analysis and human-machine interaction from static cameras [35].
Given a sequence of images, one can cast them into a matrix B by vectorizing each image and then stacking row by row. In many cases, it is reasonable to assume that the background varies little. Consequently, the background forms a low rank matrix M, while the foreground activity is spatially localized and can be seen as the error matrix E. Correspondingly, the image sequence matrix B can be expressed as the sum of a low rank background matrix M and a sparse error matrix E, which represents the activity in the scene.
In practice, it is reasonable to assume that some entries of the image sequence are missing and the images are contaminated by noise or outliers. Therefore, the foreground object detection problem can be formulated as a robust matrix completion problem. Ref. [36] proposed to use the LAD-loss-based matrix completion approach to separate M and E. The data of this experiment were downloaded from http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html.
Our experiment in this scenario is implemented as follows: 1. We choose the sequence named "Restaurant" for our experiment, which consists of 3057 color images. Each image of "Restaurant" is 160 × 120 in size. From the sequence, we pick 100 continuous images and convert them to gray images to form the original matrix B, which is 100 × 19200 in size, where each row is a vector converted from an image. 2. Two types of non-Gaussian noise are added to B. The first type of noise is drawn from the chi-square distribution, with four degree of freedom; the second type of noise is drawn from Student's t distribution, with three degrees of freedom. Then, the two types of noise are simultaneously rescaled by s n = {0.01, 0.02, 0.05}. The last 50% of the entries are missing randomly. 3. RMCσ -IHT and RMC-1 -ADM are used to deal with this problem. We set R = 1 in RMCσ -IHT.
The initial guess is the zero matrix. The stopping criterion is X (k+1) − X (k) F ≤ 10 −2 , or the iterations exceed 200.
The running time and relative error are reported in Table 5. From the table, we see that the proposed approach is faster and gives smaller relative errors. To give an intuitive impression, we choose five frames from each image sequence, as shown in Figure 5, from which we can observe that when the image sequences are corrupted by noise (s n = 0.05) and missing entries, both of the methods can successfully extract the background and foreground images, and it seems that our method performs better because the details of the background images are recovered well, whereas the LAD-based approach does not seem to perform as well as ours where some details of the background are added to the foreground. It can be also observed that none of the two methods can recover the missing entries in the foreground. In order to achieve this, maybe more effective approaches are needed.

Concluding Remarks
The correntropy loss function has been studied in the literature [20,21] and has found many successful applications [29][30][31]. Learning with correntropy-induced losses could be resistant to non-Gaussian noise and outliers while ensuring good prediction accuracy simultaneously with properly chosen parameter σ. This paper addressed the robust matrix completion problem based on the correntropy loss. The proposed approach was shown to be efficient to deal with non-Gaussian noise and sparse gross errors. The nonconvexity of the proposed approach was due to using the σ loss. Based on the above approach, we proposed two nonconvex optimization models and extend them to the more general robust affine rank minimization problems. Two gradient-based iterative schemes to solve the nonconvex optimization problems were offered, with convergence rate results being obtained under proper assumptions. It would be interesting to investigate similar convergence and recoverability results for other redescending-type loss functions-based models. Numerical experiments verified the improved performance of our methods, where empirically, the parameter σ for σ is set to 0.5 and λ for the nuclear norm model (6)  where Λ X and Λ Y are the diagonal matrices corresponding to ∇˜ σ (X) and ∇˜ σ (Y). It remains to show that: By letting z 1 = Avec(X) − b and z 2 = Avec(Y) − b, we observe that: Combining with the fact that for any t 1 , t 2 ∈ R and σ > 0, | exp(−t 2 1 /σ 2 )t 1 − exp(−t 2 2 /σ 2 )t 2 | ≤ |t 1 − t 2 |, we have: As a result, ∇ σ (X) −∇ σ (Y) F ≤ A 2 2 X − Y F . This completes the proof.
The following conclusion is a consequence of Proposition A1.