Deep Matrix Factorization Based on Convolutional Neural Networks for Image Inpainting

In this work, we formulate the image in-painting as a matrix completion problem. Traditional matrix completion methods are generally based on linear models, assuming that the matrix is low rank. When the original matrix is large scale and the observed elements are few, they will easily lead to over-fitting and their performance will also decrease significantly. Recently, researchers have tried to apply deep learning and nonlinear techniques to solve matrix completion. However, most of the existing deep learning-based methods restore each column or row of the matrix independently, which loses the global structure information of the matrix and therefore does not achieve the expected results in the image in-painting. In this paper, we propose a deep matrix factorization completion network (DMFCNet) for image in-painting by combining deep learning and a traditional matrix completion model. The main idea of DMFCNet is to map iterative updates of variables from a traditional matrix completion model into a fixed depth neural network. The potential relationships between observed matrix data are learned in a trainable end-to-end manner, which leads to a high-performance and easy-to-deploy nonlinear solution. Experimental results show that DMFCNet can provide higher matrix completion accuracy than the state-of-the-art matrix completion methods in a shorter running time.


Introduction
Matrix completion (MC) [1][2][3][4][5] aims to recover a matrix with missing matrix elements or incomplete data. It has been successfully applied to a wide range of signal processing and image analysis tasks, including collaborative filtering [6,7], image in-painting [8][9][10], image denoising [11,12], and image classification [13,14]. The MC methods assume that the original matrix is low rank and the missing elements of the matrix can be estimated based on rank minimization. It should be noted that the rank minimization problem is generally non-convex and NP-hard [15]. A typical approach to address this issue is to establish a convex approximation of the original non-convex objective function.
Existing approaches for solving the MC problem are mainly based on nuclear norm minimization (NNM) and matrix factorization (MF). The NNM approach [16][17][18] aims to minimize the sum of matrix singular values, which is a convex relaxation of the matrix rank. The nuclear norm minimization can be solved by singular value thresholding (SVT) algorithms [19], inexact increasing Lagrange multiplier (IALM) methods [16], and an alternating direction method (ADM) [17,20]. One major disadvantage of the NNM approach is that singular value decomposition (SVD) needs to be performed in each iteration of the optimization process, which has very high computational complexity when the matrix size is large. To avoid this problem, matrix factorization (MF), which does not need SVD, has been proposed by researchers to solve the MC problem [6,[21][22][23]. Assuming that the rank of the original matrix is known, the MF method aims to decompose and (1) Compared with existing methods, our proposed method is able to address the nonlinear data model problem faced by the traditional MC methods. It is also able to address the global structure problem in existing deep learning-based MC methods. (2) The proposed method can be pre-trained to learn the global image structure and underlying relationship between input matrix data with missing elements and the recovered output data. Once successfully trained, the network does not need to be optimized again in the subsequent image in-painting tasks, thereby providing a high-performance and easy-to-deploy nonlinear matrix completion solution. (3) To improve the performance of the proposed method, a new algorithm for pre-filling the missing elements of the image is proposed. This new padding method performs global analysis of the matrix data to predict the missing elements as their initial values, which improves the performance of matrix completion and image in-painting.
The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 presents our approach of deep matrix factorization and completion for image in-painting. Experimental results are presented in Section 4. Section 5 concludes the paper with a discussion of future research work.

Related Work
In this section, we review existing work related to our proposed method. For example, the mathematical models of the low-rank Hankel matrix factorization (LRHMF) method [35] and the deep Hankel matrix factorization (DHMF) method [36] will be introduced, respectively. The LRHMF method is a low-rank matrix factorization method that avoids singular value decomposition to achieve fast signal reconstruction. The DHMF method [36] inspired by LRHMF is a complex exponential signal recovery method based on deep learning and Hankel matrix factorization. The method proposed in this paper for image in-painting is inspired by them.
As said in [35], the rank of the Hankel matrix is equal to the number of exponentials in x which is a vector of exponential functions. Thus, the low-rank Hankel matrix completion (LRHMC) problem can be solved by using the low-rank property of the Hankel matrix. Its mathematical formula can be described as: where x is the signal to be recovered from the undersampled data y, R is the operator that converts the signal to the Hankel matrix Rx, U denotes the undersampling matrix, and λ is the balance parameter. · * is the nuclear norm of the matrix, which is used to restrict the rank of the matrix. The second term is used to measure the consistency of the data. However, it is very time-consuming to solve this problem because of its frequent singular value decomposition (SVD). To avoid this problem, the LRHMF method uses matrix factorization [37,38] instead of the nuclear norm minimization. Given any matrix, its nuclear norm can be approximated as: where P ∈ R n 1 ×r , Q ∈ R n 2 ×r , · 2 F denotes the square of the Frobenius norm of the matrix, and the superscript H denotes the conjugate transpose. If we substitute Equation (2) to optimization problem (1), then the optimization problem can be reformulated as: Since the nuclear norm of Rx is replaced by the Frobenius norm of its matrix factorization, it is no longer necessary to calculate the singular value decomposition. To solve this problem effectively, the alternating direction multiplier method (ADMM) is adopted in LRHMF [35], and its corresponding extended Lagrangian function is derived as: where D denotes the increasing Lagrange multiplier, < ·, · > denotes the inner product operator, and the λ > 0 and γ > 0 are the balanced parameters.
To solve the signal reconstruction problem, Huang et al. [36] gave the k-th iteration of the solution of (3) by minimizing (4), as shown in (5). Based on this iterative formulation, a deep Hankel matrix factorization network based on deep learning is designed for fast reconstruction of the signal.

The Proposed Method
In this section, we construct a deep matrix factorization completion network (DMFC-Net) for matrix completion and image in-painting. We derive the mathematical model for our DMFCNet method, discuss the network design, and then introduce two network structures based on different prediction methods for missing elements. Finally, we introduce the loss functions and explain the network training process.

Mathematical Model of the DMFCNet Method
The proposed DMFCNet is based on low-rank matrix factorization [35,36]. The optimization objective function of our proposed model can be formulated as where Y ∈ R m×n is the observation matrix with missing elements whose initial values are set to be a predefined constant. X ∈ R m×n is the matrix that needs to be recovered from the matrix Y, and λ is a regularization parameter. X * is the nuclear norm of the matrix X, which is used to restrict the rank of X. Ψ (Y − X) 2 F denotes the reconstruction error of Y, where is the Hadamard product. Ψ ∈ {0, 1} m×n is a mask indicating the positions of missing data. If Y is missing data at position (i, j), the value of Ψ ij is 0; otherwise, it is 1. We use matrix factorization instead of the traditional nuclear norm minimization, and the proposed model can be formulated as follows: where U ∈ R m×r and V ∈ R n×r . The augmented Lagrangian function for (7) is given by Here, η > 0 is the penalty parameter and S ∈ R m×n is the Lagrangian multiplier corresponding to the constraint X = UV T . Since it is difficult to solve for U, V, S and X simultaneously in (8), following the idea of alternating direction method of multipliers (ADMM), we minimize the Lagrangian function with respect to each block variable U, V, S and X at a time while fixing the other blocks at their latest values. Thus, the proposed optimization process becomes: where µ > 0 is the step size of the optimization process.
However, there are many limitations if solved directly by traditional algorithms, so we propose to solve the above optimization problem using a deep learning approach. The main idea is to update the variables using neural network modules. As shown in Figure 1, we construct a deep neural network based on (9), which has three updating modules shown in Figure 1b. A completed restoration module contains the U updating module and V updating module for updating matrices U and V, and it contains the X updating module for restoring the incomplete matrix.

U and V Updating Modules
In our proposed DMFCNet method, the input matrix is first processed by U and V updating modules. According to the analysis in [36], U and V are updated as follows: Note that the variables (X 0 + S 0 )V 0 and V 0 are included in the update formula of U. So, we choose to add them to the input of the U updating module. In order to learn the maximum convolutional features, U 0 is also added as an input to the U updating module. The auxiliary matrix variable S 0 in (10) is initialized as a zero matrix; thus, it can be removed in the U and V updating modules. Based on (10), for the U updating module, we concatenate variables X 0 V 0 , V 0 and U 0 in the channel dimension as input and use a convolutional neural network to update the variable. The updating of the V matrix follows a similar procedure.
Once U 1 is updated, we concatenate X 0 T U 1 , U 1 and V 0 channel-wise to obtain V 1 .
Thus, the updating formulas for U and V matrices are: where C denotes the convolutional neural network. We observe that the final matrix recovery performance is sensitive to the initialization of U and V. To address this issue, we propose to perform the following SVD of X 0 ∈ R m×n to initialize U and V: where Σ ∈ R d×d is a diagonal matrix withσ 1 , . . . ,σ d on the diagonal and zeros elsewhere, d = min(m, n).σ i > 0 is the i-th singular value of matrix X 0 . U ∈ R m×d and V ∈ R n×d are left and right singular vectors, respectively. Then, U 0 ∈ R m×r and V 0 ∈ R n×r are initialized by whereŨ ∈ R m×r are the first r columns of U,Ṽ ∈ R n×r are the first r columns of V, and Σ ∈ R r×r are the first r rows and first r columns of Σ. In this paper, we set m = n.
In order to maintain the maximum amount of information during the matrix completion process, a dense convolutional structure is used in the network, and a residual structure is added to improve the stability of the training process. The Mish function is chosen as the activation function due to its smoothness at almost all points of the curve, which allows more information to flow through the neural network. A batch normalization operation (BN) layer is added between convolution layers to speed up the convergence.

X Updating Module
After obtaining U 1 and V 1 using the U and V updating modules, the Lagrange multiplier S 1 can be updated using following formula: Then, they will be fed into the X updating module, andX 1 will be obtained by the following equation.X To improve the reconstruction performance, we further processX 1 by an autoencoder network. As shown in Figure 1b, the network contains four convolution layers, a batch normalization module, and the last activation layer with the tanh function. For image in-painting applications, to enhance the smoothness of the recovered image, we incorporate the following weighted averaging operation into the network where X 0 is the initial matrix and γ is a weighting parameter. When a pixel value is missing at a point in the image, the output of the network is assigned directly to the value at the corresponding location. Otherwise, a weighted average between the output of the network and the pixel value of the corresponding location of the input image are used to obtain the final reconstructed pixel value of that location.

Pre-Filling
Note that the initialization of U and V in the network is obtained from the original incomplete matrix using SVD, so the missing entries in the matrix need to be filled with predefined constants before the singular value decomposition. However, the network is extremely sensitive to the pre-filled constants and will directly affect the in-painting performance if not filled properly.
To reduce the effect of filling random constants, we first obtain X 0 by replacing the missing values of the observation matrix with predefined constants such as 255. Then, the singular value decomposition operation is performed on X 0 to obtain U 0 and V 0 and input them into the restoration module for preliminary restoration to obtain X 1 , and the output matrix X new can be calculated by: This step is the preliminary inference of the missing values by the restoration module. The predicted values are filled to the missing positions of the observation matrix, and then, the filled X new is used as the input to the second restoration module. Since U 0 and V 0 in the second restoration module are obtained by the singular value decomposition of the new X 0 , which can largely eliminate the negative effects of using random constant filling, better restoration results can be obtained by the second restoration module. The following Algorithm 1 summarizes the DMFCNet-1 algorithm for network-based prefilling operations.

Algorithm 1 DMFCNet-1
Require: X Ω : original incomplete image matrix; Ω: the position of the observed entries; non-negative parameters r, µ and λ. Ensure: the restored matrix X new . 1: Init: X 0 ∈ R m×n : The matrix using constant 255 to replace missing values of X Ω ; Compute U 0 ∈ R m×r and V 0 ∈ R n×r from X 0 using (12) and (13); 4: 10: X 0 ← X new ; 11: end for 12: return X new ; However, a pre-filling-based neural network requires a singular value decomposition, which will take a relatively long time. To improve the running time, according to the structural characteristics of the image, a new pre-filling algorithm, called Nearest Neighbor Mean Filling (NNMF), is presented. It takes the data observed near the location of the missing value as a reference to infer the missing value.
It is assumed that we need to fill the data at the location (i, j) of the missing values. Let V lij be the value of the first non-missing position traversed from position (i, j) to the left, V rij be the value of the first non-missing position traversed from position (i, j) to the right, and similarly, let V tij be the value of the first non-missing position traversed from position (i, j) to the top and V bij be the value of the first non-missing position traversed from position (i, j) to the bottom. Then, the formula for filling the data V ij at the location (i, j) of the missing value is as follows.
It is a very time-consuming operation to traverse the location of each missing value and then find the four values in turn. In this paper, we design a calculation procedure as shown in Figure 2, which can efficiently calculate the fill values at the locations of all missing data by dynamic programming. As shown in Figure 2, the missing values at the edges are first filled in a clockwise direction; then, four matrices are generated in four directions, and finally, the four generated matrices are summed to find the mean value to obtain the filled matrix. Algorithm 2 summarizes the DMFCNet-2 algorithm based on NNMF for pre-filling operation. The matrix obtained from the pre-filling operation of the observation matrix using the NNMF algorithm is used as the input of the restoration module. Based on the two pre-filling methods, the network framework of the DMFCNet-1 algorithm and DMFCNet-2 algorithm proposed in this paper is shown in Figure 1a. During training, only the weighting parameters in the convolutional networks C U and C V and the autoencoder are optimized.

Loss Function
The general convolutional neural network, whose network interior is equivalent to a black box for people, can only be globally optimized by constraining the final output of the network to the whole network weights. In contrast, each variable in the interpretable network built based on the iterative model in this paper is of practical significance. So, in addition to restricting the final output X 1 of the restoration module in the loss function, this paper also restricts the intermediate variables in the module, which can make its training more stable and efficient. Frobenius parametrization is used to restrict the variables in the network, from which the loss function of a recovery module can be derived as follows: where Θ is the network parameter of the restoration module, B is the number of samples input to the network, and α and β are the regular term coefficients. X b denotes the output X 1 of the b-th sample in the restoration module andX b is the inputX 1 of the b-th sample of the autoencoder in the X updating module. U b and V b are the U 1 and V 1 of the b-th sample output, and Y b is the complete image corresponding to the b-th sample.

Training
According to the diversity of the VOC dataset [39,40], this dataset is selected as the training sample to adapt to the recovery task of more complex images. Firstly, the image is converted into a grayscale image of size 256 × 256, and then, some random pixel values in the image are replaced by 255.
The hyperparameters in training are set as follows. The first 50 singular values are taken when initializing the U and V matrices. Adam is chosen as the optimizer for training the network, and the learning rate is set to 1 × 10 −3 , which is reduced to 1 × 10 −4 after stabilization and set to 1 × 10 −5 for global fine-tuning. µ is set to 1 × 10 −3 and γ is set to 10 in the X updating module. The loss function canonical term coefficients α and β are set to 0.1 and 0.01, respectively. The autoencoder in the X updating module contains a total of three hidden layers with the dimensions ( H 2 , W 2 , 32), ( H 4 , W 4 , 64) and ( H 2 , W 2 , 32). To make it more targeted for the recovery of images with missing elements, two models are trained for each of the DMFCNet-1 and DMFCNet-2 networks. The first model uses a dataset containing images with a 30% to 50% missing rate, so this model is mainly used for recovering images with a 50% missing rate and below. The second model uses a dataset containing images with a 50% to 70% missing rate, and this model is used to recover images with a 50% to 70% missing rate.
Specifically, DMFCNet-1 is trained with one restoration module as the training unit. The first restoration module is trained, and the weights of the first restoration module are frozen after the training is completed. Then, the second restoration module is added and trained, and the weights of the first repair module are unfrozen for global fine-tuning when the training of the second restoration module is completed. Figure 3 shows the loss convergence during the training period of the two models and the reconstruction results of the test data.

Experiments
In this section, we first compare the two versions of the DMFCNet model proposed in this paper in image in-painting tasks, and then, we compare them with six popular matrix completion methods. These methods are matrix factorization (MF) by LmaFit [21], nuclear norm minimization (NNM) by IALM [16], truncated nuclear norm minimization (TNNM) by ADMM [8], DLMC method-based deep learning [28], NC-MC method [41], and LNOP method by ADMM [42]. The peak signal-to-noise ratio (PSNR) [43] and structural similarity (SSIM) [44] were used in the experiments to evaluate the quality of the restored images.

Datasets
In this part, we discuss how to select the dataset for training the model. The method proposed requires pre-training the network model parameters, which requires a large number of datasets for training. We hope that the proposed algorithm is not only limited to simple low-rank images but includes both low-rank images and more complex images. Therefore, two datasets are chosen to train the model and test the effect of different datasets on the image restoration performance. The first dataset is the CelebFaces Attributes Dataset (CelebA) [45], which is a large-scale face attribute dataset with over 200,000 celebrity images. These images contain some degree of pose variation but remain relatively simple and homogeneous images overall. The second dataset is the VOC dataset [39,40], which has a more diverse set of images, including simple low-rank images as well as complex images.
Two datasets are used to train the DMFCNet-2 model, where the images of the datasets are converted to grayscale images of size 256 × 256, and 30% of random pixel information will be discarded. The test results of the model on complex images obtained by training with different datasets are shown in Figure 4. It can be seen from Figure 4 that the training loss when training the network using the CelebA dataset is smaller than that when training the network using the VOC dataset because of its relative simplicity. However, the loss and reconstruction performance of the network trained with the VOC dataset outperformed the network trained with the CelebA dataset when tested on complex images. Therefore, to improve the image restoration performance, we recommend using a more targeted dataset. 5

Experimental Settings
To make the best performance of six methods for comparison, the hyperparameters of each method were chosen as follows. In MF, since automatic estimation often leads to poor performance in image restoration problems, the fixed number of ranks is chosen for different missing rates, with the rank set to 30 for restoring images containing 20% to 30% missing rates, 20 for restoring images containing 40% to 50% missing rates, and 10 for restoring images containing 60% to 70% missing rates and text masks. In TNNM, the parameter r is uniformly set to 10. In DLMC, the weight decay penalty is set to 0.01, the network contains three hidden layers, and the number of hidden cells is set to [100 50 100]. p is set to 0.7 in LNOP. Other parameters follow the settings in the original paper.

Image In-Painting
At first, DMFCNet-1 and DMFCNet-2 are compared, which includes the preliminary restoration results (pre-filling results) and the final restoration results. Figures 5 and 6 show the restoration results of the two models restoring images containing 40% and 60% missing rates. As shown in Figures 5e and 6e, the image pre-filled with NNMF can achieve relatively good results in the relatively smooth areas of the image, but it produces more obvious vertical stripes in the areas with large variations in pixel values. Figures 5f and 6f show that the vertical stripes of the restored image using the DMFCNet-2 model have disappeared a lot, and the overall image is smoother, but there are still some spots left by the pre-filling of the image with NNMF. The DMFCNet-1 model uses the restoration module for the preliminary restoration, as shown in Figures 5b and 6b. As can be seen, although there is no vertical stripe, the image has some white spots and is rougher overall. Figures 5c and 6c show the final restoration result of DMFCNet-1. It can be seen that after the second restoration module, the image was restored more carefully based on the preliminary restoration. The white spots in the image basically disappear, but the overall image is a bit rougher than the restored result of DMFCNet-2. In addition, Table 1 shows the recovery of the two methods at different missing rates, which contains a performance comparison of the preliminary restoration ability of the two models. It can be seen from Table 1 that the DMFCNet-2 model with pre-filling using NNMF is better at a low missing rate, but the preliminary restoration of DMFCNet-1 at a high missing rate gives stronger results than that of pre-filling using NNMF. However, the recovery of DMFCNet-2 is better than that of DMFCNet-1 because the images obtained by NNMF are smoother overall.  The next step is to compare the proposed methods with other methods of matrix completion. Five images as shown in Figure 7 are selected for the comparison experiments. Two masks are considered in the experiments: the first one is a random pixel mask, where 20% to 70% of the pixels in the image are removed randomly. The second one is a text mask containing English words. Although the DMFCNet-1 model and DMFCNet-2 model are not trained to restore images that contain text masks, the DMFCNet-2 network is used to compare with other methods in the tests containing text masks because of the characteristics of NNMF. Here, 30%, 50% and 70% of the pixels in the original image are removed randomly, respectively. From the images, it can be visually seen that the images obtained by restoration through the MF method are rougher than those obtained by other methods, and DMFCNet-1 and DMFCNet-2 have the best restoration results. We also conducted more comprehensive tests on other images, and the experimental results are shown in Table 2. Table 2 shows the PSNR values and SSIM values of the images obtained from the recovery of five images containing a 20% to 70% missing rate by six methods, respectively. Figure 11 illustrates the average recovery performance for five images with different missing rates using the eight methods. Figure 12 shows the execution time of the eight methods to recover grayscale images of size 256 × 256 containing different missing rates. Meanwhile, Table 3 shows the average running times of the eight methods for recovering images of different sizes containing different missing rates.      The graphical data show that MF(LMaFit) takes the shortest time, which is followed by the methods proposed in this paper. Due to the superiority of deep learning, using the trained network model for image restoration can significantly reduce the time required for the restoration task; even if the missing rate increases gradually, it does not increase the time required for restoration. In contrast, the deep learning-based DLMC method takes the longest time because it needs to optimize the network weights, and its running time growth rate is the largest among all methods when the image size increases. From the overall graphical data, it can be seen that DMFCNet-1 and DMFCNet-2 can achieve better recovery performance than competing methods in the shortest time, both for images containing small and large missing rates. Especially when the missing rate is large, the recovery performance of other methods decreases faster, but the proposed methods can still achieve satisfactory results. Figure 13 shows the examples of the images containing text masks and grid masks and the recovered images obtained by the seven methods. Table 4 shows the restoration results of the seven methods on the five images containing text masks and grid masks. The data in Table 4 show that the proposed DMFCNet-2 network, even though it is not trained to recover in the case of text-masked and grid-masked images, still performs well due to the characteristics of NNMF.

Conclusions
In this work, a new end-to-end neural network structure for image restoration called DMFCNet is proposed in this paper by combining deep learning with traditional matrix complementation algorithms. Experimental results on data containing random masks and other masks show that DMFCNet performs optimally in image restoration compared to the currently popular methods, and it remains stable even when it contains high missing rates.
Although the methods have good performance, there is still room for further improvement. For example, when restoring images containing a high missing rate, the restoration result of DMFCNet-1 contains white spots and the restoration result of DMFCNet-2 con-tains vertical stripes. Therefore, how to combine these two restoration results to obtain better restoration results is a problem that needs to be investigated in the future. In addition, the adjustment of hyperparameters in this method is also one of the important elements of the next work. A variety of other experiments will be carried out in the future in order to apply the methods of this paper to a wider range of experiments, such as larger image sizes (e.g., 512 × 512 pixels) or missing data due to other factors (e.g., image processing or transmission).