You are currently viewing a new version of our website. To view the old version click .
Information
  • Article
  • Open Access

1 February 2022

Semantic Residual Pyramid Network for Image Inpainting

and
1
School of Computer Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
2
Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing University of Information Science & Technology, Nanjing 210044, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Recent Advances in Video Compression and Coding

Abstract

Existing image inpainting methods based on deep learning have made great progress. These methods either generate contextually semantically consistent images or visually excellent images, ignoring that both semantic and visual effects should be appreciated. In this article, we propose a Semantic Residual Pyramid Network (SRPNet) based on a deep generative model for image inpainting at the image and feature levels. This method encodes a masked image by a residual semantic pyramid encoder and then decodes the encoded features into a inpainted image by a multi-layer decoder. At this stage, a multi-layer attention transfer network is used to gradually fill in the missing regions of the image. To generate semantically consistent and visually superior images, the multi-scale discriminators are added to the network structure. The discriminators are divided into global and local discriminators, where the global discriminator is used to identify the global consistency of the inpainted image, and the local discriminator is used to determine the consistency of the missing regions of the inpainted image. Finally, we conducted experiments on four different datasets. As a result, great performance was achieved for filling both the regular and irregular missing regions.

1. Introduction

Image inpainting first originated from an extremely primitive technique in which artists restored a damaged painting in order to match it to the original painting as much as possible [1]. In computer vision, this is realized by filling the missing pixels in the damaged images. Currently, this technique is widely applied in many areas, such as old-photo restoration [1], object removal [2], photo modification [3], and text removal [1].
Existing image inpainting methods can be divided into two categories. The first methods are the traditional methods that were diffusion-based [1,4,5] or patch-based texture synthesis techniques [6,7], which fill the low-level features of the images. Due to the lack of a high-level understanding of the images, such approaches are unable to generate reasonable semantic results. To address this problem, the second methods [8,9,10,11] attempt to solve the inpainting problem by a learning-based approach, which predict the pixels in the missing regions by training deep convolution networks. This approach is used to mainly fill the deep features of the images. However, although semantically relevant images can be generated, obtaining visually realistic results is still challenging.
To obtain visually realistic and semantically consistent images, we propose the Semantic Residual Pyramid Network (SRPNet) for filling the missing regions of the images at the image and feature levels. Our work is based on the Pyramid-Context Encoding Network (PEN-NET) [12], which was proposed in 2019. PEN-NET [12] used U-Net [13] as its backbone. However, U-Net has shallow layers and fewer parameters than many current networks, so it is easy to overfit [14] during training. On the other hand, we can obtain more high-level semantic features by arbitrarily increasing the depth of the network. However, increasing the depth of the network is not always applicable due to the non-convergence of the network caused by the disappearance of the gradient. Therefore, our model introduces the residual blocks [11,15] to address the reduced precision caused by the increase in the depth of the neural network. At the same time, we used the instance normalization [16] to accelerate the convergence of the model. Then, we added the multi-scale discriminators [10] to obtain semantically consistent and visually superior images. This idea improves the refinement of the inpainting results by determining whether the image is consistent with the ground truth. The multi-scale discriminators include a global discriminator and a local discriminator. The global discriminator takes the complete image as input to identify the global consistency of the image, whereas the local discriminator takes the missing regions in the completed image as input to judge the consistency of the missing regions. Experiments demonstrate that this design can obtain richer image details and more realistic images.
Our approach was experimented on four different scene datasets, DTD [17], Facade [18], CELEBA-HQ [19], and Places2 [20]. The experimental results show that this method performs generally well and has good visual effects. Some of the experimental results are shown in Figure 1.
Figure 1. Masked images and the corresponding inpainting results generated by the Semantic Residual Pyramid Network (SRPNet).
In brief, the main contributions of this study are as follows:
  • We designed a novel residual pyramid encoder to obtain high-level semantic features by adding the residual blocks to the semantic pyramid encoder;
  • We introduced multi-scale discriminators based on generating adversarial networks to judge whether the semantic features of images at different scales are consistent. Thus, we can obtain richer texture details and semantic features that are more consistent with the ground truth.
The rest of this article is presented as follows. We discuss the related work of this paper in Section 2. Then, we describe the proposed methodological framework in Section 3. The experimental settings and the analysis of experimental results are presented in Section 4. We summarize future directions in Section 5.

3. Semantic Residual Pyramid Network

The Semantic Residual Pyramid Network (SRPNet) uses the principle of the generative adversarial network to inpaint the images. The SRPNet consists of three parts: a residual pyramid encoder, a multi-layer decoder, and a multi-scale discriminators. Where the residual pyramid encoder and the multi-layer decoder constitute a generator for producing the inpainted image, an overview of its network structure is shown in Table A1. The multi-scale discriminators identify the inpainted image to determine whether the image is the “truth,” and an overview of its network structure is shown in Table A2. During the training process of the network, the generator and discriminator constantly interplay and eventually reach a balance.
We describe the specifics of the residual pyramid encoder in Section 3.1; then, we introduce the attention transfer model in Section 3.2; subsequently, the details of the multi-layer decoder are shown in Section 3.3; finally, we describe the multi-scale discriminators in Section 3.4. Figure 2 illustrates the network architecture of our proposed SRPNet.
Figure 2. Model structure diagram of the Semantic Residual Pyramid Network (SRPNet). The SRPNet consists of a residual pyramid encoder, a multi-layer decoder, and multi-scale discriminators. The residual pyramid encoder and multi-layer decoder constitute the generator of the network, and the multi-scale discriminators are the discriminator of the network. The generator uses the masked image as input and inpaint the image layer by layer using the attention transfer model. Eventually, the inpainted image is the output. The multi-scale discriminators include a global discriminator and a local discriminator. The global discriminator takes the completed image as input, and the local discriminator takes the inpainted area of the output as input. The model is optimized by using pyramid loss, and the reconstruction loss is used to optimize the model and obtain finer images. (Best viewed with zoom-in).

3.1. Residual Pyramid Encoder

To obtain deeper semantic features, we generally attempt to deepen the network. However, this method may cause some problems, such as slow network convergence and increased errors. Thus, to a certain extent, this approach is unsuitable to increase the network depth. To solve these problems, our model uses instance normalization [16] to speed up the convergence of the model. Moreover, the residual blocks are added to each layer of the pyramid network to address the problem of reduced accuracy caused by the increase in the depth of the neural network. Therefore, we present the residual pyramid encoder built on the semantic pyramid structure [27]. It encodes the damaged images into compact latent features and decodes these features back to the image. The missing regions in the latent features are filled into a low-level feature layer with higher resolution and richer detail features. This design improves the effectiveness of the encoding further. In addition, the model fills the missing regions by reusing the attention transfer model (ATM) [12] from high-level semantic features to low-level features before decoding.
Suppose the depth of the encoder is n, and its feature maps from low to high are denoted as f1, …, fn−2, fn−1, fn, respectively. The reconstructed feature maps of the ATM at each layer are represented as Fn−1, Fn−2, …, F1 from high to low:
Fn−1 = ψ (fn−1, fn),
Fn−2 = ψ (fn−2, Fn−1),
…,
F1 = ψ (f1, F2) = ψ (f1, ψ (f2 …, ψ (fn−1, fn))),
where ψ represents the operation of the ATM. The missing pixels of the image are filled according to the semantic pyramid mechanism and the layer-by-layer attention transfer model. This design ensures the semantic consistency of the inpainted images. Its structure diagram is shown in Figure 3. Details regarding the specific operation of the ATM are introduced as follows.
Figure 3. Residual Pyramid Encoder Model. First, the feature maps (f1, …, fn−1, fn−2, fn) of the original masked images are obtained by the residual pyramid encoder. Then, the missing pixels are filled by the Attention Transfer Model. Finally, we obtained the inpainted feature maps (Fn−1, Fn−2,…, F1). (Best viewed in color).

3.2. Attention Transfer Model

First, the ATM learns its regional affinity from the high-level semantic feature graph (fn). It extracts the patches for the missing region (p) from fn and calculates the cosine similarity (Similarityn) between its internal and external patches:
S i m i l a r i t y n = p o n p o n , p i n p i n
where p o n denotes the o-th patch extracted from outside the missing regions of fn, and p i n denotes the i-th patch extracted from inside the missing regions of fn. Then, we used SoftMax on the similarity score to obtain the attention score (Attentionn) of each patch:
A t t e n t i o n n = exp ( S i m i l a r i t y n ) i = 1 N exp ( S i m i l a r i t y n )
After obtaining the attention scores from the high-level semantic feature maps (fn), we filled the missing regions of the adjacent low-level feature maps (fn−1) through the context weighted by the attention score:
p i n 1 = i = 1 N A t t e n t i o n n p o n 1
where p i n 1 represents the i-th patch extracted from outside the missing regions of fn−1, and p o n 1 represents the o-th patch in the missing regions of fn−1 that need to be filled.
This operation was repeated to compute all patches and then filled fn−1 afterward. Finally, we can obtain a completed feature map, fn−1, which was obtained. The network structure of the ATM is shown in Figure 4. When applying the ATM to the semantic pyramid structure across layers, finer semantic features were obtained, and the context consistency of the filled feature map was ensured. Thus, the inpainting effect was improved further.
Figure 4. Attention Transfer Model. The patch (p) was extracted from the feature map (fn) and calculated its attention scores by learning the region affinity. Subsequently, it used the attention scores to weigh the context for filling the feature map (fn−1). Last, the inpainted feature map (fn−1) was obtained. (Best viewed in color).

3.3. Multi-Layer Decoder

The multi-layer decoder first takes the semantic feature map, fn, of the highest layer as the input of the decoder and decodes to obtain the feature map, ηn−1. Next, the inpainted feature map, fn−1, from the ATM and the decoded feature map, ηn−1, are combined to obtain new feature maps, which are, represented as ηn−2, ηn−3…, η1, respectively. We sequentially took those feature maps as the input of the decoder and decoded them to acquire the predicted images of each layer, X1, X2…, Xn. Last, the pyramid loss [12] was used to optimize the image. The pyramid loss was used to gradually refine the final output by calculating the normalized L1 distance between the output at each scale and the original image. This method was used to improve the filling prediction of the missing regions at each scale. The network structure diagram of the multi-layer decoder is shown in Figure 2.

3.4. Multi-Scale Discriminators

Image inpainting is an unstable problem, so there are many different results for the missing regions. Therefore, we used the GAN [22] to select the image closest to the real image. The GAN contains at least one generator (G) and one discriminator (D). The role of the generator is to produce images based on the learned features, and the role of the discriminator is to judge whether the image produced by the generator is “real.” The discriminator makes it difficult to distinguish the “fake” image produced by the generator from the real image by constant updating.
Our discriminator consists of a global discriminator and a local discriminator. The global discriminator takes the inpainted image as input to judge the consistency between the generated image and the ground truth. The local discriminator takes the inpainted missing regions as input for judging the semantic consistency of the local details. Compared with [10], the difference in our method is that the missing regions of the image become inpainted regions. The inpainted region can be either a central square or an irregularly shaped missing region. Our proposed method ensures not only the overall consistency of the image but also the consistency of the inpainted regions. The experiments prove that our method can generate excellent results both semantically and visually.

4. Experiments and Analysis

We present the experimental settings in Section 4.1; and the experimental results in Section 4.2, and we analyze the effectiveness of our model in Section 4.3.

4.1. Experimental Settings

We trained and tested four datasets with different characteristics: CELEBA-HQ [19], DTD [17], Facade [18], and Places2 [20]. Their characteristics are as follows, and the details are shown in Table 1. CELEBA-HQ [19] is high-quality face data from CELEBA [28]. DTD [17] is textured-image data containing 5640 images. It was classified into 47 categories according to human perception or texture diversity, with 120 images per category. Facade [18] is a collection of highly structured buildings from all over the world. Places2 [20] is a natural-scene dataset with 1,839,960 images, which can be divided into 365 different categories according to the different natural scenes. Additionally, we used the irregular mask dataset proposed by the Nvidia research team [24]. They divided the masks into six categories based on the size of the holes. Each category contains 1000 masks with and without boundary constraints, for a total of 12,000 masks.
Table 1. The details of the four different scene datasets.
All of our experiments were trained and tested on 256 × 256 images. Our model ran on an NVIDIA GeForce RTX 3090 with a batch size of 64. We used a learning rate of 10−4 and a decay rate of 0.1. We used the Adam optimizer [29] with beta (0.5,0.999). Our code used TensorboardX [30] to observe when the model converges. The code was implemented in PyTorch [31].
We chose three classical methods in the field of image inpainting as the baseline for comparison: pluralistic image completion (PIC-NET) [25], GLCIC [9], and the pyramid-context encoder network (PEN-NET) [12]. Pluralistic image completion (PIC-NET) [25] is a method of inpainting images from diversified aspects, with the ability to generate multiple visually realistic images. We utilized the code given by the author for training and testing. GLCIC [9] is a deep learning model that was the first to use a global discriminator and a local discriminator to inpaint images. It yields semantically consistent inpainting results. We used the official code for testing. The pyramid-context encoder network (PEN-NET) [12] is the first image pyramid encoder network to inpaint the semantic and visual aspects of images. It can produce semantically consistent and visually realistic inpainting results. We used the officially proposed pretraining model for testing.

4.2. Experiments Results

Qualitative Comparisons. We tested the images with 128 × 128 regular square masks and the different irregular masks. Compared with the most classical models, our model exhibits semantic consistency and excellent visual effects. As shown in the figure, PIC-NET [25] can generate clear and complete images for the partial datasets but lacks semantic consistency compared to real images. Additionally, this model is not applicable for all scene images. GLCIC [10] generates blurring images for large-area mask occlusion and produces an image that is obviously inconsistent with the ground truth for inpainting the irregular regions. Moreover, the images generated by PEN-NET [12] contain blurred visual effects and partial distortion. Instead, as can be seen from the figure, our method can generate visually similar but also semantically consistent results for the inpainting of the square and irregular missing regions. The comparison of our experimental results for the regular mask and the irregular mask are shown in Figure 5. (We give the rest of the experimental results in Appendix A Figure A1).
Figure 5. Qualitative comparisons for image inpainting with square mask and irregular mask on four different datasets. From left to right: the original image, the input image, the results of the baseline model, and our results. (Best viewed with zoom-in). (a) GT; (b) Input; (c) PIC-NET; (d) GLCIC; (e) PEN-NET; (f) Ours.
Quantitative Comparisons. We used the L1 loss, the peak-signal-to-noise ratio (PSNR) [32], and the multi-scale structural-similarity-index measure (MS-SSIM) [33] as evaluation metrics for quantitative evaluation. The L1 loss is the sum of the absolute differences between the pixel values of the predicted image and the actual pixel values of the ground truth. It reflects the actual error between the predicted image and the real image. PSNR [32] is the most commonly and widely used image objective-evaluation indicator that measures the difference between the pixel values of two images. MS-SSIM [33] is used to measure the similarity of images at different resolutions. In addition to the objective-evaluation data, we also performed a quantitative comparison of four different datasets: DTD [17], Facade [18], CELEBA-HQ [19], and Places2 [20]. All images were trained images with a 128 × 128 regular square mask and an irregular mask [24]. The images we tested were output directly from our trained models without any processing of the already inpainted images. The inpainting-effect data of the model on the regular square mask are shown in Table 2, and the inpainting data of the irregular mask are shown in Table 3. Meanwhile, we conducted experiments on different models for four different datasets. The purpose of the experiments was to test the ability of the model to inpaint irregular masks of different sizes, for which the part of the experimental data are shown in Table 4. (We give all the experimental data in Appendix A Table A3, Table A4 and Table A5.) As seen from the data in Table 2, our method numerically outperformed the comparison methods on all four datasets. Among them, the data from the MS-SSIM [33] improved more than other evaluation indicators, which verifies the effectiveness of the added multi-scale components. Our experimental data on Façade [18] and Places2 [20] datasets were more prominent. It indicated that our method is more applicable for inpainting of the natural-scene images. As can be seen from the data in Table 3, although the inpainting effects of our method on the DTD [17] dataset are still insufficient, its experimental data on other datasets significantly outperform the comparison methods. It can be concluded that our model shows better performance in both regular masks and irregular masks. Observing the data in Table 4, it can be seen that our method has better inpainting effects on the larger missing areas. This also verifies that our model is more suitable for inpainting large holes.
Table 2. The inpainting-effect data of the central square mask. It includes a quantitative comparison between L1, PSNR [32], and MS-SSIM [33] on four different datasets. ↑ higher is better; ↓ lower is better.
Table 3. The inpainting effect data of the irregular mask. It includes a quantitative comparison between L1, PSNR [32], and MS-SSIM [33] on four different datasets. ↑ higher is better;↓ lower is better.
Table 4. Different degrees of masks compared to the quantitative comparison of L1, PSNR [32], and MS-SSIM [33] in CELEBA-HQ [19]. ↑ higher is better; ↓ lower is better.

4.3. Ablation Study

We used the ablation experiments to determine the validity of our proposed network. To verify the effectiveness of the residual blocks and the multi-scale discriminators, we added the residual blocks and the multi-scale discriminators to the baseline model in turn. We trained and tested the effects of inpainting the central square missing region on the Facade [18] and CELEBA-HQ [19]. The test data are shown in Table 5, where we can observe that the inpainting ability of the models gradually improved with the increase in components. At the same time, we show the test results under different models in Figure 6. As seen from the figure, the baseline output has different degrees of blurring. After adding the discriminator, some of the blurred areas were eliminated. Finally, when the residual blocks were added to the model, the output image had a good visual effect. Therefore, we can prove the effectiveness of the aforementioned addition of the components for image inpainting.
Table 5. Ablation comparison of the discriminator and residual blocks over Facade [18] and CELEBA-HQ [19]. ↑ higher is better; ↓ lower is better.
Figure 6. Comparison of different training-model outputs. From left to right: the original image, the input image, the results of the baseline model, the results of adding the discriminator model, and our results. (Best viewed with zoom-in). (a) GT; (b) Input; (c) Baseline; (d) +Discriminator; (e) +Residual blocks.

5. Conclusions

In this article, we proposed a Residual Semantic Pyramid Network (SRPNet) based on the image inpainting model to inpaint missing regions of images. Our approach was based on the GAN. The generator generated an image, and then the discriminator judged whether the generated image was “real.” In order to acquire more semantic features, we designed a residual pyramid encoder and put it into the generator network. We also introduced a new discriminator for improving the ability of the model to determine the local semantics of the image. The experiments showed that our approach could generate images with consistent semantics and realistic visual effects on different datasets. In the future, we will focus on the high-resolution image inpainting and improve the visual quality of the inpainting results.

Author Contributions

Conceptualization, H.L. and Y.Z.; methodology, H.L.; software, H.L.; validation, H.L. and Y.Z.; formal analysis, H.L.; investigation, H.L.; resources, Y.Z.; data curation, H.L.; writing—original draft preparation, H.L. and Y.Z.; writing—review and editing, H.L. and Y.Z.; visualization, H.L.; supervision, Y.Z.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China under Grants 61972206 and 62011540407; in part by the Natural Science Foundation of Jiangsu Province under Grant BK20211539; in part by the 15th Six Talent Peaks Project in Jiangsu Province under Grant RJFW-015; in part by the Qing Lan Project, in part by the PAPD fund.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. Our code can be found at https://github.com/luobo348/SRPNet.git (accessed on 4 January 2022).

Acknowledgments

I want to thank my teacher for supporting and encouraging my research project.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Generator network structure summary of SRPNet. The generator consists of Residual Pyramid Encoder and Multi-layer Decoder.
Table A1. Generator network structure summary of SRPNet. The generator consists of Residual Pyramid Encoder and Multi-layer Decoder.
TypeStructure
Input4 × 256 × 256 (Image + Mask)
Convinput: 4, kernel: 3, stride: 2, padding: 1, LReLU, output: 32
Residual_blocksinput: 32, kernel: 3, stride: 2, padding: 0, ReLU, output: 32
Convinput: 32, kernel: 3, stride: 2, padding: 1, LReLU, output: 64
Residual_blocksinput: 64, kernel: 3, stride: 2, padding: 0, ReLU, output: 64
Convinput: 64, kernel: 3, stride: 2, padding: 1, LReLU, output: 128
Residual_blocksinput: 128, kernel: 3, stride: 2, padding: 0, ReLU, output: 128
Convinput: 128, kernel: 3, stride: 2, padding: 1, LReLU, output: 256
Residual_blocksinput: 256, kernel: 3, stride: 2, padding: 0, ReLU, output: 256
Convinput: 256, kernel: 3, stride: 2, padding: 1, LReLU, output: 512
Residual_blocksinput: 512, kernel: 3, stride: 2, padding: 0, ReLU, output: 512
Convinput: 512, kernel: 3, stride: 2, padding: 1, LReLU, output: 512
Residual_blocksinput: 512, kernel: 3, stride: 2, padding: 0, ReLU, output: 512
ATMConvinput: 512, kernel: 1, stride: 1, output: 512
ATMConvinput: 256, kernel: 1, stride: 1, output: 256
ATMConvinput: 128, kernel: 1, stride: 1, output: 128
ATMConvinput: 64, kernel: 1, stride: 1, output: 64
ATMConvinput: 32, kernel: 1, stride: 1, output: 32
DeConvinput: 512, kernel: 3, stride: 1, padding: 1, ReLU, output: 512
DeConvinput: 1024, kernel: 3, stride: 1, padding: 1, ReLU, output: 256
DeConvinput: 512, kernel: 3, stride: 1, padding: 1, ReLU, output: 128
DeConvinput: 256, kernel: 3, stride: 1, padding: 1, ReLU, output: 64
DeConvinput: 128, kernel: 3, stride: 1, padding: 1, ReLU, output: 32
Output1input: 1024, kernel: 1, stride: 1, padding: 0, Tanh, output: 3
Output2input: 512, kernel: 1, stride: 1, padding: 0, Tanh, output: 3
Output3input: 256, kernel: 1, stride: 1, padding: 0, Tanh, output: 3
Output4input: 128, kernel: 1, stride: 1, padding: 0, Tanh, output: 3
Output5input: 64, kernel: 1, stride: 1, padding: 0, Tanh, output: 3
Output6input: 64, kernel: 3, stride: 1, padding: 1, ReLU, output: 32input: 32, kernel: 3, stride: 1, padding: 1, Tanh, output: 3
Table A2. Discriminator network structure summary of SRPNet. Our model uses the same network structure of the global discriminator and the local discriminator.
Table A2. Discriminator network structure summary of SRPNet. Our model uses the same network structure of the global discriminator and the local discriminator.
TypeStructure
Convinput: 3, kernel: 5, stride: 2, padding: 1, LReLU, output: 64
Convinput: 64, kernel: 5, stride: 2, padding: 1, LReLU, output: 128
Convinput: 128, kernel: 5, stride: 2, padding: 1, LReLU, output: 256
Convinput: 256, kernel: 5, stride: 2, padding: 1, LReLU, output: 512
Convinput: 512, kernel: 5, stride: 2, padding: 1, LReLU, output: 1
Table A3. Different degrees of masks compared to the quantitative comparison of L1, PSNR [32], and MS-SSIM [33] in DTD [17]. ↑ higher is better; ↓ lower is better.
Table A3. Different degrees of masks compared to the quantitative comparison of L1, PSNR [32], and MS-SSIM [33] in DTD [17]. ↑ higher is better; ↓ lower is better.
MaskMethodsL1 (↓)PSNR (↑)MS-SSIM (↑)
[0.01,0.1]PIC-NET0.83%31.1495.83%
GLCIC1.01%32.6796.52%
PEN-NET0.67%32.3796.57%
Ours0.66%32.7596.55%
(0.1,0.2]PIC-NET2.10%26.0790.88%
GLCIC2.87%26.3689.82%
PEN-NET1.79%26.7291.91%
Ours1.76%26.7892.01%
(0.2,0.3]PIC-NET3.60%23.4983.44%
GLCIC5.19%22.6078.60%
PEN-NET3.27%23.6884.29%
Ours3.13%24.0585.32%
(0.3,0.4]PIC-NET5.22%21.5875.08%
GLCIC7.41%20.0866.32%
PEN-NET4.90%21.6575.42%
Ours4.59%22.2377.89%
(0.4,0.5]PIC-NET7.11%19.9465.36%
GLCIC9.51%18.3954.08%
PEN-NET6.61%20.1765.90%
Ours6.24%20.7069.23%
(0.5,0.6]PIC-NET9.90%17.8551.65%
GLCIC11.51%16.9540.43%
PEN-NET8.61%18.8154.91%
Ours8.18%19.2657.48%
Table A4. Different degrees of masks compared to the quantitative comparison of L1, PSNR [32], and MS-SSIM [33] in Facade [18]. ↑ higher is better; ↓ lower is better.
Table A4. Different degrees of masks compared to the quantitative comparison of L1, PSNR [32], and MS-SSIM [33] in Facade [18]. ↑ higher is better; ↓ lower is better.
MaskMethodsL1 (↓)PSNR (↑)MS-SSIM (↑)
[0.01,0.1]PIC-NET0.86%29.2896.28%
GLCIC0.98%31.4897.92%
PEN-NET0.58%31.9897.60%
Ours0.54%32.2097.85%
(0.1,0.2]PIC-NET2.68%22.8990.81%
GLCIC2.71%25.3884.75%
PEN-NET1.75%24.9493.43%
Ours1.45%26.3495.19%
(0.2,0.3]PIC-NET4.81%20.0482.11%
GLCIC4.76%19.9876.06%
PEN-NET3.40%21.6786.66%
Ours2.71%23.4189.33%
(0.3,0.4]PIC-NET6.85%18.4872.83%
GLCIC6.67%19.9876.06%
PEN-NET5.10%19.8479.35%
Ours4.33%20.9282.36%
(0.4,0.5]PIC-NET9.11%17.0263.72%
GLCIC9.06%18.2166.51%
PEN-NET6.89%18.4570.21%
Ours6.16%19.1974.51%
(0.5,0.6]PIC-NET11.88%15.5350.26%
GLCIC11.16%16.6553.59%
PEN-NET9.27%16.9559.52%
Ours8.20%17.6764.01%
Table A5. Different degrees of masks compared to the quantitative comparison of L1, PSNR [32], and MS-SSIM [33] in Places2 [20]. ↑ higher is better; ↓ lower is better.
Table A5. Different degrees of masks compared to the quantitative comparison of L1, PSNR [32], and MS-SSIM [33] in Places2 [20]. ↑ higher is better; ↓ lower is better.
MaskMethodsL1 (↓)PSNR (↑)MS-SSIM (↑)
[0.01,0.1]PIC-NET0.45%34.7197.53%
GLCIC0.74%32.1195.64%
PEN-NET0.49%33.6197.06%
Ours0.41%34.3597.54%
(0.1,0.2]PIC-NET1.16%29.3793.48%
GLCIC2.10%28.2991.87%
PEN-NET1.35%27.4991.47%
Ours1.06%29.3893.84%
(0.2,0.3]PIC-NET2.02%26.4088.11%
GLCIC3.77%24.4583.00%
PEN-NET2.46%24.3684.25%
Ours1.89%26.4788.73%
(0.3,0.4]PIC-NET2.99%24.2881.80%
GLCIC5.52%21.8072.63%
PEN-NET3.76%22.2876.27%
Ours2.82%24.4782.64%
(0.4,0.5]PIC-NET4.16%22.4173.96%
GLCIC7.17%19.9062.01%
PEN-NET5.42%20.5566.94%
Ours3.98%22.5674.94%
(0.5,0.6]PIC-NET6.45%19.5558.67%
GLCIC9.03%18.0149.70%
PEN-NET7.76%19.1356.79%
Ours5.73%20.4463.27%
Figure A1. Qualitative comparisons for image inpainting with irregular masks on four different datasets. From left to right: the original image, the input image, the results of the baseline model, and our results. (Best viewed with zoom-in). (a) GT; (b) Input; (c) PIC-NET; (d) GLCIC; (e) PEN-NET; (f) Ours.

References

  1. Bertalmio, M.; Sapiro, G.; Caselles, V.; Ballester, C. Image inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, New Orleans, LA, USA, 23–28 July 2000; pp. 417–424. [Google Scholar]
  2. Qureshi, M.A.; Deriche, M. A bibliography of pixel-based blind image forgery detection techniques. Signal Process. Image Commun. 2015, 39, 46–74. [Google Scholar] [CrossRef]
  3. Qureshi, M.A.; Deriche, M.; Beghdadi, A.; Amin, A. A critical survey of state-of-the-art image inpainting quality assessment metrics. J. Vis. Commun. Image Represent. 2017, 49, 177–191. [Google Scholar] [CrossRef]
  4. Shen, J.; Chan, T.F. Mathematical Models for Local Nontexture Inpaintings. SIAM J. Appl. Math. 2002, 62, 1019–1043. [Google Scholar] [CrossRef]
  5. Chan, T.F.; Shen, J. Nontexture Inpainting by Curvature-Driven Diffusions. J. Vis. Commun. Image Represent. 2001, 12, 436–449. [Google Scholar] [CrossRef]
  6. Efros, A.A.; Freeman, W.T. Image quilting for texture synthesis and transfer. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, New York, NY, USA; 2001; pp. 341–346. [Google Scholar]
  7. Barnes, C.; Shechtman, E.; Finkelstein, A.; Goldman, D.B. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 2009, 28, 24. [Google Scholar] [CrossRef]
  8. Jain, V.; Seung, S. Natural image denoising with convolutional networks. Adv. Neural Inf. Processing Syst. 2008, 21, 769–776. [Google Scholar]
  9. Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27 June–1 July 2016; pp. 2536–2544. [Google Scholar]
  10. Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Globally and locally consistent image completion. ACM Trans. Graph. 2017, 36, 1–14. [Google Scholar] [CrossRef]
  11. Yi, Z.; Tang, Q.; Azizi, S.; Jang, D.; Xu, Z. Contextual residual aggregation for ultra high-resolution image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 7508–7517. [Google Scholar]
  12. Zeng, Y.; Fu, J.; Chao, H.; Guo, B. Learning pyramid-context encoder network for high-quality image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1486–1494. [Google Scholar]
  13. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  14. Ying, X. An Overview of Overfitting and its Solutions. J. Phys. Conf. Ser. 2019, 1168, 022022. [Google Scholar] [CrossRef]
  15. Nazeri, K.; Ng, E.; Joseph, T.; Qureshi, F.Z.; Ebrahimi, M. Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv 2019, arXiv:1901.00212. [Google Scholar]
  16. Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance normalization: The missing ingredient for fast stylization. arXiv 2016, arXiv:1607.08022. [Google Scholar]
  17. Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; Vedaldi, A. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [Google Scholar]
  18. Tyleček, R.; Šára, R. Spatial pattern templates for recognition of objects with regular structure. In Proceedings of the German Conference on Pattern Recognition, Saarbrücken, Germany, 3–6 September 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 364–374. [Google Scholar]
  19. Karras, T.; Aila, T.; Laine, S.; Lehtine, J. Progressive growing of gans for improved quality, stability, and variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
  20. Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 Million Image Database for Scene Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1452–1464. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  21. Kim, P. Convolutional neural network. In MATLAB Deep Learning; Apress: Berkeley, CA, USA, 2017; pp. 121–147. [Google Scholar]
  22. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
  23. Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5505–5514. [Google Scholar]
  24. Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.C.; Tao, A.; Catanzaro, B. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 85–100. [Google Scholar]
  25. Zheng, C.; Cham, T.J.; Cai, J. Pluralistic image completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1438–1447. [Google Scholar]
  26. Adelson, E.H.; Anderson, C.H.; Bergen, J.R.; Burt, P.J.; Ogden, J.M. Pyramid methods in image processing. RCA Eng. 1984, 29, 33–41. [Google Scholar]
  27. Shocher, A.; Gandelsman, Y.; Mosseri, I.; Yarom, M.; Irani, M.; Freeman, W.T.; Dekel, T. Semantic pyramid for image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 7457–7466. [Google Scholar]
  28. Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar]
  29. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  30. Dillon, J.V.; Langmore, I.; Tran, D.; Brevdo, E.; Vasudevan, S.; Moore, D.; Saurous, R.A. Tensorflow distributions. arXiv 2017, arXiv:1711.10604. [Google Scholar]
  31. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Chintala, S. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Processing Syst. 2019, 32, 8026–8037. [Google Scholar]
  32. Johnson, D.H. Signal-to-noise ratio. Scholarpedia 2006, 1, 2088. [Google Scholar] [CrossRef]
  33. Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; Volume 2, pp. 1398–1402. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.