Non-Local and Multi-Scale Mechanisms for Image Inpainting

Recently, deep learning-based techniques have shown great power in image inpainting especially dealing with squared holes. However, they fail to generate plausible results inside the missing regions for irregular and large holes as there is a lack of understanding between missing regions and existing counterparts. To overcome this limitation, we combine two non-local mechanisms including a contextual attention module (CAM) and an implicit diversified Markov random fields (ID-MRF) loss with a multi-scale architecture which uses several dense fusion blocks (DFB) based on the dense combination of dilated convolution to guide the generative network to restore discontinuous and continuous large masked areas. To prevent color discrepancies and grid-like artifacts, we apply the ID-MRF loss to improve the visual appearance by comparing similarities of long-distance feature patches. To further capture the long-term relationship of different regions in large missing regions, we introduce the CAM. Although CAM has the ability to create plausible results via reconstructing refined features, it depends on initial predicted results. Hence, we employ the DFB to obtain larger and more effective receptive fields, which benefits to predict more precise and fine-grained information for CAM. Extensive experiments on two widely-used datasets demonstrate that our proposed framework significantly outperforms the state-of-the-art approaches both in quantity and quality.


Introduction
Image inpainting, which synthesizes semantically reasonable and visually plausible contents in the damaged regions from existing areas, has attracted great attention in recent years. High-quality image inpainting can be capable of benefiting a wild range of applications such as unwanted-object-removal [1,2] and photos restoring [3]. Not only it is necessary to reconstruct textures and contents, but it is also crucial to have insight into the scene and objects that will be completed. Despite of many years research, image inpainting is still a challenging task in the field of computer vision as there is an inverse ill-posed problem [4] in this technique.
Generally, image inpainting approaches can be classified into three categories, diffusionbased method [5], patch-based method [6] and deep-learning-based method [7]. The first two methods depend on spreading and copying known information, hence they have an inferior ability to acquire high-level semantic features. Recently, deep-learning-based approaches such as convolutional neural networks (CNN) [8] and generative adversarial networks (GAN) [9] have exhausted the powerful capability of reconstructing target regions from surrounding areas.
Pathak et al. [10] designed a model based on the CNN termed as context encoder, which consists of an encoder to capture the context of an image into a compact latent feature representation and a decoder which utilizes that representation to predict the target region. Although it can achieve promising results, some exquisite details are ignored due to the more attention on recovering structure information rather than fine details. To tackle this issue, many methods adopted a two-stage architecture network. For instance, Yu et al. [11] introduced a coarse-to-fine framework to rough out the missing contents and a refined complete module to capture high-level features from known areas. Based on this network, Iizuka et al. [12] designed a global discriminator and a local discriminator to distinguish between real images and repaired images respectively, which can maintain the coherence of missing areas and surrounding regions. Above-mentioned methods mainly focus on rectangular areas and assume that the missing area is in the middle of the image. However, this hypothesis is limited to research and cannot be widely used in practice. Recently, massive methods have been investigated to deal with this problem. For example, Liu et al. [13] first put forward the mechanism of partial convolution (PConv) which incorporates re-normalized convolution and a mask-update operation to replace convolutional layers. Yu et al. [14] presented a gated convolution for freeform image completion. Furthermore, Ma et al. [15] designed a region-wise network to boost the capability of the generative model to adaptively learn feature presentations in different regions. Those methods have achieved promising performance on small proportion corrupted regions, while show insufficiency when the incomplete regions occupy a large proportion. Hence, they will inevitably lead to artifacts such as color discrepancies and blurriness.
To overcome the above-mentioned problems, we mainly focus on arbitrary and large image defects. Our work builds upon the recently proposed region-wise approach [15], which employed a region-wise convolution mechanism and trained the network with a joint loss that consists of reconstruction losses, a style loss, a correlation loss and an adversarial loss. It performs well when the missing region is discontinuous, while suffers from obvious grid-shaped artifacts in larger and continuous occluded image regions. This phenomenon is probably caused by the correlation loss and the style loss since they all use the gram matrix which pays attention to capturing pixel-wise correlations rather than taking the global consistency into account. Furthermore, they proposed an adversarial network to mitigate large area artifacts. However, there is no obvious improvement for visual performance. Another limitation is that lots of results containing over-smooth and incomplete structures will be obtained when the damaged area is large. We speculate that the original model does not have enough capacity to learn feature representations of different regions. Based on those observations, our proposed approach addresses those points and further achieves desirable results by focusing on non-local relationships and multi-scale information. More specifically, we retain original region-wise convolutions and the two-stage structure in paper [15] for the following reasons. Region-wise convolution can perform different operations for different regions. However, only adopted region-wise convolution framework will obtain blurry results. Hence, it is indispensable to incorporate the refinement network to infer more precise details. Based on the above-mentioned structure, we first utilized the pretrained VGG model combining with implicit diversified Markov random fields to substitute Gram-matrix-matching, which can alleviate the effects of grid-like shape artifacts. However, the repaired results will be blurred when the missing regions occupy a large proportion. Hence, we introduced the contextual attention module (CAM) to capture features in background patches and propagate the spatial coherency of attention. However, we find that the results will contain incorrect textures and incomplete structures. We analyze that the original cascade dilated convolution cannot provide abundant and accurate information for the CAM. To tackle this problem, we introduced several dense fusion blocks (DFB) to replace the original cascade dilated convolution and extract multi-scale features for CAM simultaneously. In conclusion, the combination of CAM and DFB can achieve the visual authenticity and perceptual plausibility of results.
We evaluate and analyze our proposed method on two standard datasets including CelebA-HQ and Paris StreetView. Meanwhile, we compare our model with the stateof-art schemes and provide experimental results to verify the effectiveness of our proposed method.
The main contributions of this paper are summarized as follows: • We aim to solve image inpainting tasks for randomly missing regions with a large proportion and employ an ID-MRF loss to tackle grid-shape artifacts and color discrepancy caused by style loss and correlation loss.

•
We innovatively combine the CAM with the DFB module to assist our network to generate precise and fine-grained contents by borrowing features from distant spatial location and extracting multi-scale features. • Experiments on multiple benchmark datasets intuitively show that our method is able to achieve competitive results.

Related Work
Recently, a great deal of literature has proposed numerous methods for image inpainting. In this section, we will mainly introduce a few methods related to non-local and multi-scale mechanisms.

Non-Local Mechanisms
For discontinuous missing regions, semantic information can be easily inferred from the background. However, it is a challenging task to repair large and continuous masked areas due to the huge gap between empty missing regions and corresponding possible recovered contents. Attention mechanism based on the relationship between contextual and missing regions has been often used in the task of image inpainting. For instance, Yu et al. [11] proposed a coarse-to-fine generative adversarial network and appended a contextual attention module to learn feature presentations via matching patches from background information. However, once the wrong information is captured in the first stage, it will cause the propagation of the error. On this basis, Sagong et al. [16] proposed a parallel extended-decoder path with a modified contextual attention module to reduce the number of convolution operations and create a higher-quality inpainting result simultaneously. For capturing long-range spatial dependencies, the self-attention mechanism [17] based on the non-local network [18] was wildly adopted. For instance, Uddin et al. [19] designed a global and a local attention architecture to obtain global and local coherent information. Yang et al. [20] exploited a self-attention mechanism and integrated it with structural information. Liu et al. [21] proposed a non-local module to capture a deeper relationship of different regions by using a self-attention framework. However, the non-local module was originally designed for the task of classification and this operation is not sufficient to significantly improve the performance of our framework. In addition to using the attention mechanism to obtain non-local information, a pre-trained model of the VGG network [22] has been wildly adopted to extract non-local features by calculating style loss. The essence of style loss is learning the relationship of existed and unknown regions by using the Gram-matrix to calculate the pixel relevance. Although it can preserve the high-frequency details, there is a tendency to produce grid-shape artifacts and contents that is inconsistent with the background. With the purpose of further successfully generating high-quality images, several studies concentrated on taking advantage of similarities of patches as loss function to learn non-local features. For example, Yang et al. [23] proposed a style transfer based on an MRF method to promote feature fusion, structure completion and texture reconstruction. In particular, Wang et al. [24] proposed an MRF-based non-local loss to encourage network to produce high-quality results by considering content consistency and texture similarity.

Multi-Scale Mechanisms
Currently, multi-scale-based methods have shown a significant development of applications in image inpainting. For instance, Wang et al. [25] introduced a Laplacian-pyramid model to progressively restore images with different resolutions. Mo et al. [26] introduced several multi-scale discriminators to generate the results containing more multi-scale information. Wang et al. [24] proposed a multi-column convolutional neural network to enlarge the receptive fields by applying different convolutional kernel size. However, Sensors 2021, 21, 3281 4 of 18 those methods will be likely to cause resource-consuming and suffer from additional parameters. The common operation of aggregating multi-scale information and reducing resource consumption is to enlarge receptive fields by using dilated convolution. In the work of [12], they replaced the original channel-wise fully connected layer by a cascaded dilated convolution to broaden the receptive field. To further learn multi-scale features, Hui et al. [27] utilized dense combinations of dilated convolutions and different dilated rates to learn larger and more effective receptive fields, which is vital to infer reasonable structures and contents.

Proposed Methods
In this section, we firstly described the process of our method based on the model of generative adversarial network. Then we introduced the details of the contextual attention mechanism and the dense connection architecture of dilated convolution. Finally, objective loss functions including the reconstruction loss, the ID-MRF loss and the adversarial loss are presented in detail. An overall framework of our method is displayed in Figure 1. scale information. Wang et al. [24] proposed a multi-column convolutional neural network to enlarge the receptive fields by applying different convolutional kernel size. However, those methods will be likely to cause resource-consuming and suffer from additional parameters. The common operation of aggregating multi-scale information and reducing resource consumption is to enlarge receptive fields by using dilated convolution. In the work of [12], they replaced the original channel-wise fully connected layer by a cascaded dilated convolution to broaden the receptive field. To further learn multi-scale features, Hui et al. [27] utilized dense combinations of dilated convolutions and different dilated rates to learn larger and more effective receptive fields, which is vital to infer reasonable structures and contents.

Proposed Methods
In this section, we firstly described the process of our method based on the model of generative adversarial network. Then we introduced the details of the contextual attention mechanism and the dense connection architecture of dilated convolution. Finally, objective loss functions including the reconstruction loss, the ID-MRF loss and the adversarial loss are presented in detail. An overall framework of our method is displayed in Figure 1. The overall architecture of our method. Region-wise convolution indicates using different convolution filters for different regions, more details can be found in paper [15]. In this architecture, 256 by 256 and 32 denote the size and channels of the feature map respectively.

The Architecture of Our Framework
As depicted in Figure 1, we take the architecture proposed by Ma et al. [15] as the backbone of our generator which is composed of several region-wise convolutions and the cascaded dilated convolution based on the coarse-to-fine structure. On this basis, we introduce the contextual attention module (CAM) and use the dense connection of dilated convolution as dense fusion block (DFB) to replace the original cascaded convolution. The CAM is not suitable for the coarse stage as this phase cannot provide enough accurate and delicate information for the CAM to borrow and propagate. Moreover, ordinary cascaded dilated convolution cannot extract multi-scale features of the image. Inspired by those observations, we integrate DFB into coarse and refinement stages and only employ the CAM in the refinement stage. In addition, we only embed one CAM as it is sufficient for feature borrowing and reconstruction. Generally, the combination of DFB and CAM can synthesize more fine-grained and better results. Moreover, in the work of [11] and [16], the attention module is used in one branch of the parallel network. However, we find it is not suitable for our architecture since our model has a strong dependence on skip connections and dilated convolutions. Based on this observation, we design a novel refinement framework to improve the robustness of the inpainting model and synthesize realistic Figure 1. The overall architecture of our method. Region-wise convolution indicates using different convolution filters for different regions, more details can be found in paper [15]. In this architecture, 256 by 256 and 32 denote the size and channels of the feature map respectively.

The Architecture of Our Framework
As depicted in Figure 1, we take the architecture proposed by Ma et al. [15] as the backbone of our generator which is composed of several region-wise convolutions and the cascaded dilated convolution based on the coarse-to-fine structure. On this basis, we introduce the contextual attention module (CAM) and use the dense connection of dilated convolution as dense fusion block (DFB) to replace the original cascaded convolution. The CAM is not suitable for the coarse stage as this phase cannot provide enough accurate and delicate information for the CAM to borrow and propagate. Moreover, ordinary cascaded dilated convolution cannot extract multi-scale features of the image. Inspired by those observations, we integrate DFB into coarse and refinement stages and only employ the CAM in the refinement stage. In addition, we only embed one CAM as it is sufficient for feature borrowing and reconstruction. Generally, the combination of DFB and CAM can synthesize more fine-grained and better results. Moreover, in the work of [11] and [16], the attention module is used in one branch of the parallel network. However, we find it is not suitable for our architecture since our model has a strong dependence on skip connections and dilated convolutions. Based on this observation, we design a novel refinement framework to improve the robustness of the inpainting model and synthesize realistic contents simultaneously. It is worth emphasizing that the input of CAM is the concatenation of the convolutional layers before and after the dilated convolution. As shown in Figure 1, given a ground truth image I gt and a binary mask M which denotes damaged areas (0 for missing regions, 1 for existing counterparts). The corrupted image is where the symbol denotes the multiplication of corresponding elements of two matrices. We feed the concatenation of I gt and M as inputs to the coarse network instead of I gt , which is beneficial for the network to concentrate more on valid pixels. Then the predicted image I 1 pred as the same resolution as the original image will be obtained. We integrate the masked area in I 1 pred with the opposite regions in the background as the composite image of the first stage. It can be denoted by the equation as below: Then the I 1 comp is sent to the refinement network to provide more information and the model can yield the refined image I 2 pred , which is used to obtain Moreover, we only consider the local predicted region in the phase of the adversary. Specifically, two local predicted images I 1 (1 − M) in every stage are feed together into the discriminator to enhance the capability of generator. In addition, we adopt the recently proposed technology of spectral normalization [28] which controls the Lipschitz constant of the discriminator.

Contextual Attention Mechanism
Recently, the attention mechanism has been widely used in image inpainting tasks and exhibits a great potential in the generation of high-quality images. Since there is a strong necessary to apply a non-local mechanism to deal with continuous and large missing regions, we employ a contextual attention mechanism (CAM) in the refinement network to enhance the power of the generator to obtain sharper and pleasing results from initial prediction models. Yu et al. [11] firstly proposed the contextual attention module which borrows the patches from the background to fill holes. However, it utilizes the cosine similarity to match similar patches, which may influence the feature extraction due to the normalization operation. Furthermore, Sagong et al. [16] modified this module by replacing the cosine similarity with Euclidean distance. It is more feasible to match and propagate more reasonable contents since the Euclidean distance can not only take the angle of different patches into account, but also consider the size of them. We refer to this method to achieve the propagation of non-local features. The process of the attention mechanism can be defined as follows: The first step is to divide the feature maps into background and foreground regions: background indicates known regions and foreground denotes opposite counterparts. We can obtain the background area though multiplying the mask by feature maps. Then we extract patches from different regions and reshape those patches of background as convolutional filters. Then we measure the similarity score d (x,y),( x , y ) between foreground patch f x,y and background patch b x , y by the function: where where m and σ indicate the constant value. Finally, the foreground region is reconstructed from the weighted sum of background patches, and the importance of the background patch is judged by the similarity score. With the assistance of the tan h function, our model has the ability to accurately distinguish background and foreground, so as to better match and propagate features. Moreover, this module plays an important role in alleviating the influence of redundant information and synthesizing the satisfactory results by adaptively differentiating and fusing long-range spatial information.

Dense Connection of Dilated Convolution
Although the CAM has delivered a remarkable improvement performance in the reconstruction of structures and contents, it depends on the accuracy of initially predicted images. In addition, it has been proved that if the coarse network performs not well, the refinement phase will take advantage of irrelevant information and feature patches to match and attend [16]. We also find that the skip connection in our framework benefits the network to learn more valid and deeper information. Inspired by those observations, we design a dense connection of dilated convolution which is similar to the structure in paper [27]. As illustrated in Figure 1, the middle layers of the network consist of a series of dense fusion blocks (DFB) based on the dense connection of dilated convolutions and the concrete structure of every DFB is presented in Figure 2. Different from the common cascaded dilated convolutions [11] using various dilated rates, which may restrict the range flexibility of the generator, our dense combination has a large respective field and can adaptively learn more effective information. Specifically, a 3 by 3 convolution is employed to reduce parameters of input feature maps and concentrate on more valid features by decreasing the channels to a quarter of the original counterparts. Then there are four dilated convolution branches with different dilated rates of 2, 4, 8 and 16, respectively, and every convolution utilizes 3 by 3 kernel size. Suppose x i {i = 1, 2, 3, 4} indicates the four branches, C(·) indicates the convolutional operator and y i {i = 1, 2, 3, 4} denotes the output after C(·). The process of our dense connection can be demonstrated as follows: We can obtain multi-scale information by connecting all the outputs y 1 , y 2 , y 3 , y 4 . Then a 1 by 1 convolution is adopted to aggregate features. It is worth notable that all the convolution layers in the DFB have the same structure as the other counterparts in our architecture. A series of DFBs enjoy the ability to preserve multi-scale information and increase the richness of extracted features by enlarging the receptive fields.

Loss Functions
The task of image inpainting is an ill-posed problem in that there are a number of possible results. Therefore, it is crucial to use loss functions to select the most reasonable and realistic one. In our experiment, we rely on several loss functions to optimize networks in the training process.

Reconstruction Loss
Reconstruction loss is a straightforward method that can measure pixel-wise differences between the predicted region and the associated surroundings. We prefer to adopt distance to calculate the reconstruction loss rather than distance as the latter will produce images with much blurriness [29]. We will verify this conclusion in the ablation studies. In our two-stage model, the overall reconstruction losses can be expressed as follows: The pixel reconstruction loss can guarantee the consistency of contents in the generated area and the background area. Moreover, this function can benefit to reconstruct initial structural information while some high-frequency details will be ignored.

Loss Functions
The task of image inpainting is an ill-posed problem in that there are a number of possible results. Therefore, it is crucial to use loss functions to select the most reasonable and realistic one. In our experiment, we rely on several loss functions to optimize networks in the training process.

Reconstruction Loss
Reconstruction loss is a straightforward method that can measure pixel-wise differences between the predicted region and the associated surroundings. We prefer to adopt L 1 distance to calculate the reconstruction loss rather than L 2 distance as the latter will produce images with much blurriness [29]. We will verify this conclusion in the ablation studies. In our two-stage model, the overall reconstruction losses can be expressed as follows: The pixel reconstruction loss can guarantee the consistency of contents in the generated area and the background area. Moreover, this function can benefit to reconstruct initial structural information while some high-frequency details will be ignored.

ID-MRF Loss
The style loss and correlation loss in paper [15] are prone to cause grid-shape artifacts and color discrepancy when repairing large continuous masked regions. We also find that the repaired target region will be very blurry if removing loss functions which focus on the relationship of different parts. Aiming to address those problems and motivated by the work of paper [30] and [24], we adopted an ID-MRF loss to capture complex image layouts and provide plausible contents with the same pattern in visual and style to the ground truth.
Suppose I 1 pred and I gt be the predicted image and the ground-truth image respectively, Φ l p is the feature maps derived from the l th feature layer of a pretrained VGG model. Similarly, Φ l GT is the feature maps of the original image. Let m and n indicate one of the patches extracted from Φ l p and Φ l GT , respectively. The relative similarity between m and n is defined as where µ(·, ·) indicates the cosine similarity. R n Φ l GT denotes all the neural patches in Φ l GT excepting for n. δ and s indicate positive constants. Then we normalize RS(m, n) to The ID-MRF loss of l th feature layer between Φ l p and Φ l GT is defined as where h is normalization constant. Different from common cosine similarity, the ID-MRF loss concentrates on relative distance which benefits to find high-quantity patches in the neighborhood. By minimizing L M (l), the process of m in Φ l p seeking for some non-local similar candidates in Φ l GT will constrain the network to generate images closer to the real counterparts.
Let the predicted image is I 2 pred . Then it is projected to a more advanced feature space using pre-trained VGG16 on ImageNet. We use conv3_2 and conv4_2 to indicate image texture and conv4_2 to describe semantic structures. The ID-MRF loss is defined as: Since the correlation loss and style loss are pixel-wise methods rather than patch-wise, our ID-MRF loss has the ability to establish the relationship of long-term contents.

Adversarial Loss
Only relying on generator cannot guarantee to yield plausible results. The experiments in [31] have confirmed that the adversarial network can benefit to remove grid-like artifacts. Motivated by this research and aiming to produce pleasing results, we adopted a discriminator to encourage the generator to synthesize visually consistent results. Given I gt , I 1 pred ,and I 2 pred , we penalize the predicted missing regions rather than the entire image and concatenate those areas with the corresponding mask as inputs of the discriminator. To sum up, the learning objective for the discriminator in our experiments is formulated as: where D and M indicate discriminator and the mask respectively. We set α as 0.01. For the generator, it struggles to improve the quality of synthesis results to fool the discriminator, while the task of the discriminator is to judge whether the predicted image is true or fake until the discriminator is indistinguishable from those images obtained by the generator.

Overall Loss
Finally, we obtain the hybrid loss function which is a linear combination of the construction loss L re , the ID-MRF loss L mr f and the adversarial loss L adv .
where λ 1 , λ 2 , λ 3 are weights of different loss components. Our joint loss function can ensure the generator to produce semantically-reasonable and visually-realistic results.

Experiments
In this section, we present the datasets used in this work and our experimental implementation. We also compare our approach with several state-of-the-art image inpainting methods to evaluate the effectiveness of our model qualitatively and quantitatively. Finally, we conduct ablation studies to examine the effect of different components in our model.

Datasets and Masks
We validate our method on two public and widely-used datasets: CelebA-HQ dataset and Paris StreetView dataset. The former focuses on human face and contains 30,000 images, the latter collected from street views of Paris contains 14,900 training images and 100 test images. We use the original train, test and validation splits for these two datasets. Moreover, we adopt an algorithm to generate irregular masks during training and testing, which is more suitable for the situation of natural damaged images and can avoid over-fitting.

Implementation Details
Our proposed framework is implemented in Tensorflow. The size of input images is 256 by 256 and the batch size is 4. Our model is optimized by the Adam algorithm with a learning rate of 1 × 10 −4 and β 1 = 0.5, β 2 = 0.9. We train our model on the NVIDIA 2070 GPU (8GB) and NVIDIA 2080Ti GPU (11GB). To stable the process of training, we divide it into two steps. Specifically, we train our model without the adversarial loss for the former 20 epochs and append it for the latter 10 epochs. Since we find that the larger proportion of our MRF loss will cause the propagation of incorrect contents, while over-smooth and blurry results will be obtained when their proportion is lower. Motivated by this observation, the trade-off parameters λ 1 , λ 2 and λ 3 are set to 20, 1 and 0 in the first step. In the second step of training, we deploy the adversarial network and set λ 1 as 10, λ 2 as 1, λ 3 as 1. The masked image includes missing regions with variable numbers, sizes, shapes and locations during every iteration. In particular, the proportion of arbitrary damaged areas varies from 0% to 40% during training. In addition, we experimentally set the number of DFB to 8. It takes 2 s to predict missing regions with any shapes and 1 day to train 20 epochs of 28,000 high-resolution images.

Comparative Experiments
We apply our network to perform qualitative and quantitative comparison experiments with several state-of-the-art methods including contextual attention (CA) [11], partial convolution (PConv) [13], GatedConv (GC) [14], EdgeConnect (EC) [32], Pluralistic Image Completion (PIC) [33] and Region-Wise Conv (RWC) [15]. To fairly evaluate, we carry out the experiments on discontinuous and continuous missing regions and every test image has the corresponding mask. For CA, GC and RWC, we train their models on CelebA-HQ dataset and Paris StreetView dataset with the released code. For EC and PIC, we directly adopt the pre-trained model as our training results perform not as good as the released results. As to PConv, we refer to the implementation on github (https://github.com/MathiasGruber/PConv-Keras, accessed on 15 December 2020) and follow the suggestions of authors for training since there are no public codes. To make a comprehensive and objective comparison of various methods, we explore the effect of different missing ratios of masks ranging from 0-40%, which is consistent with the training phase. Observing from Figures 3 and 4, inpainting results of CA suffer from visible distortions and inconsistency because it is originally designed for restoring regular missing holes. Among the rest of algorithms, EC reconstructs images with more accurate and intact structures when the missing region is narrow, but it still faces artifacts compared to the ground truth. PIC is proposed for the task of generating reasonable and diverse results hence it is difficult to approximate the true distribution of images. Although PConv and GC are designed to deal with irregular missing regions, they fail to repair some structure information such as the eye and some details in building. RWC is designed to cope with arbitrary masked images. It can successfully repair correct contents when the missing region is discontinuous, while brings strong grid-like artifacts and incomplete structures in large continuous missing parts. For face images, our model has the ability to repair missing glasses, synthesize symmetrical eyes and predicted vivid results. For Paris StreetView datasets, our method exhibits a superior performance with more intact structures and exquisite details. From what has been discussed above, we can draw the conclusion that our model exhibits competitive results with fine-grained details, pleasing textures and consistent structures with the help of the combination of the ID-MRF loss, dense fusion blocks and the contextual attention module. Sensors 2021, 21, x FOR PEER REVIEW 11 of 17

Quantitative Comparison
We evaluate our model with five commonly used image quality metrics: the L1 loss, the L2 loss, the peak signal-to-noise ratio (PSNR), the structure similarity index (SSIM) and the Frechet Inception Distance (FID). Specifically, the PSNR measures the difference in pixel values of two images, and SSIM measures the similarities between the reconstructed image and the original image. Larger PSNR and SSIM values indicate smaller gaps between the generated image and the ground truth. Moreover, the L1 loss can reflect the ability to create the image as possible as closer to the real counterpart. L2 indicates the mean square error. The value of FID was introduced to calculate the Wasserstein-2 distance between the two distributions by utilizing the pre-trained Inception-V3 model [34].
Tables 1 and 2 list full comparisons of all discussed methods in terms of five metrics with different ratios of irregular masks. As is illustrated in those Tables, we can obtain the conclusion that it is more difficult to repair the missing regions on Paris StreetView dataset since the values of PSNR and SSIM on it are commonly lower than the counterparts on CelebA-HQ dataset and the scores of L1 loss, L2 loss and FID are higher. For discontinuous damaged regions, it is quite obvious that our method shows a relative improvement in terms of all indicators. Moreover, CA shows a competitive value of PSNR at the mask ratio of 0-10%. At the same time, it faces a strong performance degradation accompanying with the increasing damaged regions. PConv, ED and PIC show the almost same performance on those two datasets and gain inferior indicators compared with GC, RWC and ours since they lake the deep understanding of semantic information and the correlation between existing regions and surroundings. The quantitative results demonstrate that our model achieves better scores in most cases. It is worth noting that the average value of PSNR increased by 1.28 for continuous missing regions on Paris StreetView dataset. Table 1. Quantitative comparisons on discontinues missing region, where the bold indicates the best performance, and the underline denotes the sub-optimal results, + indicates the higher is better, while-indicates the lower is better.

Ablation Studies
In this section, we analyze the contribution of each component in our proposed model to the final performance by presenting the inpainting results quantitatively and qualitatively. First, we respectively compare the repair results under the constraints of L1 and L2 reconstruction loss. Then we present the results obtained by the correlation loss (CL), the style loss (SL), the ID-MRF loss (IM), the combination of IM and adversarial loss (IM+AD). Subsequently, we utilize IM+AD as baseline (BL) and append the attention module into BL (BL+AT) to validate its effectiveness. Finally, we replace the cascaded dilated convolution in BL with several dense fusion blocks (DFB) to identify its role in our whole model, which can be expressed as BL(Rcdc)+DFB.
It can be seen from Figures 5 and 6 that the L2 reconstructed loss suffers from more severe blurring and shadow-like artifacts than L1 loss. Moreover, the CL will incur obvious color discrepancy and SL tends to produce images containing more high-resolution details that are inconsistent with non-missing regions such as grid-like and aliasing artifacts. The reason for this phenomenon may be that those losses lack the advanced semantic feature and the reasonable extraction of spatial information to guide image synthesis. By utilizing the ID-MRF loss, the grid-like artifacts can be alleviated while over-smooth and blurry results will be obtained in large missing regions. Adversarial loss can further mitigate the effect of blurriness, but it is far from meeting the visual requirements. Moreover, pleasing contents and reasonable structures cannot be guaranteed by using either AT or DFB. By integrating them, the repairing images show an obvious visual enhancement on pleasing structures and textures. local mechanism is more appropriate for improving the high relevance between hole and background regions than those counterparts based on pixel-wise correlations. By gradually embedding adversarial loss and the attention module and using the BL(Rcdc)+DFB module, the performance of our model is improved stably. In particular, the combination of CAM and DFB can achieve a superior performance than using any of them alone. We attribute this phenomenon to the joint effect of the enlargement of the receptive field and the reconstruction of non-local relevant features. local mechanism is more appropriate for improving the high relevance between hole and background regions than those counterparts based on pixel-wise correlations. By gradually embedding adversarial loss and the attention module and using the BL(Rcdc)+DFB module, the performance of our model is improved stably. In particular, the combination of CAM and DFB can achieve a superior performance than using any of them alone. We attribute this phenomenon to the joint effect of the enlargement of the receptive field and the reconstruction of non-local relevant features. As shown in Tables 3 and 4, the value of L2 loss is far worse than the L1 loss in terms of those five indicators, which is consistent with visual appearance. Compared with the correlation loss and style loss, most of the metrics are enhanced by a large margin under the ID-MRF loss constraint. This phenomenon indicates that the loss based on the nonlocal mechanism is more appropriate for improving the high relevance between hole and background regions than those counterparts based on pixel-wise correlations. By gradually embedding adversarial loss and the attention module and using the BL(Rcdc)+DFB module, the performance of our model is improved stably. In particular, the combination of CAM and DFB can achieve a superior performance than using any of them alone. We attribute this phenomenon to the joint effect of the enlargement of the receptive field and the reconstruction of non-local relevant features.

Conclusions and Future Work
This paper combines two non-local mechanisms which consist of the ID-MRF loss and the contextual attention module (CAM) with a multi-scale method named the dense fusion block (DFB) which relies on the dense connection of dilated convolution. Under the interaction of the various proposed mechanisms, our model can repair large continuous and discontinuous missing regions at the same time. Specifically, the ID-MRF loss can suppress color discrepancies and grid-like artifacts cause by the correlation loss and the style loss in the task of image inpainting. On this basis, we integrate the CAM with DFB to further predict high-quality results with more fine details. The former can capture longterm spatial information by borrowing or copying the feature information from known background patches and the latter can extract multi-scale features by enlarging receptive fields. Experimental results demonstrate that our proposed model can achieve superior performance than state-of-the-art methods both in quantity and quality.
Although DFB have an impact on the performance improvement of our model, this contribution is insufficient, and they require considerable computational resources due to the structure of dense connection. Hence, further improvements can be achieved by reducing the parameters of the network. In addition, the receptive-field-aware module [35] demonstrates a strong capability of enlarging the receptive field in the task of image segmentation. It will be a very meaningful work to combine this technology with our model. Thus, future research will focus on how to apply the receptive-field-aware framework to image inpainting.
Author Contributions: X.H. and Y.Y. proposed the research idea of this paper. X.H. was responsible for the experiments, data analysis and interpretation of the results. Y.Y. was responsible for the verification of the research plan. The paper was mainly written by X.H. and the manuscript was revised and reviewed by Y.Y. All authors have read and agreed to the published version of the manuscript.