Semantic Image Inpainting with Multi-Stage Feature Reasoning Generative Adversarial Network

Most existing image inpainting methods have achieved remarkable progress in small image defects. However, repairing large missing regions with insufficient context information is still an intractable problem. In this paper, a Multi-stage Feature Reasoning Generative Adversarial Network to gradually restore irregular holes is proposed. Specifically, dynamic partial convolution is used to adaptively adjust the restoration proportion during inpainting progress, which strengthens the correlation between valid and invalid pixels. In the decoding phase, the statistical natures of features in the masked areas differentiate from those of unmasked areas. To this end, a novel decoder is designed which not only dynamically assigns a scaling factor and bias on per feature point basis using point-wise normalization, but also utilizes skip connections to solve the problem of information loss between the codec network layers. Moreover, in order to eliminate gradient vanishing and increase the reasoning times, a hybrid weighted merging method consisting of a hard weight map and a soft weight map is proposed to ensemble the feature maps generated during the whole reconstruction process. Experiments on CelebA, Places2, and Paris StreetView show that the proposed model generates results with a PSNR improvement of 0.3 dB to 1.2 dB compared to other methods.


Introduction
Image inpainting aims to reconstruct the missing regions of damaged images and make the repaired image reasonable in both structure and texture. This technology shows a promising performance in many applications, such as image restoration, concealing errors and retouching photos [1][2][3]. Traditional inpainting methods [4][5][6][7] can usually synthesize relatively reasonable stationary textures. However, without semantic understanding of images, they are impossible to generate visually realistic content when the scenes are complex.
Recently, the methods based on encoder-decoder architecture [8][9][10][11][12][13] have massively improved the inpainting performance; the U-Net structure especially demonstrated a strong ability to generate detailed images. These methods encoded the input image into a latent high-level feature space, and then decoded it back to low-level pixels to fill the missing area in one shot. They are more suitable for filling images with a small range of holes, since the pixels within the local area have a strong correlation. However, as the damaged regions become larger, the model lacks effective features to infer missing contents, one-shot filling will generate semantically ambiguous results. An alternative solution is progressive inpainting. These methods divide the whole restoration process into several phases, each of which employs the information of previous phases as clues to restore the missing area step by step. For example, PGN [14] progressively inpaints the missing regions from the hole boundary to center, but the reconstruction at the image level suffers from high computational cost and information distortion. In order to reduce the amount of calculation and strengthen the connection between features, RFR-Net [15] supposes to share the attention scores of adjacent recurrences. However, there are still several limitations which could impact the performance of the existing progressive inpainting solutions, and they are summarized as follows. Firstly, current methods ignore the characteristic that the hole regions will gradually shrink. It means that the correlation between valid and invalid pixels will be weakened since the receptive field is fixed which is used to update the mask at different phases. Secondly, the mean and variance of pixels in the valid regions could be different from those of hole regions. Using batch normalization would result in covariant shift. Finally, progressive image inpainting gradually repairs the image by multiple recursions, so early generated signals will corrupt after long term propagation. Using adaptive average merging would degrade the quality of the generated features.
To address these issues, a novel image inpainting framework named Multi-stage Feature Reasoning Generative Adversarial Network (MFR-GAN) is proposed in this paper, which has the ability to generate more realistic and visually pleasing results, as shown in Figure 1. In view of the missing regions gradually narrowing down during the progressive inpainting process, we use the dynamic partial convolution which can regulate the restoration scale according to the scope of damaged areas. By this means, the correlation between known and unknown pixels is enhanced. In the process of progressive image inpainting, the input features will be decoded and encoded multiple times. Since the mean and variance of valid and invalid pixels are different, we designed a new decoding structure which not only leverages point wise normalization instead of batch normalization, but also uses skip-connection to minimize the loss of context information during decoding. To avoid covariant shift as mentioned above, point wise normalization is realized by adaptively assigning scale factors and biases to each feature point in the upsampling process. When the reconstruction process is completed, the intermediate information would be ignored if directly using the last recurrence feature map as the final result. Moreover, pixels of the recovered regions are changed during the subsequent inpainting processes. This means that it is difficult to guarantee that the correct clues are always synthesized in intermediate restorations. If wrong information is generated at a certain step, it will be inherited and become worse at the following steps. To tackle these issues, we proposed a hybrid weighted merging method. It is constituted of a hard weight map and a soft weight map. The hard weight map is obtained by analyzing mask characteristics, which could enhance the influence of signals generated in the earlier reconstruction phases. Moreover, our model should be learnable and have the ability to pay attention to certain areas of a reconstructed feature map. To this end, the soft weight map is designed for achieving more realistic content. For the image discriminator, we use the Patch GAN [16] architecture which enables the model to pay more attention to image details during the training process. The main novelties and contributions of this work can be summarized as follows: (1) We design a novel Multi-stage Feature Reasoning Generative Adversarial Network, which can adaptively adjust the inpainting scope in the recursive process through dynamic partial convolution, and leverage point wise Normalization to avoid covariant shift caused by the batch normalization. (2) In the image fusion phase, we propose a hybrid weighted merging method that accurately merges the feature maps generated in each recurrence. By this means, we eliminated the problem of the gradient vanishing and the destruction of the content generated in the previous. (3) Experiments on the benchmark datasets show that our MFR-GAN has effectively boosted the inpainting performance and generated semantically reasonable content.
The rest of the paper is organised as follows. Section 2 introduces the related work of image inpainting; Section 3 describes methods of the present study; Section 4 is concerned with the main results and the ablation study; Section 5 presents the conclusions.

Related Work
Traditional image inpainting methods are mainly composed of two categories: patchbased and diffusion-based. The patch-based inpainting methods [17] filled missing areas by calculating the similarity between patches and transferring similar areas from the background area to the hole area. The diffusion-based inpainting methods [18] attempted to propagate neighboring information to the corrupted areas. Due to capability limitation as well as as lack of semantic understanding of the image, these methods suffer from blurring artifacts when restoring relatively large regions.
Recently, deep learning based methods [19][20][21][22][23][24][25][26] have improved the capability of models to repair complex semantic environments. Context-Encoder [27] firstly employed the deep learning based method, which adopted an encoder-decoder based structure and used GAN [28] for image restoration. Shift-Net [29] introduced a special shift connection layer with the U-Net structure to fill arbitrary masked regions. PEN-Net [30] filled the holes from low resolution to high resolution with a U-shaped pyramid structure to boost the inpainting result. PConv [31] only used valid pixels to infer corrupted regions. GConv [12] further generalized partial convolution to gated convolution that learns to select features for feature maps at each level. Iizuka et al. [32] used a global and local discriminator for adversarial training to obtain coherent filling content. Liu et al. [33] designed region normalization to eliminate the influence of damaged regions on normalization. These methods cannot effectively settle the problems of semantic ambiguity, because they try to reconstruct the entire target without a strong correlation between the hole center pixels and the hole boundary pixels. SPG-Net [34] factorized image restoration into segmentation prediction and guidance. EdgeConnect [9] utilized the hallucinated edge of the missing area for restoration to ensure structural consistency. Similarly, Xiong et al. [35] used contour and image completion to gradually recover the missing regions. StructureFlow [36] consisted of a texture and structure generator. The structure reconstructor removed high-frequency textures to restore the global structure, and then used appearance flow to synthesize image details. Li et al. [37] designed the visual structure reconstruction layer to restore part edges of a missing area for assisting the completion tasks. Yu et al. [13] devised the contextual attention and leveraged a coarse-to-fine framework to restore damaged images. These methods try to guide image inpainting by adding structural constraints, but they still lack adequate information to reconstruct the central area of the hole. Zhang et al. [14] leveraged U-Net generator with LSTM to concatenate all sub-tasks, and progressively filled the image with the corresponding output sequence. Guo et al. [38] used continuous full-resolution residual blocks to directly fill the missing area of the original size image. Li et al. [15] employed Knowledge Consistent Attention to adaptively combine the attention scores of different recurrent processes to improve the accuracy of image restoration. Although these methods have achieved considerable progressive, but they are still suffering from limitations described in the introduction. Figure 2 shows the overall architecture of the Multi-stage feature generation Adversarial network, whose inputs are the damaged image X in and the corresponding binary mask M in , which indicates the missing regions. The proposed model is composed of three components: a feature generator to fill the holes in the feature maps, a feature merging model to accurately fuse the pixels synthesized in every recurrence, and a discriminator used for detail generation. We empirically use two parallel encoders to acquire semantic and image global structure information during the generation process. Firstly, we utilize the dynamic partial convolution to identify the region to be reconstructed in each recurrence. Next, the above operations are performed repeatedly to generate feature maps of different inpainting stages. When the corrupted images are completely filled, the feature maps of each stage are fused by the hybrid weighted merging model to generate the repaired results. Finally, the repaired image is sent to the discriminator for evaluating whether each patch belongs to the real or fake distribution, so as to improve the quality of image inpainting. Figure 2. Illustration of our framework. Firstly, we use dynamic partial convolution to update the input mask m l , then use the inpainting network to generate the pixels for the missing areas. Next, the above operations are performed repeatedly to generate the feature maps of different inpainting stages. Finally, the feature maps of each stage are fused by using the Hybrid weighted merging to generate the repaired results.

Hybrid Weighted Merging
After a specific number of recurrences, the corrupted image has passed through the feature generation module several times. If we directly use the feature map generated by the last recurrence, it will cause the gradient to vanish and the loss of intermediate generated features. If adopting average merging, the defect area in early reconstructed image will damage subsequently generated information. The early-stage feature information is transmitted farther in the subsequent process to infer the center content; error information may be generated. Therefore, the signals generated in early stages at the same location should be more deterministic and the influence of signals generated in the later recurrences should be decreased. However, adaptively average fusing the valid pixels will affect the early generated feature information. To address this problem, we propose a hybrid weighted merging method to fuse the feature maps generated during hole restoration processes. It is composed of hard weighting and soft weighting, as shown in Figure 3. For hard weighting, on the premise of fusing non-hole region feature values, the model adaptively generates the weight proportion for current recurrence feature map according to the generation order of valid pixels. Firstly, divide the feature map as shown in the following formula: where mask i is the mask of the i th recurrence update, mask i−1 is the mask of the (i − 1) th recurrence update. Then, a weight map W i is constructed for each recursively generated feature map F i : where i is the recurrence times, j is the number of regions, mask j is the j th region of mask i . W j i represents the j th region of the weight map corresponding to the i th recursive feature map F i , and S is the number of feature maps generated in this inpainting process. Then, we use the softmax function to generate the proportion of the component w i x,y for the pixel at position (x, y). f x,y,z represents the feature value of feature map F i at location (x, y, z), and the value of output feature mapF at location (x, y, z) can be expressed as follows: The hard weight map directly generates the weight by analyzing the mask without the learning process, which would limit the network's performance. To this end, we propose a soft weight map to assist the hard weighting for achieving better inpainting results. The soft weight map is an adaptive map which is acquired by the input feature and average fused output feature map with a learning process. As shown in Figure 4, we concatenate the F i and input feature map F in to obtain a soft weight map: where σ is the sigmoid activation function and the value of the feature map at location (x, y) after soft weighting can be expressed as follows: , δ represents a leaky relu activation function. By fusing feature maps in the above manner, the gradient vanishing can be effectively avoided, improving the ability of MFR-GAN to restore large damaged areas.

Dynamic Partial Convolution
The significance of the restoration proportion is most noticeable when a model is applied to progressively inpaint the masked image. During the mask updating phase, in order to adaptively identify the area to be repaired in each recurrence, we introduce the dynamic partial convolution. It firstly calculates the holes ratio according to the input mask, and the formula is as follows: where D is the scale factor, E 1,w and E h,1 represent the 1 × W row vector and H × 1 column vector with the value of 1. Ψ i j represents the mask corresponding to the input image F in , W and H are the width and height of Ψ i j , and N represents the value of batch size. Next, the receptive field γ of convolution kernel is obtained by D, its formula is as follows: After obtaining the value of γ, the next procedure is updating the area to be repaired. We set stride to 1 and padding = (γ − 1)/2 ensures that the mask size is consistent with the feature map size, its formula is as follows: Here, f x,y,z is the feature value at location (x, y) of layer z; W T denotes the weight of the convolution layer filter, f x,y is the input feature patch of the current sliding window, m x,y is the input mask patch corresponding to f x,y , 1 refers to a H × W matrix with all elements being 1, is the scale factor, the output result is adjusted when the number of convolution effective input values changes. The feature value of the new mask are expressed as: After dynamic partial convolution, the updated mask M l and the feature map F l are sent to the feature generation module. The difference between the updated mask and the input mask is defined as the area to be repaired in this iteration, and the updated mask remains unchanged until the next recurrence.

Feature Generation
A well-designed generator is vital to infer the missing content of the image. In order to fill the hole regions with high-quality features, two parallel encoders are used after down-sampling. The first encoder E A uses an attention mechanism to synthesize visually realistic textures, and the second encoder E D uses dilated convolution to collect the spatial features of the feature map. For encoder E A , we use several convolution layers which are bridged through a skip connection, and apply knowledge Consistent Attention to control the inconsistencies between the adjacent attention feature maps. For encoder E D , we directly stack four dilated convolution layers. After the input feature maps passing through E A and E D , the outputs of the two encoders are concatenated and sent into a single decoder for up-sampling.

Attention Module
In image inpainting, the attention mechanism can search for possible textures in the background and use them to replace textures in unknown areas. It thus ensures that the filling contents are meaningful in both structure and texture. When the feature map F i is input into the attention model, we first calculate the cosine similarity between each pair of feature pixels: where sim i x,y,x ,y represents the similarity between features of the background image hole (x, y) and the foreground image hole (x , y ). Then utilizing the similarity of the target pixels in the adjacent areas, we carry out k × k filtering to smooth the attention score: After that, the softmax function is used to generate the attention score, which is expressed as score . If the features at position (x, y) are valid in the last iteration, their attention scores are adaptively combined with present scores to synthesize the current iteration score: score i x,y,x ,y = score i x,y,x ,y .
If the pixel value at the (x, y) is invalid in the previous recurrence, the attention score obtained in this recursion is the final score: score i x,y,x ,y = λscore i x,y,x ,y + (1 − λ)score i−1 x,y,x ,y .
Next, the attention score is used to reconstruct the feature map F. The feature value of F at (x, y) isf x,y , and the calculation process off x,y is as follows: Finally, after splicing feature map F and input feature map F in , a pixel-by-pixel convolution is performed to generate the reconstructed feature map F .

Point Wise Normalization
After the recovery decoder, the generated feature maps are sent for up-sampling. We infer that the statistical characteristics of pixels in the hole regions and the pixels in the no-hole regions are different. Using traditional batch normalization could ignore this characteristic and cause the covariant shift. To tackle this issue, we utilize point wise normalization in the decoding phase to dynamically produce the mask aware scale and bias of batch normalization. The input feature is first normalized in the channel wise manner, and then modulated with learned scale and bias.
where f x,y,z is the feature before normalization, µ z and σ z are the mean and standard deviation of the activation in channel z.

Loss Function and Model Architecture
The entire training procedure is illustrated in Algorithm 1. We use a Patch-GAN [16] discriminator for image restoration learning. The Patch-GAN discriminator calculates the adversarial loss from the generator. The loss function is consisted of L1 loss, perceptual loss, style loss, adversarial loss and TV loss. L1 loss ensures the accuracy of feature map reconstruction. Given the binary mask with zeros indicating missing pixels, we define the L1 loss as follows: where I gt and I out is the ground-truth image and output value of the network. N I gt is the total number of elements in the image, which equals C × H × W. The perceptual loss proposed by Gatys et al. [39] is used to force the filled image and the ground-truth image have similar feature representation. It can be written as follows: where Ψ n represents the n th feature layer select in the fixed VGG, I com is composed of the hole range pixels of raw output image and non-hole pixels of the ground truth image. N Ψ I gt n is the number of elements in Ψ I gt n and is used as a normalization factor. A VGG-based [40] Style loss is similar to perceptual loss. The autocorrelation of each feature map is calculated before measuring the L1 distance, the computation of the style loss is as follows: In (19) and (20), K n is the normalization factor for the p th , equal to 1/C n H n W n . The final loss term total variation loss is expressed as follows: where N I com represents the number of pixels in I com . Total loss L total is the combination of the above loss functions: L total = λ valid L valid + λ hole L hole + λ perc L perc + λ style L style + λ adv L adv + λ tv L tv . (22) In this paper, the weight parameters of each loss function in Equation (22) are set as follows: 1 for λ valid , 6 for λ hole , 0.1 for λ tv , 120 for λ style , 0.1 for λ adv and 0.05 for λ perc .

Experiments
This section starts with the introduction of detailed experimental settings, then we compare our model with other methods in terms of both visual quality and quantitative measurements to demonstrate the efficiency of our proposed method. Finally, we conduct an ablation study to examine the design details of MFR-GAN.

Datasets
We used three well-known public image datasets and a mask dataset [31]

Training Settings
For the MFR-GAN, we used the Adam optimizer to optimize the generator and discriminator. We trained the model with a batch size of 6 on 2 11G NVIDIA RTX2080Ti GPUs. At the beginning, we trained the model with a learning rate of 1 × 10 −4 . Then we set the learning rate to 1 × 10 −5 for fine tuning the model, and it was kept unchanged until the model convergence. For CelebA, we first trained 200 K times, then fine tuned 500 K times until convergence. Pytorch was used as the deep learning development framework, and the CUDA version was v10.0.

Comparison Method
We compared our model with several recent state-of-the-art methods. These models are: PConv [31], GatedConv [12], EdgeConnect [9], LBAM [44] and RFR [15]. For LBAM, EdgeConnect, RFR-Net and GConv, we directly used the officially released pretrained model. Since the source code of partial convolution was not available, we implemented it with the experimental settings in the paper. Compared with PConv, GConv, LBAM and EC, the proposed method strengthens the constraint on the center of the hole by progressively reasoning the content of the hole from the edge of the hole. Compared with RFR-Net, the proposed method solves the problem of covariant shift by using point-wise normalization, and a hybrid fusion method is introduced to make full use of the feature maps generated at each stage, which effectively improves the ability of the model to repair large holes.

Qualitative Comparisons
For qualitative comparisons, we compared the irregular holes inpainting results of our method with five existing methods on the Places2, CelebA and Pairs StreetView datasets. Figure 4 shows that there are varying degrees of blurry boundary and distorted structures when PConv and Edgeconnect repair natural images. Moreover, when PConv repairs the semantic complex natural image, the color distortion is relatively serious, and neither structural connectivity of the generated content nor smooth and reasonable color content can be achieved. GConv well preserves the source image contents, but there are still color inconsistencies in some areas of the result. The result of LBAM suffers from center blur due to the lack of information for restoring deeper pixels in holes. Although RFR-Net can generate meaningful content through leveraging the learned intermediate signals for further restoration, the results still have unreasonable textures. In contrast, the results generated by our model have reasonable semantics and visual authenticity. Figure 5 shows the inpainting results on CelebA; it is observed that content synthesized by PConv, LBAM and GConv is relatively vague, and the color of some areas is different from the original image. When the holes are relatively large, the image edge generated in the first stage of EdgeConnect includes error information, which leads to the failure of generating a correct structure in the second stage. RFR-Net can generate a plausible structure, but the results still contain unreasonable textures. Through the flexible combination of point wise normalization and skip-connection during the decoding process, our model could make full use of contextual feature information, and generates the content with reasonable semantics and rich details. The Paris StreetView Dataset contains images of highly complex structures. As shown in Figure 6, when too much valid information is missing, there are obvious artifacts in the inpainting results of PConv and GConv. LBAM, EC and RFR can generate natural structures, but the results still have unsmooth content. Compared with the above methods, the texture of the inpainted regions is more natural in our results. These results suggest that our method can learn to synthesize better signals by making full use of mask information to gradually fill the missing contents. Besides, in the feature fusion phase, on the basis of eliminating gradient vanishing, the hybrid weighted merging method enables the model to adaptively fuse the feature maps with a learning process, which greatly enhances the reasoning capability of the model.

Quantitative Comparisons
Compared with other datasets, Places2 contains more scenarios, thus it can better verify the authenticity of different methods. For the evaluation metrics, we use peak signal-to-noise ratio (PSNR) to measure the L2 distance. The larger the value of PSNR, the better the image restoration effect is. Usually, when the PSNR value is greater than 28, there is no significant difference in image quality. SSIM is used to measure structural similarity; its value range is between 0 and 1. The higher value means less image distortion. Fréchet Inception Distance (FID) is used to measure the Wasserstein-2 distance between fake and real images. A lower FID score indicates that the two sets of images are more similar, and a score of zero in the best case indicates that the two sets of images are the same. We use the same irregular mask as PConv for testing. The masks are divided into six categories according to the proportion of holes:  Table 1, when the missing ratio is (0.2, 0.3], the average PSNR of the repair results is 26.23 dB, the SSIM is 0.922, and the FID is 12.79. Our method has produced excellent results. When the hole ratio is (0.5, 0.6], the PSNR is 19.53 dB, the SSIM is 0.669, and the FID is 34.27. Although the effective information is insufficient, our model can still generate clear content through multiple inferences, as shown in Figure 7 and Table 1. This further validates the effectiveness of our method.

Ablation Study
In this section, we conduct ablation experiments on the Places2 datasets, and illustrate the effectiveness of dynamic partial convolution, hybrid weighted merging, point wise normalization and the effect of recurrence number based on PSNR, SSIM, and FID.
Dynamic partial convolution As demonstrated in Figure 8, we visualize the changes of mask during the whole restoration process. The first row and third row show the mask variation of our model during the inpainting process, and the second row and fourth row are RFR-Net. It is shown that the hole change in our method is smoother than that of RFR-Net. Specifically, our model has learned to dynamically adjust the size of receptive field according to the mask ratios for updating the mask, which is beneficial for strengthening the connection of corrupted area pixels and valid pixels during the progressive completion process.

Hybrid weighted merging
The hybrid weighted merging method effectively increases the reasoning times. For large holes, the model can perform multiple inferences for generating more realistic content. Figure 9 shows the visualization of the hard weight map and the soft weight map. The weights of the hard weight map are obtained by the generated order of the signals, as shown in Figure 9a; because the center is repaired in the later recurrence, thus the weight gradually increases from the hole boundary to the center. As shown in Figure 9b, the weight distribution of the soft weight map is uniform, and the weights are adaptively changed by borrowing information from the input features. BN+G AW is a U-Net architecture using average merging, which achieves a PSNR of 21.83 dB. We replace the average merging with hybrid weighted merging, which is named BN+G HW. As shown in Table 2, under the same training strategies and experimental settings, BN+G HW achieves an average PSNR of 21.89 dB, SSIM of 0.808 and FID of 23.82. We can see that the quantitative score is significantly improved after using the hybrid weighted merging method. As shown in Figure 10, we compared the hybrid weighted merging method with other feature fusion methods on the Paris StreetView dataset, which has more repetitive structures such as gates. It can be observed that the image structure generated by our method has clear textures and consistent contextual structures.   Point wise normalization. The results in the third row and fourth row demonstrate the performance with point wise normalization. Compared to the Batch Normalization counterpart, the PSNR and SSIM are significantly improved, this means that point-wise normalization is able to capture the discrepancy between valid and invalid pixels dur-ing decoding processes, and learn the adaptive scale and bias parameters for assisting batch normalization.
The effect of iteration numbers As a hyper-parameter, the IterNums is set to seven in our model. In order to demonstrate the influence of different IterNums, we conduct the experiments on the Places2 dataset with large continuous holes. As shown in Table 3, the results from IterNums 5 are far from the others. This is because it is difficult to completely restore the large continuous holes in the small IterNums. Too many recursion times, resulting in far feature propagation, is also not conducive to image restoration.

Conclusions
In this paper, we propose a novel Multi-stage Feature Reasoning Generative Adversarial Network using recurrence filling to restore the arbitrary image defects. The hybrid weighted merging method fuses the feature map base to the mask characteristic and a learning process; it takes full advantage of the signals generated in every recurrence and thus eliminates gradient vanishing and increases the reasoning times of the model. Through the dynamic partial convolution, the image restoration range is adaptively adjusted according to the mask ratio. By this means, the correlation between hole boundary pixels and center area pixels is gradually strengthened, which is especially suitable for progressive image completion. Furthermore, skip-connection and point wise normalization are combined to minimize the loss of valid information in the up-sampling process; thus, the generated result structure is clearer and the content is more natural. Extensive experiments on the Places2, CelebA and Pairs StreetView datasets have demonstrated that MFR-GAN is more competitive than other methods in subjective quality and is objectively quantitative.