Large Mask Image Completion with Conditional GAN

: Recently, learning-based image completion methods have made encouraging progress on square or irregular masks. The generative adversarial networks (GANs) have been able to produce visually realistic and semantically correct results. However, much texture and structure information will be lost in the completion process. If the missing part is too large to provide useful information, the result will be ambiguity, residual shadow, and object confusion. In order to complete large mask images, we present a novel model using conditional GAN called coarse-to-ﬁne condition GAN (CF CGAN). We use a coarse-to-ﬁne generator with symmetry and new perceptual loss based on VGG-16. The generator is symmetric in structure. For large mask image completion, our method produces visually realistic and semantically correct results. The generalization ability of our model is also excellent. We evaluate our model on the CelebA dataset and use FID, LPIPS, and SSIM as the metrics. Experiments demonstrate superior performance in terms of both quality and reality in free-form image completion.


Introduction
Image completion (also known as inpainting) has been a hot topic in the field of computer vision, which aims to fill missing regions with plausible and meaningful contents. It is widely used in image editing [1], image re-targeting [2], and object removal [3]. Image completion first originated from the completion of artwork, restoring damaged artwork to its original appearance. Using the original pixel information in the image as a reference, the missing or damaged parts of the image are filled in so that the restored image is not recognizable to the observer as having signs of damage. Image completion not only restores damaged artwork, but also plays an important role in solving cases by public security authorities, such as identifying the appearance of suspects through image completion techniques for incomplete face areas captured by surveillance due to masks, sunglasses, and obscuring.
The subject has been researched before the deep learning era [4][5][6], and progress has accelerated in recent years by using deep and wide neural networks [7,8] and adversarial learning [9][10][11][12]. The traditional strategy for training the completion system is to train on a large automatically generated dataset, created by randomly masking real images. It is common to use a single stage, such as in [9,13]. In this work, we achieve state-of-the-art results with a two-stage network.
In this paper, we discuss a new method for image completion based on conditional generative adversarial networks. This method has a wide range of applications. For example, we can use it to remove objects in an image or changing the appearance of existing objects. To complete images with large masks, one can use the lama [13] method, a completion network architecture that leverages generative adversarial networks (GANs) [14] in a conditional setting. Recently, Zhao et al [9] suggested that a serious challenge remains that all existing algorithms tend to fail when handling large-scale missing regions. Instead, they use co-modulation to improve the generative capability, but often struggle to generate plausible image structures.
Here, we address two main issues of the above methods: the difficulty of generating plausible image structures with GANs and the lack of details and realistic textures in completion results. We show that with a new, robust coarse-to-fine generator and discriminator architecture, we can complete images with large masks, which are more visually appealing than previous methods. We first obtain results only by adversarial training without relying on perceptual loss based on VGG-16. We then show that adding perceptual loss based on VGG-16 [15] slightly improves the results. Through evaluation, we find that CF CGAN can generalize to plausible image structures. CF CGAN can capture and generate details and realistic textures and is robust to large masks. Our contributions are summarized below: We propose a large mask completion method based on conditional GAN (CF CGAN); 2.
We use the traditional generator as a global generator and introduce a local enhanced generator; 3.
We introduce a new perceptual loss based on VGG-16. Using pre-trained VGG-16 as a feature extractor can reduce the model parameters and extract rich detailed information on the premise of ensuring the receptive field.
The contents of the remaining sections of this paper are organized as follows: Section 2 introduces related work; Section 3 systematically summarizes and introduces our proposed method in detail; Section 4 conducts a systematic method experimental evaluation. Section 5 summarizes the full text, emphasizing the limitations of this work and discussing future works.

Related Work
Early patch-based methods search and copy-paste patches from known regions to gradually fill target holes. Meanwhile, the diffusion-based method describes and solves the color propagation within the holes through partial differential equations. The above methods can generate high-quality static textures while completing simple shapes, but they lack mechanisms for modeling high-level semantics to complete new semantic structures within holes. In recent years, with the development of deep learning [16,17], scholars from all over the world are committed to the work of image completion [18,19] through deep learning. Deep-learning-based completion methods significantly improve the results of image completion by learning semantic information from images in large-scale datasets. These methods usually train a convolutional neural network as an end-to-end mapping function from the original image to the completed image.
The pioneering method for semantic completion is the context encoder (CE) proposed by Pathak et al. [7], which can be called the first deep learning method for image completion, which adopts a generative adversarial network with an encoder-decoder architecture as the main body of the generator. This completion algorithm is fast and can complete the structural information of the image reasonably. However, the image completed by this method often lacks more detailed information between textures, resulting in visual artifacts or blurred images. Yang C. et al. [20] proposed the use of the convolutional neural network for damaged image completion. This method proposes a new completion idea, which combines the content information of the image with the texture information to optimize. The algorithm can generate realistic and reasonable images in terms of content and texture, especially for high-resolution images. However, the requirements for the operating equipment are relatively high.
CGAN-based image completion methods have made a series of outstanding progress. Isola et al. [10] designed and proposed a general framework for the application of CGAN methods, namely PIX2PIX. Dolhansky et al. [21] generated high-quality restoration images using example information hidden in reference images based on GAN. Liao et al. [22] proposed a collaborative generative adversarial network based on CGAN, which consists of three branch networks, namely the face feature point generation network, image inpainting network, and image segmentation network. The framework inductively im-proves the main generation task by embedding additional information into other tasks. Conditional generative adversarial network frameworks have been proven feasible for image completion.
Many losses have been proposed to train completion networks. Pixelwise and adversarial losses are used typically. Some methods apply a spatial discount weighting strategy to the pixel loss [7,23]. Simple convolutional discriminators or PatchGAN discriminators have been used to implement the adversarial loss [18]. Other popular choices are Wasserstein adversarial loss with gradient penalized discriminators and spectral normalization discriminators. Following the previous work [13], we used the R 1 gradient penalty [24] with the PatchGAN discriminator in our method. Completion frameworks usually contain feature matching loss and perceptual loss. We used perceptual loss based on VGG-16.
Recently, researchers have turned their attention to more challenging environments, where the most representative problem is large hole filling. Suvorov et al. [13] suggested that the main reason for that is the lack of an effective receptive field in both the completion network and the loss function. They proposed a method with fast Fourier convolutions [25]. The method often fails to generate plausible image structures with GANs, and there is a lack of details and realistic textures in the completion result. In this paper, we propose a novel method with conditional GAN called CF CGAN to simultaneously achieve high-quality pluralistic generation and large hole filling. CF CGAN can capture and generate details and realistic textures and is robust to large masks.

Method
Image conditional GANs solve the problem of transforming a conditional input x in the form of an image to an output imagex. We assumed that pairwise correspondences between input conditions and output images are available in the training data. The generator takes an image x and a latent vector z as the input and produces an outputx, and the discriminator takes a pair (x,x) as the input and tries to distinguish the fake generated pair from the real distribution. Image completion can be viewed as a constrained image conditional generation problem, where known pixels are constrained to be invariant.

Overall Architecture
Our proposed image completion framework consists of a generator and a discriminator. Take a binary mask of unknown pixels m and a input image x, formulated as x = stack(x m, m). We use feed-forward completion network G, which we also refer to as the generator. The completion network processes the input x and returns a completed three-channel color imagex. Then, the discriminator processes the input image x and the output imagex and returns each patch ofx as real or fake. The architecture is given in Figure 1.

Generator
We decompose the generator into two sub-networks: G 1 and G 2 . We use G 1 as the global generator network and G 2 as the local enhancer network. The global generator network runs at a resolution of 128 × 128, and the local enhancer network runs at a resolution of 256 × 256. The generator is denoted as (G 1 ,G 2 ), as visualized in Figure 2.

Global Generator
Our global generator G 1 is built on the architecture proposed by Suvorov et al. [13], which has been proven successful for image completion with a periodic structure. It uses a ResNet-like architecture with 3 downsampling blocks, 9 residual blocks, and 3 upsampling blocks. The residual blocks use FFC [25]. The global generator runs on low-resolution images, which greatly reduces the computation while obtaining the overall structure and low-frequency semantic information of the image.

Local Enhancer Generator
The local enhancer generator also consists of 3 components: a convolutional front-end G f 2 , three residual blocks G r 2 , and a transposed convolutional back-end G b 2 . Different from the global generator network, we use the elementwise sum of two feature maps as the input to the residual block G r 2 . The two feature maps are the output feature map of G f 2 and the last feature map of the back-end of the global generator network G b 1 . In current image completion methods, images are often down-sampled in order to reduce the amount of computation. This will lose much high-frequency information, which has a great impact on the completion quality of the image. Our local enhancement generator, which is processed at a resolution of 256 × 256, can obtain the high-level semantic information of the image well, and there are only three residual blocks in it.
During training, we train both the global generator and the local enhancer generator. We then jointly fine-tune all the networks together. We use this generator design to efficiently aggregate global and local information for large mask image completion tasks. We note that this joint multi-pair model is a well-established practice in computer vision [26].

Discriminator
As shown in Figure 3, for the PatchGAN discriminator [10], the output is a matrix, and each element in the matrix represents a local area of the input image. If the local area is real, we will obtain 1, otherwise 0.
PatchGAN was first used in the field of image style transfer. Compared with the discriminator in a typical GAN, the output of patch GAN is not a scalar, but a two-dimensional matrix. The discriminator of GAN is used to judge the picture. Whether synthetic or real, it considers the global picture of the image, so it is possible to ignore the local details of the image. The output of PatchGAN is a two-dimensional matrix, and each value represents the judgment of the local area of the image, so that the local texture details of the image can be focused and enhanced.

Loss Function
The completion problem is inherently ambiguous. There can be many plausible fillings for the same missing region especially when the missing region becomes wider. We discuss the components of the proposed loss, which together allow dealing with the complexity of the problem. We define our generator G as f θ (·) and discriminator D as D ξ (·). To improve the quality and reality of the generation, we adopted the non-saturating adversarial loss, regardless of the pixelwise MAE or MSE loss, which usually leads to averaged blurry results. We also used the R 1 = E x ∇D ξ (x) 2 gradient penalty [24]. Besides, we adopted the perceptual loss.

Adversarial Loss
To ensure that the generator f θ (·) generates natural-looking local details, we used an adversarial loss. Our discriminator D ξ (·) works on a local patch level, discriminating between "real" and "fake" patches. Only patches that intersect with the masked area receive the "fake" label. Finally, we used the non-saturating adversarial loss: where x is a sample from the dataset and m is a binary mask of unknown pixels.x = f θ (x ) is the completion result for x = stack(x m, m). E[ * ] denotes the expected value of the distribution function. D ξ (x) is the output of the discriminator to the completion resultx, and D ξ (x) is the output of the discriminator to the original sample x. sg var stops gradients when L D and L G reach a Nash equilibrium, and L Adv is the joint loss to optimize.

Perceptual Loss
Image completion tasks can be regarded as image transformation tasks, where one input image is converted into another image output. The image transformation tasks solved by the existing methods often train a forward propagation network in the way of supervised training, which utilizes the error between image pixel levels. This method is very effective at test time because only one forward pass is required. However, the pixel-level error does not capture the perceptual difference between the output and the ground-truth image.
Recent studies [27,28] have found that the features extracted by the pre-trained VGG-16 network can be used as a loss metric in image synthesis tasks to measure the perceptual similarity of images by comparing the differences between features. This can retain more scene structure information while extracting scene depth information and improve the visual quality of image completion.
In this paper, a perceptual loss function is introduced to make the predicted depth map closer to the real depth map in terms of the visual effect. Compared with other networks, VGG-16 has deeper network layers and smaller convolution kernels. Using the pre-trained VGG-16 ψ(·) as a feature extractor can reduce the model parameters and extract rich detailed information on the premise of ensuring the receptive field. In the pre-trained VGG-16 structure, we replaced max pooling with average pooling and removed three fully connected layers. Finally, the network structure we selected has a total of thirteen network layers with parameters that are convolutional layers, and each convolutional layer is followed by a Relu activation layer. Since the perceptual loss does not require paired images to train the network, this reduces the amount of parameters during network training. We introduce the rich detailed information perceptual loss (RDPL), which uses the VGG-16 model ψ(·): where [· − ·] 2 is an elementwise operation and M is the interlayer mean of intralayer means.

Overall Loss
Our overall loss includes GAN loss, perceptual loss, and R 1 = E x ∇D ξ (x) 2 as: where α, β, and λ control the weight of each part.

Experiments
In this section, we demonstrate that our method outperforms a range of strong baselines with standard low resolutions, and the difference is even more pronounced when completing wider holes.

Dataset
For the ease of evaluation, we used the CelebA-HQ dataset [29], a large-scale facial attribute dataset. It is derived from CelebA images as a new dataset containing 30,000 1024 × 1024 images of celebrity faces. We used CelebA-HQ with a resolution of 256 × 256 to reduce the amount of calculation. The images in this dataset cover large pose variations and background clutter. The feature distribution of the dataset is approximately symmetric, and it is very suitable to evaluate our method as the training dataset.

Evaluation Indicators
We used the learned perceptual image patch similarity (LPIPS), Fréchet inception distance (FID) [27], and structural similarity (SSIM) metrics [30]. Compared to the pixel-level L1 and L2 distances, LPIPS, FID, and SSIM are more suitable for measuring the performance of large masks completion when multiple natural completions are plausible.
The learned perceptual image patch similarity (LPIPS), also known as "perceptual loss", is used to measure the difference between two images. This metric learns the inverse mapping of generated images to the ground-truth, forcing the generator to learn the inverse mapping of reconstructing real images from fake images, and prioritizes the perceptual similarity between them. LPIPS is more in line with human perception than traditional methods. The lower the value of LPIPS, the more similar the two images are and, vice versa, the greater the difference. The LPIPS calculation formula is as follows: where x is the real image,x is the generated image, and d is the distance between the two images. The calculation process of LPIPS is as follows: x andx are sent to the VGG network for feature extraction, and the output of each layer is activated and normalized, denoted aŝ y l ,ŷ l 0 ∈ R H l ×W l ×C l . Then, the L2 distance is calculated after the weight of the w layer point multiplier. Finally, we take the average to obtain the distance.
The FID is a measure of the similarity between two image datasets. It has been shown to correlate well with human judgments of visual quality and is most commonly used to evaluate the quality of generative adversarial network samples. Lower scores are positively correlated with higher-quality images, and smaller values indicate that the features of the two sets of images are more similar. The FID is computed by computing the Fréchetdistance between two Gaussians fit to the feature representation of the Inception network. The FID calculation formula is as follows: where Tr is the sum of the elements of the main diagonal (the trace of the matrix), µ is the mean, A is the covariance matrix, x is the real image, andx is the generated image.
SSIM is an index used to measure the similarity of two digital images. SSIM treats image degradation as a perceptual change in structural information, while also incorporating important perceptual phenomena such as luminance masking and contrast masking. SSIM is suitable as a verification metric for image completion. The SSIM calculation formula is as follows: where µ x and µx represent the mean of images x andx, respectively, σ x and σx represent the covariance of images x andx, respectively, and σ xx represent the covariance of images x andx.

Implementation Details
We conducted experiments on the CelebA-HQ datasets at a 256 × 256 resolution.For CelebA-HQ, train and validation splits were organized with 26,000 and 2000 images. The experimental platform of this experiment was the Ubuntu18.0 system; the learning framework was pytorch 1.7; the corresponding CUDA was 11.3; Cudnn was 8.0; the GPU was a GeForce RTX™ 3090. We used the Adam optimizer, with fixed learning rates 0.001 and 0.0001 for the generator and discriminator networks with a batch size of 16. We set the weight values α = 10, β = 30, and λ = 0.001. The overall loss included GAN loss, perceptual loss, and R1. We trained our model using 40 epochs. During training, we recorded training loss every 250 steps. After each epoch, validation was performed on the validation set and the FID, LPIPS, and SSIM were calculated. We visualize the data in Figures 4 and 5. In the early stage of training, the loss curve decreases rapidly. The FID, LPIPS, and SSIM indicators also showed rapid improvement. As the training enters the middle and late stages, the loss function begins to converge, and the loss curve tends to flatten. At this time, the FID and LPIPS values also oscillate at low points; SSIM oscillates at high points. We scaled LPIPS up by a factor of 10 to trade off the FID and LPIPS since the LPIPS indicator is in the 0.09-0.30 range, as shown in Figure 5d.

Comparisons to the Baselines
We selected image completion methods from classic and representative deep learning as baseline methods, which included CoModGAN [9] and LaMa-Fourier [13]. The two strong baselines are presented in Table 1. CF CGAN (ours) consistently outperformed a wide range of baselines. The results of the study correlate well with the quantitative evaluation and demonstrate that the completion produced by our method is more preferable to other methods. denotes deterioration, and denotes the improvement of the score compared to our CF CGAN model (presented in the last row). Note that the "40-50% masked" column contains metrics on the most difficult samples from the test sets: these are samples with more than 40% of images covered by masks.
The results of our method produced are presented in Figure 6. We standardized the way the image was masked by using square or rectangle masks. Then, we used FaceNet [31] to calculate the cosine distance between them and the reconstructions. FaceNet is a method that directly embeds face images into Euclidean space. In Euclidean space, distance corresponds directly to a measure of face similarity. A distance of 0.0 means the faces are the same; 4.0 corresponds to two opposite, different identities. A threshold of 1.1 correctly classifies each pair. Table 2 and Figure 7 show the distance of the faces between the original and reconstructed. Their distances are all less than the threshold of 1.1 under different mask sizes. This shows that our model completion works well.

Ablation Study
We performed a set of ablation experiments to show the importance of each component of our model. All ablation models were trained and evaluated on the CelebA dataset. We started from a baseline with a simple conditional generative adversarial network. The ablation results are shown in Table 3.
Local enhancer generator: We started with a simple conditional generative adversarial network baseline. We compared the baseline and the baseline with the local enhancer generator. From the results, the baseline with the local enhancer generator improved the numerical metrics.
Perceptual loss based on VGG-16: We verified that the high receptive field of the perceptual loss based on VGG-16 indeed improved the quality of the completion results.
PatchGAN discriminator: We used the PatchGAN discriminator and global discriminator, respectively, in the baseline to verify the effectiveness of the PatchGAN discriminator. The global discriminator, also known as the general discriminator, judges the authenticity of the entire image, and the result it returns is a scalar of 0 or 1. From the results, the PatchGAN discriminator is more efficient than the global discriminator as the PatchGAN discriminator focuses on more areas of the image. Table 3. Ablation study of our model design including the local enhancer generator (LEG), perceptual loss based on VGG-16 (PL), and PatchGAN discriminator (PD). We report FID, LPIPS, and SSIM scores.

Conclusions
In this paper, an image completion method, CF CGAN, based on conditional generative adversarial networks was proposed, including a coarse-to-fine generator with structural symmetry and a new perceptual loss based on VGG-16. The generator network based on generative adversarial loss ensures the consistency of the overall semantics of the completed image. Perceptual loss based on VGG-16 extracts rich detailed information on the premise of ensuring the receptive field. The coarse-to-fine completion method expands the information receptive field of the completion network, and the completion effect is very prominent, even in the case of large holes.
However, our current work also has some limitations. At present, we can only process images with a resolution of 256 × 256. When processing high-resolution images, the results are often unsatisfactory. When our model encounters face images outside the dataset, the result is often failure.
Although this paper used conditional generative adversarial networks to achieve good results in face image completion, there are still shortcomings that need further research. Generative adversarial network training is a dynamic game process. Whether it is a generative network or a discriminant network, when one party is too powerful, it is easy to encounter the problem of gradient disappearance. In addition, the method in this paper was only applied to the problem of face image completion, and the completion effect of other types of datasets is unknown and needs further research.

Conflicts of Interest:
The authors declare no conflict of interest.