Face Completion Based on Symmetry Awareness with Conditional GAN

: Face completion is an important topic in the ﬁeld of computer vision and image processing. Its core task is to restore image information, so that the generated completion results are as consistent as possible with the ground-truth results. In existing methods, there is no strong constraint on the consistency between the completion result and the true value, and the symmetry characteristics of the face are ignored, which makes it impossible to generate a natural and consistent completion result for any position and symmetrical position of the face. In response to these problems, we propose a novel method called face completion generative adversarial network (FC GAN). Our generator uses a u-net-like structure, and the discriminator uses a combination of global discriminator and local discriminator. We use a new perceptual loss based on VGG-19 to constrain the consistency of the completion result and the true value. We use symmetry awareness in our method and it takes full advantage of face symmetry features to optimize face completion. For irregular mask image completion, our method produces visually realistic and semantically correct results. We evaluate our model on the CelebA dataset and use FID and SSIM as the indicators. Compared with existing methods, the face completion method in this paper has a certain improvement in visual effects and evaluation indicators.


Introduction
Face completion, also known as face inpainting, is an important topic in the fields of computer vision and image processing.Its core task is to use the existing image information to complete the missing content, so that the generated completion result is real and natural, and at the same time, it is consistent with the daily standard image.Facial recognition is the most natural, noninvasive and user-friendly measure as it does not require interference with the person being identified.Most existing systems can successfully recognize faces only under restricted conditions.Performance will degrade significantly, especially when occlusion occurs, which can be caused by sunglasses, thick beards, wearing scarves.However, regardless of whether there is occlusion or bad posture, occluded facial images easily lead to recognition failure and further cause huge economic losses and serious security problems.
Due to the unique semantic structure and symmetry characteristics of a face image, its constituent elements do not exist independently, but are interrelated.The face completion task is more difficult than the general image completion task.However, after the continuous exploration and research of several generations of scientists, face completion methods have made great progress.
There are two main types of early completion methods such as multiscale neural patch synthesis [1], Shift-Net [2], boosted semantic inpainting (BSI) [3], GMCNN [4] and SPGNet [5].The first category is based on nontextured structure completion methods, which are filled by propagating image information around missing regions through partial differential equations.Therefore, completion methods based on nontextured structures are usually only used to fill in small holes, such as scratches in images in old photos.The second category is image completion methods based on texture structures, such as block matching methods, which often need to find suitable image blocks from known regions or external image databases to complete based on the contextual information of the image.Reasonable completion results cannot be obtained when there are no matching image patches in the known region or in the external image database.However, compared with non-texturebased completion methods, texture-based completion methods can complete larger missing regions.Since these two methods only use the context information of the missing area to complete the image, they cannot understand the semantics, structure and appearance of the image in a fine-grained manner, so they are only suitable for the completion of general scene images, such as natural scenery images.For images of specific objects, especially face images, a better completion effect cannot be achieved.
In recent years, with the rapid development of deep learning, many learning-based completion methods have emerged and achieved remarkable results.In terms of face reconstruction, the more famous methods are HiFaceGAN [6], DFDNET [7], PSFRGAN [8], and Super-FAN [9].At first, learning-based methods could only complete small missing regions, and there was a serious blurring phenomenon.Later, Goodfellow proposed a new network framework, a generative adversarial network [10].The network framework could use the discriminative network to obtain the semantic structure, appearance and other information of the image and then guide the completion of the image through confrontation.Therefore, the completion effect could be effectively improved, making the semantics of the completion result more reasonable and the content more realistic.
The emergence of generative adversarial networks has also spurred great progress in the completion of images of specific objects, such as faces.The learning-based face image completion methods such as EGAN [11], GLCIC [12], GFC [13], SC-FEGAN [14], DCGANs [15], and dilated convolutions [16] solve the problem where traditional methods cannot perceive the global semantic structure to a certain extent and can make the completion result more reasonable in semantics and more realistic in content.
Here, we address two main issues of the above methods: Existing face completion methods do not impose strong constraints on the consistency of the completion results with the ground-truth results, resulting in the inability to generate completion results consistent with the ground truth.Existing face completion methods do not make full use of the symmetry features of face images, so they cannot generate reasonable completion results for symmetrical parts of faces.In this paper, we discuss a new method for face completion called FC GAN based on conditional generative adversarial networks.Our contributions are summarized below: 1.
We propose a large-mask face completion method based on GAN (FC GAN); 2.
We introduce a symmetry awareness in our method, and it takes full advantage of face symmetry features to optimize face completion; 3.
We introduce a new perceptual loss based on VGG-19.

Related Work
This section mainly introduces the work and techniques related to the face completion method in this paper.The face completion method in this paper is improved on the basis of existing conditional generative-adversarial-network-based completion methods.Therefore, in this section, related generative-adversarial-network-based image completion methods are introduced in detail.
In recent years, with the rapid development of deep learning, many learning-based completion methods such as PatchMatch [12], GFC-Net [13] and MSINet [17] have emerged and achieved remarkable results.Initially, learning-based image completion methods failed to produce clear, realistic completion results.Later, Goodfellow et al. [10] proposed a new network framework, GANs, which could effectively improve the authenticity of the completed image and make the generated completion result semantically reasonable, realistic and natural.
The most representative image completion method based on deep learning is the CE (context encoders) method [18] proposed by Pathak et al.This method was the first to take the form of generative adversarial networks for image completion.It adopted an autoencoder as the generator, where the encoder was used to extract the structural features of the input image, and the decoder generated the completion result based on the extracted structural features.In order to make the generated completion results have a reasonable semantic structure, the method used a discriminative network to obtain the semantic structure features of the completed images and the ground-truth results and then discriminated between the true and false results and guided the generation of the completion results through confrontation.Although using the global discriminant method to optimize the generator can make it generate semantically reasonable completion results, it cannot guarantee that the generated completion results are clear and natural.
In order to obtain more realistic completion results, Lizuka further proposed the GL method [12].The method utilized two discriminators to optimize the generator.The global discriminator could be used to constrain the global semantic features of the completion result, making the semantics of the completion content more reasonable.The local discriminator was used to constrain the authenticity of the completed image block, so that the generated completed image block was clearer and more natural.Similar global and local discriminative networks are also employed in our generative face completion method to optimize the face completion model.Based on the GL method, Li et al. [13] proposed a GFC face completion method.The GFC method added a semantic parsing network based on GL [12] according to the structural features of the face, which could extract the semantic parsing results of the complement image and the ground-truth image and then reduce the difference between the semantic parsing results by reducing the semantic parsing results.The difference could make the semantic structure of the generated complementary image and the ground-truth image as consistent as possible.Yu et al. also proposed a GntIpt [19] method based on GL [12].The GntIpt method designed an improvement scheme starting from the color and texture completion of the completed image.GntIpt adopted a coarse-to-fine completion process.First, an initial completion model was used to obtain an initial completion result.Then, an optimized completion model was used to optimize the initial completion result.A context-aware mechanism was employed in optimizing the completion model.This mechanism was similar to the traditional block-matching method.It used the initial completion result to find similar feature blocks in the context information and then used the feature block to replace the initial feature block and finally, it generated the completion result.This optimization method could ensure that the optimized completion result had consistent color and texture information with the context.Subsequently, researchers proposed new improvement directions for missing shapes and complemented forms.The previous completion methods are only suitable for the completion of rectangle or rectangle-like missing regions, and it is difficult to obtain better completion results for some hole-shaped missing regions.In order to apply the completion of missing regions of any shape, Liu et al. proposed a new completion method [20].The method used masks of arbitrary shape for training and used a style loss instead of a local discriminative loss for the optimization, to ensure that the loss of any shape could produce real and natural completion results.
The previous completion methods were all fully automatic completion without manual prior.Later, new forms of completion were gradually proposed.In the new completion form, users could intervene in the completion process.For example, Yu et al. proposed the free-from image completion method [21].Users could set the contour of the completed image, which could guide the generation of the completion result.Under the new form of completion, many new completion schemes have emerged, and these new completion schemes can enable the user to intervene in the generation of multiple features of the completed image.So far, there have been many excellent completion schemes in the field of image completion.This paper draws on the essence of existing completion methods and proposes a completion scheme specifically for face images based on the semantic structural features of face images.

Method
This section proposes a generative face completion method to solve the problem where existing face completion methods cannot generate completion results consistent with the ground truth.Our method uses a generative adversarial approach to build a completion network model.The generator is used to generate the completion results, and the global discriminator and the local discriminator are used in the training phase.They are based on different optimization purposes to assist the training of the generator and guide it to generate semantically reasonable, realistic and natural completion results.Both the global discriminator and the local discriminator optimize the generation of the generator in the form of a discriminative loss.In view of the fact that the generator cannot generate symmetrical and consistent completion results for the symmetrical parts of the face, a symmetry perception module is added to guide the training of the generator so that it can also generate symmetrical and consistent completion results for the symmetrical parts of the face.In addition, in order to ensure that the generated completion results are as similar as possible to the real value, we also use an R1 loss and perceptual loss to optimize the completion network model.The R1 loss can constrain the pixel similarity between the completion result and the ground truth, while the perceptual loss can ensure the perceptual consistency between the completion result and the ground truth.The two work together to optimize the completion results, so that the generated completion results can be as consistent as possible with the ground-truth results.The overall architecture is given in Figure 1.

Generator
As shown in Figure 2, our proposed FC GAN architecture consists of a downsampling head, a residual body and an upsampling tail.Specifically, a downsampling head is used to extract features, then the main body with nine stages of residual blocks interact contextually with the downsampling head.For the output features from the body, a convolution-based upsampling module is adopted to upsample the spatial resolution to the input size.

Discriminator
Since only the generator is used for supervised learning, it is impossible to obtain real and natural completion results.In order to guide the generator to generate semantically reasonable, realistic and natural completion results, it is necessary to use a discriminator to assist training.The generator and the discriminator promote each other through confrontation.The discriminator is used to judge whether the completion result generated by the generator is true or false and feeds back the result to the generator, so that the generator can optimize according to the judgment result.The generator intends to deceive the discriminator through continuous optimization, and the better completion results generated by the generator also prompt the discriminator to continue to optimize and improve its discrimination ability.Through continuous confrontation, the generator and the discriminator can finally reach a local optimum or a global optimum, which can greatly improve the completion effect of the generator.In the generative face completion network in this section, two discriminators are used.These two discriminators are called global discriminator and local discriminator, respectively.The global discriminator is used to distinguish the authenticity of the entire image, so as to ensure that the global semantics of the generated complementary image are reasonable.Therefore, its input is the whole complement image as well as the ground-truth image.The local discriminator is used to distinguish the authenticity of the image block corresponding to the completed area and its surrounding area, and its input is the image block corresponding to the ground-truth image.By discriminating the authenticity of the image blocks, the completion results generated by the generator can be made clearer and more realistic.In addition, the discrimination of the completed area and its neighborhood can also make the completion model make full use of the neighborhood information, so that the generated completion result can be seamlessly connected with the surrounding pixels.Under the joint optimization of the global discriminator and the local discriminator, the generator can generate semantically reasonable, realistic and natural completion results.

Symmetry Awareness Module
The symmetry-aware face completion method is obtained by adding a symmetryaware module to the generative face completion method.The symmetry-aware module is used to guide the training of the generator so that it can generate symmetrical and consistent completion results.It is divided into two steps: the first is to detect the symmetry elements in the completion results and the ground-truth results and second, to use the detection results to construct a loss function to optimize the completion of the symmetry elements.For the detection of symmetry elements, we propose a heuristic detection method to improve the detection accuracy.For the optimization of symmetric elements, this module uses a symmetric discriminator for optimization.
To optimize the symmetry element completion of the face by using the symmetry perception module, it is first necessary to detect the symmetry elements to be optimized in the completion result and the ground truth in the training phase.Since ears cannot always appear in pairs in most face images, paired ears are ignored in the actual detection and optimization process.Moreover, since the eyebrows are always close to the eyes, in this module, the eyebrows are used as part of the eyes.In summary, the only symmetric elements involved in the detection and optimization are the eyes, nose and mouth.We used the yolov5 [22] target detection algorithm to detect human eyes, nose and mouth.

Loss Function
The loss function is the driving force and source of network training.Both the generator and the discriminator optimize the network parameters by minimizing the loss function.The loss functions used in this paper include a nonsaturating adversarial loss, the 2 gradient penalty [23], a perceptual loss and a symmetric loss.The nonsaturating adversarial loss is used in the training phase of the GAN generator to guide the generator to generate more realistic images.The R1 regularization loss (R is used in the training phase of the discriminator to prevent the gradient-disappearing problem.The perceptual loss and symmetric loss are used in the training phase of the generator, mainly to guide the generator to generate more realistic, detailed and natural images.

Adversarial Loss
Adversarial loss is a loss function in generative adversarial networks , which is mainly used to train generator networks.This loss function is based on an adversarial process between the generator and the discriminator.The basic idea of the adversarial loss is to make the image generated by the generator be able to fool the discriminator into mistaking the generated image as a real image.Specifically, given a real image, the task of the discriminator is to judge whether the image is real or generated.Therefore, the goal of the adversarial loss is to minimize the gap between the generated image and the real image and make it impossible for the discriminator to accurately distinguish between the two.This confrontation process can be formalized as a min-max game.Among various adversarial losses, we used the nonsaturated adversarial loss: where L adv is the generator's adversarial loss.Specifically, D in the formula represents the discriminator, X in represents the input noise of the generator, Y represents the real image, M represents the mask image, and represents element-wise multiplication, where E represents the expected value.D(X in ) represents the discrimination result of the discriminator on the image generated by the generator, and D((Y) (1 − M)) represents the discriminator's discriminative result of the image after removing the mask from the real image discrimination result.The goal of the generator is to maximize L adv , so that the generated image is closer to the real image and improves the performance of the generator.

Perceptual Loss
Perceptual loss refers to using a pretrained convolutional neural network to perform feature extraction on the generated image and the target image and then calculate the error between the two.The VGG convolutional neural network [24] is a model proposed by Oxford University in 2014, which has shown very good results in both image classification and target detection tasks.Due to its deep network structure and strong feature expression ability, VGG-19 is often used as the basic model of perceptual loss.In this paper, we introduce a perceptual loss to make the predicted depth map closer to the real depth map in terms of visual effect.We introduce the perceptual loss (RDPL), which uses the VGG-19 model ψ(•): where LRDPL(Xin, Y) represents the feature reconstruction error calculated for the input image X in and the target image Y, and M represents specific metric functions.Specifically, ψ(X in ) and ψ(Y) in this formula represent the input image X in and target image Y, respectively, through the pretrained convolutional neural network.The extracted features' shape is usually H × W × C, where H, W and C represent the height, width and number of channels of the feature map, respectively.[ψ(X in ) − ψ(Y)] 2 represents the squared difference between two features, so that the error between each feature value can be scaled and balanced.Finally, the metric function M combines all error values into a scalar representing the entire feature reconstruction error.

Symmetric Loss
The symmetry discriminator acts on the third stage of the generative completion network and is jointly trained with the generative model, the global discriminator and the local discriminator, and finally a symmetry-aware face completion network can be obtained.The symmetric discriminator acts on the completion network in the form of a loss function named symmetric loss.Among them, the symmetric discriminant loss adopts the form of the classic cross-entropy loss, and its input is the symmetry element to be optimized in the ground-truth image and the complementary image.If the symmetry element to be optimized is a pair of symmetry elements, such as eyes, eyebrows and ears, here, P l is used to represent the left part of the symmetry element, such as the left eye, and P r is used to represent the right part of the symmetry element, such as the right eye.When P l is occluded by a mask and P r is not occluded, the symmetry loss can be defined as follows: where Pl is the left completion part of the symmetry element to be optimized in the completion result and M is the mask of the original image.

Overall Loss
Our overall loss includes nonsaturated adversarial loss, perceptual loss, 2 and symmetric loss: where α, β, δ and λ control the weight of each part.

Experiments
We evaluated our FC GAN method on the CelebA dataset and compared it with mainstream baselines in various aspects.Experiments demonstrated that our method outperformed the baselines.We also verified through experiments that the symmetric discriminator was effective in optimizing the symmetrical features of faces, and the perceptual loss based on VGG-19 could effectively improve the rationality of the restoration results.

Dataset
For our experiments, the CelebA face dataset was used as the training dataset.The CelebA dataset contains 202,599 face images of 10,177 celebrities.Because some face images in the dataset have a certain angle of deflection and occlusion, before image preprocessing, the MTCNN algorithm [25] was used for face detection and alignment, and finally, the image size was uniformly adjusted to 256 × 256.There were 160,000 processed images that met the experimental requirements: 20,000 images were randomly selected in the dataset as the training set, and 10,000 images other than those in the training set were selected as the test set.First, the model was trained on the training set to obtain the weight parameters, and then the generalization ability of the network was verified on the verification set, and the hyperparameters were adjusted according to the model performance.
After completing the above two steps, we tested the performance of the network on the test set.
A real environment is complex, and the occlusions caused are also various.Therefore, there is currently no standard occluded-face dataset.Therefore, during the model training process, we randomly added black pixel blocks to the face images in the dataset to simulate occlusion processing, mainly for two common occlusion situations in life, such as masks and glasses.The occlusion of glasses was about 20% of the image area, and the mask occlusion was about 30% of the image area.At the same time, two kinds of mixed-occlusion datasets were also created.In addition, black pixel blocks were used to randomly cover 10%, 20%, 30%, and 40% of the original image area to test the performance of the algorithm.

Evaluation Indicators
We used the Fréchet inception distance (FID) [26] and structural similarity (SSIM) metrics [27] to measure the performance of large masks completion.

FID
The FID is a metric for evaluating the quality of images generated by generative adversarial networks .In the FID calculator, we used the inception network.The inception network is actually a feature extraction network, and the last layer outputs the category of the image.However, we removed the last full connection or pooling layer, so that we obtained a 2048-dimensional feature.For the real pictures we already had, the extracted vectors of all real pictures obeyed a distribution; for the high-level vector features corresponding to the pictures generated by the GAN, they also obeyed a distribution.If the two distributions were the same, it meant that the GAN generated a picture with a high degree of realism.The calculation process of the FID needs to use the inception network to extract the features of the real image set and the generated image set and calculate their statistical distribution, and then, it evaluates the quality of the generated image by calculating the Fréchetdistance between the two distributions.The FID calculation formula is as follows: where µ X and µ Y , respectively, represent the mean value of the feature vectors of the two image sets in the inception network and µ X − µ Y 2 2 represents the square of the Euclidean distance of the first part.A X and A Y represent the eigenvector covariance matrix of the real image set X and the generated image set Y in the inception network, respectively.Tr represents the trace operation of the matrix, (A X A Y ) 1 2 represents the square root product of two matrices.

SSIM
SSIM (structural similarity) is a measure of the similarity between two images.Compared with PSNR [27], SSIM is more in line with human visual characteristics in evaluating image quality.The SSIM calculation formula is as follows: where µ X and µ Y represent the mean of images X and Y, respectively, σ X and σ Y represent the covariance of images X and Y, respectively, and σ XY represent the covariance of images X and Y.

Implementation Details
We used images from CelebA [28] to construct our training data.Our model was trained using 28,000 images from Celeba.To conduct in-training validation, we extracted 2000 images from the training set.For each image in the validation subset, we generated random masks.The experimental platform was a Ubuntu 20.0 system; the in-depth learning framework was pytorch1.8;the corresponding CUDA version was 11.3; Cudnn's version was 8.0; the GPU was a GeForce RTX™ 3090.The loss functions were optimized using the Adam optimizer with fixed learning rates of 0.001 and 0.0001 for the generator and discriminator networks.We set the training batch size as 16 and the weight values α = 10, β = 20, σ and λ = 0.001.We trained our model for 40 epochs.During training, every 200 step was recorded.At the end of each epoch, we verified the validation set and calculated the FID and SSIM.The results produced by our method are presented in Figure 3.In order to evaluate the generalization ability of our model, that is, the applicability on different datasets, and to understand how the model performed in real scenarios, we conducted "cross-dataset testing".We drew two samples each from the FFHQ dataset [29] and the LFW dataset [30].The models then output their inpainting results under different masks.The results are presented in Figures 4 and 5.

Comparisons to the Baselines
We selected the classic and representative image completion methods as the baselines, which included DeepFill v2 [21] and LaMa-Fourier [31].The two strong baselines are presented in Table 1.FC GAN (ours) consistently outperformed a wide range of baselines.The results of the study were in good agreement with the quantitative evaluation results, showing that our method completed much better than other methods.Note that the "40-50% masked" column contains metrics on the most difficult samples from the test sets: these are samples with more than 40% of images covered by masks.

Conclusions
In this paper, we proposed a face image completion network based on symmetry awareness.We provided symmetric semantic information in the network to help the network complete images.In the context of face completion, we made full use of the symmetry of the key parts of the face and showed its advantages in comparison with the baseline method.It can be seen from Table 1 that our proposed method was significantly better than other methods in two indicators, FID and SSIM.Especially for the SSIM indicators, a 33.3% improvement over the method LaMa-Fourier [31] and a 60.6% improvement over the method DeepFill V2 [21] indicated that our method could generate more realistic images while preserving image information.In addition, in order to solve the problem that the completion results are inconsistent with the visual sense, we used the perception loss based on VGG-19.It further strengthened the completion effect of the completion model in terms of image quality and diversity.
However, our method has some limitations.The face completion technology based on symmetry perception is limited by the quality of the symmetry.If the symmetry of the face is poor, the completion effect may be affected to a certain extent.In addition, since the symmetry of the face varies from individual to individual, the applicability of this technique is also limited.A large quantity of training data is required to improve the completion effect, so a high-quality face dataset is required for training.If the quality of the dataset is not high, the effect of this technique will also be affected to some extent.From Figure 5, our model tested poorly on some samples of the LFW dataset.In the future, we will improve our method regarding the following two points: 1.
In addition to symmetry, faces also have many other prior knowledge, such as shape, texture and other information.Therefore, we will explore how to make better use of this prior knowledge to improve the completion effect.

2.
Face completion technology based on symmetry perception requires a large quantity of training data to improve the completion effect.We will use a higher-quality face dataset such as the FFHQ dataset [29].

Figure 1 .
Figure 1.The overall architecture of our framework.The framework consists of a generator and three discriminators.

Figure 2 .
Figure 2. The scheme of our generator.

Figure 3 .
Figure 3.The results of our method are shown in the figure.The first row in the figure is a mask, and the white rectangles or squares represent the invisible parts of the face.The bottom four rows in the figure are the completion results of our method for different faces.

Figure 4 .
Figure 4.The results of our method are shown in the figure for samples from the FFHQ dataset.

Figure 5 .
Figure 5.The results of our method are shown in the figure for samples from the LFW dataset.

Table 1 .
Quantitative evaluation of completion on the CelebA-HQ dataset.