A Reference-Guided Double Pipeline Face Image Completion Network

: The existing face image completion approaches cannot be utilized to rationally complete damaged face images where their identity information is completely lost due to being obscured by center masks. Hence, in this paper, a reference-guided double-pipeline face image completion network (RG-DP-FICN) is designed within the framework of the generative adversarial network (GAN) completing the identity information of damaged images utilizing reference images with the same identity as damaged images. To reasonably integrate the identity information of reference images into completed images, the reference image is decoupled into identity features (e.g., the contour of eyes, eyebrows, nose) and pose features (e.g., the orientation of face and the positions of the facial features), and then the resulting identity features are fused with posture features of damaged images. Speciﬁcally, a lightweight identity predictor is used to extract the pose features; an identity extraction module is designed to compress and globally extract the identity features of the reference images, and an identity transfer module is proposed to effectively fuse identity and pose features by performing identity rendering on different receptive ﬁelds. Furthermore, quantitative and qualitative evaluations are conducted on a public dataset CelebA-HQ. Compared to the state-of-the-art methods, the evaluation metrics peak signal-to-noise ratio (PSNR), structure similarity index (SSIM) and L1 loss are improved by 2.22 dB, 0.033 and 0.79%, respectively. The results indicate that RG-DP-FICN can generate completed images with reasonable identity, with superior completion effect compared to existing completion approaches.


Introduction
Face image completion has been extensively utilized as a research hotspot in the field of computer vision in the film industry and detection. Face image completion should effectively restore the reasonable identity of a damaged image, i.e., ensure that the completed image has the same identity as the original image. Mainly, traditional image completion approaches are based on patches or diffusion. The diffusion-based approach [1,2] is to iteratively propagate the low-level features from known regions to unknown areas along the mask boundaries to complete the small-hole area with weak structure. The patch-based approach [3][4][5][6][7][8] is to search for similar patches in the same dataset or image to complete the unknown areas. Both approaches assume unknown areas share similar content with known regions; thus they can directly match, copy and realign the known patches to complete unknown areas, but cannot predict unique content not present in the known regions. Thus, these methods are only appropriate for filling natural images with similar textures. Nevertheless, face images have stronger topolopy than natural images, hence, they cannot be completed by such traditional approaches.
Presently, the approach based on deep learning and a generative adversarial network (GAN) [9] has attracted research on face image completion. In such an approach, a trainable completion network is made mostly based on a convolutional neural network, and the completion network is utilized as a discriminator and generator for adversarial training to generate natural and real completed images. A context encoder was proposed by Pathak et al. [10] using the generative adversarial loss for training, based on which Iizuka et al. [11] proposed local and global discriminant loss. Yu et al. [12] put forward a completion plan of rough completion first and then fine completion to obtain a high-definition completion effect through long-range feature transfer using the semantic attention layer. Moreover, some scholars proposed partial convolution [13] and gated convolution [14] to enhance the network robustness to the mask shape. Given the problem of a single completion result, Zhang et al. [15] mapped the damaged images to probability distribution utilizing a modified variational encoder and GAN, so that multiple completion results can be constructed for one damaged image by sampling the probability distribution.
The completion results of the dataset-guided face image completion scheme [10][11][12][13][14][15] are natural and real on face images with completely lost identity information. However, the results are essentially average faces based on the entire dataset, due to their identities being uncertain. Therefore, Li et al. [16] proposed semantic parsing loss, effectively enhancing the similarity between the completion result and the topological structure of the original images. Some scholars [17][18][19][20][21] used landmarks and edges as prior knowledge to guide the completion process. Compared with the approach of [16], the visual quality of the completion results of such approaches is greatly improved with more stable training. However, these approaches can only provide a posture information guide to the completion procedure, and the completion results still have no certain identity.
To solve the identity uncertainty problem, a reference-guided double-pipeline face image completion network (RG-DP-FICN) is proposed in this paper to guide the entire completion process using the identity information of reference images. The aim of the network is to transfer the identity information from the reference images to completed images, thereby improving the identity plausibility of the completed images.
The key points of this paper are as follows: (1) The reference-guided completion network is used to determine the rationality of the identity of completion results. (2) A double-pipeline GAN is proposed to realize the decoupling and fusion of identity and posture features.
An attention feature fusion module is designed to restore the background information lost during fusion.

Dataset and Evaluation Metrics
All the experiments conducted in this paper exploit a high-quality human face dataset CelebA-HQ [22]. This dataset contains 30,000 high-quality face images coming from 6217 identities. It is trained and tested with the processing as follows. First, the images with the same identity in the dataset are summarized to obtain 6217 image lists. Each list contains a different number of face images. Then, the lists of only one image with the same identity are eliminated. Two images are selected in order from the list as the original image and the reference image, respectively, and 28,299 image pairs are obtained. Ultimately, 27,299 image pairs are chosen as the training set, while the remaining 1000 image pairs as the test set. All images are scaled to 256 × 256 × 3.
The evaluation metrics include peak signal-to-noise ratio (PSNR) [23], structure similarity index (SSIM) [23] and L1 loss. These metrics are used to measure the pixel-level difference and overall similarity between the completed image and the original image. For PSNR and SSIM, higher values indicate better performance, while for L1 loss, the lower the better. The formulas for the calculation of these performance metrics are as follows: where, I s I s I s represents the original image, I g I g I g represents the completed image, i and j represent the coordinate positions of pixels in I s I s I s and I g I g I g ; H, W represent the width and height of the image respectively; l, c and s represent the image similarity of I s I s I s and I g I g I g in terms of luminance, contrast, and structure, respectively; µ s , µ g σ s , σ g represent the mean and variance of I s I s I s and I g I g I g , σ gs represent the covariance of I s I s I s and I g I g I g ; C1, C2, C3 represent the constant in order to avoid the denominator is 0, generally take C1 = (K1 × L) 2

Double-Pipeline GAN
Considering identity uncertainty problem in the completed image, reference image I a I a I a with the same identity as the damaged image I m I m I m is incorporated to guide the completion procedure. In this study, a reference guided double-pipeline face image completion network(RG-DP-FICN) is proposed ( Figure 1). Two pipelines (reconstruction and completion) are constructed in this network, to reasonably incorporate the identity information of reference images. To realize the identity transfer from reference image to damaged image, the posture features(e.g the orientation of face and the positions of the facial features) and identity features(e.g., the contour of eyes, eyebrows, nose) of reference images are decoupled through such double-pipeline network design, and the identity features of reference images are fused with the posture features of damaged images as follows.
The reconstruction pipeline possesses all the information on reference images and the information is jointly obtained by the encoder E and identity encoder E id . The reconstruction process is as follows: First, the damaged image I ma I rec I rec I rec = G(I ma I ma I ma , L a L a L a , E id (I a I a I a )) (5) where, I s I s I s represents the original image; M M M is a mask; L a L a L a and L g L g L g denote the face topological structure obtained by connecting the landmarks of I a I a I a and I m I m I m predicted by the landmark predictor [20]. Overview of RG-DP-FICN with two parallel pipelines. The reconstruction pipeline (blue line) consisting of generator G and discriminator D 2 possesses all the information on reference images from I a I a I a , which is used only for training. The completion pipeline (yellow line) consisting of generator G and discriminator D 1 is used for training and testing. Two piplines share identical generator G and identity encoder E id , the generator consists of encoder E and discriminator F. The structure of the residual encoder and residual decoder is the same as [15]; Identity encoder module and identity transfer module is further described in Section 2.3; attention feature fusion module is further described in Section 2.4.
The generator G is trained to enhance the visual quality of I g I g I g and the rationality of its topological structure through adversarial training with two discriminators D 1 and D 2 . For D 1 , the true sample and the false sample are the image pair (L g L g L g , I s I s I s ) and (L g L g L g , I g I g I g ), respectively. For D 2 , the true sample and the false sample are respectively the image pair (L a L a L a , I a I a I a ) and (L a L a L a , I rec I rec I rec ). Such sample pair design can effectively enhance the image quality of I g I g I g and I rec I rec I rec , and ensure the rationality of their topological structure.

Identity Encoder and Identity Transfer Module
The fusion of the identity information of reference images and the original information of damaged images is a challenging problem. This is because first, the reference images present different visual appearances due to gender, lighting conditions, and makeup, hence, it is difficult to obtain identity information. Second, there are differences in the face posture along with different proportions and positions of facial features in the image, hence, it is difficult to rationally fuse identity information. Hence, to solve this problem, an identity transfer module and an identity encoder are designed in this paper ( Figure 2) to extract the identity information and feature fusion, respectively. Next, the transfer process in the completion pipeline is introduced as an example. In the identity encoder, the residual block and the self-attention module [24] are used first to extract and compress identity information of reference image (256 × 256 × 3) to obtain a feature (512 × 4 × 4). Then, the feature is mapped into the identity feature z id z id z id (512 × 1 × 1) through a fully connected layer. As a global operation, each unit of the identity feature z id z id z id is the weighted sum of 512 × 4 × 4 feature values, which allows each unit in z id z id z id to reason about reference image identity. Hence, z id z id z id could well represent the identity of reference images. In the identity transfer module, z m z m z m and z am z am z am , which representing high-dimensional semantics, is globally adjusted step by step using the style rendering block to realize identity transfer. The Identity transfer module consists of four style rendering blocks, each of which contains two AdaIn layers. The specific structure of style rendering block is represented in Figure 3, which is employed twice through adaptation instance normalization (AdaIN) [25], as follows: where, f i,j f i,j f i,j represents the input feature, u( f i,j f i,j f i,j ) and σ( f i,j f i,j f i,j ) denote the mean and variance of f i,j f i,j f i,j , respectively, (α i,j α i,j α i,j , β i,j β i,j β i,j ) represents the AdaIn affine parameter obtained by the fully connection layer mapping of identity feature z id z id To improve the rendering effect, the receptive field is expanded utilizing the dilated convolution [26] in the style rendering blocks, thus, identity rendering in various receptive fields is achieved by continuously using the style rendering blocks. The entire style rendering process is completed by 4 style rendering blocks. The AdaIn affine parameters required by style rendering blocks are obtained by mapping the identity feature z id z id z id extracted by the identity encoder module. Thus, the adjusted code z u z u z u has rationality of identity. Similarly, the code z au z au z au of the reconstruction pipeline also possesses the reasonable identity, due to it uses the same identity transfer operation as completion pipeline.

Attention Feature Fusion Module
After identity transfer utilizing the identity transfer module, the codes z u z u z u and z au z au z au are obtained with reasonable identity. However, AdaIn in the identity transfer module will globally adjust the feature map during transfer, thus, it leads to the loss of background information. Therefore, an attention feature fusion module is designed in this paper to combine the undamaged features of background information before transfer and the code with reasonable identity to restore the background information of face images. For instance, f m f m f m and f u f u f u obtained by z u z u z u sampling are fused in the case of completion pipeline. Figure 4 represents the structure of the attention feature fusion module. First, the attention score δ δ δ is calculated utilizing f u f u f u (Equation (10)). Then, f m f m f m and f u f u f u are fused under the guidance of δ δ δ (Equations (11)- (14)). Ultimately, γ m γ m γ m and γ u γ u γ u are spliced and then the fusion feature f g f g f g is obtained through a residual block(Equation (15)). Similarly, the fusion feature f rec f rec f rec is obtained by fusing f ma f ma f ma and f au f au f au using this module.
where, N denotes the number of pixels in f u f u f u , and P and Q are 1 × 1 convolutions.
where, O represents a 1 × 1 convolution, and γ m & γ u denote equilibrium parameters, S represents a residual block. To verify the effectiveness of the attention feature fusion module, the additive operation is imposed for the multiple channels of the feature map f u f u f u and f g f g f g before and after fusion in this paper, for which the visualized results are represented in Figure 5. It is observed that the information before fusion is mainly concentrated in the face area, however, the background information is restored after fusion indicating that the module can effectively fuse the features.

1×1 conv matrix multiply
x y x y Figure 5. The feature maps before and after attention feature fusion. x and y represent the coordinates of the feature map, and the color bar represents the range of the feature map values.

Loss Function
According to the respective tasks of the generation pipeline and the reconstruction pipeline, in designing their loss function, it is considered that the reconstructed images should have a great similarity with the reference images, and the generated images should be real and natural with reasonable semantics.
First, reconstruction loss is introduced for the reconstruction pipeline to decrease the pixel-level difference between I rec I rec I rec and I a I a I a . There is no complete face image information in the completion pipeline, hence, erroneous training results will be obtained when using the reconstruction loss forcibly. Therefore, spatially discounted reconstruction loss [12] is introduced, and the spatial discounted weight W sd W sd W sd is adjusted to apply a weaker pixel-level supervision to pixels farther from the nearest known pixel. The reconstruction loss and spatial reconstruction loss can be written as: where, · 1 denotes the L 1 norm, W sd W sd W sd represents the weight of each pixel, l is the distance of the pixel to the nearest known pixel. γ is set to 0.9 in all experiments.
Then, the perceptual loss [27] is introduced for the reconstruction pipeline to decrease the perceptually difference between I rec I rec I rec and I a I a I a . The output distance of the reconstructed image I rec I rec I rec and the reference image I a I a I a at different layers of the pre-training network is penalized to enhance their perceptual similarity. The perceptual loss is determined utilizing the feature map φ i (·) output by the pre-trained VGG19 network on the ImageNet dataset [28], as follows: where, φ i (I g I g I g )$ and φ i (I s I s I s ) denote the feature map corresponding to I g I g I g and I s I s I s , respectively. Ultimately, to decrease the difference in data distribution between the generated image and the real image, and enhance the visual quality of the generated image, the discriminant loss is introduced for the two pipelines. The discriminant loss for the completion pipeline is designed based on LSGAN [29] (Eqtuion (20)), which could effectively improve the authenticity of the completed image I g I g I g . Inspired by [30], the LSGAN-based discriminant loss is deformed (Eqation (21)) for the reconstruction pipeline. Such a design can encourage the original features D 2 (I a I a I a , L a L a L a ) and reconstructed features D 2 (I rec I rec I rec , L a L a L a ) in the discriminator to be close together, so it can effectively enhance the semantic similarity between I rec I rec I rec and the reference image I a I a I a while ensuring the authenticity of the reconstructed image I rec I rec I rec .
The total losses of the generation pipeline and the reconstruction pipeline are defined as follows: where, λ app , λ adv and λ pc denote the weighting factors.

Implementation Details
Our proposed model is implemented in PyTorch v1.2.0. The model is optimized using Adam optimizer [31] with decay index β 1 = 0, β 2 = 0.9, and learning rate l r = 0.0001. The weights λ rec , λ adv and λ pc of the loss function are set to 1, 0.1 and 0.1, respectively. On a single NVIDIA 2080Ti (11 GB), we trained our model on CelebA-HQ for three days with a batch size of 4. The convergence processes of the loss function and the test set evaluation metrics are shown in Figures 6 and 7. We found that the model achieved the best results at the 38th epoch.

Completion Consistency
A comparison is made between RG-DP-FICN proposed in this paper and generative image inpainting with contextual attention (CA) [12], generative landmark guided face inpainting (LaFIn) [20], and pluralistic image completion(PIC) [15] on the same database CelebA-HQ. Notice that for CA, LaFIN and PIC, the pre-trained models on the CelebA-HQ are given, these methods are tested utilizing the code and its network weights provided by the original authors.
To visually demonstrate the superiority of the face image completion approach proposed in this paper, a qualitative evaluation is performed for the completion results of this approach, CA, LaFIn, and PIC ( Figure 8). The completion result of PIC and CA are essentially average faces based on the entire dataset, due to their identity are uncertain. Therefore, their completion results can possess certain realism, however, the identity is clearly different from the ground-truth. For LaFIn, the landmarks is introduced as prior knowledge of topological structure, nevertheless, its completion result still has no reasonable identity owing to the lack of identity information guidance, particularly manifests as the different contour of eyes, eyebrows, nose from that in the original image. For RG-DP-FICN, the reference image of the same identity is introduced to guide the completion process. Utilizing the identity transfer module for transferring the identity of the reference image to the damaged image, the identity features obtained by decoupling and the posture features of the damaged image are fused.
Moreover, the background information lost during identity transfer is restored by the attention feature fusion module. In conclusion, RG-DP-FICN can generate the completed images with reasonable topological structure and identity compared to the above three schemes. To objectively demonstrate the superiority of the face image completion approach proposed in this paper, the completion results of this approach, CA, LaFIn, and PIC are quantitatively assessed. The quantitative evaluation results are represented in Table 1. As can be seen from the numbers in Table 1, LaFIn is superior over PIC and CA in most cases, as it employs the landmark information to guide completion process. PSNR and SSIM of RG-DP-FICN are far higher than those of other schemes, while L1 loss is lower compared to the other schemes. It is inferred that in comparison with the existing advanced schemes, the RG-DP-FICN proposed in this paper has a stronger capability of face image completion.

Completion Diversity
To further validate the model's identity transfer capabilities, the reference face images of different identities are chosen for the damaged images utilizing the center mask (Figures 9 and 10). Covering the face attributes completely, the completion results will represent obvious differences in the position and contour of the eye and the shape of the nose based on the different reference face images. It can be seen that the completion network in this paper effectively guided the identity of the completion result.

Conclusions
Considering the problem of identity uncertainty in the results of face image completion, RG-DP-FICN is proposed in this paper. First, an identity transfer module is designed to realize the identity guidance of face image completion results. Then, an attention feature fusion module is designed to effectively restore the background information of the face image. In comparison with various advanced approaches, the completion results of RG-DP-FICN are more real and natural in the subjective visual effect and possessed significantly enhanced objective pixel-level similarity and overall similarity. Hence, it effectively solves the above problem.

Conflicts of Interest:
The authors declare no conflict of interest.