Recovery-Based Occluded Face Recognition by Identity-Guided Inpainting

Occlusion in facial photos poses a significant challenge for machine detection and recognition. Consequently, occluded face recognition for camera-captured images has emerged as a prominent and widely discussed topic in computer vision. The present standard face recognition methods have achieved remarkable performance in unoccluded face recognition but performed poorly when directly applied to occluded face datasets. The main reason lies in the absence of identity cues caused by occlusions. Therefore, a direct idea of recovering the occluded areas through an inpainting model has been proposed. However, existing inpainting models based on an encoder-decoder structure are limited in preserving inherent identity information. To solve the problem, we propose ID-Inpainter, an identity-guided face inpainting model, which preserves the identity information to the greatest extent through a more accurate identity sampling strategy and a GAN-like fusing network. We conduct recognition experiments on the occluded face photographs from the LFW, CFP-FP, and AgeDB-30 datasets, and the results indicate that our method achieves state-of-the-art performance in identity-preserving inpainting, and dramatically improves the accuracy of normal recognizers in occluded face recognition.


Introduction
In recent years, occluded face recognition has become a research hotspot in computer vision.Unlike unoccluded faces, occluded faces suffer from incomplete visual components and insufficient identity cues, which lead to degradation in recognition accuracy by normal recognizors [1][2][3][4].Inspired by the recovery mechanism of the nervous system, researchers have proposed two types of approach, i.e., occlusion-robust and occlusion-recovery.
The occlusion-robust approach attempts to improve the robustness of recognizers on occluded faces by improving the "representation".The latest work, FROM [5], proposed an end-to-end occluded face recognition model to learn the feature masks and deep occlusionrobust features simultaneously.However, compared with normal recognizers, it has weakened generalization ability over datasets with wide age and angle differences, such as the CFP-FP [6] and AgeDB-30 [7].
Unlike the occlusion-robust approach, the occlusion-recovery approach recovers the occluded regions before recognition.GAN-based inpainting methods [8,9] have remarkably improved realistic content generation.At the same time, identity-preserving inpainting models [10][11][12][13][14][15] have been demonstrated to be effective for occluded face recognition.These methods often adopt encoder-decoder-structured networks but with different identity loss during training, as Figure 1 shows.Dolhansky et al. [10] imported identity features to preserve identity information in eye regions by L2 feature loss, as Figure 1b shows.Inspired by the perceptual loss [11,12,16] used identity loss which combined perceptual items and identity feature items, as Figure 1c shows.The perceptual item is computed with semantic features from a low-level layer of the pretrained recognizer, while the identity feature item is from the output of the top-level layer.Ge et al. [15] proposed an identity-diversity loss that combines perceptual loss and identity-centered triplet loss to guide face recovery, which achieved state-of-the-art performance in identity preserving inpainting, as Figure 1d shows.Duan et al. [13] designed two-stage GAN models to deal with face completion and frontalization simultaneously.However, these methods are also limited by the challenge of preserving the inherent identity information against large occlusions.These methods often utilize incomplete datasets to learn the identity distribution with the supervision of identity and reconstruction loss functions, which makes the learned distribution deviate from its real one.Then, the decoder generates a new face from sampling the biased identity space, further enhancing the identity offset of the generated image.This work uses a GAN-like identity-guided inpainting model to solve occluded face recognition.We refer to our method as ID-Inpainter for brevity.Instead of starting from a Gaussian distribution, our model samples from an identity distribution learned with an unoccluded dataset, which reaches closer to the real distribution than that with an occluded dataset.The difference is shown in Figure 2. Our ID-Inpainter consists of a content inpainting process and an identity fusing process.In the content inpainting process, we train a content inpainter to implement a coarse recovery with structure consistency.In the fusing process, we design a GAN-like identity fusor consisting of a series of adaptive identity fusion blocks (AIFBs) to fuse the identity and attribute features.Through the GAN-like fusor and specifically designed AIFBs, we achieve more efficient identity fusing and obtain better attribute-consistent inpainting results.

Related Work 2.1. Occluded Face Recognition
Face recognition is a computer vision task that recognizes the identity among multiple face images.It is closely related to feature extraction, classification [17], and detection [18] technology.As one of the most successful practical cases, face recognition has a long history of research which has extended to various application scenarios [15,19,20] .Traditional face models are designed for unoccluded face images (see, for example, [1,2]).When they are applied directly to occluded datasets, their accuracy drops dramatically.There are two main approaches to solving the problem: occlusion-robust and occlusion-recovery.
The occlusion-robust approach reduces the accuracy drop by improving the robustness of recognizers on occluded faces.One idea is to improve the "representation".Refs.[21][22][23] report various kinds of representation methods for facial features.The latest work called FROM [5] is an end-to-end occluded face recognition model to learn the feature masks and deep occlusion-robust features simultaneously and achieved the SOTA result on the occluded LFW dataset.
Unlike the occlusion-robust approach, the occlusion-recovery approach recovers the occluded facial regions and then performs recognition on the recovered faces.Ge et al. [15] proposed an identity-diversity inpainting network to facilitate occluded face recognition.It improved the recovery step by integrating GAN with a novel CNN network, which used identity-centered features as supervision to enable the inpainted faces to cluster towards their identity centers.In [14], occlusions were removed with a CNN-based deep inpainting network.However, these methods are also limited by the challenge of preserving the inherent identity information against large occlusions.The core reason lies in the insufficient transformation of identity information.So, if we can improve the identity information transformation in the inpainting phase, we will further improve the performance of occluded face recognition.

Identity-Preserving Face Inpainting
A simple approach for face inpainting is to borrow general deep learning inpainting methods directly, which are good at rebuilding the overall structure of the face.For example, generative inpainting methods [9,24] involve the design of attention layers to improve the global structure consistency and fidelity and have performed well in face inpainting.Although these methods have been shown to maintain the consistency of facial structure, they showed limited improvement in occluded face recognition.So, some researchers have turned their attention to identity-preserving face inpainting.
Identity-preserving face inpainting attempts to perceive the identity information from the uncorrupted region.Some attempts, e.g., [14,15,25], imported identity loss to solve the problem and were demonstrated to be effective for occluded face recognition, but not significantly.For example, Ge et al. [15] proposed an identity-preserving face completion model that combined a CNN network and a third recognizer player to complete identitydiversity inpainting.It was designed explicitly for occluded face recognition but failed to improve performance on large-size occlusions.The main reason is that the traditional encoder-decoder network trained on occluded datasets can not build real identity space, leading to a prominent identity offset in the inpainting process.Li et al. [26] creatively combined a general inpainting network with AAD-generator [27] to solve identity-guided inpainting tasks, regenerating missing content from a pretrained identity distribution.However, there is still a certain distance in style and structure between the generated face and the ground truth face.Although an additional Poisson blending module is used to repair the style difference, the structure bias cannot be erased.

Normalization Layers
GAN is powerful in generating photo-realistic results based on distribution sampling.There have been broad investigations of the normalization layers [26][27][28][29] in GANs to improve the prediction performance.Among them, spatially adaptive denormaliza-tion (SPADE) [28] and adaptive attentional denormalization(AAD) [30] are related to our AIFB.By relying on the prelearned identity distribution and AIFBs, our method can effectively fuse the identity information into the missing area and maintain a high degree of structural consistency.

Proposed Method
For occlusion-recovery face recognition, the recovery model inpaints the occluded face to meet structure consistency and identity preservation.Instead of using a traditional encoder-decoder generator, we utilize a GAN-like identity-guided face inpainting network for the inpainting, as shown in Figure 3.Our method consists of two phases: the verification phase and the training phase.In the training phase, we use occlusion-free faces as the reference image while adopting the masked face as the reference in the verification phase.

Problem Definition
Given a ground truth face x g and its occluded version x m , our goal is to inpaint the occluded image with structure-consistent and identity-preserving content to make it easier to be recognized by normal recognizers.During the inpainting process, we use a mask M to indicate the occluded areas, and a reference face x s to guide the identity-preserving inpainting.As Figure 3 shows, our ID-Inpainter I consists of a content inpainter C, an identity sampler S, an attribute extractor A, and an identity fusor F. In the training phase, we obtain content recovered outputs X a by X a = C Xm , M , the identity embeddings z id by z id = S(X s ), and the multi-scaled attribute embeddings z a by z a = A(X a ).Then, the {z a , z id , M, X a } are delivered to the identity fusor F to obtain the Y f .According to our goal, we need to maintain the structure consistency between Y f and X g , while maximizing the identity similarity between Y f and X g .The process can be formulated as where .
However, the X g is unknown in the verification phase.Assuming that we can find an alternative X s which is very similar in identity to X g , we could update Equation (1c) as Now, the questions are how to find the very similar X s and how to transmit more identity information to the fused result Y f with high structural consistency.

Identity-Guided Inpainting
To keep structural consistency, we implement the content inpainting module C by rebuilding the network of DeepFill [8] to meet the input size of 112 × 112.Inspired by SPADE [31] and SwapInpaint [26], we utilize a GAN-like identity fusor to deal with identityguided inpainting.To fuse more identity information in the recovered result, we replace the Gaussian space of the traditional GAN with the identity space and adopt a recognizer trained with an occlusion-free dataset as the identity sampler.Here, we use an Arcface built on ResNet50-IR [2] with a feature dimension of 256, with unoccluded CASIA-WebFace [32].The identity fusor contains a series of modulation blocks with upsampling layers.Assuming that we define the k-th modulation block as f k , the k-th fused output Y k f is produced by where to match the k-th level.Y 0 f is the output of a 2× deconvolution on the z id .Similar to SwapInpaint [26], the attribute extractor A is a UNet A to convert the X a into multi-scaled attributes z a .
To decrease the structure and style differences in inpainting scenarios, we improve the AAD [27] to the attribute and identity fusing block (AIFB), which combines SPADE and AAD into a residual block.As Figure 4 illustrates, each AIFB is divided into ID-fusion and reconstruction paths.The ID-fusion path consists of two AADs responsible for the fusion of z id and z k a , while the reconstruction path utilizes a SPADE module to rebuild the unoccluded region of the input image X a .
It may be noted that, according to Equation (2), in the verification phase, we need to find a reference image x s , which should be as close to the ground truth x g as possible in identity space.From the quantitative comparisons, we find that some normal recognizers still maintain certain generalizations on occluded images; for example, the ArcFace [2] can reach a verification accuracy of 85.28% on 64 random occluded LFW [33].Therefore, it is reasonable to infer that various occluded versions of the same image still have cohesive properties in identity space and can be used directly as the reference image in the verification phase.

Training Process
For the content inpainter C, the training process is the same as DeepFill [8].For the identity fusor, which we call ID-Fuser for short, we train the attribute extractor A and the fusor F jointly.The training set is X g , X s , M .X s is randomly set to be the same or different from the X g .As for the loss function, we use a reconstruction loss to train the attribute extractor and the reconstruction path when the reference images are the same as the ground truth images, i.e., For the ID-Fusion path, we use l2 loss between the attribute embeddings to maintain the attribute consistency, which is formulated as ( At the same time, an identity loss is used to fuse the identity information of the reference face.It is computed as where cos(•, •) represents the cosine similarity of two embeddings.Furthermore, we need a multi-scale GAN loss [27] to make the result realistic.Then, the final loss is formulated as

Experiment Settings
We take CelebA [34], which is a large-scale face attributes dataset with more than 200 K celebrity camera-captured photos as the training datasets for all the comparison models, while LFW [33], CFP-FP [6], AgeDB-30 [7], and FaceScub [35] are used as the test datasets.The faces are aligned for all datasets and cropped to 112 × 112 resolution.The occluded versions are synthesized as in [9].We extract 2 k images for validation; the others are used for training.For the loss weights, which are set by default as λ 1 = λ 2 = 10, λ 3 = 5, we gradually increase the value of λ 3 during training from 5 to 10.When training, the ratio of the same to cross-identity paires is set to 1:1.All models use the Adam optimizer with the beta parameter set as [0.1, 0.999], and the learning rate as 10 −4 .ID-Fuser is trained for 100, 000 iterations in total, while the content inpaintor and other inpainting models for comparison are all trained for 500, 000 iterations.We implement our model with PyTorch 1.7.1 on a single NVIDIA V100 with a batch size of 16.

Comparison Experiments 4.2.1. Face Inpainting
We compare the proposed ID-Inpainter based on the content inpainter of PIC and CA with PIC [9], CA [8], CA with cosine identity loss (the same as ExGAN [10]), and CA with central-diversity loss (the same as ID-GAN [15]) on face inpainting in Figure 5.It can be seen that our ID-Inpainters achieve better visual quality than the others.Moreover, our models achieve better inpainting quality and higher identity similarity, as shown in Table 1.

Face Recognition
We evaluate the recognition performance of PIC [9], CA [8], CA-cos, CA-div, and ID-Inpainter on the occluded LFW dataset.All experiments are performed on the random block of 48 × 48, the random block of 64 × 64, and the random-part occlusions.The random block is implemented by placing block occlusion at a random location, including the mouth, left eye, right eye, nose, left face, right face, upper face, two eyes, and lower face.The results in Table 2 demonstrate several essential observations.First, structure consistency plays a role in improving the recognition accuracy.For content inpainting, CA performs better than VAE-based PIC.Second, the area of missing blocks has a significant influence on recognition.Lastly, compared with CAs built with an encoder-decoder network, our ID-Inpainter achieves a higher score for occluded face recognition.From existing research, we know that different occluded areas affect the recognition differently.In this experiment, we quantitatively evaluate the influence on the LFW dataset.We explore occlusion types of the left eye, right eye, mouth, nose, two eyes, left face, right face, upper face, and lower face.The results in Table 3 show that occluded areas have the same effects on our method.For example, our method achieves high accuracy in the mouth area but suffers from sharp degradation in the eyes areas.At the same time, it demonstrates that our ID-Inpainter contributes to an accuracy increase in every part.We propose an AIFB to shorten the distance between the inpainted result and the ground truth in style and structure.Here, we compare our results with the AAD-Generator [27], which uses the ID-fusion path only, and SwapInpaint [26] without post-processing.As shown in Figure 6, AAD-Generator and SwapInpaint effectively transfer identity information but can not keep the unoccluded region unchanged.

Identity Space
To explore the influence of ID-Inpainter on occluded face recognition, we compare the identity distributions among four test datasets, i.e., the ground truth (GT), occluded (Occ.),CA, and ID-Inpainter.Five classes with 20 samples for each in FaceScub [35] are randomly picked and are projected to a 256D identity space by ArcFace [2].After that, we use t-SNE [36] to reduce the dimensions from 256 to 2 and visualize them after normalization, as in Figure 7.The highly aggregated features on ground truth are dispersed due to occlusions.CAs mitigate some dispersion but still fail to tell these classes apart.However, ID-Inpainter makes the features more cohesive based on CA and distinguishes these classes with more apparent margins.

More test datasets
We report the verification experiment results for LFW-112, CFP-112, and AgeDB-112 in Table 4.Each dataset is compared with FROM [5], ArcFace [2], and our ID-Inpainter on different occlusions.These results demonstrate that our approach still works for the test datasets that vary widely in age and angle.

Conclusions
We proposed ID-Inpainter, a new identity-guided face inpainting network for occluded face recognition.It achieves maximum identity preservation through a GAN-like fusing network.However, many challenges remain to be tackled when it is applied in real-world scenarios.For example, we can not use it directly in real occlusions.When we meet real occlusion datasets, such as RMFRD [37], Bus Violence [38], CrowdSim2 [39], etc., we must combine it with an automatic occlusion detector.At the same time, the existing face occlusion detectors do not always perform perfectly to obtain the occlusion masks, which may negatively impact the subsequent inpainting process.Most occlusion detectors are built on a segmentation model and trained with synthesized datasets, which perform poorly in detecting real images.Appropriate improvements in datasets and algorithm strategies can significantly improve the accuracy of occlusive masks, thus ensuring recognition performance.For example, they could increase the proportion of real occluded images in the training dataset or improve the algorithm to obtain the occlusions indirectly based on detecting the face background.Another obvious challenge is the balance of structure consistency and identity preservation.A set of appropriate loss weight settings and the ratio setting of the same-identity pairs in the training dataset are needed to obtain optimal performance.
Combined with occlusion detectors, our model can play an essential role in various occluded face recognition scenarios, such as suspect retrieval, access verification, etc.In the future, we plan to extend our work to blind inpainting, which will rely little on the occlusion detector and is anticipated to be more effective when applied practically.

Figure 1 .
Figure 1.Encoder-decoder-structured identity-preserving inpainting networks with different identity training loss.C is an encoder-decoder-structured content inpainting network, and R is a pretrained recognizer.f id , f o , f r are identity-centered features, occlusion-recovered features, and real face features, respectively.

Figure 2 .
Figure 2. We get the recovered result closer to the ground truth by sampling from a closer distribution, which is learned with an unoccluded dataset.

Figure 3 .
Figure 3.The overall pipeline of our approach.It is divided into verification and training phases.The verification phase consists of two modules: ID-Inpainter I and recognizer R. ID-Inpainter I consists of three sub-networks, i.e., content inpainter C, identity sampler S, and identity fusor F. In the training phase, ground truth faces X g , occlusion masks M, and reference images X s are put into I to train an identity-guided inpainting model.In the verification phase, the masked face is used as the reference face to implement identity-preserving inpainting.Finally, the inpainted result is recognized by a normal recognizer R.

Figure 4 .
Figure 4.The structure of k-th AIFB.Each block consists of an ID-fusion path and a reconstruction path.

Figure 5 .
Figure 5. Inpainting results generated by different models.In each row, from left to right, they are the masked face, inpainting result by PIC [9], CA [8], CA with cosine identity loss (CA-cos), and CA with central-diversity loss [15] (CA-div), ID-Inpainter on PIC (PIC-F), ID-Inpainter on CA (CA-F), and the ground truth (GT).

Figure 7 .
Figure 7. Visualization of feature distributions by converting 256D to 2D with t-SNE [36] and following normalization.Different markers with color represent different classes.Zoomed in for better view.

Table 1 .
Quantitative performance on inpainting results.Arrows indicate whether larger is better or smaller is better, and bold indicates the optimal value.

Table 2 .
Verification accuracy (%) of occlusion-recovery methods.Bold indicates the best value.

Table 3 .
Results for the effect of our ID-Inpainter with different recognizers.The results are measured by verification accuracy (%).

Table 4 .
Results for LFW, CFP-FP, and AgeDB-30.The results are measured by verification accuracy (%).Bold indicates the best value.