You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

23 February 2022

AFD-StackGAN: Automatic Mask Generation Network for Face De-Occlusion Using StackGAN

,
,
,
,
,
,
and
1
College of Computer Science, Zhejiang University, Hangzhou 310027, China
2
Department of Software Engineering, University of Science and Technology, Bunnu 28100, Pakistan
3
Department of Biomedical Engineering, College of Engineering, Princess Nourah Bint Abdulrahman University, Riyadh 11671, Saudi Arabia
4
Department of Computer Science, College of Computer and Information Sciences, Prince Sultan University, Riyadh 12435, Saudi Arabia
This article belongs to the Special Issue Big Data Analytics in Internet of Things Environment

Abstract

To address the problem of automatically detecting and removing the mask without user interaction, we present a GAN-based automatic approach for face de-occlusion, called Automatic Mask Generation Network for Face De-occlusion Using Stacked Generative Adversarial Networks (AFD-StackGAN). In this approach, we decompose the problem into two primary stages (i.e., Stage-I Network and Stage-II Network) and employ a separate GAN in both stages. Stage-I Network (Binary Mask Generation Network) automatically creates a binary mask for the masked region in the input images (occluded images). Then, Stage-II Network (Face De-occlusion Network) removes the mask object and synthesizes the damaged region with fine details while retaining the restored face’s appearance and structural consistency. Furthermore, we create a paired synthetic face-occluded dataset using the publicly available CelebA face images to train the proposed model. AFD-StackGAN is evaluated using real-world test images gathered from the Internet. Our extensive experimental results confirm the robustness and efficiency of the proposed model in removing complex mask objects from facial images compared to the previous image manipulation approaches. Additionally, we provide ablation studies for performance comparison between the user-defined mask and auto-defined mask and demonstrate the benefits of refiner networks in the generation process.

1. Introduction

Face occlusion, a growing trend in recent years worldwide, is one of the leading causes of computer vision problems, such as face recognition, identification, tracking, detection, classification, face parsing, contour extraction, etc., which are challenging to tackle. Faces play the most substantial role in describing human face characteristics, facial identity, facial expression, and facial emotions. Thus, people used several methods, such as wearing fancy masks, painting the face with makeup, or pasting a tattoo, to hide their face characteristics, identity, expression, and emotions from the public, video surveillance cameras, or face verification systems because content replacement by serious occlusion with non-face objects always produces partial appearance and ambiguous representation. Obtaining high-resolution and non-occluded face images from occluded face images is essential but challenging for face analysis because faces usually contain few repetitive structures. For successful face recognition systems (FRS) or guessing someone’s identity, removing the occulted object covering most of the face and correctly restoring the face’s missing contents without destroying the existing data distribution is very important. The performance of a face recognition system (FRS) model may often degrade in the presence of unknown occlusions or disguises. Removing the mask object covering the human face’s discriminative region and then correctly restoring the face’s missing contents might help guess someone’s face secret identity.
Over the last several years, researchers have made significant progress in creating image synthesis algorithms that turn an occluded face image into an occlusion-free face image. They have achieved promising results for removing an object in an image; however, they feature some unignorable defects associated with the affected regions, such as lack of high-frequency and perceptual information in situations where they have to deal with occlusion masks of large objects of complex nature, and have significant variations in the structure, size, shape, type, and position in the face image. This is primarily because these methods are trained where occlusion masks, including medical masks, sunglasses, eyeglasses, microphones, scarves, cups, hands, and flowers, have less structure, size, shape, type, position variations in the face image. Their algorithms also show severe deformations and aliasing flaws in their results, especially for regions around the eyes. Such degraded results severely affect many computer vision systems, such as recognition, identification, tracking, detection, and classification.
The biggest motivation behind this research is to de-occlude the occluded parts of an image while keeping the image smoothness unaffected, focusing on the facial area, i.e., removal of the self-employed non-face objects/foreground occluding objects which fill the hole left behind in facial images with visually plausible content. This involves the automatic creation of varied binary masks for the occluded regions after detecting them in the input images (occluded images) and then inpainting the holes left behind after removing unwanted objects from images with plausible correct contents and fine detail. Various occlusions regions are observed from actual face images. Thus, automatically, face occlusions pose a challenging task because:
  • The result heavily depends on the accuracy of detection of the occluded region (i.e., failing to detect an occluded region properly may cause generation of poor binary mask that severely affects de-occlusion task);
  • It is not easy to recover complex semantics of the face under the occluded region detected due to significant variations in the occluded region (i.e., occluded objects/non-face items have vast structures, sizes, colors, shapes, types, and positions variations in the facial images);
  • Training data, i.e., facial image pairs with and without mask object datasets, are sparse or non-existent.
The proposed model proposes an interaction-free approach (i.e., the proposed approach can perform face de-occlusion without requiring a manual occlusion mask) that first generates the binary mask for the occluded region of random sizes, shapes, colors, and structures after detecting it and then removes the non–face objects from the foreground of the input occluded facial images while maintaining the face’s overall coherency.
An example result of GAN [1] based automatic mask generation network for face de-occlusion using StackGAN (AFD-StackGAN) is shown in Figure 1. Following the well-known “coarse-to-fine structure recovery method,” the proposed model’s Stage-I Network (Binary Mask Generation Network) generates a binary mask for the masked region after detecting the mask object in the input facial images. Then, Stage-II Network (Face De-occlusion Network) removes the mask object and synthesizes the damaged region with plausible content while retaining the global coherency of the face structure. Furthermore, we trained the proposed model on a synthetically created facial images dataset. Since there are no facial image pairings with or without mask objects, we have created a paired synthetic dataset using the CelebA dataset. We assessed the proposed model on real-world test images containing non-face items with vast structure, size, color, shape, type, and position variations in the facial images gathered from the Internet. We compared the performance of the proposed model with previous face recovery methods. Several experiments illustrate that the proposed AFD-StackGAN outperforms other previous face recovery methods.
Figure 1. The proposed AFD-StackGAN results on real-world images.
The main contributions of an automatic mask removal network for face de-occlusion are summarized as follows:
  • This work proposes a novel GAN-based inpainting method by employing an automatic mask generation network for face de-occlusion without human interaction. This work automatically eliminates challenging mask objects from the face and synthesizes the damaged area with fine details while holding the restored face’s appearance and structural consistency;
  • This work attempts to alleviate the manual mask selection burden by creating a straightforward method that can intelligently and automatically generate the occluded region’s binary mask in facial images;
  • One potential application of an automatic mask generation network could be a video where mask objects continuously conceal the face’s structural semantics;
  • We experimentally show that the proposed model with an automatically generated mask is more effective than those with manually generated masks for removing mask objects and generating realistic semantics of face images.
The structure of this research work is as follows. Section 2 reviews the work related to image editing. The proposed approach, as well as the loss function, is described in Section 3. The proposed scheme’s implementation and training details are discussed in Section 4. Results and comparison are argued in Section 5. Section 6 concludes the whole paper.

3. Our Approach

The general architecture of the proposed AFD-StackGAN is shown in Figure 2. Stage-I Network and Stage-II Network are the two major networks. The following sections consider each network in detail. Our task is to generate the binary mask simultaneously and remove the non-face object from the occluded image. Implementing this as an end-to-end model, we propose a two-stage approach to address this task. Each stage focuses on one aspect: Stage-I generates a binary mask, and Stage-II removes the mask object from the input facial image.
Figure 2. The architecture of the automatic mask removal network for face de-occlusion. It consists of Stage-I Network that generates a binary mask and Stage-II Network that removes the mask object from input facial images.

3.1. Stage-I Network: Binary Mask Generation Network

Stage-I Network (Binary Mask Generation Network) generates a binary mask after detecting the mask object in the input occluded facial image. The generator G 1 at Stage-I (Binary Mask Generation Network) takes the input image l c (occluded image) and generates a binary mask l p r e _ m a s k .
Generator G 1 . The encoder of the generator G 1 takes the facial image l c as input and maps it to a low-dimensional latent representation (bottleneck layer). The decoder then maps back to a low-dimensional latent representation (bottleneck layer) to generate a binary mask l p r e _ m a s k of the size of the input facial image. The architecture we design has three convolution layers for the encoder part and three convolutions (transpose convolution) layers for the decoder part, as shown in Figure 2. Each convolution layer is used in the form of a relu + a convolution + a normalization layer, except the first and last layers, which use a tanh in place of a relu. The decoder of G 1 is similar to the encoder, except that de-convolution layers substitute convolution layers. De-convolution layers are used in the decoder, gradually up-sampling latent representation to image scale. The decoder uses tanh activation without the normalization layer in the last layer.
Loss Function.  L l 1 loss is used to train Stage-I Network. The L l 1 loss calculates the pixel-wise difference between a predicted binary mask l p r e _ m a s k and target binary mask I g t _ m a s k . L l 1 loss is used to match the details of l p r e _ m a s k with I g t _ m a s k . The L l 1 loss between l p r e _ m a s k and I g t _ m a s k is expressed such as:
L l 1 = l p r e _ m a s k I g t _ m a s k
where, L l 1 loss is defined as the pixel-wise difference between a predicted binary mask l p r e _ m a s k and target binary mask l g t _ m a s k .
Binary masks l p r e _ m a s k generated by G 1 are rough and have noise at some locations. To obtain a clean binary mask l m , we utilized additional erosion and dilation morphological image processing techniques as a mask refiner network. Erosion removes salt noise from the generated mask l p r e _ m a s k and dilution fills in the holes in the generated binary mask.

3.2. Stage-II Network: Face De-Occlusion Network

Stage-II Network (Face De-occlusion Network) aims to remove the occlusion mask from facial images and complete the region left behind with plausible content and fine details. Stage-II consists of a pair of generator and discriminator networks: G 2   +   D 2 , and G 3 + D 3 . The generator G 2 takes the input occluded image I c , along with the binary mask I m , as a combined input and generates an occlusion-free image I o i . The generator G 3 takes the input image I c , binary mask I m , and I o i (generator G 2 output) as a combined input and generates an occlusion-free final image I i f . The two discriminators D 2 and D 3 , force generators G 2 and G 3 to produce visually plausible and naturalistic looking images by determining the I o i (generator G 2 output) and I i f (generator G 3 output) as a real or fake face. The following sections consider each network in detail.
Generator G 2 . Generator G 2 at Stage-II uses CNN-based encoding-decoding architecture. This encoder-decoder uses the idea of U-Net [11] with skip connections to prevent the loss of spatial information details at higher resolutions during the down-sampling and up-sampling functions of the encoder and decoder. The encoder takes the image I o as a concatenated input of occluded image I c (Stage-I input) and refine the binary mask I m (Stage-I output) and maps it to a low-dimensional latent representation. The decoder then maps back the low-dimensional latent representation, reconstructs and generates the initial coarse output facial image I o i . The encoder of G 2 is composed of five convolution layers (for simplicity, only three layers of the encoder are shown in Figure 2) progressively down-sampling the latent representation. Each convolution layer is used in the form of a relu + a convolution + an instance normalization layer, except the first and last layers, which use a tanh in place of a relu.
The decoder of G 2 is similar to the encoder, except that de-convolution layers substitute convolution layers. De-convolution layers are used in the decoder, gradually up-sampling the latent representation to image scale. A combination of dilated convolution (DC) [30] and Squeeze-and-Excitation (SE) blocks [31], as shown in Figure 2, is used in the middle of the encoder-decoder. DC is used to enhance the receptive field size without increasing the computational power and network parameters, making the recovered area under the occlusion mask convolutional network (FCN), which enhances a network’s representative power by learning the weights for more consistent with its surroundings. SE block is an addition to each feature map channel fully. SE-blocks recalibrate feature maps in the context of the channel.
Discriminator D 2 . A PatchGAN discriminator D 2 —which only penalizes structure at the scale of patches [32] and is used instead of regular GAN discriminators [1] to focus on reconstructing high-frequency content. Discriminator D 2 tries to decide if each patch of size 32 × 32 in an image I o i (de-occluded image) is real or fake. We run D 2 convolutionally across the image I o i , averaging all responses to provide the ultimate output of D 2
Loss Function. To minimize the artifacts and ensure better visual quality, a careful arrangement (amalgam) of re-construction L r c , perceptual L p e r And adversarial loss L a d v (i.e., we unite re-construction loss, perceptual loss, and adversarial loss for each stage of Stage-II Network), is used to produce realistic and perceptually correct missing content occlusion-free face image. The joint loss function used to train the Stage-II Network (Face De-occlusion Network) is defined as:
L j o i n t =   α L r c + β L p e r + L a d v
where α and β are constants to adjust the weights of re-construction loss and perceptual loss, respectively.
The re-construction loss composes of pixel-wise re-construction loss L l 1 and structure-level similarity loss L SSIM . The re-construction loss can be written as:
L r c = L l 1 + L SSIM
The pixel-wise re-construction loss L l 1 measure the per-pixel difference between generated occlusion-free face image I o i and ground-truth I g t . We calculate the pixel-wise re-construction loss via l 1 -norm in place of l 2 -norm because l 1 -norm encourages less blurring and glaring errors than l 2 -norm. The pixel-wise re-construction loss L l 1 can be defined as
L l 1 = I o 𝒾 I g t
where   .   is the l 1 -norm and I o 𝒾 = G 2 ( I o ) is the output image of the generator ( G 2 ), i.e., face image without occlusion.
The structure-level similarity loss L SSIM [33], which measures the structure-level difference between generated occlusion-free face image I o i and ground-truth I g t , can be defined as:
L SSIM = 1 SSIM   I o 𝒾   ,   I g t
The perceptual loss L p e r which boosts the generator’s output to have identical representation to the ground truth measures the feature-level difference between the feature maps of the generated occlusion-free face image I o i and ground truth I g t , extracted by a VGG-19 network [34], which is pre-trained on ImageNet [35]. Let φ 𝒾 be the activation map of the 𝒾 th layer of the VGG-19 network, then the feature matching loss is defined as:
      L p = Σ | | φ 𝒾 I o 𝒾 φ 𝒾 ( I g t   ) | |
We exploit the intermediate convolution layer feature maps (conv_3, conv_4 and conv_5) of the VGG-19 network to obtain rich structural information, which helps in recovering a plausible structure for the face semantics.
In addition to re-construction loss L r c , and perceptual loss L p e r , the adversarial loss L a d v , used to render the repaired image I o 𝒾 as real as possible and generate realistic results, can be expressed in Equation (7).
L adv = min G 2 max D 2 E [ log ( D 2 ( I o 𝒾 ,     I g t ) ) ] + [ log ( 1 D 2 ( G 2 ( I o ) ) ) ]
where   I g t represents the real sample (ground-truth), I o 𝒾 represents the initially generated de-occluded image, I o is the concatenated input for G 2 , E represents the expectation, and L adv represents the adversarial loss at the base network. The log D 2 I o 𝒾 ,     I g t is the loss function for D 2 and log 1 D 2 G 2   I o is the loss function for G 2 .
Generator G 3 . Generator G 3 at Stage-II is quite similar to the generator G 2 . We propose G 3 to bring the initial result I o i ( G 2 result) closer to the ground truth by rectifying what is missing or wrong in the initial result. To achieve this, we feed I c and I m ( G 2 inputs) again with I o i ( G 2 output) as a concatenated input I 0 f into G 3 , which generates the final result I i f with more photorealistic details in the recovered area. We feed I o and I m ( G 2 inputs) again to enforce edge consistency at the affected region boundary, further increasing the generated face image’s visual quality.
Discriminator D 3 . A Patch-GAN discriminator D 3 at Stage-II shares the identical architecture as D 2 . Discriminator D 3 tries to classify if each patch of size 32 × 32 in an image I i f (final de-occluded image) is real or fake. We run this discriminator D 3 convolutionally across the image I i f , averaging all responses to provide the ultimate output of D 3 .
Loss Function. Note: We incorporate the same re-construction loss L rc , and perceptual loss L per to produce a final de-occluded image. Thus, we do not mention them separately. The adversarial loss L adv is used to make the repaired image I i f as real as possible and generated realistic results, which can be expressed in Equation (8).
L a d v = min G 3 max D 3 E [ log ( D 2 ( I 𝒾 f ,     I g t ) ) ] + [ log ( 1 D 3 ( G 3 ( I 0 f ) ) ) ]
where   I g t represents the real sample (ground-truth), I i f represents the finally generated de-occluded image, I 0 f is the concatenated input for G 3 , E represents the expectation, and L a d v represents the adversarial loss at the refiner network. The log   D 3 I i f ,     I g t is the loss function for D 3 and log 1 D 3 G 3   I 0 f is the loss function for G 3 .

3.3. Total Loss Function

The total loss function used to train the whole module is a weighted sum of L l 1 (Equation (1)) and L j o i n t (Equation (2)), defined as:
L total = L l 1 + α L rc + β L per + L adv
where α and β are the constants for altering the weights of reconstruction and perceptual loss. For the first part of Stage-II ( G 2 + D 2 ), we used α   = 100 and 𝛽 = 33 to capture better structure, and for the second part of Stage-II ( G 3 + D 3 ), we used α = 10 and 𝛽 = 3.3 for yielding natural-looking results.

4. Experiments

In this section, firstly, we describe the training and implementation details of the proposed approach. Afterward, we introduce the competing baseline models. Finally, this section explains the synthetic dataset creation used for training and the real-world dataset used for evaluation.

4.1. Training and Implementation Details

For training of Stage-I Network, we input facial images I c into mask generation network, which generates a binary mask I p r e _ m a s k close to the target binary mask I g t _ m a s k . I p r e _ m a s k is then fed into a mask object refiner network and generates a final binary mask I m . For training of Stage-II Network, we input facial images I c (input of Stage-I) and binary mask, I m (output of Stage-I), and generate an occlusion-free facial image I o i . Then, I o i (Initially generated de-occluded image), I c (input of Stage-I), and binary mask, I m (output of Stage-I), are fed into an image refiner network ( G 3 ) that produces a final occlusion-free facial image I i f .
TensorFlow [36] is used to implement the proposed model and is trained with Nvidia GTX 1080Ti GPU. We trained the proposed model with batch size 10 and utilized Adam [37]. The model was trained for 1000 iterations. We used TTUR [38] for training. The learning rate of 0.0001 for the generator and 0.0004 for the discriminator in both stages were employed. GAN training becomes more stable using different learning rates for generator and discriminator updates.

4.2. Competing Methods

After reviewing various related approaches in Section 2, GLCIC (Iizuka et al. [21]), GCA (Yu et al. [25]), EdgeConnect (Nazeri et al. [27]), and MRGAN (Din et al. [28]) are the closest approaches to our work. MRGAN is a GAN-based two-stage framework for removing amedical face mask and reconstructing the mask-covered region. While impressive results were produced in removing medical masks, their network is incapable of automatically detecting and removing multiple types of complex objects. In contrast, the proposed model (AFD-StackGAN) can automatically detect and remove multiple complex objects of various sizes, shapes, colors, and structures. EdgeConnect also uses a two-stage adversarial approach in which it generates the guidance information in the first stage and edits the image in the second stage. It successfully recovers the image based on hallucinated edge information from an edge generator network. Unlike EdgeConnect, the proposed model generates a binary mask of the non-face object (i.e., masked region) while EdgeConnect generates the edge map of the complete image. Moreover, it uses a GAN setup with one discriminator in both stages while the proposed model employs two separate discriminators in both stages with two separate generators, which uses CNN-based encoding-decoding network architecture with Skip-connection, which is used in the generator network to strengthen the predictive ability of the generator and to prevent the gradient vanishing caused by the deep network. The result shows that the image completed by the encoder-decoder network architecture with Skip-connection is more realistic.
In contrast, GLCIC and GCA train both discriminators jointly at the same time along with one generator to learn global consistency and deep missing region with a post-processing step such as poison image blending, while we train both discriminators along with two separate generators and our work does not use any supplementary processing or post-processing step. GLCIC and GCA models have noticeable artifacts and blurriness in the generated regions since these models predict the missing regions from only high-level features. Different from GLCIC and GCA, the proposed model predicts the missing regions from both low-level and high-level features (pixel-wise loss ( l 1 ) for low-level features and Structural Similarity loss (SSIM) for high-level features). These schemes are not suitable for our problem because they cannot overcome the complexity of the task and produce artifacts due to large missing regions of arbitrary shape.

4.3. Datasets

4.3.1. Synthetic Generated Dataset

For supervised training of our model, no publicly accessible dataset comprises face image pairings with or without mask objects. We have created a synthetic dataset using the publicly available CelebA Face dataset [39]. With more than 200k celebrity images, CelebA is a vast face attribute collection. To create synthetic samples, we randomly place mask objects of various sizes, shapes, colors, and structures in the images using Adobe Photoshop CC 2018, as shown in row two of Figure 3. Then, we create the binary masks of the corresponding mask objects, as shown in row three. All input images and masks in our synthetic dataset have a resolution of 256 × 256. Figure 3 shows some sample images of our synthetic dataset. Further descriptions of our synthetic dataset are given in Table 1.
Figure 3. Some images of our synthetic dataset.
Table 1. A summary of dataset feature description used in experiments.

4.3.2. Real-World Generated Dataset

A dataset of occluded facial images downloaded from the Internet was formed to demonstrate the proposed method’s effectiveness on real-world data. While creating these occluded facial images dataset, we took all possible care to ensure that the images collected from the Internet were diverse in sizes, shapes, structures, and positions regarding the occlusion masks. Additionally, the binary mask of the corresponding occluded region for real-world data using Adobe Photoshop 2018 was developed, since manually generated binary masks for the occluded region are provided with input occluded facial images at training and inference stages. This dataset is used for evaluation (test) purposes only. Each image in real-world data has a resolution of 256 × 256.

4.4. Performance Evaluation Metrics

Although the GAN-based models have achieved great success in numerous computer vision applications, it is still difficult to evaluate which methods are better than other methods because there is no standard defined function for quantitative evaluation, which hurts the GAN performance. Nevertheless, to quantitatively and objectively analyze the accuracy or effectiveness of the proposed system, various numerical evaluation metrics are chosen, such as Structural Similarity (SSIM) [33], which guesses the all-inclusive similarity between the reconstructed and the target face images, Peak Signal-to-Noise Ratio (PSNR) is one of the most widely used full-reference quality metrics that measure the difference in pixel values between the reconstructed and the target face images, Mean Square Error (MSE) calculates the average squared difference between the reconstructed and the target face images, Naturalness Image Quality Evaluator (NIQE) [40], which measures the quality of image, and Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) [41], which calculates naturalness of image.
Greater PSNR and SSIM values mean closer distances between synthetic data and real data distributions (i.e., greater PSNR and SSIM values show good performance of the generative model). In comparison, lower PSNR and SSIM values indicate greater distances between synthetic data and real data distributions (i.e., lower PSNR and SSIM values show the generative model’s bad performance). Lower MSE, NIQE, and BRISQUE values mean closer distances between synthetic data and real data distributions (i.e., lower MSE, NIQE, and BRISQUE values show good performance of the generative model). In comparison, higher MSE, NIQE, and BRISQUE values mean greater distances between synthetic data and real data distributions (i.e., higher MSE, NIQE, and BRISQUE values show the poor performance of the generative model).

5. Results and Comparisons

We designed an automatic mask generation network for face de-occlusion to remove mask objects. This automatic mask generation network automatically detects mask objects, generates binary masks, and then removes the masks objects. This section covers the results of Stage-I Network, Stage-II Network. We also discuss and compare the qualitative and quantitative performance of the proposed model with baseline models.

5.1. Results of Stage-I Network

Figure 4 shows the results of the Stage-I Network on real-world images. The first row contains input images with mask objects. The mask generation network successfully generated binary masks, as listed in the second row. The third row displays the results of the mask refiner network, which improves the results by rectifying what is wrong or missing in the mask generator network results. Finally, these masks are used as input to Stage-II Network.
Figure 4. The results of Stage-I Network on real-world images.

5.2. Results of Stage-II Network

Figure 5 shows the results of Stage-II Network on real-world images. The first row contains input images, the second row features corresponding binary masks generated by the mask generation network, the third row contains refined mask refined by the mask refiner network, and the last two rows show the output of Stage-II Network (Face De-occlusion Network). It can be seen that the proposed Face De-occlusion Network successfully generates correct face semantic structure and texture without any interaction. Therefore, this fully automatic approach can be used for practical implementation, such as live video.
Figure 5. The results of AFD-StackGAN (Stage-I Network + Stage-II Network) on real-world images.

5.3. Qualitative Comparisons

The sample quality is primarily evaluated based on the visual fidelity generated by the GAN-based frameworks in the absence of a consistent and robust assessment method. Figure 6 shows the results of the proposed AFD-StackGAN and the baseline models (Iizuka et al. [21], Yu et al. [25], Nazeri et al. [27], and Din et al. [28]) on real-world images. We showed the input facial images and the output occlusion-free facial images in the qualitative experiments’ test set. For this, we give the qualitative results of the proposed AFD-StackGAN and baseline models. It can be seen in Figure 6 that the results of the proposed AFD-StackGAN are smoother and more realistic than the baselines models’ generated results for real data. Quantitative results show that the proposed AFD-StackGAN can handle occluded facial images under challenging conditions, e.g., complex occlusions with variations in size, structure, type, shape, and position in the facial image.
Figure 6. Visual assessment of the proposed AFD-StackGAN with the baseline models on real-world images.
  • Hard Examples. Although the proposed AFD-StackGAN can handle the removal of occlusion masks of various shapes, sizes, colors, and structures, even on images not used to train the network, there are some examples, as shown in Figure 7, AFD-StackGAN fails to remove the occlusion masks altogether. Common failure cases occur when the Stage-I Network (Binary Mask Generation Network) cannot produce a good binary mask of the mask object, as shown in the first row of Figure 7, failing to detect them correctly. This happened when occlusion masks were different from those in our synthetic dataset in shape, position, and structure, as they mainly cover the regions around both eyes. As seen in the first row of Figure 7, the mask objects’ shapes, colors, positions, and structures are different from the mask types we used in our synthetic dataset. Moreover, the proposed model was trained using images from the CelebA dataset, and the CelebA data set images are roughly cropped and aligned, while the other dataset image (e.g., real-world images) are not processed in this manner, as shown in the first row of Figure 7. Our model cannot handle unaligned faces well and fails to generate missing regions of the images with unaligned faces. As expected, AFD-StackGAN produces worse results overall, as seen in the third row.
    Figure 7. AFD-StackGAN performance for real face images with occlusion masks that have very different structures and locations in the face images than the occlusion masks used in the synthetic dataset. The first row shows occluded input facial images, and the second row shows de-occluded output face images.

5.4. Quantitative Comparisons

To quantitatively compare the performance between the proposed model and the baseline models, we use the following five performance evaluation metrics: (1) SSIM, (2) PSNR, (3) MSE, (4) NIQE, and BRISQUE (as explained in Section 4.4). The quantitative score via SSIM, PSNR, and MSE is evaluated using the synthetic test dataset results because no ground truth exists for real occluded face images since they were downloaded from the Internet, while the quantitative score via NIQE and BRISQE is evaluated using the results from the real test samples. For MSE, NIQE, and BRISQUE, smaller values indicate superior efficiency, while for PSNR and SSIM, the higher, the better. The quantitative scores in terms of SSIM, PSNR, MSE, NIQE, and BRISQUE of proposed AFD-StackGAN and baseline models are shown in Table 2. Table 2 shows the averaged test scores obtained from individual test images. It has been observed that AFD-StackGAN generates semantically consistent and visually plausible face images without occlusion masks, which can help improve the performance of many computer vision algorithms for face identification/recognition purposes in future studies.
Table 2. Performance comparison of different methods in terms of SSIM, MSE, PSNR, NIQE, and BRISQUE. For PSNR and SSIM, higher values show superior performance, while for BRISQUE and NIQE, the lower, the better.

5.5. Ablation Studies

This section presents the ablation studies to understand the usefulness of using an automatically generated mask than a manually generated mask and the role of using the refiner networks in both stages.

5.5.1. Performance Comparison between Using User-Defined Mask and Auto-Defined Mask

To evaluate the effectiveness of the proposed method, we compared the performance between directly using the user-defined manually generated binary mask and automatically generated binary mask. The first column in Figure 8 contains the input images. The second column in Figure 8 is the editing result by using a user-defined manually generated binary mask. The third column represents the editing results obtained using an automatically generated binary mask. We can see that the editing result by using an automatically generated binary mask is better than using a user-defined manually generated binary mask. Table 3 shows the quantitative scores of the proposed method with a user-defined mask and auto-defined mask
Figure 8. Visual comparison of the automatic mask removal network (used auto-generated mask) with FD-StackGAN (used user-defined mask).
Table 3. Performance comparison between using user-defined mask and auto-defined mask in SSIM, PSNR, MSE, NIQE, and BRISQUE.
Note that the editing result using the user-defined manually generated binary mask is obtained by only running Stage-II Network without Stage-I Network. The user-defined manually generated binary mask inputs Stage-II Network and the input image.

5.5.2. Role of Refiner Networks

We performed the ablation study to show the effectiveness of refiner networks in the proposed multi-stage approach. For this, we drew a qualitative comparison by training the proposed model with a refiner network and without a refiner network. As shown in Figure 9, each stage of the proposed model trained with the refiner network can generate more photorealistic results with minimum-deformation artifact-free images than the results of each stage of the proposed model trained without the refiner network.
Figure 9. Results of image refiner network on real-world images further improve the results by rectifying what is missing or wrong in the mask base network results.
In the first stage of our model, the mask generation network generates a binary mask automatically. The mask generation network-generated results (i.e., binary mask) have some noise at some locations (red circles are used to specify the locations of some noise artifacts). The refiner network removes the noise in the mask generation network-generated results (blue circles specify the areas and locations of some refinement corrections). Stage-I Network can generate a more noise-free binary mask with the help of a refiner network.
In the second stage of our model, the face de-occluded network removes the mask object and completes the area left behind with plausible content and fine details. The initially generated results are generally blurry with missing details and several defects, especially for masked areas (red circles are used to specify the locations of some undesired artifacts). The refiner network corrects what is missing or wrong in the initially generated results (blue circles are used to specify the areas and locations of some refinement corrections) and generated results that contain more photorealistic details with minimum undesired artifacts. Stage-II Network can generate more natural-looking images with the help of a refiner network.

6. Conclusions

This work proposed a two-stage GAN-based model that successfully recovers the de-occluded facial image after automatically generating the mask of the non-face object in the occluded input facial image. Previous approaches cannot resolve well issues related to removing numerous mask objects covering large discriminative regions of the person’s face. In contrast, the proposed model can successfully remove the numerous types of mask objects of large complex nature in the facial images, covering most of the person’s face by creating semantically applicable and visually plausible content for the missing regions. The performance on real world data is quite satisfactory although we train our network using the synthetic dataset only. We analyze the proposed model performance quantitatively and qualitatively and show that the proposed model can produce structurally consistent results of higher perceptual quality. The proposed model is quite flexible to handle vast missing regions or covered regions that vary in structures, sizes, colors, and shapes.
Since AFD-StackGAN is trained on a synthetic dataset, there could be a domain discrepancy between real-world test facial images and synthetic training facial images. To manage this issue, domain adaptation would be required to reduce the domain distance between real images and synthetic ones, potentially solving the problem. We have planned to work in this domain to settle this issue in the future.

Author Contributions

A.J. developed the method; A.J., X.L., M.A. (Muhammad Assam) and J.A.K. performed the experiments and analysis, and M.O., M.A.A., F.N.A.-W. and M.A. (Muhammad Assad) wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work under grant number (RGP.1/14/43). Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2022R203), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. The authors would like to acknowledge the support of Prince Sultan University, Riyadh, Saudi Arabia, for partially supporting this project and for paying the Article Processing Charges (APC) of this publication.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

GANGenerative Adversarial Network
CNNConvolutional Neural Network
FCNFully Convolutional Network
SESqueeze and Excitation block
DCDilated Convolution
TTURTwo Time-scale Update Rules
Notations
l c Occluded image
l g t Ground truth image
l p r e _ m a s k Generated binary mask
l m Noise-free binary mask
l c Concatenated   input   of   occluded   image   l c   and   generated   binary   mask   l p r e _ m a s k
I o i Initially generated de-occluded facial image
I o f Concatenated   input   of   occluded   image   l c ,   generated   binary   mask   l p r e _ m a s k   and   initially   generated   de - occluded   facial   image   I o i
I i f Finally generated de-occluded facial image

References

  1. Goodfellow, I.; Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A. Generative Adversarial Nets. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
  2. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 28 June 2014; pp. 580–587. [Google Scholar]
  3. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  4. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  6. Krizhevsky, I.; Sutskever, G.; Hinton, E. AlexNet. Adv. Neural Inf. Process. Syst. 2012, 1, 1–9. [Google Scholar]
  7. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  8. He, D.; Yang, X.; Liang, C.; Zhou, Z.; Ororbi, A.G.; Kifer, D.; Giles, C.L. Multi-scale with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3519–3528. [Google Scholar]
  9. Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
  10. Chen, L.C.; Papandreou, G.; Kokkinos, l.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  11. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Lecture Notes in Computer Science; Springer: Cham, Germany, 2015; pp. 234–241. [Google Scholar]
  12. Ehsani, K.; Mottaghi, R.; Farhadi, A. Segan: Segmenting and generating the invisible. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6144–6153. [Google Scholar]
  13. Li, J.; Liang, X.; Wei, Y.; Xu, T.; Feng, J.; Yan, S. Perceptual generative adversarial networks for small object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1222–1230. [Google Scholar]
  14. Bai, Y.; Zhang, Y.; Ding, M.; Ghanem, B. Sod-mtgan: Small object detection via a multi-task generative adversarial network. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 206–221. [Google Scholar]
  15. Prakash, C.D.; Karam, L.J. It gan do better: Gan based detection of objects on images with varying quality. arXiv 2019, arXiv:1912.01707. [Google Scholar] [CrossRef] [PubMed]
  16. Criminisi, A.; Perez, P.; Toyama, K. Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 2004, 13, 1200–1212. [Google Scholar] [CrossRef] [PubMed]
  17. Wang, J.; Lu, K.; Pan, D.; He, N.; Bao, B.-K. Robust object removal with an exemplar-based image inpainting approach. Neurocomputing 2014, 123, 150–155. [Google Scholar] [CrossRef]
  18. Hays, J.; Efros, A.A. Scene completion using millions of photographs. ACM Trans. Graph. 2007, 26, 4. [Google Scholar] [CrossRef]
  19. Park, J.-S.; Oh, Y.H.; Ahn, S.C.; Lee, S.-W. Glasses removal from facial image using recursive error compensation. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 805–811. [Google Scholar] [CrossRef] [PubMed]
  20. Li, Y.; Liu, S.; Yang, J.; Yang, M.H. Generative face completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5892–5900. [Google Scholar]
  21. Iizuka, S.; Simo-serra, E.; Ishikawa, H. Globally and Locally Consistent Image Completion. ACM Trans. Graph. 2017, 36, 1–14. [Google Scholar] [CrossRef]
  22. Yeh, R.A.; Chen, C.; Lim, T.Y.; Schwing, A.G.; Hasegawa-Johnson, M.; Do, M.N. Semantic image inpainting with deep generative models. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6882–6890. [Google Scholar]
  23. Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
  24. Liao, H.; Funka-Lea, G.; Zheng, Y.; Luo, J.; Zhou, S.K. Face Completion with Semantic Knowledge and Collaborative Adversarial Learning. In Lecture Notes in Computer Science; Springer: Cham, Germany, 2019; pp. 382–397. [Google Scholar]
  25. Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  26. Song, L.; Cao, J.; Song, L.; Hu, Y.; He, R. Geometry-aware face completion and editing. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 2506–2513. [Google Scholar]
  27. Nazeri, K.; Ng, E.; Joseph, T.; Qureshi, F.; Ebrahimi, M. EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning. arXiv 2019, arXiv:1901.00212. [Google Scholar]
  28. Din, N.U.; Javed, K.; Bae, S.; Yi, J. A novel GAN-based network for the unmasking of masked face. IEEE Access 2020, 8, 44276–44287. [Google Scholar] [CrossRef]
  29. Khan, K.; Din, N.U.; Bae, S.; Yi, J. Interactive removal of microphone object in facial images. Electronics 2019, 8, 1115. [Google Scholar] [CrossRef] [Green Version]
  30. Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
  31. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  32. Isola, P.; Efros, A.A.; Ai, B.; Berkeley, U.C. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017; pp. 5967–5976. [Google Scholar]
  33. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  34. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
  35. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
  36. Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A system for large-scale machine learning. In Proceedings of the OSDI: Operating System Design and Implementation, Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
  37. Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  38. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GAN Trained by a Two Time Scale Update Rule Converge to Local Nash Equilibrium. Adv. Neural Inf. Process. Syst. 2017, 6629–6640. [Google Scholar]
  39. Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar]
  40. Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-Reference Image Quality Assessment in the Spatial Domain. In Proceedings of the IEEE Transactions on Image Processing, Austin, TX, USA, 21 December 2012; pp. 4695–4708. [Google Scholar]
  41. Mittal, A.; Moorthy, A.K.; Bovik, A.C. Blind/Referenceless Image Spatial Quality Evaluator. In Proceedings of the 45th Asilomar Conference on Signals, Systems and Computers (ASILOMAR), Pacific Grove, CA, USA, 6–9 November 2011; pp. 723–727. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.