Research on Improved Occluded-Face Restoration Network

Pang, Shangzhen; Thio, Tzer Hwai Gilbert; Siaw, Fei Lu; Chen, Mingju; Lin, Li

doi:10.3390/sym17060827

Open AccessArticle

Research on Improved Occluded-Face Restoration Network

by

Shangzhen Pang

^1,2,3,*,

Tzer Hwai Gilbert Thio

³

,

Fei Lu Siaw

³

,

Mingju Chen

¹ and

Li Lin

⁴

¹

School of Physics and Electronic Engineering, Sichuan University of Science and Engineering, Zigong 643000, China

²

Sichuan Province University Key Laboratory of Bridge Non-Destruction Detecting and Engineering Computing, Yibin 644000, China

³

Centre for Sustainability in Advanced Electrical and Electronic Systems (CSAEES), Faculty of Engineering, Built Environment and Information Technology, SEGi University, Petaling Jaya 47810, Malaysia

⁴

Graduate School of Business (GSB), SEGi University, Petaling Jaya 47810, Malaysia

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(6), 827; https://doi.org/10.3390/sym17060827

Submission received: 25 February 2025 / Revised: 10 May 2025 / Accepted: 21 May 2025 / Published: 26 May 2025

(This article belongs to the Section Physics)

Download

Browse Figures

Versions Notes

Abstract

The natural features of the face exhibit significant symmetry. In practical applications, faces may be partially occluded due to factors like wearing masks or glasses, or the presence of other objects. Occluded-face restoration has broad application prospects in fields such as augmented reality, virtual reality, healthcare, security, etc. It is also of significant practical importance in enhancing public safety and providing efficient services. This research establishes an improved occluded-face restoration network based on facial feature points and Generative Adversarial Networks. A facial landmark prediction network is constructed based on an improved MobileNetV3-small network. On the foundation of U-Net, dilated convolutions and residual blocks are introduced to form an enhanced generator network. Additionally, an improved discriminator network is built based on Patch-GAN. Compared to the Contextual Attention network, under various occlusions, the improved face restoration network shows a maximum increase in the Peak Signal-to-Noise Ratio of 24.47%, and in the Structural Similarity Index of 24.39%, and a decrease in the Fréchet Inception Distance of 81.1%. Compared to the Edge Connect network, under various occlusions, the improved network shows a maximum increase in the Peak Signal-to-Noise Ratio of 7.89% and in the Structural Similarity Index of 10.34%, and a decrease in the Fréchet Inception Distance of 27.2%. Compared to the LaFIn network, under various occlusions, the improved network shows a maximum increase in the Peak Signal-to-Noise Ratio of 3.4% and in the Structural Similarity Index of 3.31%, and a decrease in the Fréchet Inception Distance of 9.19%. These experiments show that the improved face restoration network yields better restoration results.

Keywords:

face restoration; feature point prediction; GAN; face occlusion

1. Introduction

The human face is a significant feature of human identity and plays an essential role in social interaction. Facial features, such as the eyes, eyebrows, nose, and mouth, exhibit significant symmetry in terms of shape, texture, and position when viewed with a neutral expression and frontal angle. With the development of AI technology, facial recognition has seen widespread use in areas such as security, finance, and social media, where the quality and accuracy of facial images are of utmost importance. However, in real-world environments, facial images are often partially obstructed by various factors such as wearing masks, glasses, or hats, or the presence of hair or other objects. These occlusions lead to the loss of facial features, which negatively affects facial recognition and analysis, making face restoration an important challenge for improving recognition accuracy and human–computer interaction. This is a crucial issue in real-world scenarios.

Research on face restoration has evolved rapidly from traditional image processing techniques to deep learning methods. Early image restoration techniques were mostly based on traditional image processing, with images repaired by using the structure and texture information of known regions. Criminisi et al. proposed a sample-based image restoration method that uses texture information from a known region to fill in the missing parts, which proved effective in removing large occlusions. However, when dealing with irregular or complex missing regions, issues such as discontinuous boundaries or inconsistent structures in the restoration may arise [1]. Wu et al. applied the idea of diffusion to iteratively propagate low-level features in the occluded region, which is suitable for smaller, structure-rich regions [2].

With the development of deep learning, particularly the application of convolutional neural networks (CNNs) and Generative Adversarial Networks (GANs), significant progress has been made in face restoration technology. CNNs, by simulating the way the biological visual neural system works, can effectively perform image restoration tasks. U-Net, a typical convolutional neural network structure, uses an encoder–decoder architecture to repair facial images, preserving the continuity and consistency of facial features. However, it often falls short in restoring complex textures and details, such as facial expressions, skin textures, and hair [3]. GANs, by utilizing the concept of adversarial training, generate realistic images. Deepak et al. used an encoder–decoder structure combined with adversarial loss to restore damaged images and enhance the convenience of image repair [4]. Yu et al. proposed a GAN model based on Gated Convolution for image restoration tasks. This method can handle irregular missing areas and recover the details of facial images. The gating mechanism allows the generator to generate more precise restoration areas, improving the restoration quality. However, when dealing with large missing regions, the restoration results may exhibit unnatural or distorted areas [5]. In 2021, Dosovitskiy et al. introduced Vision Transformer (ViT) [6]. In 2022, Zhu et al. proposed a reference-guided image inpainting method based on Vision Transformer, capable of generating high-quality inpainting results by utilizing reference images [7]. Compared to a GAN, ViT has limited capability in detail restoration and higher computational complexity, making it more suitable for repairing large missing areas and capturing the global structure of images.

Determining how to restore complex textures and details in facial restoration, especially under large-area occlusions, remains a key research challenge. This study proposes an improved occluded-face restoration network based on facial landmarks and GANs, with the following main contributions:

(1): A lightweight network structure is constructed using MobileNet’s depthwise separable convolutions, forming a lightweight facial landmark prediction network.
(2): An enhanced GAN occluded-face restoration network is developed by improving upon the U-Net and Patch-GAN architectures.

The experimental results demonstrate that the proposed facial restoration network achieves high-quality restoration results under both free-form occlusion and 25% center occlusion scenarios.

2. GAN

GANs were proposed by Goodfellow et al. in 2014 [8]. The design of GANs is inspired by the game theory concept of adversarial play, with the generator and discriminator in deep neural networks simulating this idea. Figure 1 shows the GAN model, where the generator creates an image

G (z)

by capturing the data distribution in sample data

z

, and the discriminator uses real images to determine whether the generated image is correct. The discriminator is essentially a binary classifier, marking real images with a “1” and generated images with a “0”. When the discriminator classifies an image as fake, it sends the result back to the generator, which continues learning and optimizing based on the discriminator’s feedback to create more realistic images to deceive the discriminator. This process repeats until the generated image is indistinguishable from the real one according to the discriminator.

The training loss of the GAN includes adversarial loss and generation loss. The objective function

V (D, G)

is expressed as follows:

\underset{G}{m i n} \underset{D}{m a x} V (D, G) = E_{x ~ p_{d a t a} (x)} [\log D (x)] + E_{z ~ p_{z} (z)} [\log (1 - D (G (z)))]

(1)

where

x

represents real data,

p_{d a t a} (x)

is the pixel distribution of the real images,

z

represents the sample data,

P_{z} (z)

denotes the distribution of input random noise,

G (z)

is the image generated by the generator,

E

represents the expected value of the distribution function, and

D (x)

is the probability of classifying the generated sample as real.

E_{x ~ P_{d a t a} (x)} [\log D (x)]

represents the expected outcome of the discriminator’s judgment on real samples. To make the discriminator

D

optimal, the discriminator’s result for real samples,

D (x)

, should be 1, i.e.,

\log D (x)

should be 0. If the discriminator is not optimal,

\log D (x)

will be less than 0. In other words, to optimize the discriminator,

E_{x ~ P_{d a t a} (x)} [\log D (x)]

should be as large as possible. Similarly,

E_{z ~ P_{z} (z)} [\log (1 - D (G (z)))]

represents the expected outcome of the discriminator’s judgment on fake samples. To make the discriminator

D

optimal, the discriminator’s result for fake samples,

\log D (x)

, should be 0, meaning that

\log (1 - D (G (z)))

should be 0. If the discriminator is not optimal,

\log (1 - D (G (z)))

will be less than 0. In other words, to optimize the discriminator,

E_{z ~ P_{z} (z)} [\log (1 - D (G (z)))]

should be as large as possible. The value function

V (D, G)

is the sum of the two terms above and is essentially a cross-entropy loss function. The better the discriminator, the larger the value of

V (D, G)

.

At the same time, when the parameters of the generator

G

are fixed, the optimal discriminator can be represented as

D_{G}^{*} (x) = \frac{P_{d a t a} (x)}{P_{d a t a} (x) + P_{g} (x)}

(2)

where

P_{g} (x)

is the distribution of the generated samples

G (z)

when

z ~ P_{z}

, and

P_{d a t a} (x)

is the distribution of real samples.

To optimize both the generator and discriminator, the generator model and the discriminator model must be trained so that through continuous adversarial interactions, they reach their optimal states.

After training, the generator model

G

and the discriminator model

D

reach a balanced state, meaning the samples generated by the generator are very similar to the real samples. As a result, the discriminator finds it difficult to distinguish whether the samples generated by the generator are real or not, and thus assigns a score of 0.5 for any input, i.e.,

D (G (z)) = 0.5

. In most cases, the discriminator loss and generator loss do not converge simultaneously. For example, when the discriminator has very high accuracy, its loss may quickly drop to zero, preventing it from providing useful information for training the generator. This results in unstable GAN training and defeats the purpose of adversarial learning.

3. Methodology

To further enhance the restoration quality of the face restoration network, this study improves it based on GANs. The generator in the network is built upon a modified U-Net architecture. During the downsampling process, convolutional layers are used instead of max-pooling layers, and normalization layers are introduced before the activation functions to stabilize the training. Seven residual blocks with dilated convolutions are added to reduce the loss of fine details during the feature extraction process. The upsampling process employs three decoders, with a dilation block introduced before each decoder to expand the receptive field. Additionally, skip connections are added between the corresponding encoder–decoder layers and residual modules, fully utilizing the feature information extracted during the downsampling process.

The discriminator is based on the Patch-GAN structure and incorporates a spectral normalization (SN) module. Furthermore, a Self-attention Module is added to adaptively process the features. By using real feature points to evaluate the generated image, the network can better assess the restoration quality of details, thus improving the preservation of local details.

3.1. Overall Structure of the Restoration Network

An ideal facial image restoration network should generate restored images that maintain the logical structure between facial features and attributes, making the restored image appear more natural and closer to the real result. This study establishes a facial image restoration network based on facial feature points to achieve high-precision occluded-face restoration. The network structure is shown in Figure 2.

The face restoration network consists of two parts. The first part is the feature point prediction network, which is an improved facial feature point prediction network based on MobileNetV3-small. The image input to the network is the occluded image obtained by the Hadamard product of the real image and the masked image, and the output is the predicted facial feature points of the occluded image. The predicted feature points reflect facial expressions and the topological structure between facial features, which guide the restoration process of the occluded-face image. The second part is the occluded-face image restoration network, which consists of a generator and a discriminator. The generator is an improved version based on the U-Net structure. Its input is the image generated by combining the occluded image and the predicted facial feature points, and the output is the generated facial image. The discriminator is an improved version based on the Patch-GAN structure, and its input is the generated facial image and the real facial feature points.

The input to the facial image restoration network

M_{P}

is the occluded image

I^{M}

and the predicted feature points

\overset{\land}{L}

. Let

θ_{P}

be the network parameters, and the restored image can be defined as

\hat{I} : = M_{P} (I^{M}, \hat{L}; θ_{P})

(3)

3.2. Lightweight Facial Feature Point Prediction Network

Facial feature points can be seen as discrete points sampled from key areas of the face. These discrete points form a structure that preserves facial expression features and maintains the topological structure between facial features. The lightweight facial feature point prediction network is an improvement based on the MobileNetV3-small network [9]. It adopts a lightweight network structure formed of MobileNet’s depthwise separable convolutions and introduces SENet. To solve the issue of slow convergence when training key points with SENet, improvements such as batch normalization, reverse residual structures, linear bottleneck structures, and average pooling are applied. Feature point prediction loss

L_{lmk}

is used to train the feature point prediction module. A lightweight facial feature point prediction network for 256 × 256 facial images is constructed, and the network structure is shown in Figure 3.

In the figure, c represents the number of output channels, k is the kernel size, s is the convolution or deconvolution stride, p is the padding, f is the dilation factor, and n is the number of repetitions.

3.3. Generator Network Based on Dilated Convolutions

The generator is an improved version based on the U-Net structure [3]. This network consists of two parts: the first part is the downsampling encoder section, and the second part is the upsampling decoder section. The downsampling process is mainly used for feature extraction and feature compression, while the upsampling process is mainly used for feature concatenation and sampling. The structure of the generator is shown in Figure 4.

In the figure, c represents the number of channels, k is the kernel size, s is the stride of the convolution or deconvolution layer, and p is the padding. The image input to the generator is the occluded-face image, combined with the facial key points predicted by the key point prediction network.

First, three progressively downsampling encoder blocks are used, each consisting of a convolution layer, an Instance Normalization (IN) layer, and a nonlinear activation function (ReLU) connected sequentially [10,11]. Following this, there are seven residual blocks with dilated convolutions and one long short-term attention block.

The second half of the network is the upsampling process, with three decoder modules. Before the decoders, a stacked dilation block is introduced. The decoder modules have the same structure as the encoder. The dilation block is composed of a deconvolution layer, an IN layer, and a nonlinear activation function (ReLU) connected sequentially. The final decoder module removes the normalization layer and sets the activation function to tanh.

The dilated convolution-based generator network improves upon the U-Net network. The downsampling part has three key improvements. First, instead of using the max pooling layers found in the original U-Net network after each sampling step, three encoder blocks use convolution layers with strides of 1, 2, and 2 for image feature downsampling. The benefit of this approach, compared to the original U-Net, is that the network can learn through training to optimize feature downsampling. Second, an IN layer is added after each convolution layer and before the activation function, stabilizing training and accelerating model convergence. Third, dilated convolutions are used along with the introduction of residual blocks and long short-term attention layers. Dilated convolutions help preserve more details in the final feature map by preventing it from becoming too small. The residual blocks help reduce computational complexity, and the long-short term attention layer connects temporal feature maps [12].

In the upsampling part, dilation blocks are introduced to enlarge the receptive field, enabling consideration of a broader range of features. The final decoder module sets the activation function to tanh, which helps alleviate the gradient vanishing problem [13] and improves robustness. By combining low-level features, the final feature map is processed to restore the image to the same size as the input.

Additionally, to better extract semantic information from the missing parts of the image and fully utilize the extracted shallow image features, skip connections are added between the corresponding encoder–decoder layers and residual blocks. This allows for the reuse of low-level features, enhancing the network’s ability to leverage these features spatially and temporally. Skip connections efficiently pass features from the encoding modules, helping the network capture more useful information from the input image, which improves the detail restoration of the generated face image. Moreover, skip connections between residual blocks provide gradient flow information from shallow to deep layers, improving training speed and further enhancing training performance. Before each decoding layer, a 1 × 1 convolution operation is performed through channel attention to adjust the weights of the features from the skip connection and the last layer.

Table 1 shows the network architecture parameters for the generator. In the table, c represents the number of channels, k is the kernel size, s is the stride of the convolution or deconvolution layer, and p is the padding. The input image is a 256 × 256 four-channel image generated from the original image and the occlusion mask.

3.4. Discriminator Network Improved Based on Patch-GAN

The discriminator is responsible for distinguishing whether the input data are real and providing feedback to the generator based on the discrimination result. The ideal state at the end of training is that the data generated by the generator are very similar to the real data, and the discriminator’s output is unable to distinguish the authenticity of the generator’s output because the generated data are so realistic, resulting in a straight line with a value close to 0.5 (between real and fake).

The discriminator is improved based on a 70 × 70 Patch-GAN structure [14]. The first, second, fourth, and fifth convolutional modules are composed of a convolutional layer, a spectral normalization (SN) layer, and a nonlinear activation function LReLU layer [15]. The introduction of spectral normalization ensures the global structure of facial features and maintains attribute consistency. The sixth convolutional module uses the sigmoid activation function with a range of (0, 1) for judgment. The third module inserts an attention layer to adaptively process features. The discriminator discriminates the generated image based on real feature points, allowing it to better evaluate the quality of detail restoration in the repaired image. By focusing on the feature points of the image, the discriminator can better preserve local details. The structure of the discriminator is shown in Figure 5.

Table 2 shows the network structure of the discriminator. The structure of each module in the discriminator is the same, but the number of channels in the network varies due to the different resolutions of the input image patches.

3.5. Loss Function

The landmark prediction loss

L_{lmk}

is used to train the landmark prediction module

M_{L}

, which measures the fit between the generated landmark

\hat{L}

and the real landmark

L_{gt}

. The training loss is as follows:

L_{lmk} : = {∥ \hat{L} - L_{gt} ∥}_{2}^{2}

(4)

This loss is used to generate landmarks for facial image contours, ensuring recognition of the topological structure between facial features.

In the training of the image inpainting network generation model, the generator’s loss function is expressed using pixel-wise loss

L_{p i x e l}

, total variation loss

L_{t v}

, perceptual loss

L_{p e r c}

, style loss

L_{s t y l e}

, and adversarial loss

L_{a d v G}

. The loss function is formulated as follows:

L_{i n p} : = L_{p i x e l} + λ_{t v} L_{t v} + λ_{a d v} L_{a g v G} + λ_{p e r c} L_{p e r c} + λ_{s t y l e} L_{s t y l e}

(5)

where the pixel-wise loss

L_{p i x e l}

is defined as

L_{p i x e l} : = \frac{1}{N_{m}} {∥ \overset{\land}{I} - I ∥}_{1}

(6)

Here,

{∥ \cdot ∥}_{1}

denotes the

l_{1}

norm, and

N_{m}

is the mask size used to adjust the discriminator’s judgment requirements. This loss measures the difference between the image generated by the generator and the real image, helping to generate high-resolution images.

To add more texture structures to the inpainting results, adversarial loss

L_{a d v G}

is introduced. The adversarial loss

L_{a d v G}

is defined as

L_{a d v G} : = E [(D (M_{P} (I^{M}, \overset{\land}{L}), L_{g t}) - 1)^{2}]

(7)

where D is the discriminator’s output,

L_{g t}

is the real feature points,

M_{P}

is the face image inpainting network, and

I^{M}

is the occluded image, with

\overset{\land}{L}

predicting the feature points.

The total variation loss

L_{tv}

is introduced to improve the visual consistency between the repaired region pixels and the known region pixels. The total variation loss is defined as

L_{t v} : = \frac{1}{N_{I}} {∥ \nabla \overset{\land}{I} ∥}_{1}

(8)

where

{∥ \nabla \overset{\land}{I} ∥}_{1}

is the gradient magnitude of the image, and

N_{1}

is the number of pixels in the real image.

The perceptual loss

L_{p e r c}

measures the difference between the feature maps extracted from a pre-trained network, defined as

L_{p e r c} : = \sum_{p} \frac{∥ ϕ_{p} (\hat{I}) - ϕ_{p} (I) ∥}{N_{p} \times H_{p} \times W_{p}}

(9)

In the formula,

H_{p} \times W_{p}

and

N_{p}

represent the size and dimension of the feature map from the

p

-th layer of the pre-trained network, and

ϕ_{P} (\cdot)

denotes the

N_{p}

-dimensional feature map of size

H_{p} \times W_{p}

from the

p

-th layer of the pre-trained network.

L_{s t y l e} : = \sum_{p} \frac{1}{N_{p} \times N_{p}} {∥ \frac{G_{p} (\hat{I} \circ M) - G_{p} (I \circ M)}{N_{p} \times H_{p} \times W_{p}} ∥}_{1}

(10)

where

G_{P} (x) = ϕ {(x)}^{T} ϕ_{p} (x)

, and

ϕ_{p} (x)

corresponds to the Gram matrix.

The discriminator loss for the inpainting network is defined as

L_{a d v D} : = E [D {(\hat{I}, L_{g t})}^{2}] + E [{(D (I, L_{g t}) - 1)}^{2}]

(11)

In the deep learning-based inpainting network established in this paper, the parameters are set as

λ_{t v} = 0.1

,

λ_{a d v} = 0.01

,

λ_{p e r c} = 0.1

, and

λ_{s t y l e} = 250

for the experiment. Through the training of the inpainting network, the losses of the generator and discriminator are alternately minimized, and eventually, the generator is able to produce more natural images that closely resemble real images, thereby achieving the inpainting functionality.

4. Experiments and Results

4.1. Experimental Environment and Parameter Settings

The training and testing in this paper were conducted on a server running Windows 10 Professional. The main specifications of the server were as follows: the hardware configuration included a CPU (Intel Xeon, Hillsboro, OR, USA), 4 GPUs (NVIDIA TITAN Xp, Santa Clara, CA, USA), a Supermicro X10DRG-Q motherboard (San Jose, CA, USA), 256 GB of Micron RAM (Manassas, VA, USA), and a 2 TB hard drive. The software configuration included CUDA 10.0, Anaconda3, PyCharm 2019, Python 3.7, and others. All experiments in this paper were based on the PyTorch1.10.0 deep learning framework.

The experiment used 30,000 images from the CelebA-HQ dataset with a resolution of 256 × 256 for the network. The CelebA-HQ dataset is one of the most widely used datasets. Its facial image data are an extension of the CelebA dataset, resulting in a high-definition dataset that contains a total of 202,599 facial images. The downloaded CelebA-HQ dataset was divided into three parts: 29,500 images for the training set, 200 images for the validation set, and 300 images for the test set. During the training process, an Adam optimizer with exponential decay rates of

β 1 = 0

and

β 2 = 0.9

was used [16]. The learning rate for the feature point prediction network was set to 0.0001, and the batch size for the images was set to 16. The learning rate for the facial inpainting network was set to 0.00001, with a batch size of 4.

4.2. Evaluation Parameters for Experimental Results

To evaluate the quality of the face inpainting network, three objective metrics were used for quantitative assessment: the Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Fréchet Inception Distance (FID) [17,18]. The PSNR is used to assess the error between corresponding pixel points of two images, where a higher value indicates less distortion. The SSIM evaluates the overall similarity of two images in terms of brightness, contrast, and structure; a value closer to 1 indicates higher similarity. The FID is used to assess the quality of two images, with lower values indicating better image quality.

The PSNR (Peak Signal-to-Noise Ratio) refers to the ratio of the maximum signal to the noise, and it is typically used to quantify the difference between the maximum signal and the background noise. In most cases, compressed images differ from the original images, so the PSNR is commonly used as a standard for evaluating image quality.

Given a reconstructed image

I

and an inpainted image

K

of size

m \times n

, the Mean-Squared Error (MSE) is defined as

M S E = \frac{1}{m n} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{n - 1} {[I (i, j) - K (i, j)]}^{2}

(12)

Furthermore, the definition of the PSNR is

P S N R = 10 \times \log_{10} (\frac{{M A X}_{I}^{2}}{M S E})

(13)

where

{M A X}_{I}^{2}

is the maximum possible pixel value of the image, and

M A X_{I} = 2^{n} - 1

, where

n

is the bit depth of the pixels. If the pixel values are represented by 8-bit numbers, then

M A X_{I} = 2^{8} - 1 = 255

. The unit of the PSNR is decibels (dB), and the smaller the MSE, the larger the PSNR value, indicating better inpainting performance. Conversely, the larger the MSE, the smaller the PSNR value, indicating worse inpainting performance.

The Structural Similarity Index (SSIM) is a metric used to measure the similarity between two images. When one image is a distortion-free image and the other is a distorted version, their structural similarity can be used as a measure of the image quality of the distorted image. Compared to the Peak Signal-to-Noise Ratio (PSNR), the SSIM is a better metric for evaluating image quality in terms of human visual perception. The SSIM measures image similarity based on three aspects, brightness, contrast, and structure, as follows:

L (I, K) = \frac{2 μ_{I} μ_{K} + C_{1}}{{μ_{I}}^{2} + {μ_{K}}^{2} + C_{1}}

(14)

C (I, K) = \frac{2 σ_{I} σ_{K} + C_{2}}{{σ_{I}}^{2} + {σ_{K}}^{2} + C_{2}}

(15)

S (I, K) = \frac{2 σ_{I K} + C_{3}}{σ_{I} σ_{K} + C_{3}}

(16)

where

I

is the reconstructed image,

K

is the inpainted image,

L (I, K)

is the luminance comparison,

C (I, K)

is the contrast comparison,

S (I, K)

is the structural comparison,

σ_{I K}

is the covariance between image

K

and image

I

,

σ_{K}

and

σ_{I}

are the standard deviations of image

I

and image

K

, respectively, and

μ_{I}

and

μ_{K}

are the mean pixel values of image

I

and image

K

.

C_{1}

,

C_{2}

, and

C_{3}

are small constant values. The constants

C_{1}

,

C_{2}

, and

C_{3}

are introduced to avoid division by zero in the formula. The SSIM is defined as

S S I M (I, K) = {[L (I, K)]}^{α} {[C (I, K)]}^{β} {[S (I, K)]}^{γ}

(17)

where

α

,

β

, and

γ

are constants greater than zero. The value of the SSIM ranges from 0 to 1, and a higher value indicates better quality of the inpainted image.

The Fréchet Inception Distance (FID) is a metric used to evaluate the quality of generated images, often employed to measure the similarity between images generated by Generative Adversarial Networks (GANs) and real images. The FID is widely used in image generation tasks, particularly in image synthesis, style transfer, and image inpainting. It measures the similarity between generated and real images by computing the Fréchet distance between their feature distributions in the InceptionV3 network’s feature space. The formula is as follows:

F I D = {∥ μ_{r} - μ_{g} ∥}^{2} + T r (\sum r + \sum g - 2 {(\sum r \sum g)}^{\frac{1}{2}})

(18)

where

μ_{r}

and

μ_{g}

are the feature means of the real and generated images,

\sum r

and

\sum g

are the covariance matrices of the real and generated images,

{∥ \cdot ∥}^{2}

denotes the squared Euclidean distance, and

T r (\cdot)

represents the trace of the matrix (the sum of its diagonal elements).

The smaller the FID value, the closer the distribution of the generated images is to that of the real images, indicating better generation quality. A larger value indicates a greater difference between the generated and real images, implying poorer image quality.

4.3. Experimental Results

The experiment used 300 images with a resolution of 256 × 256 from the CelebA-HQ dataset, which were not involved in training. The test images were generated by combining the original images with masked images. The masked images came from the test section of the mask dataset provided by Liu et al., along with 25% central masks. The masks were randomly distributed with coverage proportions of 10–20%, 20–30%, 30–40%, 40–50%, 50–60%, and 25% for the central mask. The experimental results were compared and analyzed against the CA, EC, and LaFIn face inpainting networks [19,20,21].

CA (Contextual Attention) is an image inpainting method based on a Contextual Attention mechanism that repairs the missing parts of an image by capturing its contextual information. EC (Edge Connect) is a facial image inpainting technique based on edge information that helps the network more accurately restore the details and contours of an image by incorporating edge conditions during the inpainting process. LaFIn (Latent Face Inpainting) is a face image inpainting method based on latent space that maps face images into latent space and utilizes deep generative models to learn the latent features of face images, generating more natural and high-quality inpainted images.

Figure 6 shows the face restoration process used in this study. The original image (a) and the masked image (b) are multiplied using the Hadamard product to obtain the occluded image (c). The feature point prediction module first predicts the feature points of the occluded image, resulting in predicted feature points (d). Then, the restoration module performs the restoration of the image generated by combining the predicted feature points with the occluded image.

Figure 7 shows the inpainting results of the model on facial frontal images with four different mask coverage percentages. In the figure, the first row shows the mask covering 30–40% of the image, the second row 40–50%, and the third row 50–60%, and the fourth row is a central mask with 25% coverage.

From Figure 7, it can be seen that all of the networks—CA, EC, LaFIn, and the improved face inpainting network—are able to complete face inpainting under partial occlusion. Among them, the CA algorithm only utilizes the unoccluded areas to inform the inpainting of the occluded areas, which leads to a lack of coherence in the semantic structure, resulting in obvious deformations in the inpainted image. For example, the mouth in the first and third rows; the left eye in the second row; and the eyes, nose, and mouth in the fourth row are either fully or mostly occluded, causing a large discrepancy between the inpainted face and the original image, with unnatural facial expressions. The EC algorithm introduces edge conditions to restore facial details and contours, providing stronger inpainting capability than the CA network, and is able to reconstruct the overall facial structure in various occlusion cases. However, the repaired expressions still show considerable differences from the original image. For example, the nose and mouth in the first row; the left eye in the second row; and the nose, mouth, and facial contours in the third and fourth rows show noticeable differences in the topological structure of the facial features compared to the original image. LaFIn and the improved face inpainting network outperform CA and EC in terms of inpainting results. Compared with LaFIn, the improved face inpainting network demonstrates better results in restoring facial feature shapes and expressions, and provides smoother transitions between inpainted areas. For example, the mouth and eye features in the first row are more realistic and clearer, and the eyes in the fourth row are closer to those in the original image, with smoother transitions around the mouth.

Figure 8 shows the inpainting results of the model on facial side-view images with four different mask coverage percentages. In the figure, the first row shows a mask covering 30–40% of the image, the second row 40–50%, and the third row 50–60%, and the fourth row is a central mask with 25% coverage.

From Figure 8, it can be seen that under partial occlusion of the face, CA, EC, and LaFIn and the improved face inpainting network can all successfully restore the facial image. Among them, the CA and EC algorithms show deficiencies in obtaining semantic information from the image, leading to blurry inpainted regions with significant structural differences from the original image. LaFIn and the improved face inpainting network achieve better results, with the latter producing more realistic and natural restorations, especially in terms of facial expressions and feature repair. For example, in the first and second rows with CA, and the third row with CA, EC, and LaFIn inpainting the occluded eyes, nose, and mouth, there is obvious blurriness, and the repair results are not smooth and natural. In the fourth row, CA and EC’s repair of the eyes and nose shows noticeable fuzziness and unnatural results. The inpainting results from the improved face inpainting network, while showing some differences from the original image, are clearer compared to those from CA, EC, and LaFIn, and the repair of facial features is more realistic and natural, being closest to the original image.

5. Discussion

Table 3 shows the quantitative comparison of the PSNR, SSIM, and FID for CA, EC, LaFIn, and the improved face inpainting network under different mask coverage percentages. The values in the table represent the average evaluation results of the generated images in the test set. From the data in Table 3, it can be observed that the improved facial restoration network achieves higher PSNR and SSIM values under various mask occlusions compared to CA, EC, and LaFIn, indicating that the improved facial restoration network delivers higher-quality facial image restoration. Additionally, the FID values of the improved facial restoration network are lower than those of CA, EC, and LaFIn under various mask occlusions, demonstrating that the distribution of the restored facial images is closer to that of the real images, further confirming the superior restoration performance of the improved network.

Table 4 shows the quantitative comparison analysis of the PSNR, SSIM, and FID between the improved face inpainting network and CA, EC, and LaFIn under different mask coverage percentages.

From the data in the table, it can be seen that the EC network outperformed the CA network in terms of the PSNR, SSIM, and FID, while the LaFIn network had better objective metrics than both CA and EC. Among all of the algorithms, the improved face inpainting network achieved the highest PSNR and SSIM, along with the lowest FID value. Compared to the CA network, under various occlusions, the improved face inpainting network’s PSNR increased by up to 24.47%, its SSIM improved by 24.39%, and its FID decreased by 81.1%. Compared to the EC network, the PSNR increased by up to 7.89%, the SSIM improved by 10.34%, and the FID decreased by 27.2%. Compared to the LaFIn network, the PSNR improved by up to 3.4%, the SSIM increased by 3.31%, and the FID decreased by 9.19%. These experiments demonstrate that the improved face inpainting network achieves better restoration results.

The Shapiro–Wilk test indicated that at a significance level of α = 0.05, the SSIM values did not follow a normal distribution (p = 0.03). The Kruskal–Wallis test yielded p = 0.012 (p < 0.05), demonstrating statistically significant differences in the SSIM among CA, EC, LaFIn, and the improved network. A post hoc analysis using Dunn’s test with Bonferroni correction demonstrated statistically significant improvements in our proposed network over EC (p = 0.015), CA (p = 0.008), and LaFIn (p = 0.043).

Table 5 presents the ablation study of different structures. From the table, it can be observed that both the long short-term attention layer (LSTA) and facial landmark guidance (FLG) are beneficial for improving the performance of the facial restoration network.

From the above discussion, it can be seen that the improved facial restoration network achieves better restoration results under both free-form occlusion and 25% center occlusion compared to the CA, EC, and LaFIn networks. However, the restoration results under large-area occlusion still exhibit some differences in their details compared to the original images.

6. Conclusions

The face has significant symmetry, and symmetry, as has been previously demonstrated, has guiding significance in facial restoration. This study constructs an improved face inpainting network with occlusion, consisting of a lightweight facial landmark prediction network and a GAN-based face inpainting network. The facial landmark prediction network is improved based on the MobileNetV3-small architecture and trained to predict facial landmarks under different occlusion conditions. The improved occluded-face inpainting network is composed of a U-Net-based generator and a Patch-GAN-based discriminator. Repair tests were conducted on face images with free occlusion coverage, including 10–20%, 30–40%, 40–50%, and 50–60%, as well as a 25% center mask. The experiments demonstrate that the improved occluded-face inpainting network achieves the highest PSNR and SSIM, with the lowest FID value, and provides better restoration results under various occlusions, producing face images that are more natural and clear, and closer to the original image. The improved facial restoration network, incorporating facial landmark guidance and a U-Net-based generator architecture, contains more computational parameters compared to the CA and EC networks. While achieving enhanced performance, this design inevitably increases computational costs. Significant FID metric degradation occurs when processing large-area occlusions, primarily due to insufficient availability of facial information to guide repair.

The improved facial restoration network enhances the inpainting effect for occluded faces, thereby increasing the accuracy of facial recognition systems and improving the reliability of identity verification systems. In security surveillance systems, it can assist in restoring damaged or occluded-face images. In augmented reality applications, it can generate more realistic virtual avatars, enabling more natural human–computer interaction. In the future, facial restoration can develop towards dynamic scenes, multimodal fusion, and personalized restoration. With advancements in deep learning and multimodal technologies, facial restoration techniques will be widely applied in more fields, becoming one of the important research directions in the field of artificial intelligence.

Author Contributions

Methodology, S.P., T.H.G.T. and F.L.S.; Software, L.L.; Validation, M.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Opening Project of Sichuan Province University Key Laboratory of Bridge Non-destruction Detecting and Engineering Computing (grant number 2023QYY07) and the Sichuan Provincial Natural Science Foundation (under project numbers 2025ZNSFSC0477 and 2024NSFSC2042).

Data Availability Statement

The data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Criminisi, A.; Perez, P.; Toyama, K. Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 2004, 13, 1200–1212. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Gou, C.; Ji, Q. Simultaneous facial landmark detection, pose and deformation estimation under facial occlusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3471–3480. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Deepak, P.; Philipp, K.; Jeff, D.; Trevor, D.; Alexei, A.E. Context Encoders: Feature Learning by Inpainting. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Free-Form Image Inpainting with Gated Convolution. IEEE Trans. Image Process. 2019, 28, 5576–5589. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Wan, Z.; Zhang, J.; Chen, D.; Liao, J. Image Inpainting with Vision Transformers. arXiv 2022, arXiv:2203.12132. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML’10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance normalization: The missing ingredient for fast stylization. arXiv 2016, arXiv:1607.08022. [Google Scholar]
Zheng, C.; Cham, T.J.; Cai, J. Pluralistic image completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1438–1447. [Google Scholar]
Fan, E. Extended tanh-function method and its applications to nonlinear equations. Phys. Lett. A 2000, 277, 212–218. [Google Scholar] [CrossRef]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv 2018, arXiv:1802.05957. [Google Scholar]
Diederik, P.K.; Jimmy, B. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6626–6637. [Google Scholar]
Nazeri, K.; Ng, E.; Joseph, T. EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning. arXiv 2019, arXiv:1901.00212. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative Image Inpainting with Contextual Attention. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5505–5514. [Google Scholar]
Yang, Y.; Guo, X.; Ma, J.; Ma, L.; Ling, H. LaFIn: Generative Landmark Guided Face Inpainting. arXiv 2019, arXiv:1911.11394. [Google Scholar]

Figure 1. The GAN model.

Figure 2. Overall structure of the face restoration network.

Figure 3. Structure of the lightweight facial feature point prediction network.

Figure 4. The structure of the generator.

Figure 5. The structure of the discriminator.

Figure 6. The process of the improved occluded-face restoration network. (a) Original image. (b) Masked image. (c) The occluded image formed by the Hadamard product of the masked image and the original image, used for restoration. (d) The result of facial feature point prediction by the feature point prediction module on the occluded image. (e) The restoration result using the model trained on the CelebA-HQ dataset.

Figure 7. Frontal-view image inpainting. (a) The original image. (b) The masked image formed by Hadamard product between the original image and the mask image for inpainting testing. (c) The CA network inpainting result. (d) The EC network inpainting result. (e) The LaFIn network inpainting result. (f) The inpainting result of the improved face inpainting network.

Figure 8. Side-view image inpainting comparison. (a) The original image. (b) The masked image formed by Hadamard product between the original image and the mask image for inpainting testing. (c) The CA network inpainting result. (d) The EC network inpainting result. (e) The LaFIn network inpainting result. (f) The inpainting result of the improved face inpainting network.

Table 1. Generator network architecture parameters.

Input	Operator	k	c	s	p	Out
256² × 4	Conv-IN-ReLU	7	64	1	3	E1
256² × 64	Conv-IN-ReLU	4	128	2	1	E2
128² × 128	Conv-IN-ReLU	4	256	2	1	E3
64² × 256	Dilated Residual Block	-	256	-	-	E4
64² × 256	Dilated Residual Block	-	256	-	-	E5
64² × 256	Dilated Residual Block	-	256	-	-	E6
64² × 256	Dilated Residual Block	-	256	-	-	R1
E6, R1 64² × 256	Dilated Residual Block	-	256	-	-	R2
E5, R2 64² × 256	Dilated Residual Block	-	256	-	-	R3
E4, R3 64² × 256	Dilated Residual Block	-	256	-	-	R4
E3, R4 64² × 512	Short- and Long-Term Attention	-	256	-	-	-
64² × 256	Deconv-IN-ReLU	4	128	2	1	R5
E2, R5 128² × 256	Conv-IN-ReLU	1	256	1	0	-
128² × 256	Deconv-IN-ReLU	4	64	2	1	R6
E1, R6 256² × 128	Conv-IN-ReLU	1	128	1	0	-
256² × 128	Conv-IN-tanh	7	3	1	3	-
256² × 3	-	-	3	-	-	-

Table 2. Discriminator network structure parameters.

Input	Operator	k	c	s	p
256² × 4	Conv-SN-LReLU	4	64	2	1
128² × 64	Conv-SN-LReLU	4	128	2	1
128² × 128	Attention	-	128	-	-
64² × 128	Conv-SN-LReLU	4	256	2	1
32² × 256	Conv-SN-LReLU	4	512	1	1
31² × 512	Conv-SN-Sigmoid	4	1	1	1

Table 3. Quantitative comparison under different mask categories.

Evaluation Indicators	Mask Category	CA	EC	LaFIn	Improved Network
PSNR	10–20%	27.82	30.65	31.40	31.92
	20–30%	24.39	27.62	28.37	28.91
	30–40%	22.27	25.61	26.42	26.71
	40–50%	20.31	23.52	24.30	24.70
	50–60%	18.35	21.17	22.09	22.84
	Center Mask	24.22	24.91	25.83	26.35
SSIM	10–20%	0.935	0.968	0.972	0.979
	20–30%	0.886	0.941	0.950	0.958
	30–40%	0.822	0.910	0.925	0.932
	40–50%	0.761	0.863	0.887	0.908
	50–60%	0.652	0.735	0.785	0.811
	Center Mask	0.851	0.873	0.912	0.937
FID	10–20%	7.35	2.48	2.18	2.07
	20–30%	14.67	3.95	3.41	3.22
	30–40%	25.12	6.32	5.44	4.94
	40–50%	36.46	8.97	7.27	6.89
	50–60%	48.68	14.52	11.31	10.57
	Center Mask	15.08	8.32	6.71	6.46

Table 4. Quantitative comparison analysis of improved face inpainting network with CA, EC, and LaFIn under different mask categories.

Evaluation Indicators	Mask Category	CA		EC		LaFIn
Evaluation Indicators	Mask Category	Value Added	Increase Multiplier	Value Added	Increase Multiplier	Value Added	Increase Multiplier
PSNR	10–20%	4.1	14.74%	1.27	4.14%	0.52	1.66%
	20–30%	4.52	18.53%	1.29	4.67%	0.54	1.90%
	30–40%	4.44	19.94%	1.1	4.30%	0.29	1.10%
	40–50%	4.39	21.61%	1.18	5.02%	0.4	1.65%
	50–60%	4.49	24.47%	1.67	7.89%	0.75	3.40%
	Center Mask	2.13	8.79%	1.44	5.78%	0.52	2.01%
SSIM	10–20%	0.044	4.71%	0.011	1.14%	0.007	0.72%
	20–30%	0.072	8.13%	0.017	1.81%	0.008	0.84%
	30–40%	0.11	13.38%	0.022	2.42%	0.007	0.76%
	40–50%	0.147	19.32%	0.045	5.21%	0.021	2.37%
	50–60%	0.159	24.39%	0.076	10.34%	0.026	3.31%
	Center Mask	0.086	10.11%	0.064	7.33%	0.025	2.74%
FID	10–20%	5.28	71.84%	0.41	16.53%	0.11	5.05%
	20–30%	11.45	78.05%	0.73	18.48%	0.19	5.57%
	30–40%	20.18	80.33%	1.38	21.84%	0.5	9.19%
	40–50%	29.57	81.10%	2.08	23.19%	0.38	5.23%
	50–60%	38.11	78.29%	3.95	27.20%	0.74	6.54%
	Center Mask	8.62	57.16%	1.86	22.36%	0.25	3.73%

Table 5. Ablation study.

Evaluation Indicators	LSTA	FLG	Improved Network
PSNR	26.18	25.33	26.35
SSIM	0.934	0.911	0.937
FID	8.39	7.04	6.46

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pang, S.; Thio, T.H.G.; Siaw, F.L.; Chen, M.; Lin, L. Research on Improved Occluded-Face Restoration Network. Symmetry 2025, 17, 827. https://doi.org/10.3390/sym17060827

AMA Style

Pang S, Thio THG, Siaw FL, Chen M, Lin L. Research on Improved Occluded-Face Restoration Network. Symmetry. 2025; 17(6):827. https://doi.org/10.3390/sym17060827

Chicago/Turabian Style

Pang, Shangzhen, Tzer Hwai Gilbert Thio, Fei Lu Siaw, Mingju Chen, and Li Lin. 2025. "Research on Improved Occluded-Face Restoration Network" Symmetry 17, no. 6: 827. https://doi.org/10.3390/sym17060827

APA Style

Pang, S., Thio, T. H. G., Siaw, F. L., Chen, M., & Lin, L. (2025). Research on Improved Occluded-Face Restoration Network. Symmetry, 17(6), 827. https://doi.org/10.3390/sym17060827

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Improved Occluded-Face Restoration Network

Abstract

1. Introduction

2. GAN

3. Methodology

3.1. Overall Structure of the Restoration Network

3.2. Lightweight Facial Feature Point Prediction Network

3.3. Generator Network Based on Dilated Convolutions

3.4. Discriminator Network Improved Based on Patch-GAN

3.5. Loss Function

4. Experiments and Results

4.1. Experimental Environment and Parameter Settings

4.2. Evaluation Parameters for Experimental Results

4.3. Experimental Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI