Applied Research on Face Image Beautification Based on a Generative Adversarial Network

Gan, Junying; Liu, Jianqiang

doi:10.3390/electronics13234780

Open AccessArticle

Applied Research on Face Image Beautification Based on a Generative Adversarial Network

by

Junying Gan

^* and

Jianqiang Liu

School of Electronic and Information Engineering, Wuyi University, Jiangmen 529020, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(23), 4780; https://doi.org/10.3390/electronics13234780

Submission received: 19 October 2024 / Revised: 27 November 2024 / Accepted: 30 November 2024 / Published: 3 December 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

Generative adversarial networks (GANs) are widely used in image conversion tasks and have shown unique advantages in the context of face image beautification, as they can generate high-resolution face images. When used alongside potential spatial adjustments, it becomes possible to control the diversity of the generated images and learn from small amounts of labeled data or unsupervised data, thus reducing the costs associated with data acquisition and labeling. At present, there are some problems in terms of face image beautification processes, such as poor learning of the details of a beautification style, the use of only one beautification effect, and distortions being present in the generated face image. Therefore, this study proposes the facial image beautification generative adversarial network (FIBGAN) method, in which images with different beautification style intensities are generated with respect to an input face image. First, a feature pyramid network is used to construct a pre-encoder to generate multi-layer feature vectors containing the details of the face image, such that it can learn the beautification details of the face images during the beautification style transmission. Second, the pre-encoder combines the separate style vectors generated with respect to the original image and the style image to transfer the beautification style, such that the generated images have different beautification style intensities. Finally, the weight demodulation method is used as the beautification style transmission module in the generator, and the normalization operation on the feature map is replaced with the convolution weight to eliminate any artifacts from the feature map and reduce distortions in the generated images. The experimental results show that the FIBGAN model not only transmits the beautification style to face images in a detailed manner but also generates face images with different beautification intensities while reducing the distortion of the generated face images. Therefore, it can be widely used in the beauty and fashion industry, advertising, and media production.

Keywords:

beautification style intensity; feature pyramid network; pre-encoder; style vector; weight demodulation

1. Introduction

In recent years, intelligent image beautification has become a research hotspot in the field of image processing, which has been widely used to improve user images for social media, video calls, and video conferences. However, it is difficult to accurately distinguish between the content of the image and the beautification style. For example, the performance of such processes is better on a model wearing makeup, but the results generated on other face images are often unnatural, the beautification effect does not meet expectations, and the details generated from the beautification style are of poor quality. Therefore, face image beautification technology is still facing challenges. Yadav et al. [1] proposed ISA-GAN for facial synthesis, which uses a self-attention encoder and decoder network based on an inception structure, segmented face images for training, and a self-attention mechanism. This model can generate high-quality and realistic face images, and it has significantly improved performance in face image synthesis tasks. Hu et al. [2] proposed a method for face restoration with attitude changes. Through determining which weak local features of a face image had attitude changes and using prior knowledge obtained from generative adversarial networks, the method achieved a significant performance improvement in face restoration tasks regarding attitude changes and the generation of high-quality restored images. Hatakeyama et al. [3] proposed a generative adversarial network based on key 3D face image points to control both head pose and expression. The adversarial network also introduced head pose and expression control to effectively generate key 3D face points with a specified pose and expression. Chen et al. [4] proposed that AEP-GAN could be used for the aesthetic synthesis of Asian face images, combining an aesthetic enhancement perception mechanism and a generative adversarial network to generate face images that are more in line with Asian aesthetic standards. Wang et al. [5] proposed De-Beauty GAN to restore the original features of face images, reduce the beautification effect on face images, and revert face images back to their original state. It can also be used to remove beautification filters or to reduce the impact of beauty processing on face images in order to achieve a more natural true-face image restoration effect. Chen et al. [6] constructed an Asian face image aesthetic enhancement network that uses the aesthetics of different face images to guide the aesthetic enhancement of Asian face images and meet Asian aesthetic standards. Fang et al. [7] used generative adversarial networks to carry out a makeup transfer process for face images of different ages. Li et al. [8] proposed SC-GAN for automatic expression manipulation. Chandaliya et al. [9] proposed PlasticGAN to provide personalized facial plastic surgery effects. Liu et al. [10] proposed a face image replacement method based on regional GAN inversion by generative adversarial networks to accurately replace face images, making them more natural and giving them more realistic details. Xiang et al. [11] proposed RamGAN to transfer region-level makeup and preserve true details within a face image.

The aforementioned methods have made remarkable progress in generating facial beautification images, but they also have some shortcomings. For example, some of their network structures are complex or they require large volumes of data for training, while some of the generated results are of poor quality and lack diversity. Therefore, in this paper, a facial image beautification generative adversarial network (FIBGAN) method based on an generative adversarial network framework is proposed. Furthermore, the performance of the face image beautification model is improved to achieve a more accurate face image beautification effect. The main achievements of this study are as follows:

(1): A face image beautification network that can generate images with different beautification styles through adjusting the beautification intensity for the target features of an image is constructed.
(2): A beautification style intensity loss function is designed to preserve the similarity of the generated image to the original image during the process of beautification in order to avoid the generated image being biased toward the original or style image.
(3): A weight demodulation method is proposed, and the generator is re-designed to effectively reduce feature artifacts in the generated images and to avoid distortions during style transmission.

2. Related Work

2.1. Generative Adversarial Networks

Generative adversarial networks are widely used in the image field. Goodfellow [12] first proposed the generative adversarial network to generate realistic data samples, such as images and audio, through using both a generator and a discriminator. Ghani [13] then used a GAN to generate synthetic facial images, storing them in a blockchain to protect user privacy. Subsequently, Luo [14] proposed the MaGAT method, which improved the generator’s attention to detail for facial images through adversarial training and enhanced the robustness of facial image editing and the GAN model. Wei [15] proposed a parallel GAN method for editing the attributes of facial images, which improved the editing effect and accuracy. Tian [16] proposed a causal representation learning method, which can more accurately evaluate the quality of facial images generated by GANs in order to improve the clarity, quality, accuracy, and stability of the generated images. Peng et al. [17] proposed the ISFB-GAN method, which can be used to beautify face images at the semantic level. This method combines a semantic information encoder and a decoder to improve the aesthetics of face images through learning semantic information while preserving the original features. Akram et al. [18] proposed US-GAN for face image expression synthesis, which achieved a significant performance improvement when improving the quality and realism of synthetic face images. This approach utilizes a generative adversarial network framework to facilitate information transfer and learning through final jump connections. Dubey et al. [19] proposed that transformer architecture can be combined with a GAN to improve the generated image quality and model performance. This paper explores the different transformer GAN variants, covering their applications in image generation, super resolution, image restoration and other tasks, and discusses the advantages and challenges of this approach, providing directions for future research. Yauri-Lozano et al. [20] proposed the performance of Spanish natural language processing (NLP) encoders in this task. They conducted a quantitative and qualitative analysis to evaluate the effectiveness of different NLP encoders for generating facial images from text descriptions and proposed the potential and challenges of optimizing GANs for processing Spanish text, advancing the development of facial synthesis techniques in multilingual environments.

However, because the generated adversarial network itself finds it difficult to modify the variables, the output results are relatively limited and cannot be applied to the beautifying model of face image.

2.2. AdaIN in Image Beautification Styles

Modules for learning beautification styles need to be added to the learning process of generative adversarial networks without beautification style characteristics. Among them, AdaIN [21] (adaptive instance normalization) is a technique commonly used for image style transformation. It applies one image style to another image through adjusting the statistical characteristics of the input image. First, AdaIN normalizes the features of the content images. Then, according to the statistical characteristics of style images (e.g., mean and variance), it is re-adjusted to learn the beautification style. However, the extensive normalization of image features in AdaIN usually results in the loss of some details from the original image.

2.3. Encoder

Face image beautification methods based on generative adversarial networks can be divided into two categories:

(1): Face image beautification generative adversarial network methods with no pre-encoder. These are networks that only improve on an existing GAN, such as CycleGAN [22] (cycle-consistent generative adversarial network), StyleGAN [23] (style generative adversarial network), or other similar models. GANs without pre-encoders have obvious advantages, such as the simpler network structure, reduced computational cost, and higher training speed. As no additional encoders are required for the extraction of features, the network is more lightweight and the training and reasoning process is more efficient. However, this simplification sometimes prevents the network from sufficiently extracting the features of the input data, affecting the quality and diversity of the generated images. In addition, some GANs without pre-encoders face the problem of training instability, thus requiring the use of additional methods. While this approach performs well in some cases, its feature extraction, training stability, and data requirements remain to be addressed.
(2): Face image beautification adversarial network methods with a pre-encoder. When an encoder is added to the start of a GAN, the network structure becomes more complex, but more realistic results can be obtained. StarGAN [24] (star generative adversarial network) is an example of such a network. GANs with pre-encoders make better use of the characteristics of the input data, thereby improving the quality and variety of the generated images. Through using pre-trained encoders to extract high-level features from the input data, the network is able to more accurately understand and learn the distribution of the data, which, in turn, allows for the production of more diverse images. In addition, GANs with pre-encoders generally have better training stability as they are able to learn the representation of the data more efficiently, reducing the instability during training. However, this approach increases the complexity and computational cost of the network, requiring additional encoders to extract features, thus resulting in increased time and resource consumption for training and inference processes.

In summary, the key problems faced by generative adversarial networks, beautifying style networks, and encoders applied in the context of face image beautification are as follows:

(1): It takes a lot of computing resources and time to train GANs. For complex tasks or high-resolution images, the computational cost is higher, which results in higher requirements for the hardware equipment and training time.
(2): At present, while face image beautification style learning networks can be converted between different styles, the conversion results are not always natural and distortions may be observed.
(3): Although GANs without pre-encoders simplify the structure and reduce the computing cost, they cannot fully extract the input data features, thus affecting the diversity of the generated images; furthermore, extensive normalization operations—as in the case of AdaIN—can lead to the loss of detail information. On the other hand, although GANs with pre-encoders improve the diversity, their complex network structures and pre-encoders increase the computing cost and, therefore, the training time. In order to solve the above problems, FIBGAN is proposed in this paper.

3. Methods

3.1. Network Model

The proposed FIBGAN model was constructed using three modules: reference-guided image synthesis, random vector-guided synthesis, and the generative adversarial network. The overall framework of the model is shown in Figure 1. In the reference-guided image synthesis module, a style encoder is used to extract beautification style information from an input image and to convert it into a corresponding style vector w for style transfer and image control. The random vector-guided synthesis module uses a mapping network to map the input random vector z to the style vector w in order to create a potential space for controlling the beautification style and image generation via the image generation model. In the generative adversarial network structure, the core task of the generator is to generate data that appear as real and high quality as possible through learning and iteration, in order to deceive the discriminator through imitating a real data distribution. The discriminator acts as an evaluation and feedback mechanism, which drives the generator to generate more realistic data through providing an evaluation of the authenticity of the generated data, thus improving the quality of the generator model.

In order to enhance the stability of the discriminator, the deep convolutional neural network is chosen as the structure of the discriminator. Leaky ReLU is also selected as the activation function instead of the traditional ReLU activation function to avoid the gradient disappearance problem and ensure that the gradient of the discriminator can be effectively transmitted during the training process. The network architecture of the discriminator also includes multiple convolution layers and batch normalization layers. The combination of these layers not only improves the discriminator’s ability to recognize detailed features but also helps to improve the stability of training and the quality of generated images.

The FIBGAN model takes a feature pyramid network (FPN) [25] as the main encoder structure, replacing the conventional ResNet encoder. The encoder needs to accept both the original image and the style image as inputs. The advantage of using FPNs as encoders is that they can capture and combine image features at different scales more effectively, which is especially important for beautification style learning tasks. When inputting the original and style images into the encoder, the network can process the features of both images at different levels and synthesize a face image with a beautification style in the subsequent generation process. At the same time, the FIBGAN model adjusts the generator and discriminator that comprise the adversarial network structure to make the generated face image more natural and clearer in the beautification process. The following sections discuss the reference-guided image synthesis, random vector-guided synthesis, and generative adversarial network modules.

3.2. Reference-Guided Image Synthesis

The beautification style image is input into the encoder, and a style vector that captures the features of the beautification style image is extracted through the encoder network. This method makes use of the characteristics of the encoder to transform the visual features of the beautification style image into a vector representation, such that the vector can effectively express the beautification style information of the image. This process is called reference-guided image synthesis, which is depicted in Figure 2. First, the original image and the style image are input into the style encoder. Then, the style vector w is obtained through feature extraction. Finally, the style vector w is entered into the generator to generate an image with the beautification style.

In the proposed model, an FPN is used to construct the style encoder. FPNs were initially applied in object detection and semantic segmentation tasks, effectively solving problems at different scales through the construction of top-down and horizontally connected feature pyramid structures. FPNs can also generate multi-level and differently scaled feature maps, effectively capturing the feature information from images at each level. Through extracting the feature map of the reconstructed part of the FPN in the style encoder, more abundant and diverse image features can be obtained. This not only improves the expression ability of the style vector but also effectively retains the style features of the original image. In particular, the FIBGAN model uses a six-layer FPN to build a style encoder. After sampling each layer, style vectors with a size of 1 × 512 are output, and a total of 6 style vectors are obtained (forming a 6 × 512 matrix). The subsequent reference-guided image synthesis experiment demonstrated that the generated images contain richer information with respect to the image beautification style, as well as presenting obvious differences between images with different beautification style intensities.

Figure 3 shows the process of learning the beautification style. In order to accurately control the beautification style intensity of the output image, the FIBGAN model introduces the combined input of the original image and beautification style image into the encoder. Thus, the encoder generates two style vectors: one is the style vector w₀ associated with the original image, while the other is the style vector w_s for the beautification style image. The detailed part of the bottom layer of w_s is combined with the contour part of the upper layer of w₀ to generate a diversified style vector w, which integrates the contour features of the original image and the detailed features of the style image. At the same time, through adjusting the level of distribution and the quantity of beautification style features, the beautification style intensity in the generated images can be effectively controlled.

3.3. Random Vector-Guided Synthesis

A random variable is input into the encoder and, as it is randomly generated, the style vector generated by the encoder is also random. To ensure a uniform distribution of style vectors in the beautify style space, the random variable z is usually sampled from the standard normal or uniform distribution. The standard normal distribution provides a balanced way to generate style vectors with good coverage in high-dimensional space. A uniform distribution, on the other hand, ensures a uniform distribution of random vectors within a specified range, which may better cover the entire style space. The purpose of selecting these distributions is to map these random variables z into style vectors in the style space through the mapping ability of the encoder, and these style vectors are evenly distributed in the space, so as to realize the randomization transformation of the image style. This process can be called random vector guided synthesis, as shown in Figure 4. First, the random vector z is entered into the mapping network. Secondly, feature extraction is carried out to generate style vector w. Finally, the style vector w is input into the generator at the same time as the style image to generate an image with a random style. This method enables the generator to generate images with diverse styles based on random style vectors sampled from the specified distribution, which enhances the style diversity and generation ability of the model.

3.4. The Adversarial Network Structure

3.4.1. Generator

Figure 5 shows the structure of the generator network. The generator takes the original image and the style vector as inputs and generates the result through learning the image beautification style according to the inherent characteristics of the style vector. The generator adopts the classical image reconstruction network structure, which is divided into two parts: downsampling and upsampling. Downsampling is used to extract the features of the input image, while upsampling is responsible for performing the image reconstruction process. The two parts are connected to each other through the horizontal connection, which includes mask and filter processing. The image information obtained through the horizontal connection only retains the outline of the image, ensuring that the low-level features of the input image are effectively transmitted to the generated image, thus ensuring the quality and accuracy of the generated image.

The generators in traditional style learning methods typically use the AdaIN module to extract feature information from style vector w, which is then applied in the image generation process. With the AdaIN method, the generated image can integrate the features of the original and style images. However, the use of the AdaIN module in traditional methods can lead to face image feature artifacts, as shown by the enlarged feature artifact regions of the generated face image in Figure 6. In order to address this problem, the weight demodulation method replaces the AdaIN method as the style transmission module in the proposed model. This method not only preserves the correlations between feature graphs but also eliminates artifacts while maintaining full controllability of the output. The weight demodulation method realizes style transmission by adjusting the weight of the convolution layer instead of the mean and variance of the feature graph. The method uses an independent network module to learn the statistics of style images and apply this information directly to the weight adjustment of the convolution kernel. This method not only preserves the correlation of feature graphs but also eliminates artifacts by avoiding the normalization of feature graphs. The advantage of weight demodulation is that it can provide higher detail fidelity and image quality while maintaining the style consistency of the generated image, thus overcoming the artifact problem caused by the AdaIN method.

The AdaIN architecture can be divided into two components: standardization and modulation. The standardization component is responsible for standardizing the input data, while the modulation component introduces the information of the target image by means of parameters, which is mainly achieved by replacing the standard deviation and mean of the standardized data. In this process, the mean and variance of each feature map are normalized, respectively, but the correlation between feature dimensions is destroyed, resulting in local feature peaks generated by the generator in the process of image generation, which leads to feature artifacts. In order to solve this problem, the weight demodulation algorithm abandons the normalization step in AdaIN and adopts a convolution operation instead, thus avoiding the loss of associated feature information and effectively preventing the generation of feature artifacts. Before the modulation step, the algorithm inserts a convolution layer and adjusts the scaling of the convolution weights in a specific way. The scaled convolution weight is calculated as follows:

ω'_{i j k} = s_{i} \times ω_{i j k}

(1)

where

s

is the scaling ratio of the

i

input feature map. After scaling and convolution, the standard deviation of output activation is:

μ_{i} = \sqrt{\sum_{i, k} {ω^{' 2}}_{i j k}}

(2)

After the standard deviation is obtained, the scaled weights are reversely adjusted to restore the standard deviation of the output feature map to the initial unit level. This reverse adjustment process is called demodulation, and the demodulation part is calculated as follows:

{ω^{″}}_{i j k} = \frac{{ω^{'}}_{i j k}}{\sqrt{\sum_{i, k} {ω^{'}}_{i j k} + θ}}

(3)

In the above formula, a minimal value

θ

is introduced to ensure that the denominator is not 0. The weight demodulation technique follows the principle of normalization and aims to make the output feature graph have the same distribution mean and variance, which can effectively prevent the appearance of feature artifacts while realizing style transfer.

3.4.2. Discriminator

The discriminator determines the authenticity of an input image and outputs a value between 0 and 1, where an output close to 1 indicates that the image is true (or authentic), while an output close to 0 indicates that the image is false (or fake). The discriminator structure in the proposed model consists of a six-layer FPN. The output contains several branches, each of which corresponds to a field, and the output of each branch represents the probability of the input image in this field.

3.5. Loss Function

In order to enable the generator to generate both original and beautification style images, the loss function of the generative adversarial network, the loss function of the image reconstruction, the loss function of the style image reconstruction, and the loss function of the style intensity are introduced.

The generative adversarial network loss function adjusts the adversarial training process between the generator and the discriminator, such that the discriminator can more accurately distinguish between real images and those generated by the generator. This formula is expressed by Equation (4):

\min_{G} \max_{D} L_{a d v} = E_{x \sim P_{d a t a} (x)} [\log D (x)] + E_{z \sim P_{z} (z)} [\log (1 - D (G (z)))]

(4)

where

E (*)

represents the expected value of the distribution function,

D

represents the discriminator,

G

represents the generator,

P_{d a t a} (x)

represents the real sample distribution,

P_{z} (z)

represents the low-dimensional noise distribution,

x

represents the input image, and

z

represents the random variable.

The image reconstruction loss function for the generated image passes through the two conversion processes of the network to ensure that the generated image is as consistent as possible with the original input image:

L_{c y c} = E_{x, y, z} [x - G (G (z), s)]

(5)

where

y

represents the variable of the style domain in which the image is located and

s

represents the generated style vector.

The style reconstruction loss function visually maintains the features and styles of the image generated, such that they are similar to those of the specified style vector:

L_{s t y} = E_{x, y, z} [x - G (G (x), s)]

(6)

The main purpose of this loss is to measure the difference between the generated image and the target style image. As the two losses L_cyc and L_sty are mutually exclusive, they are weighted and taken as a whole loss to ensure the quality of image style learning, to a certain extent, in order to adjust the beautification style of the generated image during the training process:

L_{s t r} = λ E_{x, y, z} [G (x) - y] + λ E_{x, y, z} [G (x) - x]

(7)

where

λ

represents the weight of the distribution function.

The above four loss functions are synthesized into a total loss function:

L_{F I B} = \min_{G} \max_{D} L_{a d v} + L_{c y c} + L_{s t y} + L_{s t r}

(8)

The total loss function not only guides the cooperative training of the generator and discriminator in the generative adversarial network, thus ensuring the clarity and diversity of the generated face image, but also effectively controls the beautification style intensity of the generated face image, making the generated face image more consistent and avoiding artifacts.

4. Experimental Results and Analysis

In order to ensure the effectiveness of the proposed method, detailed experiments are carried out to verify the designed method based on a number of objective evaluation indicators.

4.1. Experiment

4.1.1. Experimental Environment

For this study, a computer equipped with a 13th Generation Intel(R) Core(TM) i5-13600KF CPU and NVIDIA GeForce RTX 4060 Ti graphics card was used. Ubuntu 20.04 was selected as the computer’s operating system, and all the necessary software tools were installed on the computer, including Office software and the Python programming language, with the PyTorch deep learning framework used to build models and to conduct experiments.

In order to verify the reliability and validity of the proposed method, objective evaluation indices were utilized. Among them, the Fréchet inception distance (FID) [26] index mainly evaluates the similarities between the generated and original images: the smaller the value, the more similar the generated image is to the original image. The learned perceptual image patch similarity (LPIPS) [27] index mainly evaluates the diversity of generated images: the larger the value, the richer the visual diversity of the generated images. In addition to the FID and LPIPS, the inception score (IS) [28] and structural similarity index (SSIM) [29] were also used to comprehensively evaluate the quality and visual similarity of the generated images. The IS is used to measure the quality and diversity of generated images, where a model with a higher value is generally considered to generate higher-quality images; meanwhile, the SSIM evaluates the structural similarity between generated and original images. These indicators jointly demonstrate the superior performance and diversification ability of the model in the style transfer task: the closer the SSIM value is to one, the higher the structural similarity between the generated and original images. Therefore, the quality of the generated image is closer to that of the original image.

4.1.2. Experimental Data Set

This paper uses the Celeb-HQ data set provided by the Laboratory of the Chinese University of Hong Kong, which contains high-resolution images of 1024 × 1024 pixels that are sharper and more detailed than the traditional CelebA data set. For the experiment, 30,000 images were extracted from the data set with a resolution of 1024 × 1024, of which 10,500 were male images and 19,500 were female images. In addition to these two attribute tags, there are also age, race, non-celebrity, and celebrity attribute tags. The experiment only used the gender attribute labels.

In the pre-processing phase, the image resolution was adjusted to 512 × 512, which was chosen in order to balance computational resources and model performance, ensuring that the training process is completed in a reasonable computational time while maintaining sufficient image detail. In the training process, the data were divided into the training set and verification set, and the data were fully segmented. During training, 150,000 rounds were performed, using a batch size of two batches per round. The 512 × 512 resolution and this batch size were chosen to strike the best balance between maintaining training efficiency and image detail, and these settings have been widely validated as effective configurations for similar tasks. For the comparison of the methods, the benchmark method commonly used in the field was selected to ensure the reliability and comparability of the evaluation results.

In order to ensure the effectiveness of the training, the data set was divided into a training set and a verification set in the training and generation stages. Compared with other variants of the GAN, the proposed method performs better in face image conversion tasks and is widely used in related fields. In order to further verify the effectiveness of the proposed model, it will be compared with other methods.

4.2. Experimental Results of Reference-Guided Image Synthesis

Reference-guided image synthesis learns the style of an input style image, allowing an image with a beautified style to be generated from an original input image. In the proposed model, the style encoder structure is used to achieve this goal. As gender was introduced as a domain label when constructing the data set, the provided FID, IS, SSIM, and LPIPS indicators represent the averages over the two sets of gender domain labels, respectively. Table 1 shows the quantitative index results obtained with the StyleGAN, pix2pix (image-to-image translation with conditional adversarial networks) [30], CycleGAN, and StarGAN v2 [31] methods, as well as the proposed method. Among them, StyleGAN can generate face images with similar styles through learning the features and structures of style images. This method can beautify the generated face image through the specification of different beautification reference images; it not only improves the clarity of the generated image but also retains key features of the original image, such as detailed features and expressions of the face. pix2pix takes the beautification style image as a conditional input and maps the input image to the output image of the target beautification style to achieve personalized face beautification in a relatively simple manner, such as changing the makeup, stylizing, or enhancing details. Using unsupervised learning, CycleGAN can beautify original face images through adversarial training, learn the mapping relationship between different styles, and then convert the original images into face images containing beautification styles for output. This method does not require pairing data, just a large number of style images. StarGAN v2 can convert multiple styles through a single model and can convert original images into different beautification styles. Not only can the key features of the input image, such as the facial structure and expression, be effectively adjusted according to the specified style image but also details related to appearance.

As can be seen from Table 1, the indicators of StarGAN v2 were better than those of the other existing GAN variants. Compared with StarGAN v2, the FID for the FIBGAN model was decreased by 5, the IS was increased by 2.7, the SSIM was increased by 0.042, and the LPIPS was increased by 0.015. As such, all four measures were better than those of StarGAN v2. The style encoder structure used by StarGAN v2 is based on the ResNet network, which leads to difficulties in extracting image detail features. On the contrary, the FIBGAN model adopts an FPN, which can extract detailed features from the images more effectively and can further improve the quality of style learning through hierarchical processing. Therefore, compared with the other methods, the FIBGAN model not only extracts richer and more hierarchical style vectors but also learns the features of the style images at the detail level better. Figure 7 provides a visual comparison between the output images of the FIBGAN and StarGAN v2 models, from which it can be seen that the proposed model provides a better and more natural visual effect.

In terms of vision, the FIBGAN model significantly reduced artifacts in face images compared with StarGAN v2 and avoided overprocessing of face jaw position by StarGAN v2. With the FPN network, the FIBGAN model performs well in detail feature extraction and style learning, and the generated images are more natural and real.

4.3. Results of the Random Vector-Guided Synthesis Experiment

Random vector-guided synthesis is a method used to guide the generator in generating diverse images through inputting random variables without relying on style images. In particular, the mapping network structure is used to generate random style vectors in order to achieve random vector-guided synthesis. As mentioned above, the FID, IS, SSIM, and LPIPS indicators represent the average of the two sets of domain labels. Table 2 shows the quantitative indicators for the StyleGAN, deep convolutional generative adversarial network (DCGAN) [32], and Wasserstein generative network (WGAN) [33], as well as StarGAN v2. Among them, StyleGAN introduces a gradient mapper and style vector to gradually generate face images and control the generation of face images through the input of random vectors, effectively improving the beautification effect of the generated face images. In the DCGAN, the generator takes random vectors as input, a convolution layer and a transposed convolution layer are combined, and the resulting model can generate clear face images. In addition, the discriminator training of the DCGAN effectively improves the diversity of the generated face images and avoids the problem of pattern collapse through countering the loss function. In the WGAN, the generator typically receives a random vector as input without directly relying on a specific beautifying style image. To start the generator during the model construction process, the architecture of the generator and discriminator is initially adjusted, following which the loss function of the WGEN—namely, the Wasserstein distance—is used. This method helps to improve the stability of the generator and the quality of the generated face images. In StarGAN v2, the generator inputs a multivariate label vector. This approach is achieved through jointly optimizing a generator and a multi-task discriminator that accepts a random vector and conditional labels to generate face images that meet the specified conditions.

As can be seen from Table 2, the indicators of StarGAN v2 were better than those of the other existing GAN variants. Compared with StarGAN v2, the FID of the FIBGAN model was decreased by 1.6, the IS was increased by 2.2, the SSIM was increased by 0.058, and the LPIPS was increased by 0.103. As such, all four measures were better than those of StarGAN v2. StarGAN v2 uses the AdaIN normalization operation during its random vector-guided synthesis, resulting in feature artifacts during face image beautification style learning. In contrast, the FIBGAN model adopts the weight demodulation method, which effectively reduces image distortion and learns more detailed features from the original image, thus leading to better performance. Therefore, the FIBGAN converts the input random variables into style vectors that are evenly distributed in the beautification style space due to the mapping ability of the encoder, thus achieving a randomized transformation of the image beautification style. Figure 8 shows a comparison indicating the feature artifact elimination effects of the FIBGAN model compared to StarGAN v2.

According to the results of the control image guided synthesis experiment and random vector guided synthesis experiment, the FID index showed that the FIBGAN-generated images had a low similarity score with the original images, indicating that the FIBGAN could maintain the consistency of image content with high quality. In contrast, StyleGAN has a large content deviation in some complex scenes, the pix2pix series lacks diversity, StarGAN v2’s style conversion ability is limited, and earlier methods such as the DCGAN and WGAN are insufficient in image quality and diversity. The LPIPS index showed that the FIBGAN generated a higher diversity of images, thanks to its diversified loss function. StyleGAN’s diversity is limited to specific styles, the pix2pix series is limited to image-to-image conversion, StarGAN v2 lacks diversity in some style transitions, and earlier methods such as the DCGAN and WGAN have limited generation capabilities. The IS index shows that the FIBGAN performs well in both image quality and diversity, achieving a balance between the two. StyleGAN is of high quality but lacks diversity, the pix2pix series is not as good as the FIBGAN, StarGAN v2 lacks quality in some style transformations, and earlier methods such as the DCGAN and WGAN are obviously inadequate. The SSIM index showed that the FIBGAN images generated by the FIBGAN had high structural similarity to the original images due to its structure retention loss function. In some complex scenes, StyleGAN has a large structural deviation, the pix2pix series has a large structural difference, StarGAN v2 has a low structural similarity in some style transformations, and early methods such as the DCGAN and WGAN are inadequate in terms of structural similarity.

In summary, the FIBGAN performs well in relation to the beautification style and the generated images are superior to the existing methods in terms of the quality, diversity and structural similarity.

4.4. Experimental Results of the Beautification Style Intensity Experiment

Next, through adjusting the parameter k in the style encoder, the beautification style intensity of the output image was controlled. The value of k is an integer ranging from one to four. When k is small, the generated face image is close to the original image; meanwhile, as k increases, the generated face image is further enhanced while retaining the original image as its basis. Figure 9 shows an example of the beautification style intensity experiment, where the first image is the original image.

As can be seen from Table 1 and Table 2, StarGAN v2 obtained better performance indicators (FID, IS, SSIM, and LPIPS) when compared to the other existing GAN variants. Therefore, in the beautification style intensity experiment, the FIBGAN model was only compared with StarGAN v2. The image quality generated by the generative adversarial network was evaluated using the FID, and the difference between the generated and real images was measured by comparing the statistical distance of the generated images: the smaller the FID value, the better the quality of the generated image. Table 3 shows the FID results obtained in the beautification style intensity experiment.

As can be seen from Table 3, combining the style vector w₀ generated from the original image and the style vector w_s generated from the beautification style image through adjusting k and the style encoder to generate diversified style vectors w allowed the proposed model to more effectively control the beautification style intensity of the generated face images. Existing GAN variants usually ignore the detail features of the face image when extracting the overall style features of the image during face style learning, thus limiting fine control of the output image beautification style. In contrast, the style vector extracted by the style encoder module in the proposed model has a multi-layer structure, where each layer contains different style features, allowing for better learning of the detailed features of face images.

4.5. Ablation Results

For the ablation experiment, the FPN, the weight demodulation method, and the loss function were sequentially used to generate high-quality face images with beautification effects. The experiment used StarGAN v2 as a baseline. First, the FPN was added. Then, the weight demodulation method was introduced. Finally, the comparison model was reconstructed using the style loss function L_sty. Meanwhile, the proposed FIBGAN model adopted the FPN, the weight demodulation method, and the total loss function L_FIB proposed in this paper. The results are listed in Table 4.

As can be seen from Table 4, for the StarGAN v2 benchmark method, the performance in terms of the four indicators was poor; however, after the addition of the three improvement methods—namely, the FPN, the weight demodulation method, and the style reconstruction loss function L_sty—the four indicators improved slightly. Furthermore, the FIBGAN model performed better than the fully improved StarGAN v2 in all four indicators. This is because the proposed total loss function L_FIB not only guides the co-training of the generator and discriminator but also ensures the clarity and diversity of the generated face images and avoids artifacts in the images. At the same time, the size of the style vector extracted by the FPN is altered to 6×512, which is larger than the previous one (1 × 64), thus covering more abundant style feature information. In the process of face image beautification, maintaining a balance between the features of the face is necessary to avoid the generated face image being biased toward either the original or style image.

In order to further study the effect of the number of FPN layers on the performance and feature extraction quality, an ablation study was conducted. The style encoder is constructed with different number of FPN layers, and its performance in the beautification style learning task is evaluated. The experimental results show that with the increase of FPN layers, the style richness and beautification effect of the generated images are significantly improved. However, when the number of FPN layers increases to a certain extent, the performance improvement tends to be saturated and the computing cost increases significantly. Therefore, in this paper, we choose a six-layer FPN network to build a style encoder. After sampling on each layer, we output vectors with a size of 1 × 512, thus obtaining a total of 6 × 512 style vectors.

The experiments reported in Table 5 were performed to further verify the effectiveness of the method. The experimental results show that the generated images not only contain richer information of image beautification style but also show obvious difference between images with different beautification style intensity. This indicates that by setting the FPN layer number to 6 × 512, the image performance and feature extraction quality can be effectively balanced while ensuring the performance.

5. Conclusions

The FIBGAN model was proposed to solve the problems of poor detail beautification in current face image beautification methods. The proposed FIBGAN model not only controls the diversity of the beautification effect in generated face images through potential space adjustment but also generates high-resolution face images. The core idea of the model is to construct a pre-encoder using the feature pyramid network, extract the details of the face images using the feature classification method, and output a style vector in terms of the feature layers. The FIBGAN model can effectively control the beautification style intensity of the generated images through synthesizing the style vectors of the original and beautification style images at different levels. In addition, the weight demodulation method is used as the beautification style transmission module in the generator, which allows for the production of richer and more detailed face images, effectively eliminates artifacts in the feature map, and significantly improves the beautification effect in the generated face images. The experimental results demonstrated that the FIBGAN model obtained good results in all the considered quantitative indices when compared with various state-of-the-art models used in the field of face image beautification. The proposed model also showed significant advantages in terms of the transfer of details from the beautification style face image, in the generation of face images with different beautification intensities, and in the reduction of distortions in the face image. In future research, the encoder and generator will be optimized, such that the network can effectively decouple more detailed image features.

Although the FIBGAN model in this paper has achieved a certain effect in the beautification of face images, the encoder and generator still have room for further optimization. In future work, optimizing these components will be considered, and more detailed image features will be decoupled in the network, further improving the beautification.

Author Contributions

Conceptualization, J.G. and J.L.; methodology, J.L.; software, J.L.; validation, J.G. and J.L.; formal analysis, J.L. and J.G.; investigation, J.L. and J.G; resources, J.G.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, J.G.; visualization, J.L. and J.G.; supervision, J.L. and J.G.; project administration, J.L. and J.G.; funding acquisition, J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (Grant No. 61771347).

Data Availability Statement

Celeb-HQ: Dataset utilized in this research is publicly available: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html (accessed on 1 October 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

Yadav, N.; Singh, S.K.; Dubey, S.R. Isa-gan: Inception-based self attentive encoder–decoder network for face synthesis using delin eated facial images. The Vis. Comput. 2024, 40, 8205–8225. [Google Scholar] [CrossRef]
Hu, K.; Liu, Y.; Liu, R.; Lu, W.; Yu, G.; Fu, B. Enhancing quality of pose-varied face restoration with local weak feature sensing and gan prior. Neural Comput. Appl. 2024, 36, 399–412. [Google Scholar] [CrossRef]
Hatakeyama, T.; Furuta, R.; Sato, Y. Simultaneous control of head pose and expressions in 3D facial keypoint-based GAN. Multimed. Tools Appl. 2024, 83, 79861–79878. [Google Scholar] [CrossRef]
Chen, H.; Li, W.; Gao, X.; Xiao, B. AEP-GAN: Aesthetic Enhanced Perception Generative Adversarial Network for Asian facial beauty synthesis. Appl. Intell. 2023, 53, 20441–20468. [Google Scholar] [CrossRef]
Wang, J.; Zhou, Z. De-Beauty GAN: Restore the original beauty of the face. In Proceedings of the 2023 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML), Chengdu, China, 3–5 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 527–531. [Google Scholar]
Chen, H.; Li, W.; Gao, X.; Xiao, B.; Li, F.; Huang, Y. Facial Aesthetic Enhancement Network for Asian Faces Based on Differential Facial Aesthetic Activations. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 3785–3789. [Google Scholar]
Fang, S.; Duan, M.; Li, K.; Li, K. Facial makeup transfer with GAN for different aging faces. J. Vis. Commun. Image Represent. 2022, 85, 103464. [Google Scholar] [CrossRef]
Li, S.; Liu, L.; Liu, J.; Song, W.; Hao, A.; Qin, H. SC-GAN: Subspace clustering based GAN for automatic expression manipulation. Pattern Recognit. 2023, 134, 109072. [Google Scholar] [CrossRef]
Chandaliya, P.K.; Nain, N. PlasticGAN: Holistic generative adversarial network on face plastic and aesthetic surgery. Multimed. Tools Appl. 2022, 81, 32139–32160. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Li, M.; Zhang, Y.; Wang, C.; Zhang, Q.; Wang, J.; Nie, Y. Fine-grained face swapping via regional gan inversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Xiang, J.; Chen, J.; Liu, W.; Hou, X.; Shen, L. RamGAN: Region attentive morphing GAN for region-level makeup transfer. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27. Available online: https://papers.nips.cc/paper_files/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html (accessed on 1 October 2024).
Ghani, M.A.N.U.; She, K.; Rauf, M.A.; Alajmi, M.; Ghadi, Y.Y.; Algarni, A. Securing synthetic faces: A GAN-blockchain approach to privacy-enhanced facial recognition. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102036. [Google Scholar] [CrossRef]
Luo, S.; Huang, F. MaGAT: Mask-Guided Adversarial Training for Defending Face Editing GAN Models From Proactive Defense. IEEE Signal Process. Lett. 2024, 31, 969–973. [Google Scholar] [CrossRef]
Wei, J.; Wang, W. Facial attribute editing method combined with parallel GAN for attribute separation. J. Vis. Commun. Image Represent. 2024, 98, 104031. [Google Scholar] [CrossRef]
Tian, Y.; Wang, S.; Chen, B.; Kwong, S. Causal Representation Learning for GAN-Generated Face Image Quality Assessment. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7589–7600. [Google Scholar] [CrossRef]
Peng, T.; Li, M.; Chen, F.; Xu, Y.; Xie, Y.; Sun, Y.; Zhang, D. ISFB-GAN: Interpretable semantic face beautification with generative adversarial network. Expert Syst. Appl. 2024, 236, 121131. [Google Scholar] [CrossRef]
Akram, A.; Khan, N. US-GAN: On the importance of ultimate skip connection for facial expression synthesis. Multimed. Tools Appl. 2024, 83, 7231–7247. [Google Scholar] [CrossRef]
Dubey, S.R.; Singh, S.K. Transformer-based generative adversarial networks in computer vision: A comprehensive survey. IEEE Trans. Artif. Intell. 2024, 5, 4851–4867. [Google Scholar] [CrossRef]
Yauri-Lozano, E.; Castillo-Cara, M.; Orozco-Barbosa, L.; García-Castro, R. Generative Adversarial Networks for text-to-face synthesis & generation: A quantitative–qualitative analysis of Natural Language Processing encoders for Spanish. Inf. Process. Manag. 2024, 61, 103667. [Google Scholar]
Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1501–1510. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8789–8797. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. Available online: https://papers.nips.cc/paper_files/paper/2017/hash/8a1d694707eb0fefe65871369074926d-Abstract.html (accessed on 1 October 2024).
Johnson, J.; Alahi, A.; Li, F.-F. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 694–711. [Google Scholar]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training gans. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. Available online: https://arxiv.org/abs/1606.03498 (accessed on 1 October 2024).
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Choi, Y.; Uh, Y.; Yoo, J.; Ha, J.W. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8188–8197. [Google Scholar]
Gao, F.; Yang, Y.; Wang, J.; Sun, J.; Yang, E.; Zhou, H. A deep convolutional generative adversarial networks (DCGANs)-based semi-supervised method for object recognition in synthetic aperture radar (SAR) images. Remote Sens. 2018, 10, 846. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]

Figure 1. Overall FIBGAN model framework.

Figure 2. Reference-guided image synthesis diagram.

Figure 3. Schematic diagram of the beautification style learning process.

Figure 4. Random vector-guided synthesis diagram.

Figure 5. Schematic diagram of the generator network structure.

Figure 6. Image of feature artifacts with the use of the AdaIN module.

Figure 7. Comparison between the FIBGAN and StarGAN v2 methods.

Figure 8. The artifact elimination effect of the FIBGAN model compared to StarGAN v2.