Few-Shot Face Recognition: Leveraging GAN for Effective Data Augmentation

Li, Shuhui; Yue, Cai; Zhou, Hang

doi:10.3390/electronics14102003

Open AccessArticle

Few-Shot Face Recognition: Leveraging GAN for Effective Data Augmentation

by

Shuhui Li

^1,†,

Cai Yue

^2,† and

Hang Zhou

^2,*

¹

School of Automation and Intelligence, Beijing Jiaotong University, Beijing 100044, China

²

School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(10), 2003; https://doi.org/10.3390/electronics14102003

Submission received: 9 April 2025 / Revised: 8 May 2025 / Accepted: 13 May 2025 / Published: 15 May 2025

(This article belongs to the Special Issue Empowering IoT with AI: AIoT for Smart and Autonomous Systems)

Download

Browse Figures

Versions Notes

Abstract

Face recognition technology is a prominent research area in the digital age, with significant applications in commerce and security. This technology relies on high-quality training data, which poses a challenge in practical engineering applications owing to the substantial investment required and the stringent privacy protection regulations that aggravate the problem of sample scarcity, resulting in few-shot learning conditions. The collection of face data under specific conditions, such as extreme lighting and poses significant challenges. Furthermore, the imbalance in sample distribution severely impacts the model’s ability to generalize and achieve accurate recognition. This paper addresses this issue by leveraging Generative Adversarial Networks (GANs) for effective data augmentation. We propose using the architectures of SR-StarGAN and FPNSA-AttGAN to generate diverse virtual face images in different feature domains, constructing a large-scale, widely distributed dataset to support face recognition under various attributes and complex conditions, enabling effective few-shot face recognition. We detail the core algorithms and network frameworks of SR-StarGAN and FPNSA-AttGAN, and demonstrate the training process of IdentiFace using the synthetic samples. The results demonstrate a significant enhancement in face recognition accuracy, from 83.59% to 96.64%, providing a viable approach to address data scarcity, achieving enhanced generalization capability under data-constrained few-shot learning scenarios and offering valuable insights for future studies in generative-based face recognition.

Keywords:

few-shot; face recognition; generative adversarial networks; data augmentation; image processing

1. Introduction

With the rapid advancement of data-driven technologies, face recognition has emerged as a significant driver of social development in the intelligence age. Face recognition networks play an indispensable role in personal privacy protection, social security governance, and user authentication of smart terminals. However, the training of face recognition models is encountering previously unheard-of difficulties due to the ongoing growth of application situations. The capacity of these models to apply their learning to new situations hinges on the availability of a broad and varied dataset. However, there are numerous obstacles that arise during the collection of such data, primarily including the requirements regarding privacy maintenance, security regulation compliance, and efficient resource allocation. Consequently, there is a pressing need to develop a model capable of achieving high levels of performance through training on a limited set of examples, a concept known as few-shot learning [1,2].

Currently, meta-learning [3], metric learning [4], and data augmentation [5] are the most popular techniques for few-shot learning. The most well-known optimization-based meta-learning techniques comprise MAML [4] and its variations, including Reptile [6], Meta-Transfer Learning [7], iMAL [8], and FAML [9]. These techniques only have 1–20 categories in the target task, whereas face recognition tasks typically involve thousands of categories; hence, they must cope with greater generalization challenges and higher application requirements. As a metric model, Partial-FC [10] has a solidified feature space after training, which makes it hard to adjust to the feature distribution of new categories. Moreover, in situations where only a small number of samples are available, these models struggle to adjust their feature representation to accommodate new data effectively. Although data augmentation can expand the dataset by generating varied additional data through various transformations of existing data, this technique cannot resolve the underlying problem of limited data availability. In contrast, Generative Adversarial Networks (GANs [11,12]) can generate poses, expressions, makeup, and other features that do not appear in the dataset through learning, while data augmentation focuses on affine transformations (such as translation, scaling, rotation, and shearing), color transformations (such as brightness and saturation adjustments), and perspective transformations. These transformations are based on a relatively small range of local changes in the original image, and they are insufficient to provide diverse data for the models.

Given the above considerations, this study aims to construct a data-constrained but high-accuracy face recognition network with comparable performance to those trained on large-scale datasets by image processing techniques. This essentially represents few-shot face recognition.

More specifically, cutting-edge methods like GANs are employed in this study to provide high-quality data to overcome the limitations of few-shot training paradigms. In addition to generating a widely dispersed face recognition dataset with enough samples, this method can significantly lessen the reliance on actual data collection, thereby satisfying the needs of data richness and variety for model training while maintaining user privacy. To remove the technological barrier of multi-attribute joint optimization in identity networks, this study aims to investigate more effective attribute migration techniques. There are some inherent drawbacks in the early attempts regarding GANs. For instance, dedicated generators for each attribute are necessary for attribute-specific modeling techniques (e.g., Dual Residual Learning Strategy and GeneGAN [13]). Despite maintaining an acceptable level of attribute control, this discretized processing method may result in an exponential expansion of the model’s complexity as the number of attributes increases, thus inducing significant scalability issues. Attribute-agnostic constraint techniques like VAE [14], IcGAN [15], and Fader Networks [16] try to accomplish multi-attribute control using unified frameworks. Although these methods can reduce computational complexity, they still frequently present less-than-ideal generation quality in real-world applications, which manifests as technological bottlenecks like unstable identity retention, blurry images, and inaccurate attribute control. These two methods demonstrate a trade-off relationship between computational efficiency and generation quality. This fundamental contradiction has driven researchers to explore more efficient multi-attribute joint optimization solutions. It has been corroborated that simply relying on naive network stacking or global constraints cannot simultaneously meet the requirements for attribute editing precision and generation quality, necessitating systematic innovations at both the architectural and training mechanism levels.

Inspired by StarGAN [17], we propose SR-StarGAN based on domain transformation for the first time. This method preserves both the original identity information and target attributes when generating images by introducing cyclic consistency constraints and efficient mapping of target attribute labels. Compared with earlier methods, SR-StarGAN achieves multi-attribute joint optimization through a single model, which significantly improves the training efficiency and practicality and can also effectively process overly smooth or distorted images.

Meanwhile, a latent space-based FPNSA-AttGAN developed from AttGAN [18] is also introduced in this study. It incorporates an attention mechanism to enhance the interplay among various features, thereby boosting the diversity and fidelity of the images it produces. Compared with previous techniques, FPNSA-AttGAN demonstrates a precise manipulation of desired attributes while maintaining the integrity of the original identity data. This contributes to more resilient performance of synthesized images in real-world scenarios and markedly enhances the visual richness and variety of output images.

2. Related Work

2.1. Data Augmentation

Data augmentation techniques utilized in this context can be mainly divided into two categories: geometric transformations and photometric transformations. Both of them are extensively applied across various computer vision assignments.

Geometric transformations manipulate the spatial layout of an image by reassigning pixel values to alternate coordinates. Photometric transformations modify the RGB channel by changing the pixel color values. Hsu et al. [19] and Taylor et al. [20], respectively, demonstrated that geometric and photometric transformations could effectively prevent model overfitting and enhance image classification performance compared with baseline methods. These findings revealed that the integration of data augmentation consistently enhanced the classification capabilities of the network across all scenarios, with the most substantial improvements observed in flipping and cropping. With the advancement in relevant research, more innovative approaches have been proposed in increasing studies to extend the scope and effectiveness of data augmentation [21]. For example, AutoAugment [22] and Fast AutoAugment [23] utilize a reinforcement learning approach to determine the optimal combinations of data augmentation operations; Mixup [24], CutMix [25], and GridMask [26] generate novel image data through the application of techniques, such as noise addition, image blending, and masking, thereby enriching the variety and complexity of datasets. Khaled [27] proposed the Random Local Rotation method to overcome boundary distortion caused by conventional global rotation by randomly selecting circular regions in the image for rotation. Concurrently, data augmentation approaches rooted in generative models, including GANs and VAEs, have been introduced to bolster the volume and enhance the caliber of training data through the synthesis of new imagery.

2.2. Generative Adversarial Networks

A typical GAN framework is composed of two components: a generator and a discriminator. The generator is tasked with producing counterfeit samples that closely mimic authentic ones, while the discriminator is trained to distinguish between genuine and fabricated samples. Within the realm of computer vision, GANs are extensively utilized for a variety of tasks, including image style transfer, image-to-image translation, and image generation, with the synthesis of facial images standing out as a particularly significant application.

The remarkable diversity and technology development potential of GAN-based face attribute synthesis have been demonstrated in many studies. As an important advancement, conditional GANs [28] (cGANs) introduce conditional constraints on top of conventional GANs, which significantly enhances the ability of generating face images with specific attributes, thus extending it to the application of image style migration. This approach offers a viable solution to address the limited variability in facial recognition datasets. Building upon this concept, CycleGAN [29] introduces an unsupervised image transformation framework that utilizes cycle-consistency constraints, thereby removing the dependency on paired training samples. This method significantly enhances the diversity of available training data for facial analysis tasks. Luo et al. [30] proposed EA2F-GAN, which utilized eye information as a condition to dynamically modify the attribute vocabulary, so as to generate a more realistic face image. The goal of an image translation GAN is to facilitate the transformation of images across different domains, which is commonly accomplished by model training on aligned pairs of input and resultant images. Pix2pix [31] is the first conditional GAN-based image-to-image translation model trained with paired images. To address the challenge of acquiring paired datasets, Wang et al. [32] proposed a CP-EB model, which utilized audio signals to pair with images to generate faces capable of speaking. Meanwhile, Xie et al. [33] proposed a BPFRe model, which removed facial blemishes and modified them by a two-phase framework that efficiently improved the training effect under limited paired data. Based on that, StarGAN and AttGAN were introduced nearly simultaneously to address the issue of a single model corresponding to a single feature. Although these two networks trained great generators, there is still room for improvement.

Rombach et al. [34] proposed applying diffusion and denoising models in the latent space. StableSR [35] and DiffBIR [36] expanded the model’s scale to accomplish super-resolution image restoration by employing the principles of stable diffusion. Full Convolutional Networks [37] (FCNs) proposed the use of convolution and upsampling instead of conventional deconvolution after PSPNet [38], HRNet [39], LinkNet [40], respectively, and the structure in the decoder was used to fuse multi-scale images. In this study, an enhanced super-resolution component integrated with the adversarial framework of StarGAN is developed, and it collectively elevates the perceptual quality and structural coherence of synthesized face images and information, thus enriching feature details. The attention mechanism [41,42] in GANs is introduced in SAGAN [43], BigGAN [44], and Attention-GAN [45] to augment the discriminator’s ability to extract features. In this study, FPN [46] and FCN are integrated in the generator of AttGAN to facilitate the generation of end-to-end pixel-level images. Simultaneously, the attention mechanism is embedded within the discriminator, which enables the model to grasp the global structure of images, substantially enhancing the output quality of generators.

As a GAN for multi-domain image-to-image translation, StarGAN consists of a generator (G) and a discriminator (D). The generator transforms the input image x into the output y based on a target domain label c

G (x, c) \to y

; while the discriminator distinguishes real images from generated ones and classifies domain labels

D : x \to {D s r c (x), D c l s (x)}

. StarGAN uses the adversarial loss

L a d v

to make generated images indistinguishable from real ones, and it also employs the domain classification loss

L c l s

to ensure correct domain labeling. StarGAN also incorporates a reconstruction loss based on cycle consistency to maintain image similarity after domain transformation. To process partially known label information in multi-dataset training, StarGAN uses the mask vector m to focus on known labels and enable joint training. A series of improvements has been made based on StarGAN in previous studies. Zhang et al. developed MU-GAN [47], which featured a symmetric U-Net with additive attention connections and self-attention mechanisms to improve attribute editing while preserving other facial details. Ko et al. constructed SuperstarGAN [48], which significantly enhanced cross-domain image translation capabilities by incorporating an independent classifier and data augmentation techniques. The OMGD-StarGAN [49] developed by Gu integrates PatchGAN discriminator, dynamic training strategy, and modulated ResNet generator are integrated to achieve lightweight multi-domain image editing. However, these improved GANs still have fundamental limitations, such as insufficient cross-domain generalization ability and discontinuous dynamic attribute control.

As a GAN for facial attribute editing, AttGAN utilizes a triple-collaborative mechanism: attribute classification, reconstruction learning, and adversarial learning. Different from conventional methods, it directly applies an attribute classification constraint to ensure that target attributes are modified while preserving non-target details. An auxiliary classifier is employed to ensure correct attribute modification, and reconstruction learning is adopted to maintain identity and illumination. Adversarial learning based on WGAN-GP enhances visual authenticity. The encoder–decoder architecture of AttGAN allows multi-attribute editing with improved efficiency and supports attribute intensity control and style manipulation. Some researchers have made relatively few improvements to AttGAN. For example, Lin’s MAGAN [50] innovatively combined the GRU structure with the AGU. The application of a discriminative attention mechanism contributed to the accurate localization of key facial regions.

3. Proposed Method

3.1. Network Architecture

SR-StarGAN enhances the generator of StarGAN by adding a super-resolution module, which improves the clarity of the generated images. In contrast, FPNSA-AttGAN incorporates a feature pyramid and self-attention structure into AttGAN. It also replaces deconvolution with a combination of the nearest neighbor interpolation, upsampling, and convolution to reduce artifacts in the generated images. Figure 1 shows the overall network architecture.

3.2. SR-StarGAN Face Generation Network

The SR-StarGAN face generation network consists of StarGAN and super-resolution. StarGAN provides an efficient means of data augmentation by using a single generator to learn mappings between multiple feature domains and directionally change face feature attributes. Nonetheless, the images generated by StarGAN are still prone to insufficient resolution and lack of clarity of faces; hence, it is still not an excellent training set for face recognition networks. In this study, a super-resolution module is incorporated into the generator framework of StarGAN, with the aim of improving the sharpness and overall quality of the images produced. This augmented network is termed SR-StarGAN, and its architecture is depicted in Figure 2.

The architecture of SR-StarGAN is characterized by a conditional generator designed to generate various attributes in tandem with a discriminator. This generator processes two inputs simultaneously: an original image and a merged product that integrates conditional labels. The original image is normalized before being passed to the generator, but its spatial structure remains unchanged. The image label, which is a binary vector that delineates the desired domain attributes, undergoes a spatial expansion to align with the image’s dimensions. After that, it is input into the generator. By employing this input method, the generator can effectively utilize the spatial configuration of the original image while incorporating the information on the target domain attributes furnished by the conditional label. This enables the generator to perform deep feature extraction and transformation on the input data. The conditional generator is constructed using fundamental components, including convolution layers, residual blocks, upsampling units, and super-resolution mechanisms. The convolutional layers methodically extract low-level features, such as textures and edges, before advancing to discern high-level, semantically relevant features, thereby extracting all the important aspects of the input image. Incorporating residual blocks along with instance normalization layers enhances the stability of the model during training, expedites the convergence process, and bolsters the model’s capacity to accommodate features across diverse scales and distributions. This process guarantees that the generated image adheres closely to the target specifications. Subsequently, the data emerging from the upsampling layer are channeled into the super-resolution module, which refines the image to yield a high-definition facial representation. The super-resolution module consists of a Surface Encoding Denoising Module (SEDM), Feature Space Affine Module (FSAM), Latent Prompt Parameter Module (LPPM), and Pretrained Denoising U-Net Module (PDUM).

The Surface Encoding Denoising Module comprises an encoder as well as a Residual Block. The encoder is responsible for downscaling the image resolution, allowing it to isolate and retain the essential characteristics of the image. Then, the encoded image, F0, is combined with random noise, Zt, allowing the model to learn the behavior patterns of features under various noise levels. Subsequently, the output is fed into the ResBlock, which improves the model’s ability to handle noise. Additionally, it ensures the preservation of image features, enabling the model to refine these features while maintaining the essential information.

The Feature Space Affine Module receives the output feature F1 of SEDM and the temporal embedding obtained by the MLP encoding of the time step t as the inputs. The diffusion model leverages a U-Net framework, which is augmented with Transformer capabilities, for extracting features across multiple scales. The Transformer’s multi-head self-attention mechanism takes the input features F1 and transforms them into a series of distinct, reduced-dimensionality representations. Within each subspace, attention coefficients are independently calculated and subsequently aggregated through weighted summation, thus enabling the concurrent processing of diverse input sequence segments across multiple representation spaces. This parallelized attention mechanism significantly enriches feature encoding. Subsequently, a feed-forward network applies nonlinear transformations to the attention outputs, facilitating more discriminative feature mapping while progressively refining the hierarchical semantic representations. Temporal embedding is incorporated into various aspects of SEDM, which provides the model with important information about the current denoising stage and realizes the dynamic adjustment of the model to more accurately capture the feature changes in the image at different denoising stages.

The Latent Prompt Parameter Module first encodes the discrete time step t into a continuous time embedding through the MLP to capture the stage characteristics in the diffusion process. Then, the cross-attention mechanism is adopted to realize the feature fusion between the time embedding and the trainable parameters. Finally, the time-aware prompts are generated through the residual linkage with the MLP, which, according to the characteristics of different times, guides the network to more accurately predict and remove the noise in the image. Based on that, the module can more adeptly conform to the varied prerequisites of image reconstruction endeavors, consequently bolstering the model’s proficiency in generalization over assorted contexts.

The Pretrained Denoising U-Net Module receives the multi-scale feature Fn output from the MFSM and the time-aware cues generated by the LPPM as the inputs. In addition, a pretrained U-Net structure is employed to perform feature extraction and dimensional mapping of Fn through a series of convolutions and downsampling. Simultaneously, it leverages the supplementary semantic guidance offered by temporal cues to refine and enhance the noise prediction for further tuning and optimization.

Meanwhile, to guarantee the preservation of inherent identity attributes in the synthesized image, the generator is also tasked with reverting the image from the target domain to its original domain, thereby establishing a cyclic consistency check. This ensures the fidelity and veracity of the transformation. This constraint mechanism safeguards the retention of identity characteristics throughout the domain translation process.

The cyclic consistency loss can be expressed as the following equation:

L_{r e c} = E_{X, c, c^{'}} [| |x - G (x, c^{'}, c^{'})| |_{1}]

(1)

The discriminator not only needs to judge the realism of the image, but also needs to classify the attribute properties of the image. Hence, it can offer guiding signals that steer the generator towards generating an image aligned with the desired attributes. This multi-task discriminator ensures that the generated image is realistic and has the expected attributes, thus improving the image conversion quality and multi-attribute control accuracy. The adversarial training architecture is conducive to a dynamic balance between the generative and discriminative models. Through iterative refinement, the generator continuously enhances the visual verisimilitude and the proficiency of attribute modification within its artificial creations. At the same time, the discriminator develops increasingly sophisticated discrimination boundaries to differentiate authentic samples from generated ones, thereby driving mutual improvement through competitive interaction. Ultimately, it realizes efficient image conversion and enhancement between multiple attribute domains, which makes image editing more flexible and precise. To enable the discriminator to effectively differentiate between authentic and fabricated images, the adversarial loss for both the generator and the discriminator can be expressed as the following equation:

L_{a d v} = E_{x} [l o g D_{s r c} (x)] + E_{x, c} [\log (1 - D_{s r c} (G (x, c)))]

(2)

The loss function used for the domain classification of real images can be expressed as the following equation:

L_{c l s}^{r} = E_{x, c^{'}} [- l o g D_{c l s} (c^{'}| x)]

(3)

The loss function for the domain classification of generated images can be expressed as the following equation:

L_{c l s}^{f} = E_{x, c} [- \log D_{c l s} (c| G (x, c))]

(4)

The aggregated overall loss can be expressed as the following equation:

\begin{array}{l} L_{D} = - L_{a d v} + λ_{c l s} L_{c l s}^{r}, \\ L_{G} = L_{a d v} + λ_{c l s} L_{c l s}^{f} + λ_{r e c} L_{r e c} \end{array}

(5)

The final images generated by the SR-StarGAN face generation network have achieved a remarkable breakthrough in their quality. This is manifested as not only a high degree of detail, clarity, and realism but also effective enhancement in the expression of facial features. By accurately capturing and reconstructing the subtle differences in the face, the network dramatically enriches the diversity of face images while enhancing visual realism, which in turn expands the coverage of the dataset.

3.3. FPNSA-AttGAN Face Generation Network

FPNSA-AttGAN is constructed based on the basic structure of AttGAN, while the generator of AttGAN adopts a U-Net-like encoding–decoding structure, which transforms the facial features into the potential space for attribute manipulation to generate diversified images. Despite the high flexibility, the details of the images generated by the generator of AttGAN are not realistic enough, and are accompanied by some artifacts. In this study, based on AttGAN, a feature pyramid network (FPN) is added to the encoder and the combination of the nearest neighbor interpolation upsampling and convolution is used in the decoder instead of deconvolution, and the self-attention mechanism is incorporated into the discriminator to form the FPNSA-AttGAN face-generating network, which generates more diversified images with higher quality. Figure 3 shows the overall network architecture of FPNSA-AttGAN.

The original encoder utilizes a multilayer convolutional and normalization structure to map the input image to a latent space. This process involves the downsampling of the image and the progressive extraction of its features until it reaches a predefined maximum dimension, eventually resulting in a compact latent representation. This representation serves as the foundation for attribute manipulation and reconstruction. In this study, an FPN is incorporated into the encoder to effectively merge feature information across different scales through bottom–up paths, top–down paths, and lateral connections. In the bottom–up path, the encoder retains its multilayer convolution and normalization structure, sequentially downsampling the input image through these layers to generate feature maps at various scales. The top–down path, conversely, begins with the highest layer of the encoder’s feature map and progressive upsampling to increase the size of the feature map. The lateral connections combine the feature maps from both paths, preserving the fine details of the lower-level features while incorporating the semantic information of the higher-level features. By introducing FPN, the encoder’s capacity to perceive attributes at multiple scales is significantly improved, thus providing more detailed information for subsequent attribute reconstruction.

In the conventional decoder architecture, a series of deconvolutional layers with attribute vector concatenation are employed to perform feature upsampling, ultimately producing attribute-modified images via tanh activation. However, this approach may be restrained by parameterized deconvolution operations that may induce inconsistent weight updates across spatial positions in the feature maps, resulting in localized texture discontinuity and perceptually unnatural artifacts. To address these limitations, the method proposed in this study implements a hybrid upsampling strategy, which integrates nearest-neighbor interpolation with convolutional refinement. This alternative approach enables more efficient image enlargement while preserving spatial coherence, substantially reducing visual artifacts and enhancing the overall realism of synthesized outputs.

At the same time, the decoder performs image reconstruction based on potential vectors and original attribute labels. Based on the conditional generation strategy, the decoder can precisely control the specific attributes of the generated images while preserving the latent spatial information. This framework achieves dual optimization of output diversity and attribute-aware generation, where the model can precisely modulate synthesized images according to target attribute conditions. This capability of the decoder significantly strengthens both the expressive power and attribute fidelity of the generated results, while maintaining rich variation across outputs.

The original discriminator shares its convolution structure with the attribute classifier. After the downsampling is completed, the feature map is warped and flattened, branching through two independent fully connected layers and then completing the output process. In this study, the self-attention mechanism is introduced to enhance its ability to model global features, thus improving the accuracy of image discrimination. The self-attention operation is initiated by projecting the input feature representation into three distinct subspaces through linear transformations, yielding the query, key, and value matrices. Subsequently, the pairwise similarity computation between query and key vectors produces attention scores that can be employed to quantify inter-positional feature dependency. These scores are normalized through softmax activation to generate attention weights, which are then used to perform weighted aggregation of value vectors, ultimately producing a refined feature representation with enhanced contextual relationships. The feature map thus formed contains abundant global information, and the discriminator and attribute classifier can make decisions based on more comprehensive information when branching the output through two independent fully connected layers, thus augmenting the accuracy of image discrimination and attribute classification. The attribute classifier C is designed to ensure that the generated image

x^{\hat{b}}

reflects the target attribute

b

, namely,

C (x^{\hat{b}}) \to b

, indicating that the classifier guides the output image to match the intended attribute.

The equation can be expressed as follows:

\begin{matrix} m i n \\ G_{e n c}, G_{d e c} \end{matrix} L_{c l s}^{g} = E_{x^{a} ~ p_{d a t a}, b ~ p_{a t t r}} [l_{g} (x^{a}, b)]

(6)

where

l_{g} (x^{a}, b) = \sum_{i = 1}^{n} - b_{i} l o g C_{i} (x^{\hat{b}}) - (1 - b_{i}) l o g (1 - C_{i} (x^{\hat{b}}))

(7)

p_{d a t a}

and

p_{a t t r}

represent the distributions of the real image and the attributes, respectively;

C_{i} (x^{\hat{b}})

represents the predicted value of the ith attribute;

l_{g} (x^{a}, b)

represents the sum of the binary cross-entropy losses of all attributes.

The attribute classifier C is trained using the input image and its original attributes, with the following training objective:

\underset{C}{m i n} L_{c l s}^{c} = E_{x^{a} \sim p_{d a t a}} [l_{r} (x^{a}, a)]

(8)

where

l_{r} (x^{a}, a) = \sum_{i = 1}^{n} - a_{i} l o g C_{i} (x^{a}) - (1 - a_{i}) l o g (1 - C_{i} (x^{a}))

(9)

To enhance the preservation of non-target attributes during reconstruction learning, the decoder architecture is designed to faithfully reconstruct the original input image

x^{a}

when processing the latent representation

z

under the strict constraint of maintaining all source attributes

a

unchanged. This constraint ensures that attribute-independent features remain intact throughout the encoding–decoding process while allowing targeted modifications only to specified attributes. The learning objective can be expressed as follows:

\underset{G_{e n c}, G_{d e c}}{m i n} L_{r e c} = E_{x^{a} \sim p_{d a t a}} [∥ x^{a} - x^{\hat{a}} ∥_{1}]

(10)

where L1 loss instead of L2 loss is used to attenuate the ambiguity.

Adversarial learning is employed between the generator and discriminator to ensure that the generated image

x^{\hat{b}}

appears visually realistic. Building upon WGAN, the adversarial loss functions for the discriminator and generator can be defined as follows:

\underset{∥ D ∥_{L} \leq 1}{m i n} L_{a d v}^{d} = - E_{x^{a} \sim p_{d a t a}} [D (x^{a})] + E_{x^{a} \sim p_{d a t a}, b \sim p_{a t t r}} [D (x^{\hat{b}})]

(11)

\underset{G_{e n c}, G_{d e c}}{m i n} L_{a d v}^{g} = - E_{x^{a} \sim p_{d a t a}, b \sim p_{a t t r}} [D (x^{\hat{b}})]

(12)

where

D

represents the discriminator that satisfies the

1 - L i p s c h i t z

condition.

By integrating attribute classification constraints, reconstruction loss, and adversarial loss, a unified AttGAN framework is constructed. This framework effectively edits the specified attributes while maintaining the details of non-target attributes. The overall goal of the encoder and decoder can be expressed as follows:

\underset{G_{e n c}, G_{d e c}}{m i n} L_{e n c, d e c} = λ_{1} L_{r e c} + λ_{2} L_{c l s}^{g} + L_{a d v}^{g}

(13)

The overall objectives of the discriminator and attribute classifiers can be expressed as follows:

\underset{D, C}{m i n} L_{d i s, c l s} = λ_{3} L_{c l s}^{c} + L_{a d v}^{d}

(14)

where the discriminator and attribute classifiers share most of the layers;

λ_{1}

,

λ_{2}

, and

λ_{3}

represent hyperparameters used to balance the loss.

3.4. IdentiFace Face Recognition Network

To validate whether the face images produced by GANs can serve as a dataset for face recognition networks, this study employs the generated images to train the IdentiFace face recognition network, which is based on FaceNet [51], thus resolving the problems related to limited sample sizes and broad distributions.

As a deep learning-based face recognition model, IdentiFace effectively represents face features as high-dimensional vectors by learning the ‘embedding space’ and then realizes face recognition and authentication. IdentiFace is developed based on the basic structure of FaceNet, including Inception ResNet v1 and Inception ResNet v2. The network receives pre-processed face images as the input layer, which is the starting point for feature extraction. Convolutional layers serve as the core of the network, with several layers progressively extracting features (such as edges, textures, and contours) from the image through weight sharing. The Inception module processes the input data simultaneously by applying convolutional kernels of different sizes in parallel, thus capturing multi-scale feature information. Meanwhile, a jump connection mechanism is introduced to the ResNet module, which effectively mitigates the gradient vanishing problem during deep network training and facilitates the flow of information between network layers. The integration of the Inception and ResNet modules allows the network to capture feature representations in a more profound and effective manner. In the convolutional layer, the maximum pooling is utilized in the pooling operation to reduce the size of the feature map while preserving essential feature details. After the convolution and pooling stages, the feature maps are transformed into one-dimensional vectors, which are then processed further by a fully connected layer to generate a unified embedding vector. To guarantee that the distances between various face embedding vectors correctly represent the similarity of distinct identities, the output feature vectors undergo L2 regularization, which ensures that the vectors are mapped to a unit sphere, indicating that their magnitudes are normalized to one.

4. Experiments

4.1. Dataset

The base dataset is sourced from the CelebA dataset, which includes more than 202,599 high-resolution facial images of 10,177 different celebrities. Each image is labeled with multiple binary facial attributes, such as hair color, gender, glasses, and smile. Additionally, CelebA provides location annotations for five key facial feature points, including the eyes, nose, and mouth.

4.2. Face Generation Network

In this study, the experiments are conducted using ordinary data augmentation, classical GAN models, and improved GAN models as the face generation network, respectively, to expand the CelebA dataset, thus generating the training set for the face recognition network. In each of the experiments, the dataset was split into training, validation, and test sets to ensure proper evaluation and generalization of the models. Specifically, 80% of the images (161,759 images) were used for training, 10% (20,260 images) were used for validation, and 10% (20,260 images) were used for testing. The training set was employed for model training, the validation set for hyperparameter tuning, and the test set for evaluating the model’s final performance.

4.2.1. Image Data Augmentation

The original image is flipped horizontally to simulate left–right symmetry; the image is slightly deformed by pixel translation, scaling, rotation, and clipping; color variations are added by brightness fine-tuning and saturation perturbation; variations from different perspectives are simulated by perspective transformations. Eventually, 28 different augmented versions of each image are generated as the first dataset for the face recognition network.

4.2.2. Typical GAN

(1) The generative network of StarGAN is composed of two convolutional layers with a downsampling stride of 2, interspersed with six residual blocks, and capped by two transposed convolutional layers with an upsampling stride of 2. Meanwhile, instance normalization is also integrated in the generator’s design. The discriminator’s configuration is adopted from PatchGANs and is built entirely out of convolutional layers.

Each model undergoes training utilizing the Adam optimizer, which preserves a learning rate of 0.0001 across the initial 10 epochs and then linearly reduces the rate to zero for the subsequent duration of the training regimen. Before feeding the images into the network, they are resized to a fixed resolution of 128 × 128 pixels, and their pixel values are normalized to the range of [−1, 1] by dividing each pixel by 127.5 and subtracting 1. Furthermore, the generator executes five updates for every single update of the discriminator. The additional detailed parameter configurations are presented in Table 1. The training duration is approximately 10 h on a solitary NVIDIA RTX 2080Ti GPU. The output images are first rescaled back to the range of [0, 1] by adding 1 to the pixel values and dividing by 2, followed by scaling them to the range of [0, 255] for visualization. Moreover, no further post-processing is applied to the generated adversarial images. Operations such as smoothing, sharpening, or any other enhancement techniques are avoided to ensure the authenticity and consistency of the training data.

(2) The generator network of AttGAN is an amalgamation of an encoder and a decoder. Specifically, the encoder comprises five convolutional layers, and the decoder is built with five transposed convolutional layers, which are linked by skip connections akin to the U-Net architecture. The discriminator network is constructed with five convolutional layers, followed by a sequence of dense layers. The classifier within the network utilizes the same convolutional layers as the discriminator, feeding the outputs into the subsequent fully connected layers.

The models are trained using the Adam optimizer with a learning rate of 0.0002 for the first 30 epochs, followed by a linear decay to zero for the remainder of the training period. Three loss coefficients are established to ensure the loss values remain within the same order of magnitude. The data are resized to 143 × 143 pixels and cropped to 128 × 128 pixels before being fed into the network, with pixel values normalized to the range of [−1, 1]. The generator performs five updates for every single update of the discriminator. The generated images are post-processed by rescaling the pixel values back to the range of [0, 255] for visualization. As with StarGAN, no additional post-processing is applied. Further detailed parameter configurations are provided in Table 2. The entire training process spans approximately 18 h on a single NVIDIA RTX 2080Ti GPU.

(3) StarGAN and AttGAN are selected as the base GANs, and excellent generators are trained to expand the image samples. The experiments show that StarGAN performs well on features such as hair color and makeup, and AttGAN achieves better results for features such as the five senses, silhouettes, and bone structure. Eventually, a total of 27 facial features are selected to form a new dataset with the original images, so that every person has 28 face images with different morphological features.

4.2.3. Improving the GAN

While training SR-StarGAN and FPNSA-AttGAN, to ensure the rigor of the experiments, no modifications are made to the original parameter settings, data preprocessing methods, or training procedures of StarGAN and AttGAN. This principle is implemented to maintain consistency and comparability with the baseline models. The experimental results demonstrate that the parameters chosen for this study are both reasonable and effective, contributing to the improved performance of the proposed models.

(1) The face images generated by the StarGAN-trained generator exhibit some facial features but lack clarity and have blurred details. This study improves network performance by integrating a super-resolution module, consisting of SEDM, MSAM, LPPM, and PDUM, into the StarGAN framework. Specifically, SEDM removes shallow degradation from the input image to maintain quality; MSAM adaptively extracts multi-scale features to capture key information at different resolutions; LPPM generates semantic guidance at different time steps to enhance the recovery process; and PDUM combines multi-scale features and cue information for refined image recovery through step-by-step denoising. The results of the loss visualization during training are shown below:

The training performance of StarGAN and SR-StarGAN is evaluated based on four key metrics: generator loss on false images (G/loss_fake), reconstruction loss (G/loss_rec), discriminator loss on false images (D/loss_fake), and discriminator classification loss (D/loss_cls). As shown in Figure 4, the generator loss on false images of both models declines rapidly, then rises, and stabilizes near 0. However, the discriminator’s loss on false images shows the opposite trend, rising quickly, then declining, and stabilizing near 0. However, the loss of StarGAN fluctuates more significantly compared with that of SR-StarGAN. The reconstruction loss and classification loss of both models gradually decrease and approach 0, indicating that the generator produces more realistic images. This makes it harder for the discriminator to distinguish between real and generated images. The reconstruction loss of SR-StarGAN is smaller than that of StarGAN. These results indicate that SR-StarGAN is more stable than StarGAN.

In this study, the generators of StarGAN and SR-StarGAN are used to generate face datasets with different facial features. Compared with StarGAN, SR-StarGAN significantly improves the clarity of the generated images by incorporating a super-resolution module, making facial details sharper and greatly enhancing recognition performance. This highlights the critical role that image clarity plays in the effectiveness of face recognition models. Figure 5 shows some of the face images generated by StarGAN and SR-StarGAN, clearly illustrating the superior clarity and detail in the images generated by SR-StarGAN.

(2) In the complex image detail recovery process of AttGAN, the back-convolution operation leads to the blurring of image edges or the generation of unnatural textures and image artefacts. In addition, the feature map fusion is relatively simple and lacks effective integration of image information at different scales. In addition, the performance is unfavorable with regard to the processing of image details. The FPN added in this study can generate multi-scale feature maps, which allows the model to recover images at multiple scales after layer-by-layer fusion, thus enhancing the recovery of image details. The combination of the nearest neighbor upsampling and convolution instead of inverse convolution provides a simple and efficient way to recover image resolution, avoiding artifacts and unnatural textures. The self-attention mechanism module is placed in the deep feature extraction stage, which effectively mitigates the problem of missing contextual information due to the sensory field limitation of the traditional convolution operation. Based on that, the model can adaptively focus on the important regions in the image, thus enhancing feature representation.

The loss visualization results during training are shown below:

The performance of AttGAN and FPNSA-AttGAN is evaluated based on four key metrics: the gradient penalty of the discriminator (D/gp), adversarial loss (D/loss_gan), reconstruction loss of the generator (G/xa_loss_rec), and adversarial loss (G/xb_loss_gan). As shown in Figure 6, as iterations increase, the gradient penalty of both models decreases and stabilizes at smaller values, with FPNSA-AttGAN achieving a smaller stable value than AttGAN. The adversarial loss of the discriminator quickly drops and then stabilizes, with FPNSA-AttGAN showing a smaller and stable value. The generator’s reconstruction loss decreases, with FPNSA-AttGAN’s value being closer to 0. The generator’s adversarial loss increases, then decreases, and finally stabilizes around 0, with FPNSA-AttGAN’s value also closer to 0. These results indicate that FPNSA-AttGAN is more stable than AttGAN.

In this study, the generators of AttGAN and FPNSA-AttGAN are used to generate face datasets. Compared with AttGAN, FPNSA-AttGAN significantly reduces artifacts in the generated images by incorporating an FPN and a self-attention mechanism, resulting in more natural facial details. Specifically, AttGAN often generates images with noticeable artifacts or unnatural textures, which can affect the realism of certain facial features. In contrast, the improved generator structure of FPNSA-AttGAN can effectively mitigate these issues, resulting in more natural-looking images with abundant details. Figure 7 shows some of the face images generated by AttGAN and FPNSA-AttGAN, clearly illustrating the significant reduction of artifacts and better detail representation in the images generated by FPNSA-AttGAN.

After the two face generation networks are improved, the generated face images achieve significant improvement in terms of clarity and detail performance. Specifically, the facial contours are clearer and more accurate; the details of the five senses, such as the texture of the eyes and the shape of the lips, become more distinct; and the skin texture is more realistic, which is more in line with the requirements of the face recognition system for input images.

4.3. IdentiFace Face Recognition Network

In this study, training is conducted based on five datasets, and the models are evaluated using the Labelled Faces in the Wild (LFWs) dataset, a standard dataset used for face recognition. The evaluation involves the calculation of the embedding vectors for images, comparison of pairs to identify whether they belong to the same person, and the application of metrics such as accuracy, validation at a specific false acceptance rate (FAR), area under the ROC curve (AUC), and equal error rate (EER). The accuracy can be used to reflect the overall performance of a model; the validation can be employed to measure its ability at a specific FAR; the AUC can be utilized to assess classification performance, and the EER can be leveraged to capture the model’s performance when the FAR and false rejection rate (FRR) are equal.

The comparative analysis of results shows that the IdentiFace face recognition network trained on the dataset generated by the face generation network consisting of SR-StarGAN and FPNSA-AttGAN performs well in all the evaluation metrics on the LFW test set. As shown in Table 3, the accuracy of the model is 96.64%, indicating its efficiency in the image pair matching task. The AUC of the model is 0.035, proving its excellent performance in various matching tasks. The validation rate and equal error rate are 82.48% and 99.3%, respectively, further demonstrating the robustness and balance of the model in face recognition tasks.

5. Summary

This paper proposes a data augmentation method to address the challenges of data scarcity, uneven distribution, and the difficulty of acquiring real-world samples for face recognition tasks. By leveraging SR-StarGAN and FPNSA-AttGAN to generate face samples with different attributes from the CelebA dataset and training the IdentiFace face recognition model using these generated virtual samples, the experimental results demonstrate that the model trained on the augmented dataset is significantly superior to conventional recognition algorithms in terms of accuracy and adaptability.

However, the proposed method still has some notable limitations. For example, the insufficient diversity of the generated samples remains a significant challenge, as the model can only modify one feature at a time. This limitation prevents the model from fully capturing the diversity of facial features in real-world scenarios. Despite this, 27 well-performing attributes are selected from 40 available attributes in the CelebA dataset. These attributes maintain high quality and stability during generation and achieve favorable performance in face recognition tasks. However, the selection of these attributes does not resolve the model’s issues under different lighting conditions. Under varying lighting conditions, the generated images may still suffer from artifacts, loss of detail, or unclear facial features, which pose a potential hazard to the model’s applicability in complex environments.

Additionally, the computational resources required to train such complex generative models cannot be overlooked. The trade-off between high accuracy and the computational resources needed for training and deployment may limit the practical feasibility of this method in certain applications. For example, both SR-StarGAN and FPNSA-AttGAN require substantial computational resources when generating high-resolution images, which may represent a bottleneck in resource-constrained environments. Therefore, future research should focus on optimizing the generative network architecture to improve generation efficiency, thus enabling the simultaneous editing of multiple features to enhance sample diversity.

Author Contributions

Conceptualization, H.Z.; methodology, S.L. and C.Y.; validation, S.L.; writing, S.L. and C.Y.; formal analysis, S.L.; visualization, C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Training Program of Innovation and Entrepreneurship for Undergraduates, grant number 202510004165.

Data Availability Statement

The original data presented in this study are openly available in the CelebA dataset at http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html (accessed on 15 June 2024). The dataset was introduced in the paper ‘Deep Learning Face Attributes in the Wild’ by Liu et al. (2015) [52], available at https://doi.org/10.1109/ICCV.2015.425 (accessed on 15 June 2024).

Acknowledgments

The authors acknowledge the equipment support from Beijing Jiaotong University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, F.-F.; Rob, F.; Pietro, P. A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France, 13–16 October 2003; IEEE Computer Society: Washington, DC, USA, 2003; Volume 2, p. 1134. [Google Scholar]
Liu, Y.; Zhang, H.; Zhang, W.; Lu, G.; Tian, Q.; Ling, N. Few-Shot Image Classification: Current Status and Research Trends. Electronics 2022, 11, 1752. [Google Scholar] [CrossRef]
Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 1126–1135. [Google Scholar]
Xing, E.P.; Ng, A.Y.; Jordan, M.I.; Russell, S. Distance Metric Learning with Application to Clustering with Side-Information. In Proceedings of the 16th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 1 January 2002; MIT Press: Cambridge, MA, USA, 2002; pp. 521–528. [Google Scholar]
Hu, Y.; Sun, L.; Mao, X.; Zhang, S. EEG Data Augmentation Method for Identity Recognition Based on Spatial–Temporal Generating Adversarial Network. Electronics 2024, 13, 4310. [Google Scholar] [CrossRef]
Nichol, A.; Achiam, J.; Schulman, J. On First-Order Meta-Learning Algorithms. arXiv 2018, arXiv:1803.02999. [Google Scholar] [CrossRef]
Sun, Q.; Liu, Y.; Chua, T.-S.; Schiele, B. Meta-Transfer Learning for Few-Shot Learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 403–412. [Google Scholar]
Rajeswaran, A.; Finn, C.; Kakade, S.M.; Levine, S. Meta-Learning with Implicit Gradients. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 113–124. [Google Scholar]
Shin, C.; Lee, J.; Na, B.; Yoon, S. Personalized Face Authentication Based On Few-Shot Meta-Learning. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; IEEE: New York, NY, USA, 2021; pp. 3897–3901. [Google Scholar]
An, X.; Deng, J.; Guo, J.; Feng, Z.; Zhu, X.; Yang, J.; Liu, T. Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 4032–4041. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Gan, J.; Liu, J. Applied Research on Face Image Beautification Based on a Generative Adversarial Network. Electronics 2024, 13, 4780. [Google Scholar] [CrossRef]
Zhou, S.; Xiao, T.; Yang, Y.; Feng, D.; He, Q.; He, W. GeneGAN: Learning Object Transfiguration and Object Subspace from Unpaired Data. In Procedings of the British Machine Vision Conference 2017, London, UK, 4–7 September 2017; British Machine Vision Association: Durham, UK, 2017; p. 111. [Google Scholar]
Larsen, A.B.L.; Sønderby, S.K.; Larochelle, H.; Winther, O. Autoencoding beyond Pixels Using a Learned Similarity Metric. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1558–1566. [Google Scholar]
Perarnau, G.; van de Weijer, J.; Raducanu, B.; Álvarez, J.M. Invertible Conditional GANs for Image Editing. arXiv 2016, arXiv:1611.06355. [Google Scholar] [CrossRef]
Lample, G.; Zeghidour, N.; Usunier, N.; Bordes, A.; Denoyer, L.; Ranzato, M. Fader Networks: Manipulating Images by Sliding Attributes. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 5969–5978. [Google Scholar]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.-W.; Kim, S.; Choo, J. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 8789–8797. [Google Scholar]
He, Z.; Zuo, W.; Kan, M.; Shan, S.; Chen, X. AttGAN: Facial Attribute Editing by Only Changing What You Want. IEEE Trans. Image Process 2019, 28, 5464–5478. [Google Scholar] [CrossRef] [PubMed]
Hsu, W.H. Investigating Data Augmentation Strategies for Advancing Deep Learning Training. In Proceedings of the Proceedings of the GPU Technology Conference (GTC) 2018, San Jose, CA, USA, 26 March 2018; Nvidia: Santa Clara, CA, USA, 2018. [Google Scholar]
Taylor, L.; Nitschke, G. Improving Deep Learning with Generic Data Augmentation. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 18–21 November 2018; IEEE: New York, NY, USA, 2018; pp. 1542–1547. [Google Scholar]
Zeng, W. Image Data Augmentation Techniques Based on Deep Learning: A Survey. Math. Biosci. Eng. 2024, 21, 6190–6224. [Google Scholar] [CrossRef] [PubMed]
Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. AutoAugment: Learning Augmentation Strategies From Data. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 113–123. [Google Scholar]
Lim, S.; Kim, I.; Kim, T.; Kim, C.; Kim, S. Fast AutoAugment. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 6665–6675. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond Empirical Risk Minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar] [CrossRef]
Yun, S.; Han, D.; Chun, S.; Oh, S.J.; Yoo, Y.; Choe, J. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 6022–6031. [Google Scholar]
Chen, P.; Liu, S.; Zhao, H.; Wang, X.; Jia, J. GridMask Data Augmentation. arXiv 2020, arXiv:2001.04086. [Google Scholar] [CrossRef]
Alomar, K.; Aysel, H.I.; Cai, X. Data Augmentation in Classification and Segmentation: A Survey and New Strategies. J. Imaging 2023, 9, 46. [Google Scholar] [CrossRef] [PubMed]
Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; MIT Press: Cambridge, MA, USA, 2014; Volume 2, pp. 2672–2680. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2242–2251. [Google Scholar]
Luo, X.; He, X.; Chen, X.; Qing, L.; Chen, H. Dynamically Optimized Human Eyes-to-Face Generation via Attribute Vocabulary. IEEE Signal Process Lett. 2023, 30, 453–457. [Google Scholar] [CrossRef]
Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 5967–5976. [Google Scholar]
Wang, J.; Deng, Y.; Liang, Z.; Zhang, X.; Cheng, N.; Xiao, J. CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding. In Proceedings of the 2023 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Wuhan, China, 21–24 December 2023; IEEE: New York, NY, USA, 2023; pp. 752–757. [Google Scholar]
Xie, L.; Xue, W.; Xu, Z.; Wu, S.; Yu, Z.; Wong, H.S. Blemish-Aware and Progressive Face Retouching with Limited Paired Data. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: New York, NY, USA, 2023; pp. 5599–5608. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 21–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 10674–10685. [Google Scholar]
Wang, J.; Yue, Z.; Zhou, S.; Chan, K.C.K.; Loy, C.C. Exploiting Diffusion Prior for Real-World Image Super-Resolution. Int. J. Comput. Vis. 2024, 132, 5929–5949. [Google Scholar] [CrossRef]
Lin, X.; He, J.; Chen, Z.; Lyu, Z.; Dai, B.; Yu, F.; Qiao, Y.; Ouyang, W.; Dong, C. DiffBIR: Toward Blind Image Restoration with Generative Diffusion Prior. In Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature: Cham, Switzerland, 2025; Volume 15117, pp. 430–448. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA, 2015; pp. 3431–3440. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 6230–6239. [Google Scholar]
Wang, H.; Wang, Y.; Zhang, Q.; Xiang, S.; Pan, C. Gated Convolutional Neural Network for Semantic Segmentation in High-Resolution Images. Remote Sens. 2017, 9, 446. [Google Scholar] [CrossRef]
Chaurasia, A.; Culurciello, E. LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation. In Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; IEEE: New York, NY, USA, 2017; pp. 1–4. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
Xiao, J.; Zhou, H.; Lei, Q.; Liu, H.; Xiao, Z.; Huang, S. Attention-Mechanism-Based Face Feature Extraction Model for WeChat Applet on Mobile Devices. Electronics 2024, 13, 201. [Google Scholar] [CrossRef]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-Attention Generative Adversarial Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 24 May 2019; pp. 7354–7363. [Google Scholar]
Brock, A.; Donahue, J.; Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 27 September 2018. [Google Scholar]
Chen, X.; Xu, C.; Yang, X.; Tao, D. Attention-GAN for Object Transfiguration in Wild Images. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Munich, Germany, 2018; Volume 11206, pp. 167–184. [Google Scholar]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 936–944. [Google Scholar]
Zhang, K.; Su, Y.; Guo, X.; Qi, L.; Zhao, Z. MU-GAN: Facial Attribute Editing Based on Multi-Attention Mechanism. IEEECAA J. Autom. Sin. 2021, 8, 1614–1626. [Google Scholar] [CrossRef]
Ko, K.; Yeom, T.; Lee, M. SuperstarGAN: Generative Adversarial Networks for Image-to-Image Translation in Large-Scale Domains. Neural Netw. 2023, 162, 330–339. [Google Scholar] [CrossRef] [PubMed]
Li, R.; Gu, J. OMGD-StarGAN: Improvements to Boost StarGAN v2 Performance. Evol. Syst. 2024, 15, 455–467. [Google Scholar] [CrossRef]
Lin, Z.; Xu, W.; Ma, X.; Xu, C.; Xiao, H. Multi-Attention Infused Integrated Facial Attribute Editing Model: Enhancing the Robustness of Facial Attribute Manipulation. Electronics 2023, 12, 4111. [Google Scholar] [CrossRef]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA, 2015; pp. 815–823. [Google Scholar]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar] [CrossRef]

Figure 1. Overall network architecture.

Figure 2. Overview of SR-StarGAN.

Figure 3. Overview of FPNSA-AttGAN.

Figure 4. Loss visualization of StarGAN and SR-StarGAN.

Figure 5. Comparison of images generated by StarGAN and SR-StarGAN.

Figure 6. Loss visualization of AttGAN and FPNSA-AttGAN.

Figure 7. Comparison of images generated by AttGAN and FPNSA-AttGAN.

Table 1. StarGAN parameter settings.

Parameterization	StarGAN
lambda_cls	1
lambda_rec	10
lambda_gp	10
g_lr/d_lr	0.0001
n_critic	5
beta1	0.5
beta2	0.999
num_iters	200,000
Num_iter_decay	100,000
batch_size	16

Table 2. AttGAN parameter design.

Parameterization	AttGAN
n_epochs	60
epoch_start_decay	30
batch_size	32
learning_rate	0.0002
beta_1	0.5
n_d	5
d_gradient_penalty_weight	10
d_attribute_loss_weight	1
g_attribute_loss_weight	10
g_reconstruction_loss_weight	100

Table 3. Comparison of assessment indicators.

Methodologies	Accuracy (%)	Validation (%)	AUC (%)	EER (%)
Traditional data augmentation	83.59	21.34	90.5	17.6
StarGAN	86.72	43.28	92.4	15.2
AttGAN	87.33	47.75	93.9	13.8
StarGAN+AttGAN	90.61	53.87	96.9	9.5
SR-StarGAN+FPNSA-AttGAN	96.64	82.48	99.3	3.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, S.; Yue, C.; Zhou, H. Few-Shot Face Recognition: Leveraging GAN for Effective Data Augmentation. Electronics 2025, 14, 2003. https://doi.org/10.3390/electronics14102003

AMA Style

Li S, Yue C, Zhou H. Few-Shot Face Recognition: Leveraging GAN for Effective Data Augmentation. Electronics. 2025; 14(10):2003. https://doi.org/10.3390/electronics14102003

Chicago/Turabian Style

Li, Shuhui, Cai Yue, and Hang Zhou. 2025. "Few-Shot Face Recognition: Leveraging GAN for Effective Data Augmentation" Electronics 14, no. 10: 2003. https://doi.org/10.3390/electronics14102003

APA Style

Li, S., Yue, C., & Zhou, H. (2025). Few-Shot Face Recognition: Leveraging GAN for Effective Data Augmentation. Electronics, 14(10), 2003. https://doi.org/10.3390/electronics14102003

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Face Recognition: Leveraging GAN for Effective Data Augmentation

Abstract

1. Introduction

2. Related Work

2.1. Data Augmentation

2.2. Generative Adversarial Networks

3. Proposed Method

3.1. Network Architecture

3.2. SR-StarGAN Face Generation Network

3.3. FPNSA-AttGAN Face Generation Network

3.4. IdentiFace Face Recognition Network

4. Experiments

4.1. Dataset

4.2. Face Generation Network

4.2.1. Image Data Augmentation

4.2.2. Typical GAN

4.2.3. Improving the GAN

4.3. IdentiFace Face Recognition Network

5. Summary

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI