Facial Expression Recognition with Contrastive Disentangled Generative Adversarial Network

Liu, Shuaishi; Ni, Shihao; Cai, Huaze

doi:10.3390/electronics14193795

Open AccessArticle

Facial Expression Recognition with Contrastive Disentangled Generative Adversarial Network

by

Shuaishi Liu

^*,

Shihao Ni

and

Huaze Cai

School of Electrical and Electronic Engineering, Changchun University of Science and Technology, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(19), 3795; https://doi.org/10.3390/electronics14193795

Submission received: 12 August 2025 / Revised: 4 September 2025 / Accepted: 22 September 2025 / Published: 25 September 2025

Download

Browse Figures

Versions Notes

Abstract

In the field of facial expression recognition, facial expression features are highly similar to identity features, making facial expression recognition a major challenge for researchers, due to the high overlap of expressions and identity features in terms of space, structure, and feature expression. Moreover, the existing models lack a decoupling mechanism during extraction, making it difficult to understand the intrinsic meaning of expressions, which in turn affects the accuracy of expression recognition. To separate expression features from identity features, this paper proposes a method called the Contrastive Disentangled Generative Adversarial Network (CD-GAN). In this study, facial representation is defined as a combination of identity and expression, and different encoders are used to extract these features, respectively. Unlike the methods of direct feature extraction, this paper uses semi-supervised contrastive learning and adversarial training to distangle the entanglement of identity and expression features, thereby obtaining a disentangled expression representation. The model in this paper, through the method of de-entanglement, enables the model to learn the differences between expression and identity features, achieving the separation of expression and identity features. The experimental results show that the quantitative and qualitative results of the CD-GAN on field and laboratory datasets are comparable to those of the state-of-the-art methods.

Keywords:

facial expression recognition; comparative learning; generating adversarial network; disentanglement

1. Introduction

Human expression is a rich and complex form of nonverbal communication, with a powerful ability to convey emotions and intentions [1]. Facial expression recognition (FER), as an important research area in computer vision and pattern recognition, aims to enable computer systems to recognize and understand human expressions for better interaction with users [2]. With the rapid development of social media, virtual reality, and human–computer interaction technologies, FER holds broad application prospects in areas such as sentiment analysis, user experience enhancement, and intelligent assistive systems. Over the past decades, FER has been extensively studied in the fields of computer vision and machine learning. However, it remains a challenging task due to the requirement of a large number of facial expression images for training, particularly when dealing with in-the-wild datasets. In real-world applications, due to the high similarity between facial expression features and identity features, facial expression recognition (FER) models are unavoidably affected by identity-related information embedded within expression images, which may lead to a decline in performance. Most existing FER approaches extract features directly from facial expression images without explicitly disentangling identity and expression representations. This limitation hinders the model’s ability to learn identity-invariant expression features, thereby compromising overall recognition accuracy and generalization capability.

Representation learning in deep learning aims to efficiently map data from a high-dimensional space to a low-dimensional space [3]. For the FER problem, it is necessary to obtain disentangled representations of expressions from raw facial expression images. In the embedding space, variations in a single expression factor lead to changes in the expression of facial images. Existing general approaches for FER involve inputting facial expression images containing basic expressions into a model for feature extraction, which is then used for expression classification. For instance, Zhang et al. [4] constructed discrete expression representations using semantic definitions and performed expression classification. However, this method fails to address the issue of entanglement between expression features and identity features. Consequently, most existing FER models lack the capability to acquire more distinctive expression representations, which hinders the development of tasks such as expression recognition and related applications.

Most models developed in the field of FER fail to consider identity robustness in the training data [5]. Some existing methods attempt to learn invariant features by introducing relevant constraints to account for variations in expression recognition. However, these methods are individual-specific, training a dedicated model for each person. This approach is not only difficult to implement in real-world scenarios but also significantly increases training costs.

Recent research indicates that, in the absence of inductive bias, unsupervised remote learning is theoretically unfeasible. Furthermore, the extant inductive bias and unsupervised learning methods do not permit the consistent acquisition of remote representations. The majority of de-entanglement methods are based on the concept of infoGAN [6], which controls the generation of images by maximizing the mutual information (MI) between images and latent variables. However, this approach is challenging to optimize, results in unstable training, and does not provide a method for utilizing limited supervision [7]. The success of supervised contrast learning has prompted us to propose a novel approach to disentanglement, whereby positive samples of features with similar labels and negative samples of the remainder of the features are employed to identify the underlying structure [8]. The proposed approach integrates disentanglement mechanisms with conventional expression recognition frameworks, distinguishing it from traditional methods. The objective is to develop a new de-identified facial expression recognition method, the Contrastive Disentangled Generative Adversarial Network (CD-GAN), as illustrated in Figure 1. The model is designed to learn de-identified expression representations from expression images. The learning method employed is that of adversarial training, whereby the input expression images are conditioned to feature extraction and reconstruction. Furthermore, a contrast loss function is constructed based on the idea of contrast learning, with the objective of ensuring that similar expression features are positioned in close proximity and dissimilar features are positioned at a distance. This approach enables the model to learn an unentangled expression representation from the expression image, addressing the issue of entanglement between identity features and expression features. The main highlights of this paper are as follows:

(1): To learn distangled expression representations, a model called the CD-GAN is proposed. Through adversarial learning, the CD-GAN is encouraged to separate the redundant identity information from the expression information, resulting in better performance on FER.
(2): To integrate conventional facial expression recognition methods with Generative Adversarial Networks (GANs), the encoder component of the GAN is modified to incorporate a ResNet-18 architecture capable of loading pre-trained weights. By fine-tuning the model based on the pre-trained network, the performance of the expression recognition system is effectively enhanced.
(3): A new supervised contrast loss function learning strategy has been formulated for detangling, based on the concept of contrast learning. This strategy involves bringing similar expression features closer together and distancing other features in the embedding space, thereby maximizing the use of the limited supervision available.

2. Related Work

In this section, several general expression recognition approaches are briefly reviewed. In addition, for the proposed method, a brief overview of the key elements of distangled expression recognition and comparative learning based on distanglement is given.

2.1. Identity-Free Facial Expression Recognition

To meet the standards of real-world facial expression recognition, it is essential to utilize large-scale, unconstrained in-the-wild facial expression datasets that exhibit diverse data distributions and identity variations. Previous expression recognition methods have largely overlooked the potential impact of identity information on expression recognition, which may result in degraded model performance. Existing theories suggest that a facial expression image is a combination of a neutral facial image and an expression component. The majority of contemporary disentangled methodologies are founded upon the de-entanglement of the GAN [9]. The adversarial training enables the generator to learn disentangled feature representations, with the discriminator evaluating the statistical authenticity of synthesized features. In this manner, the generator is capable of discerning the interrelationships between the disparate attributes of the data, and the expression and identity information within the image are distinguished, thereby facilitating the decoupling of the data. DeRL [10] generates a corresponding neutral facial image for any input facial image, utilizing the identity information retained in the GAN model for classification purposes. In contrast, Yang et al. [11] initially trains a conditional generation model to generate six distinct facial expressions for a single identity. Moreover, a pre-trained CNN is employed to extract features from the query image and the regenerated image. The decision is made by comparing the distances in the feature subspace. Huang et al. [12] initially utilizes StarGAN to synthesize complete basic emotional face images for each identity for augmentation. In addition, metric learning is introduced to learn more distinguishing features. However, the appealing method necessitates the provision of different expressions of an individual to construct a database, which is challenging to achieve for expression recognition datasets in the wild.

Some of the more recent approaches to FER explicitly seek to improve individual-independent FER. Identity-aware CNNs have been proposed as a means of mitigating identity-related changes, with the use of expression-sensitive and identity-sensitive contrast loss. Cai et al. [13] proposed the IF-GAN, an end-to-end network that directly removes identity information and generates facial images for use in identity-free FER. Zhang et al. [14] proposed a framework for expression embedding that is capable of learning a continuous and smooth expression space from face images. This approach has been demonstrated to effectively address the expression embedding problem by directly unraveling the facial identity attributes. In contrast to these methods, the CD-GAN improves the encoder of the generator based on GAN de-entanglement, thereby enabling the adaptation and fine-tuning of the pre-trained model for the general method of expression recognition. In addition, the reconstructed image is fed into the generator once more, where it is compared with the original data features. This allows for the expression representation with disentanglement to be obtained.

2.2. Contrastive Learning

Contrastive learning is a self-supervised representation learning paradigm that applies metric learning to optimize discriminative embeddings. Its core idea is to learn feature representations of data by comparing the similarities or differences between pairs of data. Especially, the data pairs consist of positive examples (i.e., data points that are similar to each other) and negative examples (i.e., data points that are dissimilar). The goal of the model is to maximize the similarity between positive pairs and minimize the similarity between negative pairs. This type of learning helps extract meaningful features from the data, which can then be applied to downstream tasks.

Contrastive loss is inspired by noise contrastive estimation [15] or N-pair loss [16], and typically uses cross-entropy loss, N2 loss, and cosine similarity loss. Constructing positive and negative pairs for training is a prerequisite for contrastive learning. Caron et al. [17] proposed the introduction of clustering to enhance contrastive learning. Moreover, Li et al. [18] proposed a clustering-based classification method called PCL, which is suitable for large-scale classification tasks due to the high number of clustering centers. He et al. [19] introduced the MOCO method, which achieves good learning results with only small batches. In contrast, supervised contrastive learning (SCL) [20] uses images and their associated category labels as the basis for constructing positive and negative pairs. It has been shown that appropriately increasing the number of positive samples can enhance the performance of contrastive learning.

The work presented in this paper primarily draws inspiration from the SCL method [21]. It introduces Contrastive loss into a GAN framework, enabling the disentangling of expression patterns across categories and identity information.

3. Proposed Method

This section provides a detailed explanation of the end-to-end CD-GAN method proposed in this paper. Given a set of image pairs consisting of expression images and identity images, the generator inputs the images into the identity branch and the expression branch for initial feature separation. Moreover, through an adaptive labeling method, pseudo-labels are assigned to the input features. Contrastive learning is applied to classify the extracted features in the embedding space, thereby separating dissimilar features and clustering similar features. The features in the embedding space are then fed into the decoder to generate images, which are subsequently fed back into the generator. Finally, the discriminator is used to distinguish between real and generated images, determining the authenticity of the generated images. Through adversarial training, the expression branch learns identity-invariant disentangled expression representations and performs expression classification.

3.1. Network Architecture of CD-GAN

Our framework is shown in Figure 2; the model follows the basic principles of Generative Adversarial Networks and is divided into generators and discriminators.

3.1.1. Generator

The aim of the generator is to extract specific image features and blend the extracted features to generate an image that aims to migrate a specific expression to a specific face image. The structure of the generator is an encoder–decoder part, and for better control of the image feature extraction, there are two encoders in our generator, which perform the extraction of expression and identity features, respectively.

The input to the generator shown in Figure 3 is a given pair of images, with each pair containing a face image

x_{f}

and an expression image

x_{e}

with corresponding labels

y_{f}

and

y_{e}

. The image pairs are fed to the encoder to extract the expression and face features. Due to the complexity of faces caused by individual differences, a neural network with a large number of parameters is required to extract identity and expression information from faces. For this purpose, we improve the ResNet network so that it can be embedded in a generator encoder. The network has a deep network structure and resistance to overfitting, in addition to the fact that most general expression recognition methods are based on the ResNet backbone network. This can be effectively exploited by piggybacking on a pre-trained model and fine-tuning it. The output expression and identity features are fed into the image synthesis decoder after a comparative learning process. For expression classification, a FER classification module is added after the expression branch, consisting of a fully connected layer and a softmax classifier, which takes the expression representations learned from the expression branch of the generator as the input to the FER task.

3.1.2. Discriminator

The discriminator of the CD-GAN consists of two parts, each performing a different task. As shown in Figure 4, during the training of the discriminator, one task is to determine the authenticity of the generated image, while the other is to evaluate whether the expression of the generated image is correct. These two discriminators have similar network structures. Their encoder structures are identical to the encoder of the generator. The advantage of this approach is that it not only ensures that the generator and the encoder have roughly the same number of parameters and are trained based on the basic GAN framework, but it also allows the generator’s encoder to be compared with the encoder in terms of expression classification. By using expression classification based solely on cross-entropy loss as a baseline and comparing it with the expression classification results obtained from the generator after applying the disentangled method, it becomes evident how the model improves upon the baseline in a straightforward and intuitive manner.

3.2. Contrastive Disentanglement Loss

Facial expression features and identity features depend on the geometric characteristics of the face, particularly key regions, such as the eyes and mouth. Therefore, when extracting expression features, some fundamental geometric information related to key point locations can resemble identity features. This causes the model to extract expression features that are similar to identity features, leading to overfitting as the model fails to learn the intrinsic factors of the expression. To address this issue, the recently proposed contrastive learning method provides a potential solution.

The premise of semi-supervised learning is that if the same class factors are present, the resulting images should belong to the same class. This approach is used to assign pseudo-labels to the dataset. Consider an expression image

x_{e}

and an identity image

x_{f}

:

\begin{matrix} e_{e} = E_{e} (x_{e}) e_{f}^{e} = E_{f} (x_{e}) \\ e_{e}^{f} = E_{e} (x_{f}) e_{f} = E_{f} (x_{f}), \end{matrix}

(1)

where

E_{e} (\cdot)

and

E_{f} (\cdot)

represent the expression encoder and identity encoder, respectively. The terms

e_{e}

and

e_{f}^{e}

refer to the expression features and identity features of the expression image, while

e_{e}^{f}

and

e_{f}

represent the expression features and identity features of the identity image. Respectively, the identity pseudo label

y_{f}^{'}

for

e_{f}^{e}

is obtained using KNN. The trained expression classifier is used to give

e_{e}^{f}

the expression pseudo label

y_{e}^{'}

. The features and their corresponding labels are combined as illustrated in Figure 5. By comparing positive and negative pairs, the contrast loss function is expressed as follows:

\begin{matrix} L_{CL} = - \frac{1}{N} \sum_{i = 1}^{N} \log \frac{e^{(s i m (z^{T}, z^{+}) / τ)}}{e^{(s i m (z^{T}, z^{+}) / τ)} + \sum_{J = 1}^{K} z^{(s i m (z^{T}, z^{-}) / τ)}}, \end{matrix}

(2)

where

τ > 0

is the temperature hyperparameter of the scalar, and z is the combination of the identity picture and the expression feature of the expression picture.

z^{+}

is a positive sample, and

z^{-}

is a negative sample. The former is an expression feature of the same class of expressions with different identities, while the latter is an expression feature of a different class of expressions with arbitrary identities. For a given class of expressions, the expression features should be similar regardless of the identities involved. In the case of different classes of emoji, it is expected that their expression features will be dissimilar, regardless of whether or not they share the same identity. By comparing positive and negative samples, the distance between positive samples in the embedding space is reduced, while the distance between negative samples is increased. The model will learn to make expression features independent of identity features by gradually learning to distinguish between positive and negative samples. This allows the model to learn disentangled expression features independent of identity.

3.3. Model Training

As a GAN-based model, the CD-GAN follows the principle of adversarial learning for training. Meanwhile, a latent code consistency loss, introduced into the generator loss, is proposed to constrain our model to improve performance. This loss induces consistency of embedded spatial features before and after image reconstruction. The whole learning process encourages the model to learn a disentangled expression representation, which improves the accuracy of the FER task. The model is trained in the following way:

\begin{matrix} \min_{G} \max_{D} V (D, G) & = E_{x, y \sim p_{e} (x, y)} [\log (D_{y^{e}}^{e} (x_{e}, x_{f}, n))] \\ + E_{x, y \sim p_{f}} [\log D (D_{y^{f}}^{f} (G (x_{e}, x_{f}, n)], \end{matrix}

(3)

where

p_{d a t a}

is the data from the real sample, and the noise z follows the distribution of

p_{z}

.

3.3.1. Discriminator Loss

From the above, we can represent our extracted identity features as

e_{f}

and expression features as

e_{e}

; the resulting image can be represented by the encoder of the generator:

\begin{matrix} x^{'} = G_{d} (e_{e}, e_{f}) . \end{matrix}

(4)

Ideally, the identity information of the generated image

x^{'}

from the generator should match the identity image input, while the expression information should align with the expression image input. Therefore, the discriminator is designed to perform two tasks: one is to determine the authenticity of the generated image, and the other is to perform expression recognition by classifying expressions. Additionally, an auxiliary task is introduced, where identity recognition is performed in the discriminator responsible for authenticity determination. The purpose of introducing this auxiliary task is to train the discriminator effectively, preventing the generator from being misled and learning incorrect directions. This encourages the model to generate more accurate label-transferred images. Through adversarial training, the generator receives feedback from the discriminator. This feedback enables the identity branch of the generator to learn the identity information from the input image and the expression branch to learn the expression information. This helps the model learn disentangled expression representations that are de-identified. The objective loss function of the discriminator can be expressed as follows:

\begin{matrix} L_{f} & = E_{(x_{f}, y_{f}) \sim p_{f}} [\log D_{f} (x_{f})] + E_{(x^{'}, y^{'}) \sim p^{'}} [\log D_{f} (x^{'})] \\ L_{e} & = E_{(x_{e}, y_{e}) p_{e}} [\log D_{e} (x_{e})], \end{matrix}

(5)

where

L_{f}

and

L_{e}

are the loss functions corresponding to the identity discriminator and expression discriminator.

p_{f}

,

p_{e}

, and

p^{'}

refer to the data distribution of the facial image, expression image, and generated image, respectively.

3.3.2. Generator Loss

For the generator, based on the principles of the GAN, the goal is to generate images that can fool the identity discriminator. Therefore, the classification loss of the generator is expressed as follows:

\begin{matrix} L_{C} & = - {E_{(x^{'}, y_{f}) \sim p^{'}} [\log D_{f} (x^{'})] + E_{(x^{'}, y_{e}) \sim p^{'}} [\log D_{e} (x^{'})]} . \end{matrix}

(6)

With the above loss function, the generator is expected to learn expression features only from the expression branch and identity features only from the identity branch. At the same time, the generated image should retain the same identity as

x_{f}

while retaining the expression of

x_{e}

, which motivates the two branches of the generator to learn the core information of the identity feature and the expression information of the expression feature, respectively. By minimizing the equation, the expression branch of the generator is expected to learn the disentangled expression representation, and thus, the FER task is more discriminative.

In order to make a better model for distanglement, it is considered that the expression information and identity information should be unchanged before and after the generation of the image. For this reason, a consistency loss of features (FCI) is proposed, which constrains the features before and after image generation to be guaranteed to be unchanged, in order to improve the learning of the model. In the first stage, the expression features

e_{e}

and identity features

e_{f}

are extracted through the generator’s two-branch structure, respectively. In the second stage, the generated image

x^{'}

is obtained by inputting it into the expression branch and the identity branch, respectively:

\begin{matrix} e_{e}^{'} = E_{e} (x^{'}), e_{f}^{'} = E_{f} (x^{'}) . \end{matrix}

(7)

For this purpose, the loss of feature consistency constructed before and after passing through the image generation is expressed as follows:

\begin{matrix} L_{F} = | | e_{e} - e_{e}^{'} {| |}_{2} + | | e_{f} - e_{f}^{'} {| |}_{2} . \end{matrix}

(8)

To minimize this damage function, the generator expression branch is expected to learn kernel information about expression features, which will drive the model to separate expression features from identity features.

Adding all the losses together, the generator losses are as follows:

\begin{matrix} L_{G} = λ_{C} L_{C} + λ_{D} L_{D} + λ_{F} L_{F}, \end{matrix}

(9)

where

λ_{C}

,

λ_{C} L

, and

λ_{F}

are custom parameters. Following the principle of adversarial GAN training, alternately optimizing Equations (5) and (9) allows the CD-GAN model to be continuously updated, allowing the model to learn the expression representation of the disentanglement.

4. Experiment

In this section, we evaluate the proposed CD-GAN on the laboratory datasets CK+ and TFEID, and the field dataset RAF-DB. In addition, the validity of the model is verified through the results of the quantitative analysis and visualization.

4.1. Dataset and Experimental Setup

The Cohn-Kanade dataset (CK+) [22] is a commonly used FER dataset created by researchers at the University of California, San Diego (UCSD). The dataset contains 593 FER performed by 123 participants of different ages, genders, and ethnicities. These expression sequences included six basic expressions (happiness, sadness, surprise, disgust, anger, and fear), as well as neutral expressions. The CK+ dataset is widely used in facial expression recognition research to evaluate and compare the performance of different algorithms and models. It provides a standard benchmark for researchers to test the accuracy and robustness of their FER systems under different conditions. In this paper, the best three frames of each sequence are selected to construct the training and test sets. Thus, a total of 1236 images were included in the experiments.

The Taiwan Facial Expression Image Database (TFEID) [23] comprises images of 40 models displaying eight facial expressions each, including six typical expressions, as well as expressions of nerve and scorn. For our experiments, images with six basic expressions and one neutral expression were selected. Therefore, the experiment consisted of 380 TFEID images.

The Ryerson Audio–Visual Database of Emotional Speech and Song (RAF-DB) [24] is a multimodal emotion recognition database containing video and audio clips of actors and singers. These clips cover a wide range of emotions, including happiness, sadness, anger, disgust, fear, and surprise. The RAF-DB is a large database containing over 5000 samples from participants of different ages, genders, and ethnicities. Each sample contains video and audio data, as well as corresponding emotion labels. These emotion labels were created by trained raters based on how they felt watching the video and listening to the audio. The emotion samples in the dataset also include neutral affective states. The RAF-DB is widely used in research and development in the field of emotion recognition, providing a rich resource for researchers and engineers to train and evaluate the performance of emotion recognition systems. In this paper, only experiments using the basic expression class are conducted, which contains 12,271 training samples and 3068 test samples.

The CASIA-WebFace [25] is a database for face recognition research, created by the Institute of Automation, Chinese Academy of Sciences. The database contains a large number of real-world face images for training and evaluating face recognition systems. It contains more than 100,000 face images covering multiple individuals from different backgrounds, postures, and lighting conditions. These images are collected from the Internet and therefore reflect the diversity and complexity of the real world. The database covers diverse face images of different ages, genders, races, etc., which helps researchers to develop face recognition systems with universal applicability. The CASIA-WebFace database has become one of the most important resources in the field of face recognition and has been widely used in academic research and industrial practice, which provides important support for the development and improvement of face recognition algorithms. In this paper, we use the images of the first 35 subjects in CASIA to train the model, involving a total of 6247 images. Face images are randomly selected from CASIA-WebFace and expression datasets, respectively, during the training phase to form input image pairs.

4.2. Experiment Setting

Prior to being fed into the model, both identity and expression images undergo a pre-processing stage. For facial images, in this paper, we use MTCNN to detect them and have them cropped to 3 × 224 × 224 (channel × height × width). In addition, to improve the generalization ability of the model, data enhancement (random cropping and horizontal flipping) is applied to the facial expression images in the training phase. The whole model only identifies in the training phase with expression image pairs with a batch size of 32. A learning rate of 1 × 10⁻⁴ five computers results in average accuracy using our Adam optimizer. Thus, the experiments are implemented using the pytorch deep learning framework, and the training process is performed on four 3090 GPUs. The manufacturer of the equipment used is Dell Inc. of the United States, with the model being T640.

For the trainable parameters, each custom parameter was adjusted individually while keeping the other two fixed. Through training, it was observed that the model achieved the highest accuracy when

λ_{C} = 0.5

,

λ_{D} = 1

, and

λ_{F} = 6

.

4.3. Quantitative Analysis

As shown in Table 1, the CD-GAN was evaluated on CK+ and TFEID, and its average accuracy results tested on the datasets reached 99.11% and 99.21%, respectively. As shown in the table, the CD-GAN was compared with other methods, and our model outperformed all other methods tested on the laboratory dataset. In addition, the baseline model for our comparison is the discriminator, whose encoder structure is identical to that of the generator expression branch. Compared to the baseline model, the CD-GAN achieves a higher average accuracy, indicating that the CD-GAN has a stronger expression discrimination ability compared to the baseline model, confirming the effectiveness of the model for the FER task on the aforementioned dataset.

The average accuracy of the CD-GAN on RAF-DB is 88.21%, as shown in Table 2, which compares the results of the CD-GAN with some state-of-the-art methods. On the field expression recognition dataset, these expression images are taken from real scenes that have different identities and suffer from pose differences and occlusions, which makes the RAF-DB data more challenging. Nevertheless, the CD-GAN is still comparable to other existing models or other state-of-the-art methods. Furthermore, our model not only performs feature extraction but also image generation. In the GAN, the generated images may contain adversarial interferences, which may confuse the encoder performing feature extraction and lead to unstable or erroneous feature extraction results. Nevertheless, the performance of the CD-GAN on RAF-DB is still much higher than that of the baseline model, suggesting that our model successfully separates the identity information of the face from the expression information through adversarial training, allowing the CD-GAN to effectively learn the expression representation of the disentanglement.

The average accuracy rate and F1 score studied in this paper are shown in Table 3. It can be seen that the model in this paper outperforms the baseline model method based on ResNet-18 on the RAF-DB dataset. These results clearly indicate that the proposed CD-GAN has a robust learning effect on the wild FER dataset.

In order to assess the role of the FIC and contrast constraints, this section analyzes the FIC and contrast constraints using ablation experiments to evaluate their impact on the model. In this experiment, the role of each module is evaluated in the ablation experiment on RAF-DB based on the use of the encoder ResNet-18 backbone network only.

As can be seen in Table 4, the FIC constraints constrain the consistency of the expressions by ensuring that the features are consistent before and after image generation. Adversarial training makes the expression features consistent before and after generation, which tends to separate the identity features from the identity branch. The contrast constraint uses supervised learning to provide a target for training. It makes the features of the same class move closer together and the features of different classes move away from each other, encouraging the model to learn the unentangled expression representation. It is obvious that the performance of the CD-GAN improves as the proposed modules are added. And it can be seen that the contrast constraints improve the performance of the CD-GAN model the most. The validation results of the ablation experiments prove the effectiveness of the modules proposed in this paper.

4.4. Visualization and Analysis

The detailed performance of the CD-GAN on RAF-DB is again shown in Figure 6, where Figure 6a is the confusion matrix of the CD-GAN on RAF-DB, and Figure 6b is the confusion matrix of the baseline model on RAF-DB. The accuracy results for each expression class were analyzed using the confusion matrix. It is obvious that the CD-GAN has a higher discriminative ability than the baseline model for most of the classes on RAF-DB, indicating that the model has a stronger expression discriminative ability than the baseline model.

From the results of the confusion matrix, it can be seen that the CD-GAN has the highest accuracy in recognizing the happy expression category, reaching 95.7%. However, it is less accurate at recognizing the fear category, with only 67.6%. This is due to the class imbalance in the RAF-DB dataset, where the fear category accounts for only 2.29% in RAF-DB, while the happy category accounts for 38.3% in RAF-DB. The difference between the two proportions in the dataset is too large, causing the model to learn more about the happy class and less about the fear class. In other words, the model struggles to learn equally about each class due to the effects of class imbalance. Using more evenly distributed FER data could reduce the bias towards different expressions and improve the accuracy of expression recognition.

To further evaluate the model in this paper, a t-SNE visualization experiment was conducted on the RAF-DB dataset to analyze the influence of the module. Figure 7a,b show the results of the comparison between the model and the baseline model, respectively. It can be clearly seen that in the embedding space, the model separates the different expression classes and each class is well aggregated, making the expression classes distinguishable and thus favorable for expression classification. As shown in Figure 7b, the network using only the baseline model for expression classification only makes the expression classes initially distinguishable in the embedding space, and its class boundaries are not only very fuzzy but also make it difficult to distinguish the expression features at the boundaries. In the center of the image, features from different classes are intermingled and mixed, making it difficult to discriminate. A comparison with the baseline model shows that our model has disentangled expression features obtained through adversarial training and comparative distanglement, which are more discriminative and conducive to the FER task.

The CD-GAN was trained using expression migration, as illustrated in Figure 8. The model demonstrated the capacity to transfer expression embeddings to a given face after training. The migration of the CD-GAN is superior and more discernible in comparison to other models. This further substantiates the efficacy of the CD-GAN model in acquiring disentangled expressions. Additionally, the model generates images with poses that are consistent with the original images, thereby demonstrating the robustness of the CD-GAN to pose variations and effectively reducing the impact of extraneous information on the model.

5. Conclusions

This paper proposes a novel model, the CD-GAN, which differs from traditional FER methods by employing adversarial training to explicitly disentangle expression features from identity information. In addition, contrastive learning is introduced to measure expression classes in the embedding space, guiding the model to learn discriminative, disentangled expression representations and thereby significantly improving FER accuracy. However, the approach still presents certain limitations; for instance, the disentanglement performance degrades when facial images exhibit large pose variations. In future work, we aim to extend the current framework to disentangle other facial attributes, such as pose and skin tone, with the goal of enabling more robust and nuanced understanding of human emotions by machines.

Author Contributions

Methodology, S.L.; Data Curation, H.C.; Writing—Original Draft Preparation, S.N.; Writing—Review and Editing, S.L.; Investigation, S.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Jilin Province (20240101336JC).

Data Availability Statement

The data presented in this study are openly available in [RAF-DB] at [10.1109/CVPR.2017.277], reference number [23].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, S.; Deng, W. Deep facial expression recognition: A survey. IEEE Trans. Affect. Comput. 2022, 13, 1195–1215. [Google Scholar] [CrossRef]
Sajjad, M.; Ullah, F.M.; Ullah, M.; Christodoulou, G.; Cheikh, F.A.; Hijji, M.; Muhammad, K.; Rodrigues, J.J.P.C. A comprehensive survey on deep facial expression recognition. Eng. J. 2023, 68, 817–840. [Google Scholar]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Zhang, Y.; Zhang, Y.; Wang, Y.; Song, Z. A Dual-Direction Attention Mixed Feature Network for Facial Expression Recognition. Electronics 2023, 12, 3595. [Google Scholar] [CrossRef]
Jiang, M.; Yin, S. Facial expression recognition based on convolutional block attention module and multi-feature fusion. Int. J. Comput. Vis. Robot. 2023, 13, 21–37. [Google Scholar] [CrossRef]
Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Jiang, J.; Deng, W. Disentangling identity and pose for facial expression recognition. IEEE Trans. Affect. Comput. 2022, 13, 1868–1878. [Google Scholar] [CrossRef]
Liu, X.; Kumar, B.V.K.V.; You, J.; Jia, P. Adaptive Deep Metric Learning for Identity-Aware Facial Expression Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 522–531. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Yang, H.; Ciftci, U.; Yin, L. Facial expression recognition by De-expression residue learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2168–2177. [Google Scholar]
Yang, H.; Zhang, Z.; Yin, L. Identity-Adaptive Facial Expression Recognition through Expression Regeneration Using Conditional Generative Adversarial Networks. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 294–301. [Google Scholar]
Choi, Y.; Choi, M.; Kim, M. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8789–8797. [Google Scholar]
Cai, J.; Meng, Z.; Khan, A.S.; O’Reilly, J.; Li, Z.; Han, S.; Tong, Y. Identity-free facial expression recognition using conditional generative adversarial network. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 1344–1348. [Google Scholar]
Zhang, W.; Ji, X.; Chen, K.; Ding, Y.; Fan, C. Learning a facial expression embedding disentangled from identity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6759–6768. [Google Scholar]
Gutmann, M.; Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010; pp. 297–304. [Google Scholar]
Kihyuk, S. Improved deep metric learning with multi-class n-pair loss objective. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. Adv. Neural Inf. Process. Syst. 2020, 33, 9912–9924. [Google Scholar]
Li, J.; Zhou, P.; Xiong, C.; Hoi, S.C.H. Prototypical Contrastive Learning of Unsupervised Representations. arXiv 2020, arXiv:2005.04966. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9729–9738. [Google Scholar]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
Kingma, D.P.; Rezende, D.J.; Mohamed, S.; Welling, M. Semi-supervised learning with deep generative models. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar]
Chen, L.; Yen, Y. Taiwanese Facial Expression Image Database; Brain Mapping Laboratory, Institute of Brain Science, National Yang-Ming University: Taipei, Taiwan, 2007. [Google Scholar]
Li, S.; Deng, W.; Du, J.P. Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition. IEEE Trans. Image Process. 2018, 28, 356–370. [Google Scholar] [CrossRef] [PubMed]
Shen, W.; Liu, R. Learning residual images for face attribute manipulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4030–4038. [Google Scholar]
Xie, S.; Hu, H. Facial expression recognition with two-branch disentangled generative adversarial network. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 2359–2371. [Google Scholar] [CrossRef]
Vaidya, K.S.; Patil, P.M.; Alagirisamy, M. Hybrid CNN-SVM Classifier for Human Emotion Recognition Using ROI Extraction and Feature Fusion. Wirel. Pers. Commun. 2023, 132, 1099–1135. [Google Scholar] [CrossRef]
Farajzadeh, N.; Hashemzadeh, M. Exemplar-based facial expression recognition. Inf. Sci. 2018, 460, 318–330. [Google Scholar] [CrossRef]
Chirra, V.R.R.; Uyyala, S.R.; Kolli, V.K.K. Virtual facial expression recognition using deep CNN with ensemble learning. J. Ambient. Intell. Humaniz. Comput. 2021, 12, 10581–10599. [Google Scholar] [CrossRef]
Baygin, M.; Tuncer, I.; Dogan, S.; Barua, P.D.; Tuncer, T.; Cheong, K.H.; Acharya, U.R. Automated facial expression recognition using exemplar hybrid deep feature generation technique. Soft Comput. 2023, 27, 8721–8737. [Google Scholar] [CrossRef]
Fard, A.P.; Mahoor, M.H. Ad-corre: Adaptive correlation-based loss for facial expression recognition in the wild. IEEE Access 2022, 10, 26756–26768. [Google Scholar] [CrossRef]
Wang, K.; Peng, X.; Yang, J.; Meng, D.; Qiao, Y. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Trans. Image Process. 2020, 29, 4057–4069. [Google Scholar] [CrossRef]
Saurav, S.; Gidde, P.; Saini, R.; Singh, S. Dual integrated convolutional neural network for real-time facial expression recognition in the wild. Vis. Comput. 2022, 38, 1083–1096. [Google Scholar] [CrossRef]
Xue, F.; Wang, Q.; Guo, G. Transfer: Learning relation-aware facial expression representations with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montréal, QC, Canada, 11–17 October 2021; pp. 3601–3610. [Google Scholar]
Wu, H.; Jia, J.; Xie, L.; Qi, G.; Shi, Y.; Tian, Q. Cross-VAE: Towards Disentangling Expression from Identity For Human Faces. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 4087–4091. [Google Scholar]

Figure 1. Disentanglement is a method of separating features in an embedded space. The CD-GAN will separate the expressions and identity features belonging to the face through the method of disentanglement.

Figure 2. A pair of expression and identity images is fed into the generator to produce expression-transferred images, which are then evaluated. A semi-supervised learning strategy is employed to disentangle identity and expression features. Additionally, adversarial learning is utilized to further decouple these features through expression transfer and image reconstruction. Finally, the disentangled expression representations are used for facial expression classification.

Figure 3. The generator consists of an identity branch and an expression branch, which extract identity and expression features, respectively. These extracted features are aggregated in a cascaded manner and then passed through a decoder to generate expression-transferred images. During the adversarial training process, the generated images become increasingly realistic, enabling the expression branch to learn disentangled expression representations.

Figure 4. The discriminator consists of an encoder and performs multiple tasks. It takes both real images and generated images from the generator as the input for discrimination and comparison. The discriminator’s tasks include identifying fake images and classifying the expression and identity information of the generated images. During adversarial training, it drives the generator to produce increasingly realistic images, achieving the separation of identity and expression information.

Figure 5. Positive samples represent the expression features of distinct identity classes for a given emoji type, whereas negative samples encompass the expression features of arbitrary identity classes for disparate emoji types. These are derived by integrating the expression features of emoji images with those of identity images. By establishing such positive and negative samples, the model can discern the intrinsic factors influencing expression change, irrespective of identity.

Figure 6. The results of the baseline approach to confusion matrices and the approach to the CD-GAN on the RAF-DB test set.

Figure 7. The 2D t-SNE visualization of facial expression features obtained by different methods, including (a) the CD-GAN, and (b) the baseline. The models are trained on RAF-DB with 12,271 labels. The features are extracted from the RAF-DB test set.

Figure 8. The following figure shows the results of comparing the images generated by our model with other models.

Table 1. Comparison results with other FER on laboratory dataset.

Methods	CK+	TFEID
DAM-CNN [25]	95.88%	93.20%
TD-GAN [26]	97.53%	97.20%
CNN-SVM [27]	94.12%	85.10%
Exemplar-based [28]	97.14%	98.90%
DCNN-VC [29]	99.42%	99.58%
Auto-FER [30]	100%	97.01%
ResNet-18	98.72%	97.53%
CD-GAN (ours)	99.11%	98.21%

Table 2. Comparison results with other FER on RAF-DB.

Methods	RAF-DB
TD-GAN [25]	81.91%
Ad-core [31]	86.96%
RAN [32]	86.90%
gACNN [33]	85.53%
DLN [14]	86.40%
ViT-FER [34]	90.91%
Cross-VAE [35]	84.81%
ResNet-18	85.13%
CD-GAN (ours)	88.21%

Table 3. Macro-F1 score and avg. accuracy on RAF-DB.

Methods	Avg. Accuracy	RAF-DB
ResNet-18	79.46%	0.8013
CD-GAN (ours)	82.35%	0.8432

Table 4. Ablation experiments on RAF-DB dataset.

Modality	RAF-DB
Baseline (ResNet-18)	85.13%
Baseline+FEC	85.82%
Baseline+CL	86.54 %
Baseline+FEC+CL (CD-GAN)	88.21%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, S.; Ni, S.; Cai, H. Facial Expression Recognition with Contrastive Disentangled Generative Adversarial Network. Electronics 2025, 14, 3795. https://doi.org/10.3390/electronics14193795

AMA Style

Liu S, Ni S, Cai H. Facial Expression Recognition with Contrastive Disentangled Generative Adversarial Network. Electronics. 2025; 14(19):3795. https://doi.org/10.3390/electronics14193795

Chicago/Turabian Style

Liu, Shuaishi, Shihao Ni, and Huaze Cai. 2025. "Facial Expression Recognition with Contrastive Disentangled Generative Adversarial Network" Electronics 14, no. 19: 3795. https://doi.org/10.3390/electronics14193795

APA Style

Liu, S., Ni, S., & Cai, H. (2025). Facial Expression Recognition with Contrastive Disentangled Generative Adversarial Network. Electronics, 14(19), 3795. https://doi.org/10.3390/electronics14193795

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Facial Expression Recognition with Contrastive Disentangled Generative Adversarial Network

Abstract

1. Introduction

2. Related Work

2.1. Identity-Free Facial Expression Recognition

2.2. Contrastive Learning

3. Proposed Method

3.1. Network Architecture of CD-GAN

3.1.1. Generator

3.1.2. Discriminator

3.2. Contrastive Disentanglement Loss

3.3. Model Training

3.3.1. Discriminator Loss

3.3.2. Generator Loss

4. Experiment

4.1. Dataset and Experimental Setup

4.2. Experiment Setting

4.3. Quantitative Analysis

4.4. Visualization and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI