A Lightweight Model Enhancing Facial Expression Recognition with Spatial Bias and Cosine-Harmony Loss

Chen, Xuefeng; Huang, Liangyu

doi:10.3390/computation12100201

Open AccessArticle

A Lightweight Model Enhancing Facial Expression Recognition with Spatial Bias and Cosine-Harmony Loss

by

Xuefeng Chen

^1,2 and

Liangyu Huang

^1,2,*

¹

Guangxi Key Laboratory of Nuclear Physics and Technology, Guangxi Normal University, Guilin 541004, China

²

College of Physical Science and Technology, Guangxi Normal University, Guilin 541004, China

^*

Author to whom correspondence should be addressed.

Computation 2024, 12(10), 201; https://doi.org/10.3390/computation12100201

Submission received: 17 August 2024 / Revised: 11 September 2024 / Accepted: 18 September 2024 / Published: 4 October 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

This paper proposes a facial expression recognition network called the Lightweight Facial Network with Spatial Bias (LFNSB). The LFNSB model effectively balances model complexity and recognition accuracy. It has two key components: a lightweight feature extraction network (LFN) and a Spatial Bias (SB) module for aggregating global information. The LFN introduces combined channel operations and depth-wise convolution techniques, effectively reducing the number of parameters while enhancing feature representation capability. The Spatial Bias module enables the model to focus on local facial features while capturing the dependencies between different facial regions. Additionally, a new loss function called Cosine-Harmony Loss is designed. This function optimizes the relative positions of feature vectors in high-dimensional space, resulting in better feature separation and clustering. Experimental results on the AffectNet and RAF-DB datasets demonstrate that the LFNSB model achieves competitive recognition accuracy, with 63.12% accuracy on AffectNet-8, 66.57% accuracy on AffectNet-7, and 91.07% accuracy on RAF-DB, while significantly reducing the model complexity.

Keywords:

facial expression recognition; spatial Bias; Cosine-Harmony Loss; lightweight model

1. Introduction

Facial expression recognition (FER) is an active research area with broad applications in interpersonal communication, education, medical rehabilitation, and safe driving [1]. Researchers have explored various frameworks and models for FER [2]. However, a key issue with existing FER techniques is their high model complexity, leading to significant computational resource requirements and large storage demands [3]. To address these issues, Ref. [4] proposed a facial expression recognition model based on MobileFaceNets called Mixed Feature Network (MFN). This model employs multi-level feature extraction and a complex network architecture to effectively capture expression information in images, achieving high recognition accuracy. Nevertheless, the network still has high computational complexity and remains unsuitable for resource-constrained mobile devices. Therefore, this paper further optimizes MFN and designs a more computationally efficient network called the Lightweight Facial Network (LFN). The LFN removes some large convolution kernels [5] and combines convolution operations with batch normalization into a single step, reducing computational complexity while maintaining good feature extraction capabilities. Additionally, the Spatial Bias (SB) module [6] is introduced into the LFN architecture to address the LFN’s shortcomings in capturing dependencies between different facial regions while maintaining computational efficiency.

Another key challenge in the field of FER is that different expressions may have similar facial features, making it relatively complex to distinguish between different expression categories [7]. To enhance the distinction between different categories and prevent overlap between classes, increasing the distance between class centers is essential [8]. Cosine-Harmony Loss measures the similarity between feature vectors and class centers using an adjusted cosine distance. Using this adjusted cosine distance also enhances the model’s robustness to variations in facial expression images, such as changes in lighting conditions and facial pose [9]. As a result, samples from the same class are more tightly clustered in the feature space, while samples from different classes are more distinctly separated, leading to improved recognition performance.

The contributions of our research can be summarized as follows:

This paper introduces a lightweight and efficient Lightweight Facial Network with Spatial Bias(LFNSB) model that utilizes a deep convolutional neural network to capture both detailed and global features of facial images while maintaining high computational efficiency.
This paper introduces a new loss function called Cosine-Harmony Loss. It utilizes adjusted cosine distance to optimize the computation of class centers, balancing intra-class compactness and inter-class separation.
Experimental results show that the proposed LFNSB method achieves an accuracy of 63.12% on AffectNet-8, 66.57% on AffectNet-7, and 91.07% on RAF-DB.

2. Methods

This section begins by providing a comprehensive overview of the related works about three key aspects of FER: backbone architectures, attention mechanisms, and loss functions used in facial recognition research. Building upon this foundation, this paper then shifts the focus to the method used to address the FER problem.

2.1. Related Work

2.1.1. FER

Facial expression recognition (FER) has seen significant advancements recently, particularly within computer vision and artificial intelligence, and is now widely applied in areas such as education, healthcare, and safe driving. Existing networks, such as VGG [10] ResNet [11], and Inception [12], utilize deep network architectures with multi-level feature extraction to effectively capture facial expression information from images. Recently, a Vision Transformer (ViT) was applied to FER. Xue F et al. [13] addressed the issues of the ViT’s difficulty in converging well and its tendency to focus on occluded and noisy regions in FER. However, these existing technologies typically require substantial computational resources and storage space [14]. To address these issues, several lightweight architectures have been proposed. For instance, MobileNetV2 [15] introduced an inverted residual structure that significantly reduces the number of parameters and computational cost while maintaining performance. Similarly, ShuffleNet [16] utilizes point-wise group convolution and channel shuffle operations to reduce the model size and complexity, making it more suitable for mobile and embedded devices. Ref. [4] proposed an improved facial expression recognition network based on MobileFaceNets [17] called MFN. MFN incorporates MixConv operations from [5], which naturally mix multiple kernel sizes within a single convolution. This allows for better capture and expression of the diversity and complexity of features. Additionally, a coordinate attention mechanism [18] can be introduced in each bottleneck to better capture the dependencies between different facial regions. Although MFN achieves high recognition accuracy, its computational complexity remains high, making it less suitable for resource-constrained environments. This paper further improves upon MFN, resulting in a network called LFN, which balances recognition accuracy with computational complexity.

2.1.2. Attention Mechanism

Attention mechanisms are implemented to enhance the model’s expressive power by emphasizing key areas or features in an image, thereby effectively capturing subtle changes and important features in expressions. In existing FER research, common attention mechanisms include spatial attention, channel attention, and local–global attention. Spatial attention mechanisms enhance the detection of subtle changes and important features in facial expressions by focusing on and highlighting localized areas related to emotions, such as the eyes and mouth [19]. However, this mechanism may be insufficient when processing global contexts, as focusing solely on local features might lead to the neglect of important overall features. Channel attention mechanisms reflect the interdependencies between different feature channels, which helps to better understand and capture semantic information in expressions. By enhancing important feature channels and suppressing irrelevant channels, the model can improve the accuracy of expression recognition [20]. However, using channel attention mechanisms alone also has limitations, as they cannot fully utilize spatial information. To overcome the above issues, some studies combine spatial and channel attention mechanisms, enabling the model to deeply understand expression details and improve the classification accuracy of emotional states [16]. Although these attention mechanisms significantly enhance FER accuracy, they often come with high computational complexity and a large number of parameters. To address these issues, a module known as Spatial Bias, as introduced in [6], is incorporated into the LFN. Unlike traditional attention mechanisms, the SB module is both lightweight and fast, adding a small amount of spatial bias to the convolutional feature maps through simple convolution operations, thereby effectively learning global knowledge. This module captures global features while preserving local information by reducing the spatial dimensions of the feature maps and compressing the number of channels. It can better establish connections among different facial regions, such as the mouth, eyes, nose, etc.

2.1.3. Loss Function

In recent years, various improved loss functions have emerged in the domain of facial recognition and expression recognition, aiming to enhance the discriminative power of facial features. Examples include center loss [21] and separate loss [22]. These improved loss functions share a common goal: to maximize inter-class variance and minimize intra-class variance. The center loss proposed in [23] enhances feature discriminability by reducing intra-class variation. However, it only focuses on intra-class compactness, neglecting inter-class sparsity, which limits its effectiveness in distinguishing different classes. To address the limitations of center loss, Ref. [24] proposes affinity loss. This loss function calculates the Euclidean distance between each sample and its class center. Additionally, it enlarges the inter-class boundaries by using the standard deviation among class centers, effectively preventing class overlap. However, the Euclidean distance is not suitable for high-dimensional image data such as facial expressions. Similarly, the Manhattan distance, which sums the absolute differences between coordinates, may also struggle with high-dimensional data as it does not effectively handle the angular relationships between vectors. In contrast, cosine similarity retains its properties, i.e., 1 for identical vectors, 0 for orthogonal vectors, and −1 for opposite vectors, even in high-dimensional spaces. This makes it more concise, efficient, and computationally less complex. Furthermore, Ref. [25] also applies cosine distance to measure the similarity between facial features in the field of facial recognition. Angular Softmax Loss, which uses angles as distance measures, was introduced in [26]. Subsequent improvements to angular loss functions have also been proposed [27]. Building on these methods, this paper proposes a new Cosine-Harmony Loss that optimizes the cosine distance to harmonize feature clustering and class separation.

2.2. Method

The architectural overview of the LFNSB model is illustrated in Figure 1, consisting of two main components: the Lightweight Feature Network (LFN) and the Spatial Bias (SB) module. Facial images are processed by the LFN, which includes multiple convolutional layers, residual bottleneck blocks, and Conv2d_BN units designed to capture detailed local features of the face. These components help ensure both feature extraction depth and efficiency. The RepVGGDW modules further improve the computational efficiency while maintaining high feature representation quality.

Once local features are extracted by the LFN, they are passed to the SB module, which enhances the global feature representation by introducing spatial bias. This step allows the model to capture broader facial expressions that go beyond local patches and integrate contextual information. The integration of local and global features allows the model to achieve a more comprehensive understanding of facial expressions.

After the feature maps are combined, they are flattened and input into a fully connected layer, which outputs the final predictions for the expression categories. Importantly, the LFN module is optimized using the proposed Cosine-Harmony Loss function. This loss function improves the discriminative power of the extracted feature vectors by adjusting the cosine similarity between feature vectors and their corresponding class centers, focusing on better inter-class separation and intra-class compactness.

2.2.1. LFN

The LFN architecture optimizes feature extraction capabilities through a hierarchical design and modular construction. It consists of the following components: a residual bottleneck, which captures complex features and facilitates information flow; non-residual blocks, which enhance the model’s representation capability; Conv2d_BN, which simplifies computation processes and improves inference efficiency; RepVGGDW, which reduces computational complexity and parameter count while maintaining robust feature extraction capabilities.

Integrating these components in the LFN architecture enables efficient facial feature extraction, leveraging the strengths of each module to optimize feature extraction and representation.

Improved Face Expression Recognition Network LFN Based on MFN

MFN is a face expression recognition architecture based on MobileFaceNet, utilizing the lightweight network MobileFaceNet [17] as its foundation. It employs a combination of two primary building blocks: residual bottlenecks and non-residual blocks. The residual bottleneck block was designed to capture complex features. The block leverages residual connections to mitigate the degradation problem. The improved baseline presented in this paper is called LFN, as shown in Table 1. The first two layers of Conv_blocks are replaced with Conv2d_BN, and the final Conv_block is replaced with RepVGGDW. The original MFN network depth is retained, with large kernel sizes of 5 and 7 removed in the shallow layers and only a few large kernel sizes used in the deeper layers.

Conv2d_BN

Convolutional layers have limitations in expressing complex features, which can lead to performance degradation when handling intricate expression recognition tasks. To maintain high feature extraction capability while keeping low computational complexity, we introduced the Conv2d_BN module [13]. Conv2d_BN performs only two operations: convolution and batch normalization. As shown in Figure 2, after fusion, the parameters of the batch normalization operation are directly integrated into the convolutional layer. This improves inference efficiency. The fused weight formula is as follows:

ω^{'} = ω \frac{γ}{\sqrt{r u n n i n g_v a r}}

(1)

where ω is the weight of the original convolutional layer, γ is the weight of the batch normalization layer,

r u n n i n g_v a r

is the running variance of the batch normalization layer, and ϵ is a small constant for numerical stability. By multiplying the original convolutional weight ω by γ, we account for the scaling effect of the batch normalization layer. This ensures that the input data processing in the fused convolutional layer still reflects the adjustment of the data distribution made by the batch normalization layer, thereby maintaining consistency and stability in the output. The fused bias formula is as follows:

b^{'} = β - \frac{r u n n i n_m e a n * γ}{\sqrt{r u n n i n g_v a r} + ϵ}

(2)

where β is the bias of the batch normalization layer, and

r u n n i n_m e a n

is the running mean of the batch normalization layer. The bias term

b^{'}

needs to subtract the effect of the running mean adjusted by γ from the original bias β to maintain the overall translation effect.

RepVGGDW

When models become deeper, they may encounter issues such as gradient vanishing or explosion, which complicates training. Additionally, deeper networks require more computational resources, increasing training time and computational cost. Replacing the final ‘Conv_block’ layer with the ‘RepVGGDW’ module can help address these issues.The ‘RepVGGDW’ module consists of the previously described ‘Conv2d_BN’ (3 × 3 depth-wise convolution), a 1 × 1 convolution (‘Conv2d’), and a batch normalization layer (‘BatchNorm2d’). As illustrated in Figure 3, this module uses fusion technology. The fusion process integrates the weights and biases of the depth-wise and point-wise convolutions with the batch normalization parameters, achieving parameter reduction and decreased complexity while maintaining effective feature extraction capabilities. Importantly, the ‘RepVGGDW’ module forms a residual connection by adding the input ‘x’ to the result of the convolution and batch normalization, facilitating gradient flow and preventing issues with gradient diminishing or exploding in deep networks. Additionally, the ‘RepVGGDW’ module does not alter the number of output channels, preserving the 256-channel output, which makes integration into existing facial expression recognition models more convenient and efficient. Experimental validation shows that this replacement improves both recognition accuracy and inference efficiency, demonstrating the effectiveness of the ‘RepVGGDW’ module in optimizing deep network structures and enhancing model performance.

2.2.2. Spatial Bias

To ensure that the model can capture the dependencies between different regions of the face while maintaining computational efficiency, we introduce the Spatial Bias (SB) module. The channel and spatial size of the feature map are reduced through the 1 × 1 convolution and average pooling operations. Its workflow is illustrated in Figure 4 and operates as follows:

Input Feature Map Compression: The input feature map is first compressed through a 1 × 1 convolution, resulting in a feature map with fewer channels. Then, an adaptive average pooling layer is used to spatially compress the feature map, producing a smaller feature map.
Feature Map Flattening: The feature map for each channel is flattened into a one-dimensional vector, resulting in a transformed feature map.
Global Knowledge Aggregation: A 1D convolution is applied to the flattened feature map to encode global knowledge, capturing global dependencies and generating the spatial bias map.
Upsampling and Concatenation: The spatial bias map is upsampled to the same size as the original convolutional feature map using bilinear interpolation, and then concatenated with the convolutional feature map along the channel dimension.

In this way, the Spatial Bias module enables the network to learn both local and global information, improving feature representation and enhancing the overall effectiveness of the model.

2.2.3. Cosine-Harmony Loss

In FER, features can vary significantly among different samples within the same class, while features across different classes may exhibit high levels of similarity. This paper introduces a new loss function called Cosine-Harmony Loss to address this issue. By using the adjusted cosine distance to calculate intra-class and inter-class distances separately, and balancing them through weighting, it is possible to optimize the separation and clustering of features to a certain extent.

Cosine distance: In face recognition, cosine distance is commonly used to compute the similarity between two facial images. The formula for calculating the cosine distance is as follows:

c o s i n e_d i s t a n c e (x, c) = 1 - \frac{x * c}{| |x || * || c| |}

(3)

Here,

x

represents the input feature vector, and

c

denotes the class center. To mitigate the impact of global factors such as brightness and contrast on the measurement of feature similarity, this paper introduces an adjusted cosine distance. By subtracting the mean from both the input feature vectors and the class centers, the adjusted cosine distance reduces the bias in global feature distribution, allowing it to better capture the relative differences in local features. The formula for its calculation is as follows:

a d j u s t e d_c o s i n e_d i s t a n c e (x, c) = d i s t (x_{i}, c_{j}) = 1 - \frac{(x - m e a n (x)) * (c - m e a n (c))}{||x - m e a n (x)|| | |c - m e a n (c)| |}

(4)

With this adjustment, the model can more effectively resist interference caused by variations in image brightness and contrast, thereby improving the accuracy and robustness of feature clustering. In the following sections, we use

d i s t (x_{i}, c_{j})

to represent the adjusted cosine distance.

Cosine-Harmony Loss: Cosine-Harmony Loss utilizes the adjusted cosine distance to separately compute intra-class distance (the distance between feature vectors and their corresponding class centers) and inter-class distance (the distance between feature vectors and other class centers), providing a more robust measure of feature similarity. The formula for its calculation is as follows:

i n t r a_c l a s s_d i s t a n c e = \frac{1}{N} \sum_{i = 1}^{N} d i s t (x_{i}, c_{y_{i}})

(5)

where

N

is the number of samples,

x_{i}

is the feature vector of the i-th sample, and

c_{y_{i}}

is its corresponding class center.

The inter-class distance is calculated by dividing the sum of distances by the number of classes minus one, to measure the dispersion of features between different classes. The formula for its calculation is as follows:

i n t e r_c l a s s_d i s t a n c e = \frac{1}{N (C - 1)} \sum_{i = 1}^{N} \sum_{j \neq y_{i}} d i s t (x_{i}, c_{j})

(6)

where

C

is the number of classes.

In addition, to standardize the intra-class distance and eliminate the impact of feature distribution differences, intra-class variance is used to improve the model’s stability and discriminative power, preventing errors caused by distribution discrepancies. The formula for its calculation is as follows:

c l a s s_v a r i a n c e () = \frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - μ_{y_{i}})}^{2}

(7)

where

μ

is the mean vector of all feature vectors. Furthermore, we incorporate a weight parameter

\propto

to flexibly balance intra-class variance and inter-class distance contributions in the loss function. The final formula for the loss function is as follows:

l o s s = \propto \frac{i n t r a_c l a s s_d i s t a n c e}{c l a s s_v a r i a n c e ()} + (1 - \propto) i n t e r_c l a s s_d i s t a n c e

(8)

The proposed Cosine-Harmony Loss function demonstrates significant effectiveness in enhancing the discriminative power of feature vectors. By minimizing intra-class distance and maximizing inter-class distance, Cosine-Harmony Loss more effectively clusters features, thereby improving model performance in tasks such as FER.

3. Results

In this section, we provide a detailed description of our experimental evaluation results. We demonstrate the superiority of the proposed method on two widely used benchmark datasets. The experimental evaluation began with a series of ablation experiments, analyzing the individual contributions of each component within LFNSB. This allowed us to assess the significance of each component in enhancing the overall performance of LFNSB. Subsequently, we conducted comparative analyses with other state-of-the-art networks.

3.1. Datasets

AffectNet [28]: AffectNet is a comprehensive facial expression recognition (FER) dataset, containing over 1 million facial images, with approximately 440,000 images manually annotated. Due to its extensive collection of labeled facial images, it is one of the largest publicly available FER datasets. It includes two benchmarks: AffectNet-7 and AffectNet-8. AffectNet-7 comprises 283,901 images for training and 3500 images for testing, with seven emotion categories including neutral (74,874), happy (134,415), sad (25,459), surprise (14,090), fear (6378), angry (24,882), and disgust (3803). AffectNet-8 introduces an additional category of “contempt” (3750) and expands the training set to 287,651 images, along with 4000 images for testing. In our experiments, we chose AffectNet for its large size and diverse emotion categories. Its broad range of emotional categories and detailed annotations provide robust support for the training and evaluation of FER models, ensuring that our results are based on a comprehensive and representative dataset.

RAF-DB [29]: RAF-DB (Real-world Affective Faces Database) is a large-scale dataset specifically designed to advance research in facial expression recognition. It includes 29,672 images with strong diversity, covering various factors such as age, gender, and ethnicity. Within RAF-DB, there are 12,271 training samples (surprise (1290), fear (281), disgust (717), happy (4772), sad (1982), angry (705), and neutral (2524)) and 3068 testing samples available for FER. The dataset features diverse conditions including lighting variations, head poses, and occlusions (such as glasses or facial hair). The primary reason for selecting this dataset is its widespread use and high recognition within the field, making it a reliable benchmark for our experiments.

3.2. Implementation Details

During the preprocessing stage, we utilized the RetinaFace model to detect facial regions in the AffectNet and RAF-DB datasets, identifying five key points: both eyes, the nose, and both corners of the mouth. All images were resized to 112 × 112 pixels. We employed several data augmentation techniques to mitigate overfitting, including random horizontal flipping, random rotations and cropping, color normalization, and random pixel erasure. These augmentation strategies enhanced the robustness and generalization capability of the LFNSB model during training.

To ensure a fair comparison with other backbone architectures, the MFN backbone was pre-trained on the Ms-Celeb-1M dataset [30]. All experiments were conducted using the PyTorch 1.8.0+ framework on a server equipped with an NVIDIA TESLA P40 GPU. Our code is open-sourced at https://github.com/1chenchen22/LFNSB (accessed on 21 September 2024).

Training for all tasks used a consistent batch size of 256 over 40 epochs. During training, we applied various optimization strategies to adjust model parameters. Specifically, for the AffectNet-7 and AffectNet-8 datasets, we started with an initial learning rate of 0.0001. On the RAF-DB dataset, we adjusted the learning rate to 0.01. These parameter selections were aimed at optimizing the model effectively to achieve better training efficiency and performance.

3.3. Ablation Studies

To validate the effectiveness of each component in the LFNSB model, this section conducted ablation experiments across multiple datasets, demonstrating the generalization capability of the proposed LFNSB model and the effectiveness of its components.

3.3.1. Effectiveness of the LFN

To evaluate the effectiveness of the LFN, this section conducted multiple comparative experiments on the RAF-DB dataset. Table 2 presents the performance of the original MFN and the improved LFN. The LFN achieved 90.22% accuracy on RAF-DB. While its accuracy is slightly lower than MFN’s, the LFN significantly reduces complexity, with a 30.8% decrease in parameters and a 27.1% reduction in FLOPs. This demonstrates that LFN strikes a balance between model complexity and recognition accuracy, providing a lightweight and efficient foundational model for future facial expression recognition tasks.

3.3.2. Effectiveness of the Cosine-Harmony Loss

To validate the effectiveness of the Cosine-Harmony Loss in facial expression recognition tasks, we conducted a series of ablation experiments. Initially, as shown in Table 3, we employed only the CrossEntropyLoss in the proposed LFN backbone, resulting in accuracy rates of 89.57% and 64.26% on the RAF-DB and AffectNet-7 datasets, respectively. In subsequent experiments, we applied both the CrossEntropyLoss and the newly proposed Cosine-Harmony Loss function simultaneously. The accuracy rates improved to 90.22% on RAF-DB and 65.45% on AffectNet-7, representing increases of 0.65% and 1.19%, respectively. These results demonstrate that the combined use of CrossEntropyLoss and Cosine-Harmony Loss yields higher recognition accuracy compared to using CrossEntropyLoss alone.

To determine the optimal alpha value for the Cosine-Harmony Loss function and validate its reliability, we conducted multiple experiments, setting the epochs to 40. We tested different alpha values, including 0.1, 0.2, 0.3, 0.4, and 0.5. In each experiment, we evaluated the model’s performance on the validation set, primarily using validation accuracy and loss value as evaluation metrics. The impact of different α values in the loss function on model performance is shown in Table 4.

From the table, it can be seen that when the alpha value is 0.1, the validation accuracy reaches 90.22%, and the loss value decreases to 0.066, indicating the best model performance. This may be because the inter-class distance is more critical for facial expression recognition tasks, whereas focusing too much on the intra-class distance could lead to overfitting or excessive compression of the feature space. Therefore, we chose an alpha value of 0.1 as the final parameter for the Cosine-Harmony Loss function. Additionally, we plotted the loss curve for alpha = 0.1, as shown in Figure 5. The figure shows a significant decrease in the loss value with the increase in training epochs, indicating good convergence of the model under this setting.

3.3.3. Effectiveness of the LFNSB

To further validate the contribution of the Spatial Bias module, we integrated it into the LFN backbone and utilized the Cosine-Harmony Loss function as proposed, forming the LFNSB model. Experiments were conducted on the RAF-DB and AffectNet-7 datasets. The results, shown in Table 5, indicate performance improvements of 0.85% and 1.12% on RAF-DB and AffectNet-7, respectively. The effect is more pronounced on the larger AffectNet dataset. The Spatial Bias module enhances the model’s ability to capture complex expression features by effectively integrating global information, leading to improved accuracy.

3.4. Quantitative Performance Comparisons

This section presents a quantitative performance comparison of the LFNSB model with other existing models on the AffectNet and RAF-DB datasets, as shown in Table 6, Table 7 and Table 8. The results indicate that the proposed LFNSB model achieves higher recognition accuracy than the average of existing models. Specifically, the LFNSB model reached an accuracy of 66.57% on AffectNet-7, 63.12% on AffectNet-8, and 91.07% on RAF-DB. These results demonstrate the potential of the LFNSB model.

3.5. K-Fold Cross-Validation

To comprehensively evaluate the effectiveness and reliability of LFNSB, we conducted K-fold cross-validation on the RAF-DB and AffectNet-7 datasets, as shown in Table 9. In this process, the dataset is randomly divided into k mutually exclusive subsets of equal size. The model is then trained using k−1 of these subsets, while the remaining subset is used for testing. This procedure is repeated for each subset, and the results are collected to compute the average accuracy. This validation method ensures that all data points are used for both training and testing, effectively reducing the risk of overfitting.

The results in Table 9 show that LFNSB achieved an average accuracy of 90.64% on the RAF-DB dataset and 65.72% on the AffectNet-7 dataset. These results indicate that the model maintains consistently high performance, further validating the reliability and robustness of the LFNSB model in facial expression recognition tasks.

3.6. Confusion Matrix

In classification tasks, a confusion matrix is commonly used to evaluate a classification model’s performance across different categories, providing a comprehensive understanding of the model’s effectiveness. This section presents the confusion matrices for the LFNSB model tested on three datasets, as shown in Figure 6a–c. From these confusion matrices, it can be observed that the “happy” category consistently exhibits the highest recognition rates across all three datasets, likely because it is the most prevalent category in each dataset.

Figure 6b,c reveal that categories such as “disgust”, “angry”, and “contempt” perform poorly in the AffectNet dataset. This underperformance is due to the small number of samples for these categories, which are only in the thousands, compared to other categories with tens of thousands of samples. Additionally, the data quality for these categories is lower, with noticeable noise interference. Furthermore, the AffectNet dataset suffers from class imbalance, which may impact the recognition accuracy of various categories.

To address issues related to dataset imbalance and data quality, future research will explore methods such as label distribution learning and correction, as suggested in [10,17], to mitigate these problems. From Figure 6a, it can be observed that the recognition accuracy for the disgust and fear expressions in the RAF-DB dataset is relatively low, at 74.38% and 67.57%, respectively. Further analysis reveals that fear is often misclassified as surprise or angry, which may be due to the similarities in certain features among these expressions. Additionally, disgust is frequently misclassified as sad, indicating that the model has difficulty distinguishing between these two categories. To improve the model’s performance, we plan to implement further dual-view data augmentation for the fear and disgust categories in the future to increase the sample size.

The experimental results shown in Figure 6 validate the effectiveness of the method proposed in Section 2.2 and demonstrate the accuracy rates for each facial expression category. Regarding computational efficiency, when using a pre-trained model, training on the RAF-DB dataset requires approximately 15 min, while training on the AffectNet dataset takes about 8 h. This highlights the LFNSB model’s efficiency in terms of training time, which is crucial for practical applications.

4. Conclusions

The LFNSB model for facial expression recognition (FER) addresses the complexity issues of existing models, making it suitable for mobile devices with limited computational resources. It incorporates Conv2d_BN, RepVGGDW, and Spatial Bias modules to improve feature extraction while reducing the computational load. The proposed Cosine-Harmony Loss function optimizes class centers, enhancing feature clustering and model generalization. This allows the LFNSB model to maintain a good balance between parameter count and computational complexity. However, the model has limitations, such as susceptibility to class imbalance in datasets, which can impact its accuracy. There may also be other factors contributing to the model’s performance, which has not yet reached state-of-the-art (SOTA) levels. Future work will focus on improving accuracy through strategies like dual-view enhancement and exploring the model’s real-time performance on various devices. We will also further validate the model’s generalizability and adaptability, particularly in handling class imbalance and diverse data qualities. Ongoing research will aim to enhance the model’s capabilities, ensuring its effectiveness in real-world FER applications (Appendix A).

Author Contributions

Conceptualization, X.C. and L.H.; methodology, X.C. and L.H.; software, X.C.; validation, X.C.; formal analysis, X.C.; investigation, X.C.; resources, X.C.; data curation, X.C.; writing—original draft preparation, X.C.; writing—review and editing, X.C. and L.H.; visualization, X.C.; supervision, L.H.; project administration, X.C.; funding acquisition, L.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the General Program of the Natural Science Foundation of Guangxi (No. 2023GXNSFAA026347), the Central Government Guidance Funds for Local Scientific and Technological Development, China (No. Guike ZY22096024), and the University-Industry Collaborative Education Program of Ministry of Education, China (No. 230702496270001).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Application of the LFNSB Model in a Real-Time Music Recommendation System

Facial expressions are a crucial means of expressing and recognizing emotions, and emotional recognition and management are essential for mental health [41]. Negative emotions can have adverse effects on physical and mental well-being, while music can help relax the mind and improve sleep [42]. Therefore, we applied the LFNSB facial expression recognition model proposed in this paper to an open-source music recommendation system based on facial expression recognition for performance evaluation and testing. This system is based on an open-source project available at https://github.com/aj-naik/Emotion-Music-Recommendation (accessed on 16 August 2024) and uses the Flask framework for the backend and OpenCV for real-time capture of user facial expressions. Our integration work involved the following key steps:

System Integration: We replaced the original model in the open-source music recommendation system with our LFNSB model. This system utilizes the Flask framework to handle backend logic, and Flask, as a lightweight web framework, provides stable and flexible services.

Real-Time Expression Capture: The system uses OpenCV to process real-time video streams and capture user facial expressions. The use of the OpenCV library allows the system to efficiently handle video stream data and input the captured facial images into the LFNSB model for expression recognition.

Personalized Recommendations: After analyzing the user’s real-time expressions and identifying their emotional state, the LFNSB model enables the system to generate a customized music recommendation list. The user’s emotional state is used to adjust the recommendation algorithm, thereby providing more personalized music suggestions.

Results Display: The system dynamically updates music recommendations based on the real-time expressions captured by the camera and analyzed by the LFNSB model. The recommendations are derived from a predefined playlist within the recommendation system, ensuring that the music suggested aligns with the user’s current emotional state. Figure A1 shows a snapshot of the web page, displaying both the captured facial expression and the recommended music list from the playlist. This visual representation illustrates how the system integrates facial expression analysis to provide music suggestions. In future work, we plan to further explore and expand the application of facial expression recognition models in various domains, evaluate their performance in different real-world scenarios, and optimize both the model and the overall system performance.

Figure A1. The system recommends music that matches the user’s current emotions based on real-time facial expression analysis.

References

Banerjee, R.; De, S.; Dey, S. A survey on various deep learning algorithms for an efficient facial expression recognition system. Int. J. Image Graph. 2023, 23, 2240005. [Google Scholar] [CrossRef]
Sajjad, M.; Ullah, F.U.M.; Ullah, M.; Christodoulou, G.; Cheikh, F.A.; Hijji, M.; Muhammad, K.; Rodrigues, J.J. A comprehensive survey on deep facial expression recognition: Challenges, applications, and future guidelines. Alex. Eng. J. 2023, 68, 817–840. [Google Scholar] [CrossRef]
Adyapady, R.R.; Annappa, B. A comprehensive review of facial expression recognition techniques. Multimed. Syst. 2023, 29, 73–103. [Google Scholar] [CrossRef]
Zhang, S.; Zhang, Y.; Zhang, Y.; Wang, Y.; Song, Z. A dual-direction attention mixed feature network for facial expression recognition. Electronics 2023, 12, 3595. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. Mixconv: Mixed depthwise convolutional kernels. In Proceedings of the 30th British Machine Vision Conference 2019, Cardiff, UK, 9–12 September 2019. [Google Scholar]
Go, J.; Ryu, J. Spatial bias for attention-free non-local neural networks. Expert Syst. Appl. 2024, 238, 122053. [Google Scholar] [CrossRef]
Li, S.; Deng, W. Deep Facial Expression Recognition: A Survey. IEEE Trans. Affect. Comput. 2020, 13, 1195–1215. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 16–17 June 2019; pp. 4690–4699. [Google Scholar]
Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.; Li, Z.; Liu, W. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5265–5274. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Altaher, A.; Salekshahrezaee, Z.; Abdollah Zadeh, A.; Rafieipour, H.; Altaher, A. Using multi-inception CNN for face emotion recognition. J. Bioeng. Res. 2020, 3, 1–12. [Google Scholar]
Xue, F.; Wang, Q.; Tan, Z.; Ma, Z.; Guo, G. Vision transformer with attentive pooling for robust facial expression recognition. IEEE Trans. Affect. Comput. 2022, 14, 3244–3256. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. Repvit: Revisiting mobile cnn from vit perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024, Seattle, WA, USA, 16–22 June 2024; pp. 15909–15920. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Chen, S.; Liu, Y.; Gao, X.; Han, Z. Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. In Biometric Recognition: 13th Chinese Conference, CCBR 2018, Urumqi, China, August 11–12, 2018, Proceedings 13; Springer International Publishing: Berlin/Heidelberg, Germany, 2018; pp. 428–438. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
You, Q.; Jin, H.; Luo, J. Visual sentiment analysis by attending on local image regions. In Proceedings of the AAAI Conference on Artificial Intelligence 2017, San Francisco, CA, USA, 4–9 February 2017; pp. 231–237. [Google Scholar]
Zhao, S.; Jia, Z.; Chen, H.; Li, L.; Ding, G.; Keutzer, K. PDANet: Polarity-consistent deep attention network for fine-grained visual emotion regression. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 192–201. [Google Scholar]
Farzaneh, A.H.; Qi, X. Facial expression recognition in the wild via deep attentive center loss. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 2021, Virtual, 5–9 January 2021; pp. 2402–2411. [Google Scholar]
Li, Y.; Lu, Y.; Li, J.; Lu, G. Separate loss for basic and compound facial expression recognition in the wild. In Proceedings of the Asian Conference on Machine Learning 2019, Nagoya, Japan, 17–19 November 2019; pp. 897–911. [Google Scholar]
Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 499–515. [Google Scholar]
Wen, Z.; Lin, W.; Wang, T.; Xu, G. Distract your attention: Multi-head cross attention network for facial expression recognition. Biomimetics 2023, 8, 199. [Google Scholar] [CrossRef] [PubMed]
Nguyen, H.V.; Bai, L. Cosine similarity metric learning for face verification. In Proceedings of the Asian Conference on Computer Vision 2010, Queenstown, New Zealand, 8–12 November 2010; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 212–220. [Google Scholar]
Liu, Y.; Li, H.; Wang, X. Learning deep features via congenerous cosine loss for person recognition. arXiv 2017, arXiv:1702.06890. [Google Scholar]
Dhall, A.; Goecke, R.; Lucey, S.; Gedeon, T. Collecting large, richly annotated facial expression databases from movies. IEEE Multimed. 2012, 19, 34–41. [Google Scholar] [CrossRef]
Li, S.; Deng, W. Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition. IEEE Trans. Image Process. 2018, 28, 356–370. [Google Scholar] [CrossRef] [PubMed]
Guo, Y.; Zhang, L.; Hu, Y.; He, X.; Gao, J. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 87–102. [Google Scholar]
Wang, K.; Peng, X.; Yang, J.; Meng, D.; Qiao, Y. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Trans. Image Process. 2020, 29, 4057–4069. [Google Scholar] [CrossRef] [PubMed]
Wang, K.; Peng, X.; Yang, J.; Lu, S.; Qiao, Y. Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 6897–6906. [Google Scholar]
Vo, T.-H.; Lee, G.-S.; Yang, H.-J.; Kim, S.-H. Pyramid with super resolution for in-the-wild facial expression recognition. IEEE Access 2020, 8, 131988–132001. [Google Scholar] [CrossRef]
Savchenko, A.V.; Savchenko, L.V.; Makarov, I. Classifying emotions and engagement in online learning based on a single facial expression recognition neural network. IEEE Trans. Affect. Comput. 2022, 13, 2132–2143. [Google Scholar] [CrossRef]
Wagner, N.; Mätzler, F.; Vossberg, S.R.; Schneider, H.; Pavlitska, S.; Zöllner, J.M. CAGE: Circumplex Affect Guided Expression Inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024, Seattle, WA, USA, 16–22 June 2024; pp. 4683–4692. [Google Scholar]
Li, H.; Sui, M.; Zhao, F.; Zha, Z.; Wu, F. Mvt: Mask vision transformer for facial expression recognition in the wild. arXiv 2021, arXiv:2106.04520. [Google Scholar]
Zhao, Z.; Liu, Q.; Wang, S. Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Trans. Image Process. 2021, 30, 6544–6556. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Wang, J.; Chen, S.; Shi, Z.; Cai, J. Facial motion prior networks for facial expression recognition. In Proceedings of the 2019 IEEE Visual Communications and Image Processing (VCIP), Suzhou, China, 13–16 December 2019; pp. 1–4. [Google Scholar]
Farzaneh, X.Q.; Hossein, A. Discriminant distribution-agnostic loss for facial expression recognition in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 2020, Seattle, WA, USA, 14–19 June 2020; pp. 406–407. [Google Scholar]
Zhang, W.; Ji, X.; Chen, K.; Ding, Y.; Fan, C. Learning a Facial Expression Embedding Disentangled from Identity. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 6755–6764. [Google Scholar]
Shaila, S.G.; Gurudas, V.R.; Rakshita, R.; Shangloo, A. Music therapy for mood transformation based on deep learning framework. In Computer Vision and Robotics: Proceedings of CVR 2021; Springer: Singapore, 2022; pp. 35–47. [Google Scholar]
Shaila, S.G.; Rajesh, T.M.; Lavanya, S.; Abhishek, K.G.; Suma, V. Music therapy for transforming human negative emotions: Deep learning approach. In Proceedings of the International Conference on Recent Trends in Computing: ICRTC 2021, Delhi, India, 4–5 June 2021; Springer: Singapore, 2022; pp. 99–109. [Google Scholar]

Figure 1. A simplified view of the LFNSB model architecture, which includes two main components: LFN and SB. First, facial images are input into the LFN, which processes the images and extracts basic facial feature maps. Subsequently, these feature maps are processed by the Spatial Bias module, which accentuates the global features related to facial expressions. Additionally, the LFN module utilizes a loss function we proposed, the Cosine-Harmony Loss, to improve the discriminative power of the feature vectors. Finally, the enhanced features are combined and passed through a fully connected layer to predict the expression categories of the images.

Figure 2. The fuse method of Conv2d_BN is used to merge the convolutional and batch normalization layers, reducing the number of parameters and potentially enhancing computational efficiency.

Figure 3. The RepVGGDW module optimizes computational efficiency and reduces parameter count by integrating 3 × 3 depth-wise convolution, 1 × 1 convolution, and batch normalization while maintaining effective feature extraction capabilities.

Figure 4. A diagram of the Spatial Bias module, which reduces the feature map size through 1 × 1 convolution and pooling, followed by aggregating spatial information using 1D convolution.

Figure 5. Loss curve with α = 0.1 over epochs.

Figure 6. The confusion matrix of the LFNSB tested on different datasets. (a) RAF-DB; (b) Affect- 419 Net-7; (c) AffectNet-8.

Table 1. The proposed LFN architecture. In the table, n refers to the number of repetitions, c refers to output channels, t refers to the expansion factor, and s refers to the stride.

Input	Operator	t	c	n	s
112 × 112 ×3	Conv2d_BN	-	64	1	2
56 × 56 × 64	depthwiseConv2d_BN	-	64	1	1
56 × 56 × 64	bottleneck (MixConv 3 × 3, 5 × 5)	2	64	1	2
28 × 28 × 64	bottleneck (MixConv 3 × 3)	2	128	9	1
28 × 28 × 128	bottleneck (MixConv 3 × 3, 5 × 5)	4	128	1	2
14 × 14 × 128	bottleneck (MixConv 3 × 3)	2	128	16	1
14 × 14 × 128	bottleneck (MixConv 3 × 3, 5 × 5, 7 × 7)	8	256	1	2
7 × 7 × 256	bottleneck (MixConv 3 × 3, 5 × 5)	2	256	6	1
7 × 7 × 256	RepvggDW	-	256	1	1
7 × 7 × 256	linear GDConv 7 × 7	-	256	1	1
1 × 1 × 256	Linear	-	256	1	1

Table 2. Evaluation (%) of the LFN and other networks on RAF-DB.

Methods	Accuracy (%)	Params	Flops
MobileFaceNet	87.52	1.148 M	230.34 M
MFN	90.32	3.973 M	550.74 M
LFN (ours)	90.22	2.749 M	401.48 M

Table 3. Ablation studies for the loss function in the LFN.

Methods	RAF-DB	AffectNet-7
CrossEntropyLoss	89.57	64.26
CrossEntropyLoss + Cosine-Harmony Loss	90.22	65.45

Table 4. The impact of different α values in the loss function on model performance.

$α$	Accuracy	Loss
0.1	90.22%	0.066
0.2	90.12%	0.083
0.3	89.96%	0.137
0.4	90.03%	0.096
0.5	89.86%	0.161

Table 5. Evaluation (%) of the LFN and the LFNSB on RAF-DB and AffectNet-7.

Model	RAF-DB	AffectNet-7
LFN	90.22	65.45
LFNSB	91.07	66.57

These ablation experiments confirm the effectiveness and reliability of both the LFN and the Spatial Bias module.

Table 6. Performance comparison on the RAF-DB dataset.

Methods	Accuracy (%)
Separate-Loss [22]	86.38
RAN [31]	86.90
SCN [32]	87.03
DACL [21]	87.78
APViT [13]	91.98
DDAMFN [4]	91.34
DAN [24]	89.70
LFNSB (ours)	91.07

Table 7. Performance comparison for AffectNet-8 dataset.

Methods	Accuracy (%)
PSR [33]	60.68
Multi-task EfficientNet-B0 [34]	61.32
DAN [24]	62.09
CAGE [35]	62.3
MViT [36]	61.40
MA-Net [37]	60.29
DDAMFN [4]	64.25
LFNSB (ours)	63.12

Table 8. Performance comparison on the AffectNet-7 dataset.

Methods	Accuracy (%)
Separate-Loss [22]	58.89
FMPN [38]	61.25
DDA-Loss [39]	62.34
DLN [40]	63.7
CAGE [35]	67.62
DAN [24]	65.69
DDAMFN [4]	67.03
LFNSB (ours)	66.57

Table 9. The results of K-fold cross-validation.

Fold	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Fold 6	Fold 7	Fold 8	Fold 9	Fold 10	Average
RAF-DB	90.48	90.22	90.61	90.35	90.48	90.12	91.07	90.65	90.32	90.16	90.34
Afectnet-7	65.71	66.57	66.11	64.69	65.65	65.45	65.61	65.25	65.12	66.03	65.72

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, X.; Huang, L. A Lightweight Model Enhancing Facial Expression Recognition with Spatial Bias and Cosine-Harmony Loss. Computation 2024, 12, 201. https://doi.org/10.3390/computation12100201

AMA Style

Chen X, Huang L. A Lightweight Model Enhancing Facial Expression Recognition with Spatial Bias and Cosine-Harmony Loss. Computation. 2024; 12(10):201. https://doi.org/10.3390/computation12100201

Chicago/Turabian Style

Chen, Xuefeng, and Liangyu Huang. 2024. "A Lightweight Model Enhancing Facial Expression Recognition with Spatial Bias and Cosine-Harmony Loss" Computation 12, no. 10: 201. https://doi.org/10.3390/computation12100201

APA Style

Chen, X., & Huang, L. (2024). A Lightweight Model Enhancing Facial Expression Recognition with Spatial Bias and Cosine-Harmony Loss. Computation, 12(10), 201. https://doi.org/10.3390/computation12100201

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Model Enhancing Facial Expression Recognition with Spatial Bias and Cosine-Harmony Loss

Abstract

1. Introduction

2. Methods

2.1. Related Work

2.1.1. FER

2.1.2. Attention Mechanism

2.1.3. Loss Function

2.2. Method

2.2.1. LFN

Improved Face Expression Recognition Network LFN Based on MFN

Conv2d_BN

RepVGGDW

2.2.2. Spatial Bias

2.2.3. Cosine-Harmony Loss

3. Results

3.1. Datasets

3.2. Implementation Details

3.3. Ablation Studies

3.3.1. Effectiveness of the LFN

3.3.2. Effectiveness of the Cosine-Harmony Loss

3.3.3. Effectiveness of the LFNSB

3.4. Quantitative Performance Comparisons

3.5. K-Fold Cross-Validation

3.6. Confusion Matrix

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Application of the LFNSB Model in a Real-Time Music Recommendation System

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI