1. Introduction
Emotion recognition using artificial intelligence relies on computational methods that analyze human emotional states from various input sources. These include physical signals such as voice, facial expressions, and body posture, as well as physiological signals such as heart rate and brain activity, which are captured through microphones, cameras, and biosensors [
1]. Physiological signals such as electroencephalography (EEG) have shown promising results in emotion recognition because they can capture affective states through direct measurements of brain activity or by combining them with other modalities. Recent studies highlight the importance of spatiotemporal representation fusion to model the dynamic evolution of neural activation patterns associated with emotional states [
2]. Similarly, Liu et al. [
3] examined emotional recovery by analyzing variations in EEG frequency-band activity when individuals were exposed to natural sound environments, demonstrating the relevance of neural signals for monitoring affective responses. However, the acquisition of EEG and other physiological signals requires dedicated instrumentation, calibration, and noise-controlled environments, which restrict their deployment in everyday settings, affective computing frameworks, and real-world evaluations of occupational dissonance [
4]. Considering this, facial expression recognition has gained significant attention due to the ubiquity of imaging sensors and their non-intrusive nature, providing a widely applicable and accessible channel for emotion analysis. Although facial expressions can be consciously controlled or feigned, microexpressions provide reliable signals of genuine affective states [
5,
6,
7]. Recent studies provide empirical evidence that image sensors, combined with image processing and neural network models, are effective for detecting facial expressions and microexpressions, enabling fast, unobtrusive affective monitoring in real-world environments [
8]. Such approaches are being applied across various domains, including occupational health and emotional dissonance monitoring [
9,
10], as well as adaptive learning, automotive safety, and customer service systems [
5,
7].
The literature reports several real-time facial emotion recognition (FER) methods developed using image datasets that serve as benchmarks for evaluating algorithms that integrate computer vision and artificial intelligence with camera and sensor systems. These datasets capture facial expressions from individuals in both controlled laboratory conditions and real-world scenarios. Among the most widely used and well-established datasets are the Extended Cohn–Kanade Dataset (CK+) [
11], the Facial Expression Recognition 2013 (FER2013) [
12], and the Karolinska Directed Emotional Faces (KDEF) [
13].
For example, Yang et al. proposed an algorithm that interprets facial Action Units (expressions) to find emotions with the Facial Action Coding System (FACS) and a Convolutional Neural Network (CNN) [
14]. They evaluated their algorithm in real time using the CK+ dataset and reported an average accuracy of 92.4%. The models proposed in [
15,
16,
17] are CNN-based architectures for real-time FER applications that achieved accuracies of 95.65% [
15], 98.5% [
16], and 98% [
17] using the CK+ dataset. Happy and Routray [
18] proposed a method based on Local Binary Patterns (LBP) and Support Vector Machine (SVM) for real-time FER, achieving an accuracy of 94% on the CK+ dataset. Gupta et al. designed a system to assess student engagement in real-time online learning environments through FER [
19]. Their approach employed Inception-V3, VGG-19, and ResNet-50, trained on the CK+ dataset. Among these networks, ResNet-50 achieved the highest accuracy (92.32%), followed by VGG-19 (90.14%) and Inception-V3 (89.11%). The models of [
20,
21] are CNN architectures trained and evaluated on FER2013, achieving accuracies of 75.1% [
20] and 63.2% [
21]. Sholikah et al. developed a real-time mobile application based on VGG16 to support children with autism [
22]. The application achieved an accuracy of 91% on FER2013, demonstrating its effectiveness in understanding emotional learning in autistic children. Shesu et al. introduced LEmo, a method for facial emotion categorization that is resistant to adversarial and distractor attacks while maintaining low computational requirements, making it suitable for near real-time applications [
23]. LEmo achieved an accuracy of 97.86% on CK+ and 85% on KDEF. Finally, Hussain et al. proposed a method based on the VGG16 architecture and the KDEF dataset, reporting an accuracy of 88% [
24].
Numerous approaches have explored methods based on CNN, traditional machine learning, or transformers for FER, which are also trained and evaluated on FER2013, CK+, or KDEF, achieving competitive performance in facial emotion recognition. Although these methods achieve strong results with only facial expressions, they are generally not designed for real-time processing or deployment on embedded or ubiquitous systems. Nevertheless, they provide valuable baselines and design references for the development of lightweight approaches that enable real-time facial expression recognition on embedded platforms. For example, Ullah et al. [
25] proposed a system composed of Canonical Correlation Analysis, a CNN, and a Long Short-Term Memory (LSTM) that achieved an accuracy of 91.42% on the CK+ dataset. The models in [
26,
27,
28] are CNNs that reported accuracies of 99.25%, 93.24%, and 98.38%, respectively. Other CNN-based approaches were trained and evaluated on the CK+ and FER2013 datasets [
29,
30,
31,
32]. The CNN models proposed in [
29,
30] achieved competitive results: [
29] reported accuracies of 94% on CK+ and 73% on FER2013, while [
30] reported accuracies of 96% on CK+ and 71% on FER2013. The method in [
31] integrates CNN, LBP, and an attention module, achieving accuracies of 98.9% on CK+ and 75.82% on FER2013. The algorithm in [
32] is a CNN that incorporates deep reinforcement learning, reaching accuracies of 94.1% on CK+ and 72.1% on FER2013. The model presented in [
33] is a CNN architecture that achieved an accuracy of 88.56% on FER2013. Haider et al. [
34] presented a hybrid model that extracts features using a CNN and classifies them with an SVM, achieving 74.64% accuracy on FER2013. Akhand et al. [
35] proposed a CNN-based pipeline developed through an evaluation of several pre-trained architectures, including VGG16/19, ResNet-18/50/152, Inception-V3, and DenseNet-161. Their experiments on the KDEF and JAFFE datasets using 10-fold cross-validation showed that DenseNet-161 performed best, achieving 96.51% accuracy on KDEF and 99.52% on JAFFE.
Finally, Khan et al. [
36] introduced an ensemble framework combining ResNet-50 and Inception-V3 within an attention-enhanced deep ensemble network, which reached 99.3% accuracy on KDEF and 78.6% on FER2013 under 10-fold cross-validation. Sen et al. [
37] proposed a method that combines LBP with an SVM classifier, achieving an accuracy of 91.11% on the CK+ dataset. Kumar et al. [
38] introduced a model that integrates the wavelet transform, multivariate analysis, and a Fuzzy SVM, which achieved 98.9% accuracy on CK+. More recently, Nawaz et al. [
39] proposed a transformer-based architecture that dynamically focuses on the most informative facial regions via adaptive attention mechanisms. Their model also implements dynamic token pruning to eliminate less relevant tokens during intermediate processing stages. This transformer-based approach achieved state-of-the-art performance, with 99.67% accuracy on FER2013 and 99.52% on CK+.
The reviews presented in [
1,
5], along with our literature analysis, highlight the continuous progress in facial emotion recognition using transformer-based and CNN-based approaches, such as Vision Transformers (ViT), attention mechanisms, Inception, VGG, ResNet, MobileNet, and Xception. Although these methods achieve high accuracy, they typically require large training datasets and rely on architectures with millions of parameters, as well as substantial computational resources, which limits their deployment on embedded platforms and other low-power, resource-constrained systems.
Other approaches aim to reduce model complexity, but they often report lower accuracy or are validated on a limited number of datasets, making it difficult to assess their generalization and robustness in real-world sensor-based environments. In addition, research addressing occupational environments remains scarce, despite their relevance for applications such as stress measurement and safety monitoring. These limitations can be addressed by developing novel architectures that achieve competitive recognition performance while maintaining low computational cost, thereby enabling real-time deployment on embedded platforms. Therefore, this paper proposes the Lightweight Expression Recognition Network (LiExNet), a novel neural architecture that achieves consistent learning and generalization across multiple datasets with heterogeneous facial information—including frontal faces captured under controlled conditions as well as multi-angle faces in unconstrained environments—attaining top accuracy while remaining suitable for real-time deployment on embedded and ubiquitous image-sensor systems [
40]. The main contributions of LiExNet are:
Present a lightweight deep neural network with only 42,000 parameters, combining standard and depthwise separable convolutions with an efficient channel attention mechanism, specifically designed to enable real-time inference on embedded sensor platforms with minimal computational resources.
Provide an extensive evaluation on three widely used facial expression datasets (CK+, KDEF, FER2013) and on EMOTION-ITCH, a novel dataset focused on emotion recognition under occupational stress conditions.
Achieves state-of-the-art accuracy (99.5% on CK+) with only 86 MFLOPs, a 0.17 MB model memory footprint, a 2 MB peak inference memory, and real-time performance above 530 FPS on an Jetson TX2 (NVIDIA Corporation, Santa Clara, CA, USA).
Introduce EMOTION-ITCH as a new public dataset that includes facial expressions from both working and non-working participants, supporting the measure and analysis of emotional responses in industrial environments and enabling fine-tuning experiments that demonstrate improvements when occupational stress information is incorporated.
The rest of the paper is organized as follows:
Section 2 presents the LiExNet model,
Section 3 describes the datasets used for experiments, and
Section 4 presents the results. Finally,
Section 5 provides the conclusions.
2. Lightweight Expression Recognition Network
According to the state-of-the-art analysis, the following conclusions can be made:
Based on these observations, LiExNet was designed as a compact CNN architecture enhanced with an efficient channel attention mechanism to better focus on expression and microexpression features.
Figure 1 presents the overall architecture of LiExNet, organized into the following stages: input layer, feature extraction block, flattening layer, classification block, and output layer.
The input layer receives a facial expression image, either sourced from benchmark datasets or captured in real time through a camera or other imaging sensor. The feature extraction stage comprises four blocks, each containing an efficient channel attention (ECA) module and convolutional layers equipped with ReLU activations, batch normalization, and max pooling. The output of this block is a set of feature representations, which are subsequently converted into a one-dimensional feature vector by the flatten layer. The classification block then processes this feature vector to determine the corresponding emotion class conveyed by the input image. The output layer produces the final predicted emotion, which can be utilized in real-time applications. The following subsections provide a detailed description of LiExNet.
2.1. Input
The input to LiExNet is a color image in 8-bit integer format, which is then transformed into a tensor in float format, where denotes the pixel position and r is the depth index. This tensor has values scaled to the range and dimensions of with a depth of 3, where .
2.2. Feature Extraction Blocks
The feature extraction stage consists of four sequential blocks that share the same architecture.
Each block begins with a standard convolutional sublayer to find local abstract features, mathematically defined as:
where
l is the layer index,
R is the depth,
represents the weights,
is the output of the previous layer, and
is the bias term. The activation function
is the Rectified Linear Unit (ReLU), which helps prevent vanishing gradients. The input of the first block is the tensor
. The subsequent sublayer is batch normalization, denoted as
, and defined as
. The ReLU and batch normalization operations are described in [
41]. The use of batch normalization after ReLU achieves faster convergence than other possible sublayer orderings within the block.
The next layer in each block is a depthwise convolution used for parameter reduction, defined as:
where
are the weights for channel
r. As in the previous layer, the activation function
is ReLU, followed by batch normalization
, defined as
.
The next layer is an efficient channel attention (ECA) module, a lightweight attention mechanism designed to emphasize the channels that carry the most relevant features by incorporating channel-wise attention with minimal computational overhead. This module consists of global average pooling, a one-dimensional convolution over channels, and channel-wise recalibration. The global average pooling operation, denoted as
, is described in [
41], and the one-dimensional convolution is expressed as follows:
where
represents a column from the previous layer, and the activation function
is the sigmoid function. The following equation determines the kernel size:
where
and
f are hyperparameters, experimentally set to
and
. Channel recalibration estimates the importance of each channel by reweighting them as follows:
where
is the scalar weight assigned to each
.
The final layer of each block is a standard convolution used to extract local features from channels identified as most relevant. This layer is defined in Equation (
1), where
and
. The activation function
is ReLU, followed by a max pooling operation with a window size of
, represented as
. Max pooling is used to reduce overfitting and computational cost, as described in [
41].
The convolution and depthwise layers follow the initial design principle of the MobileNet architecture, whereas VGG16 inspires the hierarchical stacked block structure. Depthwise convolutions significantly reduce FLOPs and the number of parameters, while the ECA module introduces channel-wise attention at minimal computational cost. The number of filters increases progressively across the four convolutional blocks to expand the feature space, whereas each depthwise layer maintains a single kernel per channel. The increase in network depth was determined empirically, based on the hypothesis that increasing depth across the four convolutional blocks allows the model to capture progressively more abstract and discriminative facial representations, thereby enhancing overall recognition performance.
The first feature extraction block uses 24 convolutional kernels () to separate the face from the background, reducing the amount of irrelevant information captured by the sensor. The second block uses 48 kernels () and emphasizes higher-level facial components that are essential for robust recognition under varying acquisition conditions. The third block uses 72 kernels () and focuses on expression-related regions (Action Units), including the cheeks, chin, mouth, and the T-shaped area between the eyes and nose. The fourth block uses 96 kernels () and captures subtle microexpression features, which contribute to the interpretation of authentic emotional states.
2.3. Flattened Layer
Layer transforms the output into a feature vector , where , with M denoting the number of rows and N the number of columns.
2.4. Classification Block
The classification block consists of two FCN layers. The first layer in this block,
, is defined as:
The activation function
for this layer is the ReLU function. Subsequently, a dropout layer is applied to remove 30% of the neurons during each training epoch [
42]. The output of this layer is denoted as
. The next layer is an FCN, defined as follows:
where
D is the number of classes. The maximum value of the output layer,
, represents the predicted emotion label
y obtained by LiExNet. The value of
y is compared with the desired label value
l during training.
2.5. Output Layer
The output layer encodes y as data that represents the detected emotion. This information can be transmitted to other embedded systems or servers for further processing.
4. Results
This section reports the results of six experiments conducted to evaluate the performance, consistency, and generalization of LiExNet: cross-validation and statistical significance, final performance assessment, Grad-CAM visualization, ablation study, comparison with state-of-the-art models, and fine-tuning on the EMOTION-ITCH dataset.
4.1. Cross-Validation and Statistical Significance
Cross-validation was used to assess generalization consistency, and statistical significance testing was used to confirm the reliability of LiExNet’s accuracy across the CK+, KDEF, and FER2013 datasets.
The cross-validation analysis was performed using a 5-fold scheme. This configuration was selected because, as reported in [
48], it is more suitable than 10-fold cross-validation for datasets of the size of CK+, KDEF, and FER2013.
Table 1 presents the cross-validation results, where LiExNet achieves a mean accuracy of 99.3 ± 1.2 on the CK+ dataset, corresponding to a generalization range of 98.3% to 100%. For the KDEF dataset, the mean and standard deviation are 88.3 ± 0.75, indicating consistent generalization within an accuracy range of 87.55% to 89.05%. For the FER2013 dataset, the mean and standard deviation are 79.2 ± 0.9, indicating good generalization within an accuracy range of 78.3% to 80.1%.
The statistical significance analysis consisted of a linear variability study and a hypothesis test to compare the set of true test labels from all datasets with the corresponding model outputs , where T denotes the number of samples in the test set.
The linear variability study was performed using the Pearson correlation coefficient
, which quantifies the strength and direction of the linear relationship between
L and
Y in each cross-validation partition [
49]. The hypothesis test was conducted to determine whether the differences between
L and
Y are statistically significant, that is, whether they exhibit random variation or reflect a consistent relationship. The statistical significance was evaluated using the Wilcoxon signed-rank test, a nonparametric test for non-normal paired samples that assesses whether the difference between the two distributions is statistically significant [
50].
Table 2 presents the results of the Pearson correlation coefficient
and the Wilcoxon test statistics
and
, while
Figure 6 illustrates the histograms generated by grouping the true labels
L and the predicted outputs
Y from each fold.
In the case of CK+, the values of
and the Wilcoxon test indicate that
L and
Y are almost identical in all the folds. This behavior is illustrated in
Figure 6, where the only observed difference is due to a single misclassified sample.
For the KDEF dataset, the
values are approximately 0.8, indicating a strong positive linear relationship between
L and
Y, implying that
Y can be reliably estimated from
L using a linear regression model. The Wilcoxon test indicates that the folds
do not exhibit statistical significance, as
L and
Y show similar distributions. In contrast, the folds
exhibit statistical significance, although their corresponding
p-values suggest that the difference is minimal. This behavior can be attributed to specific cases in which LiExNet failed to correctly recognize certain facial expressions belonging to the NE and SA categories. For example, in
Figure 6, noticeable differences can be observed between the NE and SA distributions.
In the case of FER2013, the
values are approximately 0.8, indicating a strong positive linear relationship between
L and
Y, showing that
Y can be reliably approximated from
L using a linear regression model. However, the Wilcoxon test reveals statistical significance across all folds. The corresponding
p-values suggest that the folds
include samples affected by uncontrolled imaging conditions, whereas the folds
contain cases in which LiExNet failed to recognize certain facial expressions correctly. As shown in
Figure 6, although the most significant errors again occur in the NE and SA categories, the misclassifications are more pronounced than on the KDEF dataset.
It is also noteworthy that although the average accuracies in cross-validation on FER2013 (79.2%) and KDEF (88.3%) differ by nearly 10%, the values of the Pearson correlation coefficient remain similar across both datasets. These results occur because quantifies the strength of the linear relationship between the predicted outputs and the true labels, rather than the exact categorical match between them. In other words, LiExNet achieves lower accuracies on FER2013 due to uncontrolled image acquisition conditions. However, it exhibits consistent performance, with the predicted outputs maintaining a strong linear relationship with the true labels and, in most cases, showing no statistically significant differences. However, it still maintains a strong linear correspondence between L and Y, even though the Wilcoxon test detects statistically significant differences in all folds of FER2013.
4.2. Final Performance Assessment
4.2.1. Performance and Computational Cost
Table 3 presents the performance results of LiExNet, VGG16, MobileNet, ShuffleNet, and EfficientNet architectures on the CK+, KDEF, and FER2013 datasets, trained on a desktop workstation and tested on the NVIDIA Jetson device. The values in bold in
Table 3 correspond to the best-performing results. VGG16, MobileNet, ShuffleNet, and EfficientNet were included in the experiments for the following reasons:
Table 3.
Accuracy and computational cost metrics obtained during testing (all computations executed in FP16 precision). Bold values denote the highest performance, and the arrows (↑, ↓) indicate the direction in which the metric shows better results.
Table 3.
Accuracy and computational cost metrics obtained during testing (all computations executed in FP16 precision). Bold values denote the highest performance, and the arrows (↑, ↓) indicate the direction in which the metric shows better results.
| Metric | LiExNet | VGG16 | LVGG16 | MbNetV1 | MbNetV2 | SfNet | ENetL |
|---|
| Acc (CK+)↑ | 99.5% | 98.2% | 97.1% | 95.4% | 86.8% | 73.1% | 77.15% |
| Acc (FER2013)↑ | 79.2% | 77.4 | 69.84% | 69.72% | 65.4% | 37.7% | 59.3% |
| Acc (KDEF)↑ | 88.2% | 83.3% | 49.29% | 56.25% | 40.6% | 65.18% | 36.22% |
| FPS↑ | 530.7 | 11.2 | 12.2 | 104.3 | 195.2 | 193.5 | 102.7 |
| RAM (MB)↓ | 0.17 | 553.4 | 68.38 | 5.3 | 9.361 | 3.72 | 15.76 |
| PRAM (MB)↓ | 2 | 12.97 | 12.86 | 2.8 | 5.6 | 2.4 | 2.3 |
| Giga FLOPs↓ | 0.086 | 30.94 | 0.307 | 10.9 | 0.392 | 0.037 | 0.258 |
| nParams↓ | 42,000 | 138,357,544 | 17,926,983 | 1,325,829 | 2,340,423 | 976,503 | 4,132,010 |
The training of LiExNet, VGG16, LVGG16, MobileNet V1 and V2, ShuffleNet, and EfficientNet-Lite was performed using the Adam optimizer, a dynamic learning rate schedule, and categorical cross-entropy loss. The number of training epochs was adjusted according to each dataset. CK+, KDEF, and FER2013 were trained independently (without cross-dataset initialization), and no data augmentation was applied, as these settings consistently yielded the best performance in our experiments. All models were trained on a desktop workstation equipped with an AMD Ryzen 7 7700 processor, 32 GB of RAM, and an NVIDIA RTX 4060 GPU. The benchmarking and evaluation of these networks on an embedded edge device were performed using a Jetson TX2 development board (NVIDIA Corporation, Santa Clara, CA, USA). This board features a Tegra X2 System-on-Chip (SoC), an NVIDIA Pascal GPU with 256 CUDA cores, and a CPU with two Denver cores and four ARM Cortex-A57 cores. Additionally, the Jetson includes 8 GB of optimized LPDDR4 RAM with a bandwidth of 58.4 GB/s and 32 GB of eMMC 5.1 storage for data handling.
The experimental metrics evaluate both recognition performance and computational efficiency. These metrics include the average test accuracy (Acc) [
53], inference speed (FPS) [
54], floating-point operations per second (FLOPs) [
54], model memory footprint (RAM) [
55], peak inference memory (PRAM) [
55], and the number of parameters (nParams) [
56].
According to the values reported in
Table 3, LiExNet outperforms VGG16, Simplified VGG16 (SVGG16), MobileNet variants, ShuffleNet, and EfficientNet-Lite0 across all metrics—except FLOPs, where LiExNet ranks second (slightly below ShuffleNet). However, SfNet exhibits lower accuracy across all datasets. LiExNet achieves the best FPS with a latency of 1.88 ms, the lowest RAM usage, PRAM usage, and number of parameters, while maintaining competitive computational efficiency. Moreover, since LiExNet reaches 530.7 FPS on an NVIDIA Jetson device, these results demonstrate its ability to operate in real time on embedded systems with hybrid CPU–GPU architectures.
Based on the results presented in
Table 3 and the architectural description provided in
Section 2, the novelty of LiExNet lies in its four-block hierarchical design, where each block integrates multi-scale feature extraction, depthwise-efficient convolutional operations, channel-wise attention, and progressive spatial reduction. This combination results in a lightweight model explicitly optimized for the extraction of expression and microexpression facial features.
The Jetson TX2 provides up to 1.3 TFLOPs in FP16 and 8 GB of shared LPDDR4 RAM, but to run LiExNet, it must be configured with Ubuntu 18.04 LTS, Jetson Linux R32.7.4 (JetPack 4.6.4), CUDA 10.2, cuDNN 8.2.1, and TensorRT 8.2.1. In practical terms, LiExNet requires only 86 MFLOPs per inference and a peak inference memory of approximately 2 MB, which means that the model and its activations use far less than 1% of the available RAM and only a small fraction (approximately 3.5%) of the theoretical compute capacity, even at 530 FPS. These numbers indicate that LiExNet leaves ample headroom on the Jetson TX2 to execute several auxiliary services in parallel, such as image acquisition, pre-processing, logging, communication, or a graphical user interface, without saturating the device. The image
is acquired from a real-time camera connected to the Jetson TX2 via a GStreamer pipeline integrated into Jetson Linux. For example,
Figure 7 illustrates a real-time processing example of LiExNet running on a Jetson TX2 device, executed under Jetson Linux.
LiExNet integrates into a typical edge AI pipeline and can be deployed on any operating system or AI framework. According to our experiments, LiExNet can run in real time while concurrently with other emotion-recognition processes, such as EEG-based analysis, sensor-fusion modules, facial analysis components, or user-interaction programs. Therefore, it is fully compatible with commercial mobile phone processors and 32-bit microcontrollers with AI accelerators, enabling a wide range of applications in affective computing and occupational dissonance assessment.
4.2.2. Emotion Classification Performance
Figure 8 illustrates the accuracy trends over the training epochs for LiExNet using the CK+, KDEF, and FER2013 datasets. The following observations can be made:
In the CK+ dataset, LiExNet reaches an accuracy of nearly 100% starting from epoch 150.
In the FER2013 dataset, the model achieves an accuracy of approximately 80% by epoch 80.
In the KDEF dataset, LiExNet obtains an accuracy of almost 90% by epoch 200.
Figure 8.
Training process generated by LiExNet with CK+, KDEF, and FER2013.
Figure 8.
Training process generated by LiExNet with CK+, KDEF, and FER2013.
In addition, the epoch-wise accuracy achieved by LiExNet on the CK+, KDEF, and FER2013 datasets is consistent between the training and test sets, indicating that LiExNet generalizes well across all three datasets.
Figure 9 shows the confusion matrices obtained during the testing experiments. Regarding the CK+ dataset, nearly all emotion classes are classified correctly. Only one sample of the Sadness class is misclassified as Fear.
However, the confusion matrices of KDEF and FER2013 are more complex to interpret. Therefore,
Table 4 and
Table 5 were created to present the confusion matrix values normalized with respect to the total number of images in each class
l. These normalized metrics are defined as the class-based recall (
) and the false-negative rate per class (
).
is reported along the main diagonal of the confusion matrix and is defined as
, where
denotes the true positives for class
l, and
represents the samples of class
l that were misclassified into other classes.
corresponds to the off-diagonal elements, which represent misclassifications expressed as percentages, and is defined as
. Based on the results from the KDEF dataset presented in
Table 4, several conclusions can be drawn:
HA achieves the highest performance, where and only are samples misclassified as DI.
DI reaches and only are samples misclassified as HA and AN.
SU achieves and , which are SU samples confused with AF.
AF achieves and . Among these errors, are AF samples confused with DI and SU.
AN achieves and . Of these, are AN samples confused with DI and with SA.
SA reaches and . Among these, are SA samples confused with DI and with AF.
NE exhibits the lowest performance, with and . Notably, of these errors are NE samples confused with SA.
The average across all classes is , with a standard deviation of . Therefore, HA, DI, and SU perform above the mean, and AF remains close to it, whereas AN, SA, and NE fall below.
Table 4.
and of LiExNet on the KDEF dataset. The values in bold correspond to the .
Table 4.
and of LiExNet on the KDEF dataset. The values in bold correspond to the .
| | AF | AN | DI | HA | NE | SA | SU |
|---|
| AF | 87.86% | 1.43% | 4.29% | 0.71% | 0.00% | 1.43% | 4.29% |
| AN | 1.43% | 84.29% | 10.71% | 0.00% | 0.00% | 3.57% | 0.00% |
| DI | 0.00% | 1.43% | 97.86% | 0.71% | 0.00% | 0.00% | 0.00% |
| HA | 0.00% | 0.00% | 1.43% | 98.57% | 0.00% | 0.00% | 0.00% |
| NE | 0.00% | 2.14% | 0.00% | 2.14% | 75.71% | 20.00% | 0.00% |
| SA | 3.57% | 2.14% | 8.57% | 1.43% | 0.71% | 83.57% | 0.00% |
| SU | 10.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 90.00% |
Table 5.
and of LiExNet on the FER2013 dataset. The values in bold correspond to the .
Table 5.
and of LiExNet on the FER2013 dataset. The values in bold correspond to the .
| | AN | DI | FE | HA | NE | SA | SU |
|---|
| AN | 76.15% | 1.45% | 3.03% | 3.27% | 7.87% | 5.21% | 3.03% |
| DI | 4.88% | 90.96% | 1.66% | 0.48% | 0.36% | 1.43% | 0.24% |
| FE | 3.95% | 2.15% | 79.90% | 0.72% | 2.39% | 2.03% | 8.85% |
| HA | 1.80% | 0.12% | 0.48% | 87.76% | 4.44% | 2.28% | 3.12% |
| NE | 2.23% | 0.25% | 0.87% | 3.60% | 77.79% | 11.79% | 3.47% |
| SA | 5.14% | 1.38% | 1.25% | 7.27% | 26.07% | 56.64% | 2.26% |
| SU | 3.49% | 0.00% | 2.53% | 2.77% | 6.27% | 0.96% | 83.98% |
Table 5 shows the
values and error percentages of LiExNet evaluated on the FER2013 dataset. Based on these results, several observations can be made:
DI achieves the highest performance, with and , which are DI samples mainly misclassified as AN.
HA reaches and . The samples are HA mostly confused with NE.
SU achieves and , which are SU samples mainly misclassified as NE and SA.
FE reaches and . The samples are FE mainly confused with SU and AN.
NE obtains and , which are NE samples misclassified, mostly as SA, SU, and HA.
AN achieves and , which are AN samples misclassified, mainly as NE, SA, and HA.
SA exhibits the lowest performance, with correctly classified and of SA samples are . Among these errors, correspond to confusions with NE, and notable confusion also occurs with HA and AN.
The average across all classes is , with a standard deviation of . Therefore, DI, HA, and SU perform above the mean, and FE is close to it, whereas AN, SA, and NE fall below.
Based on the results obtained across the three datasets, the following conclusions can be drawn: the emotions HA, DI, and SU consistently achieve the highest recognition performance. AF and FE (which correspond to the same emotion category) exhibit performance levels close to the average value of in each dataset. In contrast, NE, SA, and AN show comparatively lower recognition rates.
Another factor that influences LiExNet’s performance across CK+, KDEF, and FER2013 is the manner in which the images were acquired. The CK+ dataset achieves results above 99% because all images were captured in frontal view under strictly controlled acquisition conditions. The performance decreases to 88% in KDEF, since—although captured in controlled environments—the faces exhibit variations in head orientation with respect to the camera. FER2013 yields the lowest performance, as its images were collected from the Internet and therefore present unconstrained conditions, including variations in pose, illumination, occlusions, and imaging quality.
4.3. Grad-CAM Analysis
A Grad-CAM analysis [
57] was conducted to identify the facial regions used by LiExNet for emotion discrimination. For this purpose, Grad-CAM activation maps were projected onto each facial image using the feature maps extracted from the max-pooling operation of layer
, as illustrated in
Figure 10. A total of 63 test images were evaluated (21 from each dataset), including 54 images corresponding to Ekman’s basic emotions (9 per emotion), six from the NE class, and three from the CO class.
The visual evidence provided by Grad-CAM reveals two types of discriminative facial patterns utilized by LiExNet: expressions and microexpressions. Expressions correspond to visible emotional manifestations produced by voluntary or involuntary muscle activations, commonly referred to as Action Units (AUs) within the Facial Action Coding System (FACS). In contrast, microexpressions are brief, involuntary, and universal emotional responses that leave transient traces. In static images, these typically appear as subtle or residual Action units based on muscle activations (rAUs), which are often associated with atypical high-spatial-frequency patterns or localized facial tension [
58,
59,
60,
61,
62].
Table 6 summarizes the common AU-based expressions, rAUs, and spatial-frequency microexpressions identified from the 63 analyzed Grad-CAM images, while
Figure 10 exemplifies the typical activation patterns found across the different emotion categories. Emotions with the highest
values, such as HA, SU, and DI, exhibit strong Grad-CAM activations across multiple facial expressions and microexpressions, particularly in the central facial region. The emotions FE and AF show limited activation in the eyebrow area, but strong activation in microexpressions near the eyes and in the nasojugal region. SA and AN display Grad-CAM activations in the eyebrows, the elevated infraorbital region, and the nasojugal region. However, their expressions and microexpressions are highly similar. Finally, NE shows no consistent expression or microexpression patterns.
These findings are consistent with the theory proposed by Ekman and Friesen, which holds that positive emotions such as happiness and surprise show distinctive, highly detectable facial features with strong activation of specific expressions and microexpressions. In contrast, negative emotions—particularly sadness, fear, and anger—tend to produce more subtle and overlapping muscle activations, increasing the likelihood of misclassification [
58,
61,
62]. Moreover, according to [
59,
60], the NE class has the lowest
because neutral facial expressions vary widely among individuals and are strongly influenced by personality traits, thereby increasing intra-class variability.
From the 63 Grad-CAM images analyzed, the most prominent confusion occurs between the SA and NE classes. This confusion arises because SA samples typically elicit strong activations across the T-zone of the face, driven by characteristic eyebrow expressions, the infraorbital region, and nasojugal microexpressions. In contrast, NE samples occasionally produce T-zone activations because the third feature-extraction block (layers ) detects this region. However, the fourth block does not capture distinctive expressions or microexpressions. As a result, the T-zone activations remain unchanged across layers l = 12–16. This leads SA and NE to exhibit highly similar activation patterns, explaining the confusion between these two categories.
4.4. Liexnet Ablation Study
An ablation study was performed to assess the impact of each architectural component in LiExNet. Four network variants, shown in
Figure 11, were evaluated by selectively removing layers from the feature extraction blocks:
LiExNet-a1. This variant eliminates layers
, which consist of standard convolutional layers (as described in Equation (
1)) followed by batch normalization.
LiExNet-a2. This variant removes layers
, which correspond to depthwise convolutional layers (as outlined in Equation (
2)) followed by batch normalization.
LiExNet-a3. In this variant, the Efficient Channel Attention (ECA) layers are removed, effectively turning off the attention mechanism.
LiExNet-a4. This variant excludes layers
, which corresponds to convolutional layers (as defined in Equation (
1)) followed by max-pooling operations.
The ablation study was conducted using the KDEF dataset because LiExNet achieved intermediate performance on it. Results are presented in
Table 7. Removing the convolutional or depthwise convolutional layers (LiExNet-a1 and LiExNet-a2) reduces the model’s feature-extraction capacity, leading to a notable drop in accuracy. Removing the ECA module (LiExNet-a3) results in decreased performance due to the absence of channel-wise attention. Eliminating the final convolutional and grouping stage (LiExNet-a4) forces the network to deliver as output the selection of channels containing relevant features, rather than the features themselves. These findings confirm that all components are essential and jointly contribute to maintaining LiExNet’s balance between performance and efficiency.
4.5. Comparison of LiExNet with Other Methods
Table 8 presents a comparison of the accuracy of LiExNet with other state-of-the-art FER methods that classify emotions into six or seven categories. The table reports the dataset used for evaluation (CK+, KDEF, or FER2013), the total number of emotions measured and evaluated by each method (Emotions), and whether the method is suitable for real-time applications (Real-time). More precisely, the total number of emotions evaluated by each method is not determined by the number of classes that the model classifies, but rather by the combined set of native emotion categories that the model measures across the datasets (CK+, KDEF, and FER2013).
Based on the results presented in
Table 8, the following observations can be made:
In summary, LiExNet demonstrates superior performance among real-time FER methods and ranks among the top-performing models out of 26 state-of-the-art approaches. Its effectiveness is further evidenced by consistent generalization across datasets, as demonstrated through cross-validation, statistical significance testing, and class-level behavioral analyses across three benchmark databases. These findings highlight LiExNet as a robust and comprehensive solution for real-time emotion recognition, offering an optimal balance between accuracy and computational efficiency.
4.6. Experiments with EMOTION-ITCH Dataset
The confusion matrices in
Figure 9 and the detailed
and
distributions shown in
Table 4 and
Table 5 indicate that negative emotions exhibit lower performance compared to positive emotions, as they tend to be misclassified more frequently among themselves. To address this limitation, the EMOTION-ITCH dataset groups emotions into three categories: negative, neutral, and positive. Although these categories are more general, they reduce misclassification errors and provide more consistent emotion recognition. Furthermore, the structure of the dataset enables the analysis of emotional responses in relation to habitual work-related stress. Then, we designed an experiment in which LiExNet, pre-trained on FER2013, was fine-tuned using facial expression images from the EMOTION-ITCH dataset. The fine-tuning process was performed by dividing the EMOTION-ITCH dataset into 60% (352 images) for training and 40% (234 images) for testing and for cross-validation. The objectives of this experiment were to reduce misclassification among negative emotion classes and to analyze differences in emotional responses between worker and non-worker subjects. Then, the model was initialized with the weights obtained from the FER2013 training and fine-tuned using a weighted cross-entropy loss function, defined as:
where
denotes the
i-th input image sample,
d represents the class,
is the one-hot encoded ground truth label, and
is the output of the
i-th sample from the final layer.
The images corresponding to working subjects were assigned the following weights:
where
is a weighting factor used to emphasize the samples from working subjects.
The metrics used in this experiment were accuracy (Acc), recall for the negative class (
), and the Area Under the Curve (AUC). The
is defined as follows:
where
refers to the number of negative emotion samples correctly classified as negative, and
refers to the number of negative emotion samples incorrectly classified as positive. The
is defined as follows:
where
and
represent the true positive rate and false positive rate, respectively, for the
i-th sample.
Table 9 reports the results of the fine-tuning experiment:
The AUC, presented in
Figure 12, can be interpreted as follows:
A weight of yields the best AUC.
A weight of generates a slight improvement in AUC compared to , and outperforms in terms of Acc and .
A weight of achieves the best values for Acc and .
A weight of yields the same AUC as and . However, both Acc and decrease significantly.
Figure 12.
AUC with various weight adjustments. The dashed line represents the random performance.
Figure 12.
AUC with various weight adjustments. The dashed line represents the random performance.
Based on these points, it can be observed that information about whether subjects work or not contributes to model performance when . Specifically, with , the model exhibits improved discrimination capability between classes, particularly for the negative class, across all possible thresholds. In contrast, yields the best classification accuracy. The weight of suggests a potential overemphasis on the working subjects. These findings indicate that including the weight factor improves inter-class separability, as prior knowledge of the affective state (habitual work-related stress or relaxed) helps extract more discriminative features from facial expressions and microexpressions, thereby enhancing emotion recognition.
A GRAD-CAM experiment was applied to visualize the facial regions that LiExNet focuses on during emotion recognition, thereby improving interpretability. According to the Grad-CAM analysis, LiExNet effectively suppresses background information and concentrates on the salient facial regions described in
Table 6. The model primarily attends to the eyes, nose, cheeks, and lips, as well as subtle microexpressions located around the nose and below the eyes, as illustrated in
Figure 13.
To verify the consistency of the results with EMOTION-ITCH, we perform 5-fold cross-validation and a statistical significance analysis using
and the Wilcoxon test, with LiExNet trained via fine-tuning and
.
Table 10 shows that the folds yield very similar results, all of which are close to 100%. The values of
and
indicate that the label set
L and the prediction set
Y are nearly identical across all folds.
5. Conclusions
This paper presents LiExNet, a deep neural network with strong generalization and top accuracy across multiple datasets, designed for facial expression-based emotion recognition with low computational complexity, making it suitable for real-time deployment in embedded and sensor-driven ubiquitous systems. The model was evaluated using the CK+, KDEF, FER2013, and EMOTION-ITCH datasets. CK+, KDEF, and FER2013 are well-established and widely used in the literature. EMOTION-ITCH is a custom multimodal dataset specifically developed to investigate the relationship between emotion recognition and habitual work-related stress under realistic acquisition conditions.
Cross-validation experiments indicate that LiExNet achieves accuracies of on CK+, on KDEF, and on FER2013. The final test results report accuracies of on CK+, on FER2013, and on KDEF. Statistical significance analyses further demonstrate that LiExNet learns and generalizes consistently across all datasets. Overall, these differences are expected, not because of the LiExNet architecture, but because CK+ contains frontal faces acquired under controlled conditions, KDEF includes faces captured from multiple angles under controlled settings, and FER2013 contains facial images collected in unconstrained, highly variable real-world environments.
According to the emotion classification analysis conducted on the KDEF and FER2013 datasets, LiExNet achieves class-based recall () values of for CK+, for FER2013, and for KDEF. The highest-performing emotions are HA, SU, and DI, whereas AF and FE exhibit intermediate behavior. The lowest-performing categories are SA, AN, and NE. The Grad-CAM analysis suggests that this behavior arises because HA, SU, and DI exhibit a wide range of expressions and microexpressions, whereas AF, FE, SA, and AN share similar expressions and microexpressions patterns. In contrast, the NE category lacks consistent facial features that can be reliably used for classification.
Sadness and neutrality yield slightly lower performance, as these emotions are occasionally confused with one another. Similarly, anger and fear show reduced accuracy because they frequently overlap with other categories. To address these challenges, a fine-tuning experiment was conducted using the EMOTION-ITCH dataset, which groups emotions into positive, negative, and neutral classes to improve robustness under real-world acquisition conditions. Results show that LiExNet attains an accuracy of , which increases to when prior knowledge of the subject’s affective state is incorporated. These findings demonstrate that LiExNet is effective for affective computing applications in sensor-based and embedded environments, particularly in occupational scenarios where emotional dissonance and stress monitoring are relevant.
LiExNet has a computational cost of 86 MFLOPs and an architecture comprising approximately 42,000 parameters, a 0.17 MB model memory footprint, and a peak inference memory of 2 MB. This low architectural complexity enables real-time inference on CPU–GPU embedded platforms commonly used in sensor-driven systems. Both a desktop workstation and an NVIDIA Jetson TX2 achieved instantaneous, low-latency processing. Moreover, LiExNet requires only a minimal fraction of the computational and memory resources available in embedded devices, allowing it to operate concurrently with other sensing, processing, or decision-making modules within edge-AI architectures.
Given its balance between performance and computational efficiency, LiExNet stands among the most suitable real-time FER models for deployment in embedded and ubiquitous environments, including mobile devices, smartphones, and affective computing systems. Future work will focus on two main directions. First, we aim to explore strategies to train LiExNet to achieve strong performance on facial images captured under uncontrolled conditions and with faces in varying poses. Second, it is necessary to integrate LiExNet into an application programming interface (API) for edge-computing platforms, expanding its applicability to sensor-driven affective computing, emotional dissonance monitoring in occupational environments, automotive driver-assistance systems, and human–machine interaction scenarios.