TriViT-Lite: A Compact Vision Transformer–MobileNet Model with Texture-Aware Attention for Real-Time Facial Emotion Recognition in Healthcare
Abstract
1. Introduction
2. Related Work
3. Materials and Methods
3.1. Dataset
3.1.1. Data Collection and Preprocessing
3.1.2. FER2013
3.1.3. AffectNet
3.1.4. Custom Dataset
- Feature Alignment: During training, a shared embedding space was learned to map samples from both the custom dataset and public datasets into a unified representation space.
- Loss Function Regularization: A weighted cross-entropy loss was applied to encourage the model to balance its performance across domains. The modified loss function is given by the following:
3.1.5. Preprocessing Pipeline
- Face Detection: Multi-task Cascaded Convolutional Networks (MTCNNs) were implemented to detect and obtain facial regions from video frames. The MTCNN was optimal for preprocessing as it can effectively handle variation in head posture and lighting changes. Only the face was extracted to retain solely the facial area from each frame while eliminating background interference.
- Normalization: All facial frames were scaled to 224 × 224 pixels for model input uniformity. Pixel values were standardized to [0, 1] for faster and more reliable training.
- Data Augmentation: To increase the variability of the dataset, augmentation techniques such as random cropping, rotation, flipping, and brightness/contrast adjustments were applied. These augmentations account for natural variations in facial poses and lighting, particularly in healthcare settings where lighting may vary. This step is critical to ensuring the model’s robustness in real-world conditions.
- Emotion Labelling: A semi-supervised approach was employed for annotating the real-time dataset. Initially, the two unique emotions were labelled carefully based on experts’ considerations and their observations, and these annotations were further refined using active learning techniques. Other common emotions like in public datasets that were pre-labelled and integrated with the real-time data through consistent label alignment.
- Temporal Alignment: To capture the evolution of emotions over time, frames were segmented into continuous sections where transitions between emotional states (all seven considered emotions) were explicitly labelled. This temporal annotation ensures that the model can recognize unique changes in facial expressions as patients’ emotional states shift during video monitoring.
3.2. Proposed TriViT-Lite Model
3.2.1. Proposed Architecture Overview
3.2.2. Local Feature Extraction with MobileNet
- Depthwise Convolution: In the depthwise convolution operation, a single convolutional filter is applied independently to each input channel, allowing the model to capture spatial features in each channel separately. Let the input feature map be denoted by , where , , and are the height, width, and the number of input channels, respectively. The depthwise convolution applies a unique filter of kernel size to each channel . The output of the depthwise convolution at position in the -th channel of the resulting feature map is given by the following:
- Pointwise Convolution: Following depthwise convolution, pointwise convolution is used to combine information across channels. For each output channel m, the pointwise convolution is formulated as follows:
3.2.3. Handcrafted Features with LBP and HOG
- Histogram of Oriented Gradients (HOG): The Histogram of Oriented Gradients (HOG) descriptor characterizes features by analyzing the distribution of gradient orientations within specific regions of an image. It achieves this by dividing the image into a grid of cells and computing the gradients within each cell. Originally proposed by Dalal and Triggs [25], HOG operates on the intensity (grayscale) function, denoted as , which represents the image under analysis. In Figure 7a, the image is divided up into pixels cells. For the x-axis: and for the y-axis: , by the following equation, gradient magnitude can be denoted as follows:
- Local Binary Patterns (LBP): This is a texture descriptor, which is useful for analyzing the texture of images, as described in [27]. Local Binary Patterns (LBP) encode the relationship of each pixel to its surrounding neighbors by comparing the pixel’s value with the central pixel. This comparison is expressed as a binary number. The binary codes are then concatenated in a clockwise direction, starting from the top-left neighbor, forming a binary string. The resulting binary sequence is converted into its corresponding decimal value, which is used for labeling the pixel [28]. In decimal form, the resulting (LBP code) is as follows:
3.2.4. Global Feature Extraction with Vision Transformer
- Patch Embedding: In the ViT module, the input features, such as image is divided into , a grid of non-overlapping patches . Each patch is then flattened and projected into a higher-dimensional space using a linear embedding layer, transforming it into a sequence of vectors:
- Self-Attention Calculation for Feature Representation: Within each transformer layer, the self-attention mechanism calculates relationships between patches by transforming the input sequence at time t into queries , keys , and values through learned projections, such as the following:
3.2.5. Feature Fusion with Cross-Attention
4. Results
- Resizing: A 224 × 224 pixel scale was applied to all input frames to provide consistency across the collection. The input dimensions were chosen based on the requirements of MobileNet and the Vision Transformer, which work best with photos of this size. Because larger images would require significantly more memory and processing power, resizing allowed for more effective GPU use.
- Normalization: A common technique in deep learning, the mean and standard deviation values from the ImageNet dataset [31] were applied to each image. By preventing too large or small gradients during training, this normalization process brings the pixel values into a uniform range. The mean and standard deviation are calculated across all three color channels (RGB). The normalizing formula used is as follows:
- Video Frame Extraction: A representative sample of facial expressions throughout time was obtained by extracting frames from video sequences at a rate of 10 frames per second (FPS). By lowering the frame rate, we were able to remove unnecessary frames that would interfere with training and yet catch the most notable changes in facial expression.
- Preprocessing Pipeline: Painful and unconscious emotions were preprocessed with an emphasis on facial alignment to enhance classification. As discussed earlier, to achieve precise face detection, we used MTCNN (Multi-Task Cascaded Convolutional Networks) for preprocessing, assuring consistency in aspects like jaw alignment and eye positioning. This stage is critical because uneven expressions (such as synchronous eye movement or drooping of one side eye, etc.) are important indicators of unconsciousness.
- Temporal Alignment for Overlapping Emotions: Recognizing transitions between sad and pain require monitoring temporal dependency across frames. We utilized a sliding window approach to facilitate the smoothing of transitions across multiple time steps, while also assuring the precise distinction between the painful and sad by analyzing adjacent frames. The temporal alignment is depicted as follows:
4.1. Training and Optimization
4.2. Evaluation Metrics
- Accuracy: The simplest straightforward metric is accuracy, which provides a thorough evaluation of the model’s predicted reliability:
- Precision: The observed ratio of positive predictions to all expected positives is known as precision. Precision is crucial in this study because false positives could lead to unnecessary interventions (such as incorrectly classifying a patient to be unconscious) proving precision to be important:
- Recall: Recall, or sensitivity, measures the proportion of true positives out of all actual positive cases. For example, missing a case of painful or unconscious emotion can be critical for patient care, so recall provides insight into how well the model identifies all relevant instances:
- F1-Score: The F1-score is the harmonic mean of precision and recall, providing a single metric that balances both:
- Confusion Matrix: We generated confusion matrices for each dataset to highlight places where the model struggled. For example, in some circumstances, the model confuses sad with painful states, particularly in video sequences with minimal facial movements, as discussed in more depth in the following section. This insight is useful for understanding the model’s limitations and will be explored further in future study.
4.3. Experimental Evaluation
4.3.1. Class-Level Performance Analysis
4.3.2. Confusion Matrix Evaluation
4.3.3. Overall Accuracy and Baseline Comparison Analysis
4.4. Ablation Study
4.4.1. Full Model (TriViT-Lite)
4.4.2. Without MobileNet
4.4.3. Without Vision Transformer
4.4.4. Without Handcrafted Features
4.4.5. Without Cross-Attention
5. Conclusion and Future Work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zaman, K.; Zengkang, G.; Zhaoyun, S.; Shah, S.M.; Riaz, W.; Ji, J.C.; Hussain, T.; Attar, R.W. A Novel Emotion Recognition System for Human–Robot Interaction (HRI) Using Deep Ensemble Classification. Int. J. Intell. Syst. 2025, 2025, 6611276. [Google Scholar] [CrossRef]
- Lim, N. Cultural differences in emotion: Differences in emotional arousal level between the east and the west. Integr. Med. Res. 2016, 5, 105–109. [Google Scholar] [CrossRef] [PubMed]
- Shan, C.; Gong, S.; McOwan, P.W. Facial Expression Recognition Based on Local Binary Patterns: A Comprehensive Study. Image Vis. Comput. 2009, 27, 803–816. [Google Scholar] [CrossRef]
- Rasool, A.; Aslam, S.; Hussain, N.; Imtiaz, S.; Riaz, W. nBERT: Harnessing NLP for Emotion Recognition in Psychotherapy to Transform Mental Health Care. Information 2025, 16, 301. [Google Scholar] [CrossRef]
- Avila, A.R.; Akhtar, Z.; Santos, J.F.; O’Shaughnessy, D.; Falk, T.H. Feature pooling of modulation spectrum features for improved speech emotion recognition in the wild. IEEE Trans. Affect. Comput. 2021, 12, 177–188. [Google Scholar] [CrossRef]
- Soleymani, M.; Pantic, M.; Pun, T. Multimodal emotion recognition in response to videos. IEEE Trans. Affect. Comput. 2012, 3, 211–223. [Google Scholar] [CrossRef]
- Noroozi, F.; Marjanovic, M.; Njegus, A.; Escalera, S.; Anbarjafari, G. Audio-visual emotion recognition in video clips. IEEE Trans. Affect. Comput. 2019, 10, 60–75. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Riaz, W.; Ji, J.; Zaman, K.; Zengkang, G. Neural Network-Based Emotion Classification in Medical Robotics: Anticipating Enhanced Human–Robot Interaction in Healthcare. Electronics 2025, 14, 1320. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Wien, Austria, 7–11 May 2024. [Google Scholar]
- Goodfellow, I.J.; Erhan, D.; Carrier, P.L.; Courville, A.; Mirza, M.; Hamner, B.; Cukierski, W.; Tang, Y.; Thaler, D.; Lee, D.-H.; et al. Challenges in representation learning: A report on three machine learning contests. In Proceedings of the International Conference on Neural Information Processing, Berlin, Germany, 3–7 November 2013; pp. 117–124. [Google Scholar]
- Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 2017, 10, 18–31. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
- Wang, H.; Wei, S.; Fang, B. Facial expression recognition using iterative fusion of MO-HOG and deep features. J. Supercomput. 2020, 76, 3211–3221. [Google Scholar] [CrossRef]
- Hashiguchi, R.; Tamaki, T. Temporal cross-attention for action recognition. In Proceedings of the 16th Asian Conference on Computer Vision (ACCV) Workshops, Macao, China, 4–8 December 2022; pp. 283–294. [Google Scholar] [CrossRef]
- Turk, M.; Pentland, A. Eigenfaces for recognition. J. Cogn. Neurosci. 1991, 3, 71–86. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks. In Proceedings of the Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. 2023, 54, 1–41. [Google Scholar] [CrossRef]
- Zhao, Z.; Liu, Q. Former-DFER: Dynamic facial expression recognition transformer. In Proceedings of the 29th ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–2 November 2023. [Google Scholar]
- Baltrusaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef]
- Gowda, S.N.; Gao, B.; Clifton, D.A. FE-Adapter: Adapting image-based emotion classifiers to videos. In Proceedings of the 18th International Conference on Automatic Face and Gesture Recognition (FG), Istanbul, Turkey, 27–31 May 2024. [Google Scholar]
- Guo, Y.; Zhang, L.; Hu, Y.; He, X.; Gao, J. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 87–102. [Google Scholar]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–26 June 2005; pp. 886–893. [Google Scholar] [CrossRef]
- Chen, J.; Chen, Z.; Chi, Z.; Fu, H. Facial expression recognition based on facial components detection and HOG features. In Proceedings of the International Workshops on Electrical and Computer Engineering Subfields, Yogyakarta, Indonesia, 20–21 August 2014; pp. 884–888. [Google Scholar]
- Sedaghatjoo, Z.; Hosseinzadeh, H. The use of the symmetric finite difference in the local binary pattern (symmetric LBP). arXiv 2024, arXiv:2407.13178. [Google Scholar] [CrossRef]
- Huang, D.; Shan, C.; Ardabilian, M.; Wang, Y.; Chen, L. Local binary patterns and its application to facial image analysis: A survey. IEEE Trans. Syst. Man Cybern. 2011, 41, 765–781. [Google Scholar] [CrossRef]
- Ren, J.; Jiang, X.; Yuan, J. Face and facial expressions recognition and analysis. In Context Aware Human-Robot and Human-Agent Interaction; Springer International Publishing: Cham, Switzerland, 2015; pp. 3–29. [Google Scholar] [CrossRef]
- Butt, M.H.F.; Li, J.P.; Ji, J.C.; Riaz, W.; Anwar, N.; Butt, F.F.; Ahmad, M.; Saboor, A.; Ali, A.; Uddin, M.Y. Intelligent tumor tissue classification for Hybrid Health Care Units. Front. Med. 2024, 11, 1385524. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 25–29 June 2009; pp. 248–255. [Google Scholar]
- Gao, S.; Cheng, M.; Zhao, K.; Zhang, X.; Yang, M.; Torr, P.H.S. Res2Net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef]
- Roy, A.K.; Kathania, H.K.; Sharma, A.; Dey, A.; Ansari, M.S.A. ResEmoteNet: Bridging Accuracy and Loss Reduction in Facial Emotion Recognition. arXiv 2024, arXiv:2409.10545. [Google Scholar] [CrossRef]
- Georgescu, M.-I.; Ionescu, R.T.; Popescu, M. Local Learning with Deep and Handcrafted Features for Facial Expression Recognition. arXiv 2018, arXiv:1804.10892. [Google Scholar] [CrossRef]
- Hasani, B.; Negi, P.S.; Mahoor, M.H. BReG-NeXt: Facial Affect Computing Using Adaptive Residual Networks With Bounded Gradient. arXiv 2020, arXiv:2004.08495. [Google Scholar] [CrossRef]
- Zhang, X.; Huang, Y.; Liu, H.; Gao, W.; Chen, W. HFE-Net: Hybrid Feature Extraction Network for Facial Expression Recognition. PLoS ONE 2023, 18, e0312359. [Google Scholar]
- Kolahdouzi, M.; Sepas-Moghaddam, A.; Etemad, A. FaceTopoNet: Facial Expression Recognition using Face Topology Learning. arXiv 2022, arXiv:2209.06322. [Google Scholar] [CrossRef]
Emotion | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) |
---|---|---|---|---|
Happy | 95 | 93 | 93 | 93 |
Sad | 92 | 89 | 90 | 89 |
Neutral | 94 | 91 | 92 | 91 |
Fear | 90 | 87 | 86 | 86 |
Disgust | 89 | 84 | 85 | 84 |
Painful | 76 | 69 | 70 | 68 |
Unconscious | 77 | 71 | 72 | 71 |
Model | Dataset | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) | Memory Usage (MB) | FLOPs (G) |
---|---|---|---|---|---|---|---|
VGG-16 | FER2013 | 85.30 | 84 | 83 | 84 | 528 | 15.47 |
ResNet-50 | FER2013 | 87.90 | 87 | 86 | 87 | 177 | 4.12 |
ViT | FER2013 | 88.00 | 88 | 87 | 87 | 330 | 7.23 |
Res2Net | FER2013 | 89.50 | 89 | 89 | 89 | 150 | 3.95 |
Swin Transformer | FER2013 | 91.00 | 90 | 89 | 90 | 290 | 4.51 |
TriViT-Lite | FER2013 | 91.80 | 91 | 90 | 91 | 128 | 2.89 |
VGG-16 | AffectNet | 64.00 | 63 | 62 | 63 | 528 | 15.47 |
ResNet-50 | AffectNet | 65.80 | 65 | 64 | 65 | 177 | 4.12 |
ViT | AffectNet | 70.00 | 69 | 68 | 68 | 330 | 7.23 |
Res2Net | AffectNet | 71.50 | 70 | 70 | 70 | 150 | 3.95 |
Swin Transformer | AffectNet | 73.00 | 72 | 72 | 72 | 290 | 4.51 |
TriViT-Lite | AffectNet | 74.00 | 73 | 72 | 73 | 128 | 2.89 |
VGG-16 | Custom Dataset | 78.40 | 78 | 77 | 78 | 528 | 15.47 |
ResNet-50 | Custom Dataset | 81.20 | 80 | 79 | 80 | 177 | 4.12 |
ViT | Custom Dataset | 82.00 | 81 | 80 | 81 | 330 | 7.23 |
Res2Net | Custom Dataset | 83.80 | 83 | 82 | 83 | 150 | 3.95 |
Swin Transformer | Custom Dataset | 85.20 | 84 | 84 | 84 | 290 | 4.51 |
TriViT-Lite | Custom Dataset | 87.50 | 87 | 85 | 87 | 128 | 2.89 |
Authors | Model | FER2013 Accuracy (%) | AffectNet Accuracy (%) |
---|---|---|---|
Roy et al. [33] | ResEmoteNet | 79.79 | 72.39 |
Georgescu et al. [34] | Local Learning | 75.42 | 63.31 |
Hasani et al. [35] | BReG-NeXt-50 | 71.53 | 68.50 |
Zhang et al. [36] | HFE-Net | 71.69 | 58.55 |
Kolahdouzi et al. [37] | FaceTopoNet | 72.66 | 70.02 |
Ours | TriViT-Lite | 91.80 | 74.00 |
Scenario | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) |
---|---|---|---|---|
Standard (normal conditions) | 87.5 | 87 | 85 | 87 |
Low Lighting | 81.6 | 83 | 82 | 82 |
High Lighting (overexposed) | 84.4 | 83 | 81 | 82 |
Mild Occlusion (e.g., glasses) | 86.8 | 85 | 84 | 84 |
Configuration | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) |
---|---|---|---|---|
TriVit-Lite (complete model) | 87.5 | 87 | 85 | 87 |
Without MobileNet (ViT only) | 81.7 | 79 | 77 | 79 |
Without ViT (MobileNet only) | 84.5 | 82 | 80 | 83 |
Without Handcrafted Features | 86 | 84 | 83 | 85 |
Model Configuration | Accuracy (%) | F1-Score (%) | Occlusion Accuracy (%) |
---|---|---|---|
TriViT-Lite (full) | 87.5 | 87 | 86.8 |
TriViT-Lite (without cross-attention) | 80.4 | 78 | 79 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Riaz, W.; Ji, J.; Ullah, A. TriViT-Lite: A Compact Vision Transformer–MobileNet Model with Texture-Aware Attention for Real-Time Facial Emotion Recognition in Healthcare. Electronics 2025, 14, 3256. https://doi.org/10.3390/electronics14163256
Riaz W, Ji J, Ullah A. TriViT-Lite: A Compact Vision Transformer–MobileNet Model with Texture-Aware Attention for Real-Time Facial Emotion Recognition in Healthcare. Electronics. 2025; 14(16):3256. https://doi.org/10.3390/electronics14163256
Chicago/Turabian StyleRiaz, Waqar, Jiancheng (Charles) Ji, and Asif Ullah. 2025. "TriViT-Lite: A Compact Vision Transformer–MobileNet Model with Texture-Aware Attention for Real-Time Facial Emotion Recognition in Healthcare" Electronics 14, no. 16: 3256. https://doi.org/10.3390/electronics14163256
APA StyleRiaz, W., Ji, J., & Ullah, A. (2025). TriViT-Lite: A Compact Vision Transformer–MobileNet Model with Texture-Aware Attention for Real-Time Facial Emotion Recognition in Healthcare. Electronics, 14(16), 3256. https://doi.org/10.3390/electronics14163256