Local Multi-Head Channel Self-Attention for Facial Expression Recognition
Abstract
:1. Introduction
2. Related Work
2.1. Attention
- Global Attention: is usually only a module used before another model with the idea to enhance the important parts of an image.
- Spatial Attention: the attention modules focus on single pixels or areas of the feature maps.
- Channel Attention: the attention modules focus on entire feature maps.
- Self-Attention: the attention tries to find relationships between different aspects of the data.
- Stand-Alone Attention: the architecture is aimed at fully replacing the convolutional blocks and at defining a new processing block for computer vision based on some attention mechanism (mostly self-attention).
- In all of them, channel attention always has a secondary role and there is always a spatial attention sub-module with a primary role;
- In all of them, the crucial multi-head structure is lacking;
- All of them implement channel attention as a “passive” non-learning module;
- None of them integrates our local spatial behavior for channel attention;
- None of them integrates our dynamic scaling, which is very specific of our architecture.
2.2. FER2013
3. LHC-Net
3.1. Architecture
3.2. Motivation and Analysis
3.2.1. Channel Self-Attention
- Spatial attention in computer vision strongly relies on the main assumption that a relationship between the single pixels or areas of an image exists. This assumption is not self-evident, or at least, not as evident as the relationship between words in a phrase, which spatial attention is inspired by.
- All attempts to pursue spatial self-attention in computer vision (especially in stand-alone mode) have gained only minor improvements over the previous state-of-the-art architectures, and most of the times, at the price of an unreasonably higher computational cost and a prohibitive pre-training on enormous datasets.
- Much more simple and computationally cheaper approaches, like Squeeze and Excitation in Efficient Net, have already been proven to be very effective without the need to replace convolution.
3.2.2. Dynamic Scaling
3.2.3. Shared Linear Embedding and Convolution
3.2.4. Local Multi-Head
- Their approach is not designed for self-attention.
- Their local processing units are used at a later stage. They directly calculate local attention weights from embeddings (scalar output with softmax activation). Our local processing units calculate the initial embeddings (high dimension output with linear activation).
- Local heads have the advantage of working at a much lower dimension. Detecting a pattern of a few pixels is harder if the input includes the entire feature map.
- Splitting the images in smaller parts gives to local heads the ability to build new feature maps, considering only the important parts of the old maps. There is no reason to compose feature maps in their entirety when only a small part is detecting an interesting feature. Local heads are able to add a feature map to a new feature map only if the original map is activated by a pattern and only around that pattern, avoiding the addition of not useful information.
- Local heads seem to be more efficient in terms of parameters allocation (see Appendix B).
4. Experiments
4.1. Experimental Setup
4.2. Results
- ResNet34 is confirmed to be the most effective architecture on FER2013, especially its v2 version. In our experiments, raw ResNet34 trained with the multi-stage protocol and inferenced with TTA reaches an accuracy that is not distant from the previous SOTA (ResMaskingNet).
- Heavy architectures seem not to be able to outperform more simple models on FER2013.
- LHC-Net has the top accuracy, both with and without TTA.
- LHC-NetC outperforms LHC-Net, but is outperformed when TTA is used.
- More importantly, LHC-Net outperforms the previous SOTA with less than one-fourth of its free parameters, and the impact of the LHC modules on the base architecture is much lower (less than vs. over ), and it is closer to other attention modules such as CBAM/BAM/SE.
5. Conclusions and Future Developments
- Testing LHC on other, more computational intensive scenarios such as the Imagenet dataset.
- Testing LHC with other backbone architectures and with a larger range of starting performances (not only peak performances).
- We did not optimize the general topology of LHC-Net, and the model hyper-parameters of the attention blocks are hand-selected with only a few attempts. There is evidence that both the 5 blocks topology and hyper-parameters might be sub-optimal.
- Further research on the stand-alone training mode will be necessary.
- Normalization blocks before and after the LHC blocks should be better evaluated, in order to mitigate the divergence issue mentioned in the previous section.
- A second convolution before the residual connection should be considered, to mimic the general structure of the original Transformer.
- A better head splitting technique could be key in future research. The horizontal splitting we used was only the most obvious way to achieve it, but not necessarily the most effective. Other approaches should be evaluated, e.g., learning the optimal areas through spatial attention.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Acronyms
Acronym | Meaning |
---|---|
AI | Artificial Intelligence |
CBAM | Convolutional Block Attention Module |
CNN | Convolutional Neural Network |
CoAtNet | COnvolutional and self-ATention Network |
ECA | Efficient Channel Attention Network |
FER | Facial Expression Recognition |
GAN | Generative Adversarial Network |
LHC | Local multi-Head Channel self-attention |
LSTM | Long Short-Term Memory |
ML | Machine Learning |
NLP | Natural Language Processing |
ResNet | Residual Network |
RM-Net | Residual Masking Network |
SAGAN | Self-Attention Generative Adversarial Network |
SCA | Spatial and Channel wise Attention |
SE | Squeeze and Excitation Network |
SGD | Stochastic Gradient Descent |
SOTA | State of the art |
SVM | Support Vector Machine |
TTA | Test Time Augmentation |
URCA | UpSample Residual Channel-wise Attention |
VGG | Visual Geometry Group |
VIT | Vision Transformer |
Appendix B. A Qualitative Analysis of Multi-Head Efficiency over Global Head
Appendix C. Benchmark Results
Model | Accuracy | TTA | Params | Att |
---|---|---|---|---|
BoW Repr. [37] | 67.48% | no | - | - |
Human [37] | 70.00% | no | - | - |
CNN [39] | 70.02% | no | - | - |
VGG19 [38] | 70.80% | yes | 143.7M | - |
EffNet [38] * | 70.80% | yes | 9.18M | - |
SVM [37] | 71.16% | no | - | - |
Inception [40] | 71.60% | yes | 23.85M | - |
Incep.v1 [38] * | 71.97% | yes | 5M | - |
ResNet34 [40] | 72.40% | yes | 27.6M | - |
ResNet34 [38] | 72.42% | yes | 27.6M | - |
VGG [40] | 72.70% | yes | 143.7M | - |
SE-Net50 [41] | 72.70% | yes | 27M | 5.18% |
Incep.v3 [38] * | 72.72% | yes | 23.85M | - |
ResNet34v2 | 72.81% | no | 27.6M | - |
BAMRN50 [38] * | 73.14% | yes | 24.07M | |
Dense121 [38] | 73.16% | yes | 8.06M | - |
ResNet50 [41] | 73.20% | yes | 25.6M | - |
ResNet152 [38] | 73.22% | yes | 60.38M | - |
VGG [42] | 73.28% | yes | 143.7M | - |
CBAMRN50 [38] | 73.39% | yes | 28.09M | |
LHC-Net | no | 32.4M | 14.8% | |
LHC-NetC | no | 32.4M | 14.8% | |
ResNet34v2 | 73.92% | yes | 27.6M | - |
RM-Net [38] | 74.14% | yes | 142.9M | 80.7% |
LHC-NetC | yes | 32.4M | 14.8% | |
LHC-Net | yes | 32.4M | 14.8% |
References
- Fasel, B.; Luettin, J. Automatic facial expression analysis: A survey. Pattern Recognit. 2003, 36, 259–275. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Shi, Y.; Jain, A.K. Probabilistic Face Embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Huang, Y.; Qiu, S.; Zhang, W.; Luo, X.; Wang, J. More Information Supervised Probabilistic Deep Face Embedding Learning. In Proceedings of the 37th International Conference on Machine Learning Research, Virtual, 13–18 July 2020; Daumé, H., III, Singh, A., Eds.; 2020; Volume 119, pp. 4484–4494. [Google Scholar]
- Wang, X.; Wang, S.; Chi, C.; Zhang, S.; Mei, T. Loss Function Search for Face Recognition. In Proceedings of the 37th International Conference on Machine Learning Research, Virtual, 13–18 July 2020; Daumé, H., III, Singh, A., Eds.; 2020; Volume 119, pp. 10029–10038. [Google Scholar]
- Yang, X.; Jia, X.; Gong, D.; Yan, D.M.; Li, Z.; Liu, W. LARNet: Lie Algebra Residual Network for Face Recognition. In Proceedings of the 38th International Conference on Machine Learning Research, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR 139 2021. Volume 139, pp. 11738–11750. [Google Scholar]
- Uppal, H.; Sepas-Moghaddam, A.; Greenspan, M.; Etemad, A. Depth as Attention for Face Representation Learning. IEEE Trans. Inf. Forensics Secur. 2021, 16, 2461–2476. [Google Scholar] [CrossRef]
- Zhong, S.h.; Liu, Y.; Zhang, Y.; Chung, F.l. Attention modeling for face recognition via deep learning. In Proceedings of the Annual Meeting of the Cognitive Science Society, Sapporo, Japan, 1–4 August 2012; Volume 34. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. arXiv 2015, arXiv:1508.04025. [Google Scholar]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Xu, L.; Huang, J.; Nitanda, A.; Asaoka, R.; Yamanishi, K. A Novel Global Spatial Attention Mechanism in Convolutional Neural Network for Medical Image Classification. arXiv 2020, arXiv:2007.15897. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28, 2017–2025. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. arXiv 2020, arXiv:1910.03151. [Google Scholar]
- Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667. [Google Scholar]
- Nie, X.; Ding, H.; Qi, M.; Wang, Y.; Wong, E.K. URCA-GAN: UpSample Residual Channel-wise Attention Generative Adversarial Network for image-to-image translation. Neurocomputing 2021, 443, 75–84. [Google Scholar] [CrossRef]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
- Liu, X.; Xiao, G.; Dai, L.; Zeng, K.; Yang, C.; Chen, R. SCSA-Net: Presentation of two-view reliable correspondence learning via spatial-channel self-attention. Neurocomputing 2021, 431, 137–147. [Google Scholar] [CrossRef]
- Tian, Y.; Zhang, Y.; Zhou, D.; Cheng, G.; Chen, W.G.; Wang, R. Triple attention network for video segmentation. Neurocomputing 2020, 417, 202–211. [Google Scholar] [CrossRef]
- Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention augmented convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 3286–3295. [Google Scholar]
- Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Yan, Z.; Tomizuka, M.; Gonzalez, J.; Keutzer, K.; Vajda, P. Visual transformers: Token-based image representation and processing for computer vision. arXiv 2020, arXiv:2006.03677. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-alone self-attention in vision models. arXiv 2019, arXiv:1906.05909. [Google Scholar]
- Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; Tran, D. Image transformer. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 4055–4064. [Google Scholar]
- Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 7354–7363. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
- Dai, Z.; Liu, H.; Le, Q.V.; Tan, M. CoAtNet: Marrying Convolution and Attention for All Data Sizes. arXiv 2021, arXiv:2106.04803. [Google Scholar]
- Gens, R.; Domingos, P.M. Deep symmetry networks. Adv. Neural Inf. Process. Syst. 2014, 27, 2537–2545. [Google Scholar]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
- Pham, H.; Dai, Z.; Xie, Q.; Le, Q.V. Meta pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11557–11568. [Google Scholar]
- Goodfellow, I.J.; Erhan, D.; Carrier, P.L.; Courville, A.; Mirza, M.; Hamner, B.; Cukierski, W.; Tang, Y.; Thaler, D.; Lee, D.H.; et al. Challenges in representation learning: A report on three machine learning contests. In Proceedings of the International Conference on Neural Information Processing, Daegu, Korea, 3–7 November 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 117–124. [Google Scholar]
- Pham, L.; Vu, T.H.; Tran, T.A. Facial Expression Recognition Using Residual Masking Network. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 4513–4519. [Google Scholar]
- Minaee, S.; Minaei, M.; Abdolrashidi, A. Deep-emotion: Facial expression recognition using attentional convolutional network. Sensors 2021, 21, 3046. [Google Scholar] [CrossRef]
- Pramerdorfer, C.; Kampel, M. Facial expression recognition using convolutional neural networks: State of the art. arXiv 2016, arXiv:1612.02903. [Google Scholar]
- Khanzada, A.; Bai, C.; Celepcikay, F.T. Facial expression recognition with deep learning. arXiv 2020, arXiv:2004.11823. [Google Scholar]
- Khaireddin, Y.; Chen, Z. Facial Emotion Recognition: State of the Art Performance on FER2013. arXiv 2021, arXiv:2105.03588. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
- Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv 2019, arXiv:1905.09418. [Google Scholar]
- Cordonnier, J.B.; Loukas, A.; Jaggi, M. Multi-head attention: Collaborate instead of concatenate. arXiv 2020, arXiv:2006.16362. [Google Scholar]
- Sukhbaatar, S.; Grave, E.; Bojanowski, P.; Joulin, A. Adaptive attention span in transformers. arXiv 2019, arXiv:1905.07799. [Google Scholar]
- India, M.; Safari, P.; Hernando, J. Self multi-head attention for speaker recognition. arXiv 2019, arXiv:1906.09890. [Google Scholar]
Authors | Method | Accuracy |
---|---|---|
Minaee et al. [39] | CNN with global spatial attention | 70.02% |
Pramerdorfer et al. [40] | Inception | 71.6% |
Pramerdorfer et al. [40] | ResNet | 72.4% |
Pramerdorfer et al. [40] | Visual Geometry Group (VGG) | 72.7% |
Khanzada et al. [41] | SE-ResNet50 | 72.7% |
Khanzada et al. [41] | ResNet50 | 73.2% |
Khaireddin et al. [42] | VGG with hyper-parameters fine tuning | 73.28% |
Pham et al. [38] | ResMaskingNet (ResNet with spatial attention) | 74.14% |
Pham et al. [38] | ensemble of 6 convolutional neural networks | 76.82% |
Block | Heads | Dim | Pool | Scale | Ker |
---|---|---|---|---|---|
LHC1 | 8 | 196 | 3 | 1 | 3 |
LHC2 | 8 | 196 | 3 | 1 | 3 |
LHC3 | 7 | 56 | 3 | 1 | 3 |
LHC4 | 7 | 14 | 3 | 1 | 3 |
LHC5 | 1 | 25 | 3 | 1 | 3 |
Optimizer | Adam, learning rate = 0.0001 |
Batch Size | 48 |
Patience | 30 epochs |
Augmentation | 30 degree rotation |
Optimizer | Stochastic Gradient Descent, learning rate = 0.01 |
Batch Size | 64 |
Patience | 10 epochs |
10 degree rotation | |
Augmentation | 0.1 horizontal/vertical shift |
0.1 zoom |
Optimizer | Stochastic Gradient Descent, learning rate = 0.01 |
Batch Size | 64 |
Patience | 5 epochs |
Augmentation | - |
Optimizer | Stochastic Gradient Descent, learning rate = |
Batch Size | 64 |
Patience | 3 epochs |
Augmentation | - |
Model | Top 40% | Top 40% w/o Best | Top 25% | Top 25% w/o Best | Best |
---|---|---|---|---|---|
ResNet34v2 | 72.69% | 72.65% | 72.75% | 72.69% | 72.81% |
LHC-Net | 72.77% | 72.83% | |||
LHC-NetC | 72.79% | 72.89% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pecoraro, R.; Basile, V.; Bono, V. Local Multi-Head Channel Self-Attention for Facial Expression Recognition. Information 2022, 13, 419. https://doi.org/10.3390/info13090419
Pecoraro R, Basile V, Bono V. Local Multi-Head Channel Self-Attention for Facial Expression Recognition. Information. 2022; 13(9):419. https://doi.org/10.3390/info13090419
Chicago/Turabian StylePecoraro, Roberto, Valerio Basile, and Viviana Bono. 2022. "Local Multi-Head Channel Self-Attention for Facial Expression Recognition" Information 13, no. 9: 419. https://doi.org/10.3390/info13090419
APA StylePecoraro, R., Basile, V., & Bono, V. (2022). Local Multi-Head Channel Self-Attention for Facial Expression Recognition. Information, 13(9), 419. https://doi.org/10.3390/info13090419