Chinese Lip-Reading Research Based on ShuffleNet and CBAM
Abstract
:1. Introduction
2. Lip-Reading Model Architecture
2.1. ShuffleNet
2.1.1. Group Convolution
2.1.2. Channel Shuffle
2.1.3. ShuffleNet V2 Unit
2.2. Convolutional Block Attention Module
2.2.1. Channel Attention Module
2.2.2. Spatial Attention Module
2.3. Temporal Convolutional Network
2.3.1. Casual Convolution
2.3.2. Dilated Convolution
2.3.3. Residual Connection
3. Experiment
3.1. Dataset
3.2. Experiment Settings
3.3. Recognition Results
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
TCN | Temporal Convolutional Networks |
CBAM | Convolutional Block Attention Module |
CNN | Convolutional Neural Network |
LSTM | Long Short-term Memory |
CTC | Connectionist Temporal Classification |
LRW | Lip Reading in the Wild |
GRU | Gate Recurrent Unit |
DWConv. | Depth Wise Convolution |
CAM | Channel Attention Module |
SAM | Spatial Attention Module |
References
- Palecek, K. Utilizing lipreading in large vocabulary continuous speech recognition. In Proceedings of the International Conference on Speech and Computer, Hatfield, UK, 12–16 September 2017; Springer: Berlin, Germany; Cham, Switzerland, 2017; pp. 767–776. [Google Scholar]
- Mcgurk, H.; Macdonald, J. Hearing lips and seeing voices. Nature 1976, 264, 746–748. [Google Scholar] [CrossRef] [PubMed]
- Assael, Y.M.; Shillingford, B.; Whiteson, S. Lipnet: End-to-end sentence-level lipreading. arXiv 2016, arXiv:1611.01599. [Google Scholar]
- Burton, J.; Frank, D.; Saleh, M.; Navab, N.; Bear, H.L. The speaker-independent lipreading play-off; a survey of lipreading machines. In Proceedings of the 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS), Sophia Antipolis, France, 12–14 December 2018. [Google Scholar]
- Lu, H.; Liu, X.; Yin, Y.; Chen, Z. A Patent Text Classification Model Based on Multivariate Neural Network Fusion. In Proceedings of the 2019 6th International Conference on Soft Computing & Machine Intelligence (ISCMI), Johannesburg, South Africa, 19–20 November 2019. [Google Scholar]
- Hussein, D.; Ibrahim, D.M.; Sarhan, A.M. HLR-Net:A Hybrid Lip-Reading Model Based on Deep Convolutional Neural Networks. Comput. Mater. Contin. 2021, 68, 1531–1549. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vincent, V.; Andrew, R. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Saberi-Movahed, F.; Rostami, M.; Berahmand, K.; Karami, S.; Tiwari, P.; Oussalah, M.; Band, S.S. Dual Regularized Unsupervised Feature Selection Based on Matrix Factorization and Minimum Redundancy with application in gene selection. Knowl. Based Syst. 2022, 256, 109884. [Google Scholar] [CrossRef]
- Nazari, K.; Ebadi, M.J.; Berahmand, K. Diagnosis of alternaria disease and leafminer pest on tomato leaves using image processing techniques. J. Sci. Food Agric. 2022, 102, 6907–6920. [Google Scholar] [CrossRef] [PubMed]
- Rostami, M.; Berahmand, K.; Nasiri, E.; Forouzandeh, S. Review of swarm intelligence-based feature selection methods. Eng. Appl. Artif. Intell. 2021, 100, 104210. [Google Scholar] [CrossRef]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
- Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18 June 2018–23 June 2018; pp. 6848–6856. [Google Scholar]
- Huang, G.; Liu, Z.; Van, D.M.L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 2017 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Zhang, R.; Sun, F.; Song, Z.; Wang, X.; Du, Y.; Dong, S. Short-term traffic flow forecasting model based on GA-TCN. J. Adv. Transp. 2021, 2021, 1338607. [Google Scholar] [CrossRef]
- Hewage, P.; Behera, A.; Trovati, M.; Pereira, E.; Ghahremani, M.; Palmieri, F. Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station. Soft Comput. 2020, 24, 16453–16482. [Google Scholar] [CrossRef] [Green Version]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural. Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar]
- Chollet, F.X. Deep learning with depthwise separable convolution. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Chung, J.S.; Zisserman, A.P. Lip reading in profile. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 4–7 September 2017. [Google Scholar]
- Themos, S.; Georgios, T. Combining residual networks with lstms for lipreading. In Proceedings of the INTERSPEECH 2017: Conference of the International Speech Communication Association, Stockholm, Sweden, 20–24 August 2017. [Google Scholar]
- Wang, C.H. Multi-grained spatio-temporal modeling for lip-reading. In Proceedings of the 30th British Machine Vision Conference, Cardiff, UK, 9–12 September 2019. [Google Scholar]
- Weng, X.S.; Kris, K. Learning spatio-temporal features with two-stream deep 3d cnns for lipreading. In Proceedings of the 30th British Machine Vision Conference, Cardiff, UK,, 9–12 September 2019. [Google Scholar]
- Luo, M.S.; Yang, S.; Shan, S.G.; Chen, X.L. Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading. In Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020. [Google Scholar]
- Brais, M.; Ma, P.C.; Stavros, P.; Maja, P. Lipreading using temporal convolutional network. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6319–6323. [Google Scholar]
Model | mmAP (%) | GPU Speed (Images/s) | ||||||
---|---|---|---|---|---|---|---|---|
FLOPs | 40M | 140M | 300M | 500M | 40M | 140M | 300M | 500M |
Xception | 21.9 | 29.0 | 31.3 | 32.9 | 178 | 131 | 101 | 83 |
ShuffleNet v1 | 20.9 | 27.0 | 29.9 | 32.9 | 152 | 85 | 76 | 60 |
MobileNet v2 | 20.7 | 24.4 | 30.0 | 30.6 | 146 | 111 | 94 | 72 |
ShuffleNet v2 | 23.7 | 29.6 | 32.2 | 34.2 | 183 | 138 | 105 | 83 |
Year | Method | Frontend | Backend | Input Size | LRW |
---|---|---|---|---|---|
2016 | Chung et al. [26] | VGGM | - | 112 × 112 | 61.1% |
2017 | Stafylakis et al. [27] | ResNet-34 | BiLSTM | 112 × 112 | 83.5% |
2019 | Wang et al. [28] | Multi-Grained ResNet-18 | Conv BiLSTM | 88 × 88 | 83.3% |
2019 | Weng et al. [29] | Two-Stream ResNet-18 | BiLSTM | 112 × 112 | 84.1% |
2020 | Luo et al. [30] | ResNet-18 | BiGRU | 88 × 88 | 83.5% |
2020 | Martinez et al. [31] | ResNet-18 | TCN | 88 × 88 | 85.3% |
Method | Top-1 Acc. (%) | FLOPs × 109 | Params × 106 |
---|---|---|---|
ResNet-18 | 74.4 | 3.46 | 12.55 |
MobileNet v2 | 69.5 | 1.48 | 3.5 |
ShuffleNet v2 (1×) | 71.3 | 1.73 | 3.9 |
ShuffleNet v2 (0.5×) | 68.4 | 0.89 | 3.0 |
ShuffleNet v2 (0.5×) (CBAM) | 71.2 | 1.01 | 3.1 |
Model/ Character | ResNet-18 (%) | MobileNet v2 (%) | ShuffleNet v2 (1×) (%) | ShuffleNetv2 (0.5×) (%) | ShuffleNet v2 (0.5×) (CBAM) (%) |
---|---|---|---|---|---|
Tai-Yang (sun) | 78.00% | 73.00% | 74.00% | 65.00% | 76.00% |
Gong-Zuo (work) | 73.00% | 72.00% | 72.00% | 69.00% | 76.00% |
Dui-Bu-Qi (sorry) | 79.00% | 71.00% | 69.00% | 69.00% | 79.00% |
Shui-Jiao (sleep) | 79.00% | 71.00% | 68.00% | 69.00% | 76.00% |
Chi-Fan (eat) | 70.00% | 68.00% | 62.00% | 60.00% | 73.00% |
Bai-Yun (cloud) | 79.00% | 79.00% | 68.00% | 66.00% | 76.00% |
Shun-Li (well) | 79.00% | 72.00% | 69.00% | 67.00% | 79.00% |
Zhong-Guo (China) | 75.00% | 70.00% | 65.00% | 63.00% | 73.00% |
Jian-Pan (sorry) | 78.00% | 70.00% | 68.00% | 63.00% | 78.00% |
Xie-Xie (thanks) | 73.00% | 74.00% | 66.00% | 65.00% | 70.00% |
Zai-Jian (goodbye) | 77.00% | 70.00% | 60.00% | 66.00% | 77.00% |
Xue-Xiao (school) | 75.00% | 70.00% | 69.00% | 66.00% | 73.00% |
Bai-Zhi (paper) | 79.00% | 71.00% | 69.00% | 69.00% | 78.00% |
Shu-Ben (book) | 79.00% | 72.00% | 69.00% | 63.00% | 72.00% |
Gang-Bi (pen) | 79.00% | 70.00% | 73.00% | 56.00% | 71.00% |
Shou-Ji (phone) | 79.00% | 69.00% | 69.00% | 65.00% | 79.00% |
Dian-Nao (computer) | 79.00% | 69.00% | 67.00% | 60.00% | 73.00% |
Ping-Guo (apple) | 75.00% | 63.00% | 62.00% | 60.00% | 69.00% |
Xiang-Jiao (banana) | 79.00% | 69.00% | 69.00% | 64.00% | 78.00% |
Pu-Tao (grape) | 73.00% | 69.00% | 68.00% | 67.00% | 77.00% |
Model | Randomly Drop N Frames | N = 0 | N = 1 | N = 2 | N = 3 | N = 4 |
---|---|---|---|---|---|---|
ShuffleNet (0.5×) +CBAM | Fixed-Length Training | 71.2% | 62.1% | 51.7% | 43.8% | 37.1% |
Variable-Length Training | 68% | 66.3% | 61.9% | 57.1% | 54.2% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fu, Y.; Lu, Y.; Ni, R. Chinese Lip-Reading Research Based on ShuffleNet and CBAM. Appl. Sci. 2023, 13, 1106. https://doi.org/10.3390/app13021106
Fu Y, Lu Y, Ni R. Chinese Lip-Reading Research Based on ShuffleNet and CBAM. Applied Sciences. 2023; 13(2):1106. https://doi.org/10.3390/app13021106
Chicago/Turabian StyleFu, Yixian, Yuanyao Lu, and Ran Ni. 2023. "Chinese Lip-Reading Research Based on ShuffleNet and CBAM" Applied Sciences 13, no. 2: 1106. https://doi.org/10.3390/app13021106
APA StyleFu, Y., Lu, Y., & Ni, R. (2023). Chinese Lip-Reading Research Based on ShuffleNet and CBAM. Applied Sciences, 13(2), 1106. https://doi.org/10.3390/app13021106