MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning
Abstract
:1. Introduction
- (1)
- We present a new supervised speech recognition framework for training across audio speech recognition (ASR), visual speech recognition (VSR), and audio–visual speech recognition (AVSR) tasks simultaneously, achieving a remarkable VSR result of 21.0% WER on the LRS3-TED dataset, which is state-of-the-art among models trained on under 3000 h of data.
- (2)
- We introduce a multi-task hybrid Connectionist Temporal Classification (CTC)/Attention loss that enables direct multi-task training across ASR, VSR, and AVSR tasks. This loss significantly enhances VSR performance while mitigating the high compute demands of multi-task self-supervised learning, requiring only 18% of the training compute required for the USR [18] self-supervised multi-task approach (47 vs. 253 exaFLOPS; see Section 5.5).
- (3)
- We demonstrate that supervised multi-task speech recognition models exhibit strong generalization, achieving an impressive 44.7% WER on the WildVSR dataset [19], which is state-of-the-art among models trained on under 3000 h of data. There are methods that have achieved better results, but they have had to use a much larger amount of data that is not publicly available. Furthermore, MultiAVSR is the first model to perform better without an external language model (44.7% WER) compared to with a language model (46.0% WER) on the WildVSR dataset. This indicates that our model has increased linguistic generalization, particularly on in-the-wild data.
- (4)
- We demonstrate that our multi-task training approach significantly reduces the reliance on external language models. Our model exhibits only a 2.8% relative improvement when adding an external language model during evaluation, while state-of-the-art single-task models see >7% improvement. This reduced reliance on external language models is a critical advancement for enabling faster and more compute-efficient real-time VSR as removing the language model decreases inference time by 40% and reduces the total evaluation parameter count by 18%.
- (5)
- We show that our supervised multi-task framework improves the ASR and AVSR tasks when it comes to performance in noisy environments, achieving relative improvement of 16% and 30%, respectively, compared to the state-of-the-art single-task approaches that are trained on more data [4].
2. Related Work
2.1. Self-Supervised Methods
2.2. Supervised Methods
2.3. Multi-Task Methods
3. Methods
3.1. Architecture
Training Tasks | Shared Encoder | WER | ||||
---|---|---|---|---|---|---|
VSR | ASR | AVSR | VSR | ASR | AVSR | |
Single-Task Models | ||||||
✓ | ✗ | ✗ | ✗ | 42.0 | - | - |
✗ | ✓ | ✗ | ✗ | - | 2.3 | - |
✗ | ✗ | ✓ | ✗ | - | - | 2.3 |
Multi-Task Models | ||||||
✓ | ✓ | ✗ | ✗ | 41.2 | 2.1 | - |
✓ | ✓ | ✗ | ✓ | 32.2 | 2.5 | - |
✓ | ✗ | ✓ | ✓ | 36.9 | - | 3.7 |
✓ | ✓ | ✓ | ✓ | 31.1 | 2.4 | 2.5 |
Method | Total Hours | Multi-Task Training | LM | LRS3 WER (%) | WildVSR WER (%) |
---|---|---|---|---|---|
No Additional Data | |||||
AV-HuBERT [30] | 433 | ✗ | ✗ | 41.6 | 69.4 ‡ |
BRAVEn [20] | 433 | ✗ | ✗ | 36.0 | - |
RAVEn [21] | 433 | ✗ | ✗ | 39.1 | 69.9 ‡ |
Auto-AVSR [4] | 438 | ✗ | ✓ | 36.3 | - |
USR [18] | 438 | ✓ | ✗ | 34.3 | - |
SyncVSR [35] | 438 | ✗ | ✗ | 33.3 | - |
SyncVSR [35] | 438 | ✗ | ✓ | 31.2 | - |
MultiAVSR | 438 | ✓ | ✗ | 31.1 | 63.0 |
MultiAVSR | 438 | ✓ | ✓ | 29.9 | 63.7 |
Less than 1000 h | |||||
CM-Seq2Seq [37] | 595 | ✗ | ✓ | 43.3 | - |
VTP [39] | 698 | ✗ | ✗ | 40.6 | 75.6 ‡ |
Auto-AVSR [4] | 818 | ✗ | ✓ | 33.0 | - |
Auto-AVSR [4] | 661 | ✗ | ✗ | 32.7 ‡ | 62.3 ‡ |
SyncVSR [35] | 661 | ✗ | ✗ | 30.4 | - |
SyncVSR [35] | 661 | ✗ | ✓ | 28.1 | - |
MultiAVSR | 661 | ✓ | ✗ | 28.1 | 57.8 |
MultiAVSR | 661 | ✓ | ✓ | 27.3 | 58.2 |
Less than 3000 h | |||||
VTP [39] | 2676 | ✗ | ✗ | 30.7 | 68.7 ‡ |
BRAVEn [20] | 1759 | ✗ | ✗ | 26.6 | - |
u-HuBERT [23] | 2221 | ✓ | ✗ | 27.2 | - |
AV-HuBERT [30] | 1759 | ✗ | ✗ | 26.9 | 48.7 ‡ |
RAVEn [21] | 1759 | ✗ | ✓ | 23.1 | 46.7 ‡ |
Auto-AVSR [4] | 1759 | ✗ | ✗ | 24.6 | 49.3 ‡ |
Auto-AVSR [4] | 1902 | ✗ | ✓ | 23.5 | - |
SyncVSR [35] | 1992 | ✗ | ✗ | 23.4 | - |
SyncVSR [35] | 1992 | ✗ | ✓ | 21.5 | - |
USR [18] | 1759 | ✓ | ✗ | 22.3 | 46.8 † |
USR [18] | 1759 | ✓ | ✓ | 21.5 | 46.4 |
MultiAVSR | 1968 | ✓ | ✗ | 21.6 | 44.7 |
MultiAVSR | 1968 | ✓ | ✓ | 21.0 | 46.0 |
Greater than 3000 h and Extra Proprietary Data | |||||
RNN-T [31] | 30,000 | ✗ | ✗ | 33.6 | - |
BRAVEn [20] | 3082 | ✗ | ✓ | 20.1 | - |
SparseVSR [40] | 3068 | ✗ | ✗ | 19.5 | - |
Auto-AVSR [4] | 3448 | ✗ | ✓ | 19.1 | 38.6 ‡ |
SynthVSR [34] | 7100 | ✗ | ✗ | 18.2 | - |
SynthVSR [34] | 7100 | ✗ | ✓ | 16.9 | - |
ViT 3D [33] | 90,000 | ✗ | ✗ | 17.0 | - |
LP Conformer [38] | 100,000 | ✗ | ✗ | 12.8 | - |
3.2. Multi-Task Training
3.3. Loss
4. Experimental Setup
4.1. Datasets
4.2. Evaluation
4.3. Pre-Processing and Augmentation
4.4. Implementation Details
4.5. Language Model
5. Results
5.1. Comparison to the Latest Methods
5.2. Language Model
5.3. Generalization
5.4. Auditory Noise Experiments
5.5. Training Compute
6. Future Works
7. Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
SR | Speech recognition |
VSR | Visual speech recognition |
ASR | Audio (or automatic) speech recognition |
AVSR | Audio–visual speech recognition |
WER | Word error rate |
LM | Language model |
References
- Dua, M.; Akanksha; Dua, S. Noise robust automatic speech recognition: Review and analysis. Int. J. Speech Technol. 2023, 26, 475–519. [Google Scholar] [CrossRef]
- Cui, X.; Iseli, M.; Zhu, Q.; Alwan, A. Evaluation of noise robust features on the Aurora databases. In Proceedings of the 7th International Conference on Spoken Language Processing, INTERSPEECH, Denver, CO, USA, 16–20 September 2002; pp. 481–484. [Google Scholar]
- Haapakangas, A.; Hongisto, V.; Hyönä, J.; Kokko, J.; Keränen, J. Effects of unattended speech on performance and subjective distraction: The role of acoustic design in open-plan offices. Appl. Acoust. 2014, 86, 1–16. [Google Scholar] [CrossRef]
- Ma, P.; Haliassos, A.; Fernandez-Lopez, A.; Chen, H.; Petridis, S.; Pantic, M. Auto-avsr: Audio–visual speech recognition with automatic labels. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Rouditchenko, A.; Thomas, S.; Kuehne, H.; Feris, R.; Glass, J. mWhisper-Flamingo for multilingual audio–visual noise-robust speech recognition. arXiv 2025, arXiv:2502.01547. [Google Scholar] [CrossRef]
- Shi, B.; Mohamed, A.; Hsu, W.N. Learning Lip-Based Audio–visual Speaker Embeddings with AV-HuBERT. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 4785–4789. [Google Scholar]
- Sumby, W.H.; Pollack, I. Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 1954, 26, 212–215. [Google Scholar] [CrossRef]
- Cappellazzo, U.; Kim, M.; Chen, H.; Ma, P.; Petridis, S.; Falavigna, D.; Brutti, A.; Pantic, M. Large language models are strong audio–visual speech recognition learners. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
- Ryumin, D.; Ivanko, D.; Ryumina, E. Audio–visual speech and gesture recognition by sensors of mobile devices. Sensors 2023, 23, 2284. [Google Scholar] [CrossRef]
- Sun, K.; Yu, C.; Shi, W.; Liu, L.; Shi, Y. Lip-interact: Improving mobile device interaction with silent speech commands. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, Berlin, Germany, 14–17 October 2018; pp. 581–593. [Google Scholar]
- Srivastava, T.; Winters, R.M.; Gable, T.; Wang, Y.T.; LaScala, T.; Tashev, I.J. Whispering wearables: Multimodal approach to silent speech recognition with head-worn devices. In Proceedings of the 26th International Conference on Multimodal Interaction, San Jose, Costa Rica, 4–8 November 2024; pp. 214–223. [Google Scholar]
- Jin, Y.; Gao, Y.; Xu, X.; Choi, S.; Li, J.; Liu, F.; Li, Z.; Jin, Z. EarCommand: “Hearing” your silent speech commands in ear. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2022, 6, 1–28. [Google Scholar] [CrossRef]
- Cha, H.S.; Chang, W.D.; Im, C.H. Deep-learning-based real-time silent speech recognition using facial electromyogram recorded around eyes for hands-free interfacing in a virtual reality environment. Virtual Real. 2022, 26, 1047–1057. [Google Scholar] [CrossRef]
- Acosta, L.H.; Reinhardt, D. A survey on privacy issues and solutions for Voice-controlled Digital Assistants. Pervasive Mob. Comput. 2022, 80, 101523. [Google Scholar] [CrossRef]
- Abdolrahmani, A.; Kuber, R.; Branham, S.M. “Siri Talks at You” An Empirical Investigation of Voice-Activated Personal Assistant (VAPA) Usage by Individuals Who Are Blind. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility, Galway, Ireland, 22–24 October 2018; pp. 249–258. [Google Scholar]
- Cowan, B.R.; Pantidi, N.; Coyle, D.; Morrissey, K.; Clarke, P.; Al-Shehri, S.; Earley, D.; Bandeira, N. “What can I help you with?” infrequent users’ experiences of intelligent personal assistants. In Proceedings of the 19th International Conference on Human-Computer Interaction with Mobile Devices and Services, Vancouver, BC, Canada, 4–7 September 2017; pp. 1–12. [Google Scholar]
- Pandey, L.; Hasan, K.; Arif, A.S. Acceptability of speech and silent speech input methods in private and public. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, Online, 8–13 May 2021; pp. 1–13. [Google Scholar]
- Haliassos, A.; Mira, R.; Chen, H.; Landgraf, Z.; Petridis, S.; Pantic, M. Unified Speech Recognition: A single model for auditory, visual, and audiovisual inputs. arXiv 2024, arXiv:2411.02256. [Google Scholar]
- Djilali, Y.A.D.; Narayan, S.; LeBihan, E.; Boussaid, H.; Almazrouei, E.; Debbah, M. Do VSR models generalize beyond LRS3? In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 6635–6644. [Google Scholar]
- Haliassos, A.; Zinonos, A.; Mira, R.; Petridis, S.; Pantic, M. BRAVEn: Improving self-supervised pre-training for visual and auditory speech recognition. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11431–11435. [Google Scholar]
- Haliassos, A.; Ma, P.; Mira, R.; Petridis, S.; Pantic, M. Jointly Learning Visual and Auditory Speech Representations from Raw Data. In Proceedings of the Eleventh International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
- Ma, P.; Mira, R.; Petridis, S.; Schuller, B.W.; Pantic, M. Lira: Learning visual speech representations from audio through self-supervision. arXiv 2021, arXiv:2106.09171. [Google Scholar]
- Hsu, W.N.; Shi, B. u-hubert: Unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality. Adv. Neural Inf. Process. Syst. 2022, 35, 21157–21170. [Google Scholar]
- Chung, J.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep Speaker Recognition; Interspeech: Sydney, Australia, 2018. [Google Scholar]
- Ephrat, A.; Mosseri, I.; Lang, O.; Dekel, T.; Wilson, K.; Hassidim, A.; Freeman, W.T.; Rubinstein, M. Looking to listen at the cocktail party: A speaker-independent audio–visual model for speech separation. ACM Trans. Graph. (TOG) 2018, 37, 1–11. [Google Scholar] [CrossRef]
- Afouras, T.; Chung, J.S.; Zisserman, A. LRS3-TED: A large-scale dataset for visual speech recognition. arXiv 2018, arXiv:1809.00496. [Google Scholar]
- Pascual, S.; Ravanelli, M.; Serrà, J.; Bonafonte, A.; Bengio, Y. Learning problem-agnostic speech representations from multiple self-supervised tasks. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 161–165. [Google Scholar]
- Afouras, T.; Chung, J.S.; Senior, A.; Vinyals, O.; Zisserman, A. Deep audio–visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 44, 8717–8727. [Google Scholar] [CrossRef]
- Chung, J.S.; Zisserman, A. Lip reading in the wild. In Proceedings of the Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Revised Selected Papers, Part II 13. Springer: Berlin/Heidelberg, Germany, 2017; pp. 87–103. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MI, USA, 2–7 June 2019; Volume 1 (Long and Short Papers). pp. 4171–4186. [Google Scholar]
- Makino, T.; Liao, H.; Assael, Y.; Shillingford, B.; Garcia, B.; Braga, O.; Siohan, O. Recurrent neural network transducer for audio–visual speech recognition. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Sentosa, Singapore, 14–18 December 2019; pp. 905–912. [Google Scholar]
- Zhu, Q.; Zhou, L.; Zhang, Z.; Liu, S.; Jiao, B.; Zhang, J.; Dai, L.; Jiang, D.; Li, J.; Wei, F. Vatlm: Visual-audio-text pre-training with unified masked prediction for speech representation learning. IEEE Trans. Multimed. 2024, 6, 1055–1064. [Google Scholar] [CrossRef]
- Serdyuk, D.; Braga, O.; Siohan, O. Transformer-based video front-ends for audio–visual speech recognition for single and multi-rerson video. arXiv 2022, arXiv:2201.10439. [Google Scholar]
- Liu, X.; Lakomkin, E.; Vougioukas, K.; Ma, P.; Chen, H.; Xie, R.; Doulaty, M.; Moritz, N.; Kolar, J.; Petridis, S.; et al. Synthvsr: Scaling up visual speech recognition with synthetic supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18806–18815. [Google Scholar]
- Ahn, Y.J.; Park, J.; Park, S.; Choi, J.; Kim, K.E. SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization. In Proceedings of the Interspeech 2024, ISCA, Kos Island, Greece, 1–5 September 2024; pp. 867–871. [Google Scholar]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning. PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
- Ma, P.; Petridis, S.; Pantic, M. End-to-end audio–visual speech recognition with conformers. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7613–7617. [Google Scholar]
- Chang, O.; Liao, H.; Serdyuk, D.; Shahy, A.; Siohan, O. Conformer is all you need for visual speech recognition. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10136–10140. [Google Scholar]
- Prajwal, K.; Afouras, T.; Zisserman, A. Sub-word level lip reading with visual attention. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5162–5172. [Google Scholar]
- Fernandez-Lopez, A.; Chen, H.; Ma, P.; Haliassos, A.; Petridis, S.; Pantic, M. SparseVSR: Lightweight and Noise Robust Visual Speech Recognition. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023; pp. 1603–1607. [Google Scholar]
- Ma, P.; Petridis, S.; Pantic, M. Visual speech recognition for multiple languages in the wild. Nat. Mach. Intell. 2022, 4, 930–939. [Google Scholar] [CrossRef]
- Kim, S.; Jang, K.; Bae, S.; Cho, S.; Yun, S.Y. MoHAVE: Mixture of hierarchical audio–visual experts for robust speech recognition. arXiv 2025, arXiv:2502.10447. [Google Scholar]
- Bulat, A.; Tzimiropoulos, G. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1021–1030. [Google Scholar]
- Varga, A.; Steeneken, H.J. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 1993, 12, 247–251. [Google Scholar] [CrossRef]
- Son Chung, J.; Senior, A.; Vinyals, O.; Zisserman, A. Lip reading sentences in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6447–6456. [Google Scholar]
- Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv 2018, arXiv:1804.03209. [Google Scholar]
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar]
- Afouras, T.; Chung, J.S.; Zisserman, A. Asr is all you need: Cross-modal distillation for lip reading. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2143–2147. [Google Scholar]
Language Model Weight | VSR WER ↓ |
---|---|
0 | 21.6 |
0.1 | 21.1 |
0.2 | 21.0 |
0.3 | 21.6 |
0.4 | 22.4 |
Method | Training Hours | WER ↓ | WER with LM ↓ | ↓ |
---|---|---|---|---|
SyncVSR [35] | 438 | 33.3 | 31.2 | 6.3 |
661 | 30.4 | 28.1 | 7.6 | |
1992 | 23.1 | 21.4 | 7.4 | |
SynthVSR [34] | 7100 | 18.2 | 16.9 | 7.1 |
USR [18] | 1759 | 22.3 | 21.5 | 3.6 |
MultiAVSR | 438 | 31.1 | 29.9 | 3.9 |
661 | 28.1 | 27.3 | 2.8 | |
1968 | 21.6 | 21.0 | 2.8 |
Noise | Model | Task | SNR Levels (dB) | Average | |||||
---|---|---|---|---|---|---|---|---|---|
Clean | 12.5 | 7.5 | 2.5 | ||||||
Pink | Auto-AVSR [4] | Audio | 1.0 | 1.4 | 1.9 | 4.3 | 13.1 | 56.8 | 15.5 |
MultiAVSR | 1.2 | 1.4 | 1.9 | 3.7 | 12.0 | 43.0 | 12.4 | ||
Auto-AVSR [4] | Audio–visual | 0.9 | 1.2 | 1.4 | 2.3 | 6.0 | 16.2 | 5.4 | |
MultiAVSR | 1.2 | 1.2 | 1.6 | 2.0 | 3.9 | 9.8 | 3.7 | ||
White | Auto-AVSR [4] | Audio | 1.0 | 2.1 | 4.0 | 10.4 | 30.2 | 88.9 | 27.1 |
MultiAVSR | 1.2 | 2.2 | 4.0 | 9.7 | 27.2 | 76.0 | 23.8 | ||
Auto-AVSR [4] | Audio–visual | 0.9 | 1.4 | 2.3 | 4.3 | 9.5 | 24.2 | 8.3 | |
MultiAVSR | 1.2 | 1.6 | 2.2 | 3.4 | 7.0 | 14.7 | 5.8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Torrie, S.; Wright, K.; Lee, D.-J. MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning. Electronics 2025, 14, 2310. https://doi.org/10.3390/electronics14122310
Torrie S, Wright K, Lee D-J. MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning. Electronics. 2025; 14(12):2310. https://doi.org/10.3390/electronics14122310
Chicago/Turabian StyleTorrie, Shad, Kimi Wright, and Dah-Jye Lee. 2025. "MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning" Electronics 14, no. 12: 2310. https://doi.org/10.3390/electronics14122310
APA StyleTorrie, S., Wright, K., & Lee, D.-J. (2025). MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning. Electronics, 14(12), 2310. https://doi.org/10.3390/electronics14122310