MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning

Torrie, Shad; Wright, Kimi; Lee, Dah-Jye

doi:10.3390/electronics14122310

Open AccessEditor’s ChoiceArticle

MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning

by

Shad Torrie

,

Kimi Wright

and

Dah-Jye Lee

^*

Department of Electrical and Computer Engineering, Brigham Young University, Provo, UT 84602, USA

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(12), 2310; https://doi.org/10.3390/electronics14122310

Submission received: 2 May 2025 / Revised: 30 May 2025 / Accepted: 2 June 2025 / Published: 6 June 2025

(This article belongs to the Special Issue Advances in Information, Intelligence, Systems and Applications)

Download

Browse Figure

Versions Notes

Abstract

Speech recognition approaches typically fall into three categories: audio, visual, and audio–visual. Visual speech recognition, or lip reading, is the most difficult because visual cues are ambiguous and data is scarce. To address these challenges, we present a new multi-task audio–visual speech recognition, or MultiAVSR, framework for training a model on all three types of speech recognition simultaneously primarily to improve visual speech recognition. Unlike prior works which use separate models or complex semi-supervision, our framework employs a supervised multi-task hybrid Connectionist Temporal Classification/Attention loss cutting training exaFLOPs to just 18% of that required by semi-supervised multitask models. MultiAVSR achieves state-of-the-art visual speech recognition word error rate of 21.0% on the LRS3-TED dataset. Furthermore, it exhibits robust generalization capabilities, achieving a remarkable 44.7% word error rate on the WildVSR dataset. Our framework also demonstrates reduced dependency on external language models, which is critical for real-time visual speech recognition. For the audio and audio–visual tasks, our framework improves the robustness under various noisy environments with average relative word error rate improvements of 16% and 31%, respectively. These improvements across the three tasks illustrate the robust results our supervised multi-task speech recognition framework enables.

Keywords:

speech recognition; visual speech recognition; automatic speech recognition; lip reading

1. Introduction

Audio or automatic speech recognition (ASR) is highly integrated into modern personal electronics and has even been hailed as the future of human–computer interfaces due to the rise of digital assistants. However, these exciting advancements and aspirations have not been fully realized due to several key inherent limitations of ASR. Modern ASR models achieve near-human accuracy in quiet environments or synthetic clean speech, but their word error rates (WER) often grow by an order of magnitude—from <5% to 25–60%—when evaluated in realistic noisy environments such as public places, moving vehicles, and busy office environments [1,2,3]. Audio–visual speech recognition (AVSR) greatly improves recognition accuracy in noisy conditions by leveraging visual information [4,5,6], mimicking how humans make use of visual cues when background noise masks the acoustic signal [7]. These AVSR approaches effectively mitigate the impact of noise, often reducing the WER to below 10% [4,5,8], and have the potential to match the performance achieved with clean audio.

Despite substantial improvements in noisy conditions, AVSR is rarely integrated into personal electronic devices, likely due to the practical challenges posed by camera placement and head positioning requirements [9,10]. Fortunately, these challenges are not present in head-worn devices, such as earbuds and mixed reality (MR) headsets, due to the presence of fixed sensors [11,12,13]. This allows for the deployment of AVSR in head-worn devices to enhance speech recognition performance. This promising application has, to our knowledge, not been implemented in any commercially available devices.

Studies have found that the largest inhibitors to individual adoption of ASR-enabled digital assistants are privacy concerns regarding audio data use and fears of private conversations being overheard [14,15,16,17]. The issue of being overheard is a common complaint of ASR not just for privacy’s sake but also for disrupting nearby individuals [16,17]. Cowan et al. found that “social embarrassment” is one of the primary reasons individuals avoid digital assistants [16]. Visual speech recognition (VSR) or lip reading eliminates the need for vocalization by relying entirely on visual cues, offering the potential to significantly reduce concerns related to privacy and social embarrassment associated with current speech recognition technologies.

Another large inhibitor to individual adoptions of ASR-enabled digital assistants is privacy concerns regarding audio data use. Although VSR requires facial data, which is also a privacy concern, our method is significantly more robust and efficient than prior methods and could eventually be improved upon to avoid sending data to the cloud, increasing the security of the data. Additionally, positioning the cameras on the device so only consenting users are filmed could help mitigate this concern.

Despite its potential to address speech recognition challenges, VSR remains uncommon as it is significantly less accurate than ASR and AVSR, exhibiting word error rates that are often orders of magnitude higher compared to ASR and AVSR. This accuracy difference comes from the nature of the problem and the lack of VSR data. Differentiating speech based solely on lip movements is challenging as multiple phonemes can map to the same viseme, resulting in visually indistinguishable patterns. Additionally, due to the widespread adoption of ASR, ASR datasets are significantly more common than those for AVSR and VSR, resulting in limited availability of high-quality training data for AVSR and VSR methods.

Given the substantial potential to advance the adoption of speech recognition and digital assistant systems, this work primarily focuses on improving VSR accuracy and robustness. With one of the largest difficulties with the VSR task being the accessibility of high-quality data, we explore methods to maximize the use of data that is publicly available. Our proposed supervised multi-task framework takes advantage of the rich audio features of common VSR datasets, which generally go unused in the literature. We show that the use of this audio data improves VSR performance and generalization to a real-world or wild test set by using only publicly available data. Additionally, our methods enhance noise robustness for both AVSR and ASR tasks, an important aspect of real-world speech recognition systems.

Our contributions are as follows:

(1): We present a new supervised speech recognition framework for training across audio speech recognition (ASR), visual speech recognition (VSR), and audio–visual speech recognition (AVSR) tasks simultaneously, achieving a remarkable VSR result of 21.0% WER on the LRS3-TED dataset, which is state-of-the-art among models trained on under 3000 h of data.
(2): We introduce a multi-task hybrid Connectionist Temporal Classification (CTC)/Attention loss that enables direct multi-task training across ASR, VSR, and AVSR tasks. This loss significantly enhances VSR performance while mitigating the high compute demands of multi-task self-supervised learning, requiring only 18% of the training compute required for the USR [18] self-supervised multi-task approach (47 vs. 253 exaFLOPS; see Section 5.5).
(3): We demonstrate that supervised multi-task speech recognition models exhibit strong generalization, achieving an impressive 44.7% WER on the WildVSR dataset [19], which is state-of-the-art among models trained on under 3000 h of data. There are methods that have achieved better results, but they have had to use a much larger amount of data that is not publicly available. Furthermore, MultiAVSR is the first model to perform better without an external language model (44.7% WER) compared to with a language model (46.0% WER) on the WildVSR dataset. This indicates that our model has increased linguistic generalization, particularly on in-the-wild data.
(4): We demonstrate that our multi-task training approach significantly reduces the reliance on external language models. Our model exhibits only a 2.8% relative improvement when adding an external language model during evaluation, while state-of-the-art single-task models see >7% improvement. This reduced reliance on external language models is a critical advancement for enabling faster and more compute-efficient real-time VSR as removing the language model decreases inference time by 40% and reduces the total evaluation parameter count by 18%.
(5): We show that our supervised multi-task framework improves the ASR and AVSR tasks when it comes to performance in noisy environments, achieving relative improvement of 16% and 30%, respectively, compared to the state-of-the-art single-task approaches that are trained on $1.75 \times$ more data [4].

2. Related Work

2.1. Self-Supervised Methods

To overcome data scarcity, researchers have increasingly adopted self-supervised methods that allow for the use of large amounts of unlabeled corpora. In VSR, using these methods enables the use of the audio modality of labeled corpora [6,18,20,21,22,23] as well as large unlabeled audio–visual corpora [24,25].

Ma et al. proposed LiRA, one of the first self-supervised VSR methods for sentences [22]. The model is pre-trained on the LRS3-TED dataset [26] to predict PASE+ auditory features [27] with visual features as input. This model is then fine-tuned for two separate tasks, word-level VSR and sentence-level VSR in a supervised manner to predict text. This self-supervision brings an absolute improvement of 1.7% WER on the LRS2 dataset [28] compared to a fully supervised baseline. Fine-tuning the model for word-level VSR brings an absolute accuracy improvement of 0.7% on the LRW dataset [29]. These results are intriguing as the initial pre-training was on a sentence-level dataset, but the results improved the accuracy of the word-level results, showing that self-supervised training can be powerful when transferring between domains.

Building off masked language model training such as BERT [30], Shi et al. proposed an audio–visual hidden unit BERT (AV-HuBERT) network that learns to predict features that are masked out of the input sequence [6]. They found that pre-training in a cross-modal objective fashion (masking out audio and visual features) improved results compared to single-modal objectives. This cross-modal pre-trained network was then fine-tuned for the ASR and VSR tasks separately. AV-HuBERT brought an absolute improvement of 6.7% for VSR WER performance compared to a supervised model that was trained on thousands of times more data [31]. This work was foundational for self/semi-supervised methods for the VSR task. The use of cross-modal self-supervised training was used in many subsequent self-supervised and semi-supervised methods [18,20,21,32].

Building off the success of AV-HuBERT, Haliassos et al. proposed RAVEn [21]. RAVEn is a self-supervised multi-modal method that pre-trains encoders using masked features with two sets of student and teacher models for audio and visual features. Interestingly, they found that during pre-training it was most beneficial for their masked audio encoder to predict audio and visual features, while their masked visual encoder only predicted audio features. They then fine-tuned the child encoders for audio and visual speech recognition separately. This unique student–teacher masked self-supervised learning brought an absolute improvement of 3.8% in WER compared to AV-HuBERT [6].

Halliassos et al. later proposed BRAVEn [20], which applies four key improvements to the RAVEn system. First, they use the mean of the teacher’s transformer blocks rather than just the final block’s output to construct the target features. Second, they adjust the shallow transformer predictor for the visual student to use one block rather than two. Third, BRAVEn uses stronger masking on the audio input features with a 40% probability of masking audio features compared to 20% used for RAVEn. Finally, they apply different weights to the losses for audio-to-audio and audio-to-video, two and one, respectively. These adjustments and additional data brought about an absolute improvement of 4.3% WER.

While these and other self-supervised methods bolster strong VSR performance by taking advantage of audio and unlabeled data, there are drawbacks. Djilali et al. found that self-supervised methods required

\approx

3.6 times more training computations (exaFLOPS) than supervised methods [19]. While self-supervised methods have proven effective in addressing data scarcity by leveraging audio data to enhance VSR performance, their reliance on complex pre-training and fine-tuning pipelines often results in significant computational overhead, highlighting the need for our efficient multi-task supervised multi-task approach. They additionally highlighted that self-supervised struggle to generalize to the WildVSR dataset compared to fully supervised methods.

2.2. Supervised Methods

Serduyk et al. [33] and Makino et al. [31] performed supervised learning to obtain impressive results; however, they used over 90 thousand hours of proprietary VSR data. Without these large proprietary datasets, recent works have reduced their dependency on large pre-labeled datasets by using synthetic data generation [34], architectural improvements [35], and pseudo-labeling of publicly available data [4] to improve performance.

To combat scarce public data availability, Ma et al. used a pre-trained ASR model [36] to pre-label audio–visual datasets that were previously only usable by self-supervised methods [4], VoxCeleb2 [24] and AVSpeech [25]. This automatically labeled data increased the training dataset size by four times to 3.4 thousand hours resulting in a 13.9% reduction in WER and largely outperforming self-supervised methods. Concurrent with our work, Ahn et al. added an audio reconstruction loss with the traditional CTC loss and attention loss, comparatively reducing the WER by 2% [35].

These methods show that VSR performance can be improved using audio data in a supervised approach without requiring compute-expensive self-supervised training. Our method similarly takes advantage of audio data but in a multi-task framework enabling a single model to complete all three SR tasks, yielding superior VSR accuracy and greater robustness across all three tasks in real-world scenarios.

2.3. Multi-Task Methods

Hsu et al. proposed the first self-supervised multi-task model [23]. They pre-trained a single model on AVSR, ASR, and VSR, with VSR data pseudo-labeled by the ASR data. After fine-tuning solely on the ASR task, the model also achieves impressive zero-shot AVSR and VSR performance, despite the fact that no labeled visual data was provided during training.

Concurrent to our work, Haliassos et al. proposed a single model to perform audio, visual, and audio–visual speech recognition in a semi-supervised manner [18]. This multi-task framework uses unsupervised pre-training and then fine-tunes with semi-supervised learning. Adding to the complexity, in both pre-training and fine-tuning, a student–teacher model is used, increasing compute cost greatly. This complex training achieves an impressive WER of 21.6% on the LRS3-TED dataset [26] but requires

5.4 \times

more training compute compared to our supervised multi-task method (see Section 5.5).

Although multi-task training itself is not new, we introduce the first supervised multi-task framework that surpasses all prior approaches on VSR while demanding substantially less compute budget. Furthermore, our method generalizes substantially better on the WildVSR benchmark [19], a domain where supervised models typically outperform self-supervised models.

3. Methods

To take advantage of feature-rich audio data, we propose a supervised multi-task framework. This framework adapts the state-of-the-art Auto-AVSR [4] method to simultaneously support VSR, AVSR, and ASR. As depicted in Figure 1, our approach introduces a shared multi-task conformer encoder, transformer decoder, and CTC projection layer. These shared components enable knowledge transfer from the audio and audio–visual tasks to the more challenging VSR task, while also allowing the VSR task to strengthen AVSR and ASR models by encouraging more robust use of audio features. These more robust features ultimately improve performance under noisy conditions. This unified framework not only streamlines the training process compared to self-supervised multi-task training [18,23] but also taps into the data-rich audio signals available, ultimately driving more robust and accurate VSR predictions without introducing unnecessary complexity.

3.1. Architecture

Many recent state-of-the-art AVSR supervised methods [4,34,35] have adopted the off the shelf conformer sequence-to-sequence (CM-seq2seq) architecture [37]. Due to the success of this architecture and its subsequent adaptation, we adopt this basic structure. See Figure 1 for an overview of our architecture. Our audio backbone consists of a 1D ResNet18, and our visual backbone consists of a 3D CNN followed by a 2D ResNet18 model. The typical AVSR CM-seq2seq architecture employs two separate Conformer encoders [38] to process visual and audio features independently. In contrast, our multi-task framework utilizes a shared Conformer encoder for both modalities, which substantially improves VSR performance while only slightly reducing ASR and AVSR performance (see Table 1). For the decoding strategy, we adopt the transformer decoder and CTC projection layer used to perform the hybrid CTC/Attention loss. For all multi-task trainings, the encoder, decoder, and CTC layers are shared across all SR tasks.

Table 1. Ablation study on task training objective and parameter sharing. Word error rate (WER) and relative improvement compared to the single-task configuration for visual (VSR), audio (ASR), and audio–visual (AVSR) speech recognition under multi-task ablations. Each row toggles (✗/✓) the presence of the three task training objectives (columns 1–3) and the use of a common conformer encoder (column 4). In tests where a task’s training objective is not included, the WER is given as (-) for not applicable. All models share the Transformer decoder and CTC projection layer and are trained on 438 h of data from LRS3-TED. Lower values indicate better performance. The best performers are in boldface and the second best are underlined. Experiments using larger training datasets and comparison with other state-of-the-art models are included in Table 2.

Training Tasks			Shared Encoder	WER $(%) ↓$
VSR	ASR	AVSR	Shared Encoder	VSR	ASR	AVSR
Single-Task Models
✓	✗	✗	✗	42.0	-	-
✗	✓	✗	✗	-	2.3	-
✗	✗	✓	✗	-	-	2.3
Multi-Task Models
✓	✓	✗	✗	41.2 $↓ 2 %$	2.1 $↓ 9 %$	-
✓	✓	✗	✓	32.2 $↓ 23 %$	2.5 $↑ 9 %$	-
✓	✗	✓	✓	36.9 $↓ 12 %$	-	3.7 $↑ 61 %$
✓	✓	✓	✓	31.1 $↓ 26 %$	2.4 $↑ 4 %$	2.5 $↑ 9 %$

Table 2. VSR Results. VSR WER (%) comparison of the state-of-the-art models on the LRS3-TED [26] and WildVSR [19] test sets. The best performers are in boldface and the second best are underlined. Our model trained on 661 h of data, including LRS3-TED and LRS2. The model with 1968 h includes LRS2, LRS3-TED, and VoxCeleb2. ‡ Djilali et al. reports these results [19] †. The evaluation was conducted as part of this work using the original code and model.

Method	Total Hours	Multi-Task Training	LM	LRS3 WER (%)	WildVSR WER (%)
No Additional Data
AV-HuBERT [30]	433	✗	✗	41.6	69.4 ‡
BRAVEn [20]	433	✗	✗	36.0	-
RAVEn [21]	433	✗	✗	39.1	69.9 ‡
Auto-AVSR [4]	438	✗	✓	36.3	-
USR [18]	438	✓	✗	34.3	-
SyncVSR [35]	438	✗	✗	33.3	-
SyncVSR [35]	438	✗	✓	31.2	-
MultiAVSR	438	✓	✗	31.1	63.0
MultiAVSR	438	✓	✓	29.9	63.7
Less than 1000 h
CM-Seq2Seq [37]	595	✗	✓	43.3	-
VTP [39]	698	✗	✗	40.6	75.6 ‡
Auto-AVSR [4]	818	✗	✓	33.0	-
Auto-AVSR [4]	661	✗	✗	32.7 ‡	62.3 ‡
SyncVSR [35]	661	✗	✗	30.4	-
SyncVSR [35]	661	✗	✓	28.1	-
MultiAVSR	661	✓	✗	28.1	57.8
MultiAVSR	661	✓	✓	27.3	58.2
Less than 3000 h
VTP [39]	2676	✗	✗	30.7	68.7 ‡
BRAVEn [20]	1759	✗	✗	26.6	-
u-HuBERT [23]	2221	✓	✗	27.2	-
AV-HuBERT [30]	1759	✗	✗	26.9	48.7 ‡
RAVEn [21]	1759	✗	✓	23.1	46.7 ‡
Auto-AVSR [4]	1759	✗	✗	24.6	49.3 ‡
Auto-AVSR [4]	1902	✗	✓	23.5	-
SyncVSR [35]	1992	✗	✗	23.4	-
SyncVSR [35]	1992	✗	✓	21.5	-
USR [18]	1759	✓	✗	22.3	46.8 †
USR [18]	1759	✓	✓	21.5	46.4
MultiAVSR	1968	✓	✗	21.6	44.7
MultiAVSR	1968	✓	✓	21.0	46.0
Greater than 3000 h and Extra Proprietary Data
RNN-T [31]	30,000	✗	✗	33.6	-
BRAVEn [20]	3082	✗	✓	20.1	-
SparseVSR [40]	3068	✗	✗	19.5	-
Auto-AVSR [4]	3448	✗	✓	19.1	38.6 ‡
SynthVSR [34]	7100	✗	✗	18.2	-
SynthVSR [34]	7100	✗	✓	16.9	-
ViT 3D [33]	90,000	✗	✗	17.0	-
LP Conformer [38]	100,000	✗	✗	12.8	-

3.2. Multi-Task Training

Our multi-task framework enables the simultaneous training of ASR, VSR, and AVSR, enhancing robustness across all three tasks. While multi-task speech recognition has been explored previously [18,23], our framework introduces a novel, simpler, and more compute-efficient approach that achieves superior VSR performance. These improvements are achieved by shared parameters with the VSR, ASR, and AVSR tasks. Given that the primary objective is to enhance VSR performance, we investigate which combinations of training modalities yield the largest VSR improvement. Table 1 shows our ablation study on training tasks and shared parameters. We see that adding both the ASR and AVSR tasks to the training tasks leads to the largest improvement with a drop in WER from 42.0% to 31.1%. This 26% relative improvement in WER is substantial compared to other configurations, giving us confidence that this is the best performing configuration. While this multi-task training improves VSR performance due to knowledge transfer, it slightly reduces ASR and AVSR performance with a relative increase in WER by 4% and 9%, respectively. This small increase in ASR and AVSR WER is acceptable with the more significant 26% improvement in VSR results. ASR and AVSR are inherently easier tasks with high baseline accuracies, and our multi-task framework ultimately improves their robustness to noise, as discussed in Section 5.4.

3.3. Loss

The final component of the model architecture is the multi-task hybrid CTC/Attention loss. Most of the literature applies a complex loss function to learn from audio data [18,20,21,23,35,41]; however, we propose a simpler and effective approach. We simply aggregate the commonly employed hybrid CTC/Attention loss [37] across all tasks. This multi-task hybrid CTC/Attention loss and use of task-shared network parameters enable the incorporation of multiple tasks in one simple training.

Let

x^{(i)}

with length T represent the input sequence, with the output from the shared Conformer encoder, for task

i \in {v, a, av}

, representing VSR, ASR, and AVSR, respectively. The target sequence

y = [y_{1}, y_{2}, \dots, y_{L}]

is shared across all tasks, representing the ground-truth transcription of length L. For each task, we define the probabilities used in the CTC and Cross-Entropy Attention (CE) losses as follows. The CTC loss

L_{CTC}^{(i)}

is computed based on the probability

p_{CTC} (y | x^{(i)}) = \prod_{t = 1}^{T} p (y_{t} | x^{(i)})

where

p (y_{t} | x^{(i)})

is the probability of the target label at time t given the input sequence

x^{(i)}

. The CE loss

L_{CE}^{(i)}

utilizes the probability

p_{CE} (y | x^{(i)}) = \prod_{l = 1}^{L} p (y_{l} | y_{< l}, x^{(i)})

where

y_{< l} = [y_{1}, y_{2}, \dots, y_{l - 1}]

represents all previous tokens before position l. We compute the total CTC and CE losses by summing over all tasks:

L_{CTC} = \sum_{i} log p_{CTC} (y | x^{(i)}) L_{CE} = \sum_{i} log p_{CE} (y | x^{(i)})

(1)

Our final loss function is a weighted combination of these total losses:

L_{total} = α L_{CTC} + (1 - α) L_{CE}

(2)

where

α

is a hyperparameter that balances the contributions of the CTC and CE losses, which is set to 0.1 for all experiments following the literature [4,18,20,21,34,35,37].

This adjustment to the hybrid CTC/Attention loss [37] facilitates multi-task learning straightforwardly and effectively, significantly improving VSR results. Preliminary experiments prove that our simpler multi-task hybrid CTC/Attention loss function obtains better VSR performance than more complex loss functions similar to those proposed in prior work [18,35,41]. Additionally, we experimented with weighting the ASR and AVSR tasks lower than the VSR task losses with negligible effect on the ASR, AVSR, or VSR results. Because ASR and AVSR objectives converge more quickly, their gradients diminish early in training, whereas the more challenging VSR objective continues to generate larger gradients. Consequently, the VSR task dominates the joint optimization regardless of the explicit loss-weighting scheme. We thus attribute the negligible effect loss-weighting has to this task gradient disparity.

4. Experimental Setup

4.1. Datasets

For training MultiAVSR, we use the LRS3-TED [26], LRS2 [28], and VoxCeleb2 [24] datasets. LRS3-TED was extracted from TED talks and contains 408, 30, and 0.9 h of lip reading data in the pre-training, training-validation, and test sets, respectively. While it is commonly used in recent research for training, it does not represent real-world conversations and speech well as the data source are formal lectures. Despite this drawback, for comparison against recent works, we retain the use of this dataset. LRS2 is an AVSR dataset consisting of 223 h of BBC television broadcast data. VoxCeleb2 is a dataset commonly used for audio speaker recognition. It thus does not contain ground-truth transcripts. To obtain transcripts for the VoxCeleb2 dataset, we follow the Auto-AVSR method [4] using the large-v3 Whisper model [36] for language detection and audio transcription, yielding 1307 h of AVSR data. Using Whisper to create transcripts results in noisy ground truth data for VoxCeleb2; however, Ma et al. [4] showed that adding noisy pseudo-labels can substantially increase the model’s performance. Additionally, including this data prevents the model from overfitting to the LRS3-TED and LRS2 datasets. We conduct three experimental trainings, firstly, with only the LRS3-TED pre-training and training-validation sets (438 h), secondly with LRS3-TED + LRS2 (661 h), and finally LRS3-TED + LRS2 + VoxCeleb2 (1968 h).

4.2. Evaluation

We conduct evaluation experiments on the test set of the LRS3-TED dataset (0.9 h) [26] and the newly released WildVSR dataset (4.8 h) [19]. The WildVSR test set is used to evaluate whether VSR networks generalize beyond the LRS3-TED test set. It contains 4.8 h of individuals speaking in YouTube videos with a larger variation in vocabulary, recording conditions, and speaker race compared to LRS3-TED. The WildVSR dataset is a particularly good evaluation of real-world speech, which, as stated previously, LRS3-TED falls short in. This more difficult, diverse dataset helps evaluate whether VSR methods overfit to the LRS3-TED test set.

The word error rate (WER) is the primary metric used in this work for evaluating VSR, AVSR, and ASR performance. While WER is not a perfect metric to evaluate the intricacies that visual speech brings with visually ambiguous patterns, it is the primary metric used in the literature for SR methods [18,20,34,35,36,37,42], and thus we use it to evaluate our method.

We acknowledge that WER cannot capture viseme-level ambiguities unique to the visual-only speech recognition task. Its use is retained because it satisfies three practical requirements that alternative measures only partially address. First, comparability: virtually every recent benchmark study in visual speech recognition reports WER, enabling direct comparisons with the state-of-the-art [18,20,34,35,37,42]. Second, interpretability: WER’s error-components (substitutions, insertions, deletions) map cleanly onto editing operations in text, giving a concrete sense of how often a predicted transcript must be corrected. Third, deployment relevance: downstream user-facing applications rely on token-level accuracy thresholds that are conventionally expressed in WER. While WER is not perfect at representing speech recognition results, our experiments on the WildVSR [19] for the VSR task and auditory noise robustness experiments on the AVSR and ASR tasks demonstrate the robust nature of our methods.

4.3. Pre-Processing and Augmentation

For data pre-processing and augmentation, we follow the methods of Ma et al. [4]. For visual data, following the previous literature [4,18,34,35,37], the face is localized using RetinaFace [43] and cropped to the lower portion of the face with the lips in the center at a resolution of 96 × 96; the image is then grayscaled (see Figure 1 for example images). During training the images are augmented with randomly cropping to a 88 × 88 section and adaptive time masking. During inference the images are cropped to the 88 × 88 center pixels. For auditory data, the raw waveform is used without pre-processing. During training adaptive time masking is applied and babble noise from the NOISEX dataset [44] is added at one of the following SNR levels [−5 dB, 0 dB, 10 dB, 15 dB, 20 dB, ∞ dB].

4.4. Implementation Details

We generally use the same model setup as Auto-AVSR [4]. Our model includes a 1D Resnet18 audio backbone (4 M parameters), 3D CNN + 2D Resnet18 visual backbone (11 M), audio–visual fusion MLP (19 M), shared conformer encoder (170 M), shared transformer decoder (64 M), and shared CTC projection layer (4 M), totaling 274 million parameters. The conformer encoder and transformer decoder have 12 and 6 layers, respectively, with 768 input dimensions, 3072 feed-forward dimensions, and 12 attention heads.

Typically, VSR networks have a pre-training phase where the model is trained on shorter video clips to prepare the model for longer more difficult sequences. This process is called curriculum learning and is widely used in VSR and AVSR methods [4,28,37,41,45]. Similarly, we pre-train all our models with a learning rate of

7 \times 10^{- 5}

on samples less than 4 s long in the LRS3-TED dataset [26] for 75 epochs on 8 A100 GPUs. Following the curriculum learning, we fine-tune all models on full-length videos in the given datasets for 75 epochs on 8 A100 GPUs with a learning rate of

1 \times 10^{- 3}

. All trainings use the AdamW optimizer with a cosine learning rate schedule with a warm-up of 5 epochs.

4.5. Language Model

Many VSR methods employ a transformer-based language model (LM) trained on large corpora of text to improve outputs at evaluation time [4,18,34,35,41]. To measure the impact of using an LM on our model, we use the pre-trained transformer-based LM from [41], consisting of 54 million parameters trained on 166 million characters of text. Following the literature [4,34,35,37], the weight of the language model is included in the prediction scoring as follows:

\hat{y} = arg {max}_{y \in \hat{Y}} {α log p_{CTC} (y | x) + (1 - α) log p_{CE} (y | x) + β log p_{LM} (y)}

where

\hat{y}

is the predicted output tokens,

α

is the CTC weight, which is set to the same value as during training (0.1), and

β

is the relative weight of the language model.

β

is set to 0.2 for our LM experiments. For language model weighting ablation see Section 5.2.

5. Results

5.1. Comparison to the Latest Methods

Our VSR results compared to recent methods on the LRS3-TED and WildVSR test sets are presented in Table 2. When comparing the LRS3-TED results, MultiAVSR outperforms all other methods with less than 3000 h of training data on LRS3-TED with a WER of 21.0%, improving upon the concurrent methods SyncVSR [35] and USR [18], which both achieve 21.5%. Using Auto-AVSR as the principal baseline, our model lowers the WER by 2.4 percentage points [4]. In the 438 and 661 h tests, MultiAVSR outperforms all other methods by a larger margin as can be seen in Table 2. Our model’s high performance in every training data size category gives us confidence that our model outperforms past and concurrent works.

Comparison with the semi-supervised multi-task USR method [18] demonstrates the efficiency and effectiveness of our framework. Our supervised multi-task model achieves a 0.5% lower WER compared to USR, despite being 54% the model size (274 M vs. 503 M parameters) and requiring only 18% of training compute (47 vs. 253 exaFLOPS), underscoring the efficiency and effectiveness of our simpler supervised multi-task framework.

MultiAVSR does not outperform VSR methods trained on massive (often proprietary) amounts of data, likely due to the discrepancy in data access [31,33,34,38]. Because MultiAVSR outperforms all methods trained on similar amounts of data, we postulate that access to these larger datasets would enable further improvements in our supervised multi-task training.

5.2. Language Model

Many recent works use an external language model (LM) to improve linguistic consistency for VSR results [4,18,34,35,37]. See Table 3 for our ablation study of the use of the language model weight factor

β

and the effects it has on the VSR results on the LRS3-TED dataset. Based on this ablation, 0.2 is the chosen value for

β

in all experiments that use an LM.

Ma et al. found that an external language model (LM) is often not beneficial for AVSR and ASR as the accompanying models are fully capable without an LM [4]; however, as the VSR task is the most difficult, external LMs have frequently been proven beneficial in improving prediction accuracy. MultiAVSR exhibits a much lower reliance on external LMs as can be seen in Table 4. Most notably, SyncVSR has a 7.4% relative improvement [35], SynthVSR has a 7.1% relative improvement, USR has a relative improvement of 3.6% [18], and MultiAVSR has a 2.8% relative improvement. Furthermore, all three MultiAVSR models exhibit much lower relative improvement when an LM is used compared to SyncVSR’s models trained with a similar amount of data. This indicates that the lower 2.8% is not merely within normal variation but that there is a significant linguistic robustness that multi-task training brings. These results suggest that MultiAVSR captures richer, more robust speech representations through supervised multi-task learning, minimizing the benefit gained from external LM assistance.

Reducing the reliance on an external LM is a critical step towards real-time VSR, as the 54 M parameter LM accounts for 18% of the total parameters (304 M) used during VSR evaluation and 44% of the total decoding parameters (122 M), which take more compute due to the auto-regressive nature of transformer decoders. On average, removing the external LM yields a 40% reduction in inference time, significantly enhancing the practicality of real-time VSR. This reduced LM reliance is further illustrated by the results on the WildVSR dataset.

5.3. Generalization

The WidVSR test set [19] is designed to evaluate how well a VSR method generalizes beyond the LRS3-TED test set. WildVSR offers more varied, unconstrained real-world data. It includes higher variation in recording conditions, race, native vs. non-native speakers, and vocabulary. As seen in Table 2, MultiAVSR outperforms all other methods with similar amounts of data, achieving a WER of 44.7%, improving upon the next closest by 1.7% WER and upon the baseline model by 4.6% WER. Interestingly, our model evaluated with an additional LM performs 1.7% WER worse than without the LM on the WildVSR dataset. Our method is the first to see this disparity with the USR method, which performed better with an LM (46.4% WER) compared to without (46.8% WER). This claim is further validated by the fact that our models trained on fewer hours of data also performed better without the LM on the WildVSR dataset (see Table 2). These results underscore the effectiveness of our supervised multi-task learning framework in enabling robust, generalized VSR performance without reliance on external language models.

5.4. Auditory Noise Experiments

While the primary objective of this work is to enhance VSR performance, our auditory noise robustness experiments reveal that the multi-task training framework also improves ASR and AVSR performance under auditory noisy conditions. For these experiments, we use white and pink noise taken from the Speech Commands dataset [46]. We compare our best model (1968 h) to the Auto-AVSR method, which involves SOTA single-task models for AVSR and ASR tasks that are trained on 3448 h of data (

1.75 \times

more hours than MultiAVSR). To maintain comparison with Auto-AVSR, we apply noise only to the auditory signal. Although the base WER for ASR and AVSR increases slightly with our multi-task training and shared encoders (see Table 1), our noise experiments show that both tasks become significantly more robust under noisy auditory conditions. Quantitatively, our multi-task framework exhibits a relative average improvement of 16% and 31% for the ASR and AVSR tasks, respectively, as seen in Table 5. Notably, at an SNR of −7.5 dB—where noise exceeds signal—the AVSR model outperforms the VSR task with no auditory signal, achieving a WER of 14.7% under white noise conditions compared to 21.6% VSR performance. This robustness does not extend to the single-task Auto-AVSR model, where at −7.5 dB, AVSR performance degrades to a WER of 24.2%—worse than the corresponding VSR model, which achieves 19.1%. These findings underscore the effectiveness of our supervised multi-task framework in enhancing noise robustness across modalities, particularly demonstrating its superiority over single-task approaches in severely degraded acoustic environments.

5.5. Training Compute

Compared to semi-supervised multi-task methods, our supervised multi-task framework is much more efficient in terms of training computate requirements. Djilali et al. found that supervised methods tend to require

\approx

28% of the training compute that self-supervised methods require [19]. They further found that the self-supervised methods perform only moderately well compared to supervised methods on the WildVSR dataset [19]. We further validate this claim for multi-task speech recognition methods. To compute the exaFLOPs required to train, we follow Djilali et al. [19] by using the methodology proposed by Kaplan et al. [47]. We found that USR’s multi-task semi-supervised method requires 253 exaFLOPS due to the self-supervised pretraining and the semi-supervised fine-tuning [18]. Our method requires only 47 exaFLOPS, just 18% of those required by USR. This, combined with our method’s superior performance, particularly on the WildVSR test set (44.7% vs. 46.4%), demonstrates the strong, compute-efficient capability of MultiAVSR.

6. Future Works

Our supervised multi-task framework has shown impressive improvements in VSR results and opens many future directions. Liu et al. also used audio-only data to improve VSR results; however, their method required a talking head generation model to synthesize visual speech data from these datasets [34]. Thanks to our task-independent multi-task hybrid CTC/Attention loss, future work can take advantage of audio-only data without needing to synthesize the visual information. This will remove the need to train a talking head generation model and synthesize the visual data but would still use large amounts of ASR data available to improve performance. We postulate that additional ASR data would enable our multi-task framework to outperform other single-task methods on the ASR and AVSR tasks, thus enabling a robust multi-task framework with no compromises on ASR or AVSR performance.

Similarly, our results indicate that shared parameters with the ASR task greatly improve VSR results. This points to another future direction regarding fine-tuning a large high-accuracy ASR model using our multi-task framework. While fine-tuning an ASR model for VSR has been achieved in the past [48], it has not been done in a multi-task manner, which we show in this work improves robustness and generalization. We anticipate that this strategy would significantly strengthen VSR accuracy and further enhance both generalization and robustness, given that large-scale ASR models are trained on highly diverse speech corpora.

MultiAVSR exhibits robust improvements in the VSR, AVSR, and ASR tasks. While this work does not dive into the interpretability of the effect multi-task training has on the network parameters and the eventual speech recognition results, future works could examine task-specific variations in attention maps or other network visualization methods to guide targeted network architecture adjustments to improve multi-task speech recognition.

While MultiAVSR exhibits reduced reliance on external language models and thus improves inference efficiency, this work does not focus on real-time speech recognition. Recent works have found that increasing network sparsity, through methods such as network pruning [40] and mixture-of-experts [42], improves inference time for speech recognition tasks. We hypothesize that merging these sparsity techniques with our supervised multi-task framework would narrow the accuracy gap with large dense models while preserving their robustness to real-world variability and delivering inference speeds suitable for real-time use.

MultiAVSR brings about impressive improvements in performance, generalization, and robustness in the VSR, ASR, and AVSR tasks. While these improvements are substantial and MultiAVSR reduces reliance on external LMs, which enables faster real-time systems, future works can apply these methods to real-world applications such as AVSR systems in noisy auditory conditions and head-mounted VSR systems.

7. Discussion

In this work, we propose MultiAVSR, a supervised multi-task training framework for robust speech recognition. Our simple multi-task hybrid CTC/Attention loss enables large improvements in the visual speech recognition (VSR) task, while requiring only 18% of train compute of other multi-task SR approaches. This framework greatly improves the VSR results with 21.0% WER on LRS3-TED [26]. MultiAVSR exhibits a strong generalization ability on the WildVSR dataset [19] with an improved WER of 44.7%. Despite slightly lower base ASR and AVSR performance, MultiAVSR shows a relative improvement of 17% and 31% for the ASR and AVSR tasks, respectively, under diverse noise conditions. While single-task approaches see >7% relative improvement by adding an external language model, MultiAVSR has greater linguistic generalization with only a 2.8% improvement. This improvement is much smaller than the improvement of adding a language model to other models, demonstrating a decreased reliance on external language models. This decreased reliance on external language models is a key step towards real-time VSR as external language models account for up to 40% of the computation required during inference. We also find that MultiAVSR is the first framework to perform better on the WildVSR dataset without an LM, further validating the claim that it has greater linguistic generalization. While some approaches with more data available perform better than MultiAVSR, our results show that supervised multi-task training exhibits strong generalization and robustness for all three speech recognition tasks and could potentially perform as well, if not better, than other methods, if larger proprietary datasets could be made available.

Author Contributions

Conceptualization, S.T. and D.-J.L.; methodology, S.T.; validation, S.T.; formal analysis, S.T. and K.W.; investigation, S.T.; resources, D.-J.L.; writing—original draft preparation, S.T.; writing—review and editing, K.W. and D.-J.L.; visualization, S.T.; supervision, D.-J.L.; project administration, D.-J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available (accessed on 1 June 2025): LRS3-TED—https://mmai.io/datasets/lip_reading/; VoxCeleb2—https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html; LRS2—https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html; WildVSR—https://github.com/YasserdahouML/VSR_test_set.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SR	Speech recognition
VSR	Visual speech recognition
ASR	Audio (or automatic) speech recognition
AVSR	Audio–visual speech recognition
WER	Word error rate
LM	Language model

References

Dua, M.; Akanksha; Dua, S. Noise robust automatic speech recognition: Review and analysis. Int. J. Speech Technol. 2023, 26, 475–519. [Google Scholar] [CrossRef]
Cui, X.; Iseli, M.; Zhu, Q.; Alwan, A. Evaluation of noise robust features on the Aurora databases. In Proceedings of the 7th International Conference on Spoken Language Processing, INTERSPEECH, Denver, CO, USA, 16–20 September 2002; pp. 481–484. [Google Scholar]
Haapakangas, A.; Hongisto, V.; Hyönä, J.; Kokko, J.; Keränen, J. Effects of unattended speech on performance and subjective distraction: The role of acoustic design in open-plan offices. Appl. Acoust. 2014, 86, 1–16. [Google Scholar] [CrossRef]
Ma, P.; Haliassos, A.; Fernandez-Lopez, A.; Chen, H.; Petridis, S.; Pantic, M. Auto-avsr: Audio–visual speech recognition with automatic labels. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Rouditchenko, A.; Thomas, S.; Kuehne, H.; Feris, R.; Glass, J. mWhisper-Flamingo for multilingual audio–visual noise-robust speech recognition. arXiv 2025, arXiv:2502.01547. [Google Scholar] [CrossRef]
Shi, B.; Mohamed, A.; Hsu, W.N. Learning Lip-Based Audio–visual Speaker Embeddings with AV-HuBERT. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 4785–4789. [Google Scholar]
Sumby, W.H.; Pollack, I. Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 1954, 26, 212–215. [Google Scholar] [CrossRef]
Cappellazzo, U.; Kim, M.; Chen, H.; Ma, P.; Petridis, S.; Falavigna, D.; Brutti, A.; Pantic, M. Large language models are strong audio–visual speech recognition learners. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Ryumin, D.; Ivanko, D.; Ryumina, E. Audio–visual speech and gesture recognition by sensors of mobile devices. Sensors 2023, 23, 2284. [Google Scholar] [CrossRef]
Sun, K.; Yu, C.; Shi, W.; Liu, L.; Shi, Y. Lip-interact: Improving mobile device interaction with silent speech commands. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, Berlin, Germany, 14–17 October 2018; pp. 581–593. [Google Scholar]
Srivastava, T.; Winters, R.M.; Gable, T.; Wang, Y.T.; LaScala, T.; Tashev, I.J. Whispering wearables: Multimodal approach to silent speech recognition with head-worn devices. In Proceedings of the 26th International Conference on Multimodal Interaction, San Jose, Costa Rica, 4–8 November 2024; pp. 214–223. [Google Scholar]
Jin, Y.; Gao, Y.; Xu, X.; Choi, S.; Li, J.; Liu, F.; Li, Z.; Jin, Z. EarCommand: “Hearing” your silent speech commands in ear. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2022, 6, 1–28. [Google Scholar] [CrossRef]
Cha, H.S.; Chang, W.D.; Im, C.H. Deep-learning-based real-time silent speech recognition using facial electromyogram recorded around eyes for hands-free interfacing in a virtual reality environment. Virtual Real. 2022, 26, 1047–1057. [Google Scholar] [CrossRef]
Acosta, L.H.; Reinhardt, D. A survey on privacy issues and solutions for Voice-controlled Digital Assistants. Pervasive Mob. Comput. 2022, 80, 101523. [Google Scholar] [CrossRef]
Abdolrahmani, A.; Kuber, R.; Branham, S.M. “Siri Talks at You” An Empirical Investigation of Voice-Activated Personal Assistant (VAPA) Usage by Individuals Who Are Blind. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility, Galway, Ireland, 22–24 October 2018; pp. 249–258. [Google Scholar]
Cowan, B.R.; Pantidi, N.; Coyle, D.; Morrissey, K.; Clarke, P.; Al-Shehri, S.; Earley, D.; Bandeira, N. “What can I help you with?” infrequent users’ experiences of intelligent personal assistants. In Proceedings of the 19th International Conference on Human-Computer Interaction with Mobile Devices and Services, Vancouver, BC, Canada, 4–7 September 2017; pp. 1–12. [Google Scholar]
Pandey, L.; Hasan, K.; Arif, A.S. Acceptability of speech and silent speech input methods in private and public. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, Online, 8–13 May 2021; pp. 1–13. [Google Scholar]
Haliassos, A.; Mira, R.; Chen, H.; Landgraf, Z.; Petridis, S.; Pantic, M. Unified Speech Recognition: A single model for auditory, visual, and audiovisual inputs. arXiv 2024, arXiv:2411.02256. [Google Scholar]
Djilali, Y.A.D.; Narayan, S.; LeBihan, E.; Boussaid, H.; Almazrouei, E.; Debbah, M. Do VSR models generalize beyond LRS3? In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 6635–6644. [Google Scholar]
Haliassos, A.; Zinonos, A.; Mira, R.; Petridis, S.; Pantic, M. BRAVEn: Improving self-supervised pre-training for visual and auditory speech recognition. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11431–11435. [Google Scholar]
Haliassos, A.; Ma, P.; Mira, R.; Petridis, S.; Pantic, M. Jointly Learning Visual and Auditory Speech Representations from Raw Data. In Proceedings of the Eleventh International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
Ma, P.; Mira, R.; Petridis, S.; Schuller, B.W.; Pantic, M. Lira: Learning visual speech representations from audio through self-supervision. arXiv 2021, arXiv:2106.09171. [Google Scholar]
Hsu, W.N.; Shi, B. u-hubert: Unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality. Adv. Neural Inf. Process. Syst. 2022, 35, 21157–21170. [Google Scholar]
Chung, J.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep Speaker Recognition; Interspeech: Sydney, Australia, 2018. [Google Scholar]
Ephrat, A.; Mosseri, I.; Lang, O.; Dekel, T.; Wilson, K.; Hassidim, A.; Freeman, W.T.; Rubinstein, M. Looking to listen at the cocktail party: A speaker-independent audio–visual model for speech separation. ACM Trans. Graph. (TOG) 2018, 37, 1–11. [Google Scholar] [CrossRef]
Afouras, T.; Chung, J.S.; Zisserman, A. LRS3-TED: A large-scale dataset for visual speech recognition. arXiv 2018, arXiv:1809.00496. [Google Scholar]
Pascual, S.; Ravanelli, M.; Serrà, J.; Bonafonte, A.; Bengio, Y. Learning problem-agnostic speech representations from multiple self-supervised tasks. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 161–165. [Google Scholar]
Afouras, T.; Chung, J.S.; Senior, A.; Vinyals, O.; Zisserman, A. Deep audio–visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 44, 8717–8727. [Google Scholar] [CrossRef]
Chung, J.S.; Zisserman, A. Lip reading in the wild. In Proceedings of the Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Revised Selected Papers, Part II 13. Springer: Berlin/Heidelberg, Germany, 2017; pp. 87–103. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MI, USA, 2–7 June 2019; Volume 1 (Long and Short Papers). pp. 4171–4186. [Google Scholar]
Makino, T.; Liao, H.; Assael, Y.; Shillingford, B.; Garcia, B.; Braga, O.; Siohan, O. Recurrent neural network transducer for audio–visual speech recognition. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Sentosa, Singapore, 14–18 December 2019; pp. 905–912. [Google Scholar]
Zhu, Q.; Zhou, L.; Zhang, Z.; Liu, S.; Jiao, B.; Zhang, J.; Dai, L.; Jiang, D.; Li, J.; Wei, F. Vatlm: Visual-audio-text pre-training with unified masked prediction for speech representation learning. IEEE Trans. Multimed. 2024, 6, 1055–1064. [Google Scholar] [CrossRef]
Serdyuk, D.; Braga, O.; Siohan, O. Transformer-based video front-ends for audio–visual speech recognition for single and multi-rerson video. arXiv 2022, arXiv:2201.10439. [Google Scholar]
Liu, X.; Lakomkin, E.; Vougioukas, K.; Ma, P.; Chen, H.; Xie, R.; Doulaty, M.; Moritz, N.; Kolar, J.; Petridis, S.; et al. Synthvsr: Scaling up visual speech recognition with synthetic supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18806–18815. [Google Scholar]
Ahn, Y.J.; Park, J.; Park, S.; Choi, J.; Kim, K.E. SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization. In Proceedings of the Interspeech 2024, ISCA, Kos Island, Greece, 1–5 September 2024; pp. 867–871. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning. PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
Ma, P.; Petridis, S.; Pantic, M. End-to-end audio–visual speech recognition with conformers. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7613–7617. [Google Scholar]
Chang, O.; Liao, H.; Serdyuk, D.; Shahy, A.; Siohan, O. Conformer is all you need for visual speech recognition. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10136–10140. [Google Scholar]
Prajwal, K.; Afouras, T.; Zisserman, A. Sub-word level lip reading with visual attention. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5162–5172. [Google Scholar]
Fernandez-Lopez, A.; Chen, H.; Ma, P.; Haliassos, A.; Petridis, S.; Pantic, M. SparseVSR: Lightweight and Noise Robust Visual Speech Recognition. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023; pp. 1603–1607. [Google Scholar]
Ma, P.; Petridis, S.; Pantic, M. Visual speech recognition for multiple languages in the wild. Nat. Mach. Intell. 2022, 4, 930–939. [Google Scholar] [CrossRef]
Kim, S.; Jang, K.; Bae, S.; Cho, S.; Yun, S.Y. MoHAVE: Mixture of hierarchical audio–visual experts for robust speech recognition. arXiv 2025, arXiv:2502.10447. [Google Scholar]
Bulat, A.; Tzimiropoulos, G. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1021–1030. [Google Scholar]
Varga, A.; Steeneken, H.J. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 1993, 12, 247–251. [Google Scholar] [CrossRef]
Son Chung, J.; Senior, A.; Vinyals, O.; Zisserman, A. Lip reading sentences in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6447–6456. [Google Scholar]
Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv 2018, arXiv:1804.03209. [Google Scholar]
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar]
Afouras, T.; Chung, J.S.; Zisserman, A. Asr is all you need: Cross-modal distillation for lip reading. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2143–2147. [Google Scholar]

Figure 1. MultiAVSR framework overview. The raw audio and visual inputs are passed through their respective backbones and through a shared conformer encoder to create task-specific representations. For the AVSR task, the audio and visual feature representations are fused into an AVSR feature representation. These task-specific feature representations are passed into the CTC layer and transformer decoder for loss calculation during training and text prediction for evaluation.

L_{CTC}^{a}

,

L_{CTC}^{av}

, and

L_{CTC}^{v}

represent the CTC loss for audio, audio–visual, and visual, respectively, and

L_{CE}^{a}

,

L_{CE}^{av}

, and

L_{CE}^{v}

represent, the Cross-Entropy Attention loss for audio, audio–visual, and visual, respectively.

Figure 1. MultiAVSR framework overview. The raw audio and visual inputs are passed through their respective backbones and through a shared conformer encoder to create task-specific representations. For the AVSR task, the audio and visual feature representations are fused into an AVSR feature representation. These task-specific feature representations are passed into the CTC layer and transformer decoder for loss calculation during training and text prediction for evaluation.

L_{CTC}^{a}

,

L_{CTC}^{av}

, and

L_{CTC}^{v}

represent the CTC loss for audio, audio–visual, and visual, respectively, and

L_{CE}^{a}

,

L_{CE}^{av}

, and

L_{CE}^{v}

represent, the Cross-Entropy Attention loss for audio, audio–visual, and visual, respectively.

Table 3. Language model weight ablation study. Visual speech recognition WER results on the LRS3-TED dataset when the language model weight factor (

β

) is ablated. All experiments were run using our model trained on 1968 h including LRS2, LRS3-TED, and VoxCeleb2. The best performer is in boldface and the second best is underlined.

Table 3. Language model weight ablation study. Visual speech recognition WER results on the LRS3-TED dataset when the language model weight factor (

β

) is ablated. All experiments were run using our model trained on 1968 h including LRS2, LRS3-TED, and VoxCeleb2. The best performer is in boldface and the second best is underlined.

Language Model Weight $β$	VSR WER ↓
0	21.6
0.1	21.1 $↓ 2.3 %$
0.2	21.0 $↓ 2.8 %$
0.3	21.6 $0 %$
0.4	22.4 $↑ 3.7 %$

Table 4. Language model impact. The impact on the LRS3-TED test WER when including a language model. Lower relative change to WER is ideal as using an LM increases inference time and the compute time required. Only models from Table 2 with and without LM results are included. The best performers are in boldface and the second best are underlined.

Method	Training Hours	WER ↓	WER with LM ↓	$% Δ$ ↓
SyncVSR [35]	438	33.3	31.2	6.3
	661	30.4	28.1	7.6
	1992	23.1	21.4	7.4
SynthVSR [34]	7100	18.2	16.9	7.1
USR [18]	1759	22.3	21.5	3.6
MultiAVSR	438	31.1	29.9	3.9
	661	28.1	27.3	2.8
	1968	21.6	21.0	2.8

Table 5. Auditory noise experiments. ASR and AVSR auditory noise experiments compared to the SOTA single-task model. White and pink noise data is sourced from the Speech Commands dataset [46]. The best performers are in boldface.

Noise	Model	Task	SNR Levels (dB)						Average
Noise	Model	Task	Clean	12.5	7.5	2.5	$- 2.5$	$- 7.5$	Average
Pink	Auto-AVSR [4]	Audio	1.0	1.4	1.9	4.3	13.1	56.8	15.5
	MultiAVSR	Audio	1.2	1.4 $0 %$	1.9 $0 %$	3.7 $↓ 14 %$	12.0 $↓ 8 %$	43.0 $↓ 24 %$	12.4 $↓ 20 %$
	Auto-AVSR [4]	Audio–visual	0.9	1.2	1.4	2.3	6.0	16.2	5.4
	MultiAVSR	Audio–visual	1.2	1.2 $0 %$	1.6 $↑ 14 %$	2.0 $↓ 13 %$	3.9 $↓ 35 %$	9.8 $↓ 40 %$	3.7 $↓ 31 %$
White	Auto-AVSR [4]	Audio	1.0	2.1	4.0	10.4	30.2	88.9	27.1
	MultiAVSR	Audio	1.2	2.2 $↑ 5 %$	4.0 $0 %$	9.7 $↓ 7 %$	27.2 $↓ 10 %$	76.0 $↓ 15 %$	23.8 $↓ 12 %$
	Auto-AVSR [4]	Audio–visual	0.9	1.4	2.3	4.3	9.5	24.2	8.3
	MultiAVSR	Audio–visual	1.2	1.6 $↑ 14 %$	2.2 $↓ 4 %$	3.4 $↓ 21 %$	7.0 $↓ 26 %$	14.7 $↓ 39 %$	5.8 $↓ 30 %$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Torrie, S.; Wright, K.; Lee, D.-J. MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning. Electronics 2025, 14, 2310. https://doi.org/10.3390/electronics14122310

AMA Style

Torrie S, Wright K, Lee D-J. MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning. Electronics. 2025; 14(12):2310. https://doi.org/10.3390/electronics14122310

Chicago/Turabian Style

Torrie, Shad, Kimi Wright, and Dah-Jye Lee. 2025. "MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning" Electronics 14, no. 12: 2310. https://doi.org/10.3390/electronics14122310

APA Style

Torrie, S., Wright, K., & Lee, D.-J. (2025). MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning. Electronics, 14(12), 2310. https://doi.org/10.3390/electronics14122310

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning

Abstract

1. Introduction

2. Related Work

2.1. Self-Supervised Methods

2.2. Supervised Methods

2.3. Multi-Task Methods

3. Methods

3.1. Architecture

3.2. Multi-Task Training

3.3. Loss

4. Experimental Setup

4.1. Datasets

4.2. Evaluation

4.3. Pre-Processing and Augmentation

4.4. Implementation Details

4.5. Language Model

5. Results

5.1. Comparison to the Latest Methods

5.2. Language Model

5.3. Generalization

5.4. Auditory Noise Experiments

5.5. Training Compute

6. Future Works

7. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI