Next Article in Journal
Utility-Based Wireless Routing Algorithm for Massive MIMO Heterogeneous Networks
Previous Article in Journal
The Flexural Fatigue Behavior of Honeycomb Sandwich Composites Following Low Velocity Impacts
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model

Department of Electronic Engineering, Sogang University, Seoul 04107, Korea
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2020, 10(20), 7263; https://doi.org/10.3390/app10207263
Submission received: 20 September 2020 / Revised: 13 October 2020 / Accepted: 14 October 2020 / Published: 17 October 2020

Abstract

:
Since attention mechanism was introduced in neural machine translation, attention has been combined with the long short-term memory (LSTM) or replaced the LSTM in a transformer model to overcome the sequence-to-sequence (seq2seq) problems with the LSTM. In contrast to the neural machine translation, audio–visual speech recognition (AVSR) may provide improved performance by learning the correlation between audio and visual modalities. As a result that the audio has richer information than the video related to lips, AVSR is hard to train attentions with balanced modalities. In order to increase the role of visual modality to a level of audio modality by fully exploiting input information in learning attentions, we propose a dual cross-modality (DCM) attention scheme that utilizes both an audio context vector using video query and a video context vector using audio query. Furthermore, we introduce a connectionist-temporal-classification (CTC) loss in combination with our attention-based model to force monotonic alignments required in AVSR. Recognition experiments on LRS2-BBC and LRS3-TED datasets showed that the proposed model with the DCM attention scheme and the hybrid CTC/attention architecture achieved at least a relative improvement of 7.3% on average in the word error rate (WER) compared to competing methods based on the transformer model.

1. Introduction

Automatic speech recognition (ASR) has attracted much interest because speech is the most convenient, natural, and user-friendly interface to various kinds of devices. Unfortunately, a speech signal acquired in real-world noisy environments is significantly contaminated, and the performance of ASR systems with the contaminated speech signal is seriously degraded due to the mismatch between the training and testing environments. Although many approaches have been developed to accomplish robustness by compensating for the mismatch under specific conditions, most of them fail to attain robustness in real-world environments with various types of noise (e.g., [1,2,3,4,5]). Therefore, robust recognition remains a challenging but important issue in the field of ASR.
As a result that visual information is not distorted by acoustic noise, visual speech recognition (known as lip reading) may play an important role in ASR in acoustically adverse environments [6]. Thus, visual speech recognition generally provides consistent recognition accuracies regardless of the signal-to-noise ratios (SNRs) of acquired acoustic speech whereas audio speech recognition gets worse recognition accuracies for speech with lower SNRs. However, it is well known that audio speech recognition with clean speech typically achieves higher recognition accuracy than visual speech recognition because speech without distortion may provide more sufficient and clear cues to classify phonemes than visual movement in a lip in addition to its face. Therefore, audio–visual speech recognition (AVSR) fuses audio and visual information acquired from a talking face with audio to achieve comparable or possibly higher recognition performance than audio speech recognition with clean speech and visual speech recognition in acoustically adverse environments [7].
As a result that features for visual speech recognition are not well-established in contrast to acoustic speech features, such as the logarithmic mel-frequency power spectral coefficients or cepstral coefficients, conventional visual features including histogram of oriented gradients [8], local binary patterns [9], and features with scale-invariant transforms [10] have been commonly used. As a result that these are rather general-purpose image features, the performance of visual speech recognition may be improved by devising effective features [11]. Influenced by impressive success of deep learning in not only diverse object detection and recognition tasks but also action recognition tasks, deep architectures such as convolutional neural networks (CNNs) and long short-term memory (LSTM) have been applied to AVSR [12]. In addition, deep learning provides ways to learn end-to-end recognition models without developing and training separated acoustic and language models indispensable for conventional speech recognition (e.g., [7,13,14,15]).
However, a mechanism fusing audio and visual information in AVSR should be still developed to achieve successful recognition performance in both acoustically clean and noisy environments. Intuitively, it is better to rely more on audio features than on visual features in clean environments. Therefore, AVSR is hard to train attentions with balanced modalities. In [16], modality attention computes scores for modality space at a certain time whereas conventional attention [17] computes scores for time space using a specific modality (query). However, the modality attention is applied assuming that all modalities have the same time length. As a result that audio and visual features are usually generated at different time steps, they have to be resampled to apply the modality attention. Sterpu et al. proposed cross-modality attention that computed the video context using audio query (AV align) although conventional attention computes the video context using video query [13].
Although Sterpu et al. used cross-modality attention computing the AV align, we propose dual cross-modality (DCM) attention that combines two cross-modality attentions calculating the AV align and also the audio context using video query (VA align), in order to increase the role of visual modality to a level of audio modality by fully exploiting input audio and visual information in training attentions. Recently, the transformer model provided better performance than the conventional LSTM-based model because it calculated the global context vector over the entire time of input data [18]. Therefore, we apply our proposed DCM model to the transformer model. Furthermore, we introduce a connectionist-temporal-classification (CTC) loss in combination with our attention-based model to force monotonic alignments, which results in a hybrid CTC/attention architecture to improve the performance of AVSR [19]. Figure 1 shows an overview of our AVSR architecture.
The remainder of this paper is organized as follows: Section 2 summarizes related works on the AVSR task, attention mechanism, modality fusion, and hybrid CTC/attention architecture for speech recognition. In Section 3, we propose an AVSR model with DCM attention scheme and the hybrid CTC/attention architecture. Our proposed methods are compared with other attention mechanisms implemented on the transformer model through experiments on LRS2-BBC and LRS3-TED datasets [20] in Section 4. Finally, some concluding remarks are presented in Section 5.

2. Related Work

2.1. AVSR

The AVSR problem is highly related to lip reading. Mroueh et al. [21] performed phoneme classification based on feed-forward deep neural networks (DNNs). In addition, several prior studies have conducted AVSR to recognize digits or isolated words by using various features such as deep bottleneck features [22], discrete-cosine-transform (DCT)-based features [23], and pre-trained CNN features with mel-frequency cepstral coefficients (MFCCs) [12]. Chung et al. published a continuous speech recognition model that fused pre-trained CNN features and audio features with a dual attention mechanism [7]. Petridis et al. studied a model that fused raw-pixel images and waveforms by using pre-trained CNN and stacked bidirectional recurrent network [14]. Afouras et al. compared and analyzed AVSR models by applying either cross-entropy loss or CTC loss to a transformer-based AVSR model [20].
Since the transformer model was presented in machine translation [24], there have been many researches to introduce the transformer model not only to ASR but also for many audio–visual tasks. Although the LSTM- and bidirectional-LSTM-based models compressed input data into a fixed-size vector, the transformer model calculated the global context over the entire input data through attention mechanism, which might result in improved performance and faster and more stable training (e.g., [18,25]). In [26], the transformer model was also combined with the LSTM-based model. In a typical AVSR transformer model with two encoders for audio and video and one common decoder, Afouras et al. [20] analyzed the advantages and disadvantages of both models based on the CTC loss and sequence-to-sequence (seq2seq) loss of the transformer model. Recently, an efficient fusion method of audio and visual in a transformer-based AVSR model has also been studied (e.g., [27,28]). In addition, the transformer model is under study for various audio–visual tasks (e.g., [29,30,31,32]).

2.2. Attention Mechanism

Additive attention and dot-product attention are typical attention mechanisms. In case of the additive attention, Chan et al. [33] applied Bahdanau attention [17] to LSTM-based audio speech recognition model. First, an encoder generates high-level representation h = ( h 1 , , h U ) for input x = ( x 1 , , x T ) with the number of frames T U , which is expressed as [33]
h = Encoder ( x ) .
At decoder time step i, the attention module computes the context c i using the scalar energy e i , u at every encoder-output time step u that is subsequently calculated by the decoder state s i and the encoder output h u h , 1 u U as [33]
e i , u = ϕ ( s i ) , ψ ( h u ) ,
α i , u = exp ( e i , u ) u exp ( e i , u ) ,
c i = u α i , u h u ,
where ϕ and ψ denote processing functions by multi-layer perceptron (MLP). Finally, the decoder state s i and the output character y i at decoder time step i are computed by [33]
s i = Decoder ( s i 1 , y i 1 , c i 1 ) ,
P ( y i x , y < i ) = CharacterDistribution ( s i , c i ) ,
where Decoder is composed of two-layer LSTM and CharacterDistribution is an MLP with softmax outputs over characters.
On the other hand, Luong et al. [34] used attentional vector s ˜ i at decoder time step i to produce the predictive character distribution, computed from the decoder state s i and source-side context vector c i , represented as
s ˜ i = tanh ( W c [ c i , s i ] ) ,
P ( y i x , y < i ) = softmax ( W s s ˜ i ) ,
where [ · , · ] indicates concatenation of the two components and W c and W s are trainable parameters.
The dot-product attention applied to the transformer model [24] calculates the dot products of the query with all keys, divides each by the square-root of the key dimension d k , and applies a softmax function to get the weights on the values. The matrix of attention outputs for a set of queries Q with the keys and values packed into matrices K and V is computed by
Attention ( Q , K , V ) = softmax Q K T d k V .
To efficiently combine information from different representation subspaces at different positions, they proposed multi-head attention mechanism using Q, K, and V as inputs. The output of the n-th attention head is expressed as [24]
Attention n ( Q , K , V ) = softmax W n q Q T T W n k K T d k W n v V T T ,
where W n q , W n k , and W n v denote the linear projection parameters for Q, K, and V, respectively.

2.3. Modality Fusion with Attention Mechanism

To focus on the relationship between audio and visual modalities, Sterpu et al. proposed an AV align model in which one of the two decoder-side attentions of the AVSR model in [7] was moved to encoder-side cross-modal alignment [13]. On the encoder side, the cross-modal alignment fused the modalities by computing attention using audio and video encoder outputs as queries and values, respectively. The output of the attention mechanism is similar to (7), which is written as
h ˜ u a v = tanh ( W c a [ c u v , h u a ] ) ,
where c u v and h u a denote a video context vector using an audio encoder state as a query and the audio encoder state at encoder time step u, respectively. Therefore, the video context vector and the audio encoder state are fused by tanh and trainable parameters W c a , which results in an attentional vector obtained by the merged modalities, h ˜ u a v , at the top layer of the audio encoder.
Unlike the AV align, Zhou et al. [16] proposed modality attention to obtain attention weights between modalities in a decoding step. The modality attention is similar to the conventional attention methods except that the modality attention is calculated and combined over the modality axis. The modality attention fusion process can be summarized as [16]
z i m = Z ( f 1 i m ) = σ W · LSTM f 1 i m + b ,
where f 1 i m , z i m , and Z denote the feature vectors corresponding to the encoder outputs of modality m up to decoder time step i, scores for the feature vectors of modality m at decoder time step i, and the scoring function composed of the LSTM , the feed-forward network with W, b, and the sigmoid function σ ( · ) . Then, the attention weight for the m-th modality, α i m , and the fusion output feature vector v i at decoder time step i can be computed by
α i m = exp ( z i m ) j = 1 M exp ( z i j ) ,
v i = m = 1 M α i m f i m ,
where M denotes the number of modalities.

2.4. Speech Recognition with a Hybrid CTC/Attention Architecture

An alignment between an encoder and a decoder is one of the main issues in speech recognition. Although the attention algorithm is widely used to solve the alignment, this approach inherently has a problem of allowing non-sequential alignments. However, a CTC loss addresses this problem because it forces a monotonic alignment. A hybrid CTC/attention architecture using a CTC loss in an attention-based encoder–decoder model has been successfully applied to AVSR as well as audio speech recognition or lip reading (e.g., [19,35,36]).

3. Proposed AVSR Method Based on DCM Attention

In this section, we describe our proposed model architectures based on the recently proposed transformer model for ASR [37], AVSR [20], and the hybrid CTC/attention architecture [19].

3.1. Input Features

3.1.1. Audio Features

We use 90-D log-mel filterbank features. Each feature vector is obtained from 25-ms-long Hamming-windowed input speech at every 10 ms.

3.1.2. Video Features

To prepare visual features that represent sequential lip movements, we crop a 120 × 120-pixel patch covering the mouth region and convert it into a grayscale image. The cropped mouth images are then fed into the pre-trained model in [38]. Using the network based on VGG-M [39], we can get a 512-D feature vector that describes about 200-ms-long lip movement.

3.2. Seq2seq Transformer

3.2.1. Positional Encoding

In order to learn both the global context and the local context in the transformer model, Mohamed et al. proposed a transformer model using convolutional layers [37]. Similarly, we use a 2-D convolutional block for each modality, each of which consists of two 2-D convolutional layers in the encoder. On the other hand, the decoder uses four 1-D convolutional layers over previously generated outputs. Figure 2, Figure 3 and Figure 4 show the encoder and decoder structures, respectively.

3.2.2. Self-Attention Encoder

Two encoders and one decoder consist of stacks of multi-head attention layers. As shown in Figure 2, we use 6 encoder blocks for each modality and each block consists of a multi-head self-attention layer and two feed-forward linear layers that generate 2048 and 512 outputs. Using input data as queries Q, keys K, and values V, the multi-head self-attention has 512-D features with eight heads. Like (1), each modality encoder generates a high-level representation h a or h v , for input x a or x v after applying the VGG-M-based network [39], respectively, which is expressed as
h a = AudioEncoder VGG-M x a , h v = VideoEncoder VGG-M x v ,
where AudioEncoder and VideoEncoder denote stacked self-attention encoders for audio and video modalities, respectively.

3.2.3. DCM Attention

The AV align model provided improved performance by fusing two modalities with attention in the encoder side instead of modality fusion in the decoder side [13]. However, when using audio as a query in the AV align model, attention weights may not be properly obtained in noisy environments. On the other hand, since video data is independent of acoustic noise, it may be important to make the role of video modality to a level of audio modality by fully exploiting input audio and visual information in learning attentions. Therefore, using video as a query may be helpful to achieve noise robustness. To consider a video query for audio context in addition to an audio query for video context of the AV align model [13] and to apply them to the transformer model, our DCM attention model has two multi-head attention layers between the two modality encoders, as shown in Figure 3. The configuration of each multi-head attention layer used in the DCM attention model is the same as that in the multi-head attention used for the encoder and decoder. Using h a and h v , DCM attention outputs, AV and VA , can be expressed as
AV = Attention h a , h v , h v , VA = Attention h v , h a , h a .

3.2.4. Bi-Modal Self-Attention Decoder

We use an architecture similar to the seq2seq transformer (TM-seq2seq) model by Afouras et al. [20]. We use six decoder blocks. Each block has one multi-head self attention and two multi-head encoder–decoder attentions. Each encoder–decoder attention uses previous decoder outputs after the self attention as queries and DCM attention outputs as keys and values. Then, the two encoder–decoder attention outputs ( AV c , VA c ), as shown in Figure 4, are concatenated channel-wise and fed to fusion layers for calculating attentional vector s ˜ i at decoder time step i, which is expressed as
s ˜ i = LayerNorm ( FusionLayer ( [ AV c , VA c ] ) + [ AV c , VA c ] ) ,
where FusionLayer denotes modality fusion layers composed of two fully connected layers, rectified linear units, and dropout as shown in Figure 4. Finally, like (8), the predictive character distribution is obtained from calculating the attentional vector s ˜ i . Detailed hyper-parameters in the multi-head attentions and the feed-forward layers are identical to those in the encoder.

3.3. Training and Decoding with a Hybrid CTC/Attention Architecture

Similar to [19], a CTC loss is combined with an objective for our attention-based model to force a monotonic alignment required for speech recognition during training. The resulting loss function is a weighted sum of the CTC and attention objectives, which is computed as follows:
L = α log p ctc ( y x ) + ( 1 α ) log p att ( y x ) ,
where y = ( y 1 , , y I ) , x , and α denote a decoded output character sequence, input feature sequences from both modalities, and a relative weight for the loss function, respectively. A decoded output character may include an extra end-of-sentence label. In the training phase, a ground truth character sequence is used as a target label, and the detailed method is shown in Algorithm 1.
To calculate the CTC loss, we need to fuse the audio and video encoder results as shown in Figure 5. Therefore, the video encoder output is upsampled to have a sequence with the same length as the audio encoder output. After that, the two encoder outputs are concatenated channel-wise and fed to a feed-forward layer.
Algorithm 1: Hybrid CTC/attention training
Applsci 10 07263 i001
In the decoding phase, we use a joint CTC/attention approach. We calculate a joint score based on CTC and attention decoder probabilities for decoded output character sequences. A hypothesis character output sequence y ^ is computed as
y ^ = arg max y { λ log p ctc ( y x ) + ( 1 λ ) log p att ( y x ) } ,
where λ is a relative weight in the decoding phase. α and λ are respectively set to 0.2 and 0.1 that are the same as in [19].

4. Experimental Results and Discussions

In this section, we describe our experimental setup and training strategies. Our proposed model was evaluated and compared with others.

4.1. Datasets

We used LRS datasets, the largest existing public AVSR datasets [20]. Unfortunately, due to the license restriction of MV-LRS [40], we used only LRS2-BBC and LRS3-TED datasets for training, validation, and testing. To improve robustness in noisy environments, we simulated noisy reverberant signals with the signal-to-noise ratios (SNRs) of 5 to 20 dB and the reverberation time ( RT 60 ) of 0.4 s by adding babble noise samples acquired at cafeterias and restaurants and imposing reverberation generated by the image method [41] to clean speech signals from the datasets. The added noise samples were different for training and testing.

4.2. Evaluation Measure

The performance of the models was evaluated by the word error rate (WER) defined as
WER = S + D + I N ,
where S , D , and I are the counts of substitutions, deletions, and insertions between reference and hypothesis word sequences, respectively, and N denotes the number of words for the reference.

4.3. Training Strategies

We trained in an order of clean short sentences, clean sentences, and clean/noisy reverberant sentences. The detailed data in the order are as follows:
  • Clean short sentences with three or four words in the pre-train set.
  • Clean sentences in the pre-train and train-val sets.
  • Clean and noisy reverberant sentences (as described in Section 4.1) in the train-val set.
  • Clean and noisy reverberant sentences in the train-val set of either LRS2-BBC or LRS3-TED dataset for fine tuning on either dataset.
Our implementation was based on the PyTorch library [42] and the fairseq toolkit [43]. We used the Adadelta optimizer [44] with default parameters. The learning rate started with 10 1 and decreased by half to 10 5 depending on the validation error plateaus. The dropout was performed with p = 0.15 . The implementation code of our proposed model is available at https://github.com/LeeYongHyeok/DCM_vgg_transformer.

4.4. Attention Visualization

Figure 6 shows the attention weight maps between audio and video encoders using audio or video features as queries in our proposed model (TM-DCM) for clean and noisy reverberant data. The weights were computed by averaging over all the cross-modality attention heads. The weight map of the audio query cross-modality attention for noisy reverberant data was more noisy than that for clean data because noisy reverberant data could not provide clear clues than clean data, which might result in performance degradation by speech contamination. However, the video query cross-modality attention produced very clean weight maps even with noisy reverberant audio data as keys and values.
In Figure 7, we display the encoder–decoder attention weight maps using previous decoder outputs after the self attention as queries and DCM attention outputs as keys and values for the cross-entropy and hybrid CTC/attention losses. The hybrid architecture made the maps, especially between audio encoders and decoders, more clean by forcing monotonic alignments even with noisy reverberant data.

4.5. WER Results

Table 1 summarizes the word error rates (WERs) for our proposed model (TM-DCM), the TM-seq2seq, and the AV align implemented on the transformer model (TM-av_align). The TM-seq2seq model was implemented using modality-independent encoder–decoder attention as described in [20]. For fair comparison, the TM-av_align model was implemented by applying the cross-modal alignment structure in [13] to the TM-seq2seq model used as the common baseline model and performed modality fusion with attention using audio as a query on the encoder side while our proposed model added the DCM to the baseline model. In the modality column, ‘A’, ‘V’, and ‘AV’ denote that audio-only, video-only, and audio–visual modalities were used, respectively. In the objective column, ‘CE’ and ‘H’ mean the objective functions based on the cross-entropy only and the hybrid CTC/attention loss, respectively. In addition, the numbers of parameters in the used models are presented, and our model requiring two additional attentions for DCM has 2.5% more parameters than the TM-seq2seq. Table 2 describes architectural differences in cross-modality attentions between the three models since the cross-modality attention architectures are mainly different with audio–visual modalities.
Regardless of the used methods, the WERs increased as the SNR decreased. The relatively significant differences in WERs between clean and 20-dB noisy reverberant data were possibly caused by the reverberation to simulate realistic situations. In the case of using the cross-entropy only as a loss function, our model consistently provided better recognition performance than the TM-seq2seq and the TM-av_align regardless of the input SNRs. In particular, the WER of our model averaged over the two datasets achieved a relative improvement of about 16.9% for clean data compared to that of the TM-seq2seq (much larger than the parameter growth rate) whereas the TM-av_align showed comparable or slightly deteriorated performance than the TM-seq2seq. These results were because our model could effectively fuse the modality information by the video query cross-modality attention in addition to the audio query cross-modality attention similar to the AV align model. As shown in Figure 6, clean weight maps of the video query cross-modality attention in our model were helpful for recognition of noisy reverberant audio data as well as clean data.
Adding the CTC loss in the hybrid CTC/attention architecture to assess the contribution of the CTC loss further improved the recognition performance, which demonstrated that monotonic alignments were very useful for speech recognition. These results were consistent with more clean encoder–decoder attention weight maps. Above all, our model using the DCM attention and the hybrid CTC/attention loss achieved the WER of 8.7% averaged over the two datasets for clean data with consistent better performance for all the experimented conditions than the others.

4.6. Decoding Examples

Table 3 shows some decoding results for 0-dB-SNR noisy reverberant data. Using video-only modality provided inferior performance to the others due to its inherent ambiguity in visual speech recognition. Using audio-only modality with speech contamination, it was difficult to recognize similar pronunciations such as “to”-“two”, “that”-“bad”, “of the”-“off a”. The methods using both the modalities mitigated the disadvantages for the uni-modalities by fusing audio and video information. In particular, our model that achieved superior performance to the others predicted correct words successfully except a very unusual word “antiquarans”.

4.7. Decoding on Sentences of Various Lengths

In Figure 8, we summarize the WERs according to the numbers of words in sentences for clean and noisy reverberant data. For both the data, our models achieved better recognition performance than the others in most cases. All the experimented models obtained poor performance for short sentences of three and four words because appropriate contexts could not be extracted in these sentences.

4.8. Decoding on Out-of-Sync Data

Figure 9 displays the WERs on out-of-sync data for clean and noisy reverberant data. Since audio and video were synchronized in the datasets, we synthetically shifted the video frames to get out-of-sync data as in [20]. Although the transformer model with the CTC loss only showed worse performance than that with the cross-entropy loss in [20], our model with the hybrid CTC/attention loss provided comparable WERs with the cross-entropy loss. Even with the CTC loss, our model still had robustness against out-of-sync data because our model was based on independent encoder–decoder and cross-modality attention mechanisms. The results demonstrated that our model might use the hybrid CTC/attention loss to force monotonic alignments required for AVSR without concerning relative performance degradation for out-of-sync data.

4.9. Comparison with Simple Concatenation of Audio and Video Encoder Outputs

In order to show the effectiveness of DCM attention, Table 4 compares the WERs of our model with those for decoding on simple concatenation of audio and video encoder outputs. For all the experimented SNRs, our model outperformed the model using simple concatenation of audio and video information, which demonstrated the effectiveness of DCM attention.

4.10. Model Parameter Sensitiveness and Run-Time Complexity

Figure 10 shows the WERs averaged over clean/noisy reverberant test data with all the experimented SNRs for various model sizes. For each model, we stacked encoders and decoders in various layers. Since the numbers of trainable model parameters and multiply-accumulate operations (MACs) are different for each model, Figure 10a,b display the WERs with the numbers of model parameters and MACs on the horizontal axis, respectively. The experimental results showed that our model yielded better recognition performance than the other compared models especially with smaller models, which indicated that our DCM structure and hybrid CTC/attention loss were efficient for fusion of audio and video.

5. Conclusions

In this paper, we proposed an AVSR model based on the transformer with the DCM attention and a hybrid CTC/attention architecture. We constructed the DCM attention for proper alignment information between audio and visual modality even with noisy reverberant audio data, and applied a hybrid CTC/attention structure to enhance monotonic alignments. In general, our model provided better recognition performance than the compared models based on the transformer, even for out-of-sync data, and the hybrid CTC/attention loss further improved the performance. In the future, we will focus on more efficient fusion strategy of audio and video information and extend to audio–visual speech recognition including a speech enhancement model.

Author Contributions

Conceptualization, Y.-H.L., D.-W.J., R.-H.P., and H.-M.P.; methodology, Y.-H.L., D.-W.J., and J.-B.K.; software, Y.-H.L., D.-W.J., and J.-B.K.; validation, Y.-H.L. and J.-B.K.; formal analysis, Y.-H.L. and D.-W.J.; investigation, Y.-H.L., D.-W.J., R.-H.P., and H.-M.P.; resources, R.-H.P. and H.-M.P.; data curation, Y.L. and H.-M.P.; writing—original draft preparation, Y.-H.L., R.-H.P., and H.-M.P.; writing—review and editing, Y.-H.L., D.-W.J., J.-B.K., R.-H.P., and H.-M.P.; visualization, Y.-H.L.; supervision, R.-H.P. and H.-M.P.; project administration, H.-M.P.; funding acquisition, H.-M.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2017R1A2B4009964 and NRF-2020R1A2B5B01002398).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
LSTMlong short-term memory
seq2seqsequence-to-sequence
AVSRaudio–visual speech recognition
DCMdual cross-modality
CTCconnectionist-temporal-classification
ASRautomatic speech recognition
SNRsignal-to-noise ratio
CNNconvolutional neural network
AV aligncross-modality attention that computes the video context using audio query
VA aligncross-modality attention that computes the audio context using video query
DNNdeep neural network
DCTdiscrete-cosine-transform
MFCCmel-frequency cepstral coefficient
MLPmulti-layer perceptron
sosstart of a sentence
eosend of a sentence
RT 60 reverberation time
WERword error rate
TMtransformer model

References

  1. Virtanen, T.; Singh, R.; Raj, B. (Eds.) Techniques for Noise Robustness in Automatic Speech Recognition; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2012. [Google Scholar]
  2. Wölfel, M.; McDonough, J. Distant Speech Recognition; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2009. [Google Scholar]
  3. Droppo, J.; Acero, A. Environmental Robustness. In Handbook of Speech Processing; Springer: Berlin, Germany, 2008. [Google Scholar]
  4. Raj, B.; Parikh, V.; Stern, R.M. The effects of background music on speech recognition accuracy. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Munich, Germany, 21–24 April 1997; Volume 2, pp. 851–854. [Google Scholar]
  5. Cho, J.W.; Park, J.H.; Chang, J.H.; Park, H.M. Bayesian feature enhancement using independent vector analysis and reverberation parameter re-estimation for noisy reverberant speech recognition. Comput. Speech Lang. 2017, 46, 496–516. [Google Scholar] [CrossRef]
  6. Zhou, Z.; Zhao, G.; Hong, X.; Pietikäinen, M. A review of recent advances in visual speech decoding. Image Vis. Comput. 2014, 32, 590–605. [Google Scholar] [CrossRef]
  7. Chung, J.S.; Senior, A.; Vinyals, O.; Zisserman, A. Lip reading sentences in the wild. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3444–3453. [Google Scholar]
  8. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
  9. Ahonen, T.; Hadid, A.; Pietikainen, M. Face description with local binary patterns: Application to face recognition. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2006, 28, 2037–2041. [Google Scholar] [CrossRef] [PubMed]
  10. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  11. Jang, D.W.; Kim, H.I.; Je, C.; Park, R.H.; Park, H.M. Lip Reading Using Committee Networks With Two Different Types of Concatenated Frame Images. IEEE Access 2019, 7, 90125–90131. [Google Scholar] [CrossRef]
  12. Noda, K.; Yamaguchi, Y.; Nakadai, K.; Okuno, H.G.; Ogata, T. Audio-visual speech recognition using deep learning. Appl. Intell. 2015, 42, 722–737. [Google Scholar] [CrossRef] [Green Version]
  13. Sterpu, G.; Saam, C.; Harte, N. Attention-based audio-visual fusion for robust automatic speech recognition. In Proceedings of the International Conference on Multimodal Interaction, Boulder, CO, USA, 16–20 October 2018; Volume 5, pp. 111–115. [Google Scholar]
  14. Petridis, S.; Stafylakis, T.; Ma, P.; Cai, F.; Tzimiropoulos, G.; Pantic, M. End-to-end audiovisual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 6548–6552. [Google Scholar]
  15. Shillingford, B.; Assael, Y.; Hoffman, M.W.; Paine, T.; Hughes, C.; Prabhu, U.; Liao, H.; Sak, H.; Rao, K.; Bennett, L.; et al. Large-Scale Visual Speech Recognition. In Interspeech; 2019; pp. 4135–4139. Available online: https://arxiv.org/pdf/1807.05162.pdf (accessed on 20 September 2020).
  16. Zhou, P.; Yang, W.; Chen, W.; Wang, Y.; Jia, J. Modality attention for end-to-end audio-visual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6565–6569. [Google Scholar]
  17. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  18. Zeyer, A.; Bahar, P.; Irie, K.; Schlüter, R.; Ney, H. A comparison of transformer and LSTM encoder decoder models for ASR. In Proceedings of the IEEE Automatic Speech Recognition and Understanding (ASRU) Workshop, Sentosa, Singapore, 14–18 December 2019; pp. 8–15. [Google Scholar]
  19. Petridis, S.; Stafylakis, T.; Ma, P.; Tzimiropoulos, G.; Pantic, M. Audio-visual speech recognition with a hybrid CTC/attention architecture. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 513–520. [Google Scholar]
  20. Afouras, T.; Chung, J.S.; Senior, A.; Vinyals, O.; Zisserman, A. Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2018. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  21. Mroueh, Y.; Marcheret, E.; Goel, V. Deep multimodal learning for Audio-Visual Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 2130–2134. [Google Scholar]
  22. Tamura, S.; Ninomiya, H.; Kitaoka, N.; Osuga, S.; Iribe, Y.; Takeda, K.; Hayamizu, S. Audio-visual speech recognition using deep bottleneck features and high-performance lipreading. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Hong Kong, China, 16–19 December 2015; pp. 575–582. [Google Scholar]
  23. Galatas, G.; Potamianos, G.; Makedon, F. Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, 27–31 August 2012; pp. 2714–2717. [Google Scholar]
  24. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  25. Wang, Y.; Mohamed, A.; Le, D.; Liu, C.; Xiao, A.; Mahadeokar, J.; Huang, H.; Tjandra, A.; Zhang, X.; Zhang, F.; et al. Transformer-based acoustic modeling for hybrid speech recognition. arXiv 2019, arXiv:1910.09799. [Google Scholar]
  26. Yeh, C.F.; Mahadeokar, J.; Kalgaonkar, K.; Wang, Y.; Le, D.; Jain, M.; Schubert, K.; Fuegen, C.; Seltzer, M.L. Transformer-Transducer: End-to-End Speech Recognition with Self-Attention. arXiv 2019, arXiv:1910.12977. [Google Scholar]
  27. Paraskevopoulos, G.; Parthasarathy, S.; Khare, A.; Sundaram, S. Multimodal and Multiresolution Speech Recognition with Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, DC, USA, 6–8 July 2020; pp. 2381–2387. [Google Scholar]
  28. Sterpu, G.; Saam, C.; Harte, N. Should we hard-code the recurrence concept or learn it instead? Exploring the Transformer architecture for Audio-Visual Speech Recognition. arXiv 2020, arXiv:2005.09297. [Google Scholar]
  29. Boes, W.; Van hamme, H. Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1961–1969. [Google Scholar]
  30. Li, Z.; Li, Z.; Zhang, J.; Feng, Y.; Niu, C.; Zhou, J. Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog. arXiv 2020, arXiv:2002.00163. [Google Scholar]
  31. Le, H.; Sahoo, D.; Chen, N.; Hoi, S. Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Volume 2, pp. 5612–5623. [Google Scholar]
  32. Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Volume 2019, pp. 6558–6569. [Google Scholar]
  33. Chan, W.; Jaitly, N.; Le, Q.; Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4960–4964. [Google Scholar]
  34. Luong, M.T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421. [Google Scholar]
  35. Watanabe, S.; Hori, T.; Kim, S.; Hershey, J.R.; Hayashi, T. Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE J. Sel. Top. Signal Process. 2017, 11, 1240–1253. [Google Scholar] [CrossRef]
  36. Xu, K.; Li, D.; Cassimatis, N.; Wang, X. LCANet: End-to-end lipreading with cascaded attention-CTC. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 548–555. [Google Scholar]
  37. Mohamed, A.; Okhonko, D.; Zettlemoyer, L. Transformers with convolutional context for ASR. arXiv 2019, arXiv:1904.11660. [Google Scholar]
  38. Chung, J.S.; Zisserman, A. Out of time: Automated lip sync in the wild. In Asian Conference on Computer Vision (ACCV); Springer: Berlin, Germany, 2016; pp. 251–263. [Google Scholar]
  39. Chatfield, K.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Return of the devil in the details: Delving deep into convolutional nets. In Proceedings of the British Machine Vision Conference (BMVC), Nottingham, UK, 1–5 September 2014. [Google Scholar]
  40. Chung, J.S.; Zisserman, A. Lip Reading in Profile. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 4–7 September 2017. [Google Scholar]
  41. Allen, J.B.; Berkley, D.A. Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am. 1979, 65, 943–950. [Google Scholar] [CrossRef]
  42. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. In Proceedings of the Neural Information Processing Systems (NeurIPS) Workshop, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  43. Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; Auli, M. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, MN, USA, 2–7 June 2019; pp. 48–53. [Google Scholar]
  44. Zeiler, M.D. ADADELTA: An Adaptive Learning Rate Method. arXiv 2012, arXiv:1212.5701. [Google Scholar]
Figure 1. An overview of our proposed audio–visual speech recognition (AVSR) architecture. It consists of four modules: encoders, dual cross-modality (DCM) attentions, an attention decoder, and a connectionist-temporal-classification (CTC) block. The encoders receive each modality and compress their information. DCM attentions are calculated by using different modalities for the queries and the keys. Finally, character probabilities are calculated by using both the attention decoder and the CTC block.
Figure 1. An overview of our proposed audio–visual speech recognition (AVSR) architecture. It consists of four modules: encoders, dual cross-modality (DCM) attentions, an attention decoder, and a connectionist-temporal-classification (CTC) block. The encoders receive each modality and compress their information. DCM attentions are calculated by using different modalities for the queries and the keys. Finally, character probabilities are calculated by using both the attention decoder and the CTC block.
Applsci 10 07263 g001
Figure 2. Encoder structures. Each structure generates audio or video features using a multi-head self attention followed by two feed-forward layers (FC 1 and FC 2). The number of encoder blocks, N, is six in these structures. (a) Video encoder; (b) audio encoder.
Figure 2. Encoder structures. Each structure generates audio or video features using a multi-head self attention followed by two feed-forward layers (FC 1 and FC 2). The number of encoder blocks, N, is six in these structures. (a) Video encoder; (b) audio encoder.
Applsci 10 07263 g002
Figure 3. DCM attention architecture. Either audio or video encoder output as a query Q is fused with the other modality as a key K and a value V in a multi-head attention. AV and VA denote the audio query encoder and video query encoder outputs, respectively.
Figure 3. DCM attention architecture. Either audio or video encoder output as a query Q is fused with the other modality as a key K and a value V in a multi-head attention. AV and VA denote the audio query encoder and video query encoder outputs, respectively.
Applsci 10 07263 g003
Figure 4. Decoder structure. Similar to the transformer sequence-to-sequence (TM-seq2seq) model, the transformer with DCM attention (TM-DCM) model receives the audio query encoder output (AV) and video query encoder output (VA) and returns character probabilities. Context vectors (AVc and VAc) computed by multi-head encoder–decoder attentions are concatenated channel-wise and fed to feed-forward layers.
Figure 4. Decoder structure. Similar to the transformer sequence-to-sequence (TM-seq2seq) model, the transformer with DCM attention (TM-DCM) model receives the audio query encoder output (AV) and video query encoder output (VA) and returns character probabilities. Context vectors (AVc and VAc) computed by multi-head encoder–decoder attentions are concatenated channel-wise and fed to feed-forward layers.
Applsci 10 07263 g004
Figure 5. Procedure to compute the loss function for the TM-DCM model. The TM-DCM model is trained by using the hybrid CTC/attention architecture to force monotonic alignments required for speech recognition.
Figure 5. Procedure to compute the loss function for the TM-DCM model. The TM-DCM model is trained by using the hybrid CTC/attention architecture to force monotonic alignments required for speech recognition.
Applsci 10 07263 g005
Figure 6. DCM attention weight maps between audio and video encoders using audio or video features as queries in the TM-DCM for clean and noisy reverberant data. The used clean utterance was “PSlBlZ3hqKc/00011.mp4” in the test set of the LRS3-TED dataset, whose character label was “that is the real world and unless we find a way to globalize democracy or”. The input SNR for noisy reverberant data was 0 dB. The weights were computed by averaging over all the cross-modality attention heads. The weight maps for (a) clean audio query and video key/value; (b) noisy reverberant audio query and video key/value; (c) video query and clean audio key/value; and (d) video query and noisy reverberant audio key/value.
Figure 6. DCM attention weight maps between audio and video encoders using audio or video features as queries in the TM-DCM for clean and noisy reverberant data. The used clean utterance was “PSlBlZ3hqKc/00011.mp4” in the test set of the LRS3-TED dataset, whose character label was “that is the real world and unless we find a way to globalize democracy or”. The input SNR for noisy reverberant data was 0 dB. The weights were computed by averaging over all the cross-modality attention heads. The weight maps for (a) clean audio query and video key/value; (b) noisy reverberant audio query and video key/value; (c) video query and clean audio key/value; and (d) video query and noisy reverberant audio key/value.
Applsci 10 07263 g006
Figure 7. Encoder–decoder attention weight maps using the DCM attention outputs as keys and values in the TM-DCM for the cross-entropy and hybrid CTC/attention losses. The weight maps were displayed for both clean and noisy reverberant data, and the used clean and noisy reverberant data were the same as in Figure 6. The weights were computed by averaging over all the encoder–decoder attention heads at all decoder layers. The weight maps for (ad) the cross-entropy loss and (eh) the hybrid CTC/attention loss; the weight maps between (a,e) audio encoders and decoders using clean audio input; (b,f) audio encoders and decoders using noisy reverberant audio input; (c,g) video encoders and decoders using clean audio input; and (d,h) video encoders and decoders using noisy reverberant audio input.
Figure 7. Encoder–decoder attention weight maps using the DCM attention outputs as keys and values in the TM-DCM for the cross-entropy and hybrid CTC/attention losses. The weight maps were displayed for both clean and noisy reverberant data, and the used clean and noisy reverberant data were the same as in Figure 6. The weights were computed by averaging over all the encoder–decoder attention heads at all decoder layers. The weight maps for (ad) the cross-entropy loss and (eh) the hybrid CTC/attention loss; the weight maps between (a,e) audio encoders and decoders using clean audio input; (b,f) audio encoders and decoders using noisy reverberant audio input; (c,g) video encoders and decoders using clean audio input; and (d,h) video encoders and decoders using noisy reverberant audio input.
Applsci 10 07263 g007
Figure 8. WERs according to the numbers of words in sentences for the models using audio–visual modality on the test sets of LRS2-BBC and LRS3-TED datasets with (a) clean and (b) noisy reverberant audio data.
Figure 8. WERs according to the numbers of words in sentences for the models using audio–visual modality on the test sets of LRS2-BBC and LRS3-TED datasets with (a) clean and (b) noisy reverberant audio data.
Applsci 10 07263 g008
Figure 9. WERs on out-of-sync data for the models using audio–visual modality on the test sets of LRS2-BBC and LRS3-TED datasets with (a) clean and (b) noisy reverberant audio data. The video frames were shifted by the numbers of frames on the horizontal axes with audio data fixed. The positive offset meant that the audio preceded the video, and the negative offset meant vice versa.
Figure 9. WERs on out-of-sync data for the models using audio–visual modality on the test sets of LRS2-BBC and LRS3-TED datasets with (a) clean and (b) noisy reverberant audio data. The video frames were shifted by the numbers of frames on the horizontal axes with audio data fixed. The positive offset meant that the audio preceded the video, and the negative offset meant vice versa.
Applsci 10 07263 g009
Figure 10. WERs for various model sizes with the numbers of (a) trainable model parameters and (b) multiply-accumulate operations (MACs). The data used to measure the MACs were “PSlBlZ3hqKc/00011.mp4” in the test set of the LRS3-TED dataset.
Figure 10. WERs for various model sizes with the numbers of (a) trainable model parameters and (b) multiply-accumulate operations (MACs). The data used to measure the MACs were “PSlBlZ3hqKc/00011.mp4” in the test set of the LRS3-TED dataset.
Applsci 10 07263 g010
Table 1. Word error rates (WERs) (%) for the TM-seq2seq, TM-av_align, and TM-DCM on the LRS2-BBC and LRS3-TED datasets. The boldface WERs denote the best performance in each condition. Abbreviation to understand the table: A, audio-only modality; V, video-only modality; AV, audio–visual modalities; CE, cross-entropy loss only; H, hybrid CTC/attention loss; As a result that the video-only modality case (V) does not use the audio modality, we have a constant WER for each dataset on the TM-seq2seq using the video-only modality in the first two rows.
Table 1. Word error rates (WERs) (%) for the TM-seq2seq, TM-av_align, and TM-DCM on the LRS2-BBC and LRS3-TED datasets. The boldface WERs denote the best performance in each condition. Abbreviation to understand the table: A, audio-only modality; V, video-only modality; AV, audio–visual modalities; CE, cross-entropy loss only; H, hybrid CTC/attention loss; As a result that the video-only modality case (V) does not use the audio modality, we have a constant WER for each dataset on the TM-seq2seq using the video-only modality in the first two rows.
ModelModalityObjective#ParamsDatasetCleanNoisy ReverberantAvg.
LRS2-BBCLRS3-TEDSNR (dB)
20151050 5
TM-seq2seqVCE54.2 M 59.7
67.3
TM-seq2seqACE47.3 M  9.821.723.325.733.747.668.933.0
10.121.423.526.133.848.169.633.2
TM-seq2seqAVCE84.6 M 10.519.719.823.025.134.043.725.1
10.820.020.223.527.636.451.327.1
TM-av_alignAVCE76.2 M 11.518.819.322.625.031.243.422.6
11.718.118.921.825.834.147.125.4
TM-DCMAVCE86.7 M  8.717.317.519.222.029.241.222.2
 9.017.818.019.822.931.545.823.5
TM-DCMAVH86.7 M  8.616.816.918.822.028.940.721.8
 8.817.117.319.222.230.943.622.7
Table 2. Architectural differences in cross-modality attentions between the TM-seq2seq, TM-av_align, and TM-DCM.
Table 2. Architectural differences in cross-modality attentions between the TM-seq2seq, TM-av_align, and TM-DCM.
ModelTM-seq2seqTM-av_alignTM-DCM
Modality attentionNoneAudio–VideoAudio–Video and Video–Audio
(Query-Key/Value)
Table 3. Some decoding results for 0-dB-SNR noisy reverberant data. The boldface words denote wrong prediction. Abbreviation to understand the table: A, audio-only modality; V, video-only modality; AV, audio–visual modalities; CE, cross-entropy loss only; H, hybrid CTC/attention loss.
Table 3. Some decoding results for 0-dB-SNR noisy reverberant data. The boldface words denote wrong prediction. Abbreviation to understand the table: A, audio-only modality; V, video-only modality; AV, audio–visual modalities; CE, cross-entropy loss only; H, hybrid CTC/attention loss.
ModelsModalityObjectiveTranscription
Ground truth and it’s even rarer to find one that hasn’t been dug into by antiquarans
TM-seq2seqVCEand it’s even rarer to find one that hasn’t bin diagnosed by asking quarries
TM-seq2seqACEand it’s equal rare two find one that hasn’t been dug into by anti crayons
TM-seq2seqAVCEand it’s even rarer to find one that hasn’t been dug into by antique areas
TM-av_alignAVCEand it’s even rarer to find one that hasn’t been dug into by antiquarists
TM-DCMAVCEand it’s even rarer to find one that hasn’t been dug into by antiquate risks
TM-DCMAVHand it’s even rarer to find one that hasn’t been dug into by antiquarans
Ground truth home to an animal that is right at the top of the food chain
TM-seq2seqVCEhome to an animal has raised in some of the future in
TM-seq2seqACEhome to an animal bad is rights into top off a food chain
TM-seq2seqAVCEhome to an animal that is right at the top over food chain
TM-av_alignAVCEhome to an animal that is right at the top of a food chain
TM-DCMAVCEhome to an animal that is right at the top of the food chain
TM-DCMAVHhome to an animal that is right at the top of the food chain
Ground truth and would eventually marry her after his wife
TM-seq2seqVCEand would eventually the most american hundreds of
TM-seq2seqACEand would eventually marry him got the his wife
TM-seq2seqAVCEand would emit actually marry her after his wife
TM-av_alignAVCEand would emitting her after his wife
TM-DCMAVCEand would eventually marry her after his wife
TM-DCMAVHand would eventually marry her after his wife
Table 4. WERs (%) for our model and decoding on simple concatenation of audio and video encoder outputs on the test sets of LRS2-BBC and LRS3-TED datasets. The boldface WERs denote the best performance in each condition. Abbreviation to understand the table: CE, cross-entropy loss only; H, hybrid CTC/attention loss.
Table 4. WERs (%) for our model and decoding on simple concatenation of audio and video encoder outputs on the test sets of LRS2-BBC and LRS3-TED datasets. The boldface WERs denote the best performance in each condition. Abbreviation to understand the table: CE, cross-entropy loss only; H, hybrid CTC/attention loss.
Fusion MethodConcatenationTM-DCM
ObjectiveCECEH
Clean9.68.88.7
Noisy
reverberant
SNR
(dB)
2018.217.516.9
1518.917.717.0
1021.219.418.9
525.222.322.1
033.530.029.5
5 47.342.741.7
Avg.24.822.622.1
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Lee, Y.-H.; Jang, D.-W.; Kim, J.-B.; Park, R.-H.; Park, H.-M. Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model. Appl. Sci. 2020, 10, 7263. https://doi.org/10.3390/app10207263

AMA Style

Lee Y-H, Jang D-W, Kim J-B, Park R-H, Park H-M. Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model. Applied Sciences. 2020; 10(20):7263. https://doi.org/10.3390/app10207263

Chicago/Turabian Style

Lee, Yong-Hyeok, Dong-Won Jang, Jae-Bin Kim, Rae-Hong Park, and Hyung-Min Park. 2020. "Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model" Applied Sciences 10, no. 20: 7263. https://doi.org/10.3390/app10207263

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop