Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model

: Since attention mechanism was introduced in neural machine translation, attention has been combined with the long short-term memory (LSTM) or replaced the LSTM in a transformer model to overcome the sequence-to-sequence (seq2seq) problems with the LSTM. In contrast to the neural machine translation, audio–visual speech recognition (AVSR) may provide improved performance by learning the correlation between audio and visual modalities. As a result that the audio has richer information than the video related to lips, AVSR is hard to train attentions with balanced modalities. In order to increase the role of visual modality to a level of audio modality by fully exploiting input information in learning attentions, we propose a dual cross-modality (DCM) attention scheme that utilizes both an audio context vector using video query and a video context vector using audio query. Furthermore, we introduce a connectionist-temporal-classiﬁcation (CTC) loss in combination with our attention-based model to force monotonic alignments required in AVSR. Recognition experiments on LRS2-BBC and LRS3-TED datasets showed that the proposed model with the DCM attention scheme and the hybrid CTC/attention architecture achieved at least a relative improvement of 7.3% on average in the word error rate (WER) compared to competing methods based on the transformer model. Contributions: and J.-B.K.; validation, analysis, D.-W.J.; H.-M.P.; H.-M.P.; curation, Y.L. and H.-M.P.; writing—original Y.-H.L., H.-M.P.; visualization, Y.-H.L.; R.-H.P. H.-M.P.; H.-M.P.; H.-M.P. manuscript.


Introduction
Automatic speech recognition (ASR) has attracted much interest because speech is the most convenient, natural, and user-friendly interface to various kinds of devices. Unfortunately, a speech signal acquired in real-world noisy environments is significantly contaminated, and the performance of ASR systems with the contaminated speech signal is seriously degraded due to the mismatch between the training and testing environments. Although many approaches have been developed to accomplish robustness by compensating for the mismatch under specific conditions, most of them fail to attain robustness in real-world environments with various types of noise (e.g., [1][2][3][4][5]). Therefore, robust recognition remains a challenging but important issue in the field of ASR.
As a result that visual information is not distorted by acoustic noise, visual speech recognition (known as lip reading) may play an important role in ASR in acoustically adverse environments [6]. Thus, visual speech recognition generally provides consistent recognition accuracies regardless of the signal-to-noise ratios (SNRs) of acquired acoustic speech whereas audio speech recognition gets worse recognition accuracies for speech with lower SNRs. However, it is well known that audio speech recognition with clean speech typically achieves higher recognition accuracy than visual speech recognition because speech without distortion may provide more sufficient and clear cues to classify phonemes than visual movement in a lip in addition to its face. Therefore, audio-visual speech recognition (AVSR) fuses audio and visual information acquired from a talking face with audio to achieve comparable or possibly higher recognition performance than audio speech recognition with clean speech and visual speech recognition in acoustically adverse environments [7].
As a result that features for visual speech recognition are not well-established in contrast to acoustic speech features, such as the logarithmic mel-frequency power spectral coefficients or cepstral coefficients, conventional visual features including histogram of oriented gradients [8], local binary patterns [9], and features with scale-invariant transforms [10] have been commonly used. As a result that these are rather general-purpose image features, the performance of visual speech recognition may be improved by devising effective features [11]. Influenced by impressive success of deep learning in not only diverse object detection and recognition tasks but also action recognition tasks, deep architectures such as convolutional neural networks (CNNs) and long short-term memory (LSTM) have been applied to AVSR [12]. In addition, deep learning provides ways to learn end-to-end recognition models without developing and training separated acoustic and language models indispensable for conventional speech recognition (e.g., [7,[13][14][15]).
However, a mechanism fusing audio and visual information in AVSR should be still developed to achieve successful recognition performance in both acoustically clean and noisy environments. Intuitively, it is better to rely more on audio features than on visual features in clean environments. Therefore, AVSR is hard to train attentions with balanced modalities. In [16], modality attention computes scores for modality space at a certain time whereas conventional attention [17] computes scores for time space using a specific modality (query). However, the modality attention is applied assuming that all modalities have the same time length. As a result that audio and visual features are usually generated at different time steps, they have to be resampled to apply the modality attention. Sterpu et al. proposed cross-modality attention that computed the video context using audio query (AV align) although conventional attention computes the video context using video query [13].
Although Sterpu et al. used cross-modality attention computing the AV align, we propose dual cross-modality (DCM) attention that combines two cross-modality attentions calculating the AV align and also the audio context using video query (VA align), in order to increase the role of visual modality to a level of audio modality by fully exploiting input audio and visual information in training attentions. Recently, the transformer model provided better performance than the conventional LSTM-based model because it calculated the global context vector over the entire time of input data [18]. Therefore, we apply our proposed DCM model to the transformer model. Furthermore, we introduce a connectionist-temporal-classification (CTC) loss in combination with our attention-based model to force monotonic alignments, which results in a hybrid CTC/attention architecture to improve the performance of AVSR [19]. Figure 1 shows an overview of our AVSR architecture.
The remainder of this paper is organized as follows: Section 2 summarizes related works on the AVSR task, attention mechanism, modality fusion, and hybrid CTC/attention architecture for speech recognition. In Section 3, we propose an AVSR model with DCM attention scheme and the hybrid CTC/attention architecture. Our proposed methods are compared with other attention mechanisms implemented on the transformer model through experiments on LRS2-BBC and LRS3-TED datasets [20] in Section 4. Finally, some concluding remarks are presented in Section 5.  Figure 1. An overview of our proposed audio-visual speech recognition (AVSR) architecture. It consists of four modules: encoders, dual cross-modality (DCM) attentions, an attention decoder, and a connectionist-temporal-classification (CTC) block. The encoders receive each modality and compress their information. DCM attentions are calculated by using different modalities for the queries and the keys. Finally, character probabilities are calculated by using both the attention decoder and the CTC block.

AVSR
The AVSR problem is highly related to lip reading. Mroueh et al. [21] performed phoneme classification based on feed-forward deep neural networks (DNNs). In addition, several prior studies have conducted AVSR to recognize digits or isolated words by using various features such as deep bottleneck features [22], discrete-cosine-transform (DCT)-based features [23], and pre-trained CNN features with mel-frequency cepstral coefficients (MFCCs) [12]. Chung et al. published a continuous speech recognition model that fused pre-trained CNN features and audio features with a dual attention mechanism [7]. Petridis et al. studied a model that fused raw-pixel images and waveforms by using pre-trained CNN and stacked bidirectional recurrent network [14]. Afouras et al. compared and analyzed AVSR models by applying either cross-entropy loss or CTC loss to a transformer-based AVSR model [20].
Since the transformer model was presented in machine translation [24], there have been many researches to introduce the transformer model not only to ASR but also for many audio-visual tasks. Although the LSTM-and bidirectional-LSTM-based models compressed input data into a fixed-size vector, the transformer model calculated the global context over the entire input data through attention mechanism, which might result in improved performance and faster and more stable training (e.g., [18,25]). In [26], the transformer model was also combined with the LSTM-based model. In a typical AVSR transformer model with two encoders for audio and video and one common decoder, Afouras et al. [20] analyzed the advantages and disadvantages of both models based on the CTC loss and sequence-to-sequence (seq2seq) loss of the transformer model. Recently, an efficient fusion method of audio and visual in a transformer-based AVSR model has also been studied (e.g., [27,28]). In addition, the transformer model is under study for various audio-visual tasks (e.g., [29][30][31][32]).

Attention Mechanism
Additive attention and dot-product attention are typical attention mechanisms. In case of the additive attention, Chan et al. [33] applied Bahdanau attention [17] to LSTM-based audio speech recognition model. First, an encoder generates high-level representation h = (h 1 , . . . , h U ) for input x = (x 1 , . . . , x T ) with the number of frames T ≥ U, which is expressed as [33] h = Encoder(x). (1) At decoder time step i, the attention module computes the context c i using the scalar energy e i,u at every encoder-output time step u that is subsequently calculated by the decoder state s i and the encoder output h u ∈ h, 1 ≤ u ≤ U as [33] where φ and ψ denote processing functions by multi-layer perceptron (MLP). Finally, the decoder state s i and the output character y i at decoder time step i are computed by [33] where Decoder is composed of two-layer LSTM and CharacterDistribution is an MLP with softmax outputs over characters.
On the other hand, Luong et al. [34] used attentional vectors i at decoder time step i to produce the predictive character distribution, computed from the decoder state s i and source-side context vector c i , represented ass where [·, ·] indicates concatenation of the two components and W c and W s are trainable parameters. The dot-product attention applied to the transformer model [24] calculates the dot products of the query with all keys, divides each by the square-root of the key dimension √ d k , and applies a softmax function to get the weights on the values. The matrix of attention outputs for a set of queries Q with the keys and values packed into matrices K and V is computed by To efficiently combine information from different representation subspaces at different positions, they proposed multi-head attention mechanism using Q, K, and V as inputs. The output of the n-th attention head is expressed as [24] Attention n (Q, where W q n , W k n , and W v n denote the linear projection parameters for Q, K, and V, respectively.

Modality Fusion with Attention Mechanism
To focus on the relationship between audio and visual modalities, Sterpu et al. proposed an AV align model in which one of the two decoder-side attentions of the AVSR model in [7] was moved to encoder-side cross-modal alignment [13]. On the encoder side, the cross-modal alignment fused the modalities by computing attention using audio and video encoder outputs as queries and values, respectively. The output of the attention mechanism is similar to (7), which is written as where c v u and h a u denote a video context vector using an audio encoder state as a query and the audio encoder state at encoder time step u, respectively. Therefore, the video context vector and the audio encoder state are fused by tanh and trainable parameters W a c , which results in an attentional vector obtained by the merged modalities,h av u , at the top layer of the audio encoder.
Unlike the AV align, Zhou et al. [16] proposed modality attention to obtain attention weights between modalities in a decoding step. The modality attention is similar to the conventional attention methods except that the modality attention is calculated and combined over the modality axis. The modality attention fusion process can be summarized as [16] where f m 1...i , z m i , and Z denote the feature vectors corresponding to the encoder outputs of modality m up to decoder time step i, scores for the feature vectors of modality m at decoder time step i, and the scoring function composed of the LSTM, the feed-forward network with W, b, and the sigmoid function σ(·). Then, the attention weight for the m-th modality, α m i , and the fusion output feature vector v i at decoder time step i can be computed by where M denotes the number of modalities.

Speech Recognition with a Hybrid CTC/Attention Architecture
An alignment between an encoder and a decoder is one of the main issues in speech recognition. Although the attention algorithm is widely used to solve the alignment, this approach inherently has a problem of allowing non-sequential alignments. However, a CTC loss addresses this problem because it forces a monotonic alignment. A hybrid CTC/attention architecture using a CTC loss in an attention-based encoder-decoder model has been successfully applied to AVSR as well as audio speech recognition or lip reading (e.g., [19,35,36]).

Proposed AVSR Method Based on DCM Attention
In this section, we describe our proposed model architectures based on the recently proposed transformer model for ASR [37], AVSR [20], and the hybrid CTC/attention architecture [19].

Audio Features
We use 90-D log-mel filterbank features. Each feature vector is obtained from 25-ms-long Hamming-windowed input speech at every 10 ms.

Video Features
To prepare visual features that represent sequential lip movements, we crop a 120 × 120-pixel patch covering the mouth region and convert it into a grayscale image. The cropped mouth images are then fed into the pre-trained model in [38]. Using the network based on VGG-M [39], we can get a 512-D feature vector that describes about 200-ms-long lip movement.

Positional Encoding
In order to learn both the global context and the local context in the transformer model, Mohamed et al. proposed a transformer model using convolutional layers [37]. Similarly, we use a 2-D convolutional block for each modality, each of which consists of two 2-D convolutional layers in the encoder. On the other hand, the decoder uses four 1-D convolutional layers over previously generated outputs. Figures 2-4

Self-Attention Encoder
Two encoders and one decoder consist of stacks of multi-head attention layers. As shown in Figure 2, we use 6 encoder blocks for each modality and each block consists of a multi-head self-attention layer and two feed-forward linear layers that generate 2048 and 512 outputs. Using input data as queries Q, keys K, and values V, the multi-head self-attention has 512-D features with eight heads. Like (1), each modality encoder generates a high-level representation h a or h v , for input x a or x v after applying the VGG-M-based network [39], respectively, which is expressed as where AudioEncoder and VideoEncoder denote stacked self-attention encoders for audio and video modalities, respectively.

DCM Attention
The AV align model provided improved performance by fusing two modalities with attention in the encoder side instead of modality fusion in the decoder side [13]. However, when using audio as a query in the AV align model, attention weights may not be properly obtained in noisy environments. On the other hand, since video data is independent of acoustic noise, it may be important to make the role of video modality to a level of audio modality by fully exploiting input audio and visual information in learning attentions. Therefore, using video as a query may be helpful to achieve noise robustness. To consider a video query for audio context in addition to an audio query for video context of the AV align model [13] and to apply them to the transformer model, our DCM attention model has two multi-head attention layers between the two modality encoders, as shown in Figure 3. The configuration of each multi-head attention layer used in the DCM attention model is the same as that in the multi-head attention used for the encoder and decoder. Using h a and h v , DCM attention outputs, AV and VA, can be expressed as  Figure 3. DCM attention architecture. Either audio or video encoder output as a query Q is fused with the other modality as a key K and a value V in a multi-head attention. AV and VA denote the audio query encoder and video query encoder outputs, respectively.

Bi-Modal Self-Attention Decoder
We use an architecture similar to the seq2seq transformer (TM-seq2seq) model by Afouras et al. [20]. We use six decoder blocks. Each block has one multi-head self attention and two multi-head encoder-decoder attentions. Each encoder-decoder attention uses previous decoder outputs after the self attention as queries and DCM attention outputs as keys and values. Then, the two encoder-decoder attention outputs (AV c , VA c ), as shown in Figure 4, are concatenated channel-wise and fed to fusion layers for calculating attentional vectors i at decoder time step i, which is expressed as where FusionLayer denotes modality fusion layers composed of two fully connected layers, rectified linear units, and dropout as shown in Figure 4. Finally, like (8), the predictive character distribution is obtained from calculating the attentional vectors i . Detailed hyper-parameters in the multi-head attentions and the feed-forward layers are identical to those in the encoder.

Training and Decoding with a Hybrid CTC/Attention Architecture
Similar to [19], a CTC loss is combined with an objective for our attention-based model to force a monotonic alignment required for speech recognition during training. The resulting loss function is a weighted sum of the CTC and attention objectives, which is computed as follows: where y = (y 1 , · · · , y I ), x, and α denote a decoded output character sequence, input feature sequences from both modalities, and a relative weight for the loss function, respectively. A decoded output character may include an extra end-of-sentence label. In the training phase, a ground truth character sequence is used as a target label, and the detailed method is shown in Algorithm 1.
To calculate the CTC loss, we need to fuse the audio and video encoder results as shown in Figure 5. Therefore, the video encoder output is upsampled to have a sequence with the same length as the audio encoder output. After that, the two encoder outputs are concatenated channel-wise and fed to a feed-forward layer.  In the decoding phase, we use a joint CTC/attention approach. We calculate a joint score based on CTC and attention decoder probabilities for decoded output character sequences. A hypothesis character output sequenceŷ is computed aŝ where λ is a relative weight in the decoding phase. α and λ are respectively set to 0.2 and 0.1 that are the same as in [19].

Experimental Results and Discussions
In this section, we describe our experimental setup and training strategies. Our proposed model was evaluated and compared with others.

Datasets
We used LRS datasets, the largest existing public AVSR datasets [20]. Unfortunately, due to the license restriction of MV-LRS [40], we used only LRS2-BBC and LRS3-TED datasets for training, validation, and testing. To improve robustness in noisy environments, we simulated noisy reverberant signals with the signal-to-noise ratios (SNRs) of −5 to 20 dB and the reverberation time (RT 60 ) of 0.4 s by adding babble noise samples acquired at cafeterias and restaurants and imposing reverberation generated by the image method [41] to clean speech signals from the datasets. The added noise samples were different for training and testing.

Evaluation Measure
The performance of the models was evaluated by the word error rate (WER) defined as where S, D, and I are the counts of substitutions, deletions, and insertions between reference and hypothesis word sequences, respectively, and N denotes the number of words for the reference.

Training Strategies
We trained in an order of clean short sentences, clean sentences, and clean/noisy reverberant sentences. The detailed data in the order are as follows:

1.
Clean short sentences with three or four words in the pre-train set.

2.
Clean sentences in the pre-train and train-val sets.

3.
Clean and noisy reverberant sentences (as described in Section 4.1) in the train-val set.

4.
Clean and noisy reverberant sentences in the train-val set of either LRS2-BBC or LRS3-TED dataset for fine tuning on either dataset.
Our implementation was based on the PyTorch library [42] and the fairseq toolkit [43]. We used the Adadelta optimizer [44] with default parameters. The learning rate started with 10 −1 and decreased by half to 10 −5 depending on the validation error plateaus. The dropout was performed with p = 0.15. The implementation code of our proposed model is available at https://github.com/LeeYongHyeok/ DCM_vgg_transformer. Figure 6 shows the attention weight maps between audio and video encoders using audio or video features as queries in our proposed model (TM-DCM) for clean and noisy reverberant data. The weights were computed by averaging over all the cross-modality attention heads. The weight map of the audio query cross-modality attention for noisy reverberant data was more noisy than that for clean data because noisy reverberant data could not provide clear clues than clean data, which might result in performance degradation by speech contamination. However, the video query cross-modality attention produced very clean weight maps even with noisy reverberant audio data as keys and values.

Attention Visualization
In Figure 7, we display the encoder-decoder attention weight maps using previous decoder outputs after the self attention as queries and DCM attention outputs as keys and values for the cross-entropy and hybrid CTC/attention losses. The hybrid architecture made the maps, especially between audio encoders and decoders, more clean by forcing monotonic alignments even with noisy reverberant data.  Figure 6. DCM attention weight maps between audio and video encoders using audio or video features as queries in the TM-DCM for clean and noisy reverberant data. The used clean utterance was "PSlBlZ3hqKc/00011.mp4" in the test set of the LRS3-TED dataset, whose character label was "that is the real world and unless we find a way to globalize democracy or". The input SNR for noisy reverberant data was 0 dB. The weights were computed by averaging over all the cross-modality attention heads.  Figure 7. Encoder-decoder attention weight maps using the DCM attention outputs as keys and values in the TM-DCM for the cross-entropy and hybrid CTC/attention losses. The weight maps were displayed for both clean and noisy reverberant data, and the used clean and noisy reverberant data were the same as in Figure 6. The weights were computed by averaging over all the encoder-decoder attention heads at all decoder layers. The weight maps for (a-d) the cross-entropy loss and (e-h) the hybrid CTC/attention loss; the weight maps between (a,e) audio encoders and decoders using clean audio input; (b,f) audio encoders and decoders using noisy reverberant audio input; (c,g) video encoders and decoders using clean audio input; and (d,h) video encoders and decoders using noisy reverberant audio input. Table 1 summarizes the word error rates (WERs) for our proposed model (TM-DCM), the TM-seq2seq, and the AV align implemented on the transformer model (TM-av_align). The TM-seq2seq model was implemented using modality-independent encoder-decoder attention as described in [20]. For fair comparison, the TM-av_align model was implemented by applying the cross-modal alignment structure in [13] to the TM-seq2seq model used as the common baseline model and performed modality fusion with attention using audio as a query on the encoder side while our proposed model added the DCM to the baseline model. In the modality column, 'A', 'V', and 'AV' denote that audio-only, video-only, and audio-visual modalities were used, respectively. In the objective column, 'CE' and 'H' mean the objective functions based on the cross-entropy only and the hybrid CTC/attention loss, respectively. In addition, the numbers of parameters in the used models are presented, and our model requiring two additional attentions for DCM has 2.5% more parameters than the TM-seq2seq. Table 2 describes architectural differences in cross-modality attentions between the three models since the cross-modality attention architectures are mainly different with audio-visual modalities. Table 1. Word error rates (WERs) (%) for the TM-seq2seq, TM-av_align, and TM-DCM on the LRS2-BBC and LRS3-TED datasets. The boldface WERs denote the best performance in each condition. Abbreviation to understand the table: A, audio-only modality; V, video-only modality; AV, audio-visual modalities; CE, cross-entropy loss only; H, hybrid CTC/attention loss; As a result that the video-only modality case (V) does not use the audio modality, we have a constant WER for each dataset on the TM-seq2seq using the video-only modality in the first two rows.  Table 2. Architectural differences in cross-modality attentions between the TM-seq2seq, TM-av_align, and TM-DCM.

Modality attention None
Audio-Video Audio-Video and Video-Audio (Query-Key/Value) Regardless of the used methods, the WERs increased as the SNR decreased. The relatively significant differences in WERs between clean and 20-dB noisy reverberant data were possibly caused by the reverberation to simulate realistic situations. In the case of using the cross-entropy only as a loss function, our model consistently provided better recognition performance than the TM-seq2seq and the TM-av_align regardless of the input SNRs. In particular, the WER of our model averaged over the two datasets achieved a relative improvement of about 16.9% for clean data compared to that of the TM-seq2seq (much larger than the parameter growth rate) whereas the TM-av_align showed comparable or slightly deteriorated performance than the TM-seq2seq. These results were because our model could effectively fuse the modality information by the video query cross-modality attention in addition to the audio query cross-modality attention similar to the AV align model. As shown in Figure 6, clean weight maps of the video query cross-modality attention in our model were helpful for recognition of noisy reverberant audio data as well as clean data.
Adding the CTC loss in the hybrid CTC/attention architecture to assess the contribution of the CTC loss further improved the recognition performance, which demonstrated that monotonic alignments were very useful for speech recognition. These results were consistent with more clean encoder-decoder attention weight maps. Above all, our model using the DCM attention and the hybrid CTC/attention loss achieved the WER of 8.7% averaged over the two datasets for clean data with consistent better performance for all the experimented conditions than the others. Table 3 shows some decoding results for 0-dB-SNR noisy reverberant data. Using video-only modality provided inferior performance to the others due to its inherent ambiguity in visual speech recognition. Using audio-only modality with speech contamination, it was difficult to recognize similar pronunciations such as "to"-"two", "that"-"bad", "of the"-"off a". The methods using both the modalities mitigated the disadvantages for the uni-modalities by fusing audio and video information.

Decoding Examples
In particular, our model that achieved superior performance to the others predicted correct words successfully except a very unusual word "antiquarans". Table 3. Some decoding results for 0-dB-SNR noisy reverberant data. The boldface words denote wrong prediction. Abbreviation to understand the table: A, audio-only modality; V, video-only modality; AV, audio-visual modalities; CE, cross-entropy loss only; H, hybrid CTC/attention loss.

Models Modality Objective Transcription
Ground truth and it's even rarer to find one that hasn't been dug into by antiquarans TM-seq2seq V CE and it's even rarer to find one that hasn't bin diagnosed by asking quarries TM-seq2seq A CE and it's equal rare two find one that hasn't been dug into by anti crayons TM-seq2seq AV CE and it's even rarer to find one that hasn't been dug into by antique areas TM-av_align AV CE and it's even rarer to find one that hasn't been dug into by antiquarists TM-DCM AV CE and it's even rarer to find one that hasn't been dug into by antiquate risks TM-DCM AV H and it's even rarer to find one that hasn't been dug into by antiquarans

Decoding on Sentences of Various Lengths
In Figure 8, we summarize the WERs according to the numbers of words in sentences for clean and noisy reverberant data. For both the data, our models achieved better recognition performance than the others in most cases. All the experimented models obtained poor performance for short sentences of three and four words because appropriate contexts could not be extracted in these sentences.  Figure 9 displays the WERs on out-of-sync data for clean and noisy reverberant data. Since audio and video were synchronized in the datasets, we synthetically shifted the video frames to get out-of-sync data as in [20]. Although the transformer model with the CTC loss only showed worse performance than that with the cross-entropy loss in [20], our model with the hybrid CTC/attention loss provided comparable WERs with the cross-entropy loss. Even with the CTC loss, our model still had robustness against out-of-sync data because our model was based on independent encoder-decoder and cross-modality attention mechanisms. The results demonstrated that our model might use the hybrid CTC/attention loss to force monotonic alignments required for AVSR without concerning relative performance degradation for out-of-sync data. . WERs on out-of-sync data for the models using audio-visual modality on the test sets of LRS2-BBC and LRS3-TED datasets with (a) clean and (b) noisy reverberant audio data. The video frames were shifted by the numbers of frames on the horizontal axes with audio data fixed. The positive offset meant that the audio preceded the video, and the negative offset meant vice versa.

Comparison with Simple Concatenation of Audio and Video Encoder Outputs
In order to show the effectiveness of DCM attention, Table 4 compares the WERs of our model with those for decoding on simple concatenation of audio and video encoder outputs. For all the experimented SNRs, our model outperformed the model using simple concatenation of audio and video information, which demonstrated the effectiveness of DCM attention.  Figure 10 shows the WERs averaged over clean/noisy reverberant test data with all the experimented SNRs for various model sizes. For each model, we stacked encoders and decoders in various layers. Since the numbers of trainable model parameters and multiply-accumulate operations (MACs) are different for each model, Figure 10a,b display the WERs with the numbers of model parameters and MACs on the horizontal axis, respectively. The experimental results showed that our model yielded better recognition performance than the other compared models especially with smaller models, which indicated that our DCM structure and hybrid CTC/attention loss were efficient for fusion of audio and video.  Figure 10. WERs for various model sizes with the numbers of (a) trainable model parameters and (b) multiply-accumulate operations (MACs). The data used to measure the MACs were "PSlBlZ3hqKc/00011.mp4" in the test set of the LRS3-TED dataset.

Conclusions
In this paper, we proposed an AVSR model based on the transformer with the DCM attention and a hybrid CTC/attention architecture. We constructed the DCM attention for proper alignment information between audio and visual modality even with noisy reverberant audio data, and applied a hybrid CTC/attention structure to enhance monotonic alignments. In general, our model provided better recognition performance than the compared models based on the transformer, even for out-of-sync data, and the hybrid CTC/attention loss further improved the performance. In the future, we will focus on more efficient fusion strategy of audio and video information and extend to audio-visual speech recognition including a speech enhancement model.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: LSTM long short-term memory seq2seq sequence-to-sequence AVSR audio-visual speech recognition DCM dual cross-modality CTC connectionist-temporal-classification ASR automatic speech recognition SNR signal-to-noise ratio CNN convolutional neural network AV align cross-modality attention that computes the video context using audio query VA align cross-modality attention that computes the audio context using video query DNN deep neural network DCT discrete-cosine-transform MFCC mel-frequency cepstral coefficient MLP multi-layer perceptron sos start of a sentence eos end of a sentence RT 60 reverberation time WER word error rate TM transformer model