Efﬁcient End-to-End Sentence-Level Lipreading with Temporal Convolutional Networks

: Lipreading aims to recognize sentences being spoken by a talking face. In recent years, the lipreading method has achieved a high level of accuracy on large datasets and made breakthrough progress. However, lipreading is still far from being solved, and existing methods tend to have high error rates on the wild data and have the defects of disappearing training gradient and slow convergence. To overcome these problems, we proposed an efﬁcient end-to-end sentence-level lipreading model, using an encoder based on a 3D convolutional network, ResNet50, Temporal Convolutional Network (TCN), and a CTC objective function as the decoder. More importantly, the proposed architecture incorporates TCN as a feature learner to decode feature. It can partly eliminate the defects of RNN (LSTM, GRU) gradient disappearance and insufﬁcient performance, and this yields notable performance improvement as well as faster convergence. Experiments show that the training and convergence speed are 50% faster than the state-of-the-art method, and improved accuracy by 2.4% on the GRID dataset.


Introduction
Lipreading, also known as visual language recognition, refers to decoding the content of the spoken text based on the visual information of the speaker's lip movement. It has a wide range of applications values in speech recognition [1], public safety [2], intelligent human-computer interaction [3], visual synthesis, etc.
Traditionally, the lipreading method can follow two stages. First, it extracted features from the mouth region. Discrete Cosine Transform [4,5] was considered the most popular feature extractor, and then it was fed to the Hidden Markov Model (HMM) [1,6,7]. At the same time, there are some similar methods proposed: the difference is that the feature extractor is replaced with a deep autoencoder, and HMMS was replaced with Long-Short Term Memory (LSTM) [8,9].
Deep learning methods have achieved great success [10,11] in many complex tasks based on traditional machine learning [12][13][14]. Convolutional Neural Networks (CNNs) show superior performance in image and video feature extraction compared to traditional methods. For example, Stafylakis et al. [15] present a deep learning architecture for lipreading and audiovisual word recognition. Petridis et al. [16] proposed an end-to-end visual speech recognition system based on fully connected layers and LSTM networks. At this stage, there are two solutions for the lipreading architecture based on deep learning, which are divided into Connectionist Temporal Classification(CTC) [17]-based speech recognition [18,19] technology, and the attention-based sequence-to-sequence (seq2seq) [20] neural network translation model.
Speech recognition [18,21] technology-based on the CTC approach has made a great breakthrough. For the lipreading problem, Refs. [22][23][24] used Convolutional Neural Network (CNN) [25,26] as a feature extractor, Recurrent Neural Network(RNN) [27,28] as the feature learner, and CTC [17] as the objective function, training an end-to-end sentencelevel lipreading architecture. Its architecture outperforms experienced human lipreaders on the GRID dataset [29]. However, these architectures have two main problems. On the one hand, a simple feature extractor is not competent for feature extraction of video data. On the other hand, the use of RNN will have the defects of vanishing or exploding gradients.
The attention-based sequence-to-sequence model was first used in the neural network translation model [20] to solve the problem that the input sequence and output sequence are not aligned in time. For the lipreading problem, Refs. [30,31] use the Attentionbased seq2seq model to build a WAS (Watch, Attend and Spell) architecture. Outstanding performance in LRW [32] and GRID [29] datasets shows a Word Error Rate(WER) of 23.8% on the LRW [32] dataset. However, recent work [33,34] shows that the attention-based sequence-to-sequence model cannot correctly align with the output sequence for longer input sequences, so it is hard to converge during the entire training process.
However, recent results [35] show that the convolutional architecture performs better than recurrent networks on audio synthesis and machine translation tasks. Ref. [36] proposed a general TCN model and did a series of evaluation experiments for all serialization tasks. The results show that the TCN performs better than the primary recurrent neural network(e.g., BLSTM, BGRU) in a broad range of sequence model tasks.
In this work, we propose a state-of-the-art model that improves the performance of lipreading and the speed of convergence. First, we discard the basic BGRU or BLSTM neural network layer and replace it with a TCN [36]. Secondly, we propose a more efficient and complex feature extractor based on a 3D convolutional network and ResNet50 [37]. Its efficiency is greatly improved compared with standard feature extractors. Finally, we use the CTC [17] objective function as a decoder to implement an end-to-end sentence-level lipreading architecture. It needs to be emphasized that the use of the TCN architecture has a core effect on the improvement of lipreading performance. Experiments show that the training and convergence speed is 50% faster than the state-of-the-art method, and improved accuracy by 2.4% on the GRID dataset. Figure 1 shows the general architecture of lipreading.

Related Works
Traditionally, for lipreading, Luettin et al. [38] first applied the ASM model to lipreading, using a set of feature points to describe the inner or outer lip contour. This model has the disadvantage of manually labeling the training data. The quality of its feature extraction depends on the accuracy of the labeling, which requires more effort. The choice of the traditional lipreading system classifier depends on the task requirements. For large-scale continuous sentence recognition tasks, traditional methods generally use a decoding model based on GMM-HMM [8,39].
Deep learning practitioners commonly regard recurrent architectures as the default starting point for sequence modeling tasks. A well-regarded recent online course on "Sequence Models" focuses exclusively on recurrent architectures [41]. Recent studies [42][43][44] have shown that the convolutional architecture can achieve state-of-the-art performance accuracy in audio synthesis, word-level language modeling, and machine translation. This raises the question of whether these successes of convolutional sequence modeling are confined to specific application domains or whether a broader reconsideration of the association between sequence processing and recurrent networks is in order.
Recent studies [36] have shown that TCN show superior performance compared to RNNs in most sequence modeling tasks, and overcome the shortcomings of RNNs, while demonstrating longer effective memory.TCN solves the defects of slow convergence, gradient explosion or disappearance, and local overfitting in RNN. In terms of lipreading, we should reconsider the common association between the lipreading architecture and the recurrent network, and regard TCN as the natural starting point for lipreading tasks.

Proposed Architecture
In this section, we introduce the proposed lipreading architecture. We should emphasize that the proposed architecture is an end-to-end sentence-level lipreading model. Figure 2 shows the implementation diagram of the architecture. Table 1 shows the implementation details of the architecture.

Encoder Decoder
ResNet50 TCN soft soft soft … Figure 2. The proposed architecture. A lipreading video containing N frames as input is followed by one layer of 3D-CNN, followed by a 3D average pooling layer. The 3D feature maps are passed through a residual network (ResNet50, [37]). The classification and fusion of feature maps are processed by the 2-layer TCN [36] network; and the TCN output of each time-step is processed by the linear layer and the softmax activation function. This end-to-end sentence-level lipreading architecture is trained using the CTC objective function.

Layers Output Size Kernel/Stride/Pad
N is the number of frames in video; L is the number of labels.

3D Convolutional Network
Convolutional Neural Networks (CNN) are commonly used to perform convolution operations on images to improve the performance of computer vision tasks, such as receiving image data as input [25]. The basic 2D convolutional layer mainly changes the channel C to C' of the image. Figure 3a can clearly describe the process of 2D convolution. The convolution kernel of 3D convolution can be understood as a three-dimensional cube, as shown in Figure 3b. The 3D convolutional network is composed of 32 convolution kernels with a size of 3 × 3 × 3, followed by Batch Normalization (BN, [45]) and linear activation function (ReLU, [46]). Finally, the extracted feature map passes through the 3D average pooling layer, reducing the sampling rate and improving its robustness, The parameter weight of the 3D convolutional neural network is ∼16 K.

ResNet
At each time step, a 3D feature map is followed by a residual network (ResNet, [37]). Based on the needs of lipreading architecture design, we use the 50-layer ResNet version, which was proposed for ImageNet [47]. Its main innovation is residual learning, so that a deeper convolutional network can be trained. The most critical solution to achieve residual learning is short connections. What we need to emphasize is that we did not use its pretrained weights on ImageNet [47], as they optimize completely different tasks and evaluate other protocols. The weight initialization we adopt is standard random initialization, and the random parameters obey Gaussian distribution.

Temporal Convolutional Network (TCN)
A recent study [36] has shown that a simple convolutional architecture is superior to classic recurrent convolutional networks, such as LSTM and GRU, on various tasks and datasets while exhibiting longer effective memory. Compared to the language modeling architecture of LSTM [28], TCNs have longer memory networks and can efficiently handle longer inputs.
We should emphasize that lipreading is the task of Sequence Modeling. Our goal is to replace BLSTM or BGRU with TCN to solve the a disappearing training gradient and slower convergence. Figure 4 shows the basic architecture of the TCN. For the sequence modeling task, an input image sequence for x 0 , x 1 , . . . , x T , and we wish to predict some corresponding outputs y 0 , y 1 , . . . , y T at each time: A simple causal convolution network can only have a limited size of feature information in the deep historical network. It is very challenging to build a sequence model using simple sequence convolution. Our solution is to implement a dilated convolution neural network to extend the receptive field. We use Dilated Convolutions to apply to the lipreading task and use Formula (2) to briefly describe its outline: where x ∈ R n as a 1D sequence input, f : {0, . . . , k − 1} → R as a filter, d is the dilation factor, k is the filter size, and s − d · i accounts for the direction of the past. More radically, we replace ordinary convolutional networks with a residual block [37]. It consists of a convolutional layer with 512-dimensional kernels of 5 size and the stride of 3 size, followed by Batch Normalization (BN, [45]), Rectified Linear Units (ReLU, [46]), and Dropout [48].

Connectionist Temporal Classification (CTC)
The CTC objective function [17] was originally widely used in speech recognition. In view of the similarity between speech recognition and lipreading, CTC was introduced in lipreading.
The core step of CTC is to convert the output of each time step of the sequence model into a probability distribution in the label sequences. The softmax activation function of the CTC network converts the output into a probability. The number of units is one more unit than the number of labels L; therefore, the output of each softmax layer can denote the probability distribution of the corresponding label.
Suppose a given input sequence x of length T, followed by a Bi-LSTM recurrent neural network layer with m input, n output, and w weight. Therefore, define a continuous mapping to denote Bi-LSTM, N w : (R m ) T → (R n ) T . Then, y = N w (x) becomes the output of the sequence model(e.g., Bi-LSTM) and defines y t k as the probability distribution of output k in time step t. We describe an alphabet L ' =L ∪ {blank}. For each true label path, we obtain a probability by Formula (3): Formula (3) is the product of the probability distribution under a π path. There are many such paths from output to label. Given a mapping function from output to label β : where L ≤T is the combination of a series of possible labels. In the next step, we delete blank and duplicate labels in all paths (e.g., β(c − cd−) = β(−cc − −ccdd−) = ccd). For any given label l ∈ L ≤T , through the inverse mapping of β −1 (l), we can obtain all of its π paths. Calculate the probability sum of all π paths by Formula (4): Through the above Formula (4), we only need to obtain the most probable labeling of the input sequence as the output of CTC, as shown in Formula (5): Finally, we use the CTC network to minimize Formula (6) as the training goal, and constantly update the weight parameters of the entire model: In terms of lipreading, our dataset is video data and its corresponding text. Unfortunately, it is difficult to align video data and text in units. If we directly train the model without using alignment, it will be difficult for the model to converge due to the difference in people's speech speed or the distance between characters. From the above description of CTC, we know that CTC is a solution that avoids manual alignment of input and output and is very suitable for lipreading or speech recognition applications. Therefore, CTC is a sensible choice for the lipreading task.

Database
this section describe the relevant dataset and evaluation protocol and perform evaluation on the dataset according to the relevant protocol.

GRID Dataset
For this study, we use the GRID dataset [29]. There are a total of 33,000 sentence sample videos, including 33 speakers. Each sentence consisted of a six-word sequence of the form indicated in Table 2. Of the six components, three-color, letter, and digitwere designated as "keywords". Each sample video is fixed at 75 frames. The videos are recorded in a controlled lab environment, shown in Figure 5.  6]. In contrast, the WAS ipher half of the spoken y better than professional ' (LRW) dataset consists isolated words from BBC sand different speakers. , validation and test splits give word error rates. ed for one epoch to clasthis dataset's lexicon. As eeds the current state-ofargin.

RW [9] GRID [11]
-35.0% -20.4% The GRID dataset [11] consists of 34 subjects, each uttering 1000 phrases. The utterances are single-syntax multi-word sequences of verb (4) + color (4) + preposition (4) + alphabet (25) + digit (10) + adverb (4) ; e.g. 'put blue at A 1 now'. The total vocabulary size is 51, but the number of possibilities at any given point in the output is effectively constrained to the numbers in the brackets above. The videos are recorded in a controlled lab environment, shown in Figure 7. Evaluation protocol. The evaluation follows the standard protocol of [39] and [2] -the data is randomly divided into train, validation and test sets, where the latter contains 255 utterances for each speaker. We report the word error rates. Some of the previous works report word accuracies, which is defined as (WAcc = 1 − WER). Results. The network is fine-tuned for one epoch on the GRID dataset training set. As can be seen in Table 8, our method achieves a strong performance of 3.0% (WER), that substantially exceeds the current state-of-the-art.

Summary and extensions
In this paper, we have introduced the 'Watch, Listen, Attend and Spell' network model that can transcribe speech into characters. The model utilises a novel dual attention mechanism that can operate over visual input only, audio input only, or both. Using this architecture, we demonstrate lip reading performance that beats a professional lip reader on videos from BBC television. The model also surpasses the performance of all previous work on standard lip reading benchmark datasets, and we also demonstrate that visual information helps to improve speech recognition performance even when the audio is used.
There are several interesting extensions to consider: first, the attention mechanism that provides the alignment is unconstrained, yet in fact always must move monotonically from left to right. This monotonicity could be incorporated as a soft or hard constraint; second, the sequence to sequence model is used in batch mode -decoding a sentence given the entire corresponding lip sequence. Instead, a more

Evaluation Protocol
We refer to the standard protocols in [22,49] to define an evaluation protocol. The Word Error Rate (short: WER) is a way to measure the performance of lipreading. It compares a reference to an hypothesis and is defined like this: where S, D, I, N represent the number of substitutions, deletions, insertions, and words in the reference, respectively. Character Error Rate (CER) is another way to measure the performance of lipreading. It is very similar to Word Error Rate (WER). The difference is that words are replaced with characters.

Experiment
This section conduct experiments on the proposed architecture on a public benchmark dataset, summarize the corresponding performance data, and compare it with other stateof-the-art methods.

Data Alignment
The videos were processed with the DLib face detector, and the iBug face landmark predictor [50] with 68 landmarks coupled with an online Kalman Filter. Using these landmarks, we apply an affine transformation to extract a mouth-centered crop of size 112 × 112 × 3 pixels per frame. Therefore, each sample takes 75 × 112 × 112 × 3 data as the model input, where 75 is the number of frames of the video sample.
What we should emphasize is that the original data can not be used as the model input. Before, the data samples should be normalized to make the model more robust. In this experiment, we use Z-score normalization as the normalization process. The specific implementation process can be obtained from the following formula: where x * is the final normalized result. Normalization processing is very critical for the model and can accelerate the convergence of the model.

Implementation Details
We use the proposed architecture for Tensorflow [51] training and testing. Table 1 summarizes the detailed parameters of the proposed architecture at each layer. The adopted back-propagation optimization algorithm is the ADAM optimizer [52], the initial learning rate is 0.0001, and the batch size is 8. Connectionist Temporal Classification (CTC) is used as the objective function. We trained the proposed architecture for 100 epochs on the public GRID dataset and reached a stable convergence point.
There are 33,000 sentence sample videos including 33 speakers on the GRID dataset [29]. This experiment randomly uses 31 speakers with 31,000 samples as the training set and two speakers with 2000 samples as the evaluation set. we should calculate the training loss, evaluation loss, the training and evaluation of Word Error Rate (WER), and Character Error Rate (CER) for every epoch.
For the GRID dataset, the proposed approach is compared with [22][23][24]30], which are referred to as 'LipNet', 'WLAS', 'LCANet', and '3D-2D-CNN-BLSTM', respectively. In addition, in order to reflect the impact of key modules on the architecture, we separate the various modules of the architecture for comparison experiments.

Convergence Speed
What we should emphasize is that replacing the recurrent neural network (e.g., LSTM, GRU) with a Temporal Convolutional Network (TCN) [36] is for adequate training, speeding up the convergence speed, and preventing the disappearance of the gradient. We im- We trained 45 epochs for each architecture, and calculate its training and evaluate loss under one epoch for each iteration. To describe the differences more vividly, we use curve graphs to represent these data, as shown in Figures 7 and 8. From the figure, it can be seen that the losses of the four architectures are steadily decreasing until they approach a fixed value. Comparing the 3D-2D-CNN-BGRU-CTC and 3D-ResNet50-BGRU-CTC architectures, the 3D-2D-CNN-TCN-CTC and 3D-ResNet50-TCN-CTC architectures converge faster, there is no vanishing gradient, and minimal losses can be achieved. Comparing the 3D-2D-CNN-BGRU-CTC and 3D-2D-CNN-TCN-CTC architectures, the 3D-ResNet50-BGRU-CTC and 3D-ResNet50-TCN-CTC architectures use efficient feature extractors, so the final loss is relatively small, and the accuracy rate is relatively improved.
The above experiments prove that the proposed architecture has back propagation paths in different sequence time directions, thereby avoiding gradient explosion/ disappearance in RNNs (such as LSTM, GRU). In addition, the use of an efficient feature extractor combining 3D and ResNet50 [37] convolutional networks has improved the performance of our architecture.

Results
Comparing the [22][23][24]30] architectures, the proposed architecture achieves the stateof-the-art accuracy (Acc = 1 − WER). We have presented the detailed report data in Table 3. In the table, 'NA' indicates that the method did not evaluate the evaluation protocol, 'unseen' indicates that the training data are separated from the evaluation data, and 'seen' is the opposite by coincidence. We also present statistics of the performance data of the four architectures for 3D-2D-CNN-BGRU-CTC, 3D-ResNet50-BGRU-CTC, 3D-2D-CNN-TCN-CTC, and 3D-ResNet50-TCN-CTC on the GRID dataset [29]. The experimental results are shown in Table 4. 3D-ResNet50-TCN-CTC is our proposed architecture, which achieves the state-of-the-art accuracy on each evaluation protocol. The experimental results show that an efficient feature extractor and high-performance TCN [36] as a feature learner have apparent practical effects for accelerating model convergence, improving performance accuracy, and reducing training memory requirements.

Conclusions
This paper proposed an efficient end-to-end sentence-level lipreading architecture, using an efficient feature extractor that combines 3D convolution and ResNet50 [37], and replacing the traditional recurrent neural network with a Temporal Convolutional Network (TCN) [36]. Finally, an end-to-end sentence-level lipreading architecture was trained using the CTC objective function [17]. The proposed architecture overcomes the difficulties of slow convergence, disappearing gradient, and poor performance. Experiments on the GRID dataset show that, compared with the state-of-the-art method, the performance accuracy increase by 2.4%, and the convergence speed increase by 50%.
We divide our future work into three directions: first, the CTC objective function [17] used by the proposed architecture is based on independent conditional probability. Therefore, our research will focus on proposing a solution to this defect. Secondly, we can fully integrate voice features and visual features to seek further breakthroughs in performance. Finally, due to the shortcomings of long text samples in lipreading, designing a set of efficient and long-term dependent decoders is our future research direction.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.