Multi-Angle Lipreading with Angle Classiﬁcation-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition

: Recently, automatic speech recognition (ASR) and visual speech recognition (VSR) have been widely researched owing to development in deep learning. Most VSR research works focus only on frontal face images. However, assuming real scenes, it is obvious that a VSR system should correctly recognize spoken contents from not only frontal but also diagonal or proﬁle faces. In this paper, we propose a novel VSR method that is applicable to faces taken at any angle. Firstly, view classiﬁcation is carried out to estimate face angles. Based on the results, feature extraction is then conducted using the best combination of pre-trained feature extraction models. Next, lipreading is carried out using the features. We also developed audio-visual speech recognition (AVSR) using the VSR in addition to conventional ASR. Audio results were obtained from ASR, followed by incorporating audio and visual results in a decision fusion manner. We evaluated our methods using OuluVS2, a multi-angle audio-visual database. We then conﬁrmed that our approach achieved the best performance among conventional VSR schemes in a phrase classiﬁcation task. In addition, we found that our AVSR results are better than ASR and VSR results.


Introduction
Recently, automatic speech recognition (ASR) has been confirmed to have high recognition performance by using deep learning (DL), an attractive artificial intelligence technology, and is used in various scenarios, such as voice input for mobile phones and car navigation systems. However, there is a problem that speech waveforms are degraded by audio noise in real environments, reducing the accuracy of speech recognition. In order to overcome this issue, we need to develop robust ASR systems against any audio noise. One of these ASR systems applicable in noisy environments is audio visual speech recognition (AVSR, also known as multi-modal speech recognition), which employs ASR frameworks with visual speech recognition (VSR, also known as lipreading). VSR uses lip images which are not affected by audio noise and estimates what a subject uttered only from a temporal sequence of lip images. VSR and AVSR have a potential to be applied in various practical applications such as automatic conference minute generation and human interfaces on smartphones. Owing to state-of-the-art DL technology, recently, we have achieved high performance of VSR. However, VSR still has several problems when we employ the technique in real-world scenes; for example, most VSR studies have only considered frontal faces, but VSR technology for non-frontal views is also essential for real applications. In other words, assuming real scenes, a speaker does not always face a camera, such as smart device or tablet device, in a VSR or an AVSR system. We thus have been developing multi-angle VSR architecture which enables us to perform VSR when not only frontal lip images but also non-frontal lip images are observed.
There are two main approaches for multi-angle VSR. The first method is to build a VSR model using training lip images captured at several angles. The second approach is to convert non-frontal lip images to frontal ones and apply the conventional frontal VSR technique. In this paper, we focus on the first approach, and propose a feature integrationbased multi-angle VSR system using DL, particularly 3D convolutional neural networks (CNNs), that are one kind of deep neural networks (DNNs). Based on most conventional multi-angle VSR studies, it is necessary to estimate at which angle lip images are captured, to choose a suitable angle-specific VSR model. However, if the system fails to estimate the right angle, the recognition performance drastically decreases. We need to build a VSR technique that can be applied to real scenes where it is difficult to estimate the accurate lip angle.
Therefore, we employ a new multi-angle VSR method, in which all angle-specific VSR models are trained using images at different angles. Our multi-angle VSR method consists of three parts: a view classification part, a feature extraction part and a recognition part. Assume that we have a sequence of lip images to be recognized. Firstly, in the view classification part, we prepare a common 2D CNN that estimates the angle of the input image (see Section 3.1.1). The model is then applied to each image in the sequence, followed by determining the angle which has the majority in the estimation. Secondly, in the feature extraction part, we build 3D CNN models for possible combinations of angle-specific training data sets (see Section 3.1.2). Based on the angle obtained in the first part, we choose the best models and extract features from the models. In the last integration part, we concatenate these features, followed by recognition by means of a fully connected (FC) neural network (see Section 3.1.3) In addition, we perform a decision fusion-based AVSR employing our proposed multi-angle VSR.
We conducted evaluation experiments using the open data set OuluVS2, in which subjects were captured simultaneously at five angles in addition to speech data. The experimental results show that our proposed method can improve VSR accuracy much more than conventional schemes on average, and achieve significant AVSR accuracy in noisy environments. In addition, we confirm that our proposed method is sufficiently robust against view classification errors, because, in the second part, we simultaneously employ several models built using multi-angle training data.
The rest of this paper is organized as follows. In Section 2, we briefly review related works on multi-angle VSR. Section 3 introduces our method. The experimental setup, results and discussion are described in Section 4. Finally, Section 5 concludes this paper.

Related Work
Recently, many researchers have proposed deep learning-based AVSR and VSR schemes . As mentioned, most conventional VSR research has focused on frontal face images, assuming that VSR systems are in front of speakers, since there are only a few data sets available with multi-angle faces. Here, we introduce several lipreading works focusing not only on frontal but also diagonal and profile images. To develop these schemes, we need a research corpus. One of the public multi-angle VSR data sets is OuluVS2 [22].
An early work of multi-angle lipreading is [1], where a system was trained using either frontal (0 • ) or profile (90 • ) faces. According to the experimental results, the frontal view showed a lower word error rate (WER) than the profile view. In [2], the authors built a multi-angle system investigating a frontal (0 • ) view, a left profile (90 • ) view and a right profile (−90 • ) view. They reported significantly better performance when using the frontal view than the others. Saitoh et al. proposed a novel sequence image representation method called concatenated frame image (CFI) [3]. Two types of data augmentation methods for CFI, and a framework of a CFI-based CNN, were tested. Bauman et al. indicated that human lipreaders tend to have higher performance when slightly angled faces are available, presumably because of the visibility of lip protrusion and rounding [4]. In [5], the active appearance model (AAM) was utilized for feature extraction at five angles, and lipreading was examined on a view-dependent system, as well as on a view-independent system using a regression method in a feature space. As a result, the view-dependent system performed the best performance at 30 • in all tests. Zimmermann et al. used principal component analysis (PCA)-based convolutional networks together with Long short-term memories (LSTMs), one of the DL models, in addition to a conventional speech recognition model, hidden Markov models (HMMs) with Gaussian mixture models (GMMs) [6]. They aimed at combining multiple views by employing these techniques. They finally confirmed that the highest performance was obtained at 30 • . Anina et al. stated that the highest accuracy was achieved at 60 • in their experiments [22]. Kumar et al. showed that profile-view lipreading provides significantly lower WERs than frontal-view lipreading [7].
There is another strategy to conduct transformation to images or incorporate several views with DL technology. There is one work [8] that involved converting faces viewed from various directions to frontal faces using AAMs. The experimental results showed that recognition accuracy was improved even when the face direction changed about 30 • relative to a frontal view. In [9], the authors proposed a scheme called "View2View" using an encoder-decoder model based on CNNs. The method transformed non-frontal mouth region images into frontal ones. Their results showed that the view-mapping system worked well for VSR and AVSR. Estellers et al. introduced a pose normalization technique and performed speech recognition from multiple views by generating virtual frontal views from non-frontal images [10]. In [11], Petridis et al. proposed an end-to-end multi-view lipreading system based on bidirectional LSTM networks. This model simultaneously extracted features directly from the pixels and performed visual speech classification from multi-angle views. The experimental results demonstrated that the combination of frontal and profile views improved accuracy over the frontal view. Zimmermann et al. also proposed another decision fusion-based lipreading model [12]; they extracted features through a PCA-based convolutional neural network, LSTM network and GMM-HMM scheme. The decision fusion succeeded by combining Viterbi paths. In [13], Sahrawat et al. extended a hybrid attention-based connectionist temporal classification system with view-temporal attention to perform multi-angle lipreading. Lee et al. trained an end-to-end CNN-LSTM model [14].
Many studies have been conducted focusing on AVSR. In this paper, we would like to introduce a couple of state-of-the-art works. An AVSR system based on a recurrent neural network transducer architecture was built in [15]. The authors evaluated the system using the LRS3-TED data set, achieving high performance. In [16], the authors proposed a multimodal attention-based method for AVSR, which could automatically learn fused representations from both modalities based on their importance. They employed sequence-to-sequence architectures, and confirmed high recognition performance under both acoustically clean and noisy conditions. Another AVSR system using a transformerbased architecture was proposed in [17]. The experimental results show that on the How2 data set, the system improved word error rate relatively over sub-word prediction models. In [18], we proposed an AVSR method based on deep canonical correlation analysis (DCCA). DCCA consequently generates projections from two modalities into one common space, so that the correlation of projected vectors could be maximized. We thus employed DCCA techniques with audio and visual modalities to enhance the robustness of ASR. As a result, we confirmed that DCCA features of each modality can be improved compared to the original features, and better ASR results in various noisy environments can be obtained.
Although we can find a lot of VSR and AVSR methods, there are only a few works combining ASR and multi-angle VSR to accomplish angle-invariant AVSR. One of them is [19], where the authors proposed an early fusion-based AVSR method using bidirectional LSTMs. Similar to their past work [11], the authors put lip images at various angles and corresponding audio signals into the bidirectional LSTM models.

Methodology
Our proposed multi-angle VSR method consists of three parts: a view classification part, a feature extraction part and a recognition part. Figure 1 depicts the architecture of our AVSR approach, including ASR and the VSR model. In this section, we describe each part of our multi-angle VSR scheme followed by ASR and AVSR frameworks.

Multi-Angle VSR
VSR accepts a temporal sequence of lip images to recognize what a subject utters according to the given images. Assuming real scenes, it is not guaranteed that a speaker is strictly facing a VSR system. One way to deal with this problem is to prepare several models, each of which corresponds to a certain angle, estimate at which angle face images are captured and apply a corresponding angle-specific model.

View Classification
In the view classification part, we at first estimate at which angle face images were recorded among the following five candidates in this work: 0 • , 30 • , 45 • , 60 • and 90 • . The estimation was carried out for each lip image in one sequence, using the 2D CNN model illustrated in Figure 2. The 2D CNN model employs a simple and common architecture; convolutional and pooling layers are repeatedly applied followed by FC layers, to obtain a classification result. After processing the above step for all the input images, we determine the angle which is the most often chosen.

Feature Extraction
Before conducting feature extraction, we prepare 3D CNN pre-recognition models for all possible combinations of the above five angles, i.e., models each trained only using images obtained from a single angle, such as a model from frontal images and a model from 30 • images, as well as models each built using data of several angles, such as a model trained using both 0 • and 30 • data and a model using all face images. An architecture of our 3D CNN-based VSR models is shown in Figure 3. The last layer has 20 outputs, each of which corresponds to one class in our recognition task. As a result, we build 31 models in this case (∑ 5 i=1 5 C i = 5 + 10 + 10 + 5 + 1 = 31), as shown in Table 1. Table 1 also indicates preliminary VSR results: recognition accuracy to validation data at a certain angle, using a certain model chosen among those 31 models. For example, if we adopt a 30 • model for 60 • data, the accuracy is 87.55%.   architecture of our 3D CNN-based VSR models is shown in Fig. 3. The last layer has 20 169 outputs, each which corresponds to one class in our recognition task. As a result, we 170 build 31 models in this case (∑ 5 i=1 5 C i = 5 + 10 + 10 + 5 + 1 = 31), shown in Table 1.  According to the angle obtained in the view classification part, we select the most reliable three models for the estimated angle, which are shown in bold in Table 1. For instance, we adopt (1) "0 • + 30 • + 45 • ", (2) "0 • + 30 • + 45 • + 60 • " and (3) "0 • + 30 • + 45 • + 90 • " models for 45 • data. In other words, we determine suitable angle combination patterns of training data for the estimated angle. We then utilize those models as feature extractors; we remove the last layer, resulting in a new output layer generating a 48-dimensional feature vector, as indicated in Figure 3. Finally, we obtain three 48-dimensional vectors from this part.

171
This strategy has two advantages. First, as shown in Table 1, models trained using data of several angles have relatively higher performance than those trained using single angle data. This result motivates us to choose such models for multi-angle data. Second, even if the view classification fails, it is still expected to obtain high performance by our scheme; for instance, in the case where a 30 • sequence is misclassified as 45 • , the above models (1)∼(3) are used for feature extraction, all in which 30 • data are also used in model training. There is another reason to encourage us to choose this framework. The model trained using all data, indicated in the bottom row in Table 1, achieved good performance. On the other hand, there exists a better model in all the angle cases. This suggests using only the model with all data is not the best solution. Hence, for each angle, we prepare several models trained using multi-angle data and utilize them as feature extractors.

Recognition
In the integration part, firstly, we integrate those 48-dimensional features extracted from three angle-specific models, by simply concatenating them. Thereafter, we conduct recognition using two FC layers (48 × 3 → 48 → 20). Here, we apply a 50% dropout between the FC layers. In our ASR framework, we extract 13 mel-frequency cepstrum coefficients (MFCCs) in addition to 13 ∆MFCCs and 13 ∆∆MFCCs from audio waveforms with a frame length of 25 msec and a frame shift of 10 msec [23][24][25][26]. The MFCC is the most commonly used feature in the speech recognition field in addition to ∆MFCCs and ∆∆MFCCs, which are first and second derivatives, respectively. As a result, we obtain a 39-dimensional acoustic vector.
In the acoustic modality, there are many frameworks and a lot of features, e.g., [27,28]. We should carefully choose an audio processing scheme based on performance and theoretical perspectives. For instance, mel-frequency spectrograms are commonly used for CNN-based speech recognition. In this study, we first conduct preliminary experiments to measure the accuracy when using mel-frequency spectrograms or MFCCs. The size of the spectrograms is 96 × 128. Because using MFCCs with CNNs achieves better performance, we choose this framework. Note that we need to investigate which acoustic processing methods and features are the most suitable for the other tasks.

Recognition
After computing MFCCs from consecutive frames, we apply a 2D CNN-based model for recognition, which is illustrated in Figure 4. Similar to the VSR model, we finally obtain an audio result including a probability for each class.

AVSR
Firstly, a sequence of lip images is added to the VSR model, while corresponding speech data are given to the ASR model. As mentioned in detail later, we adopt the corpus OuluVS2, in which the task is to estimate which sentence is spoken. Therefore, for each class, we obtain a probability from ASR results and another one from VSR. These probabilities are integrated in a decision fusion manner. Let us denote conditional probabilities of class c from ASR and VSR models by P A (x A |c) and P V (x V |c), respectively. Here, x A indicates an audio input representation, and x V means the corresponding image vector. We then obtain an audio-visual probability P AV (x A , x V |c) as: In this work, we simply fix α = 0.5.

Experiments
In order to examine the effectiveness of our VSR scheme as well as AVSR framework, we carry out recognition experiments.

OuluVS2
We choose the OuluVS2 corpus to evaluate our scheme. The database contains 10 short phrases, 10 digits sequences and 10 TIMIT sentences uttered by 52 speakers. The corpus includes face images captured by five cameras simultaneously at 0 • (frontal), 30 • , 45 • , 60 • and 90 • (profile) angles. In this study, we adopt the phrase data and digit data, uttered three times by each speaker. In our experiment, the data spoken by 52 speakers are divided into training data by 35 speakers (speaker ID:1-36), validation data by 5 speakers (speaker ID: 37-41) and testing data by 12 speakers (speaker ID: 42-53). Note that the speaker ID: 29 is missing. We conduct the same data split as previous works, such as [3,6,14], for a fair comparison. We also check whether the data split is appropriate by changing the different split settings, and confirm that using the data sets gives us fair results. The phrases are as follows: "Excuse me", "Goodbye", "Hello", "How are you", "Nice to meet you", "See you", "I am sorry", "Thank you", "Have a good time", "You are welcome". Each digit utterance consists of 10 digits randomly chosen. Note that, since we use a part of this corpus to enhance model training data, the task in this work is a 10-class classification for phrase utterances.

DEMAND
We select another database, DEMAND [29], as a noise corpus. This corpus consists of six primary categories, each of which has three environments. Four of those primary categories are for closed spaces: Domestic, Office, Public and Transportation. The remaining two categories are recorded outdoors: Nature and Street. In this study, we add some of those noises to build audio training data.

CENSREC-1-AV
CENSREC-1-AV [30] is a Japanese audio-visual corpus for noisy multi-modal speech recognition. CENSREC-1-AV provides audio utterances, lip images and audio noise. In this study, we utilize the audio noise, i.e., interior car noises recorded on city roads and expressways, to obtain acoustically noisy testing data.

Experimental Setup
We evaluate a model by utterance-level accuracy: where H and N are the number of correctly recognized utterances and the total number of utterances, respectively. In addition, we also evaluate our model performance by the F1 score. An F1 score can be computed as: where Precision = T P T P + F P , Recall = T P T P + F N In Equation (4), T P is the number of correctly classified utterances. F P and F N indicate false positives and false negatives, respectively. We calculate the score in each class.
Since DNN-based model performance slightly varies depending on the probabilistic gradient descent algorithm, which is a common model training approach, we repeat the same experiment three times and the mean accuracy is calculated. In terms of DNN hyperparameters, we choose a cross-entropy function as a loss function and Adam as an optimizer. Batch size, epochs and learning rate are set to 32, 50 and 0.001, respectively. We carry out our experiments using NVIDIA GEFORCE RTX 2080 Ti.

Preprocessing
The OuluVS2 data set includes extracted lip images, however, the image size is not consistent. In order to apply DNNs, we resize all images to 64 × 64. Based on our preliminary experiments with different image sizes, considering classification accuracy and computational cost, we use the image size of 64 × 64. Furthermore, we normalize a frame length to 64; if the length is less than 64 we conduct upsampling, otherwise we suppress some frames. In addition, we convert all color images to gray-scale ones. Similar to visual frames, we normalize the audio frame length to 115; if the length is less than 115 we add last frame, otherwise up to 115 frames are used.
In the OuluVS2 corpus, there are 1050 (35 speakers × 10 utterance × 3 times) sentences available. However, the data size is not enough for DNN model training. To compensate for the lack of training data, we apply data augmentation in the audio and visual modalities. In the audio modality, we add acoustic noises in DEMAND to the original utterance data. The details, including noise type and signal-to-noise Ratio (SNR) conditions, are shown in Table 2. In the visual modality, we train our VSR models using not only phrase data but also digit sequence data based on our previous work [31]. First of all, we investigated view classification performance. View classification results for the test data are shown in Table 3. The whole accuracy of view classification was 91.39%. Focusing on the results for each angle, classification for frontal and profile views was fully successful. On the other hand, misclassification was found in the diagonal views, particularly at 45 • . In conclusion, the performance of our view classification was acceptable. However, the last fact also indicates that it is required for the following VSR models to carry out recognition successfully even for the miscategorized sequences.  Table 4. We firstly tested our models with and without view classification. Our method with the view classification part achieved almost the same or better performance, compared to ours without the classification, in which the classification result was correctly given. This indicates our feature extraction and recognition strategy can perform well. Next, we compared our approach with conventional methods. Focusing on the average of recognition accuracy, our proposed method achieved the highest accuracy regardless of the presence or absence of the view classification part. It is interesting that at 45 • we found much more improvement than in the other conditions, and even the view classification performance was insufficient. Since 45 • data were used as training data in the neighboring 30 • and 60 • conditions, we might obtain such an improvement even if the view classification fails. We also found that our method was particularly effective in the medium-angle (30 • , 45 • and 60 • ) conditions, while the end-to-end system had higher accuracy for frontal and profile images. Figure 5 indicates F1 scores for each angle. Among all the angles, it is found that shorter utterances were relatively hard to classify, because there were fewer cues for recognition.   Table 5 shows recognition accuracy of our ASR, VSR and AVSR methods in various noise environments. Note that, because the task was a 10-class classification, the accuracy in noisy environments tended to be higher compared to large-vocabulary speech recognition. The VSR accuracy was stable and unrelated to SNR since visual information is not affected by noise. As is already known, the results of VSR were lower than those of ASR in all the SNRs, because audio features are more effective and informative than visual ones. Among the models, AVSR achieved the best accuracy in all the conditions. In particular, at 0 dB, where the effect of noise was the largest, the performance was improved by 3% for city road noise and by 2.3% for expressway noise compared to ASR results. Even in the case of 20 dB, where the effect of noise was quite small, the accuracy was slightly improved. As mentioned, we employed the decision fusion strategy, which is the simplest integration method. Similar to the ensemble approach, we believe our decision fusion method could successfully integrate ASR and VSR results, which had different recognition errors.

Conclusions
In this paper, we proposed a multi-angle VSR system in which feature extraction was conducted using angle-specific models based on view classification results, followed by feature integration and VSR. We also proposed a decision fusion-based AVSR. We employed DNNs in our system, to perform view classification, feature extraction and recognition. The advantages of our method are choosing appropriate feature extraction models based on angle classification results, reducing the negative impact of misclassification, and incorporating ASR and VSR results efficiently. Evaluation experiments were conducted using the multi-view corpus OuluVS2. Then, we found our scheme could work well compared to past works, and we clarified the effectiveness of view classification and feature extraction from pre-trained angle-specific models. Moreover, we found that our AVSR method is superior to ASR and VSR because our decision fusion method could successfully integrate ASR and VSR results.
As our future work, we are planning to conduct experiments using different angle settings and other tasks. The implementation of this framework for real applications is also expected. In addition, because there are some research works investigating spectrograms instead of MFCCs, we will try to employ spectrograms as acoustic input. Finally, we will explore the suitable model architecture and its physical meaning for feature extraction. Funding: This research received no external funding Data Availability Statement: The databases used in this article are OuluVS2, DEMAND and CENSREC-1-AV. For details, please refer to [22], [29] and [30], respectively.

Conflicts of Interest:
The authors declare no conflict of interest.