Augmented Latent Features of Deep Neural Network-Based Automatic Speech Recognition for Motor-Driven Robots

: Speech recognition for intelligent robots seems to suffer from performance degradation due to ego-noise. The ego-noise is caused by the motors, fans, and mechanical parts inside the intelligent robots especially when the robot moves or shakes its body. To overcome the problems caused by the ego-noise, we propose a robust speech recognition algorithm that uses motor-state information of the robot as an auxiliary feature. For this, we use two deep neural networks (DNN) in this paper. Firstly, we design the latent features using a bottleneck layer, one of the internal layers having a smaller number of hidden units relative to the other layers, to represent whether the motor is operating or not. The latent features maximizing the representation of the motor-state information are generated by taking the motor data and acoustic features as the input of the ﬁrst DNN. Secondly, once the motor-state dependent latent features are designed at the ﬁrst DNN, the second DNN, accounting for acoustic modeling, receives the latent features as the input along with the acoustic features. We evaluated the proposed system on LibriSpeech database. The proposed network enables efﬁcient compression of the acoustic and motor-state information, and the resulting word error rate (WER) are superior to that of a conventional speech recognition system.


Introduction
Voice as a form of communication can be acquired virtually everywhere and is the most natural and intuitive means of communication between humans or humans and machines. Recently, the humanoid robots such as Softbank's Pepper [1], MIT's home robot JIBO [2], and Intel's Jimmy [3] have been developed thanks to the considerable advances in the automatic speech recognition (ASR) technology. Many recent studies on this ASR technology, as an indispensable part for the humanoid robots, have been actively carried out, but there still retains a challenging problem of isolating ambient noise. The humanoid robots, additionally, generate the strong ego-noise [4], which results in a significant factor deteriorating the recognition performance. Indeed, since the distance between the microphone and the internal parts of the robot that generate the ego-noise is much closer than the distance between the microphone and the speaker, the ego-noise is loudly recorded into the microphone causing performance degradation. The robustness to the ambient noise has been one of the main issues in ASR technology to this day, but the robustness to the ego-noise has not been fully addressed.
In general, DNN-based speech enhancement systems for dealing with the ambient noise are trained using two types of targets including masking-based targets and mapping-based targets. In the first case, the model attempts to describe the time-frequency (T-F) relationships of clean speech to background interference and make a binary classification decision on T-F units, such as ideal binary mask estimation [5]. Furthermore, in [6], DNNs were used to estimate a smoothed ideal ratio mask (IRM) in the Mel frequency domain for robust ASR. In the second case, the DNN learns a complex mapping function from noisy to clean speech using multi-condition training data [7][8][9]. However, such conventional DNN-based speech enhancement systems generally require very deep models or large databases to cope with as much ambient noise as possible, which is not suitable for embedded systems such as the humanoid robots. In addition, these systems have limitations in using motor-state information as an auxiliary feature.
As for the ego noise suppression, spectral subtraction [10] can be one possible way. A DNN-based framewise prediction approach is developed to estimate the noise spectra using angular velocity when a robot manipulates a joint [11]. The estimated noise spectra are then subtracted from the target signal spectra to clean up the noisy speech signal. One of the problems in this approach is that the ASR performance gets quite poorer since the noise power estimates are different from the original ones especially when the ego-noise is non-stationary. Several researchers have tackled this problem by predicting and subtracting the ego-noise using templates. For instance, [12] proposed a way to predict the ego-noise using motion-like gesture and walking pattern template obtained from a pre-recorded motor noise that corresponds to the motion pattern. Along with the labeled motion command, the appropriate ego-noise template matched to the latest motion is selected from the template database and used for subtraction. A small set of noise template database are extended in [4] to a larger ego-noise space in which the template database is enhanced by incorporating more information related to the joints such as angular positions, velocities and accelerations. Next, [13] employed the motor data to use the intrinsic harmonic structure of the ego-noise and incorporated the ego-noise harmonics into a multichannel dictionary-based approach for the ego-noise reduction. These studies on the ego-noise reduction have shown that the instantaneous motor data of the humanoid robots can be used as a secondary information source for dealing with the ego-noise. Recently, we originally devised an idea in [14] to use the motor on/off state as the auxiliary information when designing the acoustic model of the DNN-based speech recognition system for the humanoid robots. However, the auxiliary information for the motor-state is simply designed as a one-hot vector, so the performance gain turns out to be limited.
In this paper, we propose to employ the latent features that improve the speech recognition performance by utilizing the internal knowledge about whether the motor is running or not. The first DNN is carefully designed to create the motor-state dependent latent features, and the input is determined by concatenating the motor-state data in addition to the acoustic features contaminated by the ego-noise. After the preliminary training is accomplished at the first DNN to result in the latent features, these are fed into the second DNN designed for acoustic modeling under the ego-noise environments. Subsequently, to fully represent the complex relationship between the audio signal and senones, the second DNN is trained with the inputs containing both the latent features and acoustic features. In order to verify the performance of our approach against the existing methods, experiments are extensively conducted on the LibriSpeech corpus. The experimental results showed better performance in terms of the WER reduction than the baseline models including the mapping-based speech enhancement, the method using one-hot encoded input [14], and the method using only acoustic features.
The rest of the paper is organized as follows: In Section 2, we describe the related works. Details about the model architecture are explained in Section 3. The experimental results are shown in Section 4. We conclude with a discussion in Section 5.

Bottleneck Network
In the past several years, the latent features extracted from the bottleneck layer have been widely used in many tasks such as speech recognition [15][16][17], audio classification [18,19], speech synthesis [20] and speaker recognition [21]. The latent features are generated from a multi-layer perceptron (MLP) or DNN with a middle bottleneck layer having a small number of hidden units compared to the other hidden layers. This special hidden layer creates a constriction in the network to compress the task-related information into a low dimensional representation. Therefore, the latent features can be considered as nonlinear transformation and dimensionality reduction of the input features.
One may wonder how to derive the latent features in both the unsupervised and supervised method. In the unsupervised manner, an autoencoder with several hidden layers is trained to reconstruct the inputs as shown in Figure 1, and does not need explicit lables to train on. In this figure, the autoencoder has three layers: input, output and hidden layer. The input vector x is encoded to the hidden vector h of the autoencoder by a nonlinear activation function σ, with the learned weight matrix W (1) and the bias vector b (1) . Then, the input is decoded from the hidden vector to produce a reconstructed vector x using the learned weight matrix W (2) and bias vector b (2) . The autoencoder parameter θ = (W (1) , b (1) ), (W (2) , b (2) ) is learned using the back-propagation algorithm by minimizing the mean square error (MSE) loss. Further, a stacked autoencoder can be used to extract latent features, which are progressively encoded using successive hidden layers. Then, fine-tuning on the entire stack of hidden layers is performed using the back-propagation algorithm. This allows each hidden layer to exhibit different levels of representation for the input feature.
In the supervised approach, latent features are created by the MLP trained to predict the class label (e.g., phoneme states) as shown in Figure 2, where the MLP is the feed-forward neural network made of an input layer, output layer, and at least one hidden layer. Usually, for the classification task, the softmax function is employed to convert the values of arbitrary ranges into a probabilistic representation. The learning process is accomplished to minimize the prediction error with respect to the parameter θ = (W (1) , b (1) ), (W (2) , b (2) ), · · · , (W (L) , b (L) ), and a cross entropy error function [22] is generally selected as the loss function in the MLP. While the supervised method carries out the subsequent classification task, the classification information is compressed into the excitation of the latent units as much as possible via the continuous training. Hence, the latent features contain the expressive power to show more invariant and discriminative capability.

Proposed Method
Since the instantaneous motor-state information of the robot is intuitively effective for dealing with the ego-noise, in this work, we attempt to fuse acoustic features obtained from spoken utterances and the motor-state data into a single framework. To this end, we propose a way to use the two distinct motor-states, one for the operating motor and the other for idle motor, which are brought into the auxiliary features as shown in Figure 3. The motor data are inherently acquired from the robot in such a way that the robot is driving like shaking its body or not in accordance with the motor-state alternating between "on" and "off". This auxiliary feature provides a very useful information, instantaneously changing, for the internal state of the intelligent robot. Then, the input x = [x; o] is comprised of the conventional spectral features with the auxiliary feature and used for the latent feature training as depicted in Figure 4. Here, the elementary question raises about how to fuse the motor data with the acoustic feature such as MFCC. For this, we introduce a way to exploit the latent feature as proposed in [12,16]. This key idea is accomplished by training the first DNN, designed for preliminary acoustic modeling, with the fused features as displayed in Figure 4. For training, three different targets are placed at the output, including senones, motor-state, and MFCC. The cross-entropy between the network output and target is minimized by using senones and motor-states as the labels, and the MSE loss is minimized to restore a clean version of MFCC. The created latent features can be thought of as an embedding features that contain information about the motor operation in combination with the acoustic features. Next, this latent features are fed into the second DNN, which is responsible for primary acoustic modeling as shown in Figure 4, illustrating the overall architecture of the proposed system. Eventually, latent features are appended to the acoustic features and then applied as the second DNN input together. There seems to be a possibility that the acoustic modeling capabilities can be improved due to the additional input that more effectively represents the information about whether the motor is operating or not under the acoustic circumstances, which will be verified in detail in the following experimental section.

Corpus Description
In order to evaluate the proposed approach, we conducted experiments with the JIBO humanoid robot which has 3 full-revolute axes [2] as a brief specification of the robot introduced in Table 1. We mainly consider the scenario in which humans interact with the robot while the robot shakes its head and body. To simulate noisy environments for the humanoid robot, we recorded the ego-noises using the single microphone located at the front side of the head under the anechoic environment. These noise signals involve two types of the ego-noise including fan and movement noises. The first type of the ego-noise labeled "Motor Off " only considers the fan noise, while the second type labeled as "Motor On" considers both the fan and movement noise. The mixing was conducted at various signal-to-noise ratio (SNR) levels including 5 dB and 15 dB with 1 m distance between the speaker and robot. Walking robots typically use a motor with a large torque, so low SNR levels between -5 dB and 0 dB are assumed [4]; a walking robot like ASIMO [23] requires motors with large torque because they run, climb stairs or avoid obstacles like human. However, since the JIBO robot is a social robot for home, its motion is limited and it uses a motor with lower torque. Furthermore, social robots generally operate in an indoor environment, so there is little external noise. Therefore, extreme conditions like an SNR level less than 0 dB were not considered. These mixtures were then used to train and evaluate the ego-noise robust ASR algorithms described above. Our experiments were performed on the LibriSpeech database [24] which is composed of three distinct training subsets of 100,360, and 500 h of speech respectively, representing a total training set of 960 h of read English speech. In our experiments, we used 100 h subset and further augmented the database via simulating the ego-noises that we recorded in reality. We evaluated the models on the test sets provided by the LibriSpeech corpus: test-clean and test-other. The difference between "clean" and "other" is the quality of the audio and its corresponding transcription. The quality of the "clean" is higher than that of the "other" [24]. The waveform sampling rate of the corpus and the recorded noises was set to 16 kHz. We then evaluated the proposed algorithm in terms of WER under the aforementioned environments. Since we focus on the ego-noise in this work, the ambient noise was excluded, but, the acoustic noise can be successfully addressed in a unified DNN framework in the end.

Experimental Setup
In our experiments, the Kaldi toolkit [25] was utilized to train both the preliminary DNN extracting the latent feature and the primary DNN for acoustic modeling. The two systems were equally designed with 4 hidden layers and each layer has 512 hidden units except for the bottleneck layer. For the primary DNN, the rectified linear unit (ReLU) activation functions was used in the lower layers and the softmax function was used at the output layer. For the acoustic features, 13 dimensional MFCC features were extracted using 25 ms analysis window with 10 ms frame shift. The baseline models used for comparison with the proposed model in this experiment are described in Table 2. Three baseline models were adopted to evaluate the proposed model. As for the first baseline system, the MFCC features stacked with 11 adjacent frames were used as the DNN input for acoustic modeling. In the second baseline system, the MFCC features and one-hot representation of the motor-state were used as the DNN input with 165 dim (15 × 11). Furthermore, the third baseline system is a DNN-based spectral mapping system designed to have a total of four hidden layers in order to have similar computational complexity to our proposed model. The spectral mapping model is trained using the 100 h subset contaminated by the ego-noises (5 dB and 15 dB SNR). The enhanced MFCC features are evaluated through the following acoustic model which is identical to the first baseline model. Therefore, the performance of the DNN-based spectral mapping model was verified through evaluation of the reconstructed MFCC features. We measured the performance of the first baseline model to check the performance of the acoustic model to be used in all experiments. In addition, to improve speech recognition performance in an environment where the ego-noise exists, the second and third baseline models use auxiliary features or a feature enhancement model, respectively. Therefore, evaluation in a clean environment without the ego-noise was performed only in the first baseline model. As for the proposed systems, the preliminary DNN was separately constructed with 4 hidden layers at the early stage. To investigate the effect of the various latent features, experiments were performed by varying output features, and bottleneck layer positions. Firstly, the senones, motor-state label and original MFCC features were independently placed in order at the output, then three kinds of the primary DNN yielding the latent features (bDNN) were designed including the bDNN senone , bDNN MS , bDNN MFCC . The sigmoid and tanh activation functions were used for the classification and regression task, respectively. In addition, we varied the placement of the bottleneck layer from the bottom hidden layer (position 1) to the top hidden layer (position 4). For all the preliminary networks, the stacked MFCC and motor-state features ((13 + 2) × 11 = 165-dim.) were used for the input. Then, the extracted motor-state dependent latent features were combined with the MFCC again and used to train the acoustic model as shown in Figure 4. The complexity of all models is described in Table A1.

Experimental Results and Analysis
Tables 3 and 4 present the WER results on LibriSpeech database for the "motor off" or "motor on" state, respectively. The values in the first column of these two tables represent the recognition performance in absence of the ego-noise and indicate the performance of the acoustic model used in all models. This acoustic model was trained using a kaldi [25] script and its performance is similar to that of a p-norm DNN trained on a 100 h subset published by kaldi. (6% to 9% and 20% to 28% on the "clean" and "other" test set, respectively) Furthermore, the values in the last column show the average performance at SNR of 5 dB and 15 dB. On the both cases, it is discovered that the motor-state data yields better recognition performances as reported in [14]. As we expected, the auxiliary features like the latent feature, generated by using the bottleneck layer, yielded superior performance when compared to the DNN [14] using one-hot encoded vectors. It indicates that the bottleneck layer can create more valuable representation of the motor-state data by fusing along with the spectral features and being compressed. In addition, the proposed model was superior to the spectral feature mapping, a DNN-based speech enhancement method. Figure 5 shows the results of the DNN-based spectral feature mapping. According to the evaluation by the following acoustic model, the reconstructed features did not bring a significant performance improvement. Because, the DNN for the spectral feature mapping was designed lighter than the usual DNN-based mapping model to have a similar amount of computation as the proposed preliminary network. For the comparison of the proposed algorithms, the WER was reported by changing the output features among senones, MS, and MFCC, When constructing the preliminary DNN. The results show that the senones are appropriate as the target features for the preliminary DNN. Hence, predicting the senones turns out most relevant for the task of the primary DNN. To be specific, the bDNN senone model achieved relative WER reduction of 2.24% and 5.83% over the baseline 1 model for the cases including "motor off" and "motor on" states. Furthermore, As for the position of latent features, the third layer yielded the superior performance for the bDNN senone showing the best results as displayed in Figure 6.

Conclusions
In this paper, we proposed a novel way to incorporate the instantaneous motor on/off state information for the ego-noise robust ASR system. For this, we introduced the bottleneck network to create motor-state dependent latent features to effectively represent the ego-noise along with the acoustic features. These ego-noise adaptive latent features offer a significant improvement than the one-hot encoded motor-state features. In case of the proper target of the bottleneck network, the senones turn out most effective to extract the ego-noise adaptive latent features. Additionally, we compared the effect of the bottleneck layer position and concluded that the third layer was the best. To summarize, we demonstrated that the bottleneck layer offers more valuable representation of the motor data than the previous method. In a future work, more diverse states of the robot could be considered as the motor data, e.g., walking state, right/left arm rotating state, head shaking state, or multiple states

Conflicts of Interest:
The authors declare no conflict of interest.