You are currently viewing a new version of our website. To view the old version click .
Symmetry
  • Article
  • Open Access

7 May 2019

End-to-End Mandarin Speech Recognition Combining CNN and BLSTM

,
and
1
College of Computer, National University of Defense Technology, Changsha 410073, China
2
Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China
*
Authors to whom correspondence should be addressed.

Abstract

Since conventional Automatic Speech Recognition (ASR) systems often contain many modules and use varieties of expertise, it is hard to build and train such models. Recent research show that end-to-end ASRs can significantly simplify the speech recognition pipelines and achieve competitive performance with conventional systems. However, most end-to-end ASR systems are neither reproducible nor comparable because they use specific language models and in-house training databases which are not freely available. This is especially common for Mandarin speech recognition. In this paper, we propose a CNN+BLSTM+CTC end-to-end Mandarin ASR. This CNN+BLSTM+CTC ASR uses Convolutional Neural Net (CNN) to learn local speech features, uses Bidirectional Long-Short Time Memory (BLSTM) to learn history and future contextual information, and uses Connectionist Temporal Classification (CTC) for decoding. Our model is completely trained on the by-far-largest open-source Mandarin speech corpus AISHELL-1, using neither any in-house databases nor external language models. Experiments show that our CNN+BLSTM+CTC model achieves a WER of 19.2%, outperforming the exiting best work. Because all the data corpora we used are freely available, our model is reproducible and comparable, providing a new baseline for further Mandarin ASR research.

1. Introduction

With the rapid development of smart devices such as mobile phones and robots, users increasingly interact with man–machine interfaces via speech recognition. Google Now, Apple Siri, and Microsoft Cortana are all widely used systems that rely on Automatic Speech Recognition (ASR). Besides, Baidu IME and iFLY IME can map Mandarin and English utterances to corresponding texts. Furthermore, recently in January 2019, Alibaba Cloud Computing published its Distributed Speech Solution. By combining ASR technique with devices such as switch panel or air conditioner, it helps to easily deploy speech recognition systems indoors. More than that, speech recognition can also offer lots of help in other domains such as auto driving, health care, etc.
Decades of hand-engineered domain knowledge has gone into current state-of-the-art ASR pipelines. Conventionally, Large Vocabulary Continuous Speech Recognition (LVCSR) systems often contain several separate modules, including acoustic, phonetic, language models, and some special lexicons. All these modules in an ASR system are trained separately. As a result, errors of every module would extend during the recognizing process. More than that, building an ASR with so many modules requires varieties of hand-engineered domain knowledge such as pronunciation lexicon, linguistic expertise, etc. All these make it hard to design and train a good-performing ASR system.
Since conventional ASR has many disadvantages, recently a powerful alternative solution is proposed to train ASR models end-to-end, replacing most modules with a single deep learning model [1,2]. The ‘end-to-end’ vision of training can substantially simplify the training process by removing the engineering required for the bootstrapping/alignment/clustering/HMM machinery often used to build state-of-the-art ASR models. On a system built on end-to-end deep learning, We can employ a range of the newest deep learning techniques to train a novel deep neural model with high performance.
Enough data is the key for end-to-end ASRs to achieve better performance than conventional ASRs. For English ASR, large-amount databases such as TED-LIUM [3] and Librispeech [4] offer open platforms for both researchers and industrial developers to experiment and compare system performances. Thanks to the popularization of smart devices, it becomes much easier than before for industries to access and collect large amount of speech data for Mandarin ASR. However, as most of these data are not shared with the public, researchers still have very limited access to large-amount real-world Mandarin speech data. As a result, Mandarin ASR research works do not scale well to industrial scenarios. Besides, since existing state-of-the-art Mandarin ASR works all use in-house large-amount dataset, comparing with them is pointless and sheds little light on future research.
Fortunately, a freely-available Mandarin speech corpus, AISHELL-1 [5], is released recently. It is by far the largest free-accessed Mandarin corpus, containing 400 speakers and over 170 h of Mandarin speech data. It is suitable for conducting and comparing speech recognition research works for Mandarin.
In this paper, we propose an end-to-end Mandarin ASR model that combines Convolutional Neural Net (CNN) and Bidirectional Long-Short Time Memory (BLSTM) Neural Net, named CNN+BLSTM+CTC. This model uses CNN to learn local features in frequency and time domain, uses BLSTM to learn history and future contextual features. Our model is reproducible and comparable for other researchers because it is trained on the open-accessed Mandarin Speech Dataset AISHELL-1, using neither other in-house data nor external language model. We benchmark our CNN+BLSTM+CTC on AISHELL-1 test dataset and compare it to some existing works, Experiments results show that our model gets a WER of 19.2%, outperforming existing methods in [6,7].
The contribution of this paper is three-fold:
  • We propose CNN+BLSTM+CTC, an end-to-end ASR model using both CNN and BLSTM. It combines CNN layer’s ability of learning local features and BLSTM layer’s ability of learning history and future contextual features, enabling the CNN+BLSTM+CTC to model audio signals effectively and make precise recognition.
  • We use neither in-house training data no external language model in this paper. All the training, development, testing data we used come from dataset AISHELL-1, which can be freely acquired. This makes our results comparable for other researchers.
  • We carry out comprehensive experiments to verify our design ideas. Experiments results show that our CNN+BLSTM+CTC makes effective speech recognition.
The remainder of the paper is organized as follows. We begin with a review and analysis of related works in conventional ASR, end-to-end ASR, especially Mandarin end-to-end ASR in Section 2. Section 3 describes the architecture and detail design of our CNN+BLSTM+CTC model. We describe experiment settings, analyze the results for our model and compare it with existing works in Section 4. We conclude our work in Section 5 and list some future works in Section 6.

3. End-to-End Model for Mandarin ASR

Figure 1 illustrates the architecture of our deep neural network. The audio input x is firstly batch normalized, then passed through 3 CNN blocks, each of which involves 4 operations: Convolution, Batch Normalization, Rectified Linear Unit (ReLU) activation and max pooling. The CNN blocks are followed by a bidirectional LSTM layer and a full connected layer. At last a CTC layer does the decoding and outputs the label sequence y.
Figure 1. Architecture of the deep network for speech recognition on AISHELL-1.
In the following part of this section we will describe our design ideas in detail.

3.1. Convolution Layer

Given a input sequence X = { x 1 , , x T } , x i R b × c , a 1-filter convolution kernel K R w × h × c with convolution strides S C = ( s w c , s h c ) . The convolution result is a 2-dimensional feature map, which is calculated as in Equation (1):
o i , j = w i , h j , q x s w c · i + w i , s h c · j + h j , q · k w i , h j , q
where T, c, b are the time steps, channels and bandwidth of input sequence respectively, w, h are the kernel’s width and height respectively, s w c , s h c are the width and height stride of convolution respectively.
If the kernel has more than 1 filters, the convolution will get more than 1 feature maps.
Works in different papers use different features as the input sequence { x 1 , , x T } . While most works use cepstral coefficients, there are some works using the raw waveform. In this paper, the inputs { x 1 , , x T } are Mel-Frequency Cepstral Coefficients. The convolution kernel K is a matrix. It works on local patches of inputs and slides along T-dimension and b-dimension.
Figure 2 illuminates a simple procedure of CNN with only one filter, where T = 3 , b = 3 , c = 1 , w = 2 , h = 2 , s w c = 1 , s h c = 1 .
Figure 2. A simple example of CNN procedure.
From Equation (1) we can see that every result element o i , j of convolution is derived from w × h local elements in every input feature map. Thus, for convolution on an input sequence with c feature maps, every result element is correlated with w × h × c local input elements. This means that the convolution can capture input data’s local features at corresponding position.
Convolution’s ability of learning local features suits speech recognition task very well. ASR never gives output depending only on a single momentary input signal. In fact, no matter to utter or recognize a piece of speech, the speech is always treated as a sequence of short audio segments which last for hundreds of microseconds. Therefore, learning the local features on short acoustic segments is a significant step for speech recognition.
One CNN layer can only cover a small input scope, but if we stack many CNN layers together, they can learn the local features of a much larger scope.
To simply the analysis, let us only consider the time axis. Assume the CNN kernel’s width is w c , width stride is s w c . We refer to the time span covered by result element as t c , and refer to the time shift window between two adjacent result element as w i n d o w c . Then, t c and w i n d o w c can be calculated according to the following Equation (2):
t c = t i + ( w c 1 ) · w i n d o w i w i n d o w c = s w c · w i n d o w i
where t i and w i n d o w i are the time scope and shift window of input.
In this paper, we use Mel-Frequency Cepstrum Coefficient (MFCC) sequence as the input. Every MFCC frame’s time span is 25 ms and the shift window is 10 ms. 3 CNN layers’ kernel width on the time dimension are respectively 3, 2, and 2. Their convolution strides are all 1. Therefore, without pooling layer, result element of the last CNN layer covers a time span of 65 ms, and its shift window between two adjacent element is 10 ms, which means that the 3-layer CNN can learn local features of every 65 ms, much larger than the original MFCC frame’s time span.

3.2. Batch Normalization

Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. Since the inputs to each layer are affected by the parameters of all preceding layers, small changes to the network parameters amplify as the network becomes deeper. The change in the distribution of network activations due to the change in network parameters during training is defined as Internal Covariate Shift. Batch normalization is designed to alleviate this Internal Covariate Shift by introducing a normalization step that fixes the means and variances of layer inputs.
Batch Normalization (BN) [25] is widely used in deep learning and brings remarkable improvement in many tasks. It allows researchers to use much higher learning rates and be less careful about initialization. It helps to accelerate training speed and improve the performance substantially. In this work, we use BN between convolution and activation.
Formally, for a batch X = { x 1 , , x m } of size m, where every x i is a d-dimension vector, x i = { x i 1 , , x i d } , BN of every x i is
b n ( x i ) = { b n ( x i 1 ) , , b n ( x i d ) }
where b n ( x i k ) is
b n ( x i k ) = x i k E [ x k ] V a r [ x k ]
and
E [ x k ] = 1 m i = 1 m x i k
V a r [ x k ] = 1 m i = 1 m ( x i k E [ x k ] ) 2
Please note that simply normalizing each input of a layer may change what the layer can represent. To accomplish this, a pair of parameters γ k and β k are introduced for each dimension k, and the final BN result is b n ( x i ) = { y i 1 , , y i d } where y i k = γ k b n ( x i k ) + β k .
During training, the batch size m is larger than 1. However, during inferring, m = 1 . Therefore, we cannot calculate the means and variances of the layer inputs. So, the means and variances calculated during training are used for inferring.
However, BN is neither necessary nor desirable during inference. Thus, in inference, the BN transform y i k = γ k b n ( x i k ) + β k is replaced by
y i k = γ k V a r [ x k ] x i k + ( β k γ k E [ x k ] V a r [ x k ] )
where γ k , β k , E [ x k ] and V a r [ x k ] are all calculate on the training set.

3.3. Activations

The pre-activation feature maps learned by convolution and BN are then passed through nonlinear activation functions. We introduce two activation functions in the following and compare their effects. Notice that all the operations below are element-wise.

3.3.1. ReLU

ReLU [26] is widely used in deep learning. For the element that greater than 0, it outputs the element itself, for other elements, it outputs 0. Formally, given an input matrix X, the output matrix of ReLU is defined as Equation (8):
R e L U ( X ) = m a x { 0 , X }
The left of Figure 3 depicts ReLU activation.
Figure 3. ReLU and Clipped ReLU.

3.3.2. Clipped ReLU

Clipped ReLU is a revision of ReLU. It introduces a parameter α > 0 . Its output for every element that greater than α is α . Thus, Clipped ReLU limits the output in { 0 , α } . Given an input matrix X, Clipped ReLU is defined as Equation (9):
C l i p p e d R e L U ( X , α ) = m i n { m a x { 0 , X } , α }
The right of Figure 3 depicts Clipped ReLU activation.

3.4. Max Pooling

Above we introduced how to calculate t c and w i n d o w c for CNN layer without pooling operation. Now we will describe their calculation with a max pooling layer following CNN. Formally, we refer to the time span covered by result element after CNN and max pooling as t p , and the time shift window between two neighbor elements as w i n d o w p . For a w p × h p max pooling with pooling strides s w p × s h p , t p and w i n d o w p can be calculated based on t c and w i n d o w c as in Equation (10):
t p = t c + w i n d o w c · ( w p 1 ) w i n d o w p = s w p · w i n d o w c
Substituting Equations (2) into (10), we can get the final calculating Equations (11) and (12):
t p = t i + ( w c 1 ) · w i n d o w i + s w c · w i n d o w i · ( w p 1 )
w i n d o w p = s w p · s w c · w i n d o w i
Equations (11) and (12) show that max pooling can also enlarge the feature’s corresponding time span and reduce computing steps. Besides, since max pooling uses the maximize value as output, it helps to pick the most important features out from less useful ones.

3.5. Bidirectional LSTM

There are many temporal dependencies in speeches and transcriptions. However, some of them may be so long-term that both CNN and max pooling cannot capture them. Therefore, we use LSTM RNN layer in our model to enable better modeling of the temporal dependencies.

3.5.1. LSTM

The structure of Long-Short Time Memory calculating cell is shown in Figure 4.
Figure 4. Calculating Cell in LSTM.
At time step t, LSTM uses the following information for calculating:
  • x t : input data at current step t.
  • h t 1 : hidden state at previous step t 1 .
  • c t 1 : cell state at previous step t 1 .
Given x t , h t 1 and c t 1 , LSTM firstly calculates the forget gate f t (shown in Equation (13)), the input gate i t (shown in Equation (14)), the output gate o t (shown in Equation (15)) and the candidate context c ˜ t (shown in Equation (16)).
f t = σ ( [ x t ; h t 1 ] W f + b f )
i t = σ ( [ x t ; h t 1 ] W i + b i )
o t = σ ( [ x t ; h t 1 ] W o + b o )
c ˜ t = F c ˜ ( [ x t ; h t 1 ] W c ˜ + b c ˜ )
Then, according to f t , c t 1 , i t , c ˜ t , LSTM calculates the cell state c t at current step as depicted in Equation (17).
c t = f t c t 1 + i t c ˜ t
After that, LSTM uses o t and c t to calculate the hidden state h t at current step, which is shown in Equation (18).
h t = o t F h ( c t )
Commonly, F c ˜ ( · ) and F h ( · ) are the hyperbolic tangent function.
Finally, LSTM gives its output y t at time step t, which is same as the hidden state h t .

3.5.2. Stacking Up LSTMs of Opposite Directions

However, forward recurrent connection reflects the temporal nature of the audio input, it is typically shown to be beneficial for acoustic models to make full use of the future contextual information [23]. To take advantage of both history and future information from the entire temporal extent of input features, we build a bidirectional LSTM by stacking two opposite LSTM layer, which maintains states both time-forward and time-backward. The structure of BLSTM is demonstrated in Figure 5.
Figure 5. Structure of bidirectional LSTM.

3.6. CTC

Before the proposal of CTC, Some difficulties stand in the way of end-to-end speech recognition. Firstly, the database must be aligned, which is a very exhausting and time-consuming work. This makes it hard to build large-amount database. Secondly, it is a tough process to build a good-performing ASR, because it costs varieties of expertise to design modules such as HMM, CRF, pronunciation lexicons, etc.
By interpreting the network outputs as the probability distributions in possible labels space conditioned on the inputs, CTC addresses these problems properly.
Roughly, CTC can be separated into two procedures: path probability calculating and path merging. In both procedures, the key is that it introduces a new blank label ‘-’ which means no output and an intermediate structure, the path.
For an input sequence { x 1 , , x T } of length T to CTC, CTC firstly computes a N + 1 dimension vector at every time step. N is the number of elements in the vocabulary V . Then at each time step i, CTC maps this output vector to the output distribution p i = { p i , 1 , , p i , N + 1 } by a SoftMax operation. Here p i , j ( j < N + 1 ) is the probability of outputting the j-th elements of the vocabulary at time i, and p i , N + 1 is the probability of outputting the blank label ‘-’.
After the computation, CTC maps its input sequence { x 1 , , x T } to a probability sequence P = { p 1 , , p T } of the same length T.
If we pick the w i -th element out from set V + { } at each time step i and put them together in chronological order, we get a output sequence P = { w 1 , , w T } with length T. This P is a path. This is the definition of path. Since p i , w i is the probability of output the w i -th element of V + { } at time i, the probability of the path P can be calculated as Equation (19).
P ( P ) = i = 1 T p i , w i
Above is the procedure that we called path probability calculating. In this procedure the path is of the same length T as the input sequence, which is not conforming to the actual situations. Commonly the transcription’s length is much shorter than the input sequences. Therefore, we should merge some related paths to a shorter label sequence. This is the path merging procedure. It mainly consists of two operations:
  • Remove repeated labels. If there are several same outputs occurring at successive time steps, they are removed and only one of them is kept. E.g., for two different 7-time-step paths ‘c-aa–t-’ and ‘c-a–tt-’, after removing repeated labels, they get the same result sequence ‘c-a-t-’.
  • Remove blank label ‘-’ from the path. Now that ‘-’ stands for ‘no output at this step’, it should be removed to get the final label sequence. E.g., the sequence ‘c-a-t-’ becomes ’cat’ after removing all the blank labels.
In the merging procedure shown above, ‘c-aa–t-’ and ‘c-a–tt-’ are two paths of length 7, while ‘cat’ is a label sequence of length 3. We can see that a short label sequence may be merged from several long paths. For example, assume the label sequence ’cat’ comes from paths of length 4, then there are 7 different paths included, as shown in Figure 6.
Figure 6. Paths of length 4 for label sequence ‘cat’.
The decoding lattice of these paths are demonstrated in Figure 7. In this figure, 1, 2, 3 and 4 stand for the time step, ‘-’, ‘c’, ‘a’ and ‘t’ stand for the output at each time step. Moving along the arrows’ direction, every path that starts at time step 1 and stops at time step 4 is a legal path for label sequence ‘cat’.
Figure 7. Illustration of decoding the label ‘cat’ from paths of length 4.
In addition to getting the final label sequence from paths, the path merging procedure also aims to calculating the final label sequence’s probability. For a label sequence L consists of kpaths { P 1 , , P k } , its probability P ( L ) is calculated as in Equation (20):
P ( L ) = i = 1 k P ( P i )
From the calculation described above we can see that the label sequence’s probability is differentiable. Thus, it enables us to train the model by using back-propagation algorithm to maximize the true label sequence’s probability, and use a trained model to recognize speech by considering the label sequence with the maximize probability as the final result.

4. Experiments and Discussion

4.1. Datasets and Input Features

We train our model completely on the AISHELL-1 corpus, using neither any in-house databases nor external language model. The corpus is divided into training, development, and test set. There are 120,098 utterances from 340 speakers in training set, 14,326 utterances from 40 speakers in development set, and 7176 utterances from 20 speakers in test set. For each speaker, around 360 utterances (about 26 min of speech) are released. Table 1 provides a summary of all subsets in the corpus.
Table 1. Data structure.
We use MFCC as the model’s input features. It consists of 13-dimensional MFCC with delta and acceleration (in total 39-dimensional features). The MFCC features are derived from the raw audio files with frame window scope of 25 ms and shift window between successive frames of 10ms.
The decoding target vocabulary includes all the 4334 characters (4328 Chinese characters and 6 special tokens ‘ ’, ‘a’, ‘b’, ‘c’, ‘k’, ‘t’) that occurred in the AISHELL-1 transcriptions. Please note that the input of CTC is 4335 (4334 + 1) dimension because the external blank label ‘-’ should be added to the vocabulary when decoding by CTC.

4.2. Convolution Neural Network

Table 2 shows the performance of models with different CNN depth. In these models every CNN layer has 64 feature maps. In the table numbers in bold give the best WERs of every group of models respectively.
Table 2. WERs of models with various CNN depth.
For models without BLSTM (there are only CNN and full connected layers), model’s performance increases as we deepen the CNN layers from 2 to 5. However, deeper CNN does not necessarily lead to better performance, as the model with 6 CNN layers has a higher WER than the model with 5 CNN layers.
When we deepen the CNN, model performance first increases. It shows that local contextual features play a significant role for speech recognition. Deeper CNN can learn local features of a longer time scope and a larger frequency scope. It enables the model to use more local information to determine the output at current step.
However, this positive effect does not always exist. When the local context covers too large scope (for the model without BLSTM, when the CNN is deeper than with 5 layers), it introduces too much unnecessary information that it dulls those really distinct features, and results in a worse performance.
As shown in the third column in Table 2, this phenomenon also exists in the model using BLSTM. However, since the BLSTM can model contextual information itself, the turning of performance increasing to decreasing comes early (only with 3 CNN layers, rather than 5).
We find that this phenomenon of performance firstly increasing and then decreasing has something to do with the speaking speed. In this experiments, for CNNs from the first layer to the sixth, the kernels are 3 × 2 , 2 × 2 , 2 × 2 , 2 × 2 , 2 × 1 , 2 × 1 , convolution strides are all 1 × 1 . For max pooling from the first layer to the fourth (there is no higher max pooling layer), their pooling sizes are 2 × 2 , 2 × 2 , 2 × 2 , 2 × 1 , and pooling strides are 2 × 2 , 2 × 1 , 2 × 1 , 2 × 1 . Given that every MFCC frame’s time scope is 25 ms, shift window is 10 ms, calculated as described in Section 3.4, we know that the fifth CNN layer covers a time scope of 335 ms, while the fourth layer covers a much smaller scope (175 ms) and the sixth layer covers a much larger scope (495 ms). We then analyze the datasets and find that the speaking speeds of training, development, and test, and the total AISHELL-1 dataset are 3.2, 3.1, 2.9 and 3.2 characters per second, respectively. This means that in the dataset, every character corresponds to an audio piece lasting for about 300 ms. This is consistent with the fifth layer’s time scope.

4.3. Bidirectional LSTM

Table 2 also compares models with and without BLSTM. Every line in Table 2 reveals that the model using BLSTM outperforms that using no BLSTM significantly. It may because BLSTM’s ability of modeling contextual information is much stronger than CNN. At each time step, BLSTM uses forget gate to determine how much history information should be kept, and uses input gate to determine how much new information should be added. Since all the gates are derived from previous hidden state and current input, BLSTM can set different weights for different location when computing contextual information at different time. Besides, by setting far-away location’s weights to 0 (or nearly 0), BLSTM can dynamically determine the context span. In addition, and more important than that, BLSTM can model contextual information from both forward and backward direction. All these together enables BLSTM surpassing CNNs.
In previous experiments, the number of hidden units in BLSTM is 128. For BLSTM, its number of hidden units is very influential for performance. Different number of hidden units means that the BLSTM uses features of different dimension to model the contextual information and current input. It cannot work properly with low dimension. However, too high dimension may introduce unnecessary feature patterns which will confuse the recognition model. So it is important to set the number of hidden units properly.
Performance of models with different hidden dimension are given in Table 3. Number in bold is the best WER. Model performance increases as we enlarge the hidden dimension from 128 to 768, and achieves the best WER of 19.2% with a hidden dimension of 768. However, when we enlarge the hidden dimension from 768 to 896, the performance begins to decrease. We think this is because 896 is a too high dimension for the model and it introduces unnecessary feature patterns which confuses the recognition model. As a result, the WER increases.
Table 3. Comparison of Models with different number of hidden units in BLSTM.
Many works use more than one RNN layers in ASR. For example, DeepSpeech2 uses 5 RNN layers. We conduct experiments to compare the performance of models with different BLSTM depth, results are given in Table 4. The bold number in this table is the best WER.
Table 4. Comparison of models with different BLSTM depth.
Model with 2 BLSTM layers performs even worse than that with only 1 BLSTM layer. BLSTM can model contextual information from two directions. Results in Table 2 show that one-BLSTM model can learn contextual information sufficiently (because it reduces CNN layers from 5 to 3 and achieves the best WER). Therefore, using more BLSTM only unnecessarily enlarges context scope and involves more useless features, which confuses the model and pulls the performance down.
Commonly, results of the two opposite-direction LSTM are concatenated along the time dimension as the input for subsequent neural layers. Nevertheless, there are some works adding them up instead of concatenating. Since the adding operation may counteract the opposite-direction features and make it difficult to distinguish them, we think the concatenation may achieves better results than adding them up. Experiments results in Table 5 verify this analysis. The bold number in this table is the best WER.
Table 5. Comparison of models using concatenation and addition.

4.4. Batch Normalization

BN uses the average and variance of training dataset for recognition on test dataset. Thus, objectively it requires training, and testing dataset have the same average and variance. Otherwise it may fail to improve the performance.
Since the AISHELL-1 is not a very large corpus (it contains 170 h speech, while some English corpus contains tens of thousands of hours of speech data), the training and testing sets may have different distributions. we compare models with and without BN to verify BN’s effect. Results are given in Table 6, where the bold number gives the best WER. They show that BN can improve the CNN+BLSTM+CTC model on AISHELL-1 corpus remarkably.
Table 6. Comparison of Batch Normalization’s effect.

4.5. Activations

Activation functions also impact model’s performance significantly. Many of recent outstanding works use Clipped ReLU activation F ( x , α ) = m i n { m a x { x , 0 } , α } [1,9,10,14]. More than that, some of them set α = 20 [1,9,14].
However, at the same time, there are still many works using the ReLU activation F ( x ) = m a x { x , 0 } [7,19,22,23,24,27,28].
We conduct experiments to compare these two activation functions, results are shown in Table 7. For our model on AISHELL-1 dataset, the ReLU activation gives better performance.
Table 7. Effect of different Activations.

4.6. Comparison with Existing Works

We compare our work with two existing works: CNN-input [6] and ACS [7]. CNN-input [6] achieves WER of 20.68% on the AISHELL-1 data without language model. In [7], the ACS method gets CER of 21.6% on AISHELL-1 test without language model, while adding bidirectional contexts and RNN language model (referred to as ACS+Bidirectional Contexts+RNN-LM), the CER drops to 18.7%. The comparison is shown in Table 8, where the best WER is given in bold. Our model CNN+BLSTM+CTC achieves the best performance.
Table 8. Comparison with existing works.

5. Conclusions

There are many difficulties to build and train conventional ASR systems, since such systems contain many sub-modules and need lots of domain knowledge. However, as to end-to-end Mandarin ASR systems, most of them are neither reproducible nor comparable because they use specific language model and in-house training database which are not freely available.
In this paper, we propose a CNN+BLSTM+CTC end-to-end Mandarin ASR. The CNN+BLSTM+CTC ASR uses CNN to learn local speech features, uses BLSTM to learn history and future contextual information. It is trained completely on the by-far-largest open-source Mandarin speech corpus AISHELL-1, using neither any in-house database nor external language model. It achieves a WER of 19.2%, outperforming the exiting best work. Now that all data corpora we used are freely available, our model is reproducible and comparable, providing a new baseline for further Mandarin ASR research.

6. Future Works

Although our work achieves a good performance, there are still some future works to do.
  • We use MFCC as the input features. However, for English ASR, there are works using original wav signals, spectrogram, and other acoustic features as input. For Mandarin speech recognition, modeling units of acoustic model also affect the performance significantly [29]. We will compare their differences and find the best input acoustic features for Mandarin ASR.
  • Bidirectional LSTM suffers from long latency, so it does not suit the online ASR scenario. We will explore unidirectional LSTM or other techniques to shorten the latency.
  • Language model is crucial for ASR, and [12] shows that with enough speech transcriptions, end-to-end ASR can learn language model implicitly. Therefore, another future work is to explore language model and develop end-to-end Mandarin ASR on some larger databases.

Author Contributions

Conceptualization, D.W., X.W. and S.L.; Funding acquisition, X.W. and S.L.; Investigation, D.W.; Methodology, D.W.; Project administration, X.W.; Supervision, X.W. and S.L.; Writing—original draft, D.W.; Writing—review & editing, X.W. and S.L.

Funding

This research was funded by Fund of Science and Technology on Parallel and Distributed Processing Laboratory (grant number 6142110180405).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ACSAdaptive Computation Steps
ASRAutomatic Speech Recognition
BLSTMBidirectional Long-Short Time Memory
BNBatch Normalization
CD-phoneContext-Dependent phone
CNNConvolutional Neural Network
CRFConditional Random Field
CTCConnectionist Temporal Classification
DNNDeep Neural Network
GMMGaussian Mixed Model
HMMHidden Markov Model
IMEInput Method Editor
LMLanguage Model
LSTMLong-Short Time Memory
LVCSRLarge Vocabulary Continuous Speech Recognition
MFCCMel-Frequency Cepstrum Coefficient
ReLURectified Linear Unit
RNNRecurrent Neural Network
WERWord Error Rate

References

  1. Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A.; et al. DeepSpeech: Scaling up end-to-end speech recognition. arXiv 2014, arXiv:1412.5567. [Google Scholar]
  2. Graves, A.; Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1764–1772. [Google Scholar]
  3. Rousseau, A.; Deléglise, P.; Estève, Y. TED-LIUM: An Automatic Speech Recognition dedicated corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, 23–25 May 2012. [Google Scholar]
  4. Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
  5. Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. AIShell-1: An open-source Mandarin speech corpus and a speech recognition baseline. In Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Korea, 1–3 November 2017; pp. 1–5. [Google Scholar]
  6. Wang, Y.; Zhang, L.; Zhang, B.; Li, Z. End-to-End Mandarin Recognition based on Convolution Input. In Proceedings of the 2018 2nd International Conference on Information Processing and Control Engineering (ICIPCE 2018), Shanghai, China, 27–29 July 2018; Volume 214, p. 01004. [Google Scholar]
  7. Li, M.; Liu, M. End-to-end speech recognition with adaptive computation steps. arXiv 2018, arXiv:1808.10088. [Google Scholar]
  8. Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
  9. Hannun, A.Y.; Maas, A.L.; Jurafsky, D.; Ng, A.Y. First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. arXiv 2014, arXiv:1408.2873. [Google Scholar]
  10. Maas, A.; Xie, Z.; Jurafsky, D.; Ng, A. Lexicon-free conversational speech recognition with neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 31 May–5 June 2015; pp. 345–354. [Google Scholar]
  11. Sak, H.; Senior, A.; Rao, K.; Beaufays, F. Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition. arXiv 2015, arXiv:1507.06947. [Google Scholar]
  12. Soltau, H.; Liao, H.; Sak, H. Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 3707–3711. [Google Scholar]
  13. Audhkhasi, K.; Ramabhadran, B.; Saon, G.; Picheny, M.; Nahamoo, D. Direct Acoustics-to-Word Models for English Conversational Speech Recognition. arXiv 2017, arXiv:1703.07754. [Google Scholar]
  14. Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 173–182. [Google Scholar]
  15. Li, A.; Yin, Z.; Wang, T.; Fang, Q.; Hu, F. RASC863-A Chinese speech corpus with four regional accents. In Proceedings of the ICSLT-o-COCOSDA, New Delhi, India, 17–19 November 2004. [Google Scholar]
  16. Wang, D.; Zhang, X. THCHS-30: A free Chinese speech corpus. arXiv 2015, arXiv:1512.01882. [Google Scholar]
  17. Wang, D.; Tang, Z.; Tang, D.; Chen, Q. OC16-CE80: A Chinese-English mixlingual database and a speech recognition baseline. In Proceedings of the 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA), Bali, Indonesia, 26–28 October 2016; pp. 84–88. [Google Scholar]
  18. Chen, N.F.; Lim, B.P.; Hasegawa-Johnson, M.A. Multitask Learning for Phone Recognition of Underresourced Languages Using Mismatched Transcription. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 501–514. [Google Scholar]
  19. Zhou, J.; Jiang, T.; Li, L.; Hong, Q.; Wang, Z.; Xia, B. Training Multi-Task Adversarial Network for Extracting Noise-Robust Speaker Embedding. arXiv 2018, arXiv:1811.09355. [Google Scholar]
  20. Tu, M.; Grabek, A.; Liss, J.; Berisha, V. Investigating the Role of L1 in Automatic Pronunciation Evaluation of L2 Speech. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 1636–1640. [Google Scholar]
  21. Zhang, Y.; Zhang, P.; Yan, Y. Improving Language Modeling with an Adversarial Critic for Automatic Speech Recognition. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3348–3352. [Google Scholar]
  22. Lugosch, L.; Tomar, V.S. Tone Recognition Using Lifters and CTC. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 2305–2309. [Google Scholar] [CrossRef]
  23. Li, J.; Wang, X.; Zhao, Y.; Li, Y. Gated Recurrent Unit Based Acoustic Modeling with Future Context. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 1788–1792. [Google Scholar]
  24. Li, J.; Shan, Y.; Wang, X.; Li, Y. Improving Gated Recurrent Unit Based Acoustic Modeling with Batch Normalization and Enlarged Context. arXiv 2018, arXiv:1811.10169. [Google Scholar]
  25. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
  26. Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
  27. Zhang, Y.; Pezeshki, M.; Brakel, P.; Zhang, S.; Laurent, C.; Bengio, Y.; Courville, A. Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 410–414. [Google Scholar]
  28. Battenberg, E.; Chen, J.; Child, R.; Coates, A.; Li, Y.G.Y.; Liu, H.; Satheesh, S.; Sriram, A.; Zhu, Z. Exploring neural transducers for end-to-end speech recognition. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, 16–20 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 206–213. [Google Scholar]
  29. Chang, E.; Zhou, J.; Di, S.; Huang, C.; Lee, K.F. Large vocabulary Mandarin speech recognition with different approaches in modeling tones. In Proceedings of the Sixth International Conference on Spoken Language Processing, Beijing, China, 16–20 October 2000. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.