A 2D Convolutional Gating Mechanism for Mandarin Streaming Speech Recognition

: Recent research shows recurrent neural network-Transducer (RNN-T) architecture has become a mainstream approach for streaming speech recognition. In this work, we investigate the VGG2 network as the input layer to the RNN-T in streaming speech recognition. Speciﬁcally, before the input feature is passed to the RNN-T, we introduce a gated-VGG2 block, which uses the ﬁrst two layers of the VGG16 to extract contextual information in the time domain, and then use a SEnet-style gating mechanism to control what information in the channel domain is to be propagated to RNN-T. The results show that the RNN-T model with the proposed gated-VGG2 block brings signiﬁcant performance improvement when compared to the existing RNN-T model, and it has a lower latency and character error rate than the Transformer-based model. convolutional networks and enhance informative information through a gating mechanism. The experiments show that the streaming model based on the gated-VGG2 block achieves lower CER compared to other LSTM models, and has lower latency compared to a Transformer-based model with similar accuracy. In addition, the gated-VGG2 block illustrates the inability of previous VGG2 networks to adequately consider the importance of features. We hope that this insight can be useful for other tasks that rely on convolutional networks to represent features.


Introduction
With the development of artificial intelligence, automatic speech recognition (ASR) has contributed to improving the efficiency of human productive activities (e.g., recording meetings, automatically captioning videos, interacting with modern smart devices by sending voice commands directly, etc.). The ASR process generally converts the speech signals into symbol sequences; for example, the speech signal can be an utterance from a speaker, and the symbol sequence is its corresponding text. Whereas, offline ASR systems take the entire utterance as input to produce output symbols, streaming ASR systems process the speech signals as a streaming input, which means that hypotheses are generated as soon as possible once the first frame has arrived. We are interested in such low latency systems not only for ASR systems, but also some downstream tasks, such as spoken dialogue systems [1] and real-time translation systems [2].
Over the past few years, some end-to-end models for offline applications [3][4][5][6][7] have gained performance comparable to that of humans. However, these models cannot be directly applied to real-time scenarios because of their high latency. In contrast, recurrent neural networks (RNNs) are a natural architecture for building such a streaming model, which produces output that relies only on the current input and previous state history. Several models employing RNNs with LSTM [8] cells for streaming purposes have been proposed previously, including the recurrent neural aligner (RNA) [9], neural transducer [10], and RNN-Transducer (RNN-T) [11][12][13]. RNN-T is very well suited for on-device applications because it has the ability to perform streaming, high-accuracy, and low-latency [14].
However, the sequential nature of RNNs also restricts RNN-T to its input at the current time step, missing future information. Therefore, several models based on attention mechanisms have been proposed to make it possible for Transducer models to exploit contextual information. Transformer-Transducer (T-T) [15,16] has been proposed on speech recognition, with Transformer [17] becoming the state-of-the-art approach in the language modeling and machine translations fields [18][19][20]. They replaced LSTM with the encoder part of Transformer, which mainly includes multi-head attention mechanisms, feedforward networks, and layer normalization, have been proposed on speech recognition. Experiments that are based on T-T show that the accuracy of the streaming model considering contextual information is comparable to that of the offline models. Truncated self-attention adopted in [15] and masked self-attention adopted in [16] both reduce the error rate of the streaming model.
In general, the T-T model requires a deep Transformer. If each layer of the Transformer calculates the attention scores of the input sample points with the same context range, e.g., in [16], each layer masks the same number of future speech frames, the deep transformer will superimpose a high-latency. One solution is to make the context extracting mechanism independent of the deep network, acting only as an input layer. Thus, the depth-wise convolutional network is well suited as the so-called input layer.
The output of the depth-wise convolutional network incorporates the spatial and channel-domain features through its perceptual field. The multi-channel feature is a unique property of the convolutional network, also known as the width of the convolutional network, which is determined by the number of convolutional kernels. That is, the more convolutional kernels there are, the more channel-wise features can be extracted. Several studies and experiments [21,22] have shown that the shallow wide convolutional networks outperform the deep narrow convolutional networks for these tasks.
In this paper, we propose an Encoder architecture for Transducer that incorporates the properties of convolutional networks to extract contextual information and the ability of LSTM to learn historical information. Our model is trained on the AISHELL-1 [23] dataset, containing over 170 h of speech being recorded by 400 speakers in a quiet office and obtains a character error rate (CER) of 12.9%, outperforming the previously proposed LSTM-based Transducer models [11][12][13]. When compared with the Transformer-based Transducer model with the same convolutional networks, our model only has the latency of the model employed unidirectional Transformer, but it achieves a comparable CER of the model, which looks ahead for three frames.
The main contributions of this paper are as follows: 1.
We combine convolutional networks with LSTM as the Encoder of Transducer to build a low-latency streaming speech recognition model. These convolutional networks are built in the form of VGG2 [24] networks, which are the first two layers of VGG16, a deep convolutional network architecture. Additionally, the maximum pooling layer is retained to reduce the frame rate, which improves the training efficiency.

2.
We introduce a two-dimensional (2D) convolutional gating mechanism inside VGG2, called gated-VGG2, which controls what information will flow into LSTM. The gating mechanism employs half of the channel features that are generated by the convolutional network to form gate states acting on the other half of the channel features, so that twice the channel information can be learned, which improves the performance of the model.

3.
There are no temporal dependencies in the gating mechanism, so that our model is easy to train in parallel.
This paper is organized, as follows, Section 2 presents the work related to this paper. Section 3 discribes the structure of the proposed model. Section 4 presents the experimental results on the AISHELL-1 dataset, and Section 6 provides concluding remarks.

Related Works
Developing a streaming speech recognition systems has been a hot issue in speech recognition in recent years. As already introduced in the previous section, RNN-T and T-T are two commonly used streaming models. Several pieces of research are devoted to fusing convolutional networks in the T-T model. Among the methods that were proposed in these studies, some of the methods for convolutional network enhancement come from the fields of natural language processing and computer vision.
T-T has been combined with the VGG network when it was first proposed in [15]. The VGG network plays two roles in T-T as an input layer: (1) adding relative position information to Transformer; and, (2) downsampling the input features through the pooling layer. Another study [25] also confirms that the convolutional approach is more helpful to extract the position information of the input sequence.
Whereas the convolutional network is used as the input layer in VGG T-T, the convolutional network alternates with the deep Transformer layer in Conv. Transformer-Transducer (ConvT-T) [26]. The convolutional network and Transformer form three blocks in ConvT-T, with the convolutional network in the latter two blocks incorporating more implicit features. Moreover, there is only unidirectional Transformer layers in ConvT-T, which achieves a low latency model.
Conformer [6] is designed as an architecture with a multi-head attention module and a convolutional module, and a pair of feedforward network modules. The multi-head attention module and feedforward network module follow the form of the Transformer, and the convolutional module is the key part of the Conformer, which is mainly a depthwise convolutional layer that is sandwiched between two point-wise convolutional layers. Specifically, the conformer chooses gated linear units (GLU) [27] as the activation function for the first point-wise convolutional layer. The output channels of the first point-wise convolutional layer are twice the number of input channels, and the GLU resizes the output features to exactly the size of the input features, while it contains two times the channel information. In fact, the number of input channels is counted as the feature dimension at the current time step, so GLU is a feature frequency-wise gating mechanism in Conformer.
Researches on convolutional channels have focused on making convolutional networks more expressive by enhancing or suppressing some specific channel features. Some of the approaches that are derived from the above studies have become essential modules for convolutional networks, e.g., NetInNet [28] and SEnet [29]. NetInNet proposed 1x1 convolutional networks, in which the weights are learned for channels on a specific task, In contrast, SEnet added a gating mechanism to channels, which learned gating states through the global average of the features of that channel as the initial value.
In our work, we apply the convolution method to RNN-T. Similar to Conformer, GLU is chosen as the channel gating mechanism, but our GLU really acts on the multi-channel features that are generated by a set of convolutional kernels, which is a channel-wise gating mechanism. Additionally, we also experimentally tested another gating mechanism: gated tanh units (GTU). The results are shown in Section 4, where GTU outperforms GLU in our proposed model.

Transducer
Consider a model that consists of an Encoder, a Prediction network, and a Joint network, as illustrated in Figure 1. Given an input speech feature sequence x = (x 1 , x 2 , . . . , x T ) of length T and the output symbol sequence y = (y 1 , y 2 , . . . , y U ) of length U, the Encoder encodes the inputs x 1:t to obtain the acoustic feature representation f t and the Prediction network encodes the symbols y 0:u−1 to produce the symbol representation g u−1 . We denote that Y is the output symbol space that consists of K symbols, φ is the blank symbol, indicating that it outputs nothing at the current time-step, and the extended output spacē Y = (φ Y ). For streaming, a Joint network would produce a probability distribution P(k|t, u − 1) overȲ for each combination of f t at input time-step t and g u−1 at output time-step u − 1, which finally goes through the softmax layer to produce a symbol y u that will be the next input of the Prediction network if it is a non-blank symbol. Otherwise, g u−1 fuses with the next frame f t+1 to keep on predicting y u . The Encoder can be expressed as Equations (1)-(5), where y 0 is the φ and superscript k is the k-th element of the vectors in Equation (3).

RNN-Transducer
RNN-Transducer (RNN-T) employed LSTM for both Encoder and the Prediction network. The version of LSTM used in this paper is implemented according to the following composite function: where h t and h t−1 are the hidden states at time-step t and t − 1, c t and x t are the cell states and input at time-step t, and i t , f t , g t , o t are the input, forget, cell, and output gates, repectively. The W ij and b ij refer to the learnable weights and bias between units with the index of gate name i and j. σ is the sigmid function, and is the Hadamard product. The previous cell states are not involved in the formation of the gates, and this is to allow the model to have fewer parameters, which is different from LSTM in [11,12]. The joint network is implemented in the form where W f , W g , W o , b i , and b o are the learnable weights and biases, respectively.

Training
Transducer introduces φ in the hypothesis to align speech sequences and symbol sequences, called an alignment a. By removing φ, the alignment a can be folded into a corresponding symbol sequence y (e.g., a = (φ, y 1 , φ, y 2 , y 3 , φ) is equivalent to y = (y 1 , y 2 , y 3 )). Given an input speech sequence x and a target sequence y, the sum over all conditional probabilities of the alignments is defined as the probability of generating y where A is a set consisting of all alignments equal to y.
Thus, P(y|x) can be rewritten as the sum of the products of the forward and backward probabilities of all points that satisfy ∀n = t + u, where 1 ≤ n ≤ T + U.
The training model is to minimize the loss −lnP(y|x) of the target sequence y.

Extension to Gated-VGG2 RNN-T
So far, we have considered the RNN-T model in which each output conditioned its corresponding historical information. This model is too restrictive to consider contextual information of current frame. In this section, we extend RNN-T to include convolutional networks, pooling layers, and a gating mechanism as an input layer to fuse spatial and channel information of the input features, as illustrated in Figure 2. The above model could be called a gated-VGG2 RNN-T, since the encoder consists of a gated-VGG2 block and an N-layer LSTM, and the Prediction network is the multilayer LSTM. The gated-VGG2 block is an architecture that is designed based on VGG2, the first two layers of VGG16, and a 2D convolutional gating mechanism. The VGG2 network is organized, as follows: there are four layers of convolutional networks, each of which applies the rectified linear unit (ReLU) [30] as the nonlinear activation function, and each two-layer convolutional network is followed by a maximum pooling layer. In our work, the gating mechanism is inserted between the last convolutional network and its activation function. When considering the input speech features as single-channel features, a set of convolutional networks will output multi-channel features. The gating mechanism executes element-wise multiplication of channel-specific spatial domain features, so we call it a 2D convolutional gating mechanism. After a series of convolutional networks, pooling layers, and the gating mechanism, we denote the output of the gated-VGG2 block asx.

Convolutional Network
Each convolutional layer maps the input features x with C in channels to output features h with C channels as Equation (19), where s is the s-th input channel, k c is the parameters of the c-th kernel, and h c is the c-th channel output. For simplicity of expression, the bias is neglected in the formula.

Activation Function
We only consider the ReLU function as the nonlinear activation function for the convolutional network, because it has a constant gradient of 1 in regions that are larger than 0, which could avoid down-scaling the gradient, leading to gradient vanishing. The ReLU function is calculated, as follows, where x is the output of the previous convolutional layer.

Max Pooling Layer
The max pooling layer outputs the maximum value within the receptive field. We denote that i is the time domain index, j is the frequency domain index , m, s are the kernel size and stride in the time domain, and n, d are the kernel size and stride in the frequency domain. Given input x, the output of the max pooling layer is If the size of the input feature x is T × D, where T and D represent the dimensions in the time domain and frequency domain, respectively, the dimension T × D of the output feature is where · is the round up function.

2D Convolutional Gating Mechanism
Employing the gating mechanism to control the information flow in a network has proved successful for RNNs. The input gate and forget gate of LSTM will have to learn when to release and write scaled information to the memory unit, and the output gate will have to learn what information needs to be trapped in the memory unit. Without these gates, LSTM would easily harm learnable short-term memories by storing long-term memories. In this paper, we introduce a two-dimensional (2D) convolutional gating mechanism to control what features can flow to the LSTMs. See Figure 3 (where C in = C = 4 for simplicity), where input after convolutional operators is split in half along channel-dimension to form u 1 and u 2 . Only half of the multi-channel features could be output, where the other half features will be transformed into gate states by a sigmoid function. Subsequently, these gate states act on the other half of the channel features to generate gated units, which are the output of the gating mechanism. Inspired by the work of [27], we consider both gated linear units (GLU) and gated tanh units (GTU) forms of gating mechanism to produce output o in Equations (24) and (25), where σ is the sigmoid activation function and is the Hadamard product between two matrices.
The local gradients of GLU and GTU are calculated as where both the values of tanh and tanh are between 0 and 1. Then we can get the corollary: Essentially, the gradient in the backpropagation wil be downscaled, which may lead to a gradient vanishing when using the GTU, and the GLU does not have downscaled factors and, therefore, it can avoid this problem better.
However, the tanh function scales the input features to within the interval of [−1,1] with mean 0, which can be seen as a data normalization operation that makes it easier for the model to converge to an optimal value in training, e.g., we found the better performance of GTU in our experiments.
Given the input speech features x, the gated-VGG2 block outputx according to the following combination of equations wherex is obtained by flattening the multi-channel features of u 2 into 1-dimensional features at every timestep.

Latency
In our proposed model, all of the latency comes from the convolutional layers and the pooling layers. We set the kernel size to 3 × 3 with stride 1 in the convolutional layer and 2 × 2 with stride 2 in the pooling layer. A padding is added to both ends of the input sequence before each convolutional operation. Generatingx 0 actually requires waiting for x 6 to arrive, as shown in Figure 4. A latency of 60 ms will be introduced if the frame rate of the input features is 10 ms.

Corpus
We use the public AISHELL-1 Mandarin speech corpus for experiments. Table 1 shows the details of the corpus.

Hyperparameter Setting
In all of the experiments, we extract the 80-dimensional log Mel-filter banks as features on 25 ms, with a 10 ms shift, and normalize to zero-mean and unit-variance. Table 2 shows the Encoders used in all experiments. For (Bi)LSTM, hs represents the hidden layer size of the LSTM. For Transformer, d represents the input size, h represents the number of heads of the multi-head attention, and u represents the size of the feedforward network. The maximum pooling layers in our (gated-)VGG networks are both of size 2 × 2 with a stride of 2 to achieve a lower frame rate of 40 ms. The outputs in the first and second layers of the (Bi)LSTM are downsampled by a factor of 2, respectively, to bring the model to the same frame rate.
The corpus consists of 4233 Chinese characters (including ""blank" and "unk" tags), each of which is represented using a 320-dimensional embedding. We choose two-layer LSTM with 1024 hidden units as the Prediction network, and the size of the Joint network is 320. For the training step, we utilized adadelta [31] optimizer with an initial learning rate of 1.0. All of the models are trained 20 epochs on the training set and cross-validated using the validation set after each epoch. The training will stop early, if there is no performance improvement for three consecutive times. All the models in our experiments were built using the Espnet [32] toolkit. We evaluated the effect of GLU and GTU, respectively. Table 3 shows the results; a lower CER is obtained using GTU than GLU on the model with our proposed gated-VGG2 block. We present the scaling of local gradients by the tanh function and the possible impact of the normalization of the tanh function on the performance in 3.1.5. The results show that the effect of local gradient scaling on the model is cancelled out.

Beam Search
In the decoding stage, we use the beam search algorithm in [11], where the size of the beam will affect the decoding speed and performance of the model. Figure 5 shows the decoding results of our proposed model (GTU) with an increasing beam width from 0 to 10, where a beam width of 0 represents greedy decoding. The results show that a beam width of 5 can achieve the balance of accuracy and cost.  Table 4 shows the character error rate (CER) of our model (GTU) and other LSTMbased models. All of the models use the same structure and hyperparameters for the Prediction and Joint networks. In the decoding stage, we use a beam search of beam size 5 without language model. The streaming model that is based on our proposed model achieves the lowest CER and it outperforms the offline model with BiLSTM Encoder. Secondly, we compare the performance of our case with the Transformer-based model. In particular, to keep the same frame rate, we compared with the model with a VGG2-Transformer Encoder, which is a combination of VGG2 and the 12-layer Transformer in Table 2. As shown in Table 5, we test the model in full attention, unidirectional attention and with different sizes of lookahead, respectively. Increasing the lookahead sizes in the Transformer layers is effective in reducing CER, but it introduces a significant amount of latency. The CER of our proposed model is comparable to that of the lookahead three-frame model, with only unidirectional latency.

Conclusions
In this paper, we proposed the gated-VGG2 block, a feature fusion module that is designed to capture the contextual information in streaming speech recognition through convolutional networks and enhance informative information through a gating mechanism. The experiments show that the streaming model based on the gated-VGG2 block achieves lower CER compared to other LSTM models, and has lower latency compared to a Transformer-based model with similar accuracy. In addition, the gated-VGG2 block illustrates the inability of previous VGG2 networks to adequately consider the importance of features. We hope that this insight can be useful for other tasks that rely on convolutional networks to represent features.