Improving Transformer Based End-to-End Code-Switching Speech Recognition Using Language Identiﬁcation

: A Recurrent Neural Networks (RNN) based attention model has been used in code-switching speech recognition (CSSR). However, due to the sequential computation constraint of RNN, there are stronger short-range dependencies and weaker long-range dependencies, which makes it hard to immediately switch languages in CSSR. Firstly, to deal with this problem, we introduce the CTC-Transformer, relying entirely on a self-attention mechanism to draw global dependencies and adopting connectionist temporal classiﬁcation (CTC) as an auxiliary task for better convergence. Secondly, we proposed two multi-task learning recipes, where a language identiﬁcation (LID) auxiliary task is learned in addition to the CTC-Transformer automatic speech recognition (ASR) task. Thirdly, we study a decoding strategy to combine the LID into an ASR task. Experiments on the SEAME corpus demonstrate the effects of the proposed methods, achieving a mixed error rate (MER) of 30.95%. It obtains up to 19.35% relative MER reduction compared to the baseline RNN-based CTC-Attention system, and 8.86% relative MER reduction compared to the baseline CTC-Transformer system.


Introduction
Code-switching (CS) speech is defined as speech which contains more than one language within an utterance [1].With the development of globalization, this multilingual phenomenon has become increasingly common in real life, so the research on this phenomenon has attracted growing attention.Traditionally, research on a Gaussian mixture model based hidden Markov models (GMM-HMM) and deep neural network based hidden Markov model (DNN-HMM) framework for code-switching speech recognition (CSSR) [2,3] focuses on two challenges: lack of language model training data at CS points, co-articulation effects between phonemes at CS points.Therefore, it is difficult to reliably estimate the probability of word sequences where CS appears and to model the phonemes at CS points.To address the former challenge, statistical machine translation is utilized to manufacture artificial CS training text [4].Several methods are proposed to improve the performance of language modeling to CS speech: recurrent neural network language models and factored language models with the integration of part-of-speech tag, language information, or syntactic and semantic features [5][6][7].To address the latter challenge, speaker adaptation, phone sharing and phone merging are applied [4].Recently, an End-to-End (E2E) approach for the CSSR task has attracted increasing interest [8][9][10][11].By predicting graphemes or characters directly from acoustic information without predefined alignment, the E2E system can considerably reduce the effort of building automatic speech recognition (ASR) systems.In the mean time, the need of expert linguistic knowledge is also eliminated, which makes it an attractive choice for CSSR.Previous works mainly adopted two types of E2E methods in the CSSR task: connectionist temporal classification (CTC) [12] and the RNN-based attention method [13][14][15].The CTC objective function simplifies acoustic modeling into learning a RNN over pairs of speech and context-independent (CI) label sequences, without requiring a frame-level alignment of the target labels for a training utterance [16].The RNN-based attention method consists of an RNN encoder and an attention-based RNN decoder, which maps acoustic speech into a high-level representation and recognizes symbols conditioned on previous predictions, respectively [13][14][15].A joint CTC-Attention multi-task learning model is presented to combine the benefit of both types of systems [17,18].However, RNN remains as the sequential computation constraint.Therefore, stronger short-range dependencies and weaker long-range dependencies exist in encoder outputs and decoder outputs, which makes it hard to immediately switch languages in CSSR.Recently, the Transformer [19,20] has achieved state-of-the-art performances in many monolingual ASRs [21].It transduces sequential data with its self-attention mechanism, which replaces the RNN in previous works [15,18].Since self-attention mechanisms utilize global context-that is, all frames learn time dependency inside the input sequence to achieve sequence transduction in parallel-information transmission is the same for each location to draw global dependencies, which makes it possible to switch more freely at CS points.Therefore, in this paper, we apply a joint CTC-Transformer framework for CSSR.Then, we study different multi-task learning recipes, where a language identification (LID) auxiliary task is learned in addition to the ASR task.Lastly, we study a decoding strategy to combine the LID information into ASR.All of our experiments are conducted on the SEAME corpus.
The paper is organized as follows.Related works are presented in Section 2. The multi-task learning recipes and LID joint decoding are studied in Section 3. Experimental setups and results analysis are described in Section 4. Some conclusions are drawn in Section 5.

Transformer Based E2E Architecture
The Transformer contains an Encoder network and a Decoder network [21][22][23].Both the Encoder and the Decoder consist of several layers stacked, as shown in Figure 1a,b, respectively.The Encoder transforms the input features X = [x 1 , . . ., x T ] into a sequence of encoded features H e = [h e,1 , . . ., h e,T ], as follows: where i = 0, . . ., e − 1, e is the number of encoder layers, CNN(•) is a convolution network, PE is positional encoding, MH A(•) is a multi-head self-attention mechanism, and ] of the l-th step, and then determine the subsequent [ ŷ1 , . . ., ŷl ], as Equations ( 2)-( 4): Embed(•) is an embedding layer that transforms a sequence of labels Y[0 : l − 1] into a sequence of learnable vectors E dec ∈ R l×d att , and d att is the dimension of attentions.
where learnable weight matrices W o ∈ R d att ×d units belong to the output linear layer, and d units is the number of output units.

Self-Attention
Scaled Dot-Product Attention is commonly used as an attention function for the selfattention mechanism [21].The input consists of queries(Q) and keys(K) of dimension d k , and values(V) of dimension d v .Scaled Dot-Product Attention is computed as Equation ( 5): To allow the model to jointly attend to information from different representation subspaces at different positions, [21] extends Equation ( 5) to multi-head attention Equation ( 6): where h is the number of attention heads, ×d att are learnable weight matrices.

CTC-Transformer Based CSSR Baseline System
We adopted a Transformer framework to build the CSSR system.However, the Transformer takes many more epochs for the monolingual ASR task to converge, let alone the CSSR task, which has many more model units.Inspired by [20], we added a CTC objective function to train the encoder of the Transformer.CTC helps the Transformer to converge with the forward-backward algorithm, enforcing a monotonic alignment between input features and output labels.The architecture of the CTC-Transformer baseline system is indicated in Figure 2. Specifically, let X be the input acoustic sequence, Y be the output label sequence comprising Mandarin modeling units or English modeling units, let L ctc (Y|X) be the CTC objective loss [16] and L att (Y|X) be the attention-based objective loss.The L att (Y|X) loss is the cross entropy of predicted Ŷ and ground truth Y.The combination of L ctc (Y|X) and L att (Y|X) is adopted for the ASR task: where α is a hyperparameter.We chose a Chinese character as the Mandarin acoustic modeling unit, as it is the most common choice for E2E Mandarin ASR and it has shown a state-of-the-art performance in Mandarin ASR [24,25].As for English, we chose the subword as the English unit.We adopted Byte Pair Encoding (BPE) [26] to generate subword units.

CSSR Multi-Task Learning with LID
In CSSR, modeling units belonging to different languages but with similar pronunciation are easy to confuse.Meanwhile, the language information was not used explicitly during training.LID is a process by which a computer analyzes and processes speech to determine which language it belongs to.So, we believe that adopting LID prediction as an auxiliary task can improve the CSSR performance.We sent the feature output from the encoder to the decoder, and the decoder output to the LID sequence corresponding to the feature sequence.The LID task and the ASR task share the same encoder, therefore we call it multitask learning.The objective loss L LID (Z|X) was added to extend the multi-task learning (MTL) objective loss: The L LID (Z|X) loss is the cross entropy of the predicted LID label sequence Ẑ and the ground truth LID label sequence Z.In this work, each LID label z l corresponds to an ASR label y l .So the length of the LID label sequence Z is the same as that of the ASR label sequence Y.An example is shown in Figure 3.We used label 'E' for English, and label 'M' for Mandarin.We used label 'N' for nonverbal, such as noise, laugh and so on, but we do not mention 'N' below, as it is not important in this study.Because the length of the Z sequence is the same as that of the Y sequence, and is inspired by the decoding method of the Y sequence, we used a similar structure to predict the Z sequence.In the training stage, the Decoder of the ASR task used H e and the information of the label sequence (y 0 , . . ., y l−1 ) to predict (y 1 , . . ., y l ).We studied what label sequence should participate to predict the next LID label.Specifically, we propose two strategies to implement the Z label prediction task.
The LID predictor does not share the embedding layer with ASR predictor, and it has its own embedding layer: Embed LID (•) transforms a sequence of labels Z[0 : l − 1] into a sequence of learnable vectors E Z ∈ R l×d att .• ASR label sequence (ALS): just like LLS, except that the LID predictor does not receive the LID label sequence, but the ASR label sequence (y 0 , . . ., y l−1 ), and in fact, the LID predictor shares the embedding layer with the ASR predictor.ALS can be implemented in two structures, one is the LID task sharing the decoder with the ASR task (ALS-share), the other is not sharing (ALS-indep), as indicated in Figure 4b,c, respectively.
(a) ALS-share On this basis, inspired by [20], we also operate another set of experiments adding joint CTC training for the LID task.

CSSR Joint Decoding with LID
There are similar pronunciation units across two languages; therefore, units of one language may be incorrectly identified as units with similar pronunciation to that of another language.Figure 5 shows an example in our experiment.This problem may be related to the fact that LID is not used in decoding.As indicated in Figure 6, we integrate LID into the decoding process by conditionally modifying the ASR output probabilities p att (y k l ) with the LID output probabilities p att (z m k l ): • Firstly, ASR branch decoding and LID branch decoding are carried out simultaneously to obtain the ASR label ŷl and the LID label ẑl of the l-th step; • ŷl can be uniquely mapped to ẑ l , since there is no intersection between the Chinese modeling unit set and the English modeling unit set; • If ẑ l is not in {'E','M'} or ẑl is not in {'E','M'}, p att (y k l ) will not be modified.
• If both ẑ l and ẑl are in {'E','M'}, and ẑ l is different from ẑl , then, modification and normalization will be added to p att (y k l ).
where k = 1, . . ., d units , d units is the number of ASR decoder output units, and y k l is the k-th output unit of l-th step, and m k is the k-th ASR task output units mapping to the LID task output units; therefore, z  If z m k l is different to the corresponding language of y k l , then, after normalization, the value of p att (y k l ) will decrease, which can reduce the probability of selecting the current error unit.

Data
We conduct experiments on the SEAME(South East Asia Mandarin-English) corpus [27], which was developed for spontaneous Mandarin-English code-switching research.We divide the SEAME corpus into three sets (train, development and test) by proportionally sampling speakers.The detailed statistics of the corpus division are presented in Table 1.

Baseline Setup
Firstly, we replicate two other baseline systems based on different frameworks-GMM-HMM [4] and RNN-based CTC-Attention [17].Since the partition of training/development/test sets is not identical, there is a slight gap between our results and the original, but within the allowable range.A "big model" is commonly suggested for the Transformer [19,20] in monolingual ASR, but we find it unsuitable for CSSR due to insufficient CS data obtained and CSSR output units that are too large.In this work, we chose a "smaller model" for the Transformer(d att = 256, e = 6, d = 3).The input speech is represented as a sequence of a 40-dim filterbank feature.The filterbank feature is firstly subsampled by a two layer time-axis convolutional neural network with ReLU activation (stride size is 2, kernel size is 3, the number of channels is 256).The loss weight α for the CTC joint training is set to 0.3.To prevent training from overfitting to the training set, label smoothing [28] with a penalty of 0.1 is applied.For Mandarin modeling units, a set of 2639 Chinese characters is used, covering the Chinese characters that appear in the training text.For English, a set of 200 subwords is used, which is trained on English segments of the training set using the BPE method [26].
As shown in Table 2, the CTC-Transformer baseline had a mixed error rate (MER) of 33.96%, better than that of the RNN-based CTC-Attention baseline (38.38%) and the GMM-HMM baseline (39.6%).We conducted experiments using a different choice of MTL weight β, and we chose the best weight β = 0.1 in the following experiments.As shown in Table 3, LLS and ALS-share achieve an equivalent effect, both better than the CTC-Transformer baseline, consistent with our expectations.However, the effect of ALS-indep is worse, therefore we did not use ALS-indep for subsequent experiments.To add joint CTC training for LID, the LID CTC weight is set to 0.3, the same as the ASR CTC weight (α = 0.3).As shown in Table 3, the LID auxiliary task with joint CTC training can better assist the ASR main task, because CTC can learn to align the speech feature and the LID label sequence explicitly.

Effects of Joint Decoding with LID
As shown in Table 4, joint LID decoding improves LLS, but has no effect on ALS-share.For ALS-share, the LID task and the ASR task use the same decoder, which makes the result of the LID task more closely related to that of the ASR task.In contrast, as for LLS, the LID task is relatively independent of the ASR task.Therefore, LID decoder training in LLS is more capable of correcting language errors in the ASR task.Under the configuration of the LLS method, CTC joint training for the LID task and LID joint decoding, the final system achieves an MER of (30.95%), obtaining up to a (19.35%) and (8.86%) relative MER reduction compared to the RNN-based CTC-Attention baseline system (38.38%) and the CTC-Transformer baseline system (33.96%)respectively.

Conclusions
In this work, we introduce a CTC-Transformer based E2E model for Mandarin-English CSSR, which outperforms most of the traditional systems on the SEAME corpus.As for the inclusion of LID, we propose two LID multi-task learning strategies: LLS and ALS (ALS-share and ALS-indep).LLS and ALS-share have a comparable promotion effect.Furthermore, we study a decoding strategy to combine the LID information into the ASR task and it slightly improves the performance in the case of LLS.The final system with the proposed methods achieved an MER of 30.95%, obtaining up to a 19.35% and 8.86% relative MER reduction compared to the RNN-based CTC-Attention baseline system (38.38%) and the CTC-Transformer baseline system (33.96%),respectively.
FF i is a positionwise fully connected feed-forward network.In this work, layer normalization N(•) is employed before each sub-layer .The Decoder receives the encoded features H e and the label sequence Y[0 : l − 1] to emit the probabilities of the Decoder output units set Y [l] = [y 1 l , . . ., y d units l

Figure 3 .
Figure 3.An example: Y label sequence corresponding to Z label sequence.

Figure 5 .
Figure 5.An example of units in one language being incorrectly identified as units in another language.

l
is the m k -th output unit of l-th step of the LID task, p att (y k l ) is the probability of the k-thASR output unit of the l-th step, p att (z m k l ) is the probability of the m k -th LID output unit of the l-th step.

Figure 6 .
Figure 6.Frameworks for joint LID and ASR decoding.

Table 1 .
Statistics of the SEAME corpus.

Table 4 .
MER(%) of ASR task (using CTC-Transformer architecture), corresponding to joint decoding with LID or not.