A Self-Supervised Model for Language Identiﬁcation Integrating Phonological Knowledge

: In this paper, a self-supervised learning pre-trained model is proposed and successfully applied in language identiﬁcation task (LID). A Transformer encoder is employed and multi-task strategy is used to train the self-supervised model: the ﬁrst task is to reconstruct the masking spans of input frames and the second task is a supervision task where the phoneme and phonological labels are used with Connectionist Temporal Classiﬁcation (CTC) loss. By using this multi-task learning loss, the model is expected to capture high-level speech representation in phonological space. Meanwhile, an adaptive loss is also applied for multi-task learning to balance the weight between different tasks. After the pretraining stage, the self-supervised model is used for xvector systems. Our LID experiments are carried out on the oriental language recognition (OLR) challenge data corpus and 1 s , 3 s , Full-length test sets are selected. Experimental results show that on 1 s test set, feature extraction model approach can get best performance and in 3 s , Full-length test, the ﬁne-tuning approach can reach the best performance. Furthermore, our results prove that the multi-task training strategy is effective and the proposed model can get the best performance.


Introduction
Recently, the self-supervised training has shown to be effective for improving downstream systems [1][2][3][4]. Speech signal contains a rich set of acoustic and linguistic information, including phonemes, words, articulatory and even sentiment information. Through self-supervised pre-training, high-level speech representation can be captured from raw speech [1,5]. The learned models could be applied to downstream speech and language processing tasks through feature-based speech representation extraction.
In this work, we propose a self-supervised based pre-trained model where the phonological labels are used as an auxiliary objective. In this model, two objectives are used. First, like most of the self-supervised models do, the masking strategies are applied on the input frames and the L1 Loss is used to minimize reconstruction error between prediction and ground-truth frames. Second, to make the model learn the speech representation in phonological space, we apply the CTC with phoneme and phonological labels. After the pre-training stage, xvectors system is incorporated with pre-trained self-supervised model for LID. The framework is described as Figure 1.
During the self-supervised model training, the input acoustic frames are randomly masked on time and channel axis, the model learns to reconstruct and predict the original frames. In neutral network models, a contrastive loss can induce high-level latent knowledge [5] so the sequence-level CTC loss is used with phoneme and phonological labels for phonological representation learning here. Language identification is very important in our real life communication, for both text LID [6,7] and speech LID [8,9]. In speech LID, phonetic knowledge is often used to improve the system performance. The most traditional approach is to incorporate the deep bottleneck features (DBF) with LID model, where the DBF is extracted from a well-trained hybrid automatic speech recognition (ASR) system. In our proposed model, we not only use phonetic information, phonological knowledge is also introduced to make the model learn phonological representation. A lot of researches have proved that the phonological knowledge can be shared across different languages by using statistical model but most of the previous works are using "acoustic-toarticulatory(-attribute) modeling [10,11]. When the self-supervised model reconstructs the masked frames, the CTC loss works as a regularization with phonological knowledge at the same time, thus the model can capture high-level representation at phonological space. To balance the different losses, we apply a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. Because of the jointly training, the model can learn both acoustic and phonological representation. By incorporating the pre-trained model, the LID system can integrate the phonological representation from source language through model transferring method to improve the LID performance. For model transfer, two approaches are considered: Feature extraction and Fine tuning, which will be described in Section 4.
The rest of the paper is organized as follows: Section 2 presents some related works. Section 3 describes the definition of the phonological. Section 4 gives the model architecture. Section 5 and Section 6 show the experiments set and experimental results. Section 7 concludes the paper.

Related Work
Inspired by the Masked Language Model (MLM) task from BERT [12], researches have explored using BERT-style tasks to pretrain speech encoders. In [13], the author proposed a transformer encoder based pretrain model named Mockingjay, where the input frames are masked to zero. The model learns to reconstruct and predict the original frames. In Audio ALBERT [14], Mockingjay is modified to share parameters across Transformer layers. In [13], a pretrained model using time and frequency alteration objective, the results show that the pretrained model can improve several downstream tasks. In PASE [15], a single neural encoder is trained to solve multiple self-supervised tasks at the same time, including reconstruction of waveform, Log power spectrum, MFCC, prosody and other binary discrimination tasks.
The phonological knowledge has been used in many speech tasks. Many researches design neutral network model to map the acoustic features to articulatory features (AFs) by using phonological knowledge. In [16], the authors combined the acoustic and articulatory features, which can improve the speaker identification performance. [17] applied the articulatory features (AFs) to Deep Bottleneck (DBN) features based ivector and xvector systems, which can get better performance than baseline. Because the AFs are language-independent features, there are many researches focused on multilingual speech recognition [18][19][20]. Previous studies generally take a bottom-up approach and train phonological feature detectors, and here we jointly train the phonological labels and the acoustic frames reconstruction.
Most traditional solutions are based on ivector which is extracted from Gaussian mixture model (GMM) [21]. Recently, as the development of the deep neural network (DNN), it has been demonstrated that the DNN can bring significant improvement for LID. In [22], the authors developed deep bottleneck features (DBF) based ivectors, which were extracted from a well-trained hybrid automatic speech recognition (ASR) system. In [23], authors proposed a LID model namedPhonetic Temporal Neural Model (PTN) where LSTM-RNN LID system that accepts phonetic features produced by a phone-discriminative DNN as the input. In this work, the self-supervised model is used to learn phonological knowledge and transfer the knowledge to LID model.

Phonological Definition
In this experiment, Mandarin is used to train the self-supervised model. According to the previous work [24] and the International Phonetic Alphabet (IPA) [25], we defined the six speech attributes for Mandarin, which can be found in Table 1. These attributes of speech can be comprehended by a collection of information from fundamental speech sounds. For each phoneme, it has six classes: Place, Manner, Front, Height, Roundness and Voiced. In this table, the nil means "not specified". For example, the articulatory class Manner does not exist in consonants, thus, in consonants phones, this class is defined as nil.    In recent years, Transformer model has been applied successfully in mask speech tasks [26,27], so we use a standard multi-layer Transformer encoder with multi-head selfattention for left-and-right bidirectional encoding to train the self-supervised model. Each encoder layer has two sublayers, the first is a multi-head self-attention network, and the second is a feed-forward layer, each sub-layer has a residual connection followed by layer normalization [6]. To make the model aware of the input sequence order, the positional encoding is used. The sinusoidal positional encoding instead of learnable positional embeddings because acoustic features can be arbitrarily long with high variance [13]. After the Transformer encoder, the fully connected layer is used to reduce the dimension of the output vector.

Multi-Task Learning
As we describe above, there are two tasks in our proposed model: reconstruction task and supervision task using phonological knowledge. By training the model jointly with these tasks, the model can learn more on phonological space since LID relies on phonological information. The training stage is described in following parts: • Reconstruction task: For reconstruction task, we apply two kinds of masking approaches on the input frames: Time mask and Channel mask. Channel mask: Inspired by Specaugment [28] and TERA, we also introduce channel mask on top of time mask. For channel mask, a block of consecutive channels is masked to zero on all the time step. In our experiments, the percentage of the masked channels are 20%. In [29], the results showed that using channel masking can make the model to learn more on speaker representation. So by using the channel masking, we want to find whether the channel masking can learn phonological representation since the CTC loss with phonological knowledge is used.
To better illustrate the Time mask and Channel mask, we visualize different masking strategies, as plotted in Figure 2.  For both masking strategies, we follow RoBERTa [30] and generate new masking patterns for each batch. Finally, we reconstruct all the frames to induce acoustic information at all positions and more explicitly train the model. The loss is described as follows: where z t are outputs of the transformer encoder and x t is the input. • Supervision task: To make the model learn phonological representation from speech, the CTC loss function is applied and the phoneme and phonological labels are used. The CTC approach is an objective function for sequence labeling problem [31], which doesn't rely on force alignment between input and output labels. For the phoneme output, the loss of the phoneme based CTC can be calculated as: where the x is the input acoustic features and y is the phoneme sequence.
There are 6 classes for phonological labels, so for ith class, the loss is: where the x is the input acoustic features and y is the phonological labels sequence. So the multitask learning loss is: The λ represents the weight of the corresponding task. When using the multi-task learning, the performance of the model can be sensitive to the weight between different tasks and finding optimal values can be expensive. To better train the model, we propose to use the adaptive loss function derived in to automatically weight the task-specific loss functions [32], i.e., Thus this adaptive loss is used for model training.

Xvector System
For the xvector system, the Time Delay Neutral Network (TDNN) based xvector system is chosen because it can get the state-of-art results and always considered as the baseline for LID [33,34].
The first five layers are the extended context layers while following a statistical pooling layer then accumulates all frame-level outputs. Then the outputs are calculated the mean and standard deviation and the segment-level fixed-dimension representation is obtained. The segment-level statistics are passed to the fully connected hidden layers. There are main two ways to incorporate the pretrained model to language identification tasks.

Incorporating with Language Identification Tasks
• Feature Extraction: The first approach is to extraction the features from the last layer of transformer encoder. The extracted features are fed to LID system as input.
Parameters of the self-supervised model is frozen when training the LID system in this approach. In later experiments, we denote this approach as FE. • Fine-tuning: The second approach is to fine-tune the self-supervised model with LID model. Here the output of the self-supervised model is connected to the xvector model. We then update the pretrained model with random initialized xvectors model.
In later experiments, we denote this approach as FT.

Datasets
For the self-supervised model, the THCHS30 [35] dataset is used and the details of the dataset are listed in Table 2. The LID experiments are conducted on the second oriental language recognition (OLR) challenge AP17-OLR [34]. The training set contains 10 different languages: Mandarin, Cantonese, Indonesian, Japanese, Russian, Korean, Vietnamese, Kazakh, Tibetan and Uyghur. In these languages, the male and female speakers are balanced. For the training set, it is recorded by mobile phones, with a sampling rate of 16 kHZ and a sample size of 16 bits. Our systems are evaluated on AP17 challenge's development set , which is selected apart from the training set. The development set contains three test sets of different conditions: 1 s, 3 s and full length utterance condition, which are denoted as 1 s, 3 s, and full-length. The test utterances of the 1 s and 3 s are randomly excerpted from the full-length utterance. If a test utterance is not sufficient long for the excerption, it is simply discarded. The details of the dataset are described in Table 3.

Self-Supervised Model Setup
For the self-supervised model training, the input to our network is cepstral meannormalized Fbank of speech utterances. All of the features are extracted using open-source toolkit Kaldi [36] , using windows of 25 ms and an overlap of 10 ms. We stack every 3 frames to reduce the memory cost of long sequences [37]. The self-supervised Transformer encoder architecture has 12 self-attention layers and the number of multi-head attention is 12. Gradient descent training with mini-batches of size 16 is used to find model parameters. The Adam optimizer [38] is employed for updating model parameters, where learning rate is warmed up over training.
All the experiments are conducted on Pytorch [39].

LID Systems
To compare our proposed model, different systems are introduced: • Xvectors: For the xvector system, the open-source toolkit asv-subtools is used [40]. The acoustic features for xvector system is 23-dim MFCCs and before feeding to the xvector system, a frame-level energy-based voice activity detection (VAD) is used to select voiced speech frames. The xvector system contains 6-layers TDNN layers, the details of the TDNN configuration is shown in Table 4. To get the xvectors, 512dimensional embedding features are extracted at the layer segment6 of the network before the nonlinearity. We apply the Linear Discriminant Analysis (LDA) to reduce the dimension of output vectors. For the back-end classifiers are used: Logistic Regression(LR) and Probabilistic Linear Discriminant Analysis(PLDA).  [23] an auxiliary phonetic model produces phonetic feature, and an RNN LID model is used to identify the language. The PTN is also the baseline for AP17 OLR challenge [34]. Meanwhile, in the results reported by this model, the same source data THCHS30 is used. • IM-LSTM-PTN: The structure of the IM-LSTM-PTN is described in [41]. This model is the submitted model to AP17 OLR challenge which ranked 4th among all the participated teams (http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/OLR_Challenge_ 2017). Based on the PTN, The IM-LSTM-PTN uses a modified LSTM which has a top-down connection from time t to time t + 1.

Language Identification Results
First we list all the results on proposed methods in Table 5. In this table, it can be found that for back-end classifiers, the LR always performs better than PLDA in all the LID systems. For model transferring approaches, in short duration test condition 1 s, the FE outperforms than FT. This is because that the 1s utterance is too short so it is hard to do fine-tuning on pretrained model. In 3 s and Full-length test condition, FT can get better performance. When applying the channel masking on top of time masking, the performance of the LID always worse than only using time masking. The reason is that the LID often requires linguistic knowledge which is on time axis and the CTC loss is a sequence-level loss. Previous works also show that the channel masking mainly helps to encode speaker information [29]. Then we compare our best results with some other results reported in some previous works, which are listed in Table 6. Among al the results reported, our proposed model still can get the best performance. More specifically, when comparing our best results with PTN which is also a phonetic based LID model, our results still have significant improvement. Table 5. LID results (EER %) on different test set. In this table, FT means fine-tuning and FE means feature extraction. "+" means the corresponding approach is used and "−" means not. "+ channel" means the channel masking is applied on the corresponding model.

Analysis of Different Tasks
To analyze the influence of different tasks in the training stage, we train our model with single task and apply to LID. To simplify the experiments, we conduct the experiments using SSL xv+LR (FT). The results are listed in the Table 7.
It shows that all the single tasks can improve the LID results. Meanwhile, the model only using reconstruction loss performs worse than only using CTC loss does which is because the phonetic knowledge always benefits the LID system. By combining all the losses and using weighted adapted loss, the LID system can get the best performance. Considering all the results above, it can be concluded that through the reconstruction task, the model can capture contextual representation and when the supervision task is applied, the phonological representation is learned. So by jointly training these two tasks, our proposed method can improve the performance on LID.

Conclusions
In this paper, a self-supervised model integrating phonological knowledge is proposed for language identification. The proposed model achieves training speech perception and speech production jointly by using self-supervised approach and our model can get significant improvement on downstream task (language identification task). In the selfsupervised model, the reconstruction loss and CTC loss with phonological labels are jointly used to train the model. For the reconstruction loss, we apply masking on time and channel axis and use L1 loss to reconstruct the output frames. For the CTC loss, the phoneme labels and phonological labels are used to train the model. By doing the jointly training, the model aims to learn high-level speech representation at phonological space. The final results show that our proposed model can get best performance with features extraction (FE) model transfer approach and LR as the back-end classifier. Our future work will explore on applying our proposed model on other speech tasks.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: