Domain-Adversarial Based Model with Phonological Knowledge for Cross-Lingual Speech Recognition

Phonological-based features (articulatory features, AFs) describe the movements of the vocal organ which are shared across languages. This paper investigates a domain-adversarial neural network (DANN) to extract reliable AFs, and different multi-stream techniques are used for crosslingual speech recognition. First, a novel universal phonological attributes definition is proposed for Mandarin, English, German and French. Then a DANN-based AFs detector is trained using source languages (English, German and French). When doing the cross-lingual speech recognition, the AFs detectors are used to transfer the phonological knowledge from source languages (English, German and French) to the target language (Mandarin). Two multi-stream approaches are introduced to fuse the acoustic features and cross-lingual AFs. In addition, the monolingual AFs system (i.e., the AFs are directly extracted from the target language) is also investigated. Experiments show that the performance of the AFs detector can be improved by using convolutional neural networks (CNN) with a domain-adversarial learning method. The multi-head attention (MHA) based multistream can reach the best performance compared to the baseline, cross-lingual adaptation approach, and other approaches. More specifically, the MHA-mode with cross-lingual AFs yields significant improvements over monolingual AFs with the restriction of training data size and, which can be easily extended to other low-resource languages.


Introduction
Automatic speech recognition (ASR) systems have been improved greatly in recent years due to deep neural networks (DNNs). However, there are more than 7000 living languages in the world, where only about 125 different languages have access to ASR technologies [1], so it is still a big challenge to develop a reliable ASR system for low-resourced languages. The phonological attribute modeling, also known as "acoustic-to-articulatory(attribute) modeling", is widely used to describe the movement of the organ during speech production and can be shared among all languages. Articulatory information has been proved useful in many related areas, such as pathological speech recognition [2], pronunciation prediction [3] and multilingual speech recognition [4]. There are mainly three methods to derive phonological-based features: (i) using an X-ray radiometer to measure movements of vocal organs [5], (ii) acoustic-articulatory mapping using filtering techniques [6] and, (iii) statistical model based speech attribute detectors [7]. The first approach has a high initial setup cost, thus it is an unfeasible approach, while the second only detects some of the attributes and not for all the phonological attributes [8]. This research explores the third approach due to its feasibility and reliability. The main advantage of phonological attributes based on cross-lingual ASR is that the phonological knowledge can be shared across different languages.
In low-resource languages, the lack of linguistic knowledge causes a scarcity in transcribed speech data for the training of ASR systems. Therefore, International Phonetic Alphabet (IPA) [9] explores an approach for cross-lingual phonological attributes. The languages tackled in this paper are English, German, French, and Mandarin. In practice, Mandarin is a well-resourced language, however, to verify our proposed framework on cross-lingual speech recognition task, we take Mandarin as a low-resourced language by using a limited dataset.
As shown in Figure 1, our system has two key modules: (i) a domain adversarial based multi-task learning model to extract phonological knowledge-based features (abbreviated as AFs) and, (ii) a framework that fuses the AFs with conventional acoustic features e.g., Mel Frequency Cepstral Coefficients (MFCCs). Firstly, the AFs detector is modeled into a domain adversarial-based multitask learning system (abbreviated as DANN). The DANN model contains a gradient reversal layer that can prevent the model from learning domain information (languages and speakers in this paper). In the proposed method, the domain classifier of DANN is modified as multi-task supervised learning with speaker and language classification, the classification of the AFs is the main task. To combine the MFCCs and AFs, different fusion approaches are considered. The paper is organized as follows: Section 2 reviews the latest related work on AFs and multi-stream framework. Section 3 gives detailed information regarding AFs detectors and Section 4 provides further elements on multi-stream frameworks. Section 5 reviews the configuration of the proposed experiments. Section 6 presents the results and analysis. Finally, Section 7 gives the conclusions and directions to future research.

Related Work
Phonological research demonstrated that each sound unit of a language can be split into smaller phonological units based on articulators used to produce the respective sound. To generate reliable AFs, it is critical to design stable AFs detectors. Several studies found that CNNs have a better ability to capture the articulator's information [7,10]. The combination of articulatory features and conventional acoustic features has been shown useful in many speech tasks. In [11], researchers show that the AFs can improve the ASR performance by combining the MFCCs and AFs at the lattice level. Similarly, the results by [12] work indicates that the combination of MFCCs and AFs at lattice level can improve the performance of pronunciation error detection.
It is also proved that the phonological features can be shared across languages. Chin-Hui Lee et al., trained three attribute detectors on Mandarin speech for three "manner" (articulatory class) features, and further used these detectors to process an English utterance spoken by a non-native Mandarin speaker [13]. In those experiments, both stops and nasals attributes were correctly detected, which can prove that the speech attribute can be used in cross-lingual speech recognition in English and Mandarin. There are few studies on multilingual speech recognition integrating AFs; Hari Krishna et al., trained a bank of AFs detectors using source language to predict the articulatory features for the target languages, which showed that the combination of AFs using AF-Tandem method performs better than the lattice-rescoring approach [14].
Because English, German, French, and Mandarin are from two different language families, it is not straightforward to implement cross-lingual speech recognition. In [15], the researchers used the English-trained MLPs for AF extraction for Mandarin, after applying PCA (Principal Components Analysis), the AFs tandem features were directly concatenated with MFCCs. Results indicated that the English-trained AFs detector slightly degraded the performance. Li et al. [16] use bottleneck features from a system trained with English and Mandarin, only achieving 1.6% relative improvement compared to a Mandarin baseline system built using conventional acoustic Perceptual Linear Predictive (PLP) features.
The AFs and acoustic features have different numerical ranges, therefore the multistream framework can relate better both features by fusing them. The multi-stream framework has been proved to improve the performance of the ASR system. In [17], researchers proposed a multi-stream set up to combine the M-vector features (Sub-band Based Energy Modulation Features) and MFCCs, which improves the ASR performance. In [18], the authors implement a 5 sub-band multi-stream system, with a proposed fusion network in a noise-robust ASR task. Considering the previous works, multi-stream is an effective way to boost ASR systems, especially in challenging tasks (i.e., noisy environment and low-resource ASR).

Phonological Attributes
We define the phonological attributes and their corresponding phone set, which are listed in Table 1. For Mandarin, English, German, and French we define the symbol sil to represent the silence.
Adapted from previous work [19,20] and IPA [21], we define a universal phonological class definition. Unlike other previous works [22,23], they defined the phonological attributes for only one language (only English or Mandarin). However, our delineation can be shared by multiple languages and can be easily extended to other languages. As shown in Table 1, these attributes of speech can be comprehended by a collection of information from fundamental speech sounds. There are six attributes for each phone: place, manner, backness, height, roundness and, voice. Every phone has a one-hot encoding in each attribute, so after combining all the 6 attributes, there is a 32-dimensional AF vector. Each phone has a unique AF definition. In the Table 1, the nil means "not specified". For example, the phonological class Place does not exist in consonants, thus, in consonant phones, this class is defined as nil.
Although the phones of these languages do not share the same phone set, we could describe the phones by these attributes. Thus all phones can share phonological knowledge at the phonological level (AFs).
nil sil all_vowels sil all_vowels sil all_vowels sil all_vowels Voiced voiced oo uu o n ng ei ix a er i vv ee ii iz r m e u iy aa i v d t g at eh ao uh r z l er uw ow iy ah dh aw aa ey ih m v n w ae jh y s oy ng all_vowels

Domain-Adversarial Modeling: Integrating Phonological Knowledge
To make the model learn phonological knowledge, we propose a novel representation extraction model, which combines a deep CNN-based model and domain adversarial learning.
Ganin et al. [24] proposed the domain adversarial neural network (DANN). The network learns two classifiers: the main classification task and the domain classifier. The latter determines whether the input sample is from the source or target domain. Both classifiers share the same hidden layers which learn hidden representations for each specific task. The DANN model has a gradient reversal layer (GRL) between the domain classifier and the hidden layers. This layer passes the data during forward propagation, while inverting the sign of the gradient during backward propagation. The network attempts to minimize the task classification error and find a representation that maximizes the error of the domain classifier. The goal of the DANN is to reduce the distribution differences between the source and target domain. With the help of GRL, the model receives the reversed gradient lamdba. Thus, the network will maximize the error of the domain classifier. Meanwhile, the network attempts to minimize the task classification as usual. By considering these two goals. The model can learn a discriminative representation for the main classification task while making the samples from either domain indistinguishable. As shown in Figure 2, we apply three supervised classification tasks in the DANN model. We include articulatory and phoneme classifiers as main tasks and speaker and language identification as domain classifiers. The objective function of the DANN is defined as follows: (1) L phn , L a f ,L lid and L spk are the loss functions of the phoneme, articulatory, language and speaker classification, respectively. λ is a trade-off weight parameter to control each loss term (i.e., there is one weight term for each loss). For articulatory classes, there are 6 classes so the 'i' means the i th articulatory class. The CNN block is a U-Net [25] like CNN structure which is used to obtain features with different time and frequency scales, as depicted in Figure 3. When using the multi-task learning, the performance of the model can be sensitive to the weight between different tasks and finding optimal values can be expensive. To better train the model, we propose to use the adaptative loss function, to automatically tune task-specific weight on the loss functions [26].
The resulting tuned task-specific equation is presented in Equation (2). In this equation, σ means the coefficient (weight) of different tasks.

Multi-Stream ASR Framework
MFCCs and AFs have different numerical ranges, thus simply concatenating them together and training the hybrid system will bring bias towards one feature stream. Therefore, the feature combination would even harm the performance [15]. We overcome this problem by implementing different modes of multi-stream training, where AFs and MFCCs are integrated. For instance, parallel-mode, joint-mode, and MHA-mode: • Parallel-mode: the parallel-mode multi-stream ASR framework is shown in Figure 4. The bottleneck features (BNF1 and BNF2) are extracted from the last batch-norm layer following the approach from [27], and the bottleneck dimension is set to a 100 dimension. All the layers in parallel mode are standard TDNN layers. • Joint-mode: the joint-mode approach is described in Figure 5. The joint-mode involves two separate layers for two individual feature streams, and one combination layer to integrate the medium representation of the individual stream. The configuration of the joint-mode is shown in Table 2. • MHA-mode: Multihead attention (MHA) based fusion method is also used in this paper, which is shown in Figure 6. The attention mechanism allows a neural network to capture speech representation from different inputs. The attention score for each Head i is calculated as: In our experiments, the Q is represented using MFCCs, the K is represented using AFs and the concatenated features are used as V. After fusing those features, a 6-layers TDNN-HMM model is used to train the ASR.   Table 2. Layer-wise context configuration for joint-mode multi-stream framework.

Train and Test Data Sets
Because Mandarin, English, German and French are well-researched languages, the experiments are conducted on these languages. For English, we take 100 h subset from Librispeech [28] dataset for training. The French and German dataset is randomly selected from MLS multilingual dataset [29]. For Mandarin, we use THCHS30 [30], which consists of 27 h of data for training and 5.4 h of data for testing. The language model for Mandarin is trained using a text collection that is randomly selected from the Chinese Gigaword corpus (https://catalog.ldc.upenn.edu/LDC2003T09 (accessed on 27 October 2021)). The detailed statistics for these languages are shown in Table 3. The quality of both speech corpora can be considered acoustically similar (i.e., relatively high signal-to-noise ratio, reading speech under similar acoustic conditions, 16 kHz of sampling frequency, etc). With well-studied linguistic knowledge, English, German, French, and Mandarin are considered in our experiments. The Mandarin here is taken to play the role of low-resourced languages by using limited datasets. By studying our proposed method on those languages using a limited dataset, we can apply this method to other low-resourced languages.

AFs Detectors
In this paper, the phonological attribute-based features (AFs) extraction model is represented by a DANN model, which is shown in Figure 3. Kernel sizes of all convolutional layers are set to 3 and strides are specified to conserve sequence lengths. A self-attention layer with 8 heads of multi-head attention is stacked on top of the convolutional layers. A learning rate of 0.08 is used and the dropout rate is set to 0.2. All the weights in these models are randomly initialized and are trained using stochastic gradient descent with momentum. The input of the AFs detector is 40-dimensional log mel-filterbank coefficients together with their first and second-order derivatives, derived from 25 ms frames with a 10 ms frame shift.

Comparison Approaches
To compare the proposed multi-stream systems, we study different approaches. The details of the comparison approaches are described in the following parts. All the TDNN-HMM models have the same architecture (otherwise, it is stated). We use the Lattice-free MMI (LF-MMI) loss function [31] with one-third frame sub-sampling.

Experiments Configuration
All experiments are conducted on Pkwrap toolkit [33]. The ASR models are trained during 7 epochs. The batch size is 32, whereas the learning rate is reduced gradually from 0.01 to 0.001 (e.g., epoch 1: 0.01, epoch 7: 0.001).

Performance on AFs Detectors
The AF detector is the key part of our framework. The first step is to train a reliable AFs detector. The frame-level performance is listed in Table 4. DANN AFs detector can produce over 82.9% frame-level average accuracy on Mandarin as listed in Table 4.

Effectiveness of the Cross-Lingual AFs on ASR
We experimented with different systems to get the best configuration for the jointmode approach. Table 5 indicates that the configuration that uses {−1 s1 , 0 s1 , 1 s1 , −1 s2 , 0 s2 , 1 s2 } (see Table 5) has the best performance. Table 5. Joint-mode multi-stream framework using different configuration of combination layer evaluated on the Mandarin test set. The subscript s1 means this frame is from stream-MFCCs and the s2 means this frame is from stream AFs corresponding to Figure 5 (i.e., −1 s1 means the t − 1 frame is from stream MFCCs, 0 s1 means the current t frame is from stream MFCCs).

System Features Source Languages WER[%]
Baseline Considering all the results above, all the three feature fusion approaches perform better than the baseline (see Table 6). The MHA-based approach is more generalized when exploiting cross-lingual AFs and gets the best performance (22.5% WER) We also compare the results using AFs and bottleneck features (BNF) on mha-mode, the results indicate that the AFs still outperform BNF. It is reasonable because from Figure 3 we can see the DANN already contains phoneme information.

Performance on Extremely Low-Resource Training Data
To further study the performance of MHA, we also consider the approach by training the AFs detectors with Mandarin data directly (so-called mono-lingual AFs). The sizes of training data varied from 1 h to 27 h, which are randomly selected from the THCHS30 dataset. The system trained with the 1-hour data set means the extremely low-resourced case.
The results are presented in Table 7 and Figure 8. In Table 7, the baseline is the same as that in Table 6. From the Figure 8, it can be found that both monolingual AFs and cross-lingual AFs can improve the performance in all cases. Again, it indicates that the AFs from the DANN-AFs detector can boost the ASR performance.
As expected, the more training data employed during training, the better performance the system yielded. In the case of low-resources (i.e., very limited data, less than 5 h) the cross-lingual AFs outperform the system trained with monolingual AFs. More specifically, in the condition of extremely low-resourced training data (i.e., 1 h train subset), the crosslingual AFs (45.0% WER) have a significant improvement compared to MFCCs baseline (55.4% WER) and monolingual AFs (47.8% WER). Therefore, the less training data, the better improvement can be reached using cross-lingual AFs.  To better illustrate our proposed method, the Cantonese is also introduced because of the language similarity with Mandarin. The Cantonese dataset is taken from OLR 2021 challenge [35], which only has 13 h for training and 0.4 h for testing. Thus, Cantonese is performed as a low-resourced language as well. We conduct the experiments using mha-mode and the same DANN AFs model is used. The results are shown in Table 8. The same conclusion can be found that the cross-lingual AFs from DANN can boost the Cantonese ASR through MHA-mode multi-stream framework.

Conclusions
This research demonstrates that the AFs detectors can perform phone decomposer tasks, which inject phones into AFs space. However, different languages do not have the same phone set, still, they share the phonological knowledge at AFs level by using AFs detectors.
The languages chosen in our experiments are English, German, French, and Mandarin where Mandarin is presented as a low-resourced language by using limited training data. Because those languages come from different language families, the knowledge transferring method is more challenging; thus this approach can be easily extended to other language pairs. We first propose a universal phone-to-articulatory mapping, where the different language phones can share the same articulatory space. Meanwhile, this mapping also can be extended to other languages easily. Experimental results indicate that the DANN-based AFs can improve the ASR results by using different feature fusion approaches and the multi-head attention method can reach the best performance. On extremely low-resourced conditions, our proposed approach has significant improvements when compared with standard ASR systems. Our future work will extend our proposed approach to some other languages and different downstream tasks.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: