Speech Recognition for Task Domains with Sparse Matched Training Data

: We propose two approaches to handle speech recognition for task domains with sparse matched training data. One is an active learning method that selects training data for the target domain from another general domain that already has a signiﬁcant amount of labeled speech data. This method uses attribute-disentangled latent variables. For the active learning process, we designed an integrated system consisting of a variational autoencoder with an encoder that infers latent variables with disentangled attributes from the input speech, and a classiﬁer that selects training data with attributes matching the target domain. The other method combines data augmentation methods for generating matched target domain speech data and transfer learning methods based on teacher / student learning. To evaluate the proposed method, we experimented with various task domains with sparse matched training data. The experimental results show that the proposed method has qualitative characteristics that are suitable for the desired purpose, it outperforms random selection, and is comparable to using an equal amount of additional target domain data.


Introduction
Deep neural networks (DNN) have been widely adopted and applied to traditional pattern recognition applications, such as speech and image recognition. DNN-based acoustic models have significantly improved speech recognition performance [1,2]. Initially, studies focused on acoustic models based on the deep neural network-hidden Markov model (DNN-HMM). More recently, end-to-end speech recognition, which completely replaces HMM with DNN, has become the focus. It has been adopted for many commercialized speech recognition systems [3,4]. DNN-based acoustic models, especially end-to-end models, use more parameters than conventional HMM-based models and require massive amounts of training data for high performance.
However, for some tasks, collecting large amounts of speech data is difficult. Non-native speech recognition is an example. It is difficult to collect extensive speech databases from non-native speakers, compared to native speakers. The number of non-native speakers of a language is usually much smaller than that of native speakers. Furthermore, due to inaccurate pronunciation and lack of language fluency, it often costs more to obtain transcription data for non-native speakers than for native speakers. In addition, some applications such as call center recording speech recognition limit the collection of large amounts of speech data due to policies such as personal information security. The shortage of training data matched to applied task domains causes degradation in speech recognition accuracy. This problem tends to be more serious in end-to-end automatic speech recognition (ASR).
To handle these difficulties, various approaches have been studied. The most representative research approaches, which are also widely used in commercial ASR services, are shared models, domain adaptation mechanisms, and semi-supervised learning approaches such as self-training and multi-task learning. Approaches that share model parameters or phone-sets have mostly been used for speech recognition tasks involving languages with low resources [5][6][7]. Domain adaptation uses a well-trained source-domain model to adapt to the target domain with matched target data [8][9][10][11]. Semi-supervised learning approaches focus on joint learning with labeled and unlabeled speech data. To learn with unlabeled speech data, self-training approaches mainly focus on generating transcriptions for unlabeled speech data using a pre-trained ASR system. Research has been conducted to obtain reliable confidence measures among the generated transcriptions [12,13]. The semi-supervised learning approaches based on multi-task learning focus on linearly combining the supervised cost function of a deep classifier with the unsupervised cost function of a deep auto-encoder, and then minimizing the combination of costs [14,15].
These methods can be classified into methods that do, or do not require transcribed data from the target task domain. Shared models and domain adaptation methods are in the former category; semi-supervised learning approaches based on self-training and multi-task learning belong to the latter. Domain adaptation mechanisms are widely applied due to their ability to improve stability and performance. However, to achieve satisfactory performance improvement, most domain adaptation approaches require a considerable amount of domain speech data with transcription. The semi-supervised learning approaches based on self-training and multi-task learning do not have additional costs for the transcription of target domain speech data, but they have the drawback of requiring a considerable amount of un-transcribed speech data from the target domain.
In this work, we focus on the problem of constructing a speech recognition system with a stable performance for domains where it is difficult to collect large amounts of matched speech data. To handle this problem, we propose two approaches. The first method actively selects training data for a target domain from the training data of another domain that already has a significant amount of labeled speech data by using attribute-disentangled latent variables. The second is a method that combines data augmentation methods for the target domain of sparse matched speech data and a transfer learning method based on teacher/student learning.
The remainder of this paper is organized as follows. In Section 2, we briefly review the research areas related to our proposed methods. Section 3 describes our proposed approach in detail. Section 4 explains the experimental setting, and Section 5 presents the experimental results. Finally, Section 6 concludes this paper and discusses future work.

End-to-End Speech Recognition
End-to-end (E2E) speech recognition systems consisting of a single integrated neural network model that is trained through input speech and output transcription have recently been proposed. Such systems have been applied to many commercialized speech recognition services. In traditional speech recognition, the acoustic model is trained through several steps. In addition, in order to combine the acoustic model, pronunciation dictionary, and the separately trained language model, weighted finite-state transducers (WFST) are used to find the most probable path and recognize speech. This process is cumbersome and requires prior knowledge of speech recognition to understand each role. By contrast, the end-to-end model does not need several training steps and each step of its structure is easy to understand.
A typical end-to-end speech recognition model uses a connectionist temporal classification (CTC) method [16][17][18]. This method uses a recursive neural network to infer text strings directly from input speech features. Similar to a Gaussian mixture model-hidden Markov model (GMM-HMM), character posterior probability is estimated in every frame, and the estimated character string attempts to determine the optimal path. However, a different approach using a speech recognition model based on a sequence-to-sequence (seq2seq) model has been proposed. It has achieved significant improvements in performance in the field of machine translation [19][20][21][22]. This model consists of a recursive network encoder and decoder. The encoder calculates the output for every frame from the input speech features. The decoder calculates which frame is paying attention to the encoder output, and estimates the final character string using the encoder value as an input according to the degree of attention. This seq2seq-based speech recognition model shows performance that is comparable to other end-to-end models, and ongoing research is being conducted.
To verify the proposed method in this study, we used ESPnet, a Python-based open source platform for public end-to-end speech recognition [23].

Active Learning
Active learning is a field of machine learning that allows a model to select its own training data. It aims to achieve the desired level of performance for the target task with less new labeled training data [24]. For this purpose, it is important to select training data with a significant amount of information. Studies have been conducted on various confidence scores to measure the informativeness of new training samples. The most common method in automatic speech recognition is the least confidence (LC) sampling technique, in which training samples with the least certainty are considered the most informative from the model's perspective [25]. The LC for a sequence model can be obtained as follows: where y * is the most likely label sequence, x is an observation sequence, and θ represents the model parameters. This is a method based on uncertainty sampling, which has already been proposed in the field of machine learning. Uncertainty sampling is a method of selecting new training samples predicted by the underlying model with the lowest confidence [26]. The drawback of LC-based sampling is that it can suffer from the problem of sampling bias and be particularly biased towards a specific group of speakers [27]. Conventional studies of speech recognition using active learning mainly focus on the problem of selecting data to be transcribed from unlabeled target domain speech data. They mainly target ASR of languages with low resources. The method proposed in the paper differs from conventional active learning techniques. It selects training data for a target domain with low resources from the training data of another domain that already has an immense amount of labeled speech data.

Disentangled Representation Learning in Speech Processing
Representation learning is a method that learns representations of input data, with the main aim of yielding abstract and useful representations for tasks such as classification. In the past, research focused on feature engineering to create representations that support machine learning algorithms. However, representations based on deep learning are now being widely studied [28]. Among the many applications of deep representation learning is disentangled representation. This method separates each feature into narrowly defined variables and encodes them into separate dimensions [29]. Assuming that the data is generated from independent factors of variation, disentanglement enables these factors to be captured by different independent variables in the representation, which yields a concise abstract representation of the data [30].
Several recent works have leveraged variational autoencoders (VAEs) to learn disentangled representations of sequence data such as speech, video, and text. Hsu et al. [31] proposed a novel factorized hierarchical VAE, which learns disentangled and interpretable latent representations from speech data by explicitly modeling the multi-scaled information with a factorized hierarchical graphical model. Speech data inherently contains information at multiple scales such as noise, channel, speaker, prosody, and phonetic content. These are independent factors operating at different time scales. For instance, noise, channel, and speaker identity affect the sequence level and tend to have a smaller amount of variation within an utterance compared to the variation between utterances. The sequence Appl. Sci. 2020, 10, 6155 4 of 15 level is a time series of speech data that is the object of model training and speech recognition. It can be an utterance or a phrase. However, the phonetic content affects the segmental level and tends to have a similar amount of variation within and between utterances. The segment level is a small unit that comprises a sequence and is composed of frames of a certain length. Factorized hierarchical VAEs consist of an inference model and a generative model that learns a disentangled representation of a latent sequence variable and a latent segment variable, which have properties that change at the utterance and segment level, respectively. Figure 1 is a graphical illustration of factorized hierarchical VAEs [31].
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 15 level and tends to have a similar amount of variation within and between utterances. The segment level is a small unit that comprises a sequence and is composed of frames of a certain length. Factorized hierarchical VAEs consist of an inference model and a generative model that learns a disentangled representation of a latent sequence variable and a latent segment variable, which have properties that change at the utterance and segment level, respectively. Figure 1 is a graphical illustration of factorized hierarchical VAEs [31].  In Figure 1, is a sequence of ( ) observed variables, where ( , ) is the n-th segment of the i-th sequence and ( ) is the number of segments for the i-th sequence. As Figure 1 illustrates, the following generation process is considered. First, an s-vector is drawn from a prior distribution ( ) for each i-th sequence. Second, ( ) independent and identically distributed

Teacher-Student Learning for Domain Adaptation
The performance of ASR degrades significantly when there is a mismatch between the training and real test environments. The most intuitive and commonly used solution is to adaptively train a well-trained source domain model to target domain data [8,32]. However, the disadvantage of domain adaptation is that it requires a considerable amount of labeled target domain data. Thus, it is costly and not suitable for domains where it is difficult to collect large amounts of data.
To handle these issues, teacher-student (T/S) learning was proposed for the domain adaptation of DNN-HMM-based acoustic models [33]. In teacher-student learning, the Kullback-Leibler (KL) divergence between the posterior distributions of the teacher and student networks, given parallel source and target domain data as input, is minimized by updating the model parameters of the student network. The KL divergence between the source and target distributions is as follows: where indicates a state with index and is the frame index. | , and | , indicate the posterior distribution of the teacher and student networks, respectively, while , and , are the source and target inputs to the teacher and student networks. T/S-based training using soft labels of the teacher network output showed improved results compared to conventional cross-entropy training, which directly uses the hard label in the target domain [33][34][35]. However, to obtain high performance, T/S learning for domain adaptation requires parallel sequences of source and target domain data, which consist of real data and the simulated pair generated from one domain to the other. These are synchronized frame-by-frame. Examples of In Figure 1, of the i-th sequence and N (i) is the number of segments for the i-th sequence. As Figure 1 illustrates, the following generation process is considered. First, an s-vector µ 2 is drawn from a prior distribution p θ µ (i) 2 for each i-th sequence. Second, N (i) independent and identically distributed are drawn from a sequence-dependent prior distribution p θ z

Teacher-Student Learning for Domain Adaptation
The performance of ASR degrades significantly when there is a mismatch between the training and real test environments. The most intuitive and commonly used solution is to adaptively train a well-trained source domain model to target domain data [8,32]. However, the disadvantage of domain adaptation is that it requires a considerable amount of labeled target domain data. Thus, it is costly and not suitable for domains where it is difficult to collect large amounts of data.
To handle these issues, teacher-student (T/S) learning was proposed for the domain adaptation of DNN-HMM-based acoustic models [33]. In teacher-student learning, the Kullback-Leibler (KL) divergence between the posterior distributions of the teacher and student networks, given parallel source and target domain data as input, is minimized by updating the model parameters of the student network. The KL divergence between the source and target distributions is as follows: Appl. Sci. 2020, 10, 6155 where s i indicates a state with index i and f is the frame index. P T s i x src, f and P S s i x tgt, f indicate the posterior distribution of the teacher and student networks, respectively, while x src, f and x tgt, f are the source and target inputs to the teacher and student networks. T/S-based training using soft labels of the teacher network output showed improved results compared to conventional cross-entropy training, which directly uses the hard label in the target domain [33][34][35]. However, to obtain high performance, T/S learning for domain adaptation requires parallel sequences of source and target domain data, which consist of real data and the simulated pair generated from one domain to the other. These are synchronized frame-by-frame. Examples of parallel source/target data in previous T/S learning studies include real clean speech and simulated noisy speech [33], real adult speech and simulated child speech [33], enhanced clean speech and real noisy speech [34] and real close-talk speech and simulated far field speech [35]. The target domains in the studies are noisy speech, child speech, and far-field speech recognition, respectively. The results of previous studies suggest that the performance of T/S learning tends to be influenced by the quality of the simulated data. For example, the experimental results of T/S learning for the noisy speech domain were shown to outperform conventional training; however, in the child speech domain, only limited performance improvements were obtained.

Active Learning Using Latent Variables with Disentangled Attributes
In this section, we describe the proposed method of actively selecting training data for a target domain using attribute-disentangled latent variables. As described in the Introduction, there are tasks that require large amounts of speech data that are difficult to collect. These include non-native speech recognition and call center recording speech recognition. The proposed active learning method is meant to effectively obtain training data close to the acoustic characteristics of these target domains from other domains, such as native speech recognition and broadcast speech recognition. In these other domains, we can obtain relatively large amounts of labeled speech data, although the overall acoustic characteristics, such as speaker and channel/noise environments, are somewhat different.
For this purpose, we designed an integrated system consisting of factorized hierarchical VAEs, [31] with an encoder that infers latent variables with disentangled attributes from the input speech, and a DNN-based classifier that selects training data with attributes matching the target domain using the preceding encoder output as input [36]. Each input speech subject to ASR contains various acoustic attributes. Among these attributes, phonetic content information changes at segmental level time scales. Additionally, environmental information such as channel/noise; speaker information such as individual speaker identity and gender/age; and prosodic information such as accent, tone, and speaking rate, change at sequence level time scales. VAEs that infer latent variables with disentangled attributes, and a DNN-based classifier are trained using a small amount of the speech data obtained from the target domain and other general domains. Here, the VAEs are trained by an unsupervised learning method using speech input/output. The classifier is trained via supervised learning by inputting attribute-disentangled latent variables that are the output of the preceding VAEs and outputting a label indicating whether the input is from the target domain or other general domains. Figure 2 illustrates the learning stage of an integrated system consisting of the VAEs and classifier using the example training data. In the VAEs' learning step (a), VAEs, which generate latent variables with disentangled attributes, are trained through the process of inferring the attribute-disentangled latent variables in the encoder and generating the speech again in the decoder. For example, the VAEs can use paired non-native and native speech data as input/output training data. In the classifier learning step (b), the classifier is trained by supervised learning, using latent variables with sequence-level attributes of the training data with label information identifying whether it is the target domain or general domain. Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 15 (a) VAEs' Learning Step Step Finally, the data selection step, which involves the selection of training speech data with the desired attributes matched for the target domain is shown in Figure 3.

Teacher/Student-Based Transfer Learning Using Augmented Training Data
In this section, we describe the proposed method, which consists of data augmentation of the target domain with sparse matched speech data and transfer learning based on teacher/student learning. For the domain with sparse matched training data, where it is difficult to collect large amounts of speech data, we propose a transfer learning method based on T/S learning. This method uses a considerable amount of labeled speech data from a general domain and augmented speech data converted from the general domain to provide the acoustic characteristics of the speaker, channel, and noise environment similar to the target domain. Finally, the data selection step, which involves the selection of training speech data with the desired attributes matched for the target domain is shown in Figure 3. Finally, the data selection step, which involves the selection of training speech data with the desired attributes matched for the target domain is shown in Figure 3.

Teacher/Student-Based Transfer Learning Using Augmented Training Data
In this section, we describe the proposed method, which consists of data augmentation of the target domain with sparse matched speech data and transfer learning based on teacher/student learning. For the domain with sparse matched training data, where it is difficult to collect large amounts of speech data, we propose a transfer learning method based on T/S learning. This method uses a considerable amount of labeled speech data from a general domain and augmented speech data converted from the general domain to provide the acoustic characteristics of the speaker, channel, and noise environment similar to the target domain.

Teacher/Student-Based Transfer Learning Using Augmented Training Data
In this section, we describe the proposed method, which consists of data augmentation of the target domain with sparse matched speech data and transfer learning based on teacher/student learning. For the domain with sparse matched training data, where it is difficult to collect large amounts of speech data, we propose a transfer learning method based on T/S learning. This method uses a considerable amount of labeled speech data from a general domain and augmented speech data converted from the general domain to provide the acoustic characteristics of the speaker, channel, and noise environment similar to the target domain. The transfer learning method proposed in this paper is as follows. First, as in the VAE learning step of Figure 2a, VAEs that generate latent variables with disentangled attributes are trained using a paired target domain and general domain speech data as input/output training data. Thereafter, data augmentation is performed through the process that is illustrated in Figure 4. The speech data for training in the target domain can be augmented significantly using pre-trained encoders that generate latent variables with disentangled attributes. The encoders maintain the phonetic content attributes from the large-scale speech database (DB) and substitute other attributes, such as environmental channel/noise factors and speaking style factors from the target domain speech DB.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 15 The transfer learning method proposed in this paper is as follows. First, as in the VAE learning step of Figure 2a, VAEs that generate latent variables with disentangled attributes are trained using a paired target domain and general domain speech data as input/output training data. Thereafter, data augmentation is performed through the process that is illustrated in Figure 4. The speech data for training in the target domain can be augmented significantly using pre-trained encoders that generate latent variables with disentangled attributes. The encoders maintain the phonetic content attributes from the large-scale speech database (DB) and substitute other attributes, such as environmental channel/noise factors and speaking style factors from the target domain speech DB.  Figure 5 illustrates the proposed teacher/student learning method, which uses training data augmented from the previous step. In Figure 5, T/S-based transfer learning is performed by inputting augmented target domain speech data and a large-scale speech DB, and sharing transcription information between them. The proposed method is a T/S-based transfer learning system with an elaborate speech recognition system trained with a large-scale speech DB as a teacher system, and a speech recognition system for the target domain as a student system. Learning is performed using the large-scale speech DB and augmented target domain speech DB as inputs, with shared labels to each teacher/student system. Simultaneously, multi-task learning, with a loss function learned in a direction insensitive to domain variations, is performed through a domain adversarial multi-task module that performs domain classification by inputting each deep feature obtained from the middle layer of the final system.  The transfer learning method proposed in this paper is as follows. First, as in the VAE learning step of Figure 2a, VAEs that generate latent variables with disentangled attributes are trained using a paired target domain and general domain speech data as input/output training data. Thereafter, data augmentation is performed through the process that is illustrated in Figure 4. The speech data for training in the target domain can be augmented significantly using pre-trained encoders that generate latent variables with disentangled attributes. The encoders maintain the phonetic content attributes from the large-scale speech database (DB) and substitute other attributes, such as environmental channel/noise factors and speaking style factors from the target domain speech DB.   In Figure 5, T/S-based transfer learning is performed by inputting augmented target domain speech data and a large-scale speech DB, and sharing transcription information between them. The proposed method is a T/S-based transfer learning system with an elaborate speech recognition system trained with a large-scale speech DB as a teacher system, and a speech recognition system for the target domain as a student system. Learning is performed using the large-scale speech DB and augmented target domain speech DB as inputs, with shared labels to each teacher/student system. Simultaneously, multi-task learning, with a loss function learned in a direction insensitive to domain variations, is performed through a domain adversarial multi-task module that performs domain classification by inputting each deep feature obtained from the middle layer of the final system. In Figure 5, T/S-based transfer learning is performed by inputting augmented target domain speech data and a large-scale speech DB, and sharing transcription information between them. The proposed method is a T/S-based transfer learning system with an elaborate speech recognition system trained with a large-scale speech DB as a teacher system, and a speech recognition system for the target domain as a student system. Learning is performed using the large-scale speech DB and augmented target domain speech DB as inputs, with shared labels to each teacher/student system. Simultaneously, multi-task learning, with a loss function learned in a direction insensitive to domain variations, is performed through a domain adversarial multi-task module that performs domain classification by inputting each deep feature obtained from the middle layer of the final system.

Experimental Settings
In this section, we describe the corpora and the detailed architecture of the end-to-end ASR system used for the experiments.

Corpus Descriptions
In order to verify the proposed methods, ASR for non-native Korean speech and Korean call center recording were used as the main task domains with sparse matched training data. The Korean broadcast speech corpus was used for the large-scale speech DB. The AMI meeting corpus [37] and certain other sources were used for the functional verification of the integrated systems. Table 1 shows the summarized characteristics of the corpora used for the experiments. The AMI meeting corpus consists of 100 h of annotated recordings of planned meetings. For each meeting, four participants have a free conversation in English, and simulate a project meeting on product design. The meetings last approximately 30 min each, and multiple microphones are used to simultaneously record conversations in different environments. Of the available microphone channels, we used the individual headset microphone (IHM) channel for clean close-talking speech and the single distant microphone (SDM) channel for far-field noisy speech. •

Non-native Korean speech corpus
We used 520 h of the in-house non-native Korean speech corpus gathered from Korean language education providers. This corpus was spoken by 830 non-native Korean speakers and recorded via PC microphone and smartphone channels. This non-native Korean speech corpus has been collected for approximately five years from Korean language education providers using speech technology for non-native speakers residing in Korea [38]. The level of Korean for each non-native speaker varies from beginner to advanced and all speech data was transcribed by humans.

•
Korean call center recording corpus The Korean call center recording corpus used in this study contains approximately 1000 h of conversations recorded at operating call centers. This information was provided with personal information deleted [39]. It is mainly composed of conversations between agents and customers. The speech recognition performance of the database is degraded due to its acoustic characteristics including the overall fast speaking rate and inaccurate pronunciation. Therefore, a considerable amount of data is required to improve the performance, but it represents a domain in which training data collection is difficult due to problems such as privacy protection. •

Korean broadcast speech corpus
The Korean broadcast audio speech corpus was used for the large-scale speech DB for the general domain as an opposite to the domain with sparse matched data. The broadcast audio data was easily accessible in large amounts. It contains speech data uttered by various speakers in diverse noise environments; therefore, it could be used to build a large-scale speech DB for improving ASR performance in the general domain. We used 14,000 h of the Korean broadcast speech corpus, with reliable transcriptions extracted from multi-genre broadcast raw audio data with inaccurate subtitle timestamps through the method proposed in [40]. In this study, all speech data was down-sampled to 8 kHz, the sampling rate of the Korean call center recording corpus.

Detailed Architecture of the End-to-End ASR
Each utterance of the training speech data was converted into 80-dimensional mel filter bank (MFB) features using the Kaldi toolkit [41]. Each frame of the feature vectors was computed with a 25ms window size and 10ms shift. One segment of the proposed method consists of five consecutive frames. For an end-to-end ASR system, we used ESPnet, an open source platform [23] to train the end-to-end model parameters in all of the experiments. The encoder network was represented by 2 blocks of VGG layer [42], followed by 5 layers with 1024 units of bidirectional long short-term memory (BLSTM). The decoder network consisted of 2 layers with 1024 units of long short-term memory (LSTM). A location-aware attention mechanism was used. Learning was performed so that cross entropy and connectionist temporal classification (CTC) loss was optimized.

Experiments for Active Learning Using Latent Variables with Disentangled Attributes
In this experiment, we evaluated the performance of the proposed active learning method using latent variables with disentangled attributes. We implemented the designed integrated system, consisting of an encoder that infers the latent variables and a DNN-based classifier that selects training data with attributes matching the target domain, using the preceding encoder output as input.
First, for the functional verification of the integrated systems, we measured the performance of the classifier using the AMI meeting corpus. The factorized hierarchical VAEs were trained using 265k utterances from the SDM and IHM channels of the AMI meeting corpus. The classifier was trained using 10k utterances from the same channels. The classification performance of300 utterances from the SDM and IHM corpus was evaluated and the results are shown in Table 2. In general, when the utterances are longer, the classification performance is better. However, as an input vector, the sequence-level latent variable outperforms mel-frequency cepstral coefficients (MFCC) [43] and log-mel filter banks [44]. Indeed, its classification results remain robust even for short utterances. We then evaluated the proposed active learning method for the non-native Korean speech recognition task, which is a task domain with sparse matched training data. The proposed integrated system consisted of VAEs with an encoder that infers latent variables with disentangled attributes and a classifier that actively selects training data with attributes matching the target domain. The system was trained using 520 h of both non-native Korean speech corpus and Korean broadcast speech corpus. Figure 6 shows the result obtained by visualizing the spatial distribution of sequence-level latent variables in the acoustic space through t-distributed stochastic neighbor embedding (t-SNE), which is the encoder output of trained VAEs with non-native Korean speech and native Korean speech as input.  Table 3 shows the speech recognition performance of the models trained by the speech DB selected by the proposed active learning method. All models in Table 3 involve an end-to-end ASR system, whose detailed architecture was described in Section 4.2. We trained the baseline model using 200 h of non-native Korean speech DB, and then compared two versions of the model. One model was trained using 500 h of additional native Korean speech DB selected by the proposed active learning method, and one was trained by equal amounts of speech DB selected randomly. The evaluation set was comprised of 515 utterances recorded for assessing a Korean tutoring service for foreigners. As shown, there is a significant improvement with the proposed active learning method. Table 3. Comparison of syllable error rate (%) on the non-native Korean speech recognition task, comparing the models trained using the speech database (DB) selected by the proposed active learning method and a random selection method. We applied the proposed active learning method to the call center recording speech recognition task, which is another task domain with sparse matched training data. Figure 7 shows the result obtained by visualizing the spatial distribution of call center recordings, broadcast speech, selected DB, and unselected DB through t-SNE. In Figure 7, the selected and unselected DBs are both sampled from broadcast speech. Among them, the selected DB was selected by the proposed active learning method, while the unselected DB is the remaining part of the DB that was not selected. As Figure 7 shows, the selected and unselected DBs are both located in the same space as the broadcast speech. The selected DB is distributed in the space closer to the call center recording DB, and the unselected DB is distributed farther away from the call center recording DB.   Table 3 shows the speech recognition performance of the models trained by the speech DB selected by the proposed active learning method. All models in Table 3 involve an end-to-end ASR system, whose detailed architecture was described in Section 4.2. We trained the baseline model using 200 h of non-native Korean speech DB, and then compared two versions of the model. One model was trained using 500 h of additional native Korean speech DB selected by the proposed active learning method, and one was trained by equal amounts of speech DB selected randomly. The evaluation set was comprised of 515 utterances recorded for assessing a Korean tutoring service for foreigners. As shown, there is a significant improvement with the proposed active learning method. Table 3. Comparison of syllable error rate (%) on the non-native Korean speech recognition task, comparing the models trained using the speech database (DB) selected by the proposed active learning method and a random selection method. We applied the proposed active learning method to the call center recording speech recognition task, which is another task domain with sparse matched training data. Figure 7 shows the result obtained by visualizing the spatial distribution of call center recordings, broadcast speech, selected DB, and unselected DB through t-SNE. In Figure 7, the selected and unselected DBs are both sampled from broadcast speech. Among them, the selected DB was selected by the proposed active learning method, while the unselected DB is the remaining part of the DB that was not selected. As Figure 7 shows, the selected and unselected DBs are both located in the same space as the broadcast speech. The selected DB is distributed in the space closer to the call center recording DB, and the unselected DB is distributed farther away from the call center recording DB. Table 4 shows the speech recognition performance of the models trained by the speech DB selected by the proposed active learning method. We trained the baseline model using 500 h of Korean call center recordings, and then compared two versions. One model was trained using 1000 h of additional Korean broadcast speech selected by the proposed active learning method, and one was trained using an equal amount of the speech DB, which was selected randomly. The evaluation set was comprised of 329 utterances from one of the ASR tasks for Korean call center recording.

Training DB
DB, and unselected DB through t-SNE. In Figure 7, the selected and unselected DBs are both sampled from broadcast speech. Among them, the selected DB was selected by the proposed active learning method, while the unselected DB is the remaining part of the DB that was not selected. As Figure 7 shows, the selected and unselected DBs are both located in the same space as the broadcast speech. The selected DB is distributed in the space closer to the call center recording DB, and the unselected DB is distributed farther away from the call center recording DB.

Experiments on Teacher/Student-Based Transfer Learning Using Augmented Training Data
In this experiment, we evaluated the performance of transfer learning based on teacher/student learning and the use of augmented training data. First, we implemented the designed integrated system consisting of VAEs, which generate latent variables with disentangled attributes, and the following module, which generates the augmented target domain data using the paired target domain and general domain speech data as input data, as shown in Figure 4.
We evaluated the proposed data augmentation method (described in Figure 4) for the ASR task with the AMI meeting corpus. Table 5 shows the speech recognition performance of the proposed method. We trained three models using different training DB: IHM-only, SDM-only, and an augmented DB using the proposed method to add to SDM. For each IHM-only and SDM-only DB, we used the standard train/dev/eval data partitions of the AMI meeting corpus. The AugSDM was generated using the proposed method by maintaining the phonetic contents of IHM and substituting channel attributes from the random sample of SDM. As shown in Table 5, the model trained using only IHM suffers from a sharp degradation in performance due to the large mismatch in channel conditions between IHM and SDM. The matched SDM training and evaluation condition shows improved results, and even further performance improvement was obtained by adding the AugSDM generated by the proposed method. Similar to the experimental results presented in Table 5, Manohar et al. [45] proposed a teacher-student learning approach for unsupervised domain adaptation and reported the results for domain adaptation from AMI-IHM speech to AMI-SDM speech. They used an architecture with time-delayed neural network layers interleaved with LSTM for their experiment. As we used the end-to-end ASR system architecture for our experiment, directly comparing the performance of our study with that of the previous study is difficult. However, when the relative degrees of improvement in performance over the baseline model are compared, our proposed method provides better results than that used in the previous study.
We then applied the proposed data augmentation method described in Figure 4 to the call center recording speech recognition task, a task domain with sparse matched training data. Figure 8 shows the spatial distribution of sequence-level attributes in the acoustic space for call center recordings, broadcast speech, and the augmented DB, which are visualized through t-SNE. Figure 8a shows that the call center recording and broadcast speech DBs have somewhat distinct spatial distributions. This is due to the difference in the channel/noise levels and other environmental factors. Figure 8b adds the augmented DB distribution to the space visualized in Figure 8a. The augmented DB was generated using the proposed method by maintaining the phonetic contents of broadcast speech and substituting sequence-level attributes from a random sample of the call center recording DB. As Figure 8b shows, the augmented DB distribution mainly overlaps that of the call center recording DB.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 12 of 15 the spatial distribution of sequence-level attributes in the acoustic space for call center recordings, broadcast speech, and the augmented DB, which are visualized through t-SNE. Figure 8a shows that the call center recording and broadcast speech DBs have somewhat distinct spatial distributions. This is due to the difference in the channel/noise levels and other environmental factors. Figure 8b adds the augmented DB distribution to the space visualized in Figure 8a. The augmented DB was generated using the proposed method by maintaining the phonetic contents of broadcast speech and substituting sequence-level attributes from a random sample of the call center recording DB. As Figure 8b shows, the augmented DB distribution mainly overlaps that of the call center recording DB.
(a) Original distribution (b) Distribution with augmented DB added Figure 8. Spatial distribution of sequence-level attributes in the acoustic space of the call center recordings, broadcast speech, and augmented DB. Table 6 shows the performance results of the ASR task on the Korean call center recording. AugCall refers to the augmented DB generated using the proposed method. It can be seen that the model trained using additional 500 h of AugCall augmented by the proposed method showed improved results, compared to the model trained using equal amounts of additional Korean broadcast speech. The AugCall model is also comparable to the model trained using additional target domain data with the equal amount. Table 6. Comparison of syllable error rate (%) for the call center recording speech recognition task, examining the models trained using AugCall, using additional Korean broadcast speech, and using an equal amount of call center recordings.

Training DB
Error Rate 500 h Korean call center recording 21.1 +500 h Korean broadcast speech 19.6 +500 h AugCall augmented by the proposed method 18.5 1000 h Korean call center recording 18.5

Conclusions
In this paper, we addressed speech recognition tasks where it is difficult to collect large amounts of labeled speech data. Domain adaptation and semi-supervised learning methods are representative approaches that are used in both academic research and application services. Domain adaptation approaches require a considerable amount of domain speech data with transcription. Semisupervised learning approaches, on the other hand, require less costly transcription, but require a significant amount of unlabeled data from the target domain and are not effective if a pre-trained model is not matched to the target domain.  Table 6 shows the performance results of the ASR task on the Korean call center recording. AugCall refers to the augmented DB generated using the proposed method. It can be seen that the model trained using additional 500 h of AugCall augmented by the proposed method showed improved results, compared to the model trained using equal amounts of additional Korean broadcast speech. The AugCall model is also comparable to the model trained using additional target domain data with the equal amount. Table 6. Comparison of syllable error rate (%) for the call center recording speech recognition task, examining the models trained using AugCall, using additional Korean broadcast speech, and using an equal amount of call center recordings.

Conclusions
In this paper, we addressed speech recognition tasks where it is difficult to collect large amounts of labeled speech data. Domain adaptation and semi-supervised learning methods are representative approaches that are used in both academic research and application services. Domain adaptation approaches require a considerable amount of domain speech data with transcription. Semi-supervised learning approaches, on the other hand, require less costly transcription, but require a significant amount of unlabeled data from the target domain and are not effective if a pre-trained model is not matched to the target domain.
We focused on handling the speech recognition problem for task domains with sparse matched training data, and proposed an active learning method that selects training data for the target domain from another domain that already has a significant amount of labeled speech data. We also proposed a transfer learning method based on teacher/student learning combined with data augmentation. The experimental results show that the proposed method outperforms random selection and is comparable to using equal amounts of additional target domain data.
In the future, we will implement and verify the proposed transfer learning system shown in Figure 5, and integrate the augmentation method that we have verified in the experiment described in this paper.

Conflicts of Interest:
The authors declare no conflict of interest.