Phonetic Variation Modeling and a Language Model Adaptation for Korean English Code-Switching Speech Recognition

: In this paper, we propose a new method for code-switching (CS) automatic speech recognition (ASR) in Korean. First, the phonetic variations in English pronunciation spoken by Korean speakers should be considered. Thus, we tried to ﬁnd a uniﬁed pronunciation model based on phonetic knowledge and deep learning. Second, we extracted the CS sentences semantically similar to the target domain and then applied the language model (LM) adaptation to solve the biased modeling toward Korean due to the imbalanced training data. In this experiment, training data were AI Hub (1033 h) in Korean and Librispeech (960 h) in English. As a result, when compared to the baseline, the proposed method improved the error reduction rate (ERR) by up to 11.6% with phonetic variant modeling and by 17.3% when semantically similar sentences were applied to the LM adaptation. If we considered only English words, the word correction rate improved up to 24.2% compared to that of the baseline. The proposed method seems to be very effective in CS speech recognition.


Introduction
Recently, automatic speech recognition (ASR) and speech translation (ST) based on end-to-end (E2E) frameworks have shown significant improvements. These systems have been widely adapted to real-life situations, such as lectures, business meetings, and humanmachine conversations. Figure 1 shows an application for English-Korean ASR.
However, as the use of foreign words is more common these days, which tends to frequently cause accuracy degradation. Many researchers have studied this issue, which is called the code-switching (CS) problem in ASR. In the case of Korean, English words pronounced by Korean speakers-Korean-style English (i.e., Konglish)-have many phonetic variations from native-like English pronunciation. Therefore, reflecting proper phonetic variations with conventional methods is very complex. Moreover, mixed-language spoken data are very rare, making any model biased toward Korean, even if there are many data, so a sophisticated approach is needed.
To figure out the effect of CS, we investigated how often Korean sentences have English words. In the broadcasting news domain, 26 million (2.1%) out of 1277 million words are from English. The IT domain is more severe, as can be expected. There are 127 million (11.5%) English words out of 1011 million words in that domain. Figure 2 shows typical CS sentences in Korean.
Generally, these problems can be categorized into two types: the inter-sentential, where language transitions occur at the phrase, sentence, or discourse boundaries; and the   In this paper, we focus on intra-sentential CS problems. In Section 2, we introduce the CS research results obtained in other studies, and Section 3 explains the difficulties in modeling phonetic variations in English spoken by Koreans. Section 4 handles how to extract sentences with similar meanings for the language model (LM) domain adaptation. Section 5 summarizes the experimental results for the proposed method and concludes the paper.

Related Work
For a long time, language-specific speech recognition with language identification tags [2] has been studied as an intuitive approach. To detect language boundaries, the bi-phonemic probability can be calculated [3], which measures the confidence score of the phoneme using a foreign language in a CS sentence. Watanabe et al. [4] and Seki et al. [5] adopted language tags in sentence units in ASR to train the model with a language tag to distinguish language-specific characteristics.
As another approach, the context-switching database (DB) was directly built by mixing several languages to solve the data imbalance [6]. Some studies approached overcoming low-resource language pairs, which relate to the data imbalance for ASR [7][8][9]. Similarly, a study was conducted to create a more robust model using an asymmetric corpus by language [10].
Among the well-known studies, one used transliteration with Latin characters. Vu et al. [11] proposed a knowledge-based phoneme merging or data-driven merging in Chinese English, and as a similar approach, training with code-mixed resources in Hindi English was attempted [12,13]. Recently, data augmentation was used for generating a mixed corpus with a generative adversarial network (GAN) using a monolingual corpus and a few CS sentences [14]. According to Long et al. [15], the acoustic data augmentation was accomplished in a English-Chinese CS speech recognition task. The unsupervised learning technique was also utilized with a monolingual CS corpus [16].

Phonetic Variant Modeling
Gathering a CS corpus for training is ineffective because its model inclines toward Korean as the data size grows. For this reason, SEAME [6], a Mandarin-English codeswitching speech corpus in Southeast Asia, built a corpus using mixed sentences specially designed for Chinese English. Tjandra et al. [17] proposed the speech chain algorithm, which synthesizes speech from recognized text and then feeds it back again. Nakayama et al. [18,19] improved the performance by generating Japanese English intra-sentential sentences. In this study, this corpus was generated by substituting katakana words or phrases with English words.
To simultaneously avoid the data imbalance and low resources of CS, in this paper, we propose a hybrid method based on phonetic knowledge and deep learning, which integrates Korean and English data. For this, defining a unified alphabet was necessary to solve the intra-sentential CS problem. As a first step, linguistic or phonetic knowledge was introduced to map English phonemes to Korean phonemes based on the phonetic similarity between the two languages. Second, after applying mapping rules, end-to-end ASR was trained with reference to natural Konglish pronunciations.

Phoneme Mapping Using Phonetic Knowledge
Table 1 [20] shows the consonants in Korean and English that are related to the places of articulation, which describe the movements of the mouth, teeth, tongue, or vocal tract. English phonemes include the labial fricative /v/, dental fricative /ð/, and alveolar approximant /r/, which do not exist in Korean. If these phonemes are approximated to Korean phonemes /p/, /t/, and /l/, any English words can be transliterated into Hangul. However, phoneme differences (i.e., acoustic differences) still exist between Korean-style English (i.e., Konglish) and native English. English phonemes include the labial fricative /v/, dental fricative /ð/, and alveolar approximant /r/, which do not exist in Korean. If these phonemes are approximated to Korean phonemes /p/, /t/, and /l/, any English words can be transliterated into Hangul. However, phoneme differences (i.e., acoustic differences) still exist between Korean-style English (i.e., Konglish) and native English. Unlike other languages, Konglish seems to be more severe in its phonetic variations. We should consider the phonetic variations between English pronounced by a Korean who has difficulty speaking English and English pronounced by a Korean who speaks English at a native-like level. In the former case, English phonemes are transformed into Korean-style English phonemes [21,22] (i.e., Valentine's /vaeləntaɪnz/ is changed into 발렌타인스 /ballɛntainsɯ/); in the latter case, the phonemes are closer to the native English phonemes. (i.e., Valentine's /vaeləntaInz/ is changed into 밸런타인즈 /bɛlləntaincɯ/). Thus, due to the abundant phonetic variance between Konglish and native English, it is difficult to make acoustic models for CS speech recognition. Table 2 shows the phoneme relationship between Korean (KR) and English (EN) based on phonetic knowledge. Some English phonemes can be directly mapped into Korean phonemes, but others should be approximately mapped to similar phonemes in Korean where possible. In this study, we defined English phonemes based on CMUdict [23] and applied the Korean English pronunciation conversion rule [24]. For example, to map the English phoneme /k/ onto the Korean phoneme /k/ or /G/, the Korean phoneme /k/ is used when it is placed at the beginning of a word, and /G/ is used when it placed amid a word. Several English monophthong or diphthong vowels should be forcedly approximated as Korean phoneme /v/. English phonemes include the labial fricative /v/, dental fricative /ð/, and alveolar approximant /r/, which do not exist in Korean. If these phonemes are approximated to Korean phonemes /p/, /t/, and /l/, any English words can be transliterated into Hangul. However, phoneme differences (i.e., acoustic differences) still exist between Korean-style English (i.e., Konglish) and native English. These consonants are different from the pronunciation-based notation (e.g., /c/, /p/, and /l/ in Korean).
Unlike other languages, Konglish seems to be more severe in its phonetic variations. We should consider the phonetic variations between English pronounced by a Korean who has difficulty speaking English and English pronounced by a Korean who speaks English at a native-like level. In the former case, English phonemes are transformed into Korean-style English phonemes [21,22] (i.e., Valentine's /vaeləntaɪnz/ is changed into 발렌타인스 /ballɛntainsɯ/); in the latter case, the phonemes are closer to the native English phonemes. (i.e., Valentine's /vaeləntaInz/ is changed into 밸런타인즈 /bɛlləntaincɯ/). Thus, due to the abundant phonetic variance between Konglish and native English, it is difficult to make acoustic models for CS speech recognition. Table 2 shows the phoneme relationship between Korean (KR) and English (EN) based on phonetic knowledge. Some English phonemes can be directly mapped into Korean phonemes, but others should be approximately mapped to similar phonemes in Korean where possible. In this study, we defined English phonemes based on CMUdict [23] and applied the Korean English pronunciation conversion rule [24]. For example, to map the English phoneme /k/ onto the Korean phoneme /k/ or /G/, the Korean phoneme /k/ is used when it is placed at the beginning of a word, and /G/ is used when it placed amid a word. Several English monophthong or diphthong vowels should be forcedly approximated as Korean phoneme /v/. The following examples show the results based on the rules: • Examples: Access rights, scratch language, and Taylor Swift. • After phoneme mapping: 액세스 라이츠, 스크래치 랭귀지, 앤드 테일러 스위프트.
Unlike other languages, Konglish seems to be more severe in its phonetic variations. We should consider the phonetic variations between English pronounced by a Korean who has difficulty speaking English and English pronounced by a Korean who speaks English at a native-like level. In the former case, English phonemes are transformed into Korean-style English phonemes [21,22] (i.e., Valentine's /vaelntanz/ is changed into ᄇ ᅡ ᆯ ᄅ ᅦ ᆫᄐ ᅡᄋ ᅵ ᆫᄉ ᅳ /ballntains/); in the latter case, the phonemes are closer to the native English phonemes. (i.e., Valentine's /vaelntaInz/ is changed into ᄇ ᅢ ᆯᄅ ᅥ ᆫᄐ ᅡᄋ ᅵ ᆫᄌ ᅳ /bllntainc/). Thus, due to the abundant phonetic variance between Konglish and native English, it is difficult to make acoustic models for CS speech recognition. Table 2 shows the phoneme relationship between Korean (KR) and English (EN) based on phonetic knowledge. Some English phonemes can be directly mapped into Korean phonemes, but others should be approximately mapped to similar phonemes in Korean where possible. In this study, we defined English phonemes based on CMUdict [23] and applied the Korean English pronunciation conversion rule [24]. For example, to map the English phoneme /k/ onto the Korean phoneme /k/ or /G/, the Korean phoneme /k/ is used when it is placed at the beginning of a word, and /G/ is used when it placed amid a word. Several English monophthong or diphthong vowels should be forcedly approximated as Korean phoneme /v/. The following examples show the results based on the rules:

•
Examples: Access rights, scratch language, and Taylor Swift.
(Transliteration: /aykset lwaitchu, sukhulaychi layngkwicyu, ayn thayllu suwipthu/.) Indeed, the Korean language is composed of syllable structures as pronunciation units, so vowel insertion occurs between consonants. Consecutive consonants in an English word should be used to form a syllable structure. For this, a Konglish dictionary with regular conversion rules was made. The results show that the rules work well. To produce a Konglish dictionary, Phonetisaurus [25] was adopted-the English grapheme-to-phoneme (G2P) toolkit. For instance, with the conversion rules in Table 2, "school" becomes /skul/ Appl. Sci. 2021, 11, 2866 5 of 14 in English G2P, and converts to Konglish phoneme sequence /sUkul/, again via the conversion rules. Then, /sUkul/ is simply transformed to "ᄉ ᅳᄏ ᅮ ᆯ(/sukhul/)" through Hangul conversion.
English phoneme based on CMUdict. 2 Transliterated Korean phoneme with Hangul. 3 Hangul character equivalent to each Korean phoneme.

Considering Phonetic Variations Using End-to-End ASR
Until now, we have dealt with the phonetic modeling of English pronounced by a Korean who has difficulty speaking English. In addition, we should take the phonetic modeling of native-like English spoken by Korean people into account. To solve these problems, we introduced end-to-end ASR using an English database as an input and rule-based Konglish as an output. We expected the output of end-to-end ASR to make up for the shortcomings of the rules. Figure 3 shows the creative process of an enhanced Konglish DB.
Indeed, the Korean language is composed of syllable structures as pronunciation units, so vowel insertion occurs between consonants. Consecutive consonants in an English word should be used to form a syllable structure. For this, a Konglish dictionary with regular conversion rules was made. The results show that the rules work well. To produce a Konglish dictionary, Phonetisaurus [25] was adopted-the English grapheme-to-phoneme (G2P) toolkit. For instance, with the conversion rules in Table 2, "school" becomes /skul/ in English G2P, and converts to Konglish phoneme sequence /sUkul/, again via the conversion rules. Then, /sUkul/ is simply transformed to "스쿨(/sukhul/)" through Hangul conversion.
English phoneme based on CMUdict. 2 Transliterated Korean phoneme with Hangul. 3 Hangul character equivalent to each Korean phoneme.

Considering Phonetic Variations Using End-to-End ASR
Until now, we have dealt with the phonetic modeling of English pronounced by a Korean who has difficulty speaking English. In addition, we should take the phonetic modeling of native-like English spoken by Korean people into account. To solve these problems, we introduced end-to-end ASR using an English database as an input and rulebased Konglish as an output. We expected the output of end-to-end ASR to make up for the shortcomings of the rules. Figure 3 shows the creative process of an enhanced Konglish DB. First, 1000 h of an English DB was converted into Konglish according to the rules. Second, we integrated these data with 1000 h of a Korean DB, and generated a mixed First, 1000 h of an English DB was converted into Konglish according to the rules. Second, we integrated these data with 1000 h of a Korean DB, and generated a mixed model through end-to-end ASR training. By an inferencing process based on the model, the rule-driven Konglish phoneme sequence can be enhanced with regard to CS pronunciation, as shown in the following examples: 1.
English sentence (partial): Look, if there's anything I can do to make . . .

LM Domain Adaptation Using Semantically Similar Sentences
We tried to cover the pronunciation variations and to balance the corpus in terms of AM. Still, due to the lack of CS data occurring in real life, we should take the infrequent occurrence of English words into account in terms of linguistic modeling. For that reason, we considered LM domain adaption using semantically similar sentences as the best way to approximate a target domain in real life. When similar sentences from a large text corpus were searched for, they had to include English words in which we were interested. Figure 4 shows the overall structure of the LM domain adaptation. As shown on the right side of the dotted line, the shallow fusion method incorporating a domain LM was used.

LM Domain Adaptation Using Semantically Similar Sentences
We tried to cover the pronunciation variations and to balance the corpus in terms of AM. Still, due to the lack of CS data occurring in real life, we should take the infrequent occurrence of English words into account in terms of linguistic modeling. For that reason, we considered LM domain adaption using semantically similar sentences as the best way to approximate a target domain in real life. When similar sentences from a large text corpus were searched for, they had to include English words in which we were interested. Figure 4 shows the overall structure of the LM domain adaptation. As shown on the right side of the dotted line, the shallow fusion method incorporating a domain LM was used. The remaining problem was how to extract CS sentences that were semantically similar to the target domain. For this, it is reasonable to utilize a development set (dev. set) as a clue for the target domain. In this experiment, AI and economics lecture domains were chosen as the targets. The semantically similar sentences were extracted from the following three steps:

•
Step 1: Sentences containing very rare English words (domain adaptation 1) The domain adaptation DB consists of CS sentences containing very rare English words. In general, the lower the frequency, the less the ambiguity. For example, deep learning is almost definitely AI-related. Accordingly, sentences containing low-frequency English words should be included in a domain DB. Very rare English words can be found by counting occurrences from the general domain. The English words in the dev. set should then be compared to the very rare English words. If there is a match, the sentence is included in the domain DB. Figure 5 shows the steps in detail.

•
Step 2: Sentences containing more than two English words (domain adaptation 2) In general, if there are English words in a sentence, the topic of the sentence is likely close to the target domain. Hence, these sentences should be given preference for inclusion in a domain DB. For this reason, we extracted CS sentences that had many English words from the general domain text corpus. Duplicates of words in a sentence were not allowed. The remaining problem was how to extract CS sentences that were semantically similar to the target domain. For this, it is reasonable to utilize a development set (dev. set) as a clue for the target domain. In this experiment, AI and economics lecture domains were chosen as the targets. The semantically similar sentences were extracted from the following three steps:

•
Step 1: Sentences containing very rare English words (domain adaptation 1) The domain adaptation DB consists of CS sentences containing very rare English words. In general, the lower the frequency, the less the ambiguity. For example, deep learning is almost definitely AI-related. Accordingly, sentences containing low-frequency English words should be included in a domain DB. Very rare English words can be found by counting occurrences from the general domain. The English words in the dev. set should then be compared to the very rare English words. If there is a match, the sentence is included in the domain DB. Figure 5 shows the steps in detail.

•
Step 2: Sentences containing more than two English words (domain adaptation 2) In general, if there are English words in a sentence, the topic of the sentence is likely close to the target domain. Hence, these sentences should be given preference for inclusion in a domain DB. For this reason, we extracted CS sentences that had many English words from the general domain text corpus. Duplicates of words in a sentence were not allowed. For example, one of the general domain sentences, "ᄃ ᅡᄋ ᅵᄂ ᅡᄆ ᅵ ᆨ ᄋ ᅩ ᆸᄐ ᅵᄆ ᅡᄋ ᅵᄌ ᅥᄂ ᅳ ᆫ ᄒ ᅪᄌ ᅵ ᆯᄋ ᅳ ᆯ ᄋ ᅲᄌ ᅵ ᄒ ᅡᄆ ᅧ ᆫᄉ ᅥ ᄋ ᅭ ᆼᄅ ᅣ ᆼᄋ ᅳ ᆯ ᄌ ᅮ ᆯᄋ ᅧ ᆻᄃ ᅡ"-"the dynamic optimizer reduced the capacity while maintaining the image quality"-contains two foreign words: /tainamik/ (ᄃ ᅡᄋ ᅵᄂ ᅡᄆ ᅵ ᆨ; dynamic) and /opthimaice/ (ᄋ ᅩ ᆸᄐ ᅵᄆ ᅡᄋ ᅵᄌ ᅥ; optimizer). This is used for generating a domain LM if these words are included in the dev. set. A total of 183,000 sentences were collected from the general domain text corpus containing words with two or more English words in the dev. set. Figure 6 shows the steps in detail.
For example, one of the general domain sentences, "다이나믹 옵티마이저는 화질을 유지하면서 용량을 줄였다"-"the dynamic optimizer reduced the capacity while maintaining the image quality"-contains two foreign words: /tainamik/ (다이나믹; dynamic) and /opthimaice/ (옵티마이저; optimizer). This is used for generating a domain LM if these words are included in the dev. set. A total of 183,000 sentences were collected from the general domain text corpus containing words with two or more English words in the dev. set. Figure 6 shows the steps in detail.   For example, one of the general domain sentences, "다이나믹 옵티마이저는 화질을 유지하면서 용량을 줄였다"-"the dynamic optimizer reduced the capacity while maintaining the image quality"-contains two foreign words: /tainamik/ (다이나믹; dynamic) and /opthimaice/ (옵티마이저; optimizer). This is used for generating a domain LM if these words are included in the dev. set. A total of 183,000 sentences were collected from the general domain text corpus containing words with two or more English words in the dev. set. Figure 6 shows the steps in detail.  . Block diagram of the sentences containing more than two English words. Dev. set, development set; DB, database. Figure 6. Block diagram of the sentences containing more than two English words. Dev. set, development set; DB, database.

•
Step 3: Sentences semantically similar to the target domain (domain adaptation 3) Recently, bidirectional encoder representations from transformers (BERT) [26] represented the semantic relationships of words in an embedding space well. In this study, we utilized KorBERT [27], which is specialized in Korean, and extracted CS sentences that were semantically similar to the dev. set from the general domain text corpus. Euclidian cosine similarity was adopted to measure the degree of similarity. The cosine similarity for arbitrary sentence vectors a and b is defined as follows: cos (a, b)= a·b ||a|| ||b|| (1) Figure 7 describes the steps with the cosine similarity.

•
Step 3: Sentences semantically similar to the target domain (domain adaptation 3) Recently, bidirectional encoder representations from transformers (BERT) [26] represented the semantic relationships of words in an embedding space well. In this study, we utilized KorBERT [27], which is specialized in Korean, and extracted CS sentences that were semantically similar to the dev. set from the general domain text corpus. Euclidian cosine similarity was adopted to measure the degree of similarity. The cosine similarity for arbitrary sentence vectors a and b is defined as follows: (1) Figure 7 describes the steps with the cosine similarity. The extracted sentences for the domain DB were converted into Konglish using the rules. Finally, the domain DB was combined with the base LM using shallow fusion [28], as shown in Equation (2): where x = (x 1 ,x 2 ,⋯, x n ) is a sequence vector consisting of n elements, y is an existing output sequence vector, y * is an expected output sequence vector, and λ Base and λ Domain are the weights of the base LM and the domain LM, respectively. This idea can apply to English-Korean automatic speech translator in Figure 1. Using the proposed method, the recognition rate between English and Korean can improve via the model with LM domain adaptation. The extracted sentences for the domain DB were converted into Konglish using the rules. Finally, the domain DB was combined with the base LM using shallow fusion [28], as shown in Equation (2): where x = (x 1 , x 2 , · · · , x n ) is a sequence vector consisting of n elements, y is an existing output sequence vector, y * is an expected output sequence vector, and λ Base and λ Domain are the weights of the base LM and the domain LM, respectively. This idea can apply to English-Korean automatic speech translator in Figure 1. Using the proposed method, the recognition rate between English and Korean can improve via the model with LM domain adaptation.

Baseline System
The experiment was conducted on ESPnet [29], which supports a kind of end-to-end ASR framework. Our model is long short-term memory (LSTM)-based with listen, attend, and spell (LAS) [30] architecture, and it also uses a connectionist temporal classification (CTC) hybrid model [31]. The input length was 600, and the output length was 150. The training data consisted of 1.06 million sentences (1052 h) of AI Hub Korean and 1.07 million sentences (960 h) of Librispeech [32] English. While learning, the training data composition was randomly mixed for each language. KR baseline represents the experiment trained using Korean alone, and KR-EN used both Korean and English. The wordpiece unit was adopted as a unigram subword model [33]. It is used for English-Chinese CS ASR study with end-to-end approach [34]. The output node was assigned to 3969 (i.e., 1950 nodes in Korean and 1946 nodes in English).
For evaluation, three types of test sets were prepared. The first, that is, the Economy set, consisted of 213 sentences from economics lectures in the business domain spoken by a Korean who can speak English like a native speaker. The second two, AILec1 (with 337 sentences) and AILec2 (with 623 sentences), comprised AI domain lectures spoken by a Korean speaker who could not speak English at all. In fact, AILec2 is a more difficult evaluation set than AILec1, since it has lots of explanations of mathematical expressions for formulas. There are 582 English words (19.1%) out of the 3047 words in the Economy set. In AILec1, there are 957 English words (20.8%) out of 4606 words, and in AILec2, there are 2211 English words (21.6%) out of 10,223 words.

Applying the Korean-Konglish Mixed Model
The proposed model, Konglish native-like EN, was trained using a Konglish DB (made from Librispeech) and AI Hub Korean DB. The experimental results of the KR baseline, KR-EN, and Konglish native-like EN, are shown in Table 3. The character accuracy (CA) of KR-EN is a little better than the baseline for Economy, but for AILec1 and AILec2, the CA was lower than that of the baseline. As expected, native English pronunciation of KR-EN was found for Economy, which was spoken by a Korean speaker with fluent English pronunciation. On the contrary, adding English data had a negative effect on recognition in both AILec1 and AILec2. Furthermore, the KR-EN in the latter two caused frequent confusion between English and Korean words at a rate of 2.8% of the total 213 sentences. Konglish native-like EN adopted syllable units, like the KR baseline, and the training DB was the same as KR-EN, except for the Konglish DB. The structure of the end-to-end ASR was almost the same as for KR-EN, but the output nodes were set to 1998 because of the reflection of unseen characters in Korean that English words transforming into the Hangul alphabet. As seen in Table 3, the performances for all test sets were improved meaningfully by integrating Konglish, which originated from native English.

Applying Domain Adaptation Using Shallow Fusion
LM domain adaptation is a process of shallowly fusing the base LM with a domain corpus. The base LM was trained based on recurrent neural network (RNN) LM using Korean and English text corpora. The dev. Set consisted of 168 Economy sentences and 1895 lecture sentences. Other parameters and hardware settings are described in Table 4. These are based on the hyperparameters of Librispeech, with some values adjusted. The CS sentences for the domain adaption were extracted from 9.6 million sentences in the general domain text corpus. Among the English words sorted by frequency in the dev. Set, the CS sentences of each domain were extracted from the general domain corpus, including words with frequencies of 2000 or less. Figure 8 shows the frequency and the cumulative rate per English word, which consists of 380 ranks in the dev. set. This implies that the cumulative rate of the words remained almost above 95% in English word rank. The proportion of very rare English words in the word list was determined to be 95%, according to empirical experiments. Since very rare English words appear only in certain domains, they have the advantage of contributing to domain adaptation. Hence, it is regarded that a word frequency of 2000 or less is very rare for English words in Korean. Finally, both 13,708 and 23,395 CS sentences from each domain were extracted as the domain DBs.
i. 2021, 11, x FOR PEER REVIEW 10 of 14 1895 lecture sentences. Other parameters and hardware settings are described in Table 4. These are based on the hyperparameters of Librispeech, with some values adjusted. The CS sentences for the domain adaption were extracted from 9.6 million sentences in the general domain text corpus. Among the English words sorted by frequency in the dev. Set, the CS sentences of each domain were extracted from the general domain corpus, including words with frequencies of 2000 or less. Figure 8 shows the frequency and the cumulative rate per English word, which consists of 380 ranks in the dev. set. This implies that the cumulative rate of the words remained almost above 95% in English word rank. The proportion of very rare English words in the word list was determined to be 95%, according to empirical experiments. Since very rare English words appear only in certain domains, they have the advantage of contributing to domain adaptation. Hence, it is regarded that a word frequency of 2000 or less is very rare for English words in Korean. Finally, both 13,708 and 23,395 CS sentences from each domain were extracted as the domain DBs.  The threshold of cosine similarity was set to above 0.6. In this case, 21,218 and 86,786 sentences were found for each domain after removing duplications. For instance, Figure 9 describes the cosine similarity in ascending order of the dev. set of the lecture domain.
The word rank of the lecture domain which contains 108,000 words was chosen by the 107,995th to select the threshold value. The threshold of cosine similarity was set to above 0.6. In this case, 21,218 and 86,786 sentences were found for each domain after removing duplications. For instance, Figure  9 describes the cosine similarity in ascending order of the dev. set of the lecture domain. The word rank of the lecture domain which contains 108,000 words was chosen by the 107,995th to select the threshold value. Through steps 1-3 in Section 4, the domain DBs for adaptation can be summarized as shown in Table 5, where domain adaptation 1 is DA 1, domain adaptation 2 is DA 2, and domain adaptation 3 is DA 3. The base LM was trained with RNN LM using 2.1 million sentences in Korean and English. DA 1 + 2 + 3 means that all of the domain DBs were used for adaptation.  2 Eco + AI1 + AI2 (1173) 3 Eco + AI1 + AI2 (1173) 3 Table 6 shows the results of the LM adaptation. With the base LM only, the CA improved with Economy and AILec2, but not with AILec2. After adapting LM at each step (i.e., steps 1-3 in Section 4), the results show that the performance improved steadily when compared with the K-Base LM. When all of the domain DBs were combined (i.e., K-Base LM + DA 1 + 2 + 3), we obtained the highest performance among all of the combinations. The LM adaptation seemed to work well, as expected. Thus, the proposed method for selecting CS sentences is effective for ASR problems.  Through steps 1-3 in Section 4, the domain DBs for adaptation can be summarized as shown in Table 5, where domain adaptation 1 is DA 1, domain adaptation 2 is DA 2, and domain adaptation 3 is DA 3. The base LM was trained with RNN LM using 2.1 million sentences in Korean and English. DA 1 + 2 + 3 means that all of the domain DBs were used for adaptation.  2 Eco + AI1 + AI2 (1173) 3 Eco + AI1 + AI2 (1173) 3 1 Sum of domain adaptations 1 (DA 1), 2 (DA 2), and 3 (DA 3). 2 The closed test was excluded from other comparison sets. 3 Sum of Economy, AILec1, and AILec2. Table 6 shows the results of the LM adaptation. With the base LM only, the CA improved with Economy and AILec2, but not with AILec2. After adapting LM at each step (i.e., steps 1-3 in Section 4), the results show that the performance improved steadily when compared with the K-Base LM. When all of the domain DBs were combined (i.e., K-Base LM + DA 1 + 2 + 3), we obtained the highest performance among all of the combinations. The LM adaptation seemed to work well, as expected. Thus, the proposed method for selecting CS sentences is effective for ASR problems.

Analysis of English Words
In this study, our intention was to improve the recognition of English words in Korean speech. We analyzed the recognition results and computed the accuracy of English words only. Table 7 shows the results.
The Konglish native-like EN with the K-Base LM + DA 1 + 2 + 3 method showed the best performance. The performance improved approximately 20% more than that of the KR baseline in all tasks.  2 Eco is an acronym of Economy; AI1 is known as AILec1; AI2 is the short form of AILec2. 3 Used the shallow fusion method as the closed test; it was excluded from the other sets.

Conclusions
In the case of Korean, English words pronounced by Korean speakers-Korean-style English (i.e., Konglish)-have many phonetic variations from native-like English pronunciation. Moreover, mixed-language spoken data are very rare, making any model biased toward Korean, even if there are many data. Pronunciation variations and imbalanced data are major problems that degrade the recognition of CS speech.
In this paper, we proposed pronunciation variations reflecting English words spoken by Koreans and the LM adaptation based on similarity of meaning. First, we tried to find a unified pronunciation model based on phonetic knowledge and deep learning by applying the language identification (LID) of Watanabe et al. [4]. Despite this, there were problems with intrusions occurring between languages. However, our proposed method can avoid this problem.
Secondly, we extracted the CS sentences that were semantically similar to the target domain and then applied the language model (LM) adaptation to solve the biased modeling toward Korean due to the imbalanced training data. Nakayama et al. [18] utilized a speech chain framework based on deep learning to enable ASR and TTS to learn codeswitching. Although this closed-loop architecture improves the performance even without any parallel code-switching data, there is a limit to improving the performance when only using synthetic speech due to the quality. It seems that the performance can be improved if this method is combined with our method.
Compared with the KR baseline, the proposed hybrid method (e.g., knowledge and deep learning) showed up to 11.6% improvement in the error reduction rate (ERR). Through the semantically similar sentence extraction process, we were able to obtain 16.5%, 15.0%, and 16.9% improvements in ERR in the experiments of LM adaptation. If all domain DBs were combined, the ERR improved by up to 17.3%. LM adaptation using the proposed method might be one way to solve the biased data problem, which is critical.
However, although we dealt with some critical issues, if compared to the closed test, which would be the upper bound of the performance, there is still room for improvement.
Recently, Tacotron [35], which is a kind of text-to-speech (TTS) system, produced very high-quality synthesized speech. Thus, Tacotron should be incorporated into our model to cope with the CS problem. Additionally, cross-lingual speech and the text embedding method will be helpful to improve the performance of our model. Institutional Review Board Statement: Ethical review and approval were waived for this study, due to the use of the speech database of human voice recognition.

Informed Consent Statement: Not applicable.
Data Availability Statement: Data are available in a publicly accessible repository (AI Hub and Librispeech).

Conflicts of Interest:
The authors declare no conflict of interest.