Code-Switching in Automatic Speech Recognition: The Issues and Future Directions

: Code-switching (CS) in spoken language is where the speech has two or more languages within an utterance. It is an unsolved issue in automatic speech recognition (ASR) research as ASR needs to recognise speech in bilingual and multilingual settings, where the accuracy of ASR systems declines with CS due to pronunciation variation. There are very few reviews carried out on CS, with none conducted on bilingual and multilingual CS ASR systems. This study investigates the importance of CS in bilingual and multilingual speech recognition systems. To meet the objective of this study, two research questions were formulated, which cover both the current issues and the direction of the research. Our review focuses on databases, acoustic and language modelling, and evaluation metrics. Using selected keywords, this research has identiﬁed 274 papers and selected 42 experimental papers for review, of which 24 (representing 57%) have discussed CS, while the rest look at multilingual ASR research. The selected papers cover many well-resourced and under-resourced languages, and novel techniques to manage CS in ASR systems, which are mapping, combining and merging the phone sets of the languages experimented with in the research. Our review also examines the performance of those methods. This review found a signiﬁcant variation in the performance of CS in terms of word error rates, indicating an inconsistency in the ability of ASRs to handle CS. In the conclusion, we suggest several future directions that address the issues identiﬁed in this review.


Introduction
Automatic speech recognition (ASR) is one of speech technology's most common research areas. ASR has many potentials in human-machine interaction that allow us to communicate more effectively with our devices, such as computers, robots and assistive tools for the disabled [1]. Monolingual ASR systems such as Alexa and Siri have shown the usefulness of ASR systems in our daily lives. The next apparent evolution in ASR development is in the ability to handle more than one language, as most of us can be fluent in several languages. The ability of an ASR system to handle more than one language makes our interactions with the machine "more natural".
Bilingual and multilingual speech recognition systems have many benefits, as they can be useful to recognise different languages around the globe; there are currently estimated to be around 6000 languages. They are also suitable for users who speak more than one

Existing Reviews of Automatic Speech Recognition Systems
These papers were identified based on the titles, including the terms "review" or "systematic review". The existing reviews focus on monolingual ASR systems, emphasising the machine learning approach in developing the monolingual model [12,13], noise-robust techniques for ASR systems [14], development of the end-to-end model for monolingual ASR systems [15], the error detection and corrections in monolingual ASR [11], ASR system development for various languages [16,17] and the architecture, methodology, process, databases, tools and applications of monolingual ASR [18][19][20][21][22][23][24]. There has been increasing interest in CS in ASR, which has resulted in many different areas of interest, solutions and outcomes. In ref. [25], the review focuses on the computational approaches for CS speech and natural language processing. However, no work summarises the progress in the development of bilingual and multilingual ASR in handling CS, allowing the researchers to better understand the progress and possible areas for future directions.

CS in Bilingual and Multilingual Speech Recognition Systems
This section examines CS in spoken language and its role in bilingual and multilingual speech recognition systems. Many current ASR systems recognise only one language (monolingual). However, a practical ASR system must be able to handle CS, which is not an easy task as the system needs to deal with multilingual input with unpredictable switching positions [26].
Individuals that can speak more than one language (e.g., bilingual or multilingual) CS or mix their languages when communicating with others [3,5,[27][28][29][30][31][32][33][34][35][36][37][38][39]. CS is common in bilingual and multilingual communities (different cultures and language backgrounds) [40]. CS also occurs in minority languages influenced by the majority or majority languages influenced by lingua francas, such as English and French [4]. In CS, the base language refers to the language to which the syntax of a CS sentence belongs, and the foreign words are referred to as an embedded language [41].
Code-switching can occur between two languages that differ in terms of linguistic, pronunciation and speech features (for example Mandarin and English), or have high similarity in terms of linguistic, pronunciation and speech features (for example Frisian and English). CS can be classified into two primary categories: inter-sentential and intra-sentential [2,26,42]. The classification is important as it shows where the CS will occur in a sentence as well as the speech unit involved during the CS [6].
When CS occurs between sentences, it is referred to as inter-sentential CS, where speakers use words, phrases or sentences from one language (the embedded language) with words or sentences in their primary or base language [2]. This type of CS is common among fluent bilingual speakers. For example, "If you are late for the job interview, işe alınmazsın/If you are late for the job interview, you will not be hired" (CS of English and Turkish languages), and "If I am late to the appointment, ignore sahaja/If I am late to the appointment, just ignore" (CS of Malay and English languages).
Intra-sentential CS is where the shift occurs in the middle of a sentence. For example, "Wǒ zhīdào you cóng Wǒ de péng yǒu/I know you from my friend" (CS of Mandarin and English), ''Nó còngdang celebrate cái sinh nhật/He's also celebrating his birthday" (CS of Vietnamese and English), "I tak nak you campur tangan dalam life I/I don't want you to interfere in my life" (CS of Malay and English languages). For intra-sentential CS, the units, and the locations of the switches may vary widely from single-word switches to whole phrases (beyond the length of standard loanword units). As such, a vital issue to be solved is that the speech recognition system needs to consider many context-dependent phone combinations. However, the data sparseness problem for phone modelling hinders this issue from being solved.
The accuracy of ASR systems is adversely affected when dealing with CS speech. The reduction in accuracy is caused by unexpected pronunciation when languages are mixed. The embedded phonemes usually have more significant variation than the base language phonemes; either resembling the embedded language pronunciation or realised as a set of phonemic equivalents from the base language [2].
Early works on CS speech recognition employ the hybrid framework with three submodels: a pronunciation model, an acoustic model and a language model [41,42]. The pronunciation model takes the pronunciation variations of words in a dictionary. On the other hand, the acoustic model is a statistical model of acoustic features for sub-word units (for example, phonemes). The language model enables the ASR system to reduce the search space by using the conditional probabilities of subsequent words with the observed word sequences. Currently, these sub-models are trained and optimised separately, leading to sub-optimal results.
The review indicates that publications on CS ASR systems are a rising trend, indicating great interest in such developments in CS, with researchers focusing on a specific issue and offering a suitable solution. Reviewing these papers can summarise issues, progress, and possible future directions in CS ASR. Thus, there is a need for a review paper that provides progress and the directions for future undertakings in this domain.

Research Aim and Approach
This study aims to review the current papers on CS in bilingual and multilingual ASR systems. The review covers the overview, issues, and solutions in existing works by concentrating on the research focus, the databases, language and acoustic modelling, and evaluation metrics. This review also discusses future directions in this domain.
The review process consists of three stages: formulation of research questions, search methodology, and search outcome and analysis.

Formulation of Research Questions
The research questions for this study are: Research Question 1: What issues affect the recognition performance of CS ASR systems in bilingual and multilingual settings?
This question is answered by analysing the issues presented in the existing papers. Research Question 2: What is the direction in the development of CS ASR systems in terms of focus, databases, acoustics and language modelling, and evaluation metrics? This question is answered by reviewing the methods and solutions proposed in the selected papers.

Search Methodology
Here, we applied an integrated search strategy that includes searches in various online databases and a manual analysis of the selected papers. Our search covered popular databases such as: In addition, we performed a manual review where we read the title and the abstract of the papers. We then selected the papers and discarded any irrelevant papers.
We have applied specific keywords as follows: − multilingual speech recognition; − bilingual speech recognition; − code-switching.
We have applied the following inclusion criteria to screen our initial search: − Search domain: Science, technology or computer science; − Types of publication: Journals, proceedings, and transactions; − Article type: Full text; − Language: English.
We filtered irrelevant papers by referring to the following exclusion criteria: − Papers that do not focus explicitly on bilingual and multilingual speech recognition. − Papers that discuss bilingual and multilingual speech recognition as a side topic. − Papers with no details of experiments or experimental design. − The full text of the paper is not available (physical and electronic forms). − Opinions, viewpoints, keynotes, discussions, editorials, tutorials, comments, prefaces, anecdotal papers and presentations in slide format without any associated papers.
We manually checked every identified paper to ensure that the title and abstract were relevant, and we excluded non-essential papers from indexed lists. On top of that, we have used the backward snowballing method to discover any unidentified papers from the primary strategy [43].

Search Outcome and Analysis
In this section, we present the outcome of the search and selection, and analysis of the selected experimental papers in terms of publication database sources and the languages investigated.
From the initial keyword search, we identified 274 papers. Manual analysis was performed by reading the titles and abstracts of the shortlisted papers to remove the irrelevant ones. This initial screening process eliminated 151 papers, and 123 papers remained. After reading the full papers, another 71 papers were rejected based on the exclusion criteria above, leaving 52 relevant papers. The backward snowballing method allowed us to add another 14 papers. The final number of papers for the review was 66.
In terms of sources, 27 (that is, 64%) of the selected experimental papers were from the IEEE database, 13 papers were from Google Scholar, and two papers were from the Science Direct database. In terms of the languages investigated, the selected experimental papers cover many languages, both well-resourced and under-resourced. Some of the common languages investigated in multilingual and bilingual speech recognition and CS are English (25 papers), Mandarin/Chinese (10 papers), and African languages (8 papers), among others. Figure 1 shows the languages investigated by the selected papers. Of the 42 experimental papers, half of them focused on bilingual ASR, while the other half looked at multilingual ASR. More than half of the selected papers focused on CS, with English topping the list for CS, followed by the Chinese language. This finding is not surprising, as English is an international language. Several works focused on lesser-known languages such as isiZulu, Setswana, Frisian, and Malay, to name a few.
Science Direct database. In terms of the languages investigated, the selected experimental papers cover many languages, both well-resourced and under-resourced. Some of the common languages investigated in multilingual and bilingual speech recognition and CS are English (25 papers), Mandarin/Chinese (10 papers), and African languages (8 papers), among others. Figure 1 shows the languages investigated by the selected papers. Of the 42 experimental papers, half of them focused on bilingual ASR, while the other half looked at multilingual ASR. More than half of the selected papers focused on CS, with English topping the list for CS, followed by the Chinese language. This finding is not surprising, as English is an international language. Several works focused on lesser-known languages such as isiZulu, Setswana, Frisian, and Malay, to name a few.

Research Question 1: What Issues Affect the Recognition Performance of CS ASR Systems in Bilingual and Multilingual Settings?
Bilingual and multilingual speech recognition and CS differ from monolingual speech recognition as they have issues and challenges. Bilingual and multilingual without CS is where a speech recognition system is not expected to recognise two or more languages in a sentence. This kind of system usually uses the existing databases for wellresourced languages [27,28,35,37,38,[44][45][46] and is self-developed for under-resourced languages [33,34,47].

• Database Sparsity
For bilingual and multilingual speech recognition with CS, one of the significant issues is data sparsity in terms of the availability of the CS corpus and scarcity of CS occurrences in the utterances [5]. While there are established databases for monolingual speech recognition that can be used in bilingual and multilingual speech recognition, the availability of the CS corpus is limited due to the various combinations of languages in CS. A speech database covering speech variations is essential for bilingual and multilingual speech recognition [3]. Much of the research in CS develops its databases [5,29,39,41,[48][49][50]. However, the continuing interest in CS has resulted in the development of standard databases in recent research on CS [2,4,6,30,40,47,[51][52][53][54][55].

Research Question 1: What Issues Affect the Recognition Performance of CS ASR Systems in Bilingual and Multilingual Settings?
Bilingual and multilingual speech recognition and CS differ from monolingual speech recognition as they have issues and challenges. Bilingual and multilingual without CS is where a speech recognition system is not expected to recognise two or more languages in a sentence. This kind of system usually uses the existing databases for well-resourced languages [27,28,35,37,38,[44][45][46] and is self-developed for under-resourced languages [33,34,47].

• Database Sparsity
For bilingual and multilingual speech recognition with CS, one of the significant issues is data sparsity in terms of the availability of the CS corpus and scarcity of CS occurrences in the utterances [5]. While there are established databases for monolingual speech recognition that can be used in bilingual and multilingual speech recognition, the availability of the CS corpus is limited due to the various combinations of languages in CS. A speech database covering speech variations is essential for bilingual and multilingual speech recognition [3]. Much of the research in CS develops its databases [5,29,39,41,[48][49][50]. However, the continuing interest in CS has resulted in the development of standard databases in recent research on CS [2,4,6,30,40,47,[51][52][53][54][55].
The existing research is focused mainly on single-language pairs of CS [26], which means bilingual speech recognition. It is because CS research aims to allow speech recognition to recognise the change in the language within a sentence uttered by multilingual users [39]. According to Nakayama et al. [26], it is prevalent that users usually switch between two languages in a sentence. However, this does not prevent researchers from investigating alternatives to modelling the acoustics for the CS of multiple language pairs, such as the universal phoneme posterior probabilities for mixed-language speech recognition, or combined language identification and speech recognition [26].

•
Recognising CS While it is challenging for CS speech recognition to recognise large output units from base and embedded languages, it is easy to recognise the specific language from a particular speech segment. However, researchers may ignore the underlying task imbalance issues when sampling tasks from CS languages, which can lead to unsatisfactory performance by the ASR systems. First, different languages have different training data scales, so the basic task quantity for each language domain will be grossly different, leading to the task-quantity imbalance [38,53]. On top of that, as different languages have different phonological systems, the tasks taken from these languages have different recognition difficulties, causing the task-difficulty imbalance [38].
One primary concern in bilingual and multilingual speech recognition is balancing language coverage and model pollution. Using multiple source languages' acoustic data allows for excellent context coverage. However, the differences between the base and embedded languages cause impurity in the training data, reducing the accuracy of the base language's acoustic model. Moreover, different languages may produce mixed acoustic dynamics and context mismatch, adversely affecting the context-dependent models trained using different speech data from several languages [35].
Unlike monolingual speech recognition, CS speech is highly unpredictable and difficult to model. Despite recent advances in ASR using various machine learning algorithms, the success of CS speech was dampened due to the challenges of not having robust acoustic and language modelling that can process bilingual and multilingual speech that includes CS utterances [29,40,56].  Table 1 shows the direction of the selected papers in terms of focus, databases, acoustics and language modelling, and evaluation metrics.

Databases
Speech recognition research has spanned several decades, and most research has focused only on a handful of languages (usually referred to as well-resourced languages) [44,45]. Many speech databases are available for these languages. The availability of databases is not challenging in bilingual and multilingual ASR systems for well-resourced languages [28,44,45]. Bilingual and multilingual speech recognition faces challenges in the availability of databases for under-resourced languages, with many having to first develop their databases [5,34,36,39,64,65]. However, the continuing interest in CS has resulted in the development of standard databases in recent research on CS [2,4,6,30,40,47,[51][52][53][54][55].
The existing monolingual standard databases have speech utterances in a single language. On the other hand, the CS database is a compilation of speech that contains words from two languages in the same utterance. Yılmaz et al. [39] recently compiled a CS speech corpus containing 14.3 h of language-balanced speech from soap opera broadcasts, and a Dutch broadcast database, which contained 17.5 and 89.5 h of Dutch data, respectively, while the English Broadcast News Database (HUB4) is the primary source of English broadcast data [4]. Separately, Yilmaz et al. [47] developed a bilingual lexicon containing more than 100,000 Frisian and Dutch words and about 160,000 lexicon entries due to the words with multiple phonetic transcriptions.
Of the 22 papers that investigated CS, 10 of the papers (45%) developed their CS databases, while the balance (55%) made use of the existing speech databases such as FAME (Frisian Audio Mining Enterprise) [4,5,47] and SEAME (South East Asia Mandarin-English) corpus [32,40,55,58]. Table 2 mentions details of the CS databases used in the selected research. Most CS databases cover two languages, though some databases have CS for more than two languages. Most of the databases for CS have recorded utterances of more than 10 h and involve many speakers uttering two languages. Some researchers developed the CS database artificially using the text-to-speech (TTS) system [6,26,30,50]. It can be concluded that the speech database development for CS is on par with the monolingual ASR system. Many researchers also use the existing TTS system to generate CS speech. CS involves the underresourced languages lacking adequate databases [5,39,41]. Many of the under-resourced languages have no existing TTS system that can support the development of CS speech utterances. However, the review shows that under-resourced languages such as Sepedi have CS databases for multilingual speech recognition research [41].

Acoustic and Language Modelling
Most papers that investigate CS in speech are for bilingual ASR [5,6,29,32,40,41,[47][48][49][50][51]53,55,56,58], while the rest look at multilingual ASR [4,26,30,39,52,59]. The acoustic modelling for speech recognition is a critical process in any ASR system development. Over the years, many techniques have been employed to develop the acoustic model for monolingual, bilingual, and multilingual speech recognition systems. The development of an acoustic model for the bilingual and multilingual speech recognition systems that contain CS required the model to switch language during the recognition of speech, which can happen within a sentence. The CS phone sets in bilingual and multilingual speech recognition can be made in three distinctive ways; (1) by combining the phone sets of the two languages, (2) by mapping the phone sets of the two languages, and (3) by merging similar phone sets of the two languages [2,42].

• Combining
For both the bilingual and multilingual ASR systems, researchers prefer to combine the acoustic models of each monolingual model [4,6,26,29,[39][40][41]49,51,53,56,58,59,62]. Combining the monolingual models for CS is common for well-resourced languages such as English, Mandarin, and German, among others. Adel et al. [40] proposed the recurrent neural network language modelling toolkit for CS for bilingual ASR. Textual features such as words and part-of-speech (POS) tags were added to the input layer, while a set of all possible languages was added to the output layer. The probability for the succeeding language to be computed is based on the current word, the current features, and the history of words and features. According to Adel et al. [40], recurrent neural network language models improve the perplexity and error rates compared to the traditional n-gram approaches, as the former can handle more extended contexts. On top of that, linguistic analyses of the CS help better understand the task and challenges and thus allow researchers to develop an appropriate language model [40].
Yılmaz et al. [39] developed acoustic and language models for five languages with CS for multilingual ASR. Unlike past works that only consider bilingual CS, the developed ASR system can hypothesise CS word sequences for five languages. They use the transcriptions of the CS speech data for training the language model [39]. Alternatively, Yılmaz et al. [4] trained several multilingual deep neural network (DNN) models in Frisian, English, and Dutch for developing the CS acoustic model for low-resourced languages.

• Merging
Merging of acoustic models was applied in CS involving low-resourced languages [4,5,52]. The merging of similar phones of the two languages for CS reduces the size and complexity of the CS. For example, ref. [52] used the union character set for each language and eliminated the duplication of output symbols in multiple languages, allowing them to reduce the computational cost during model development. When recognising CS utterances, the model can switch the language of the output sequence. The bi-directional encoder network computes the hidden representations as input acoustic features, allowing the model to predict the language identification for variable-length segments [52].
In ref. [26], a spectrogram extracted by the short-time Fourier transform (STFT) was used together with the Librosa library for generating the speech features. They first applied wave-normalisation per utterance, followed by pre-emphasis, and extracted the spectrogram with an STFT. The final set included 40 dims log mel-spectrogram features, and 1025 dims log magnitude spectrograms. On the other hand, ref. [5] applied the recurrent neural network trained with cross-lingual embedding data to maximise the use of the available textual resources.

• Mapping
Phone mapping was applied in [48], where they mapped Indonesian with Arabic phone sets. The proposed method has the advantage that the existing acoustic model with only Indonesian phones is used to build the speech recognition system. Lin et al. [57] used the universal phone set (UPS), a machine-readable phone set based on the IPA, to represent the language's universal speech units. Generally, there is a one-to-one mapping between UPS and IPA symbols, while UPS is a superset of IPA in a few other cases.
Similarly, ref. [31] proposed a method that does not require the training of languagespecific acoustic models. The method created a new phoneme set to obtain multilingual acoustic modelling by sharing part of language-specific phonemes with other languages. However, sharing of phonemes increases the amount of training data. The acoustic model with the shared phoneme set can perform speech recognition for a minor (low-resource language) utterance.

• Other techniques
Recently, end-to-end (E2E) systems have been preferred by researchers for their simplicity and success in multilingual [5,29,41,46,[51][52][53]60] settings. An E2E system directly maps an input sequence of acoustic features to an output sequence of characters, phonemes, or words. Two common variants of the E2E framework are (i) the connectionist temporal classification (CTC) [31,47], and (ii) the sequence-to-sequence modelling with an attention mechanism [41]. In E2E, the model is trained with characters as the output targets and does not include any explicit pronunciation model or language model, which means that the E2E does not need phonetically labelled training data during development. On top of that, the E2E system predicts phones or characters directly from acoustic information without any form of manually prepared alignment, making it a suitable method for multiple languages speech recognition and CS speech recognition [41].
Some common architectures for E2E are connectionist temporal classification (CTC), attention-based encoder-decoder networks, and recurrent neural network transducers [5,51,60,66]. The performance of E2E systems depends on the training data's size (it needs massive data), which can be a problem for CS for poorly resourced languages.
Seki et al. [52] introduced a hybrid attention/CTC model that used language identification explicitly and joint language identification to predict the CS between languages. Similarly, ref. [53] adopted the attention-based E2E model for the Mandarin-English CS, with three improvements, which are: (1) multi-task learning for language identity information, (2) word pieces as English modelling units to reduce the modelling unit gap between Mandarin and English, and (3) transfer learning to further improve the performance by using the larger size of Mandarin and English monolingual data. In ref. [55], the E2E and CTC were used for CS Korean-English languages.
Yue et al. [5] proposed the unsupervised two-step approach to language modelling. Unlike the traditional hybrid HMM-DNN system, an E2E CTC acoustic model is not trained using frame-level labels concerning the Cross-Entropy (CE) criterion. The model automatically learns the alignments between speech frames and label sequences using the CTC objective. This approach predicts the conditional probability of the label sequence by adding the joint probabilities of the corresponding set of CTC symbol sequences [5].
Emond et al. [30] propose the transliteration optimised word error rate to normalise the data for three types of language models, (1) a conventional n-gram language model, (2) a maximum entropy-based language model, and (3) a long short-term memory (LSTM) language model, in the CTC environment [30].

Evaluation Metrics
In much of the research on bilingual and multilingualism, the most common form of evaluation metrics is the word error rate (WER). This evaluation metric is common for bilingual and multilingual research for languages that use the Roman alphabet. The WER is a standard metric that allows researchers to compare their current work with previous works and provides the degree of improvement in their system's ability to recognise speech with CS. It also allows the comparison of performance among the different languages. For instance, English [28,45] performs better than other languages, and bilingual and multilingual models have a lower WER than the monolingual model [5,30,32,34,36,39,44,46,56,57,61].
However, according to Emond et al. [30], conventional WER is insufficient to measure CS languages' performance due to ambiguities in transcription, misspellings, and the use of words from two different writing systems. Such errors cause the WER of an ASR system to be higher than it should be and complicate its evaluation. On top of that, these errors make it difficult to determine the modelling errors resulting from CS language and acoustic models [30].
While WER is the primary measurement metric in ASR, some research uses character error rates (CER) [26,27,52,53], especially the research involving English-Chinese CS, as the writing symbols differ between the two languages. Mixed error rate (MER) was applied in ref. [40], which applies WER to English and CER to Mandarin.
We have analysed the performance of CS ASR systems in terms of their phone set arrangements as combining, mapping, or merging. Figure 2 depicts the overall results of the evaluation of ASR performance in recognising speech with CS conducted in the selected papers.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 11 of 18 In much of the research on bilingual and multilingualism, the most common form of evaluation metrics is the word error rate (WER). This evaluation metric is common for bilingual and multilingual research for languages that use the Roman alphabet. The WER is a standard metric that allows researchers to compare their current work with previous works and provides the degree of improvement in their system's ability to recognise speech with CS. It also allows the comparison of performance among the different languages. For instance, English [28,45] performs better than other languages, and bilingual and multilingual models have a lower WER than the monolingual model [5,30,32,34,36,39,44,46,56,57,61].
However, according to Emond et al. [30], conventional WER is insufficient to measure CS languages' performance due to ambiguities in transcription, misspellings, and the use of words from two different writing systems. Such errors cause the WER of an ASR system to be higher than it should be and complicate its evaluation. On top of that, these errors make it difficult to determine the modelling errors resulting from CS language and acoustic models [30].
While WER is the primary measurement metric in ASR, some research uses character error rates (CER) [26,27,52,53], especially the research involving English-Chinese CS, as the writing symbols differ between the two languages. Mixed error rate (MER) was applied in [40], which applies WER to English and CER to Mandarin.
We have analysed the performance of CS ASR systems in terms of their phone set arrangements as combining, mapping, or merging. Figure 2 depicts the overall results of the evaluation of ASR performance in recognising speech with CS conducted in the selected papers. As shown in Figure 2, the research that applies the combining of the phone set for CS [4,6,26,29,32,39,47,49,51,52,59] used WER, CER, and MER for measuring the performance of the ASR system. The existing work reported an error of 10% to 65% for WER, 5% to 37% for CER, and 37% to 38% for MER. The research that applied the mapping of phone sets As shown in Figure 2, the research that applies the combining of the phone set for CS [4,6,26,29,32,39,47,49,51,52,59] used WER, CER, and MER for measuring the performance of the ASR system. The existing work reported an error of 10% to 65% for WER, 5% to 37% for CER, and 37% to 38% for MER. The research that applied the mapping of phone sets [5,40,48,56] has reported an error of 28% to 57% for WER, 5% to 37% for PER, and 37% to 38% for MER. Meanwhile the research on merging the phone set for CS [4,6,26,29,32,39,41,47,51,52,59], has reported an error of 23 to 43% for WER, 44% to 48% for CER, and 30% to 71% for MER.

Discussion
Bilingual and multilingual speech recognition with CS differs from monolingual speech recognition as it has its issues and challenges. Bilingual and multilingual recognition without CS is where the speech recognition system is not expected to switch from one language to another during the recognition process. These bilingual and multilingual systems usually use the existing databases for well-resourced languages [27,28,35,37,38,[44][45][46]57] and are self-developed for under-resourced languages [31,32,40].
For bilingual and multilingual speech recognition with CS, 24 out of 42 (representing 57%) of experimental papers consider CS in bilingual and multilingual speech recognition development. This finding shows that CS is integral in developing bilingual and multilingual speech recognition systems. The existing research focuses mainly on single-language pairs of CS, which means bilingual speech recognition. CS research aims to allow speech recognition to recognise a change in the language within a sentence uttered by multilingual users. However, this does not prevent researchers from investigating alternatives to modelling the acoustics for the CS of multiple language pairs. The issue of data sparsity was not new, as researchers faced the same issues in the early days of ASR research. However, today, monolingual ASR research has impressive resources in terms of databases, speakers, and recording size, enabling the development of more accurate acoustic models. Despite having many monolingual databases that support the development of bilingual and multilingual speech recognition systems, the uniqueness of the languages makes it impossible to have a common database for developing bilingual and multilingual speech recognition systems in a CS environment.
For CS, a relatively new domain in ASR research, the issue of data sparsity affects the ability of the system to recognise CS. The data sparsity includes both the availability of CS databases and the scarcity of CS occurrences in the speech corpus due to the various combinations of languages in CS. Researchers need to develop their databases in these early days, but our review shows that many now prefer to use the existing available databases instead of developing their own ones (refer to Table 2). It may indicate that developing CS databases is resource-intensive and time-consuming, where researchers prefer to use the existing CS databases for their research.
We found that the variability in the occurrence of CS is one of the reasons why CS databases are limited in terms of coverage of speech. Most CS is for bilingual ASR systems, which can have a finite mix of words from the two languages. The databases developed by researchers may not include all possible mixes, leading to sub-optimal results. The databases are also limited in terms of size and variation in speech, resulting in an imbalance of resources and tasks for developing an effective recognition model. The uneven proportion of languages in the CS database can lead to task-quantity imbalance, particularly for under-resourced languages.
Another issue in CS is the need for manual annotation of the CS database, causing the development of the CS database to be expensive and time-consuming. A possible solution to manual annotation is to use the E2E method that directly converts input speech feature sequences to output label sequences without any explicit and intermediate representation of phonetic/linguistic constructs such as phonemes or words. Some researchers developed the CS database using recorded speech from sources such as broadcasts and podcasts. This form of database offers more natural CS as compared to those recorded in the studio. The main issue will be the resources needed to annotate these speeches. However, a method such as E2E allows the development of acoustic models without the need for manual annotation. Enriching the CS database from existing sources and finding an effective way to annotate speech can be one of the future directions in CS research.
A critical process in any speech recognition system development is the acoustic modelling of speech. Over the years, many techniques have been employed to develop the acoustic model for monolingual, bilingual, and multilingual speech recognition systems. The development of the acoustic model for the bilingual and multilingual speech recognition systems that contain CS requires the model to switch languages during the recognition of speech, which can happen at any point within a sentence. The ability of the ASR system to identify the CS occurrence determines the recognition performance of the system. This can be challenging due to the variation in ways a speaker can CS during a conversation. As such, the ASR system requires sufficient coverage of CS speech in real life. Using naturally recorded speech such as broadcasts and podcasts is one possible solution. In addition, having the ASR system predict the occurrence of CS in speech from the linguistic point of view will help the ASR system to improve its recognition performance.
The CS for languages that differ in terms of linguistic, pronunciation and speech features can be better predicted for CS occurrence than for languages that have high similarity in terms of linguistic, pronunciation and speech features. The use of semantic analysis may help ASR systems to improve their CS recognition performance for languages that share similar linguistic, pronunciation and speech features.
Most papers use supervised and semi-supervised training to develop acoustic and language models. Deep learning algorithms such as the neural network and recurrent neural network language modelling are common for CS. Existing techniques such as the neural network reduce the sample size and the acoustic model's complexity for the monolingual ASR system. However, this can be a problem in CS recognition due to the unification of similar vocabulary for both languages, resulting in lower recognition accuracy. The development of CS ASR systems needs to balance language coverage and model complexity.
The CS phone sets in bilingual and multilingual speech recognition can be made in three distinctive ways: combining, mapping, and merging similar phone sets of the languages. Researchers prefer combining the acoustic models of each monolingual model for both the bilingual and multilingual ASR systems. Combining the monolingual models for CS is typical for well-resourced languages. Merging of acoustic models was applied in CS involving low-resourced languages [4,5,52]. The merging of similar phones of the two languages for CS reduces the size and complexity of the CS. Mapping was applied in some research where the languages share common graphemes, such as Indonesian and Arabic.
In the earlier section, we presented the overall results of the evaluation of ASR performance in recognising speech with CS, conducted in the selected papers. Unlike monolingual ASR, where the standard evaluation metric is the WER, the CS evaluation uses several different measurements, such as the PER, CER, and MER, depending on the type of languages being tested. For example, the WER is typical for languages that use the Roman alphabet in many European countries, while the CER is ubiquitous in CS for English and Mandarin, with different written formats.
In ASR research, it is difficult to make an "apples to apples" comparison between the methods due to variations in the inputs such as database size and the number of speakers, among others. The performance of ASR in recognising CS may not be entirely due to the techniques used for training the models, but also to the size and the quality of the database. It is important for researchers to differentiate the improvement to the WER that is caused by the proposed technique as well as the quality of the database used.
We also found that depending entirely on the WER as the standard measure of performance of ASR systems may not give readers the complete picture of the merits and demerits of a particular technique applied in developing an ASR system for handling CS. The researchers should consider other measures such as computational time, computational cost, line of code, human cost and time, and resource costs when evaluating the performance of bilingual and multilingual ASR systems.
Recently, E2E systems have been preferred by researchers for their simplicity and success in bilingual and multilingual settings. An E2E system predicts phones or characters directly from acoustic information without predefined alignment. Unlike the traditional hybrid HMM-DNN system, an E2E CTC acoustic model is not trained using frame-level labels, reducing human involvement during training. However, the performance of E2E systems depends on the training data size (it needs massive data), which can be a problem for CS or poorly resourced languages. The performance of E2E, though promising for monolingual, bilingual, and multilingual ASR systems, may not be a suitable solution in CS due to the limited size of the CS database.

Conclusions
CS is an integral part of developing bilingual and multilingual speech recognition systems. Much of the speech recognition system development now focused on the bilingual and multilingual as monolingual speech recognition system has been a commercial success. As humans are commonly fluent in two or more languages, so should the ASR system be. However, unlike humans, who can naturally change from one language to another quickly, most of the existing ASR systems do not have similar abilities, particularly in handling unexpected changes in pronunciation when languages are mixed. CS may involve two languages that differ in terms of linguistic, pronunciation, and speech features or have a high degree of similarity between them. In both situations, the ASR systems need to differentiate the two languages from the uttered speech.
The aim of this research is to perform a review on CS in bilingual and multilingual ASR systems. This is because there is very little research that summarises the progress in the development of bilingual and multilingual ASR systems that can handle CS with acceptable performance.
To meet the objective of this study, two research questions were formulated. The first research question identifies the issues affecting the recognition performance of CS ASR systems in bilingual and multilingual settings. For this, we reviewed 66 selected papers (both review and experimental papers), where the issues in CS recognition can be grossly categorized into two, which are database sparsity and recognising CS. The database sparsity is mainly caused by the variability in the occurrence of CS, which is not fully covered by the existing CS databases. Recognising CS is difficult as CS speech is highly unpredictable and difficult to model.
The second research question looks at the direction in the development of CS ASR systems in terms of focus, databases, acoustics and language modelling, and evaluation metrics. For this, we reviewed 42 selected experimental papers (reflected in Table 1). These papers cover many well-resourced and under-resourced languages and techniques to recognise CS in ASR systems, such as mapping, combining, and merging the phone sets of the languages experimented with. Our review also examined the performance of those techniques. This review found a significant variation in the performance of CS experimental papers in terms of WER, indicating the inconsistency in the ability of the existing ASR systems to handle CS.

Future Directions
The future direction of CS will include more languages as globalisation continues. Our review shows that researchers prefer to use the existing CS databases in their research, which means that more CS database enrichment and unification will be needed in the future. Recorded speech from sources such as broadcasts and podcasts offer more natural CS as compared to those recorded in the studio. As such, the CS database of the future will use the existing recordings using automatic annotating methods. On top of that, methods such as E2E allow the development of an acoustic model without the need for manual annotation, and facilitate the development of CS databases from natural sources such as recorded broadcasts or podcasts. The development of a common CS database will help researchers in the improvement of the performance of ASR for bilingual and multilingual speech. Furthermore, there will be more development of multilingual CS in multi-ethnic communities worldwide, including for under-resourced languages.
The second issue identified in the review is recognizing CS, which can lead to unsatisfactory performance by the ASR systems. Different languages have different training data scales and phonological systems, making it difficult for the ASR system to recognise CS speech. One possible future direction for recognizing CS is to develop an ASR system that can predict the occurrence of CS in speech using linguistic information, as well as semantic analysis, to improve CS recognition performance for languages that share similar linguistic, pronunciation, and speech features.
While the standard evaluation metric for the ASR system is WER, researchers may need to differentiate or identify the errors that arose from the CS with the speech recognition error by the acoustic model. The performance of the ASR in recognising CS may not be entirely due to the techniques used for training the models, but also to the database used. In the future, researchers need to differentiate the improvement to the WER that is caused by the proposed technique as well as the quality of the database used. Such identification can help researchers to determine which factor(s) contribute more to the performance improvement.