Next Article in Journal
The Behavior of Wind Turbines Equipped with Induction Generators and Stator Converters Under Significant Variations in Wind Speed
Previous Article in Journal
A Fault Detection Framework for Rotating Machinery with a Spectrogram and Convolutional Autoencoder
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Self-Evaluated Bilingual Automatic Speech Recognition System for Mandarin–English Mixed Conversations

1
Multidisciplinary Computational Laboratory, Department of Electrical and Biomedical Engineering, Hanyang University, Seoul 04763, Republic of Korea
2
Lifetree Telemed Inc., Taichung 402256, Taiwan
3
Department of Photonics, National Cheng Kung University, Tainan 70101, Taiwan
4
Department of Mechanical Engineering, National Taipei University of Technology, Taipei 10608, Taiwan
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2025, 15(14), 7691; https://doi.org/10.3390/app15147691
Submission received: 4 June 2025 / Revised: 1 July 2025 / Accepted: 4 July 2025 / Published: 9 July 2025

Abstract

Bilingual communication is increasingly prevalent in this globally connected world, where cultural exchanges and international interactions are unavoidable. Existing automatic speech recognition (ASR) systems are often limited to single languages. However, the growing demand for bilingual ASR in human–computer interactions, particularly in medical services, has become indispensable. This article addresses this need by creating an application programming interface (API)-based platform using VOSK, a popular open-source single-language ASR toolkit, to efficiently deploy a self-evaluated bilingual ASR system that seamlessly handles both primary and secondary languages in tasks like Mandarin–English mixed-speech recognition. The mixed error rate (MER) is used as a performance metric, and a workflow is outlined for its calculation using the edit distance algorithm. Results show a remarkable reduction in the Mandarin–English MER, dropping from ∼65% to under 13%, after implementing the self-evaluation framework and mixed-language algorithms. These findings highlight the importance of a well-designed system to manage the complexities of mixed-language speech recognition, offering a promising method for building a bilingual ASR system using existing monolingual models. The framework might be further extended to a trilingual or multilingual ASR system by preparing mixed-language datasets and computer development without involving complex training.

1. Introduction

Automatic speech recognition (ASR), commonly known as speech recognition, is a groundbreaking field within artificial intelligence (AI) that empowers computers to comprehend and respond to spoken language. Its vital role in our daily lives is revolutionizing human–computer interaction, making it more natural and efficient. ASR eliminates the need for traditional input methods like keyboards, enabling hands-free operation on mobile devices and improving accessibility and safety in applications like autonomous driving. This technology fosters inclusivity by providing an alternative means of interaction for individuals with physical disabilities and those facing challenges with conventional input methods. Moreover, it facilitates bilingual communication, boosts productivity through quick transcription, powers voice-activated assistants like Siri, Google Assistant, Bixbi, Clova, Alexa, and more, and plays a crucial role in automated customer service and healthcare innovations [1]. As we witness ongoing advancements, ASR continues to reshape our relationship with technology, offering a seamless and intuitive interface that enriches our daily activities [2,3]. The history of ASR spans decades, beginning with Bell Labs’ “Audrey” system in 1952, which recognized spoken numbers. Major advancements followed, such as the development of hidden Markov models (HMMs) in the 1970s and the more recent impact of deep learning and neural networks, transforming ASR systems into sophisticated tools today [4,5,6].
This article focuses on mixed-language speech, which involves the use of more than one language within an utterance. Mixed-language communication is quite common all over the world, for example, between Mandarin and English [7,8], French and Algerian [9], Spanish and English [10], and Ethio and Semitic [11]. Mandarin–English mixed-speech recognition is particularly common and popular in East Asia. Despite linguists studying bilingual speech for more than half a century, it is only in the past two decades, due to advancements in speech technology, that mixed-language speech recognition has received significant attention [8,12]. In particular, research on Mandarin–English mixed speech has gained momentum over the past decade [13,14,15]. The current trajectory of research in this domain gravitates towards themes like data assembly, modeling of articulatory units, addressing data sparsity, and holistic end-to-end solutions [16]. While advancements in speech recognition technology have made significant strides toward accurate mixed-language ASR, considerable challenges remain, particularly in creating comprehensive datasets and robust models capable of handling the complexity of mixed-language speech, which is discussed in detail in Section 2.
Our study presents a new method for mixed-language ASR, focusing on bilingual speech recognition results. The algorithms examined in this article are designed for Mandarin–English mixed-language speech recognition to capture the essence of a speaker toggling between two languages within the same utterance. Our methodology utilizes the VOSK monolingual ASR toolkit without acoustic modeling for the initial recognition step. We have employed language-specific models for different target languages and curated a comprehensive dataset consisting of over 10,000 speech samples to evaluate the effectiveness of our proposed system. Throughout our testing, we have continuously refined the components of the system, comparing the baseline performance of the VOSK monolingual speech recognition toolkit with the outcomes from our enhanced system. Our investigation indicates that our method simplifies and improves speech recognition, thereby taking full advantage of the VOSK toolkit’s [17] exceptional capabilities in monolingual recognition. This scalable framework offers an innovative solution for bilingual speech recognition, applicable to a diverse range of language pairs beyond Mandarin–English. It is flexible enough to accommodate any language combination, even those involving three or more languages. In this article, we present the design of an advanced mixed-language ASR system optimized for Mandarin–English speech. Our system achieves an impressive average error rate of less than 13% across all tested datasets, where the lowest recorded average error rate achieved in one of the datasets was 7.56%. This is primarily due to the integration of threshold adjustments and recognition result compensation mechanisms. This approach provides a robust, adaptable solution for addressing the challenges of mixed-language speech recognition.
The remainder of the article is structured as follows: Section 2 reviews the literature on mixed-language speech recognition, with a particular emphasis on bilingual ASR systems and existing approaches to addressing the challenges of recognizing speech that alternates between languages. The proposed self-evaluated bilingual ASR (SEB-ASR) system is described in Section 3, outlining the evaluation metrics employed, including word error rate (WER), character error rate (CER), and mixed error rate (MER), along with the workflow for MER calculation. Section 4 gives the results and discussion. Section 4.1 presents the performance evaluation of the VOSK monolingual ASR system and explains its integration into our mixed-language ASR framework, while Section 4.2 provides an in-depth description of the implementation of the system, operational workflow, and the enhancements it brings to the baseline VOSK system in handling mixed-language speech. Finally, Section 5 concludes the article by summarizing the key findings and discussing future directions for advancing multilingual ASR systems.

2. Literature Review

Mixed-language communication is a frequent occurrence in multilingual regions where English often serves as a global lingua franca. In areas such as Singapore, Taiwan, Hong Kong, and China, Mandarin–English mixed speech is common [18]. This phenomenon poses unique challenges for ASR systems, where speakers shift between languages within a single conversation. Mixed-language research is relevant not only to speech recognition but also to fields such as text mining and information retrieval, where it has been applied to clustering bilingual documents, building multilingual web directories, and designing frameworks for the cost-effective deployment of bilingual systems [19,20,21,22,23,24]. This article focuses on addressing the challenges in the context of mixed-language ASR.

2.1. Early Efforts in Mixed-Language ASR

Early studies by Lyu et al. addressed the issue of data imbalance during model training for Mandarin, Taiwanese, and Hakka languages, developing acoustic models suited for mixed-language environments [25]. Wu et al. further explored mixed-language speech segmentation and multilingual ASR, utilizing acoustic features [26], while Qian et al. extended this work into text-to-speech synthesis with a focus on Mandarin–English mixed speech [27]. A key challenge in this area is language identification in mixed-language contexts. Lyu et al. proposed a model that combined acoustic, phonological, and lexical data to improve language identification accuracy and achieved a relative improvement of 10% compared to models that did not integrate this combination [28]. However, despite these advancements, there are still significant challenges related to the limited availability of mixed-language training data.

2.2. Development of Mixed-Language Datasets

In response to the need for mixed-language data, several speech recognition datasets have been developed, including SEAME [29], CECOS [30], and OC16-CE80 [31]. However, these datasets remain relatively small compared to the abundance of monolingual data available, which limits the effectiveness of training models for mixed-language speech. Vu et al. [32,33] pioneered the first large-vocabulary continuous speech recognition (LVCSR) system for a Mandarin–English mixed-language speech. By merging phoneme sets with the International Phonetic Alphabet (IPA), Bachmann’s distance, and discriminant training, their two-pass system achieved an MER of 36.6% on the SEAME development set. Nevertheless, the scarcity of mixed-language datasets continues to hinder progress, as there is a lack of sufficient training data where speakers naturally switch between languages.

2.3. Traditional ASR Approaches and Their Limitations

Traditional ASR approaches for mixed-language recognition often require the manual development of language-specific components, such as bilingual pronunciation dictionaries and telephone sets. Bhuvanagiri et al. [34,35] used phoneme-mapped articulatory dictionaries and modified language models to enhance recognition accuracy. However, these methods are labor-intensive, and the structural differences between languages add another layer of complexity, making it difficult for models to handle the phonetic and syntactic variations between Mandarin and English. Zellou et al. [36] highlighted the difficulty of modeling cross-linguistically uncommon word forms in mixed-language speech. To tackle these challenges, techniques like text augmentation had been employed, including statistical machine translation [32] and word embedding methods [37], which achieved the best error rate of 44.96%. Additionally, incorporating language-specific features such as parts of speech (POS) into language models led to a 32.81% relative improvement in perplexity on the SEAME development set [38]. These challenges create substantial hurdles in achieving high recognition accuracy in mixed-language ASR due to the wide disparities in linguistic units between languages like Mandarin and English.

2.4. Advancements in End-to-End (E2E), Hybrid, and Semi-Supervised Models for ASR

Recent advancements in attention-based end-to-end (E2E) models, as well as hybrid and semi-supervised approaches, offer a promising solution for mixed-language ASR. These models had demonstrated success across various speech tasks, including ASR [12,39], keyword recognition [40], speech emotion classification [41], and speaker verification [42,43]. These models eliminate the need for manually produced language-specific resources, as demonstrated by Winata et al. [15], who introduced an E2E system for Mandarin–English mixed-language speech using the connected timing classification (CTC) loss function. By creating a simulated dataset and incorporating language identification techniques, their system reduced the error rate by 5%, achieving an overall error rate of 24.61%. Other CTC-based approaches had also been proposed to improve code-switching ASR. Luo et al. [44] developed a hybrid CTC-Attention end-to-end system, achieving an MER of 34.24% on the SEAME dataset. Huang et al. [45,46] further enhanced CTC-based performance by integrating a Transformer architecture with a language identification module, reducing the MER to 30.95%. Similarly, Guo et al. [8] applied semi-supervised learning techniques to enhance the pronunciation lexicon, acoustic model, and language model within their Mandarin–English mixed-language speech recognition system. Their research reported error rates of 20.5% and 29.56% on two different datasets. Building on this line of work, Nga et al. [47] introduced a mutual learning-based semi-supervised speech (MLSS) recognition approach and achieved an MER of 17.6% on the SEAME dataset. Chen et al. [48] explored multi-task learning using deep neural networks, reporting an MER of 32.2% on a combination of LDC and in-house data. A summary of related work on Mandarin–English ASR systems is presented in Table 1.

2.5. Remaining Challenges and Research Gaps

Despite these advances, several challenges remain. First, there is still a lack of sufficiently large mixed-language datasets, limiting the ability of ASR models to generalize effectively to real-world applications. Second, traditional ASR systems require significant manual effort, including the creation of language-specific components, which reduces scalability and efficiency. Third, current models are not optimized to handle the dynamic mixed-language patterns found in spontaneous speech, which is common in multilingual communities. Furthermore, while E2E models have shown potential, there is still a need for more comprehensive solutions that fully automate mixed-language ASR without relying on manual intervention.
Our study aims to address these gaps in mixed-language, particularly in bilingual ASR systems, and further optimize them for real-world applications. Compared to the SEB-ASR system, existing bilingual speech recognition systems exhibit several differences and similarities. SEB-ASR is specifically designed for Mandarin–English mixed-language scenarios with a self-evaluation mechanism that enhances its adaptability and robustness. Traditional systems often lack such self-assessment features, relying heavily on fixed evaluation pipelines. While most systems use end-to-end models or hybrid hidden Markov model–deep neural network (HMM-DNN) frameworks, SEB-ASR incorporates tailored optimization for bilingual and mixed-language environments.

3. Research Methodology

The following section introduces the SEB-ASR system, tailored for Mandarin–English mixed dialogues, specifically designed for instances where speakers alternate between these two languages in their conversations. Utilizing the capabilities of VOSK on monolingual ASR, which facilitates the development of ASR systems for bilingual and/or multilingual cases, we have established an application programming interface (API) platform. Users can effortlessly upload audio files and promptly obtain transcriptions through this platform.

3.1. Evaluation Metrics for Monolingual and Bilingual Speech Recognition

3.1.1. Word Error Rate and Character Error Rate

English speech recognition systems predominantly employ the WER as a yardstick to evaluate performance [49,50]. WER quantifies the discrepancies between the system-generated transcription and the reference transcript by measuring word substitutions, deletions, and insertions. Expressed as a percentage of the total words in the reference transcript, a lower WER signifies higher system accuracy. This metric is widely recognized for its simplicity, intuitiveness, and comprehensive evaluation of ASR performance, encompassing both individual word accuracy and overall transcript quality.
WER is instrumental in benchmarking different speech recognition systems and monitoring their performance improvements over time. On the other hand, the CER is the preferred performance evaluation metric for Mandarin speech recognition. Mandarin ASR systems typically utilize modeling units such as phonemes, characters, and words, whereas English ASR systems often rely on phonemes, international phonetic symbols, and subwords. When comparing a Mandarin ASR transcript to a reference, CER measures character-level discrepancies, including substitutions, deletions, and insertions. This metric is particularly suited to Mandarin due to its unique character-based writing system and distinct linguistic structure. Consequently, while WER is ideal for evaluating English ASR systems, CER is more appropriate for assessing the performance of Mandarin ASR systems. The WER/CER is calculated as follows:
W E R / C E R = S W / C + D W / C + I W / C N W / C .
Here, S W / C , D W / C , and I W / C represent the numbers of word/character substitutions, deletions, and insertions, respectively, while N W / C is the total number of words/characters in the reference. Both WER and CER can be calculated using the Levenshtein distance (LD) method, also known as the edit distance algorithm, introduced by Vladimir Levenshtein in 1965 [51]. This approach defines the minimum number of edits needed to transform one string into another, providing a robust framework for evaluating ASR performance across different languages and systems. The edit distance of two strings a and b can be mathematically expressed as follows:
l e v a , b ( i , j ) = m a x ( i , j ) if min ( i , j ) = 0 , m i n l e v a , b ( i 1 , j ) + 1 l e v a , b ( i , j 1 ) + 1 l e v a , b ( i 1 , j 1 ) + 1 ( a i b j ) otherwise .
The l e v a , b ( i , j ) represents the edit distance between the first i characters of string a and the first j characters of string b. In speech recognition, string a is the reference sequence, while string b is the hypothesis sequence. The calculation starts with a base case: if either i or j is 0, the edit distance is the length of the remaining non-empty string, which is m a x ( i , j ) . For other cases, the edit distance is calculated using three possible operations: (i) deleting a character from string a, which has a cost of l e v a , b ( i 1 , j ) + 1 , (ii) inserting a character into string b, which has a cost of l e v a , b ( i , j 1 ) + 1 , and (iii) substituting a character in string a with a character in b, which has a cost of l e v a , b ( i 1 , j 1 ) + 1 if the characters are different, or l e v a , b ( i 1 , j 1 ) if they are the same. The term 1 ( a i b j ) equals 1 if the characters are different and 0 if they are the same. The minimum of these three operations gives the edit distance for the current pair of characters. This process continues recursively until the base case is reached. The final result, l e v a , b , is the total edit distance between the entire strings a and b. This distance measures how many edits (insertions, deletions, or substitutions) are needed to change the hypothesis sequence into the reference sequence.
For each transcribed word, we compute the LD, which measures the differences between two strings in terms of substitutions, deletions, and insertions [51]. This distance metric is widely used in linguistic research, such as for assessing phonetic distances between languages and language varieties [52,53]. In the context of speech recognition, LD serves as a metric for evaluating the phonetic distance between the ground truth and the generated transcriptions, thereby providing a comprehensive measure of system accuracy.

3.1.2. Mixed Error Rate

Given the dual-language nature of the SEB-ASR system handling both Mandarin and English, relying solely on WER or CER for performance evaluation is inadequate. Therefore, we use the concept of MER [54] to assess the performance of the SEB-ASR system. MER has become a common evaluation metric in numerous studies focusing on mixed-language ASR. It has been widely adopted to assess system performance in scenarios involving multiple languages or language switching, providing a more tailored measure of accuracy compared to traditional metrics like WER or CER. This metric serves as a standardized measure to gauge the accuracy and efficacy of ASR systems in managing bilingual linguistic inputs. In calculating MER, the identification outcomes of the system are denoted as the hypothetical sequence, while the accurate results used for validation are referred to as the reference sequence. Utilizing the LD algorithm, as in Equation (2), we calculate the substitutions, deletions, and insertions for English words and Mandarin characters within the hypothetical sequence, corresponding to their counterparts in the reference sequence. These outcomes are then consolidated individually to derive the cumulative counts of substitutions, deletions, and insertions, which are necessary to compute the MER as follows:
M E R = ( S W + S C ) + ( D W + D C ) + ( I W + I C ) N W + N C = S M + D M + I M N M ,
where S W , D W , and I W are the numbers of English word substitutions, deletions, and insertions, and S C , D C , and I C are the numbers of Mandarin character substitutions, deletions, and insertions. The N W and N C are the total number of English words and Mandarin characters in the reference sequence, respectively. Here, S M represents the total number of replacements, D M signifies the total number of deletions, I M denotes the total number of insertions, and N M equals the sum of the number of English words and the number of Mandarin characters in the reference.

3.2. Datasets

Creating a mixed-language speech training dataset poses significant challenges, particularly for Mandarin–English speech recognition, which often needs more resources. Although there is ample monolingual data for each language, due to a scarcity of mixed data, constructing a mixed-language dataset becomes a viable solution [32,34,35]. To evaluate the efficacy of our system, we have created a mixed Mandarin–English dataset. The architecture of the dataset is visually depicted in Figure 1. The limitation of recording only 10 sentences per person was due to practical constraints such as time, resources, and participant availability. To address the insufficient size of the dataset, we have developed a method to combine and augment the original audio recordings, resulting in approximately 10,640 files using a MATLAB (version 2023a) script. Initially, we gathered 20 WAV files, equally prepared by two individuals, “Person A” and “Person B”, each contributing 10 different sentences of Mandarin mixed with some English words, recorded at a sampling frequency of 16 kHz and a bit rate of 128 kbps. Both Person A and Person B are bilinguals proficient in both Mandarin as their first language and English as their second one. We have ensured that the ratio of Mandarin to English in each sentence was approximately 1:1.
For Person A, the 10 recordings formed the base subset, stored in a folder named “File 1A”. By merging any two of these base recordings, we created 100 distinct combinations, forming the second subset, “File 2A”. Further merging any three recordings at a time produced the third subset, “File 3A”, encompassing 1,000 audio combinations. These three subsets were combined to constitute “Dataset A”, containing 1,110 data samples. Following a similar process, Person B’s recordings were synthesized into “Dataset B”. Subsequently, we created “Dataset All” by applying the same method of permutation and combination to all 20 recordings from both individuals, producing 8,420 data samples. Therefore, the comprehensive dataset, tailored for performance evaluation, houses 10,640 mixed Mandarin and English speech samples.
Additionally, we have collected 1,019 audio files from YouTube as a second dataset for system performance testing. These files are instructional videos for Mandarin speakers aiming at speaking English and are ideal for testing the Mandarin–English bilingual ASR. This secondary dataset provides real-world examples of mixed-language usage, ensuring a non-biased evaluation of our system performance in handling natural and spontaneous speech.

4. Results and Discussion

4.1. VOSK Monolingual ASR Toolkit

VOSK is an open-source Python toolkit for both offline and online speech recognition [17], built on top of the Kaldi framework [55]. It currently supports models for more than 20 languages, including English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino, Ukrainian, Kazakh, Swedish, Japanese, Esperanto, Hindi, Czech, Polish, Uzbek, Korean, Breton, Gujarati, Tajik, and Telugu. These models offer a variety of functionalities such as speaker identification, streaming APIs, and flexible vocabulary configuration. Although the accuracy of these models is not yet fully optimized, the toolkit delivers competitive performance and continues to improve with ongoing developments. The lightweight and modular design of VOSK makes it suitable for integration in a range of applications, from mobile devices to embedded systems. This combination of accessibility, support of many languages, and extendability positions VOSK as a valuable tool for research and practical applications in ASR, particularly in settings where an offline recognition is critical.
As depicted in Figure 2, we have developed an API that facilitates real-time speech recognition of stored audio data in multiple languages, including English and Mandarin. APIs enable communication between different software programs by defining rules and protocols, allowing developers to access and utilize the functionalities and data of diverse applications or services. They play a pivotal role in modern software development. VOSK employs machine learning models to transcribe spoken words into written text, rooted in the Kaldi speech recognition infrastructure. We have formulated an API interface capable of dynamically invoking language models tailored to specific user needs. This design transcends the limitations of supporting only Chinese and English, encompassing all languages integrated into the VOSK monolingual ASR toolkit. This approach maximizes the efficacy of VOSK’s advanced single-language recognition capabilities, enhancing its versatility across a broad linguistic landscape. For developers, VOSK offers a user-friendly interface, simplifying the integration of speech recognition into software applications. Furthermore, its open-source framework empowers developers to customize and extend its functionality to meet diverse requirements.

4.1.1. Operational Workflow

The core task in speech recognition involves converting speech into text. Upon submitting speech data for conversion, a request is transmitted to the API, which processes it. The VOSK server initiates speech transcription to derive text results, which are then relayed back to the API. The API formats the results appropriately before returning them to the user, delineating the operational workflow of the API.

4.1.2. Performance Evaluation

The performance evaluation of the algorithms was performed on an Ubuntu system version 20.04, and all the code was written in Python-3.8. The efficacy of the VOSK ASR toolkit was assessed using the publicly available open-sourced dataset from Common Voice [56], which provides free speech data in multiple languages. Specifically, we independently evaluated VOSK’s performance in recognizing English and Mandarin languages, and the findings are presented in Figure 3 and Table 2. The results reveal that the VOSK speech recognition toolkit exhibits an average WER of 8.43% for English ASR and a CER of 9.04% for Mandarin one. These findings underscore the high level of proficiency of the VOSK toolkit in executing monolingual ASR tasks.
As the Mandarin model is unable to recognize the English components, calculating the average error rate for the English parts using the Mandarin model is not feasible. Hence, in Equation (3) the values of S W , D W , and I W are neglected. In this context, the CER used for evaluating Mandarin is equivalent to the MER for assessing mixed speech. Therefore, to maintain consistency, the evaluation metric will be CER. As listed in Table 3, when analyzing the bilingual datasets using the VOSK Mandarin model, the result reveals considerable performance fluctuations. For Dataset A and Dataset B, the VOSK model exhibits maximum CERs of 79.81% and 88.00%, minimum CERs of 31.89% and 28.00%, and average CERs of 54.93% and 52.08%, respectively. The combined Dataset All resulted in a maximum CER of 89.93%, a minimum CER of 30.03%, and an average CER of 56.49%, presenting the highest maximum CER among all datasets analyzed, reflecting challenges in mixed data recognition. Figure 4a illustrates the distribution of CER results for each dataset using the VOSK monolingual ASR toolkit. The YouTube dataset achieved the highest CER across all datasets, with a maximum of 99.57%, a minimum of 43.22%, and an average of 65.22%, indicating significant recognition difficulties for bilingual datasets. These findings demonstrate that, while VOSK excels in recognizing a single language, it shows limited effectiveness in handling mixed-speech recognition tasks, especially when some English words are included in a Mandarin conversation.

4.2. The SEB-ASR System

The SEB-ASR system is designed to handle mixed-language audio recognition, specifically focusing on Mandarin and English, leveraging the VOSK ASR toolkit via an API interface to analyze imported audio datasets.

4.2.1. Operational Workflow

The SEB-ASR system workflow, as depicted in Figure 5, begins with taking an audio input and using the VOSK Mandarin ASR model to recognize and analyze the audio, assigning metadata to each word, including start and end times, confidence levels, and recognition results. Each recognized segment is evaluated based on a predefined confidence threshold of 0.9. The segments with a confidence level above the threshold are marked with a “+” symbol, indicating they are part of the main Mandarin content. In contrast, segments with a confidence level below the threshold are marked with a “-” symbol, considered potential English segments misrecognized by the Mandarin model. The Mandarin list is then updated with the assigned symbols for each segment. As the VOSK Mandarin ASR is monolingual, English words recognized by the Mandarin model are often recognized as multiple Mandarin parts, some of which have high confidence and are mistakenly treated as Mandarin. To address this, the system checks if the end time of one word matches the start time of the next word, indicating consecutive pronunciation, which may signify a single English word. If the start and end times of words align, the audio segments are evaluated and marked with “-”. These segments are then processed using an English ASR model to generate supplementary English transcriptions. The English transcriptions are integrated with the main Mandarin content based on their start and end times, resulting in a combined Mandarin–English speech recognition output. The system performs an MER analysis to evaluate its performance by comparing the recognized results with a reference text, ultimately producing a bilingual recognized result.
The Mandarin characters in the article are also expressed as pinyin, the official Romanization system for Standard Mandarin Chinese using the Latin alphabet, along with their meaning in the parentheses. To explain the system in more detail, consider an audio data from our dataset with the phrase “這個idea 非常perfect 我們的work 需要提高efficiency”, where the Mandarin words are “這個 (Zhège, this), 非常 (Fēicháng, very), 我們的 (Wǒmen de, our), and 需要提高 (Xūyào tígāo, need to improve)”. When the VOSK Mandarin ASR model analyzes the audio, we obtain a recognized Mandarin list as shown in Table 4, with the start and end times, confidence values, and recognized Mandarin characters. Next, each segment is evaluated for a confidence threshold value of 0.9, as listed in Table 5. If the confidence is > 0.9, then a “+” symbol is assigned, and if the confidence is < 0.9, then a “-” symbol is assigned. This results in a self-evaluated Mandarin list with a True (+) or False (-). Then, we check the inherent continuity of word pronunciation to identify a false True, which is better explained in a separate Table 6. Here, one can see the English word “efficiency” is recognized in Mandarin as “逸飛是誰 (Yì fēi shì shéi)”, which has no literal meaning but the confidence of “是 (shì)” is 0.9928, and that of “誰 (shéi)” is 0.9035, leading to a false recognition. Hence, the check of the segments by comparing their start and end times can help find the false True, as highlighted in blue in Table 6. Since the end time of the first segment, 42.75 s, matches the start time of the next segment, and similarly, the end time of the second segment, 42.90 s, matches the start time of the third segment, we consider these three segments correspond to one English word. Hence, we reassign the symbols of the second and third segments to be the same as that of the first segment. The updated Mandarin list marked with corrected True or False value in blue and red text color, respectively, is shown in Table 7. Based on this final self-evaluated Mandarin list, our system can perform more correct speech segmentation and proceed to English speech recognition. These segments are saved to a designated directory using an audio segmentation tool, and the visual depiction of the segments is listed in Table 8. The VOSK English ASR model then analyzes each isolated segment to generate supplementary English transcriptions. These English words are integrated with the main Mandarin content based on their start and end times, resulting in a mixed Mandarin–English speech recognition output followed by an MER analysis to evaluate its performance.
As shown in Table 9, if the symbols were not reassigned based on the start and end times, leading to a faulty segmentation based solely on the confidence value, the final recognized text would be “這個(zhè ge) idea 非常(fēi cháng) perfect 我們的(wǒ mēn de) work 需要提高(xū yào tí gāo) if 是誰(shì shéi)”. Comparing this with the original reference text, “這個(zhè ge) idea 非常(fēi cháng) perfect 我們的(wǒ mēn de) work 需要提高(xū yào tí gāo) efficiency”, where N M is 15, this gives S C , D C , and I C values of 0, 3, and 0, and S W , D W , and I W values of 1, 0, and 0, respectively. Hence, the S M , D M , and I M are 1, 3, and 0, accordingly. Substituting these into Equation (2), we obtain an MER of 26.66%. Since the system is optimized with the continuous word pronunciation identification by reassigning the correct symbols based on segment time, the final recognized text perfectly matches the reference text, resulting in an MER of 0.00%. The SEB-ASR system uses a structured self-evaluation method to accurately recognize and integrate mixed-language content, enabling efficient bilingual speech recognition.

4.2.2. Performance Evaluation

Figure 4 shows the distribution of the MER results for each dataset using the VOSK monolingual ASR toolkit and the SEB-ASR system. The VOSK recognition results exhibit a significant concentration of higher CER values, particularly in the 40–60% range, with 737 entries in Dataset B and 5,639 entries in Dataset All. Dataset A displays a wider range, with 1,080 entries falling within the 40–70% CER range. The result of the YouTube dataset also shows higher CER values in the 50–70% range, with 693 entries. In contrast, the SEB-ASR distribution shifts towards lower MER ranges, with Dataset All peaking in the 0–10% MER range, accounting for 4,139 entries. The SEB-ASR results for Dataset A, Dataset B, and the YouTube dataset similarly follow this trend, peaking in the 0–10% MER range, with 628, 730, and 631 entries, respectively. The histograms of MER distributions across Dataset A, Dataset B, Dataset All, and the YouTube dataset, as shown in Figure 6, clearly indicate a substantial improvement in recognition accuracy when using the SEB-ASR system compared to the monolingual VOSK ASR.
The results consolidated in Table 3 demonstrate that the SEB-ASR system significantly outperforms the monolingual VOSK. Specifically, for Dataset A, the SEB-ASR system reduced the maximum MER by approximately 3.87 times, from 79.81% with VOSK to 20.64%. For Dataset B, the SEB-ASR system lowered the maximum MER by approximately 4.21 times, decreasing from 88.00% with VOSK to 20.93%. For Dataset All, the reduction was approximately 3.75 times, with the maximum MER decreasing from 89.93% with VOSK to 24.00%. There is a significant performance change from the higher MER results of VOSK to the lower MER results of SEB-ASR, indicating the substantial improvement in bilingual recognition, as shown in Figures 6a–c, respectively, which illustrate the distributions of MER results for Dataset A, Dataset B, and Dataset All using both VOSK and SEB-ASR systems. Regarding the average MER, the SEB-ASR system consistently outperformed the monolingual VOSK one across all datasets. For Dataset A, there was an average recognition improvement of 5.87 times, with the average MER dropping from 54.93% with VOSK to 9.36% with SEB-ASR. The test of Dataset B shows an improvement in recognition by 6.89 times, with the average MER falling from 52.08% to 7.56%. For Dataset All, there was a 4.91 times improvement, with the average MER decreasing from 56.49% with VOSK to 11.50%. These reductions highlight the enhanced performance and accuracy of the SEB-ASR system across diverse datasets. The YouTube audio files have also been tested with the SEB-ASR system, and the results are comprehensively presented in Figure 6d. The maximum MER value was 28.91%, the minimum was 0.00%, and the average was 12.78%. The average MER of the SEB-ASR system was less than 13% across all tested datasets, while in contrast, that of the monolingual VOSK ASR approached ∼65%. This demonstrates that the SEB-ASR system performs well not only on our datasets but also on those obtained from the web (YouTube), proving its excellent performance in bilingual speech recognition.

5. Conclusions

In this study, we have developed and evaluated the SEB-ASR system, designed to handle mixed-language audio recognition, specifically for Mandarin and English. By leveraging the VOSK monolingual ASR toolkit via an API interface, we have created an innovative solution for accurately recognizing and integrating mixed-language content. The SEB-ASR system has demonstrated a remarkable reduction in the MER, achieving an average of less than 12% for bilingual recognition for our mixed dataset, significantly outperforming the VOSK baseline. The YouTube dataset, which included real-world examples of Mandarin–English mixed conversations, further validated our system robustness, achieving an average MER of less than 13% across all tested datasets. This indicates the accuracy of the SEB-ASR system is high not only on our created datasets but also on naturally occurring mixed-language speech. These results emphasize the significant advancements our SEB-ASR system brings to bilingual speech recognition, showcasing its potential for broader applications. The structured approach within the framework of self-evaluation mixed-language speech recognition can be adapted for other language pairs, paving the way for developing multilingual ASR systems. By preparing mixed-language datasets and applying our methodology, similar improvements can be achieved in trilingual or even more complex multilingual speech recognition tasks without requiring intricate training processes. In conclusion, the SEB-ASR system marks a significant step forward in bilingual and mixed-language ASR. Its robust performance and adaptability demonstrate the potential for implementing this technology across various languages, facilitating efficient and accurate multilingual speech recognition in an increasingly globalized and interconnected world.

Author Contributions

Conceptualization, X.H., C.-C.Y. and M.-C.L.; methodology, X.H., K.A. and M.-C.L.; software, X.H., K.A., Z.H. and M.-C.L.; validation, X.H., K.A., C.-C.Y., Z.H., C.-Y.H., H.-Y.H. and M.-C.L.; formal analysis, X.H., C.-C.Y. and M.-C.L.; investigation, X.H., C.-C.Y., Z.H., C.-Y.H. and M.-C.L.; resources, C.-C.Y., C.-Y.H., H.-Y.H. and M.-C.L.; data curation, X.H., K.A. and M.-C.L.; writing—original draft preparation, X.H. and K.A.; writing—review and editing, X.H., K.A., C.-C.Y., H.-Y.H. and M.-C.L.; visualization, X.H., C.-C.Y. and M.-C.L.; supervision, M.-C.L.; project administration, M.-C.L.; funding acquisition, C.-C.Y. and M.-C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

Authors C.-C.Y. and C.-Y.H. were employed by the company Lifetree Telemed Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Hou, S.Y.; Wu, Y.L.; Chen, K.C.; Chang, T.A.; Hsu, Y.M.; Chuang, S.J.; Chang, Y.; Hsu, K.C. Code-switching automatic speech recognition for nursing record documentation: System development and evaluation. JMIR Nurs. 2022, 5, 37562. [Google Scholar] [CrossRef]
  2. Saksamudre, S.K.; Shrishrimal, P.P.; Deshmukh, R.R. A review on different approaches for speech recognition system. Int. J. Comput. Appl. 2015, 115, 23–28. [Google Scholar] [CrossRef]
  3. Gao, J.; Wan, G.; Wu, K.; Fu, Z. Review of the application of intelligent speech technology in education. J. China Comput. Assist. Lang. Learn. 2022, 2, 165–178. [Google Scholar] [CrossRef]
  4. Davis, K.H.; Biddulph, R.; Balashek, S. Automatic recognition of spoken digits. J. Acoust. Soc. Am. 1952, 24, 637–642. [Google Scholar] [CrossRef]
  5. Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
  6. Jurafsky, D.; Martin, J.H. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition; Prentice Hall: Upper Saddle River, NJ, USA, 2009. [Google Scholar]
  7. Yeh, C.F.; Huang, C.Y.; Sun, L.C.; Lee, L.S. An integrated framework for transcribing Mandarin-English code-mixed lectures with improved acoustic and language modeling. In Proceedings of the 7th International Symposium on Chinese Spoken Language Processing (ISCSLP 2010), Tainan, Taiwan, 29 November–3 December 2010; pp. 214–219. [Google Scholar] [CrossRef]
  8. Guo, P.; Xu, H.; Xie, L.; Chng, E.S. Study of semi-supervised approaches to improving English-Mandarin code-switching speech recognition. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 1928–1932. [Google Scholar] [CrossRef]
  9. Amazouz, D.; Adda-Decker, M.; Lamel, L. Addressing French/Algerian code-switching Arabic speech. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 62–66. [Google Scholar] [CrossRef]
  10. Guzmán, G.A.; Ricard, J.; Serigos, J.; Bullock, B.E.; Toribio, A.J. Metrics for modeling code-switching across corpora. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 67–71. [Google Scholar] [CrossRef]
  11. Alemu, A.A.; Melese, M.D.; Salau, A.O. Towards audio-based identification of Ethio-Semitic languages using recurrent neural network. Sci. Rep. 2023, 13, 19346. [Google Scholar] [CrossRef]
  12. Waibel, A.; Soltau, H.; Schultz, T.; Schaaf, T.; Metze, F. Multilingual speech recognition. In Verbmobil: Foundations of Speech-to-Speech Translation; Springer: Berlin, Germany, 2000; pp. 33–45. [Google Scholar]
  13. Long, Y.; Li, Y.; Zhang, Q.; Wei, S.; Ye, H.; Yang, J. Acoustic data augmentation for Mandarin-English code-switching speech recognition. Appl. Acoust. 2020, 161, 107175. [Google Scholar] [CrossRef]
  14. Chan, J.Y.C.; Cao, H.; Ching, P.C.; Lee, T. Automatic recognition of Cantonese-English code-mixing speech. Int. J. Comput. Linguist. Chin. Lang. Process. 2009, 14, 281–304. [Google Scholar]
  15. Winata, G.I.; Madotto, A.; Wu, C.S.; Fung, P. Towards end-to-end automatic code-switching speech recognition. arXiv 2018, arXiv:1810.12620. [Google Scholar] [CrossRef]
  16. Li, J. Recent advances in end-to-end automatic speech recognition. APSIPA Trans. Signal Inf. Process. 2022, 11, e8. [Google Scholar] [CrossRef]
  17. Shmyrev, N.V. Vosk Speech Recognition Toolkit: Offline Speech Recognition API for Android, iOS, Raspberry Pi and Servers with Python, Java, C# and Node. 2020. Available online: https://github.com/alphacep/vosk-api (accessed on 4 June 2025).
  18. Li, D.C.S. Cantonese-English code-switching research in Hong Kong: A Y2K review. World Englishes 2000, 19, 305–322. [Google Scholar] [CrossRef]
  19. Yang, H.C.; Hsiao, H.W.; Lee, C.H. Multilingual document mining and navigation using self-organizing maps. Inf. Process. Manag. 2011, 47, 647–666. [Google Scholar] [CrossRef]
  20. Zhang, Y.; Tsai, F.S.; Kwee, A.T. Multilingual sentence categorization and novelty mining. Inf. Process. Manag. 2011, 47, 667–675. [Google Scholar] [CrossRef]
  21. Segev, A.; Gal, A. Enhancing portability with multilingual ontology-based knowledge management. Decis. Support Syst. 2008, 45, 567–584. [Google Scholar] [CrossRef]
  22. Gey, F.C.; Kando, N.; Peters, C. Cross-language information retrieval: The way ahead. Inf. Process. Manag. 2005, 41, 415–431. [Google Scholar] [CrossRef]
  23. Jung, J.J. Cross-lingual query expansion in multilingual folksonomies: A case study on Flickr. Knowl. Based Syst. 2013, 42, 60–67. [Google Scholar] [CrossRef]
  24. Lee, C.W.; Wu, Y.L.; Yu, L.C. Combining mutual information and entropy for unknown word extraction from multilingual code-switching sentences. J. Inf. Sci. Eng. 2019, 35, 597–610. [Google Scholar] [CrossRef]
  25. Lyu, D.C.; Hsu, C.N.; Chiang, Y.C.; Lyu, R.Y. Acoustic model optimization for multilingual speech recognition. Int. J. Comput. Linguist. Chin. Lang. Process. 2008, 13, 363–385. [Google Scholar] [CrossRef]
  26. Wu, C.H.; Chiu, Y.H.; Shia, C.J.; Lin, C.Y. Automatic segmentation and identification of mixed-language speech using delta-BIC and LSA-based GMMs. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 266–276. [Google Scholar] [CrossRef]
  27. Qian, Y.; Liang, H.; Soong, F.K. A cross-language state sharing and mapping approach to bilingual (Mandarin–English) TTS. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 1231–1239. [Google Scholar] [CrossRef]
  28. Lyu, D.C.; Lyu, R.Y.; Zhu, C.L.; Ko, M.T. Language identification in code-switching speech using word-based lexical model. In Proceedings of the 7th International Symposium on Chinese Spoken Language Processing (ISCSLP 2010), Beijing, China, 29 November–3 December 2010; pp. 460–464. [Google Scholar] [CrossRef]
  29. Lyu, D.C.; Tan, T.P.; Chng, E.; Li, H. SEAME: A Mandarin-English code-switching speech corpus in South-East Asia. In Proceedings of the Interspeech 2010, Makuhari, Chiba, Japan, 26–30 September 2010. [Google Scholar] [CrossRef]
  30. Shen, H.P.; Wu, C.H.; Yang, Y.T.; Hsu, C.S. CECOS: A Mandarin-English code-switching speech database. In Proceedings of the 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA), Hsinchu, Taiwan, 26–28 October 2011; pp. 120–123. [Google Scholar] [CrossRef]
  31. Wang, D.; Tang, Z.; Tang, D.; Chen, Q. OC16-CE80: A Mandarin-English mixlingual database and a speech recognition baseline. In Proceedings of the 2016 Conference of the Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA), Bali, Indonesia, 26–28 October 2016; pp. 84–88. [Google Scholar] [CrossRef]
  32. Vu, N.T.; Lyu, D.C.; Weiner, J.; Telaar, D.; Schlippe, T.; Blaicher, F.; Chng, E.S.; Schultz, T.; Li, H. A first speech recognition system for Mandarin-English code-switch conversational speech. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 4889–4892. [Google Scholar] [CrossRef]
  33. Vu, N.T. Automatic Speech Recognition for Low-Resource Languages and Accents Using Multilingual and Crosslingual Information. Ph.D. Thesis, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany, 2014. [Google Scholar] [CrossRef]
  34. Bhuvanagiri, K.; Kopparapu, S.K. An approach to mixed language automatic speech recognition. In Proceedings of the Oriental COCOSDA 2010, Kathmandu, Nepal, 24–25 November 2010. [Google Scholar]
  35. Bhuvanagiri, K.; Kopparapu, S.K. Mixed language speech recognition without explicit identification of language. Am. J. Signal Process. 2012, 2, 92–97. [Google Scholar] [CrossRef]
  36. Zellou, G.; Lahrouchi, M. Linguistic disparities in cross-language automatic speech recognition transfer from Arabic to Tashlhiyt. Sci. Rep. 2024, 14, 313. [Google Scholar] [CrossRef]
  37. Van Der Westhuizen, E.; Niesler, T. Synthesising isiZulu-English code-switch bigrams using word embeddings. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 72–76. [Google Scholar] [CrossRef]
  38. Adel, H.; Vu, N.T.; Schultz, T. Combination of recurrent neural networks and factored language models for code-switching language modeling. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria, 4–9 August 2013; pp. 206–211. [Google Scholar]
  39. Weng, C.; Cui, J.; Wang, G.; Wang, J.; Yu, C.; Su, D.; Yu, D. Improving attention based sequence-to-sequence models for end-to-end English conversational speech recognition. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 761–765. [Google Scholar] [CrossRef]
  40. Shan, C.; Zhang, J.; Wang, Y.; Xie, L. Attention-based end-to-end models for small-footprint keyword spotting. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 2037–2041. [Google Scholar] [CrossRef]
  41. Akinpelu, S.; Viriri, S. Speech emotion classification using attention based network and regularized feature selection. Sci. Rep. 2023, 13, 11990. [Google Scholar] [CrossRef]
  42. Rahman Chowdhury, F.R.; Wang, Q.; Moreno, I.L.; Wan, L. Attention-based models for text-dependent speaker verification. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5359–5363. [Google Scholar] [CrossRef]
  43. Fang, X.; Gao, T.; Zou, L.; Ling, Z. Bidirectional attention for text-dependent speaker verification. Sensors 2020, 20, 6784. [Google Scholar] [CrossRef]
  44. Luo, N.; Jiang, D.; Zhao, S.; Gong, C.; Zou, W.; Li, X. Towards end-to-end code-switching speech recognition. arXiv 2018, arXiv:1810.13091. [Google Scholar] [CrossRef]
  45. Huang, Z.; Wang, P.; Wang, J.; Miao, H.; Xu, J.; Zhang, P. Improving transformer-based end-to-end code-switching speech recognition using language identification. Appl. Sci. 2021, 11, 9106. [Google Scholar] [CrossRef]
  46. Huang, Z.; Xu, J.; Zhao, Q.; Zhang, P. A two-fold cross-validation training framework combined with meta-learning for code-switching speech recognition. IEICE Trans. Inf. Syst. 2022, 105, 1639–1642. [Google Scholar] [CrossRef]
  47. Nga, C.H.; Vu, D.Q.; Le, P.T.; Luong, H.H.; Wang, J.C. MLSS: Mandarin English code-switching speech recognition via mutual learning-based semi-supervised method. IEEE Signal Process. Lett. 2025, 32, 1510–1514. [Google Scholar] [CrossRef]
  48. Chen, M.; Pan, J.; Zhao, Q.; Yan, Y. Multi-task learning in deep neural networks for Mandarin-English code-mixing speech recognition. IEICE Trans. Inf. Syst. 2016, 99, 2554–2557. [Google Scholar] [CrossRef]
  49. Ali, A.; Renals, S. Word error rate estimation for speech recognition: e-WER. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, 15–20 July 2018; pp. 20–24. [Google Scholar] [CrossRef]
  50. Klakow, D.; Peters, J. Testing the correlation of word error rate and perplexity. Speech Commun. 2002, 38, 19–28. [Google Scholar] [CrossRef]
  51. Levenshtein, V.I. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics-Doklady 1966, 10, 707–710. [Google Scholar]
  52. Kessler, B. Computational dialectology in Irish Gaelic. In Proceedings of the Seventh Conference on European Chapter of the Association for Computational Linguistics, Dublin, Ireland, 27–31 March 1995; pp. 60–66. [Google Scholar] [CrossRef]
  53. Wieling, M.; Bloem, J.; Mignella, K.; Timmermeister, M.; Nerbonne, J. Measuring foreign accent strength in English: Validating Levenshtein distance as a measure. Lang. Dyn. Change 2014, 4, 253–269. [Google Scholar] [CrossRef]
  54. Mustafa, M.B.; Yusoof, M.A.; Khalaf, H.K.; Rahman Mahmoud Abushariah, A.A.; Kiah, M.L.M.; Ting, H.N.; Muthaiyah, S. Code-switching in automatic speech recognition: The issues and future directions. Appl. Sci. 2022, 12, 9541. [Google Scholar] [CrossRef]
  55. Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU), Waikoloa, HI, USA, 11–15 December 2011. [Google Scholar]
  56. Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common Voice: A massively-multilingual speech corpus. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC), Marseille, France, 11–16 May 2020; pp. 4218–4222. [Google Scholar]
Figure 1. Dataset composition structure diagram: The diagram shows the name of each folder, the number of audio files it contains, and the size of the folder. Dataset A contains a total of 1,110 audio files (2.0 GB) and is divided into three folders: File 1A contains 10 audio files (6.2 MB), File 2A contains 100 audio files (122.4 MB), and File 3A contains 1,000 audio files (1.8 GB). Dataset B contains a total of 1,110 audio files (2.2 GB) and is divided into three folders: File 1B contains 10 audio files (6.9 MB), File 2B contains 100 audio files (138.0 MB), and File 3B contains 1,000 audio files (2.1 GB). Dataset All includes a total of 8,420 audio files (16.2 GB) and is divided into three folders: File 1 All contains 20 audio files (13.0 MB), File 2 All contains 400 audio files (520.8 MB), and File 3 All contains 8,000 audio files (15.6 GB). The combined set consists of 20 audio files from Person A and Person B. The total dataset comprises 10,640 audio files (21.8 GB).
Figure 1. Dataset composition structure diagram: The diagram shows the name of each folder, the number of audio files it contains, and the size of the folder. Dataset A contains a total of 1,110 audio files (2.0 GB) and is divided into three folders: File 1A contains 10 audio files (6.2 MB), File 2A contains 100 audio files (122.4 MB), and File 3A contains 1,000 audio files (1.8 GB). Dataset B contains a total of 1,110 audio files (2.2 GB) and is divided into three folders: File 1B contains 10 audio files (6.9 MB), File 2B contains 100 audio files (138.0 MB), and File 3B contains 1,000 audio files (2.1 GB). Dataset All includes a total of 8,420 audio files (16.2 GB) and is divided into three folders: File 1 All contains 20 audio files (13.0 MB), File 2 All contains 400 audio files (520.8 MB), and File 3 All contains 8,000 audio files (15.6 GB). The combined set consists of 20 audio files from Person A and Person B. The total dataset comprises 10,640 audio files (21.8 GB).
Applsci 15 07691 g001
Figure 2. Workflow of the API for speech-to-text processing. The diagram illustrates the interaction between the client, API, and server. The client sends a speech request to the API, which determines the type of request and forwards it to the VOSK server for processing. The server returns the text result to the API, which then converts it to an HTTP response for the client to accept.
Figure 2. Workflow of the API for speech-to-text processing. The diagram illustrates the interaction between the client, API, and server. The client sends a speech request to the API, which determines the type of request and forwards it to the VOSK server for processing. The server returns the text result to the API, which then converts it to an HTTP response for the client to accept.
Applsci 15 07691 g002
Figure 3. Distributions of error rate results for a single-language recognition by the VOSK monolingual ASR toolkit, (a) English WER recognition results, and (b) Mandarin CER recognition results. The horizontal axis represents the range of error rates, and the vertical axis represents the number of data points distributed in each interval.
Figure 3. Distributions of error rate results for a single-language recognition by the VOSK monolingual ASR toolkit, (a) English WER recognition results, and (b) Mandarin CER recognition results. The horizontal axis represents the range of error rates, and the vertical axis represents the number of data points distributed in each interval.
Applsci 15 07691 g003
Figure 4. Histograms of the distribution of MER results for all data in Dataset A, Dataset B, and Dataset All when tested using different methods. (a) shows the CER results for each dataset when using the VOSK speech recognition toolkit. (b) shows the MER results for each dataset when using the SEB-ASR system. The numbers in the graph show how many data samples are in the corresponding MER range using the VOSK and SEB-ASR systems.
Figure 4. Histograms of the distribution of MER results for all data in Dataset A, Dataset B, and Dataset All when tested using different methods. (a) shows the CER results for each dataset when using the VOSK speech recognition toolkit. (b) shows the MER results for each dataset when using the SEB-ASR system. The numbers in the graph show how many data samples are in the corresponding MER range using the VOSK and SEB-ASR systems.
Applsci 15 07691 g004
Figure 5. Flowchart of our proposed self-evaluated bilingual automatic speech recognition (SEB-ASR) system for Mandarin–English mixed conversations.
Figure 5. Flowchart of our proposed self-evaluated bilingual automatic speech recognition (SEB-ASR) system for Mandarin–English mixed conversations.
Applsci 15 07691 g005
Figure 6. Comparison of MER/CER results. The distributions of MER/CER results for each dataset are shown when using the VOSK monolingual ASR toolkit and the SEB-ASR system, respectively. (a) represents the distribution of“Dataset A”, (b) represents the distribution of “Dataset B”, (c) represents the distribution of “Dataset All”, and (d) represents the distribution of data from “YouTube” audios.
Figure 6. Comparison of MER/CER results. The distributions of MER/CER results for each dataset are shown when using the VOSK monolingual ASR toolkit and the SEB-ASR system, respectively. (a) represents the distribution of“Dataset A”, (b) represents the distribution of “Dataset B”, (c) represents the distribution of “Dataset All”, and (d) represents the distribution of data from “YouTube” audios.
Applsci 15 07691 g006
Table 1. Comparison of Mandarin–English ASR systems.
Table 1. Comparison of Mandarin–English ASR systems.
Related WorkModel/MethodDatasetMER (%)
Guo et al. [8]Lattice-Free Maximum Mutual Information (LF-MMI)-based semi-supervised training under code-switching conditionsSEAME20.54–29.56
Winata et al. [15]End-to-end system using connected timing classification (CTC)SEAME24.61
Vu et al. [32,33]Two-pass system with IPA and Bhattacharyya distance, discriminative training, statistical machine translation (SMT)-based text generation, and language identification (LID) system integrationSEAME36.60
Adel et al. [38]Recurrent neural network language models (RNNLMs) with factored language models (FLMs) and parts of speech (POS)SEAME32.81
Luo et al. [44]Hybrid CTC attention-based end-to-end system for code-switching speechSEAME34.24
Huang et al. [45]CTC transformer-based end-to-end model with integrated LID moduleSEAME30.95
Huang et al. [46]Meta-learning framework with two-fold cross-validation trainingSEAME31.02
Nga et al. [47]Mutual learning-based semi-supervised speech (MLSS) recognition approachSEAME17.60
Chen et al. [48]Multi-task deep neural networkLDC and in-house data32.20
This workSelf-evaluated bilingual automatic speech recognition (SEB-ASR) systemIn-house dataset and YouTube videos12.78
Table 2. Performance of the VOSK monolingual ASR: the maximum, minimum, and average word error rate (WER) and character error rate (CER) in recognizing English and Mandarin Common Voice datasets.
Table 2. Performance of the VOSK monolingual ASR: the maximum, minimum, and average word error rate (WER) and character error rate (CER) in recognizing English and Mandarin Common Voice datasets.
EnglishMandarin
Maximum (%)22.8025.32
Minimum (%)0.000.00
Average (%)8.439.04
Table 3. Performance comparison of the VOSK monolingual ASR and SEB-ASR systems for bilingual speech recognition: the maximum, minimum, and average MER results across various Mandarin–English mixed datasets.
Table 3. Performance comparison of the VOSK monolingual ASR and SEB-ASR systems for bilingual speech recognition: the maximum, minimum, and average MER results across various Mandarin–English mixed datasets.
Dataset ADataset BDataset AllYouTube
VOSKSEB-ASRVOSKSEB-ASRVOSKSEB-ASRVOSKSEB-ASR
Maximum (%)79.8120.6488.0020.9389.9324.0099.5728.91
Minimum (%)31.890.0028.000.0030.030.0043.220.00
Average (%)54.939.3652.087.5656.4911.5065.2212.78
Table 4. Detailed speech recognition results from the VOSK Mandarin ASR toolkit which gives the start and end times for each word, the confidence levels of the recognitions, and the words recognized. The last column provides the original audio text for accurate reference.
Table 4. Detailed speech recognition results from the VOSK Mandarin ASR toolkit which gives the start and end times for each word, the confidence levels of the recognitions, and the words recognized. The last column provides the original audio text for accurate reference.
Start TimeEnd TimeConfidenceRecognized WordOriginal Audio Text
5.225.971.0000這個 (zhè ge)這個 (zhè ge)
10.1710.410.3929哎 (āi)
10.4110.950.5900爹 (diē)idea
15.2416.231.0000非常 (fēi cháng)非常 (fēi cháng)
21.3221.450.2517而 (ér)perfect
27.5128.081.0000我們 (wǒ mēn)我們 (wǒ mēn)
28.0828.471.0000的 (de)的 (de)
32.4032.810.5419我 ()work
37.1437.741.0000需要 (xū yào)需要 (xū yào)
37.7438.431.0000提高 (tí gāo)提高 (tí gāo)
42.4242.750.2720逸飛 (yì fēi)
42.7542.900.9928是 (shì)efficiency
42.9043.320.9035誰 (shéi)
Table 5. Evaluated speech recognition results from the SEB-ASR system using the VOSK Mandarin ASR toolkit after assigning a “+” (True) and “-” (False) symbol represented in blue and red text, respectively, for a confidence threshold of value > or < 0.9 for Mandarin ASR.
Table 5. Evaluated speech recognition results from the SEB-ASR system using the VOSK Mandarin ASR toolkit after assigning a “+” (True) and “-” (False) symbol represented in blue and red text, respectively, for a confidence threshold of value > or < 0.9 for Mandarin ASR.
SymbolStart TimeEnd TimeConfidenceRecognized WordOriginal Audio Text
+5.225.971.0000這個 (zhè ge)這個 (zhè ge)
-10.1710.410.3929哎 (āi)
-10.4110.950.5900爹 (diē)idea
+15.2416.231.0000非常 (fēi cháng)非常 (fēi cháng)
-21.3221.450.2517而 (ér)perfect
+27.5128.081.0000我們 (wǒ mēn)我們 (wǒ mēn)
+28.0828.471.0000的 (de)的 (de)
-32.4032.810.5419我 ()work
+37.1437.741.0000需要 (xū yào)需要 (xū yào)
+37.7438.431.0000提高 (tí gāo)提高 (tí gāo)
-42.4242.750.2720逸飛 (yì fēi)
+42.7542.900.9928是 (shì)efficiency
+42.9043.320.9035誰 (shéi)
Table 6. Reassignment (refined evaluation) of symbols (True or False) in speech recognition results from the SEB-ASR system based on segment continuity. The segments “yì fēi shì shéi” are taken from Table 5. When the end time of the previous segment is the same as the start time of the next segment, as highlighted in blue, the symbols are reassigned to be the same as the first segment, as indicated in red, considering it as one English word.
Table 6. Reassignment (refined evaluation) of symbols (True or False) in speech recognition results from the SEB-ASR system based on segment continuity. The segments “yì fēi shì shéi” are taken from Table 5. When the end time of the previous segment is the same as the start time of the next segment, as highlighted in blue, the symbols are reassigned to be the same as the first segment, as indicated in red, considering it as one English word.
Assigned SymbolStart TimeEnd TimeReassigned SymbolConfidenceRecognized WordOriginal Audio Text
-42.4242.75-0.2720逸飛 (yì fēi)
+42.7542.90-0.9928是 (shì)efficiency
+42.9043.32-0.9035誰 (shéi)
Table 7. Updated Mandarin recognition list with symbols (True or False values in blue or red text color, respectively) reassigned for considering the continuous pronunciation in the SEB-ASR system.
Table 7. Updated Mandarin recognition list with symbols (True or False values in blue or red text color, respectively) reassigned for considering the continuous pronunciation in the SEB-ASR system.
SymbolStart TimeEnd TimeConfidenceRecognized WordOriginal Audio Text
+5.225.971.0000這個 (zhè ge)這個 (zhè ge)
-10.1710.410.3929哎 (āi)
-10.4110.950.5900爹 (diē)idea
+15.2416.231.0000非常 (fēi cháng)非常 (fēi cháng)
-21.3221.450.2517而 (ér)perfect
+27.5128.081.0000我們 (wǒ mēn)我們 (wǒ mēn)
+28.0828.471.0000的 (de)的 (de)
-32.4032.810.5419我 ()work
+37.1437.741.0000需要 (xū yào)需要 (xū yào)
+37.7438.431.0000提高 (tí gāo)提高 (tí gāo)
-42.4242.750.2720逸飛 (yì fēi)
-42.7542.900.9928是 (shì)efficiency
-42.9043.320.9035誰 (shéi)
Table 8. Segmented audio recognition results for the English ASR via the API in the SEB-ASR system: Audio segmentations were performed based on the assigned symbols (False values) and sent to the API for the English speech recognition. The table shows the start time, end time, and recognition result for each segmental pronunciation, with the last column indicating the original word for that segment, i.e., the reference.
Table 8. Segmented audio recognition results for the English ASR via the API in the SEB-ASR system: Audio segmentations were performed based on the assigned symbols (False values) and sent to the API for the English speech recognition. The table shows the start time, end time, and recognition result for each segmental pronunciation, with the last column indicating the original word for that segment, i.e., the reference.
Segment Start TimeSegment End TimeRecognized WordOriginal Audio Text
10.1710.95ideaidea
21.3221.45perfectperfect
32.4032.81workwork
42.4243.32efficiencyefficiency
Table 9. Recognized text with and without reassigned symbols for the SEB-ASR system, showing the improvement in MER with the update of the detection of continuous audio segmentation (correction of a false True). The table shows the corresponding data in each case and the final MER.
Table 9. Recognized text with and without reassigned symbols for the SEB-ASR system, showing the improvement in MER with the update of the detection of continuous audio segmentation (correction of a false True). The table shows the corresponding data in each case and the final MER.
CaseTextSCICDCSWIWDWNMMER
Recognized text without reassigning symbol這個(zhè ge) idea 非常(fēi cháng) perfect 我們的(wǒ mēn de) work 需要提高(xū yào tí gāo) if 是誰(shì shéi)0301001526.66%
Recognized text with reassigned symbol這個(zhè ge) idea 非常(fēi cháng) perfect 我們的(wǒ mēn de) work 需要提高(xū yào tí gāo) efficiency000000150.00%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hai, X.; Aranganadin, K.; Yeh, C.-C.; Hua, Z.; Huang, C.-Y.; Hsu, H.-Y.; Lin, M.-C. A Self-Evaluated Bilingual Automatic Speech Recognition System for Mandarin–English Mixed Conversations. Appl. Sci. 2025, 15, 7691. https://doi.org/10.3390/app15147691

AMA Style

Hai X, Aranganadin K, Yeh C-C, Hua Z, Huang C-Y, Hsu H-Y, Lin M-C. A Self-Evaluated Bilingual Automatic Speech Recognition System for Mandarin–English Mixed Conversations. Applied Sciences. 2025; 15(14):7691. https://doi.org/10.3390/app15147691

Chicago/Turabian Style

Hai, Xinhe, Kaviya Aranganadin, Cheng-Cheng Yeh, Zhengmao Hua, Chen-Yun Huang, Hua-Yi Hsu, and Ming-Chieh Lin. 2025. "A Self-Evaluated Bilingual Automatic Speech Recognition System for Mandarin–English Mixed Conversations" Applied Sciences 15, no. 14: 7691. https://doi.org/10.3390/app15147691

APA Style

Hai, X., Aranganadin, K., Yeh, C.-C., Hua, Z., Huang, C.-Y., Hsu, H.-Y., & Lin, M.-C. (2025). A Self-Evaluated Bilingual Automatic Speech Recognition System for Mandarin–English Mixed Conversations. Applied Sciences, 15(14), 7691. https://doi.org/10.3390/app15147691

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop