Next Article in Journal
Advances, Trends and Challenges for Determining the Condition of Railway Rolling Stock Using Automatic Classifiers: A Systematic Review
Previous Article in Journal
Towards a View-Based Measure of Educational Flexibility for Complex Clinical Cases: A Combinatorial Approach
Previous Article in Special Issue
Evaluating the X2000: A Novel Integrated Platform for Rapid ADAS Development
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Korean-Accented English ASR with Transliteration-Based Data Synthesis

1
Department of Future Convergence Technology Engineering, Sungshin Women’s University, Seoul 02844, Republic of Korea
2
Department of Applied IT Engineering, Pusan National University, Busan 46241, Republic of Korea
3
College of AI/SW Convergence, Kyungnam University, Changwon 51767, Republic of Korea
4
School of AI Convergence, Sungshin Women’s University, Seoul 02844, Republic of Korea
*
Authors to whom correspondence should be addressed.
Electronics 2026, 15(7), 1380; https://doi.org/10.3390/electronics15071380
Submission received: 2 March 2026 / Revised: 19 March 2026 / Accepted: 24 March 2026 / Published: 26 March 2026

Abstract

Despite recent advances in automatic speech recognition (ASR), performance remains limited for Korean-accented English due to the limited availability of accent-specific speech data, including pronunciation and prosodic variations. To address this limitation, we propose a synthetic data generation framework for improving Whisper-based ASR performance. Synthetic speech is generated by converting English text into Hangul-based phonetic transcriptions using an intermediate IPA representation to reflect the phonological characteristics of Korean-accented English. The ASR model is fine-tuned using Low-Rank Adaptation with a mixture of synthetic and authentic speech data. Experimental results demonstrate relative reductions of up to 16.40% in the character error rate, 14.93% in the word error rate, and 14.81% in the phoneme error rate compared to the pretrained baseline.

1. Introduction

English is one of the most widely used languages as a lingua franca. Recent advances in automatic speech recognition (ASR) have led to substantial performance improvements, and modern systems increasingly incorporate non-native speech data during training. Nevertheless, subtle performance limitations remain when handling diverse English accents and pronunciation variations. These variations are often influenced by a speaker’s first language (L1), which significantly affects the intelligibility of their second language (L2) through systematic phonological and prosodic differences [1]. Although ASR technology continues to evolve, performance gaps for certain accent groups may persist, potentially impacting technological accessibility and fairness in global communication environments. Therefore, improving the robustness of ASR systems for non-native speech remains important.
Numerous studies have been conducted to improve ASR performance for non-native speech. Some prior work has adopted a fine-tuning strategy by constructing datasets of human-recorded non-native speech and adapting ASR models accordingly [2,3]. Although such approaches can improve recognition accuracy, they require substantial time and cost for data collection and annotation. To reduce the time and cost required for dataset construction, we propose an IPA-based speech synthesis framework that generates Korean-accented English synthetic speech, enabling efficient ASR adaptation without large-scale human speech data collection. The proposed synthesis framework represents English pronunciation through IPA and converts it into Hangul sequences, which are then used as input to a Korean TTS model to generate Korean-accented English speech.
Additionally, the synthesized data are used for Low-Rank Adaptation (LoRA)-based ASR fine-tuning to improve recognition performance for Korean-accented English. In this paper, we limit both the amount of reference speaker data used for speech synthesis and the size of the dataset used for ASR fine-tuning to simulate low-resource conditions, defined in terms of the availability of target speech data rather than limitations in computational resources or the generative environment itself. Furthermore, to the best of our knowledge, few studies have explored TTS-based data augmentation to improve ASR performance for Korean-accented English. Therefore, the proposed approach provides a potential TTS-based augmentation strategy for improving ASR performance on Korean-accented English speech.
The ASR model fine-tuned using the proposed method achieves a character error rate (CER) of 16.40% on human-spoken Korean-accented English. These results show the effectiveness of the proposed approach under simulated low-resource conditions.
The three main contributions of our study are summarized as follows:
  • We propose a data synthesis framework that employs IPA as an intermediate phonological representation and utilizes a Korean TTS model.
  • We formulate accent modeling by decoupling pronunciation representation from acoustic generation, eliminating the need for accent-specific TTS retraining.
  • Experimental results show that the proposed synthetic data leads to improvements in ASR performance for Korean-accented English across multiple evaluation settings.

2. Related Work

In this section, we review prior work related to automatic speech recognition, text-to-speech-based data augmentation, phonetic representation, parameter-efficient adaptation, and evaluation metrics. Section 2.1 discusses ASR models and accented speech adaptation. Section 2.2 reviews TTS-based approaches for synthetic data generation. Section 2.3 covers phoneme-based representation and conversion. Section 2.4 introduces Low-Rank Adaptation for ASR fine-tuning. Finally, we summarize and compare studies that focus on improving ASR performance for non-native English speech.

2.1. Automatic Speech Recognition (ASR)

2.1.1. Whisper Model

Automatic Speech Recognition (ASR) is defined as a technology that converts human speech signals into character sequences [4]. Numerous studies have been conducted over a long period with the objective of improving the overall effectiveness of speech recognition. Recently, Wav2Vec 2.0 and Whisper have emerged as representative speech recognition models trained on high-volume data that include diverse languages and accents, demonstrating significantly improved recognition performance across a wide range of conditions [5,6,7].
The Whisper model has been reported to achieve consistently strong recognition performance even without task-specific fine-tuning and to exhibit robustness across various speaking environments. Whisper is trained on a large-scale dataset consisting of approximately 680,000 h of paired audio and text data covering a wide variety of languages and accents [6]. Consequently, it has attracted considerable attention for its strong recognition performance across diverse languages and accent conditions. Accordingly, we employ the pretrained Whisper model as the base ASR model for fine-tuning.
As described in [6], Whisper employs an encoder–decoder Transformer architecture for automatic speech recognition. Figure 1 illustrates the Whisper architecture. Input audio signals are resampled to 16 kHz and converted into 80-channel log-Mel spectrogram representations to capture time-frequency characteristics. The encoder first applies two convolutional layers to log-Mel spectrogram features, producing higher-level representations. These representations are then processed by Transformer encoder blocks with sinusoidal positional embeddings to yield a sequence of encoder hidden states. The encoder outputs are subsequently provided to a Transformer decoder with learned positional embeddings and shared input–output token embeddings, which generates the corresponding text transcription [6].
Although Whisper demonstrates strong overall performance, performance gaps relative to native speech are observed for utterances with non-native accent patterns [1]. Accordingly, we propose a method to improve recognition performance for Korean-accented English speech.

2.1.2. Fine-Tuning ASR Models for Accented Speech

Recently, several studies have focused on improving ASR performance for accented English, such as Indian-accented English. In [2], recognition performance for Indian-accented English was improved by collecting real speech data from speakers with diverse Indian linguistic backgrounds. The participants were asked to read both healthcare-related scripts and general-content materials, such as social media and online articles. The speech data constructed in this manner were used to fine-tune an ASR model, and the results verified that the recognition performance for Indian-accented English was significantly improved. However, this approach has limited applicability in low-resource linguistic environments, as it requires a large number of speakers during dataset construction. To address this challenge, we propose a method that reduces both data collection cost and resource dependency through synthetic speech generation using a text-to-speech (TTS) model.

2.2. Text-to-Speech (TTS)

2.2.1. TTS for Accented Speech Generation

TTS is defined as a technology that converts a given text into an intelligible speech signal [8]. It aims to synthesize natural-sounding audio for sentences in various languages and is widely utilized in applications such as voice assistants, navigation guidance systems, and customer services [8]. Human speech reflects a complex combination of factors, including speaker identity, emotional expression, accent, and speaking style [9]. Due to this complexity, generating natural-sounding speech with a TTS system requires modeling expressive and phonetic characteristics beyond basic text-to-speech conversion. To capture these pronunciation-level characteristics, we use Hangul text obtained through phoneme-based transliteration of English as input to the TTS model.
Prior studies have explored various approaches to improving automatic speech recognition (ASR) performance by leveraging TTS-generated speech. In [10], a self-refining framework was introduced in which an ASR model first generates pseudo-label transcriptions from unlabeled speech. These pseudo-labels are used to train a TTS model, which subsequently synthesizes speech from text. The synthesized speech–text pairs are then employed to retrain the ASR model, leading to improvements in ASR performance. In [11], an unsupervised TTS training strategy was introduced that leverages a small amount of accented speech to generate large-scale accented synthetic data. By jointly training the ASR model with both non-accented speech and accented synthetic data, the robustness of ASR systems to accented speech was significantly enhanced. In contrast to these approaches, our method enables the generation of non-native accented speech without requiring additional TTS training by adopting a TTS-based synthesis strategy with a preprocessing stage applied prior to speech synthesis. As a result, it reduces computational overhead and data requirements, improving practical scalability.
In [12], the authors generated synthetic speech to analyze Spanish-accented English pronunciation and extracted phoneme confusion patterns using ASR and phoneme recognition models. Based on these confusion patterns, phoneme sequences were modified and resynthesized via a phoneme-to-speech system to model non-native pronunciation variations. Similar to [12], our approach addresses non-native English pronunciation at the phoneme level. While [12] derives phoneme modifications from ASR-based confusion analysis and applies them in a subsequent synthesis stage, our study explicitly defines phoneme transformation rules prior to speech synthesis and directly incorporates them into a Korean TTS system. By introducing these transformations at the initial stage of the synthesis pipeline, our method enables consistent and interpretable modeling of non-native pronunciation characteristics without relying on ASR outputs.
Similar to the initial synthesis stage in [12], ref. [13] employs a commercial TTS system (Azure TTS) to generate synthetic English speech reflecting the accent characteristics of speakers with a given first language. However, the primary objective of [13] is the efficient generation of synthetic data for ASR fine-tuning, rather than the explicit modeling or analysis of non-native pronunciation variations at the phoneme level. Ref. [13] is similar to our approach in that it leverages TTS models trained on speech from native speakers of a given first language (e.g., Korean) to generate accented English and uses the synthesized speech alongside real data for ASR fine-tuning. While this method enables efficient large-scale data generation, it does not provide explicit control over phonemic variations in the synthesized speech. To mitigate this limitation, we implement established pronunciation guidelines (e.g., those provided by the National Institute of Korean Language) and introduce additional preprocessing and normalization steps necessary to ensure consistent application of these rules to real-world IPA inputs. This enables interpretable and reproducible modeling of accent and pronunciation variations. We further demonstrate that the resulting synthetic speech effectively improves ASR performance.

2.2.2. Breath Group Control

Human speech consists of articulation segments that produce sound and pause segments that do not produce sound [14]. According to [14], when speakers produce long utterances, they tend to segment their speech into specific units by inserting pauses between articulation segments. These utterance units are categorized into pause groups and breath groups. For experimental implementation, the authors define a pause group as a temporal segment of continuous speech in which no other pauses intervene, and a breath group as a temporal segment of speech produced following a single inhalation [14]. Following this definition, we consider a breath group to be a speech segment that is continuously produced after a single inhalation. A word segment refers to the minimal syntactic unit of a sentence and corresponds to a whitespace-delimited unit [15].
Based on these definitions, we incorporate breath group-based segmentation into the speech synthesis process in order to generate temporally and prosodically natural utterances. We further evaluate whether controlling the number of word segments within each breath group during speech synthesis can adjust pause distributions in synthetic speech to better reflect the characteristics of native Korean speech, thereby mitigating unnatural pause patterns and their adverse impact on ASR training performance.

2.3. Phoneme-Based Representation and Conversion

2.3.1. IPA-Based Phoneme Conversion for Korean-Accented English

Non-native speakers tend to substitute L2 phonemes with the most similar, though not identical, phonemes from their L1 [16]. We simulate this phenomenon by replacing L2 (English) phonemes with the most similar L1 (Korean) phonemes to emulate the English pronunciation of Korean speakers. To facilitate phoneme correspondence between the two languages, we employ the International Phonetic Alphabet (IPA) as an intermediate representation.
The IPA is a phonetic notation system proposed in the 19th century to represent speech sounds precisely and consistently [17]. As described in [17], the IPA standardizes pronunciation by assigning one symbol to each phoneme and expresses more detailed phonetic distinctions through the combination of multiple symbols and the use of supplementary notations such as tie bars and diacritics. This notation system was reported to help reduce pronunciation ambiguity caused by orthographic variations across languages.
Because English orthography does not always directly reflect actual pronunciation, we first convert English text into IPA, a standardized phonetic transcription system. The resulting IPA representation is then converted into a Korean phonetic representation by considering the phonological context. In this process, context-dependent phoneme correspondences are applied based on the IPA–Hangul correspondence guidelines provided by the National Institute of the Korean Language [15]. Following this procedure, IPA is used as an intermediate phonetic representation in a two-stage phoneme-based pronunciation conversion process consisting of English-to-IPA and IPA-to-Korean. However, the IPA-to-Korean conversion does not account for pronunciation variations arising from different levels of English proficiency. Nevertheless, phoneme-level representations remain useful for modeling pronunciation characteristics and have been widely used in ASR and speech representation learning.

2.3.2. Phoneme-Based Approaches for ASR and Representation Learning

Various attempts have been made to improve ASR performance by leveraging phoneme information. For example, ref. [18] proposes structure-aware augmentation strategies that incorporate phoneme boundary information. Specifically, Phoneme Dropout, a dynamic dropout regularization strategy based on phoneme boundary recognition, and Phoneme-Aware SpecAugment, a selective masking technique guided by phoneme boundaries, are employed to simulate phenomena such as phonetic reduction and elision observed in spontaneous speech. While this approach enables data augmentation without requiring additional manual annotation, it modifies existing speech signals. Consequently, it differs from generation-based approaches that introduce additional acoustic variability through synthetic speech generation.
Research leveraging phoneme- or IPA-based representations has been explored not only in ASR but also in cross-lingual processing and representation learning. For example, ref. [19] fine-tuned a phoneme-based model (XPhoneBERT) on English IPA data, demonstrating zero-shot transfer capability for named entity recognition without target-language training. In contrast, our method employs IPA not for cross-lingual representation learning or transfer, but as an intermediate phonological representation for modeling non-native pronunciation and generating synthetic speech. The resulting synthetic speech is incorporated into the ASR training data to mitigate the scarcity of real non-native speech data.
In [20], IPA pseudo-labels were generated from the Connectionist Temporal Classification posteriorgrams of an IPA-based ASR model and used to replace conventional K-means clustering labels during the HuBERT pretraining stage. The frame-level IPA pseudo-labels contain more explicit phonological information than conventional K-means clustering labels, thereby facilitating cross-lingual generalization. In contrast, our approach does not modify the training targets of the ASR model. Instead, IPA is employed solely as an intermediate phonological representation within a speech synthesis-based data augmentation pipeline. This design improves ASR performance without requiring the construction or training of an additional IPA recognition model.
Our approach leverages IPA as an intermediate phonetic representation for generating synthetic speech data to improve ASR performance. Specifically, the pronunciation of English text is first converted into IPA and then transcribed into a Korean phonetic representation while preserving the original English phonetic content. This process enables the generation of synthetic speech that reflects non-native pronunciation characteristics of Korean speakers. Such an approach facilitates effective training data construction under low-resource conditions, mitigates the scarcity of non-native speech data, and ultimately improves ASR performance.

2.4. Low-Rank Adaptation (LoRA)

2.4.1. Definition of LoRA

Figure 2 illustrates the LoRA architecture. Low-Rank Adaptation (LoRA) [21] is a parameter-efficient fine-tuning method that freezes the pretrained weight matrix W   R d × d and models its update as a low-rank decomposition:
W = W +   α r   A B   ,
where A     R d × r ,   B   R r × d are trainable low-rank matrices, r denotes the rank, and α denotes a scaling factor.
By restricting weight updates to a low-rank structure, LoRA reduces the number of trainable parameters while preserving the pretrained backbone weights. Because the low-rank update matrices can be merged with the original weights after training, the model can be used for inference without significant additional computational overhead. Consequently, multiple task-specific LoRA modules can share a single pretrained model, and task switching is achieved simply by replacing the corresponding adaptation matrices.

2.4.2. LoRA-Based Whisper Fine-Tuning on Low-Resources Language

In [3], LoRA-based fine-tuning was applied to the Whisper large model for Turkish to evaluate its effectiveness under low-resource conditions. The study investigated whether parameter-efficient adaptation using LoRA could improve ASR performance with limited training data for Turkish, a language characterized by agglutinative morphology, dialectal diversity, and phonological complexity. The results showed that LoRA-based fine-tuning improves ASR performance in such settings. However, this approach does not directly address the limited diversity of available utterances. While LoRA enhances parameter efficiency, it does not fundamentally resolve the constraints imposed by data scarcity. We therefore view this limitation not only as a model adaptation issue, but also as a problem of limited data diversity. To address this, we propose a data-centric approach based on synthetic speech generation. Specifically, we generate synthetic speech that simulates non-native pronunciation characteristics and use it for LoRA-based fine-tuning of the Whisper ASR model. Finally, we compare our approach with previous studies on improving ASR performance for non-native English speakers, as summarized in Table 1.

3. Proposed Method

In this section, we provide a detailed description of our method, which aims to improve the recognition performance of ASR models for accented English speech produced by Korean native speakers. The overall framework of the proposed method is presented in Section 3.1, and the subsequent sections describe each core component that constitutes this framework in detail.

3.1. Overview of the Proposed Approach

Figure 3 illustrates the overall workflow of our proposal. Given an input English sentence, the pronunciation of each word is first converted using IPA symbols, and the IPA words are then represented as Hangul strings. Based on predefined IPA-to-Hangul mapping rules, the IPA Converter outputs the Korean transliteration of the English pronunciation. The Hangul-transcribed English pronunciations are subsequently used as inputs to the Korean-accented English Generator to synthesize speech that models Korean-accented English. The Korean-accented English Generator synthesizes speech via TTS using Hangul-transcribed English pronunciation as input, thereby modeling both the prosodic and phonetic characteristics of Korean speakers. The synthetic dataset generated by the Korean-accented English Generator is combined with a human speech dataset to construct the training dataset. Using this dataset, we fine-tune the ASR model with LoRA to improve its performance on English speech produced by Korean speakers.

3.2. IPA Converter Architecture

Figure 4 visualizes the procedure of the IPA Converter. It consists of four components, including input English text, a dictionary-priority-based IPA conversion module, an IPA-to-Hangul conversion module, and the final Korean transliteration output.
Given an English sentence, each word is first converted into its IPA representation. To perform the conversion, it first checks whether each word is listed in a pronunciation dictionary. If a word exists in the dictionary, the dictionary-based IPA converter is applied with priority to ensure accurate pronunciation and improve phonetic consistency. For out-of-vocabulary (OOV) words, the rule-based IPA converter is applied. Since the rule-based IPA converter does not rely on a dictionary, it can robustly generate IPA symbols for neologisms and other OOV words.
Using the IPA-to-Hangul correspondence rules established by the National Institute of Korean Language (NIKL), the converted IPA symbols are mapped to their corresponding Hangul representations [15], which serve as the primary framework for the conversion process. However, certain IPA symbols are not explicitly specified in the NIKL documentation. To ensure consistent and systematic mapping, we introduce limited rule extensions while preserving the original guideline structure. These extensions include the normalization of non-standard IPA symbols prior to conversion and the handling of IPA symbols not directly defined in the NIKL correspondence rules. When no direct Hangul equivalent exists for a given IPA symbol, it is first normalized into a predefined set of IPA representations, after which the standard IPA-to-Hangul mapping rules are applied consistently. Through this structured framework, the conversion process maintains systematic correspondence with English pronunciation while providing stable phonetic representations suitable for speech synthesis.

3.3. Korean-Accented English Generator (KAEG)

Figure 5 shows the pipeline in which the Korean phonetic transcription of English generated by the IPA Converter is used as input to the TTS model to produce English speech exhibiting Korean accent characteristics.
After converting English text into IPA, the text is reconstructed as Korean transliterations of English and then fed into a Korean TTS model to include the prosodic characteristics of Korean speakers. The input Korean text is synthesized into a sound signal through the inference stage of the Korean TTS model. This approach incorporates Korean phonological characteristics by representing English pronunciation within the Hangul writing system and synthesizing speech using a Korean TTS model. To incorporate breath-group-level temporal structuring, a breath-group–based synthesis strategy is adopted.
Based on the statistical properties of the human speech dataset, the word range per breath group is defined to reflect the average breathing patterns of Korean speakers [14]. To mitigate sampling bias, the breath-group word range reported in [14] is used as the reference interval, and values are drawn from a uniform distribution over this interval. To model the number of words per breath group, we insert silent intervals into the generated speech. Specifically, English pronunciation text transcribed into Hangul is provided as input to a Korean TTS model, enabling the synthesis of audio that naturally reflects the phonetic characteristics of Korean speakers. To incorporate the concept of words per breath group during speech synthesis, words are grouped into segments of controlled length based on whitespace boundaries, and speech is synthesized for each segment individually before being concatenated in sequence. This process induces natural pauses between segments, resulting in speech segments that reflect the intended breath-group segmentation.

3.4. ASR Fine-Tuning

Figure 6 illustrates the process of constructing the training dataset by integrating Korean-accented English speech (Non-synthetic Dataset) with synthesized Korean-accented English speech (Synthetic Dataset), followed by fine-tuning the ASR model using the combined dataset. This integration aims to achieve efficient performance gains.
As reported in [22], fine-tuning solely on synthetic data tends to degrade performance, whereas incorporating human spoken data has been shown to improve performance. To prevent bias toward synthetic data characteristics, we incorporate human speech alongside synthetic data during ASR fine-tuning. Specifically, we fix the amount of human spoken data and vary only the proportion of synthetic data to analyze the effect of different synthetic-to-human data ratios on speech recognition performance. Detailed ratio configurations are provided in Section 5.2.

4. Experimental Setup

This section describes the experimental setup and its components. Section 4.1 presents the composition and characteristics of the datasets used in the experiment. Section 4.2 describes the configuration and synthesis process of the TTS model. Section 4.3 details the fine-tuning settings and training procedure of the ASR model. Finally, Section 4.4 presents the ASR evaluation metrics.

4.1. Dataset

The Whisper model was fine-tuned using a combination of the AI-Hub Educational Korean English Speech Dataset (AHK) [23] and the synthetic dataset generated by the proposed method. Since the AHK dataset provides only training and validation sets without a separate test set, we partitioned the original training data into non-overlapping training and validation subsets to prevent data leakage. The official validation set was then employed as the test set for our experiments. To further evaluate the generalization performance of the fine-tuned model, the L2-ARCTIC dataset was held out during training and used exclusively as an unseen test set.
Table 2 summarizes the data composition of the AHK dataset [23]. According to [23], the dataset was developed to support AI-based foreign language pronunciation and speaking education and assessment. It consists of 1052 h of English speech recordings produced by Korean speakers and includes various metadata. Access to the dataset is restricted to users within Korea. The dataset includes speakers of diverse age groups and genders and covers a broad range of topics. The topic distribution is presented in Figure 7. This composition facilitates the analysis of English speech characteristics of Korean speakers across diverse communicative contexts, rendering the dataset suitable for ASR model training and performance evaluation. This data configuration enables analysis of English speech characteristics of Korean speakers across diverse contexts, making the dataset suitable for ASR model training and performance evaluation. Also, pronunciation data were collected using expert-designed read-aloud scripts incorporating phonemes frequently misarticulated by Korean learners. In addition, speaking data were collected through CEFR-aligned, scenario-based tasks reflecting the Korean national assessment framework. All recorded responses were compiled into a structured speech corpus.
The L2-ARCTIC dataset is a non-native English speech corpus designed to support tasks such as voice conversion, accent conversion, and mispronunciation detection. It includes speech data from speakers of five different first-language backgrounds: Hindi, Korean, Mandarin, Spanish, and Arabic. The Korean subset consists of recordings from four speakers (two male and two female), each providing 150 utterances with corresponding manual transcriptions [24]. The entire Korean subset of the L2-ARCTIC dataset is used solely for evaluation and is not included in any training or fine-tuning stage, enabling an assessment of the model’s generalization capability.
For speech synthesis, the English sentences to be converted into Korean speech were obtained from the LibriSpeech dataset. In [25], it is explained that LibriSpeech is a large-scale speech corpus constructed based on audio recordings from the LibriVox project and provides approximately 1000 h of speech data sampled at 16 kHz. The dataset is one of the most widely used publicly available datasets for ASR model training and evaluation, and includes corresponding text transcriptions for each audio file [25]. In this paper, we utilized the transcription sentences of the audio files to construct the input data required for the speech synthesis stage.
The TTS synthesis process requires reference speaker audio data. We used the Zeroth-Korean dataset, a publicly available Korean speech corpus consisting of 51.6 h of training data and 1.2 h of test data, with 105 speakers in the training set and 10 speakers in the test set [26]. To control speaker distribution, a subset of training speakers was selected as reference speakers. In the main experiments, 10 speakers were randomly sampled from the training set, and an equal number of samples was assigned to each speaker for TTS-based synthetic data generation. This design mitigates potential bias caused by uneven speaker representation in the synthesized dataset.

4.2. TTS

In this section, we describe the experimental setup for data synthesis using TTS. Section 4.2.1 presents the configuration for converting English text into a Korean phonetic representation based on IPA, while Section 4.2.2 describes the experimental setup for speech synthesis using the text generated through IPA conversion.

4.2.1. IPA Conversion

The synthetic dataset used for ASR fine-tuning is constructed by sequentially applying an IPA converter and KAEG to the input English text. For the IPA conversion stage, we utilize the Phonemizer package for rule-based phoneme conversion and the eng_to_ipa package for dictionary-based conversion [27,28]. The resulting IPA sequences are mapped to a Korean phonetic representation in accordance with NIKL standards, with minor preprocessing adjustments. When no direct Hangul equivalent exists for a given IPA symbol, it is normalized into a predefined set of IPA representations prior to applying the standard mapping rules. For instance, selected IPA symbols are normalized such that ɝ is represented as ɛr, ᵻ as ɪ, ɐ as a, and ɫ and ɬ as l.

4.2.2. KAEG

The Korean phonetic representation generated by the IPA converter is subsequently synthesized into speech using a TTS system. The synthesis was performed on a system equipped with an Apple M4 Max processor. According to [29], the Coqui XTTS v2 model supports 17 languages, including Korean, English, and Spanish, and offers multilingual speech synthesis as well as voice cloning. Unlike conventional TTS models that rely on default speakers, this model enables speech synthesis by leveraging user-specified reference speakers, thereby generating highly varied and expressive voices. However, since the generated audio is sampled at a rate of 24 kHz, a resampling process to 16 kHz is required prior to fine-tuning the Whisper model. In addition, the Coqui XTTS v2 model can synthesize speech that expresses the speaking style of a given reference speaker when reference speaker audio is provided as input [29]. The Coqui XTTS v2 Korean model was used for speech synthesis. To simulate a low-resource language setting, speech synthesis was performed using 10 reference speakers.
Additionally, to encourage more natural speech synthesis, we considered the number of words per breath group. According to [14], the average number of words per breath group was analyzed based on speaking task types as well as gender and age groups. By task type, spontaneous speech exhibited an average of 7.5 words, whereas read speech showed an average of 5.82 words, which was interpreted as reflecting differences in cognitive load and speech planning requirements. Regarding gender and age groups, young adult males produced the highest average of 9.33 words, followed by middle-aged males (6.77), middle-aged females (6.64), and young adult females (6.58). Integrating these findings, most speakers, excluding young adult males, tend to produce approximately 6 words per breath group, and the overall averages for both spontaneous and read speech generally fall within the range of five to seven words. Given these findings, we selected the number of words per breath group randomly within the range of 5 to 7 in order to reflect general Korean speech characteristics while avoiding bias toward any specific group.

4.3. ASR

Fine-tuning and performance evaluation of the ASR models were conducted on a system equipped with an NVIDIA RTX 5070 GPU. We conducted fine-tuning on the Whisper Tiny and Whisper Base models. These models were selected considering that non-native accented English often needs to be recognized in real time on resource-constrained devices, including mobile platforms used in educational English conversation applications.
ASR fine-tuning was performed using 16 kHz audio. Training was conducted for 20 epochs. The learning rate was set to 1 × 10 5 for the Tiny model and 5 × 10 6 for the Base model, with a warm-up ratio of 5% of the total training steps. The batch size was set to 16, and the gradient accumulation was set to 2. To evaluate the recognition performance of the models, CER was adopted as the primary evaluation metric, and WER was also computed to maintain comparability with other studies. Additionally, we report PER to assess phoneme-level accuracy with respect to the human reference transcriptions. Fine-tuning was conducted using LoRA, with the rank set to 16 and the scaling factor (alpha) set to 32. Model selection was based on the validation CER. Approximately 5 h of audio data were held out from the AHK dataset to form the validation set, ensuring complete separation from the training data.

4.4. Evaluation Metrics

We evaluate ASR performance using Character Error Rate (CER), Word Error Rate (WER), and Phoneme Error Rate (PER), which measure discrepancies between reference and predicted sequences [30].
E r r o r   R a t e = ( S + D + I ) N
In Equation (1), S , D , and I denote substitutions, deletions, and insertions, respectively, and N is the total number of units in the reference transcription.

5. Experimental Results

This section presents the results of experiments conducted to evaluate the effectiveness of the proposed method. Section 5.1 examines the similarity between synthetic speech and human speech. Section 5.2 investigates the impact of different fine-tuning data configurations on model performance. Section 5.3 analyzes the effect of varying the amount of synthetic data on model performance. In Section 5.4, we evaluate the model fine-tuned on the AHK dataset using the L2-ARCTIC dataset to determine whether it demonstrates meaningful performance improvements on other datasets as well. Finally, Section 5.5 examines the effectiveness of the strategy for controlling the number of words per breath unit.
The relative improvement rates in the tables are calculated with respect to the pretrained baseline, and the relative improvement is calculated as:
R e l a t i v e   I m p r o v e m e n t % = E b a s e l i n e E t a r g e t E b a s e l i n e   ×   100
where E b a s e l i n e denotes the error rate (i.e., CER, WER, or PER) of the pretrained model, and E t a r g e t denotes the error rate (i.e., CER, WER, or PER) of the fine-tuned model.
In each table, the lowest error rate for each model size is highlighted in bold, and the largest relative improvement is also highlighted in bold. Baseline refers to the pretrained ASR model without fine-tuning. All experiments involving synthetic data use a mixture of 20 h of real (human-spoken) data and synthetic data. The synthetic dataset is generated using 10 reference speakers.

5.1. Similarity Analysis Between Synthetic and Real Speech

In this section, a one-hour subset is randomly sampled from the training portion of the AHK dataset, and the corresponding ground-truth transcriptions are used. Using these texts, we generate three types of synthetic speech, including speech produced by the proposed method, Korean-accented English speech generated using the Azure Korean TTS model with English text input [13], and English speech generated using the Azure English TTS model.
To analyze the similarity between phoneme-level pronunciation confusion patterns, we employ a pretrained Whisper-base model without fine-tuning. Figure 8 visualizes the top 20 phonemes with the highest error frequencies as a heatmap, where the values represent the logarithm of the error counts. Specifically, Figure 8a presents the training set of the AHK dataset, while Figure 8b–d present speech synthesized using Azure Korean TTS, Azure English TTS, and our proposed method, respectively.
The prominent phoneme confusion patterns observed in (a), such as /i/–/ɪ/, /z/–/s/ and /r/–/l/, are also observed in (d). The overall error distribution in (d) appears more similar to that in (a) than to that in (b). In particular, several phoneme regions that exhibit relatively low error frequencies in (a) are also reflected in (d), whereas (b) shows comparatively higher error frequencies in those regions. Furthermore, (d) demonstrates closer alignment with (a) than (b) in specific confusion patterns. For example, similar confusion rates are observed between /ɪ/ and /ɛ/. For the phoneme /ə/, the relative confusion pattern among /ər/, /ɪ/, and /ɛ/ is consistent, with /ər/ occurring most frequently, followed by /ɪ/ and /ɛ/.
Compared with (c), which exhibits a substantially lower overall error rate, the proposed method more closely matches the error patterns observed in (a). Although (c) includes some phonemes that show high error frequencies in (a), the overall error rate remains considerably lower, suggesting that the pronunciation confusions typically observed in Korean-accented English speech are less represented. These observations suggest that the proposed synthesis method better reflects certain characteristics of phoneme error patterns in Korean-accented English.

5.2. Comparison of Model Performance Across Data Configuration

In this section, we present the model performance results obtained by varying the dataset configuration for ASR fine-tuning. In particular, under the synthetic data condition that empirically yielded the best performance, we compare our results with those obtained using Azure TTS–based synthetic data [13] to evaluate the effectiveness of the proposed approach. The optimal amount of synthetic data is described in detail in Section 5.3.
Table 3 shows that fine-tuning with synthetic data overall improves recognition performance for Korean-accented English speech. For the Tiny model, the proposed method achieves the best performance across all evaluation metrics. Specifically, compared to the baseline, it achieves relative reductions of 16.40% in CER, 14.93% in WER, and 14.81% in PER. For the Base model, the proposed method also demonstrates consistent performance improvements, achieving relative reductions of 7.54% in CER, 4.88% in WER, and 4.99% in PER compared to the baseline. The consistent reductions across all evaluation metrics suggest that the proposed synthetic approach improves both phoneme-level discriminability and word-level recognition accuracy. Notably, larger performance gains are observed for the Tiny model. This trend suggests that synthetic data augmentation is more effective for smaller models, potentially due to their relatively limited representational capacity.
Under the same synthetic data amount condition, the proposed synthesis method consistently outperforms the Azure-based synthesis approach. This performance difference implies that the proposed method better reflects the acoustic and articulatory characteristics of Korean-accented English.
In addition, compared to models fine-tuned using only real speech data, further performance improvements are observed when synthetic data are incorporated. This indicates that the proposed synthetic data are compatible with the distribution of real speech without causing noticeable degradation. Overall, these findings imply that the proposed synthesis approach more effectively reflects the phonetic and prosodic characteristics of Korean-accented English, thereby improving the model’s robustness across diverse speaking conditions.

5.3. Effect of Synthetic Data Scale on ASR Performance

In this section, we evaluate ASR performance under varying amounts of synthetic data mixed with real speech data. In this experiment, the amount of human speech data was fixed at 20 h, while the amount of synthetic data was varied across 5, 10, and 20 h. However, experimental results indicated that the Tiny model exhibited performance variations under synthetic data conditions below 5 h. Accordingly, additional experiments were conducted using 2 h and 4 h synthetic data settings. This setup enabled a more detailed analysis of the optimal amount of synthetic data within the training dataset.
According to the results in Table 4, the best performance is achieved under the 4 h synthetic data condition for the Tiny model and under the 5 h condition for the Base model. However, the Base model exhibits the lowest PER and WER under the 10 h condition.
For the Tiny model, performance degradation is observed under the 5 h condition compared to the 4 h setting, and the difference between the 5 h and 10 h conditions is marginal. In contrast, under the 20 h condition, performance drops substantially relative to the model fine-tuned using only real speech data. This result suggests that, in in-domain evaluation, the benefit of synthetic data diminishes when the proportion of real data falls below a certain threshold [31].
For the Base model, increasing the synthetic data to 20 h does not result in substantial degradation relative to the model trained only on real data. However, error rates increase compared to the optimal condition. This pattern is interpreted as an indication that the larger Base model maintains more stable performance than the Tiny model as the proportion of synthetic data increases.

5.4. Out-of-Domain ASR Performance on L2-ARCTIC (Korean Subset)

In this section, we evaluate the generalization performance of the model fine-tuned using the proposed method on the Korean subset of the out-of-domain L2-ARCTIC dataset. The models compared include the proposed synthesis-based model, the Azure TTS-based synthetic data model, the real-speech-only fine-tuned model, and the pretrained baseline model. Through this comparison, we examine whether the proposed synthetic data-based training strategy maintains consistent performance improvements under domain-mismatched conditions. The amount of synthetic data used in this experiment was set to the optimal configuration identified in Section 5.3.
Table 5 presents the model performance results on the out-of-domain dataset. The results show that the model trained using the proposed synthesis strategy achieves the best performance for both the Tiny and Base models. For the Tiny model, relative improvements of 15.24% in CER, 13.64% in WER, and 14.0% in PER are observed compared to the baseline. For the Base model, relative improvements of 3.48% in CER, 0.95% in WER, and 1.62% in PER are achieved. Especially, the relative gains are larger for the Tiny model, suggesting that the proposed synthesis strategy may be beneficial for smaller models. These results further indicate that the proposed synthetic data generation approach supports improved generalization performance under domain-mismatched conditions.
However, for the Base model, fine-tuning with 20 h of real speech alone does not improve performance compared to the pretrained baseline. This suggests that, for models that already have strong generalization ability from large-scale pretraining, additional fine-tuning with a limited amount of data may bring only small gains. In contrast, when the proposed synthesis strategy is applied, consistent improvements are observed across all evaluation metrics. This implies that synthetic data help to increase the diversity of the training data while maintaining the pretrained knowledge, leading to better generalization performance. Overall, the results show that the proposed method provides stable performance improvements across different model sizes.
When compared with the Azure-based synthetic data approach [13], the proposed strategy also achieves consistent additional improvements across all metrics for both the Tiny and Base models. For the Tiny model, relative improvements of 5.41% in CER, 3.82% in WER, and 4.27% in PER are observed compared to Azure-based synthesis. For the Base model, lower error rates are maintained across all evaluation metrics relative to Azure. These results suggest that the proposed synthesis strategy offers a more effective expansion of the training distribution than commercially available TTS-based synthetic data under the current experimental conditions.

5.5. Effect of Word-Count Control Within Breath Units

In this section, we conduct experiments to evaluate the effectiveness of the word-count control strategy within breath units, which is a component of the proposed synthesis approach. For comparison, synthetic speech generated without controlling the number of words per breath unit was produced by synthesizing the entire sentence at once, without segment-level splitting. In Table 4, BR-O denotes the condition in which the number of words per breath unit is controlled, whereas BR-X denotes the condition without word-count control.
Table 6 shows that models, fine-tuned with the proposed synthesis method regardless of whether breath-unit information is included, outperform both the pretrained baseline and the model fine-tuned only with real speech data across all test sets. These results are interpreted as an indication that the proposed data augmentation improves robustness on Korean speech data.
On the AHK test set, the Tiny model with breath-unit control achieves additional relative reductions of 1.21% in CER, 0.71% in WER, and 0.64% in PER compared to the model without breath-unit control. For the Base model, additional improvements of 0.63% in CER and 0.49% in WER are observed. On the L2-ARCTIC Korean subset, the Tiny model with breath-unit control further reduces CER, WER, and PER by 2.62%, 1.98%, and 2.44%, respectively. The Base model also shows additional improvements of 2.39% in CER, 1.42% in WER, and 0.43% in PER.
Overall, incorporating breath-unit information results in consistent improvements across most metrics and test sets. These results suggest that modeling breath units contributes to more effective data augmentation for Korean speech. A minor exception is observed for PER on the AHK test set with the Base model, where the variant without breath-unit control achieves marginally better performance.

6. Conclusions

In this paper, we propose a synthetic data generation framework to improve the speech recognition performance of Whisper models on English speech spoken by Korean speakers. Specifically, English text was first converted into IPA phonetic representations, which were then rewritten using a Korean writing system to reflect Korean-accented pronunciation patterns. Korean TTS was subsequently employed to synthesize speech, thereby generating English utterances that aim to reflect phonological and prosodic features associated with Korean-accented English.
Analysis of the similarity between phoneme-level pronunciation confusion patterns provides additional evidence that the proposed method captures certain phonetic characteristics of Korean-accented English. Additionally, the proposed method consistently reduced CER, WER, and PER compared to the pretrained baseline across all experimental settings. It generally outperformed both the model fine-tuned solely on human speech data and the model trained with synthetic data generated by a commercial TTS system. Notably, performance gains were also observed on out-of-domain datasets, suggesting improved generalization capability. Furthermore, the PER results indicate that the proposed synthesis approach yields better phoneme-level consistency, which may contribute to enhanced ASR performance even under resource-constrained conditions.

7. Ablation Study

Section 7.1 compares the conventional augmentation strategy with the proposed synthesis method. In Section 7.2, the number of reference speakers used in the proposed method is further reduced to investigate whether the proposed approach can still improve ASR performance under more limited data conditions.

7.1. Comparison with Conventional Augmentation Methods

In this section, we compare the proposed synthesis strategy with conventional augmentation methods, including noise injection, speed perturbation, and SpecAugment. Noise injection and speed perturbation generate additional training data by transforming original speech recordings, whereas SpecAugment modifies real speech during fine-tuning. The Whisper-base model is fine-tuned using human speech data together with additional data generated either through conventional augmentation applied to real speech or through the proposed synthesis method.
Each method utilizes 5 h of additional data, in addition to the 20 h of real speech data used for fine-tuning. Noise injection is randomly applied using a uniform distribution in the range of 0.005 to 0.015, while speed perturbation is randomly applied using a uniform distribution in the range of 0.5 to 1.25 [32,33]. For SpecAugment, time warping is excluded due to its high computational cost and limited effectiveness. Instead, only frequency and time masking are employed, with hyperparameters set to two masks each, a frequency masking parameter of 10, and a time masking parameter of 40 [34]. Table 7 presents a comparison with conventional augmentation methods.
As shown in Table 7, our method achieves greater performance improvement than noise injection. Compared to SpecAugment, the proposed method shows better improvements on the AHK dataset, while on the L2-ARCTIC dataset, SpecAugment improves performance in most metrics except PER. However, the performance differences on the L2-ARCTIC dataset are not significant, suggesting that the proposed synthesis strategy is generally more effective than these conventional augmentation methods.
Speed perturbation yields greater performance improvement than the proposed synthesis method. However, it augments data by modifying existing samples, whereas the proposed method generates additional training samples without altering the original data. In other words, since the two approaches serve different objectives in data augmentation, the proposed method can still be considered a useful strategy.

7.2. Effectiveness with a Reduced Number of Reference Speakers

In this section, we analyze the effect of the number of reference speakers used for speech synthesis. The Whisper-base model is fine-tuned using 20 h of real speech data together with 5 h of synthetic speech generated using 1, 3, 5, and 10 reference speakers. The resulting ASR performance is presented in Table 8. In Table 8, ‘spk’ denotes the number of reference speakers.
As shown in Table 8, the proposed method improves performance even when only a single reference speaker is used for speech synthesis. The best performance is achieved with five reference speakers, while using ten does not provide further improvement. This indicates that performance does not consistently improve with the number of reference speakers. Nevertheless, our method remains effective even with a limited number of reference speakers.

Author Contributions

Conceptualization, H.J. and Y.J.; methodology, H.J.; software, H.J.; validation, H.J.; resources, H.C.; writing—original draft preparation, H.J.; writing—review and editing, T.K., H.C. and Y.J.; visualization, H.J.; supervision, Y.J.; funding acquisition, T.K. and H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP)-Innovative Human Resource Development for Local Intellectualization program grant funded by the Korea government (MSIT) (IITP-2026-RS-2024-00436773).

Data Availability Statement

Publicly available datasets were used in this study, including LibriSpeech, Zeroth-Korean Dataset and L2-ARCTIC Dataset. In addition, the AI-Hub Dataset was used, which is available with restricted access upon registration and approval through the AI-Hub platform in accordance with its data usage policy. The sources of all datasets are cited in the reference list.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ASRAutomatic Speech Recognition
L1First Language
L2Second Language
TTSText-to-Speech
IPAInternational Phonetic Alphabet
LoRALow-Rank Adaptation
CERCharacter Error Rate
WERWord Error Rate
PERPhoneme Error Rate
OOVOut-of-vocabulary
NIKLNational Institute of Korean Language
KAEGKorean-accented English Generator
AHKAI-Hub Educational Korean English Speech Dataset

References

  1. Graham, C.; Roll, N. Evaluating OpenAI’s Whisper ASR: Performance Analysis across Diverse Accents and Speaker Traits. JASA Express Lett. 2024, 4, 020401. [Google Scholar] [CrossRef] [PubMed]
  2. Kunisetty, J.; Ramachandrula, P.; Vekkot, S.; Gupta, D. Advancing ASR for Indian-Accented English: Dataset Creation and Whisper Fine-Tuning. Procedia Comput. Sci. 2025, 258, 2510–2519. [Google Scholar] [CrossRef]
  3. Polat, H.; Turan, A.K.; Koçak, C.; Ulaş, H.B. Implementation of a Whisper Architecture-Based Turkish Automatic Speech Recognition (ASR) System and Evaluation of the Effect of Fine-Tuning with a Low-Rank Adaptation (LoRA) Adapter on Its Performance. Electronics 2024, 13, 4227. [Google Scholar] [CrossRef]
  4. Alharbi, S.; Alrazgan, M.; Alrashed, A.; Alnomasi, T.; Almojel, R.; Alharbi, R.; Alharbi, S.; Alturki, S.; Alshehri, F.; Almojil, M. Automatic Speech Recognition: Systematic Literature Review. IEEE Access 2021, 9, 131858–131876. [Google Scholar] [CrossRef]
  5. Ahlawat, H.; Aggarwal, N.; Gupta, D. Automatic Speech Recognition: A Survey of Deep Learning Techniques and Approaches. Int. J. Cogn. Comput. Eng. 2025, 6, 100096. [Google Scholar] [CrossRef]
  6. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
  7. Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
  8. Ahmad, H.A.; Rashid, T.A. Planning the Development of Text-to-Speech Synthesis Models and Datasets with Dynamic Deep Learning. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102131. [Google Scholar] [CrossRef]
  9. Barakat, H.; Turk, O.; Demiroglu, C. Deep Learning-Based Expressive Speech Synthesis: A Systematic Review of Approaches, Challenges, and Resources. J. Audio Speech Music Process. 2024, 2024, 11. [Google Scholar] [CrossRef]
  10. Chou, C.-K.; Hsu, C.-J.; Chung, H.-L.; Tseng, L.-H.; Cheng, H.-C.; Fu, Y.-K.; Huang, K.-P.; Lee, H.-Y. A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2025), Honolulu, HI, USA, 8 December 2025. [Google Scholar]
  11. Do, C.T.; Imai, S.; Doddipatla, R.; Hain, T. Improving Accented Speech Recognition Using Data Augmentation Based on Unsupervised Text-to-Speech Synthesis. In Proceedings of the 32nd European Signal Processing Conference (EUSIPCO), Lyon, France, 26–30 August 2024. [Google Scholar]
  12. Masson, M.; Carson-Berndsen, J. Investigating the use of synthetic speech data for the analysis of Spanish-accented English pronunciation patterns in ASR. In Proceedings of the Synthetic Data’s Transformative Role in Foundational Speech Models (SynData4GenAI), Kos, Greece, 31 August 2024; pp. 81–85. [Google Scholar] [CrossRef]
  13. Karakasidis, G.; Robinson, N.; Getman, Y.; Ogayo, A.; Al-Ghezi, R.; Ayasi, A.; Watanabe, S.; Mortensen, D.R.; Kurimo, M. Multilingual TTS Accent Impressions for Accented ASR. In Proceedings of the Text, Speech, and Dialogue (TSD 2023); Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2023; Volume 14102. [Google Scholar] [CrossRef]
  14. Yoo, D.; Shin, J. Study on the Realization of Pause Groups and Breath Groups. Phon. Speech Sci. 2020, 12, 19–31. [Google Scholar] [CrossRef]
  15. National Institute of Korean Language. Pronunciation and Spacing FAQ. Available online: https://korean.go.kr/front/mcfaq/mcfaqView.do?mn_id=62&mcfaq_seq=6806&pageIndex=5 (accessed on 15 February 2026).
  16. Park, J.; Kim, M.; Hong, D.; Lee, J. Compositional Phoneme Approximation for L1-Grounded L2 Pronunciation Training. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Mumbai, India, 20–24 December 2025. [Google Scholar]
  17. International Phonetic Association. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet; Cambridge University Press: Cambridge, UK, 1999. [Google Scholar]
  18. Zhang, L.; Wu, S.; Wang, Z. Phoneme-Aware Augmentation for Robust Cantonese ASR Under Low-Resource Conditions. Symmetry 2025, 17, 1478. [Google Scholar] [CrossRef]
  19. Sohn, J.; Jung, H.; Cheng, A.; Kang, J.; Du, Y.; Mortensen, D.R. Zero-shot cross-lingual NER using phonemic representations for low-resource languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 13595–13602. [Google Scholar]
  20. Feng, S.; Tu, M.; Xia, R.; Huang, C.; Wang, Y. Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023; pp. 1384–1388. [Google Scholar] [CrossRef]
  21. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR). arXiv 2022, arXiv:2106.09685. [Google Scholar]
  22. Mizumoto, T.; Kojima, A.; Fujita, Y.; Liu, L.; Sudo, Y. Is Synthetic Data Truly Effective for Training Speech Language Models? Proc. Interspeech 2025, 2025, 1808–1812. [Google Scholar]
  23. AI-Hub. Korean Speech Dataset. Available online: https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71463 (accessed on 8 February 2026).
  24. Zhao, G.; Chukharev-Hudilainen, E.; Sonsaat, S.; Silpachai, A.; Lucic, I.; Gutierrez-Osuna, R.; Levis, J. L2-arctic: A non-native english speech corpus. Proc. Interspeech 2018, 2018, 2783–2787. [Google Scholar]
  25. Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. LibriSpeech: An ASR Corpus Based on Public Domain Audio Books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
  26. Kresnik. Zeroth Korean Text-to-Speech Dataset. Available online: https://huggingface.co/datasets/kresnik/zeroth_korean (accessed on 8 February 2026).
  27. Bernard, M.; Titeux, H. Phonemizer: Text to Phones Transcription for Multiple Languages in Python. J. Open Source Softw. 2021, 6, 3958. [Google Scholar] [CrossRef]
  28. eng-to-ipa. eng-to-ipa: Convert English Text to IPA. Python Package Index (PyPI). 2020. Available online: https://pypi.org/project/eng-to-ipa/ (accessed on 8 February 2026).
  29. Coqui. XTTS: Cross-Lingual Text-to-Speech Model Documentation. Available online: https://docs.coqui.ai/en/latest/models/xtts.html (accessed on 8 February 2026).
  30. Zhang, L.; Wu, S.; Wang, Z. Phoneme-Aware Hierarchical Augmentation and Semantic-Aware SpecAugment for Low-Resource Cantonese Speech Recognition. Sensors 2025, 25, 4288. [Google Scholar] [CrossRef] [PubMed]
  31. Mengke, D.; Mihajlik, P. Impact of Text Origin and Real-Synthetic Data Ratio in TTS-Augmented Low-Resource ASR. In Proceedings of the 2025 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Cluj-Napoca, Romania, 19–22 October 2025; IEEE: New York, NY, USA; 2025, pp. 97–101. [Google Scholar]
  32. Pandey, L.; Arif, A.S. Effects of speaking rate on speech and silent speech recognition. In Proceedings of the CHI Conference on Human Factors in Computing Systems Extended Abstracts; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
  33. Moëll, B.; O’Regan, J.; Mehta, S.; Kirkland, A.; Lameris, H.; Gustafson, J.; Beskow, J. Speech data augmentation for improving phoneme transcriptions of aphasic speech using wav2vec 2.0 for the psst challenge. In Proceedings of the RaPID Workshop-Resources and Processing of linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive, Psychiatric, and Developmental Impairments, Within the 13th Language Resources and Evaluation Conference, Marseille, France, 25 June 2022. [Google Scholar]
  34. Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Proc. Interspeech 2019, 2019, 2613–2617. [Google Scholar]
Figure 1. Whisper Architecture.
Figure 1. Whisper Architecture.
Electronics 15 01380 g001
Figure 2. Architecture of LoRA.
Figure 2. Architecture of LoRA.
Electronics 15 01380 g002
Figure 3. Overview of Our Framework.
Figure 3. Overview of Our Framework.
Electronics 15 01380 g003
Figure 4. IPA Converter.
Figure 4. IPA Converter.
Electronics 15 01380 g004
Figure 5. Korean-accented English Generator.
Figure 5. Korean-accented English Generator.
Electronics 15 01380 g005
Figure 6. The process of ASR fine-tuning.
Figure 6. The process of ASR fine-tuning.
Electronics 15 01380 g006
Figure 7. Topic Distribution of the AI-Hub Educational Korean English Speech Dataset.
Figure 7. Topic Distribution of the AI-Hub Educational Korean English Speech Dataset.
Electronics 15 01380 g007
Figure 8. Similarity analysis of phoneme confusion patterns. (a) AHK speech, (b) speech generated by Azure Korean TTS speech, (c) generated by Azure English TTS, and (d) speech generated by the proposed method.
Figure 8. Similarity analysis of phoneme confusion patterns. (a) AHK speech, (b) speech generated by Azure Korean TTS speech, (c) generated by Azure English TTS, and (d) speech generated by the proposed method.
Electronics 15 01380 g008
Table 1. Comparison of studies for improving ASR performance on non-native English speech. O denotes a used strategy, and X denotes an unused strategy.
Table 1. Comparison of studies for improving ASR performance on non-native English speech. O denotes a used strategy, and X denotes an unused strategy.
MethodTarget AccentData TypeSynthesis/Data Generation
Strategy
TTS
Retraining
Phonological ControlASR Adaptation
[2]IndianRealHuman-spoken data collectionXXFine-tuning
(details not specified)
[3]TurkishRealExisting human speech datasetOXLoRA-based
Fine-tuning
[13]Korean, German, Malaysian, Filipino, Arabic, Chinese, Hindi, Spanish, VietnameseReal (Optional)
+ Synth
Pretrained Commercial TTS-based synthetic speech generationOXFine-tuning
(details not specified)
OursKoreanReal + SynthIPA-based pronunciation conversion
with Korean TTS synthesis
XOLoRA-based
Fine-tuning
Table 2. Composition of the AI-Hub Educational Korean English Speech Dataset.
Table 2. Composition of the AI-Hub Educational Korean English Speech Dataset.
CategorySubcategoryTrainValidation
Sampling rate (kHz)-1616
Speech length (s)Mean21.8421.81
Median8.898.88
Number of filesMale61,129
Female108,803
Number of speakersMale649
Female1112
Recording device (files)PC148,99218,882
Recording place (files)Home148,61718,834
Office37548
Table 3. Comparison of Model Performance for Each Fine-Tuning Data Configuration.
Table 3. Comparison of Model Performance for Each Fine-Tuning Data Configuration.
ModelTraining ConfigurationCER (%)WER (%)PER (%)Relative Improvement (%)
CERWERPER
TinyBaseline11.7122.9723.76---
Human (20 h)10.0020.0120.7414.6012.8912.71
Human (20 h) + Azure (4 h)10.0720.0520.6914.0112.7112.92
Human (20 h) + Proposed (4 h)9.7919.5420.2416.414.9314.81
BaseBaseline8.4917.0118.22---
Human (20 h)8.3716.8018.001.411.231.21
Human (20 h) + Azure (5 h)7.8816.1817.387.184.884.61
Human (20 h) + Proposed (5 h)7.8516.1817.317.544.884.99
Table 4. Effect of Synthetic Data Scale combined with Non-Synthetic Data.
Table 4. Effect of Synthetic Data Scale combined with Non-Synthetic Data.
ModelTraining ConfigurationCER (%)WER (%)PER (%)Relative Improvement (%)
CERWERPER
TinyBaseline11.7122.9723.76---
Human (20 h)10.0020.0120.7414.6012.8912.71
Human (20 h) + Proposed (2 h)9.9019.7720.4515.4613.9313.93
Human (20 h) + Proposed (4 h)9.7919.5420.2416.4014.9314.81
Human (20 h) + Proposed (5 h)10.0019.8620.5614.6013.5413.47
Human (20 h) + Proposed (10 h)9.9919.8920.5614.6913.4113.47
Human (20 h) + Proposed (20 h)10.4920.8921.5210.429.069.43
BaseBaseline8.4917.0118.22---
Human (20 h)8.3716.8018.001.411.231.21
Human (20 h) + Proposed (5 h)7.8516.1817.317.544.884.99
Human (20 h) + Proposed (10 h)7.9016.1517.286.955.065.16
Human (20 h) + Proposed (20 h)8.1116.6117.704.482.352.85
Table 5. ASR Performance on the L2-ARCTIC Korean Subset.
Table 5. ASR Performance on the L2-ARCTIC Korean Subset.
ModelTraining ConfigurationCER (%)WER (%)PER (%)Relative Improvement (%)
CERWERPER
TinyBaseline7.0216.0517.21---
Human (20 h)6.3014.3015.3410.2610.910.87
Human (20 h) + Azure (4 h)6.2914.4115.4610.4010.2210.17
Human (20 h) + Proposed (4 h)5.9513.8614.8015.2413.6414.00
BaseBaseline7.1814.7216.66---
Human (20 h)7.1914.7416.84−0.14−0.14−1.08
Human (20 h) + Azure (5 h)7.1014.7216.421.110.001.44
Human (20 h) + Proposed (5 h)6.9314.5816.393.480.951.62
Table 6. ASR Performance with and without Word-Count Control.
Table 6. ASR Performance with and without Word-Count Control.
ModelTraining Configuration AHK L2-ARCTIC
CER (%)WER (%)PER (%)CER (%)WER (%)PER (%)
TinyBaseline11.7122.9723.767.0216.0517.21
Human (20 h)10.0020.0120.746.3014.3015.34
Human (20 h) + Proposed (4 h, BR-O)9.7919.5420.245.9513.8614.80
Human (20 h) + Proposed (4 h, BR-X)9.9119.6820.376.1114.1415.17
BaseBaseline8.4917.0118.227.1814.7216.66
Human (20 h)8.3716.8018.007.1914.7416.84
Human (20 h) + Proposed (5 h, BR-O)7.8516.1817.316.9314.5816.39
Human (20 h) + Proposed (5 h, BR-X)7.9016.2617.297.1014.7916.46
Table 7. Comparison with Conventional Augmentation Methods and Their Combination with the Proposed Method.
Table 7. Comparison with Conventional Augmentation Methods and Their Combination with the Proposed Method.
Training ConfigurationAHKL2-ARCTIC
CER (%)WER (%)PER (%)CER (%)WER (%)PER (%)
Baseline8.4917.0118.227.1814.7216.66
Human8.3716.8018.007.1914.7416.84
Human + Noise8.3516.7918.007.2114.7316.80
Human + Speed7.5015.3516.356.8214.3415.97
Human + SpecAugment7.9616.3317.436.9214.5616.68
Human + Proposed7.8516.1817.316.9314.5816.39
Table 8. ASR Performance with Different Numbers of Reference Speakers.
Table 8. ASR Performance with Different Numbers of Reference Speakers.
Training ConfigurationAHKL2-ARCTIC
CER (%)WER (%)PER (%)CER (%)WER (%)PER (%)
Baseline8.4917.0118.227.1814.7216.66
Human8.3716.8018.007.1914.7416.84
Human + Proposed (spk1)7.8916.2917.307.1214.7016.49
Human + Proposed (spk3)7.8516.1817.227.0314.6616.41
Human + Proposed (spk5)7.7916.0417.106.9214.4316.12
Human + Proposed (spk0)7.8516.1817.316.9314.5816.39
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jang, H.; Kim, T.; Choi, H.; Jung, Y. Enhancing Korean-Accented English ASR with Transliteration-Based Data Synthesis. Electronics 2026, 15, 1380. https://doi.org/10.3390/electronics15071380

AMA Style

Jang H, Kim T, Choi H, Jung Y. Enhancing Korean-Accented English ASR with Transliteration-Based Data Synthesis. Electronics. 2026; 15(7):1380. https://doi.org/10.3390/electronics15071380

Chicago/Turabian Style

Jang, Hana, Taehwa Kim, Hyungwoo Choi, and Youngbeom Jung. 2026. "Enhancing Korean-Accented English ASR with Transliteration-Based Data Synthesis" Electronics 15, no. 7: 1380. https://doi.org/10.3390/electronics15071380

APA Style

Jang, H., Kim, T., Choi, H., & Jung, Y. (2026). Enhancing Korean-Accented English ASR with Transliteration-Based Data Synthesis. Electronics, 15(7), 1380. https://doi.org/10.3390/electronics15071380

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop