Enhancing Korean-Accented English ASR with Transliteration-Based Data Synthesis
Abstract
1. Introduction
- We propose a data synthesis framework that employs IPA as an intermediate phonological representation and utilizes a Korean TTS model.
- We formulate accent modeling by decoupling pronunciation representation from acoustic generation, eliminating the need for accent-specific TTS retraining.
- Experimental results show that the proposed synthetic data leads to improvements in ASR performance for Korean-accented English across multiple evaluation settings.
2. Related Work
2.1. Automatic Speech Recognition (ASR)
2.1.1. Whisper Model
2.1.2. Fine-Tuning ASR Models for Accented Speech
2.2. Text-to-Speech (TTS)
2.2.1. TTS for Accented Speech Generation
2.2.2. Breath Group Control
2.3. Phoneme-Based Representation and Conversion
2.3.1. IPA-Based Phoneme Conversion for Korean-Accented English
2.3.2. Phoneme-Based Approaches for ASR and Representation Learning
2.4. Low-Rank Adaptation (LoRA)
2.4.1. Definition of LoRA
2.4.2. LoRA-Based Whisper Fine-Tuning on Low-Resources Language
3. Proposed Method
3.1. Overview of the Proposed Approach
3.2. IPA Converter Architecture
3.3. Korean-Accented English Generator (KAEG)
3.4. ASR Fine-Tuning
4. Experimental Setup
4.1. Dataset
4.2. TTS
4.2.1. IPA Conversion
4.2.2. KAEG
4.3. ASR
4.4. Evaluation Metrics
5. Experimental Results
5.1. Similarity Analysis Between Synthetic and Real Speech
5.2. Comparison of Model Performance Across Data Configuration
5.3. Effect of Synthetic Data Scale on ASR Performance
5.4. Out-of-Domain ASR Performance on L2-ARCTIC (Korean Subset)
5.5. Effect of Word-Count Control Within Breath Units
6. Conclusions
7. Ablation Study
7.1. Comparison with Conventional Augmentation Methods
7.2. Effectiveness with a Reduced Number of Reference Speakers
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| ASR | Automatic Speech Recognition |
| L1 | First Language |
| L2 | Second Language |
| TTS | Text-to-Speech |
| IPA | International Phonetic Alphabet |
| LoRA | Low-Rank Adaptation |
| CER | Character Error Rate |
| WER | Word Error Rate |
| PER | Phoneme Error Rate |
| OOV | Out-of-vocabulary |
| NIKL | National Institute of Korean Language |
| KAEG | Korean-accented English Generator |
| AHK | AI-Hub Educational Korean English Speech Dataset |
References
- Graham, C.; Roll, N. Evaluating OpenAI’s Whisper ASR: Performance Analysis across Diverse Accents and Speaker Traits. JASA Express Lett. 2024, 4, 020401. [Google Scholar] [CrossRef] [PubMed]
- Kunisetty, J.; Ramachandrula, P.; Vekkot, S.; Gupta, D. Advancing ASR for Indian-Accented English: Dataset Creation and Whisper Fine-Tuning. Procedia Comput. Sci. 2025, 258, 2510–2519. [Google Scholar] [CrossRef]
- Polat, H.; Turan, A.K.; Koçak, C.; Ulaş, H.B. Implementation of a Whisper Architecture-Based Turkish Automatic Speech Recognition (ASR) System and Evaluation of the Effect of Fine-Tuning with a Low-Rank Adaptation (LoRA) Adapter on Its Performance. Electronics 2024, 13, 4227. [Google Scholar] [CrossRef]
- Alharbi, S.; Alrazgan, M.; Alrashed, A.; Alnomasi, T.; Almojel, R.; Alharbi, R.; Alharbi, S.; Alturki, S.; Alshehri, F.; Almojil, M. Automatic Speech Recognition: Systematic Literature Review. IEEE Access 2021, 9, 131858–131876. [Google Scholar] [CrossRef]
- Ahlawat, H.; Aggarwal, N.; Gupta, D. Automatic Speech Recognition: A Survey of Deep Learning Techniques and Approaches. Int. J. Cogn. Comput. Eng. 2025, 6, 100096. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Ahmad, H.A.; Rashid, T.A. Planning the Development of Text-to-Speech Synthesis Models and Datasets with Dynamic Deep Learning. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102131. [Google Scholar] [CrossRef]
- Barakat, H.; Turk, O.; Demiroglu, C. Deep Learning-Based Expressive Speech Synthesis: A Systematic Review of Approaches, Challenges, and Resources. J. Audio Speech Music Process. 2024, 2024, 11. [Google Scholar] [CrossRef]
- Chou, C.-K.; Hsu, C.-J.; Chung, H.-L.; Tseng, L.-H.; Cheng, H.-C.; Fu, Y.-K.; Huang, K.-P.; Lee, H.-Y. A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2025), Honolulu, HI, USA, 8 December 2025. [Google Scholar]
- Do, C.T.; Imai, S.; Doddipatla, R.; Hain, T. Improving Accented Speech Recognition Using Data Augmentation Based on Unsupervised Text-to-Speech Synthesis. In Proceedings of the 32nd European Signal Processing Conference (EUSIPCO), Lyon, France, 26–30 August 2024. [Google Scholar]
- Masson, M.; Carson-Berndsen, J. Investigating the use of synthetic speech data for the analysis of Spanish-accented English pronunciation patterns in ASR. In Proceedings of the Synthetic Data’s Transformative Role in Foundational Speech Models (SynData4GenAI), Kos, Greece, 31 August 2024; pp. 81–85. [Google Scholar] [CrossRef]
- Karakasidis, G.; Robinson, N.; Getman, Y.; Ogayo, A.; Al-Ghezi, R.; Ayasi, A.; Watanabe, S.; Mortensen, D.R.; Kurimo, M. Multilingual TTS Accent Impressions for Accented ASR. In Proceedings of the Text, Speech, and Dialogue (TSD 2023); Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2023; Volume 14102. [Google Scholar] [CrossRef]
- Yoo, D.; Shin, J. Study on the Realization of Pause Groups and Breath Groups. Phon. Speech Sci. 2020, 12, 19–31. [Google Scholar] [CrossRef]
- National Institute of Korean Language. Pronunciation and Spacing FAQ. Available online: https://korean.go.kr/front/mcfaq/mcfaqView.do?mn_id=62&mcfaq_seq=6806&pageIndex=5 (accessed on 15 February 2026).
- Park, J.; Kim, M.; Hong, D.; Lee, J. Compositional Phoneme Approximation for L1-Grounded L2 Pronunciation Training. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Mumbai, India, 20–24 December 2025. [Google Scholar]
- International Phonetic Association. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet; Cambridge University Press: Cambridge, UK, 1999. [Google Scholar]
- Zhang, L.; Wu, S.; Wang, Z. Phoneme-Aware Augmentation for Robust Cantonese ASR Under Low-Resource Conditions. Symmetry 2025, 17, 1478. [Google Scholar] [CrossRef]
- Sohn, J.; Jung, H.; Cheng, A.; Kang, J.; Du, Y.; Mortensen, D.R. Zero-shot cross-lingual NER using phonemic representations for low-resource languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 13595–13602. [Google Scholar]
- Feng, S.; Tu, M.; Xia, R.; Huang, C.; Wang, Y. Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023; pp. 1384–1388. [Google Scholar] [CrossRef]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR). arXiv 2022, arXiv:2106.09685. [Google Scholar]
- Mizumoto, T.; Kojima, A.; Fujita, Y.; Liu, L.; Sudo, Y. Is Synthetic Data Truly Effective for Training Speech Language Models? Proc. Interspeech 2025, 2025, 1808–1812. [Google Scholar]
- AI-Hub. Korean Speech Dataset. Available online: https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71463 (accessed on 8 February 2026).
- Zhao, G.; Chukharev-Hudilainen, E.; Sonsaat, S.; Silpachai, A.; Lucic, I.; Gutierrez-Osuna, R.; Levis, J. L2-arctic: A non-native english speech corpus. Proc. Interspeech 2018, 2018, 2783–2787. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. LibriSpeech: An ASR Corpus Based on Public Domain Audio Books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
- Kresnik. Zeroth Korean Text-to-Speech Dataset. Available online: https://huggingface.co/datasets/kresnik/zeroth_korean (accessed on 8 February 2026).
- Bernard, M.; Titeux, H. Phonemizer: Text to Phones Transcription for Multiple Languages in Python. J. Open Source Softw. 2021, 6, 3958. [Google Scholar] [CrossRef]
- eng-to-ipa. eng-to-ipa: Convert English Text to IPA. Python Package Index (PyPI). 2020. Available online: https://pypi.org/project/eng-to-ipa/ (accessed on 8 February 2026).
- Coqui. XTTS: Cross-Lingual Text-to-Speech Model Documentation. Available online: https://docs.coqui.ai/en/latest/models/xtts.html (accessed on 8 February 2026).
- Zhang, L.; Wu, S.; Wang, Z. Phoneme-Aware Hierarchical Augmentation and Semantic-Aware SpecAugment for Low-Resource Cantonese Speech Recognition. Sensors 2025, 25, 4288. [Google Scholar] [CrossRef] [PubMed]
- Mengke, D.; Mihajlik, P. Impact of Text Origin and Real-Synthetic Data Ratio in TTS-Augmented Low-Resource ASR. In Proceedings of the 2025 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Cluj-Napoca, Romania, 19–22 October 2025; IEEE: New York, NY, USA; 2025, pp. 97–101. [Google Scholar]
- Pandey, L.; Arif, A.S. Effects of speaking rate on speech and silent speech recognition. In Proceedings of the CHI Conference on Human Factors in Computing Systems Extended Abstracts; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
- Moëll, B.; O’Regan, J.; Mehta, S.; Kirkland, A.; Lameris, H.; Gustafson, J.; Beskow, J. Speech data augmentation for improving phoneme transcriptions of aphasic speech using wav2vec 2.0 for the psst challenge. In Proceedings of the RaPID Workshop-Resources and Processing of linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive, Psychiatric, and Developmental Impairments, Within the 13th Language Resources and Evaluation Conference, Marseille, France, 25 June 2022. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Proc. Interspeech 2019, 2019, 2613–2617. [Google Scholar]








| Method | Target Accent | Data Type | Synthesis/Data Generation Strategy | TTS Retraining | Phonological Control | ASR Adaptation |
|---|---|---|---|---|---|---|
| [2] | Indian | Real | Human-spoken data collection | X | X | Fine-tuning (details not specified) |
| [3] | Turkish | Real | Existing human speech dataset | O | X | LoRA-based Fine-tuning |
| [13] | Korean, German, Malaysian, Filipino, Arabic, Chinese, Hindi, Spanish, Vietnamese | Real (Optional) + Synth | Pretrained Commercial TTS-based synthetic speech generation | O | X | Fine-tuning (details not specified) |
| Ours | Korean | Real + Synth | IPA-based pronunciation conversion with Korean TTS synthesis | X | O | LoRA-based Fine-tuning |
| Category | Subcategory | Train | Validation |
|---|---|---|---|
| Sampling rate (kHz) | - | 16 | 16 |
| Speech length (s) | Mean | 21.84 | 21.81 |
| Median | 8.89 | 8.88 | |
| Number of files | Male | 61,129 | |
| Female | 108,803 | ||
| Number of speakers | Male | 649 | |
| Female | 1112 | ||
| Recording device (files) | PC | 148,992 | 18,882 |
| Recording place (files) | Home | 148,617 | 18,834 |
| Office | 375 | 48 | |
| Model | Training Configuration | CER (%) | WER (%) | PER (%) | Relative Improvement (%) | ||
|---|---|---|---|---|---|---|---|
| CER | WER | PER | |||||
| Tiny | Baseline | 11.71 | 22.97 | 23.76 | - | - | - |
| Human (20 h) | 10.00 | 20.01 | 20.74 | 14.60 | 12.89 | 12.71 | |
| Human (20 h) + Azure (4 h) | 10.07 | 20.05 | 20.69 | 14.01 | 12.71 | 12.92 | |
| Human (20 h) + Proposed (4 h) | 9.79 | 19.54 | 20.24 | 16.4 | 14.93 | 14.81 | |
| Base | Baseline | 8.49 | 17.01 | 18.22 | - | - | - |
| Human (20 h) | 8.37 | 16.80 | 18.00 | 1.41 | 1.23 | 1.21 | |
| Human (20 h) + Azure (5 h) | 7.88 | 16.18 | 17.38 | 7.18 | 4.88 | 4.61 | |
| Human (20 h) + Proposed (5 h) | 7.85 | 16.18 | 17.31 | 7.54 | 4.88 | 4.99 | |
| Model | Training Configuration | CER (%) | WER (%) | PER (%) | Relative Improvement (%) | ||
|---|---|---|---|---|---|---|---|
| CER | WER | PER | |||||
| Tiny | Baseline | 11.71 | 22.97 | 23.76 | - | - | - |
| Human (20 h) | 10.00 | 20.01 | 20.74 | 14.60 | 12.89 | 12.71 | |
| Human (20 h) + Proposed (2 h) | 9.90 | 19.77 | 20.45 | 15.46 | 13.93 | 13.93 | |
| Human (20 h) + Proposed (4 h) | 9.79 | 19.54 | 20.24 | 16.40 | 14.93 | 14.81 | |
| Human (20 h) + Proposed (5 h) | 10.00 | 19.86 | 20.56 | 14.60 | 13.54 | 13.47 | |
| Human (20 h) + Proposed (10 h) | 9.99 | 19.89 | 20.56 | 14.69 | 13.41 | 13.47 | |
| Human (20 h) + Proposed (20 h) | 10.49 | 20.89 | 21.52 | 10.42 | 9.06 | 9.43 | |
| Base | Baseline | 8.49 | 17.01 | 18.22 | - | - | - |
| Human (20 h) | 8.37 | 16.80 | 18.00 | 1.41 | 1.23 | 1.21 | |
| Human (20 h) + Proposed (5 h) | 7.85 | 16.18 | 17.31 | 7.54 | 4.88 | 4.99 | |
| Human (20 h) + Proposed (10 h) | 7.90 | 16.15 | 17.28 | 6.95 | 5.06 | 5.16 | |
| Human (20 h) + Proposed (20 h) | 8.11 | 16.61 | 17.70 | 4.48 | 2.35 | 2.85 | |
| Model | Training Configuration | CER (%) | WER (%) | PER (%) | Relative Improvement (%) | ||
|---|---|---|---|---|---|---|---|
| CER | WER | PER | |||||
| Tiny | Baseline | 7.02 | 16.05 | 17.21 | - | - | - |
| Human (20 h) | 6.30 | 14.30 | 15.34 | 10.26 | 10.9 | 10.87 | |
| Human (20 h) + Azure (4 h) | 6.29 | 14.41 | 15.46 | 10.40 | 10.22 | 10.17 | |
| Human (20 h) + Proposed (4 h) | 5.95 | 13.86 | 14.80 | 15.24 | 13.64 | 14.00 | |
| Base | Baseline | 7.18 | 14.72 | 16.66 | - | - | - |
| Human (20 h) | 7.19 | 14.74 | 16.84 | −0.14 | −0.14 | −1.08 | |
| Human (20 h) + Azure (5 h) | 7.10 | 14.72 | 16.42 | 1.11 | 0.00 | 1.44 | |
| Human (20 h) + Proposed (5 h) | 6.93 | 14.58 | 16.39 | 3.48 | 0.95 | 1.62 | |
| Model | Training Configuration | AHK | L2-ARCTIC | ||||
|---|---|---|---|---|---|---|---|
| CER (%) | WER (%) | PER (%) | CER (%) | WER (%) | PER (%) | ||
| Tiny | Baseline | 11.71 | 22.97 | 23.76 | 7.02 | 16.05 | 17.21 |
| Human (20 h) | 10.00 | 20.01 | 20.74 | 6.30 | 14.30 | 15.34 | |
| Human (20 h) + Proposed (4 h, BR-O) | 9.79 | 19.54 | 20.24 | 5.95 | 13.86 | 14.80 | |
| Human (20 h) + Proposed (4 h, BR-X) | 9.91 | 19.68 | 20.37 | 6.11 | 14.14 | 15.17 | |
| Base | Baseline | 8.49 | 17.01 | 18.22 | 7.18 | 14.72 | 16.66 |
| Human (20 h) | 8.37 | 16.80 | 18.00 | 7.19 | 14.74 | 16.84 | |
| Human (20 h) + Proposed (5 h, BR-O) | 7.85 | 16.18 | 17.31 | 6.93 | 14.58 | 16.39 | |
| Human (20 h) + Proposed (5 h, BR-X) | 7.90 | 16.26 | 17.29 | 7.10 | 14.79 | 16.46 | |
| Training Configuration | AHK | L2-ARCTIC | ||||
|---|---|---|---|---|---|---|
| CER (%) | WER (%) | PER (%) | CER (%) | WER (%) | PER (%) | |
| Baseline | 8.49 | 17.01 | 18.22 | 7.18 | 14.72 | 16.66 |
| Human | 8.37 | 16.80 | 18.00 | 7.19 | 14.74 | 16.84 |
| Human + Noise | 8.35 | 16.79 | 18.00 | 7.21 | 14.73 | 16.80 |
| Human + Speed | 7.50 | 15.35 | 16.35 | 6.82 | 14.34 | 15.97 |
| Human + SpecAugment | 7.96 | 16.33 | 17.43 | 6.92 | 14.56 | 16.68 |
| Human + Proposed | 7.85 | 16.18 | 17.31 | 6.93 | 14.58 | 16.39 |
| Training Configuration | AHK | L2-ARCTIC | ||||
|---|---|---|---|---|---|---|
| CER (%) | WER (%) | PER (%) | CER (%) | WER (%) | PER (%) | |
| Baseline | 8.49 | 17.01 | 18.22 | 7.18 | 14.72 | 16.66 |
| Human | 8.37 | 16.80 | 18.00 | 7.19 | 14.74 | 16.84 |
| Human + Proposed (spk1) | 7.89 | 16.29 | 17.30 | 7.12 | 14.70 | 16.49 |
| Human + Proposed (spk3) | 7.85 | 16.18 | 17.22 | 7.03 | 14.66 | 16.41 |
| Human + Proposed (spk5) | 7.79 | 16.04 | 17.10 | 6.92 | 14.43 | 16.12 |
| Human + Proposed (spk0) | 7.85 | 16.18 | 17.31 | 6.93 | 14.58 | 16.39 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Jang, H.; Kim, T.; Choi, H.; Jung, Y. Enhancing Korean-Accented English ASR with Transliteration-Based Data Synthesis. Electronics 2026, 15, 1380. https://doi.org/10.3390/electronics15071380
Jang H, Kim T, Choi H, Jung Y. Enhancing Korean-Accented English ASR with Transliteration-Based Data Synthesis. Electronics. 2026; 15(7):1380. https://doi.org/10.3390/electronics15071380
Chicago/Turabian StyleJang, Hana, Taehwa Kim, Hyungwoo Choi, and Youngbeom Jung. 2026. "Enhancing Korean-Accented English ASR with Transliteration-Based Data Synthesis" Electronics 15, no. 7: 1380. https://doi.org/10.3390/electronics15071380
APA StyleJang, H., Kim, T., Choi, H., & Jung, Y. (2026). Enhancing Korean-Accented English ASR with Transliteration-Based Data Synthesis. Electronics, 15(7), 1380. https://doi.org/10.3390/electronics15071380

