Transliteration-Aided Transfer Learning for Low-Resource ASR: A Case Study on Khalkha Mongolian

Mengke, Dalai; Meng, Yan; Mihajlik, Péter

doi:10.3390/electronics14061137

Open AccessArticle

Transliteration-Aided Transfer Learning for Low-Resource ASR: A Case Study on Khalkha Mongolian

by

Dalai Mengke

^1,*,

Yan Meng

¹

and

Péter Mihajlik

^1,2

¹

Department of Telecommunications and Artificial Intelligence, Faculty of Electrical Engineering and Informatics, Budapest University of Technology and Economics, 1111 Budapest, Hungary

²

Hungarian Research Centre for Linguistics, 1068 Budapest, Hungary

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(6), 1137; https://doi.org/10.3390/electronics14061137

Submission received: 24 January 2025 / Revised: 12 March 2025 / Accepted: 12 March 2025 / Published: 14 March 2025

(This article belongs to the Special Issue Advanced Natural Language Processing Technology and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Automatic Speech Recognition (ASR) systems have made consistent advancements, achieving notable improvements in state-of-the-art performance across various languages. However, their effectiveness often declines significantly in low-resource settings, where data and linguistic resources are limited. This paper addresses the challenges of ASR for a low-resource language, Khalkha Mongolian, by leveraging a transliteration-aided transfer learning approach. Specifically, it improves the ASR system for Khalkha Mongolian by transliterating text from a well-resourced Chakhar Mongolian (Uighur script) dataset to the Cyrillic script and then fine-tuning it with Khalkha Mongolian data. The method effectively enhances the ASR performance of Khalkha Mongolian. The effectiveness of the proposed method was validated on three popular ASR models, Wav2Vec2-BERT, Conformer-Large, and Whisper-large-v3. Among these models, the best relative improvement in word error rate (WER) reaches 32.50%, while the absolute improvement reaches 19.26%.

Keywords:

low-resource automatic speech recognition (ASR); transfer learning; Cyrillic Mongolian; Uighur Mongolian

1. Introduction

The Mongolian language is agglutinative, belongs to the Mongolian group of the Altaic language family, and has a developmental history of thousands of years. The earliest records of the Mongolian language date back to the 13th century and are in an ancient Mongolian script called Uighur Mongolian scripts [1]. The speakers of Mongolian are primarily distributed in Mongolia, China (especially in the Inner Mongolia Autonomous Region), and the Russian Republics of Buryatia and Kalmykia [2]. In Mongolia, the Mongolian language is dominated by the Khalkha dialect; in China, the standard Mongolian language is the Chakhar dialect; and in Russia, Buryat and Kalmyk are also considered branches of Mongolian. Linguistically, Khalkha Mongolian and Chakhar Mongolian are the two closest dialects of Mongolian and also the two most spoken dialects. In spoken language, there are many similarities between them. However, there are significant differences in the writing systems of the two dialects, with Mongolia adopting the Cyrillic alphabet as its official writing script, while the Inner Mongolia Autonomous Region of China continues to use the Uighur Mongolian script, also called traditional Mongolian [3]. Consequently, a word can be written in both of the two scripts, and its pronunciation does not change. Figure 1 shows how the phrase “What’s the weather like today? Today is cloudy” is written in the two writing systems (Cyrillic Mongolian on the top; Uighur Mongolian on the bottom).

Mongolian, as a low-resource language, has a highly limited corpus for ASR tasks, and speech recognition systems for Khalkha Mongolian–Cyrillic script, in particular, are in dire need of enhancement [3]. However, the current datasets available for Khalkha Mongolian with Cyrillic script are very limited. In contrast, the corpus of Chakhar Mongolian with Uighur scripts is relatively well established. Some relatively well-established Mongolian speech datasets are publicly available, including M2ASR-MONGO [4], containing 170 h of speech from 259 native speakers; MNASR [5], which is currently one of the largest Mongolian speech corpus with 345 h of speech data; and IMUT-MC [6], a corpus consisting of conversational (IMUT-MC2 and IMUT-MC3) and news-domain data (IMUT-MC4).

Automatic Speech Recognition (ASR) for low-resource languages plays a vital role in advancing language diversity by enabling speech-based technologies, improving accessibility, and aiding in linguistic preservation [7,8]. However, ASR models for low-resource languages often yield unsatisfactory results due to the scarcity of available corpora. In such end-to-end machine learning tasks, transfer learning is considered a highly effective solution when training resources for the target task are insufficient [9,10,11]. In Automatic Speech Recognition tasks, it is often called cross-lingual transfer learning [12,13,14,15], which improves performance on low-resource languages by pre-training models on one or more high-resource languages, and then transferring these pre-trained models to low-resource languages. Fine-tuning multilingual pre-trained models is a recent common approach [16,17,18]. For example, Li et al. used wav2vec 2.0 to study the Khinalug language for speech recognition [19] and showed that pre-training in multiple languages is very helpful for fine-tuning in less resourceful languages. Certainly, there have been attempts at monolingual pre-trained models [20,21]. For example, Mihajlik et al. demonstrated significant improvements by applying cross-lingual transfer learning with a monolingual Conformer model for Hungarian [22], clearly outperforming other approaches such as QuartzNet [23].

This paper is inspired by a novel transfer learning strategy by pre-training on speech data from a high-resource language (e.g., English) and transcribing its text into that of the target low-resource language [24]. This simple character mapping significantly enhances the sharing of the output space of the two languages, even if they come from unrelated language families.

The attempt of this paper used a similar transliteration procedure to generate a variant dataset by transliterating the textual part (Uighur) of the Chakhar Mongolian dataset directly into the target language’s written form (Cyrillic). Unlike the above studies, the present study shows that despite the difference in the written forms of the two languages, their pronunciation is very similar. This transliterated dataset was then utilized together with another small natural Khalkha Mongolian dataset, along with a two-stage fine-tuning method to enable much better Automatic Speech Recognition of the target dialect (Khalkha Mongolian). This study contributed to the enhancement of Automatic Speech Recognition from Khalkha Mongolian to Cyrillic scripts, as well as inspired the improvement of ASR systems for other less-resourced languages.

2. Methodology

In this paper, we propose a method to utilize a Mongolian dataset from China (Chakhar Mongolian dialect with Uighur script) to help construct an Automatic Speech Recognition model that can enhance the Mongolian language (Khalkha Mongolian dialect with Cyrillic script) in Mongolia. The specific implementation process is shown in Figure 2.

2.1. Transliteration

Transcribing traditional Mongolian into Cyrillic Mongolian is an important task, especially in the fields of language processing, Automatic Speech Recognition, and text processing. However, although both are Mongolian languages, the transliteration of traditional Mongolian into Cyrillisc Mongolian is a complex process [25]. Due to the differences between the two types of Mongolian in terms of phonetics, writing rules, and grammar, transcription is not a simple letter-to-letter substitution; it needs to take into account pronunciation rules, phonological phenomena, lexical changes, and other factors. Figure 3 shows the alphabets of traditional Mongolian, the G2G (Grapheme-to-Grapheme) correspondences, and the G2P (Grapheme-to-Phoneme) correspondences when translated into Cyrillic Mongolian. Traditional Mongolian has 8 vowels and 27 consonants, while Cyrillic Mongolian has 13 vowels and 20 consonants plus a hard sign Electronics 14 01137 i001

and a soft sign Electronics 14 01137 i002

[26]. Vowels are marked in red. Based on these differences, the following key difficulties are faced in the specific process of transliteration.

There are many instances of incomplete correspondence between letters and phonemes when translating traditional Mongolian into Cyrillic Mongolian [26]. Traditional Mongolian is a phonetic script based on syllables, while Cyrillic Mongolian is a purely phonetic script. Some of the traditional Mongolian letters are pronounced differently in different contexts, which leads to the fact that their transcription into Cyrillic Mongolian does not always correspond to each other. These resulted in a difficult transliteration from traditional Mongolian to Cyrillic Mongolian. And in many cases, the number of words for the same word in traditional Mongolian is different from the number of words in Cyrillic Mongolian [27]. For example, although the word "bad" is pronounced the same in both traditional Mongolian and Cyrillic Mongolian, it is spelled as “ Electronics 14 01137 i003

” (4 letters) in traditional Mongolian, whereas in Cyrillic Mongolian, it appears as “ Electronics 14 01137 i004

” (3 letters). Figure 4 shows the one-to-many phenomenon when considering G2P when translating traditional Mongolian into Cyrillic Mongolian, and the letters that do not correspond are marked in red.

During the process of transliterating traditional Mongolian into Cyrillic Mongolian, there are many instances where letters are dropped or appended. Moreover, vowel harmony must be considered, not only within the words themselves, but also in the suffixes. Mongolians exhibit vowel harmony, where words and affixes adjust to specific vowel categories. Additionally, certain consonants may gain or lose vowels in both traditional Mongolian and Cyrillic Mongolian to adhere to phonetic rules. Figure 5 illustrates an example of letter dropping, an example of letter appending, and three examples of vowel harmonization, the variations are marked in red.

Additionally, because certain words differ between traditional Mongolian and Cyrillic Mongolian, the translation process sometimes relies on phonetic representation rather than direct semantic equivalence. For instance, the traditional Mongolian word for plate is spelled and pronounced as Electronics 14 01137 i005

(pil). Although its direct semantic translation into Cyrillic Mongolian would be Electronics 14 01137 i006

(tavag), we have chosen to render it as Electronics 14 01137 i007

(pil) in Cyrillic in order to preserve the alignment between acoustic and textual features in the dataset and thus facilitate subsequent fine-tuning. It should be noted, however, that Electronics 14 01137 i007

is not recognized as a standard term in Cyrillic Mongolian usage.

2.2. Fine-Tuning of End-to-End Transfer Learning

First, the text part of the relatively well-resourced dataset of the Chakhar dialect corresponding to the Uighur Mongolian script is transliterated into the Cyrillic script, and then a new dataset with the speech of the Chakhar dialect corresponding to the Cyrillic script is obtained. This dataset is then fine-tuned on the pre-trained model to obtain an extended ASR model with the ability to output the Cyrillic characters from the Chakhar Mongolian dialect. However, there are still some obvious differences between Chakhar Mongolian and Khalkha Mongolian, such as pronunciation styles, special vocabularies, and so on. For example, some vocabulary expressions in the standardized language of Chakhar in China use borrowed words from Chinese, while in Mongolia, they are borrowed from Russian. Therefore, to make the model efficient in recognizing Khalkha Mongolian to Cyrillic scripts, the pre-trained model is further fine-tuned using a very limited Mongolian dataset, which mostly maintains the Khalkha dialect and corresponds to Cyrillic scripts from Mongolia. After these steps, the proposed model is obtained.

3. Experiment

To verify our methodology, we transliterated a Mongolian dataset of Chakhar accented with the Uighur script into Chakhar accented with the Cyrillic script and then performed initial fine-tuning on three popular pre-trained models; then, we further fine-tuned the obtained model with a small native Khalkha dialect with Cyrillic scripts in the Mongolian dataset.

3.1. Datasets

For initial fine-tuning, we used a Mongolian dataset named M2ASR-MONGO (Cyrillic) with the Chakhar dialect in Cyrillic script. This dataset was created by transliterating the Uighur script part of the M2ASR-MONGO (Uighur) dataset [4] into Cyrillic script and filtering it to obtain a new dataset comprising 147 h of training, 9.5 h of validation, and 6.1 h of testing. The M2ASR-MONGO dataset was built from a 57 MB text corpus collected from the Internet, with news being the primary genre. For further fine-tuning, we used a Mongolian dataset (Khalkha dialect with Cyrillic script) from Common Voice v17 [28], which includes a 3 h training set, a 2.7 h validation set, and a 3 h test set. The content of the Common Voice dataset covers a wide range of topics, including daily conversations, news interviews, and social events, as well as narratives, literary works, stories, and legends.

3.2. Experimental Setup

All experiments were conducted on an NVIDIA A6000 GPU (48 GB memory) using the Mongolian dataset from Common Voice v17, evaluating three models: Wav2Vec2-BERT [29], Whisper-large-v3 [30], and Conformer-Large [31].

Both Wav2Vec2-BERT and Whisper-large-v3 were implemented via Hugging Face, using Facebook’s w2v-bert-2.0 (https://huggingface.co/facebook/w2v-bert-2.0, accessed on 14 August 2024) (multilingual) and OpenAI’s whisper-large-v3 (https://huggingface.co/openai/whisper-large-v3, accessed on 25 August 2024) (multilingual) pre-trained model, respectively. The two models were both trained with a batch size of 16, a gradient accumulation of 2, and mixed-precision training (FP16). The learning rate was set to 1 × 10⁻⁵, with a 500-step warm-up. For the Conformer-Large model, we used the NVIDIA NeMo toolkit and experimented with an English pre-trained model (https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_ctc_large, accessed on 8 July 2024) from NVIDIA. To enable Cyrillic Mongolian processing, we employed a BPE tokenizer trained on Common Voice v17. The training parameters for this model were set to a learning rate of 0.002 and a batch size of 32, while the remaining hyperparameters followed NeMo’s default settings. Conformer-Large and Whisper-large-v3 used CTC loss with an AdamW optimizer. Wav2Vec2-BERT applied CTC loss in the linear projection layer and CMLM loss in the feed-forward layer, and the Adam optimizer was applied. In addition to the above settings, this experiment did not add any additional language models and lexicon constraints to these three models.

4. Methods

We used three baselines in this paper: zero-shot evaluation (Section 4.1), direct fine-tuning (DTF) (Section 4.2), and merged dataset fine-tuning (Section 4.3). For the proposed core method, we made two sequential fine-tunings, which can be found in Section 4.4, based on the transliterated dataset, and Section 4.5, using the low-resourced dataset.

4.1. Zero-Shot Evaluation

The first baseline is zero-shot evaluation, which is performed solely with Whisper-large-v3 on the Mongolian dataset (Khalkha dialect; Cyrillic script) from Common Voice v17. Although Wav2Vec2-BERT is a multilingual model, it essentially only has an encoder without an inherent decoder or classification head. In a zero-shot scenario, decoding would require randomly initializing a linear layer, which does not offer a fair comparison with models like Whisper that come with a fully trained encoder–decoder. Similarly, the Conformer large model is based on English pre-training, making zero-shot testing unsuitable for it. In contrast, Whisper-large-v3 was trained with a small amount of Mongolian data, making it appropriate for zero-shot evaluation.

4.2. DFT (Direct Fine-Tuning)

The second baseline is to directly fine-tune the three pre-trained models mentioned above on the Mongolian dataset (Khalkha dialect with Cyrillic script) of Common Voice v17. Here, all hyperparameter settings were kept consistent with the preparatory experimental procedure described above. For direct fine-tuning (DFT), both Wav2Vec2-BERT and Whisper-large-v3 were trained until full convergence (20 epochs), while Conformer-Large needed more training iterations for saturated accuracy (80 epochs).

4.3. MDFT (Merged Dataset Fine-Tuning)

In the third baseline, the transliterated datasets M2ASR-MONGO (Cyrillic) and Common Voice v17 (Cyrillic) were merged. A direct fine-tuning of the three pre-trained models was performed. Because of the changes in dataset sizes, to ensure the fairness of the experiments, the number of training epochs was compensated for by using 4.5 epochs for Wav2Vec2-BERT and Whisper-large-v3 and another 49 epochs for Conformer-Large.

4.4. EFT (Extended Fine-Tuning)

We applied an extended fine-tuning of the three models using the transliterated M2ASR-MONGO (Cyrillc) dataset. For the Wav2Vec2-BERT model, Facebook’s pre-trained model w2v-bert-2.0 was used, while the Whisper-Large-3 model used OpenAI’s pre-trained model. Both models were trained only for 4 epochs, with other hyperparameters set to the same settings as during the baseline experiments. The Conformer-Large model, on the other hand, was based on the English pre-trained model provided by NVIDIA and fine-tuned for 50 epochs on the transliterated M2ASR-MONGO (Cyrillc) dataset. Since the English pre-trained model has never been exposed to Mongolian or similar languages and scripts, it needed more epochs to be adapted.

4.5. TFT (Targeted Fine-Tuning)

The targeted fine-tuning approach involves further fine-tuning of the extended model, which was initially fine-tuned using the transliterated M2ASR-MONGO (Cyrillic) dataset based on the official pre-trained model. The targeted fine-tuning (TFT) process aims to better adapt the model to Khalkha Mongolian from the extended model, distinguishing it from the method in Section 4.2, which directly fine-tunes on the official pre-trained models. At the same time, the hyperparameters, such as learning rate, number of epochs, batch size, etc., for all three models, remain consistent with those used in DFT to ensure a fair comparison. It should be explicitly noted that the primary difference between TFT and DFT is that TFT involves an additional intermediate fine-tuning step (EFT) using transliterated Mongolian data. This intermediate step allows the pre-trained model to effectively adapt to Mongolian phonetic and graphemic characteristics before further fine-tuning on the final target dataset.

5. Result Analysis

In this paper, all test results are obtained from evaluations on the Mongolian test set (Khalkha dialect with Cyrillic script) from Common Voice v17 across all three models and all the experimental results are shown in Table 1.

In Section 4.1, Whisper-large-v3 did not perform ideally, achieving a Character Error Rate (CER) of 45.73% and a word error rate (WER) of 92.60%. For Section 4.2, Whisper-Large-v3 continued to deliver the best performance, with a CER of 13.96% and a WER of 43.02%. Wav2Vec2-BERT achieved a CER of 15.56% and a WER of 55.35% under DFT, lagging behind Whisper-Large-v3. Conformer-Large had a CER of 21.16% and a WER of 59.25%, making it the least effective among the three models at this stage. Surprisingly, in Section 4.3 experiments, the Conformer-Large model outperformed the others, achieving the best results with a CER of 16.12% and a WER of 43.01%. Whisper-Large-v3 followed with modest improvements, while Wav2Vec2-BERT’s CER and WER increased to 17.42% and 57.77%. In Section 4.4, Conformer-Large again achieved the best performance (CER of 24.55% and WER of 59.37%). Whisper-Large-v3 reached a CER of 21.96% and a WER of 61.35%. As for Wav2Vec2-BERT, it reached a CER of 20.97% and a WER of 63.29%. However, in the final targeted fine-tuning (TFT) stage, Whisper-large-v3 outperformed all models, delivering the overall best CER of 10.75% and WER of 31.45%.

5.1. Method Analysis

In the experiment, understandably, the approach in Section 4.1 gave the highest error rates. While Whisper-large-v3 showed some ability to recognize Khalkha Mongolian, its actual performance was very limited. This may be due to the fact that although the model was trained with a small amount of Mongolian data, the number of Mongolian training samples remains insufficient, limiting its recognition performance in this language. Additionally, the model’s multilingual training primarily focused on other languages, which may also affect its ability to accurately recognize Mongolian.

In Section 4.2, the Mongolian dataset from Common Voice v17 (Khalkha dialect, Cyrillic script) was used to directly fine-tune the three aforementioned pre-trained models: Wav2Vec2-BERT, Whisper-Large v3, and Conformer-Large. Due to the limited amount of data available, the performance of DFT was not satisfactory. Here, the Whisper-Large-v3 model yielded the best performance. This outcome was as expected, given that the pre-trained model had previously been trained on Mongolian language data.

For the experiment in Section 4.3, the transliterated datasets M2ASR-MONGO (Cyrillic) and Common Voice v17 (Cyrillic) were merged; then, this merged dataset was used to fine-tune three pre-trained models (Wav2Vec2-BERT, Whisper-Large v3, and Conformer-Large). Compared to DFT, although the training data expanded approximately 50-fold, some models showed limited performance improvement or even a decline. This is possibly due to significant differences in pronunciation and vocabulary between the two Mongolian dialects present in the merged datasets, which make adaptation more challenging. This also indirectly demonstrates that, although Khalkha Mongolian and Chahar Mongolian are both forms of Mongolian, the differences between them are substantial. For MDFT, Conformer-Large unexpectedly delivered the best results. Notably, when compared to DFT, Conformer-Large’s performance improvement in MDFT was significantly greater than that of the other two models, suggesting that it may be better suited to handling the variability or diversity present in the merged datasets.

The method in Section 4.4 performed the worst, not even matching the performance of the official pre-trained models that were fine-tuned directly on Common Voice v17 (with three hours of training data). Although the amount of data used to fine-tune the extended model was 49 times larger than that used for DFT, testing the model trained on M2ASR-MONGO (Cyrillic) using the Common Voice test set did not yield comparable results. This may be due to intrinsic differences between the two forms of Mongolian, issues with transliteration accuracy, accent variations between the datasets, terminology differences, and discrepancies in the recording environment.

The method in Section 4.5 achieved the best overall performance. It further fine-tuned the extended model using Common Voice v17 data, resulting in a significant improvement in recognizing Khalkha Mongolian in Cyrillic script compared to the method used in Section 4.2. The best relative improvement reached 32.50%, with an absolute improvement of 19.26%. Additionally, compared to the MDFT model, TFT showed a relative improvement of 37.04% and an absolute improvement of 18.0% in terms of WER. This success can be largely attributed to the further refinement provided by TFT on the model generated in Section 4.4. Since EFT initially trained the model using a large amount of data from closely related languages, the model accumulated rich linguistic features and a wide variety of speech patterns, providing a solid foundation and strong generalization capabilities. Subsequently, through further specialization for Khalkha Mongolian via TFT, the model was able to focus on adapting to the target language’s specific pronunciation, vocabulary, and grammatical characteristics, thereby significantly improving its recognition performance.

In addition to just assessing the CER and WER, we also conducted a quantitative analysis of the results, as shown in Figure 6. Our analysis of ASR errors across Wav2Vec2-BERT, Conformer-Large, and Whisper-Large-v3 demonstrates that the approach described in Section 4.5 consistently reduces insertion, deletion, and substitution errors compared to DFT. Specifically, insertion errors decreased across all models, with Conformer-Large showing the most improvement (7.3% reduction) and Wav2Vec2-BERT showing the least (0.73% reduction). Similarly, deletion errors saw a significant reduction, with Conformer-Large improving by 7.21%, while Wav2Vec2-BERT had a smaller gain of 0.5%. Substitution errors followed the same trend, where TFT yielded lower errors across all models, with Conformer-Large and Whisper-Large-v3 showing the largest reductions of around 4.41% and 3.85%, respectively. These results indicate that TFT enhances phonetic consistency and reduces error propagation in ASR, with Conformer-Large benefiting the most. Whisper-Large-v3, which already had lower baseline error rates, continued to perform robustly with TFT.

5.2. Analysis from Model Perspective

The Wav2Vec2-BERT model exhibited relatively mediocre performance throughout the experiment. Although fine-tuning based on the extended fine-tuning (EFT) model provided incremental improvements, the gains remained limited. Possible explanations include its encoder-only architecture lacking a dedicated decoder, insufficient coverage of the target language in its pre-training data, and limited adaptability during transfer learning.

In contrast, Conformer-Large, despite being pre-trained solely on English, showed significant improvements when using our transliteration-aided fine-tuning strategy. We attribute this performance improvement primarily to two factors: First, the introduction of a Cyrillic Mongolian-specific BPE tokenizer allowed for a high degree of correspondence between graphemes and phonemes. Second, the Conformer architecture’s combination of convolution and self-attention modules may have enhanced its capability to rapidly adapt acoustic representations to new phonetic contexts introduced by transliteration.

Moreover, Whisper-large-v3 achieved the overall best performance, benefiting from its encoder–decoder structure and extensive multilingual pre-training. Nonetheless, the relative gain from transliteration-based fine-tuning (DFT vs. TFT) was most pronounced in Conformer-Large. This further confirms that our transliteration-aided transfer learning approach offers substantial improvements, especially for monolingual pre-trained models with limited cross-lingual capabilities.

5.3. Analysis of Training Process

Figure 7 illustrates the comparison of WER on the validation set between the methods outlined in Section 4.2 and Section 4.5. For the same model, both methods share identical hyperparameters and use the same dataset. The only difference is that DFT uses the officially provided pre-trained model, while TFT utilizes the extended model, which was fine-tuned using the transliterated M2ASR-MONGO (Cyrillic) dataset based on the officially provided pre-trained model. As shown in the figure, the validation error rate for TFT decreases significantly more than that for DFT for the same model. It highlights the remarkable impact of model extension based on transliteration for fine-tuning low-resource ASR systems. It should be noted that the training epochs of the Conformer-Large model are not fully displayed due to layout issues, and only the changes in the first 20 epochs are shown in the figure. The validation error stabilized around 80 epochs, with no further decrease observed, and the best WER reached 58.29%.

6. Conclusions

In this study, we transliterated the text portion of a relatively well-resourced Chakhar Mongolian dataset (written in Uighur script) into the corresponding script in Khalkha Mongolian (written in Cyrillic script). We first fine-tuned an officially provided pre-trained model, using the transliterated dataset to create an extended model, and subsequently performed further fine-tuning on this extended model using a very small Khalkha Mongolian dataset to generate the targeted model. The experimental results demonstrate that the transliteration-aided two-stage transfer learning approach significantly improves ASR performance for Khalkha Mongolian. In the future, we plan to transliterate a larger Chakhar Mongolian dataset for use in training our extended model. We also aim to apply this approach to other languages, such as Uyghur and Kazakh, because these languages face similar low-resource ASR challenges due to limited annotated datasets and linguistic complexities. Both Uyghur and Kazakh belong to the Turkic language family, sharing phonetic and structural similarities with Turkish, which is a high-resource language. By leveraging transliteration-based transfer learning, we can bridge the resource gap and improve ASR performance by utilizing cross-lingual phonetic and lexical correspondences.

Author Contributions

Conceptualization, D.M.; methodology, D.M.; investigation, D.M. and Y.M.; formal analysis, Y.M.; data curation, Y.M.; writing original draft preparation, D.M.; writing, review, and editing, D.M., Y.M. and P.M.; supervision, P.M.; project administration, D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are available upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Janhunen, J.A. Mongolian; John Benjamins Publishing Company: Amsterdam, The Netherlands, 2012. [Google Scholar]
Burjgin, J.; Bilik, N. Contemporary Mongolian population distribution, migration, cultural change, and identity. In China’s Minorities on the Move; Routledge: London, UK, 2015; pp. 53–68. [Google Scholar]
Shi, L.; Bao, F.; Wang, Y.; Gao, G. Research on Khalkha Dialect Mongolian Speech Recognition Acoustic Model Based on Weight Transfer. In Proceedings of the Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, 9–14 October 2019; pp. 519–528. [Google Scholar]
Zhi, T.; Shi, Y.; Du, W.; Li, G.; Wang, D. M2ASR-MONGO: A free mongolian speech database and accompanied baselines. In Proceedings of the 2021 24th Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), Singapore, 18–20 November 2021; pp. 140–145. [Google Scholar]
Wu, Y.; Wang, Y.; Zhang, H.; Bao, F.; Gao, G. MNASR: A Free Speech Corpus For Mongolian Speech Recognition And Accompanied Baselines. In Proceedings of the 2022 25th Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), Hanoi, Vietnam, 24–26 November 2022; pp. 1–6. [Google Scholar]
Liu, Z.; Ma, Z.; Zhang, X.; Bao, C.; Xie, X.; Zhu, F. Mongolian Speech Corpus IMUT-MC; Science Data Bank: Beijing, China, 2022. [Google Scholar]
Reitmaier, T.; Wallington, E.; Kalarikalayil Raju, D.; Klejch, O.; Pearson, J.; Jones, M.; Bell, P.; Robinson, S. Opportunities and challenges of automatic speech recognition systems for low-resource language speakers. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 6 Mau–30 April 2022; pp. 1–17. [Google Scholar]
Fendji, J.L.K.E.; Tala, D.C.; Yenke, B.O.; Atemkeng, M. Automatic speech recognition using limited vocabulary: A survey. Appl. Artif. Intell. 2022, 36, 2095039. [Google Scholar] [CrossRef]
Weiss, K.; Khoshgoftaar, T.M.; Wang, D. A survey of transfer learning. J. Big Data 2016, 3, 1–40. [Google Scholar] [CrossRef]
Torrey, L.; Shavlik, J. Transfer learning. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques; IGI Global: Hershey, PA, USA, 2010; pp. 242–264. [Google Scholar]
Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A comprehensive survey on transfer learning. Proc. IEEE 2020, 109, 43–76. [Google Scholar] [CrossRef]
Qin, C.X.; Qu, D.; Zhang, L.H. Towards end-to-end speech recognition with transfer learning. EURASIP J. Audio Speech Music. Process. 2018, 2018, 18. [Google Scholar] [CrossRef]
Shivakumar, P.G.; Georgiou, P. Transfer learning from adult to children for speech recognition: Evaluation, analysis and recommendations. Comput. Speech Lang. 2020, 63, 101077. [Google Scholar] [CrossRef] [PubMed]
Joshi, V.; Zhao, R.; Mehta, R.R.; Kumar, K.; Li, J. Transfer learning approaches for streaming end-to-end speech recognition system. arXiv 2020, arXiv:2008.05086. [Google Scholar]
Sullivan, P.; Shibano, T.; Abdul-Mageed, M. Improving automatic speech recognition for non-native English with transfer learning and language model decoding. In Analysis and Application of Natural Language and Speech Processing; Springer: Berlin/Heidelberg, Germany, 2022; pp. 21–44. [Google Scholar]
Qin, S.; Wang, L.; Li, S.; Dang, J.; Pan, L. Improving low-resource Tibetan end-to-end ASR by multilingual and multilevel unit modeling. EURASIP J. Audio Speech Music. Process. 2022, 2022, 2. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, W.Q. Improving automatic speech recognition performance for low-resource languages with self-supervised models. IEEE J. Sel. Top. Signal Process. 2022, 16, 1227–1241. [Google Scholar] [CrossRef]
Jimerson, R.; Liu, Z.; Prud’Hommeaux, E. An (unhelpful) guide to selecting the best ASR architecture for your under-resourced language. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 2, pp. 1008–1016. [Google Scholar]
Li, Z.; Rind-Pawlowski, M.; Niehues, J. Speech Recognition Corpus of the Khinalug Language for Documenting Endangered Languages. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia, 20–25 May 2024; pp. 15171–15180. [Google Scholar]
Mengke, D.; Meng, Y.; Mihajlik, P. Tandem Long-Short Duration-based Modeling for Automatic Speech Recognition. In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages@ LREC-COLING 2024, Torino, Italia, 20–21 May 2024; pp. 331–336. [Google Scholar]
Sukhadia, V.N.; Umesh, S. Domain adaptation of low-resource target-domain models using well-trained asr conformer models. In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 9–12 January 2023; pp. 295–301. [Google Scholar]
Mihajlik, P.; Kádár, M.S.; Dobsinszki, G.; Meng, Y.; Kedalai, M.; Linke, J.; Fegyó, T.; Mády, K. What kind of multi-or cross-lingual pre-training is the most effective for a spontaneous, less-resourced ASR task? In Proceedings of the 2nd Annual Meeting of the Special Interest Group on Under-resourced Languages: SIGUL 2023, Dublin, Ireland, 18–20 August 2023. [Google Scholar]
Kriman, S.; Beliaev, S.; Ginsburg, B.; Huang, J.; Kuchaiev, O.; Lavrukhin, V.; Leary, R.; Li, J.; Zhang, Y. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6124–6128. [Google Scholar]
Khare, S.; Mittal, A.R.; Diwan, A.; Sarawagi, S.; Jyothi, P.; Bharadwaj, S. Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 1529–1533. [Google Scholar]
Li, H.; Sarina, B. The study of comparison and conversion about traditional Mongolian and Cyrillic Mongolian. In Proceedings of the 2011 4th International Conference on Intelligent Networks and Intelligent Systems, Kuming, China, 1–3 November 2011; pp. 199–202. [Google Scholar]
Bao, F.; Gao, G.; Yan, X.; Wang, H. Language model for cyrillic mongolian to traditional mongolian conversion. In Proceedings of the Natural Language Processing and Chinese Computing: Second CCF Conference, NLPCC 2013, Chongqing, China, 15–19 November 2013; pp. 13–18. [Google Scholar]
Na, M.; Bao, F.; Wang, W.; Gao, G.; Dulamragchaa, U. Traditional Mongolian-to-Cyrillic Mongolian Conversion Method Based on the Combination of Rules and Transformer. In Proceedings of the 2023 IEEE 9th International Conference on Cloud Computing and Intelligent Systems (CCIS), Dali, China, 12–13 August 2023; pp. 440–445. [Google Scholar]
Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common voice: A massively-multilingual speech corpus. arXiv 2019, arXiv:1912.06670. [Google Scholar]
Zheng, G.; Xiao, Y.; Gong, K.; Zhou, P.; Liang, X.; Lin, L. Wav-BERT: Cooperative acoustic and linguistic representation learning for low-resource speech recognition. arXiv 2021, arXiv:2109.09161. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 28–29 July 2023; pp. 28492–28518. [Google Scholar]
Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]

Figure 1. Cyrillic Mongolian scripts vs. Uighur Mongolian scripts.

Figure 2. Proposed methodology map.

Figure 3. Comparison of traditional Mongolian and Cyrillic Mongolian alphabet.

Figure 4. Incomplete correspondence between letters and phonemes.

Figure 5. Drop, attachment, and vowel harmonization.

Figure 6. Quantitative error analysis for DFT and TFT.

Figure 7. Comparison of validation WER for DFT and TFT.

Table 1. Baseline and proposed fine-tuning results (CER%/WER%).

Model	Zero-Shot	DFT	MDFT	EFT	TFT
Wav2Vec2-BERT	-	15.56/55.35	17.42/57.77	20.97/63.29	15.33/52.82
Conformer-Large	-	21.16/59.25	16.12/43.01	24.55/59.37	14.50/39.99
Whisper-Large-v3	45.73/92.60	13.96/43.02	16.27/49.95	21.96/61.35	10.75/31.45

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mengke, D.; Meng, Y.; Mihajlik, P. Transliteration-Aided Transfer Learning for Low-Resource ASR: A Case Study on Khalkha Mongolian. Electronics 2025, 14, 1137. https://doi.org/10.3390/electronics14061137

AMA Style

Mengke D, Meng Y, Mihajlik P. Transliteration-Aided Transfer Learning for Low-Resource ASR: A Case Study on Khalkha Mongolian. Electronics. 2025; 14(6):1137. https://doi.org/10.3390/electronics14061137

Chicago/Turabian Style

Mengke, Dalai, Yan Meng, and Péter Mihajlik. 2025. "Transliteration-Aided Transfer Learning for Low-Resource ASR: A Case Study on Khalkha Mongolian" Electronics 14, no. 6: 1137. https://doi.org/10.3390/electronics14061137

APA Style

Mengke, D., Meng, Y., & Mihajlik, P. (2025). Transliteration-Aided Transfer Learning for Low-Resource ASR: A Case Study on Khalkha Mongolian. Electronics, 14(6), 1137. https://doi.org/10.3390/electronics14061137

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transliteration-Aided Transfer Learning for Low-Resource ASR: A Case Study on Khalkha Mongolian

Abstract

1. Introduction

2. Methodology

2.1. Transliteration

2.2. Fine-Tuning of End-to-End Transfer Learning

3. Experiment

3.1. Datasets

3.2. Experimental Setup

4. Methods

4.1. Zero-Shot Evaluation

4.2. DFT (Direct Fine-Tuning)

4.3. MDFT (Merged Dataset Fine-Tuning)

4.4. EFT (Extended Fine-Tuning)

4.5. TFT (Targeted Fine-Tuning)

5. Result Analysis

5.1. Method Analysis

5.2. Analysis from Model Perspective

5.3. Analysis of Training Process

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI