1. Introduction
Machine translation (MT) has become a key component of modern natural language processing (NLP) systems, enabling automatic text translation between languages and opening up opportunities for global communication. In recent years, the development of neural machine translation (NMT) models based on the Transformer architecture and deep neural networks has achieved significant success for languages with large parallel corpora, such as English, French, and Chinese. However, many languages, including resource-poor Turkic languages (Kazakh, Kyrgyz, Tatar, Uzbek, Karakalpak, and others), remain low-resource languages, limiting the quality and applicability of existing NMT models.
Low-resource Turkic languages face several specific challenges. First, available parallel corpora are minimal, and digital content in these languages is limited. Secondly, Turkic languages are characterized by agglutinative morphology, resulting in numerous unique word forms that complicate training models without morphological segmentation. Thirdly, many Turkic languages use multiple scripts (Cyrillic, Latin, and Arabic), requiring additional data normalization and increasing the complexity of text preprocessing. Open-source MT systems provide researchers and developers with a flexible platform for experimentation and practical implementation. Projects such as OpenNMT, Marian NMT, mBART, NLLB-200, and models from the LLaMA and Gemma families enable re-training existing models, integrating new languages, and creating specialized MT systems for low-resource languages. These tools open up opportunities for adapting models to the specificities of resource-constrained Turkic languages, including the use of synthetic corpora, multilingual learning, and transfer learning methods.
Recent advances in artificial intelligence (AI) have brought new changes to the field of NLP, including MT. The transition to neural architecture has been an enormous step forward, significantly increasing the fluency and accuracy of translation for resource-intensive language pairs [
1]. However, despite technological progress, the use of open-source AI systems for low-resource languages, such as many Turkic languages, remains an urgent problem. Many Turkic languages lack large, high-quality parallel corpora, which are critical for training accurate NMT systems. As shown in the large-scale study of Turkic languages, even with NMT, data scarcity remains a significant bottleneck [
2]. A study by Jumashukurov demonstrates that standard AI translation tools perform poorly on Turkic languages because of their agglutinative morphology and limited training data [
3]. For example, the Open Language Data Initiative (OLDI) recently released parallel corpora and fine-tuned NMT models for Karakalpak, a low-resource Turkic language, highlighting both the need and the difficulty of building such systems [
4]. Building effective AI for underrepresented Turkic languages often means adapting multilingual models or creating smaller, language-specific ones rather than using out-of-the-box general models. For instance, recent work describes a ~1.94B-parameter LLaMA-based model for Kazakh, demonstrating that strong performance can be achieved without massive infrastructure when you specialize for the language [
5]. Not just translation, but even speech recognition systems (ASR) for Turkic languages are underdeveloped. A recent multilingual ASR study combined five low-resource Turkic languages and showed that multilingual training on open-source data significantly improved recognition accuracy [
6]. As noted by Veitsman and Hartmann (2025), despite progress, many Central Asian Turkic languages (e.g., Kazakh, Kyrgyz, and Uzbek) still lack sufficient NLP resources, corpora, and open-source tools [
7].
The Turkic languages belong to an agglutinative language family characterized by high morphological complexity and structural flexibility, including branches such as Oghuz, Kipchak, and Karluk. Most of these languages are considered low-resource languages for machine translation due to the lack of parallel corpora, bilingual dictionaries, and open data for learning [
8,
9,
10].
This pilot study investigates the potential and limitations of free open-source AI-based MT systems for five closely related Turkic language pairs directed toward Kazakh: Azerbaijani–Kazakh, Kyrgyz–Kazakh, Turkmen–Kazakh, Turkish–Kazakh, and Uzbek–Kazakh. This research presents a comprehensive methodology that includes selecting machine translation systems, automatically creating and refining parallel corpora, fine-tuning multilingual artificial intelligence models, and conducting a thorough assessment of translation quality. This machine translation task, focused on Turkic languages, is part of a larger project to develop meeting minutes from transcriptions of Turkic-language speech.
The outcomes of this study pave the way for expanding this methodology across other underrepresented language pairs. By demonstrating a reproducible, scalable approach, this work contributes to improving the use of free, open-source MT for low-resource languages and reducing digital linguistic inequality.
Such an analysis on this scale has not been conducted previously for the Turkic language family, making this work one of the first comparative studies using large synthetic corpora across several pairs of translations into Kazakh. The relevance of this research stems from the urgent need to overcome digital linguistic inequality across Turkic languages. Many languages, including low-resource Turkic languages, remain severely underrepresented in digital ecosystems, which substantially limits the performance and applicability of existing NMT models.
Turkic languages face several domain-specific challenges:
Data Scarcity: Available parallel corpora are extremely limited, and digital content remains insufficient. For example, the volume of parallel sentences in the OPUS open repository for the Turkmen–Kazakh language pair is only 22,119 sentences, which is inadequate for training modern NMT architectures.
Morphological Complexity: Turkic languages are characterized by agglutinative morphology, which results in a vast number of unique word forms. This significantly complicates model training without morphological segmentation or subword modeling.
Script Diversity: Many Turkic languages use multiple writing systems (Cyrillic, Latin, and Arabic script), which requires additional normalization and harmonization of textual data.
The use of open-source AI systems for low-resource languages remains a pressing challenge. Our research provides a practical, scalable solution based on the creation, cleaning, and refinement of synthetic corpora, enabling substantial improvements in translation quality and, consequently, helping reduce digital linguistic inequality. The goal of this study is to enhance machine translation quality for six Turkic languages (Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek) by leveraging publicly available, open-source AI models. Specifically, this research focuses on developing parallel corpora for Turkic languages and on selecting and fine-tuning models using these data.
The scientific contribution of this work lies in the design and validation of a scalable, reproducible pipeline for neural machine translation in low-resource Turkic languages, based entirely on free, open-access AI tools:
The construction of multilingual synthetic parallel corpora for multiple Turkic–Kazakh language pairs using established back-translation techniques;
An automated data cleaning and filtering strategy to mitigate noise, duplication, and hallucinations inherent in synthetic data;
The fine-tuning of multilingual neural machine translation models on the resulting corpora;
A comprehensive evaluation protocol combining standard surface-form metrics (BLEU, chrF) with semantic metrics (BERTScore, COMET), external evaluation on FLORES 200, and human evaluation.
This paper is structured as follows: The Introduction situates this study within its broader research context, explains why improving machine translation for low-resource Turkic languages is an urgent task, and outlines this work’s key goals.
Section 2: Related Works reviews the existing literature, highlighting the main challenges of neural machine translation for low-resource and morphologically complex languages, recent developments in Turkic-language MT research, and the growing role of synthetic data generation.
Section 3: Methodology details the methodological approach, including the selection of AI models, the creation of synthetic parallel corpora, and the automated procedures used for data cleaning and fine-tuning.
Section 4: Experimental Results and Analysis presents the outcomes of corpus construction, provides an error analysis, and evaluates how fine-tuning the chosen models (NLLB-200 1.3B and mT5-base) influences translation quality on both cleaned and raw datasets, using metrics such as BLEU, WER, TER, chrF, BERTScore, and COMET, as well as external and human evaluation.
Section 5: Discussion interprets these findings, assesses the strengths and limitations of the proposed approach, and outlines opportunities for expanding the developed resources. Finally,
Section 6: Conclusion summarizes the main contributions and demonstrates the effectiveness of open-source AI models for corpus creation and model adaptation in low-resource Turkic-language settings.
2. Related Work
Machine translation (MT) has replaced rule-based and statistical approaches with neural models, which now dominate practice and research. Modern methods are based on neural machine translation (NMT) models that account for sentence context and model complex word relationships, ensuring high-quality translation. One of the important milestones in the development of industrial neural machine translation (NMT) was the introduction of Google’s GNMT system [
11]. This model has significantly improved translation quality through BPE tokenization, deep LSTM layers, and attention mechanisms. GNMT architecture laid the foundation for the development of modern models, such as T5 and mT5, which are based on later transformer architectures.
The development of NMT has accelerated significantly in recent years, driven by advances in artificial intelligence, deep learning, and transformer-based architectures [
12]. NMT models have vastly outperformed traditional statistical machine translation (SMT) systems in both high-resource and multilingual settings.
However, translating low-resource languages, including many Turkic languages, remains a significant challenge due to limited parallel corpora, morphological complexity, and orthographic variation. A comprehensive analysis of neural machine translation outlines the progression from statistical approaches to neural architectures, emphasizing the importance of flexible model design and careful data selection. The discussion highlights persistent limitations in NMT, including insufficient generalization, difficulties with rare words, and challenges in domain adaptation, providing a structured overview of the field’s methodological evolution [
13]. This study emphasizes several critical aspects:
Architectural flexibility: Different architectures (RNN-based, convolutional, and transformer) provide varying capabilities for sequence modeling.
Data selection: NMT models are highly dependent on high-quality parallel corpora; poor-quality data leads to overfitting and poor generalization.
Challenges with rare words: Neural models struggle to translate infrequent words and morphologically rich forms without subword tokenization (e.g., BPE, SentencePiece).
Domain adaptation limitations: NMT systems trained on one domain often fail to generalize to other domains without fine-tuning.
These works establish a foundation for understanding the specific difficulties that arise when working with low-resource languages, where parallel data may be scarce or non-existent.
2.1. Synthetic Data Generation and Multilingual Transfer for Low-Resource NMT
Research on low-resource neural machine translation (NMT) has primarily focused on transfer learning, multilingual pretraining, and synthetic data generation to mitigate the scarcity of parallel corpora. Numerous studies demonstrate that combining supervised and semi-supervised learning strategies with synthetic data can substantially enhance translation quality, particularly when transformer-based architectures with advanced attention mechanisms are employed [
14]. These models offer improved contextual modeling and robustness, even when trained on small, resource-constrained datasets.
A central technique in this area is back-translation, initially introduced by [
15] as an effective method for exploiting monolingual data. In back-translation, monolingual target-language text is translated into the source language using a reverse MT system, producing synthetic parallel sentence pairs that are then used for NMT training. Large-scale empirical studies confirm that back-translation remains effective across diverse language pairs and training scales, consistently improving translation quality in low-resource and domain-mismatched scenarios [
16]. These findings highlight the importance of synthetic data augmentation when genuine parallel resources are limited.
Complementary to back-translation, pivot translation, and multilingual transfer, these strategies have become widely adopted for low-resource MT. Pivot-based approaches generate synthetic parallel data via an intermediate language, while multilingual NMT models leverage shared representations across multiple language pairs to facilitate cross-lingual transfer. Surveys of neural machine translation emphasize that multilingual learning plays a key role in alleviating data sparsity, while also identifying challenges such as error propagation, domain mismatch, and sensitivity to noisy training data [
17].
For Turkic languages, multilingual and family-based transfer is particularly effective due to shared morphological and syntactic properties. Prior studies show that multilingual NMT systems trained on related Turkic languages significantly outperform bilingual baselines in low-resource settings [
2,
18,
19]. The availability of multilingual resources such as KazParC and MuST-C further supports effective cross-lingual transfer within this language family [
20,
21]. In addition, large-scale multilingual models such as NLLB-200 provide many-to-many and zero-shot translation capabilities for more than 200 languages, including under-resourced Turkic–Kazakh language pairs [
22,
23]. Recent work demonstrates that fine-tuning NLLB models on task-specific or synthetic data consistently improves BLEU, chrF, and TER scores, even for closely related low-resource languages [
20,
24].
Despite these advances, synthetic-data-based approaches—including back-translation, pivot translation, and multilingual transfer—remain vulnerable to noise and error propagation. Errors introduced during reverse or intermediate translation stages, such as hallucinations, truncation, repetition, language mixing, and incorrect handling of named entities, are often directly transferred to the synthetic corpora. Most existing studies address these issues only indirectly, relying on coarse filtering heuristics or corpus-level data selection strategies, without explicitly identifying or correcting individual error types. This limitation is especially pronounced for morphologically rich languages, such as those in the Turkic family. It motivates the need for more structured, transparent, and reproducible synthetic data cleaning and correction pipelines.
2.2. NMT for Turkic Languages
The morphological and structural features of Turkic languages complicate the tasks of low-resource machine translation. For Turkic languages, these challenges are amplified due to:
Agglutinative morphology, creating a large number of possible word forms;
Scarcity of parallel corpora, especially for less widely spoken languages (e.g., Tuvan, Karakalpak);
Multiple orthographies (Latin, Cyrillic, and Arabic scripts), requiring additional preprocessing.
Despite the linguistic proximity of Azerbaijani and Kazakh, translation from Azerbaijani into Kazakh has not been sufficiently studied. The reasons for this lack of research include the lack of parallel corpora, limited availability of well-established oral and written data, and weak digital infrastructure. Recent research has explored Azerbaijani-to-English translation using morphological segmentation, highlighting its potential to improve handling of rich morphological structures in low-resource settings [
25]. Given that Azerbaijani and Kazakh are agglutinative languages, their conclusions apply to literal translation methods. The OPUS project offers parallel corpora for many language pairs, including Turkic ones [
26]. Recent studies have analyzed speech translation for Turkic languages using transformer-based architectures, noting persistent challenges in processing spontaneous speech and the scarcity of available corpora [
27]. In response, the multilingual KazParC corpus has been introduced, covering Kazakh, English, Turkish, and Russian, providing a resource designed to ensure cross-linguistic consistency [
20].
Despite linguistic similarities, developing the Kyrgyz–Kazakh neural machine translation (NMT) system faces several challenges due to limited resources. Studies [
7,
18] show that, despite growing scientific interest in the Turkic languages, parallel data remain insufficient. The KazParC [
20] and MuST-C [
21] projects effectively support multilingual learning by providing a balanced text and speech corpus. Although the NLLB-200 model [
23] has significantly improved translation quality between Turkic languages, fine-tuning remains important for closely related language pairs, such as Kyrgyz–Kazakh [
24]. Recent studies have shown that using the NLLB-200 3.3B model to generate synthetic data for the Kyrgyz language based on Kazakh–English corpora and subsequent fine-tuning of the NLLB-200 1.3B model significantly improved the BLEU, CHRF, and TER indicators [
20]. These results demonstrate the importance of targeted adaptation and consideration of linguistic features to achieve high-quality translation, even between related languages.
The first machine translation system from Turkmen to Turkish was proposed using a rules-based approach and structural similarity [
28]. In a subsequent study, refs. [
2,
19] presented a robust multilingual corpus and an NMT score for 22 Turkic languages. Their results show that family learning improves interlanguage transfer. These results suggest that it can also be effectively applied to similar language pairs, for example, Turkmen-to-Kazakh translation.
The translation from Kazakh to Turkish presents several difficulties due to the morphological complexity and arbitrary word order. Although multilingual models such as MT5 and mBERT have shown promising results, they require specially adapted data for high-quality translation. In this context, the KazParC and Tilmash systems were presented [
20], demonstrating the effectiveness of a nonlinear machine translation approach that outperforms commercial services on BLEU and chrF scores. To address the need for more accurate syntactic representation, a hybrid CSE architecture was proposed [
29] to enhance translation quality for complex sentences. Recent research on multitask models for Turkic languages demonstrated the effectiveness of multitasking systems for Kazakh, Turkish, and Uzbek, achieving high translation quality even with limited supervision [
30]. Furthermore, studies have shown that incorporating POS tagging and transfer learning can significantly improve translation quality for low-resource Turkic language pairs [
31].
Overall, these results indicate that morpho-segmentation, synthetic data generation, and multilingual transformer-based learning are the most effective approaches for improving machine translation for Turkic languages.
Uzbek–Kazakh NMT has several advantages in morphological processing. The use of a Complete Set of Endings (CSE) in analysis has been shown to enhance grammatical coherence and improve translation accuracy, particularly when handling agglutinative suffix sequences [
32].
Their results show that morphological awareness is significantly superior to transitional systems without linguistic segmentation. This encourages the further development of cascading translation pipelines, especially for speech-based systems.
Overall, these studies show that the effectiveness of neural machine translation depends not only on reliable model architecture, but also on the availability of high-quality datasets and careful linguistic adaptation. These results provide a basis for developing a strategy for LMP systems that accounts for the morphological and structural features of Turkic languages. These characteristics make Turkic–Kazakh language pairs particularly sensitive to noise in synthetic data, emphasizing the need for structured cleaning and correction pipelines.
2.3. Available Parallel Corpora for Turkic–Kazakh Pairs
Effective multilingual transfer (
Section 2.1) relies heavily on the availability of parallel corpora, making the expansion and cleaning of Turkic–Kazakh datasets critical.
The availability of parallel corpora plays an important role in improving translation quality for languages with limited resources. The open repository OPUS aggregates parallel data from various sources (CCMatrix, WikiMatrix, ParaCrawl, OpenSubtitles, etc.) and provides access to data for low-resource languages [
33]. As of 20 November 2025, the OPUS repository offers the following amounts of parallel data for the Azerbaijani–Kazakh, Kyrgyz–Kazakh, Turkmen–Kazakh, Turkish–Kazakh, and Uzbek–Kazakh language pairs, as shown in
Table 1.
In addition to OPUS, specialized corpora are emerging. Thus, for the kk–tr pair, the KazParC corpus was created (371,902 parallel sentences for the kk–tr, kk–en, and kk–ru directions), developed as part of the initiative to expand Kazakh parallel resources [
20]. For uz–kk, a separate corpus was published in 2024 (113,877 parallel sentences) [
34], and for kk–ky and az–kk, synthetic data generated by the back-translation method is actively used [
35]. In addition, there are consolidated Turkic sets (e.g., the TIL Corpus), which enable the formation of multilingual models.
Research on most pairs involving the Kazakh language is mainly limited to the kk–en and kk–ru directions, where sufficient corpora have been collected for systematic experiments. For the Turkic–Kazakh pairs (az–kk, ky–kk, tk–kk, tr–kk, and uz–kk), there are still a few directly comparable publications with BLEU and chrF results. Several papers provide the results of preliminary additional training and experiments with synthetic corpora, but they are limited by small test samples and heterogeneous methodologies.
At the same time, large multilingual systems, such as No Language Left Behind (NLLB-200) [
22], demonstrate efficiency under limited resources. The results on FLORES-200 show that fine-tuning in low-resource languages can significantly increase quality (spBLEU, chrF++). This confirms the promise of large-scale AI models for the areas under consideration.
The need to expand corpora. The volume of parallel data as of 4 September 2025 remains limited for several pairs (especially tk–kk: 22,119 sentences), which constrains the accuracy of NMT. Even for az–kk and uz–kk, volumes of several-hundred-thousand sentences are insufficient for modern architectures.
Relevance of AI application. The use of large language models (e.g., NLLB-200) and synthetic augmentation methods (e.g., back-translation and data augmentation) can compensate for data scarcity and improve translation accuracy [
36]. Of additional importance is the filtering of synthetic data (eliminating hallucinations and noise), which directly enhances BLEU/chrF scores.
Thus, the results of the review confirm:
The expansion and purification of corpora for Turkic–Kazakh pairs remains a critically important area.
The use of AI as a tool for generation and additional training is the most promising strategy for improving the quality of translation of low-resource agglutinative languages.
This review indicates that the quality of machine translation for Turkic languages largely depends on the availability of clean parallel corpora and the use of multilingual transformer-based models. However, many Turkic–Kazakh language pairs remain severely under-resourced, and there is still no unified methodology for generating and filtering parallel data. These limitations highlight the need for the approach proposed in the following section.
This pilot study investigates the potential and limitations of free open-source AI-based MT systems for five closely related Turkic language pairs directed toward Kazakh: Azerbaijani–Kazakh, Kyrgyz–Kazakh, Turkmen–Kazakh, Turkish–Kazakh, and Uzbek–Kazakh. This research presents a comprehensive methodology that includes selecting machine translation systems, automatically creating and refining parallel corpora, fine-tuning multilingual artificial intelligence models, and conducting a comprehensive assessment of translation quality. This machine translation task, focused on Turkic languages, is part of a larger project to develop meeting minutes from transcriptions of Turkic-language speech.
In summary, the existing work demonstrates the effectiveness of back-translation, pivot translation, and multilingual pretraining for low-resource NMT. However, there is still no unified and reproducible methodology for generating, correcting, and filtering synthetic parallel data, particularly for morphologically rich Turkic languages. Most approaches rely on heuristic filtering and discard large portions of synthetic data without attempting targeted correction. This gap directly motivates the data-centric pipeline proposed in this work.
3. Methodology
3.1. General Methodology Framework Schemes
This study aims to systematically improve the quality of machine translation for six state-level Turkic languages—Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek—by adapting and fine-tuning freely available open-source artificial intelligence models. The proposed methodology was explicitly designed for morphologically rich languages under low-resource conditions.
The selection of these six Turkic languages was motivated by a combination of methodological and practical considerations.
First, all the selected languages belong to the Turkic language family and share an agglutinative morphological structure. This typological similarity enables cross-lingual knowledge transfer and supports the use of a unified methodological framework.
Second, each language has an official status in its respective country, underscoring this study’s practical relevance for governmental, educational, and digital applications.
Third, the language set was intentionally designed to include both a relatively high-resource language (Turkish) and low-resource languages (Turkmen and Kyrgyz), allowing the proposed methodology to be evaluated under heterogeneous resource availability and to assess its generalizability.
The proposed methodology establishes a comprehensive, reproducible pipeline for the construction of synthetic parallel corpora, quality refinement, and the subsequent fine-tuning of machine translation models within the Turkic language family. The framework consists of two interdependent stages:
- (1)
AI-driven generation and multi-level refinement of synthetic parallel corpora.
- (2)
Adaptation and evaluation of machine translation models trained on the resulting corpora.
As illustrated in
Figure 1, Kazakh was employed as a pivot language. This choice was driven by the availability of large-scale monolingual Kazakh corpora within the research group. It enables a systematic generation of synthetic parallel data for the following language pairs, Kazakh–Azerbaijani, Kazakh–Kyrgyz, Kazakh–Turkish, Kazakh–Turkmen, and Kazakh–Uzbek, using preselected AI-based translation models.
The proposed methodology implements a multi-stage framework for constructing high-quality parallel corpora and adapting machine translation models for low-resource Turkic languages. Each stage is functionally interdependent and contributes directly to the reliability of downstream model training and evaluation.
Stage 1: Synthetic Parallel Corpus Generation.
The process began with a large-scale monolingual Kazakh corpus, which was used as a pivot resource due to its availability and linguistic centrality within the Turkic language family. Synthetic parallel corpora were generated via back-translation for five language pairs, Kazakh–Azerbaijani, Kazakh–Kyrgyz, Kazakh–Turkish, Kazakh–Turkmen, and Kazakh–Uzbek, using preselected AI-based translation models. This step establishes the initial bilingual data required for subsequent refinement.
Stage 2: Multi-Level Quality Refinement.
The generated corpora underwent a structured, three-level quality control process. First, manual expert validation was applied to identify systematic errors such as incorrect abbreviations, repetitions, and misaligned segments. Second, automatic filtering enforced linguistic and structural constraints, eliminating duplicates and inconsistent translations. Third, targeted regeneration was performed, where low-quality segments were re-translated using alternative models. This multi-level refinement ensures both scalability and linguistic precision.
Stage 3: Error Analysis and Corpus Validation.
For each language pair, two corpora containing 300,000 and 500,000 sentence pairs were analyzed to identify recurring structural and semantic errors. The analysis revealed common challenges in morphological agreement, word order, and semantic fidelity across all language pairs, confirming the need for iterative filtering. The outcome of this stage was a validated bilingual corpus for each pair, consisting of 500,000 sentence pairs refined through three quality-control layers.
Stage 4: Model Fine-Tuning and Evaluation.
In the second primary phase, selected AI-based machine translation models were fine-tuned using the validated synthetic corpora. Model performance was evaluated at three levels: individual assessment for each language pair, comparative analysis across models, and final testing on an independent test corpus. Translation quality was measured using complementary metrics—WER [
37], TER [
38], BLEU [
39], and chrF [
40]—to capture both lexical accuracy and structural adequacy.
Overall, this methodology provides a structured, scalable, and empirically grounded approach to improving machine translation for morphologically rich, low-resource Turkic languages. The use of Kazakh as a pivot language enables systematic synthetic data generation, while multi-level refinement and iterative evaluation ensure corpus reliability and model robustness. Together, these stages form a reproducible foundation for building reliable machine translation systems under conditions of limited language representation.
In the following sections, we provide a detailed justification of the adopted design choices, a structured description of each methodological stage, and a comprehensive analysis of the practical results obtained through the proposed approach.
Figure 2 provides a high-level overview of the proposed pipeline for generating synthetic parallel data and fine-tuning a neural machine translation model. The process starts with monolingual data, which was used to generate synthetic parallel data. The generated data then underwent cleaning and re-generation to improve alignment and reduce noise, resulting in curated parallel corpora. These corpora were subsequently used to fine-tune neural machine translation models, followed by automatic and expert-based evaluations of the translation quality. For clarity, we provide a high-level pipeline overview in
Figure 2, while detailed processing steps are shown in
Figure 1.
To provide a comprehensive evaluation of the translation quality, we employed a combination of surface-level, character-level, and semantic metrics. BLEU and TER were used to measure n-gram overlap and edit distance with respect to reference translations, which remain standard indicators in machine translation evaluation. Given the morphologically rich nature of Turkic languages, we additionally report chrF and WER, as these metrics are more sensitive to character-level variation and inflectional morphology. Furthermore, we included BERTScore and COMET to assess semantic adequacy and meaning preservation beyond surface-form similarity.
3.2. Model Selection for Parallel Corpus Generation
The quality of synthetic parallel corpora directly determines the effectiveness of subsequent fine-tuning for further system training. Therefore, it is essential to select an appropriate AI model, and the selection should be based on transparent, reproducible criteria.
The model selection was based on the following criteria:
Open access and availability. We need to choose models that are openly accessible and that can be integrated into studies without licensing restrictions, ensuring reproducibility of experiments and scalability of the methodology.
The translation quality. Models should demonstrate special attention to the preservation of suffixes, case endings, and word order, since the languages are morphologically rich.
Language Coverage. Selected models should support the target Turkic languages either directly or indirectly. This criterion ensures that synthetic corpora can be generated consistently across multiple language pairs.
Computational Efficiency. Given the enormous volume of text required for corpus generation, the models must be able to process data within reasonable timeframes and within hardware constraints. Efficiency is evaluated based on throughput, memory consumption, and scalability.
Consistency and robustness. Models must produce stable results to ensure generality. They must be evaluated for robustness to changes in the subject area, while accounting for performance variability.
Community Validation and Benchmarking. Preference is given to models that have been evaluated in international benchmarks (e.g., WMT, FLORES-200) or have documented performance metrics. This provides an external reference point for assessing translation quality.
The most important task in creating parallel corpora for Turkic language pairs is to choose the most appropriate tools for their creation. Given the advances in machine translation, the most appropriate approach is to use modern artificial intelligence (AI) systems capable of producing high-quality translations.
Agglutinative languages such as Kazakh, Kyrgyz, and Turkmen present unique linguistic challenges due to their complex morphemic structures, extensive post-fixation, and flexible word order [
41]. These features complicate both segmentation and the generation of semantically and morphologically accurate translations.
Based on these criteria, several state-of-the-art AI systems were evaluated:
Commercial/API-based:
Copilot (Microsoft) [
44].
Open source:
NLLB-200 600M, NLLB-200 1.3B, NLLB-200 3.3B—Meta AI [
45];
Gemma 2 27B—Google DeepMind [
46];
Phi-4 (14B)—Microsoft [
47];
LlaMA 3.2 [
49], LlaMA 3.1 (Meta/Facebook) [
50].
The expert evaluation process yielded the following results:
Rejected due to availability or efficiency issues: Google Translate, GPT-4o, and Copilot;
Rejected due to insufficient translation quality: NLLB-200 600M, NLLB-200 1.3B, Phi-4, Qwen2.5, LlaMA 3.2, LlaMA 3.1;
Selected for detailed testing: Gemma 2 27B (Google) and NLLB-200 3.3B (Meta/Facebook), which demonstrated availability (both are freely available), good translation quality, and the ability to process large volumes of text.
Based on the criteria, the NLLB-200 3.3B model was chosen as the basic tool for generating parallel corpora for Turkic language pairs. This solution was driven by a combination of factors that are particularly important to morphologically complex and low-resource languages.
Firstly, The NLLB-200 3.3B (No Language Left Behind) model, developed by Meta AI [
22], is specifically designed to improve translation quality for low-resource languages. Its architecture uses the Gated Mixture of Experts (MoE) framework, with only a few limitations. This ensures efficient scaling by activating only a subset of the parameters during translation.
Second, the translation quality of NLLB-200 3.3B demonstrates consistent results for languages with a rich morphology, including the accurate rendering of suffixes, case endings, and word order. The model was initially designed to support resource-poor languages, so its architecture and training data were optimized for tasks related to Turkic language pairs. Unlike smaller versions (600M and 1.3B), version 3.3B provides sufficient depth of context modeling, reflected in the stability of morphological constructions and syntactic relations.
Third, the model has broad language coverage, including support for most Turkic languages, either directly or through closely related pairs. This enables the consistent generation of parallel corpora for different translation directions, minimizing quality imbalances between language pairs.
A fourth important factor is the computational efficiency of NLLB-200 3.3B. Despite the model’s relatively large size, it can be deployed and used on modern GPU clusters and local servers, ensuring high throughput when processing large datasets. The balance between translation quality and computational costs proved to be optimal compared to other candidates.
The next aspect is the robustness and consistency of the results. NLLB-200 shows low performance fluctuations across different subject domains, which is particularly important for building a universal corpus that covers a variety of topics. The model demonstrates robustness when processing texts of varying styles and complexity.
Taking these factors together, NLLB-200 3.3B emerged as the most appropriate choice for the project: it combines openness, high translation quality, broad language coverage, and sufficient computational efficiency. These characteristics make it particularly suitable for the automated creation of synthetic parallel corpora for Turkic languages and for the subsequent re-training of a machine translation system.
So, the NLLB-200 3.3B [
24] was used to translate large monolingual Kazakh corpora into five related Turkic languages. As a result, more than 500,000 sentences were generated, demonstrating the high efficiency and stability of this model.
In addition to the NLLB-200 3.3B, the Gemma 2 27B model was chosen for testing as a second parallel corpus generation model. This decision was motivated by several advantages that make the model promising for high-quality synthetic translation and subsequent evaluation.
The Gemma 2 27B model, developed by Google DeepMind in 2024, is a large-scale transformation architecture for generating and translating multilingual texts. This study uses a modified version of the model that includes multilingual attachments, advanced attention mechanisms, and multitasking learning components. The main advantage of Gemma architecture lies in its ability to adapt and efficiently perform translation tasks between closely related languages, even with limited training data.
In this project, Gemma 2 27B was used to translate more than 200,000 sentences. This allows for a comparative analysis with the NLLB-200 3.3B model. Several hallucinatory fragments were found in the Gemma 2 27b translation, so our final choice was NLLB-200 3.3B.
The selected model, NLLB-200 3.3B, showed good translation quality and time performance across five pairs of Turkic languages in a preliminary translation of 100,000 Kazakh sentences.
Additionally, a practical constraint was introduced: Compatibility with available hardware, specifically the ability to train on GPUs with 24 GB of video memory.
A range of AI models was evaluated according to these criteria, including:
Meta AI models: NLLB-200 600M, NLLB-200 1.3B, NLLB-200 3.3B;
Google models: mT5 [
51], mT5-small [
52], mT5-base [
53], mT5-large [
54], mT5-3B [
55], mT5-11B [
56];
Other LLMs: GPT-4o (OpenAI), Copilot (Microsoft), Phi-4 (Microsoft), Qwen2.5 (Alibaba), Gemma 2 27B (Google), LlaMA 3.1 and LlaMA 3.2 (Meta).
Based on this evaluation, NLLB-200 1.3B and mT5 were selected as the primary candidates for fine-tuning, offering a suitable compromise between model size, translation performance, and resource constraints on modeling hardware.
3.3. Error Analysis and Cleaning of Parallel Corpora from Errors
For the generation of synthetic parallel corpora, two monolingual Kazakh corpora of 300,000 sentences and 500,000 sentences were used. For the generation of synthetic parallel corpora, two monolingual Kazakh corpora of 300,000 sentences and 500,000 sentences were used. The Kazakh–English corpus consists of 300,000 parallel sentences collected from various news and government sites and primarily covers news and official government domains (grant project AP05131415: Development and research of neural machine translation of Kazakh. 2018-2020. Ministry of Education and Science of the Republic of Kazakhstan). For the current project, we took the Kazakh part of this corpus. A 500,000 sentences Kazakh corpus was compiled as part of the Joint Grant Project with the Xinjiang Technical Institute of Physical Chemistry of the Chinese Academy of Sciences for the creation of a Kazakh–Chinese multimodal corpus and the research and application of intelligent text processing technologies (2024–2025). The corpus topics included medical, technical, and colloquial terms.
Error analysis of the synthetic parallel corpora generated by NLLB-200 3.3B was performed manually. Several recurring error types were identified across all five language pairs, affecting both structural consistency and semantic accuracy. General categories of errors observed in the corpora of all pairs were:
Lexical duplicates (repeated fragments within one sentence or between sentences);
Erroneous rendering of geographical names and proper names;
Use of foreign abbreviations without decoding or adaptation;
Partially translated phrases, truncation of semantic units, and violation of the integrity of the syntactic structure.
Effective cleaning of parallel corpora is a critical prerequisite for high-quality neural machine translation [
57,
58]. This is particularly important for Turkic languages, which are highly agglutinative, morphologically rich, and often exhibit flexible word order—properties that complicate tokenization, segmentation, and MT model learning [
2,
7,
59,
60]. Poor-quality data directly compromises the model’s ability to learn accurate translation mappings and significantly degrades evaluation scores (e.g., BLEU, CHRF) [
61,
62].
These issues confirm that errors in synthetic corpora are systematic, often resulting from the interaction between morphological complexity and inadequate context modeling by large language models. Therefore, the cleaning process—involving manual inspection, automated filtering, and rule-based corrections—is essential before model fine-tuning.
3.3.1. Correction Module: Purpose and Principles
A specialized Correction Module, implemented in Python version 3.11.14, was developed for automated cleaning and the normalization of synthetic parallel data. This component is a rule-based tool designed to ensure reproducible corpus cleaning and to eliminate systematic defects introduced during synthetic translation.
Importantly, the module:
Does not use neural models;
Does not employ embedding-based similarity;
Does not rely on external named entity recognition (NER) systems;
Does not incorporate fuzzy matching or edit distance.
These design choices ensure full transparency of the applied rules and high reproducibility of the processing results.
The module workflow consists of four sequential stages:
Text normalization (whitespace correction and character/punctuation standardization);
Filtering of duplicates and hallucinations (repetition loops);
Rule-based correction of named entities and abbreviations;
Saving the cleaned corpus along with detailed change logs (deleted and corrected lines).
3.3.2. Named Entity Correction: Rules and Restrictions
Named entity correction was performed using explicitly defined rules and dictionaries of typical erroneous substitutions identified during manual inspection of synthetic translations. Corrections were applied only when a clear contextual cue was present in the original Kazakh sentence, thereby reducing the risk of false substitutions.
The module processes the following entity types:
Geographical names and country names;
State and institutional abbreviations;
Personal names;
Linguistic markers (e.g., “қaзaқшa”, “түpiкшe”).
A correction was applied only when all three of the following conditions were satisfied:
A trigger keyword appears in the Kazakh source sentence;
A predefined erroneous translation is detected in the target segment;
The target segment contains no more than one geographical entity.
The third constraint was introduced to avoid incorrect substitutions in sentences with complex geographic or political contexts (e.g., those mentioning multiple countries).
Examples of Correction Rules
Correction rules for named entities and abbreviations were defined as templates of the following form: (trigger in source KZ) + (typical erroneous substitution in target text) → correct form
Representative examples include (detailed statistics are provided in
Table 2):
Kazakhstan/ҚP → Azərbaycan → Qazaxıstan.
Қaзaқшa → Azərbaycanca → Qazaxça.
ӨP → Azərbaycan/Ermənistan → Özbəkistan.
ҚP → Qırğızıstan/Qırğızıstanın → Qazaxıstan.
Technically, replacements were implemented using whole word matching with regular expressions to prevent substitutions inside word fragments and to minimize false positives.
3.3.3. Detection of Duplicates and Hallucinations: Matching Methods
Lexical duplicates and hallucinatory fragments were detected exclusively using exact matching. The module does not apply fuzzy matching, edit distance, or embedding-based semantic similarity. Two complementary methods were implemented:
Such errors are characteristic of synthetic corpora and are particularly harmful: repetition loops introduce undesirable training patterns, increase the likelihood of repetitive output during generation, degrade fluency, and negatively affect translation quality metrics.
3.3.4. Threshold Values and Filtering Parameters
All threshold values were determined empirically based on preliminary manual analysis of typical defects in synthetic corpora. The thresholds were intentionally conservative, prioritizing high precision and minimizing the removal of valid sentences.
The applied thresholds are as follows:
Single-word repetition: ≥3 consecutive occurrences;
n-gram repetition: ≥3 repetitions;
n-gram length: 3–10 words;
Maximum number of geographical entities for NE correction: 1;
Matching type: exact match.
Separate sentence-length filtering was not applied in this version of the module, as preliminary analysis indicated that the primary sources of systematic noise in the synthetic corpora were repetition loops and named entity errors.
3.3.5. Qualitative Examples of Errors Before and After Cleaning and Their Impact
To strengthen the evidence for the effectiveness of synthetic parallel data cleaning, we present several representative examples of systematic errors observed before and after applying the correction module. Manual analysis shows that the most frequent and most toxic errors for model training are:
- (i)
Repetition loops that disrupt sentence structure, and
- (ii)
Incorrect transmission of named entities and linguistic markers, leading to semantic and factual errors.
Example 1. Repetition hallucination → line removal
KZ: Бipaқ cyғa тoлы қoлaйcыз гeoлoгиялық дeнeлepдi бapлayдa…
AZ (before): … geoloji cisimlərin geoloji cisimlərin geoloji cisimlərin …
After cleaning, the line is removed automatically.
Impact: such segments bias the target-language distribution and increase repetition during generation, degrading fluency and chrF/BLEU scores.
Example 2. Short repetition loop → line removal
KZ: Aнaлық бeздi жұмыpтқa тәpiздi…
AZ (before): yumurtalı yumurtalı yumurtalı
After cleaning, the line was removed.
Impact: repeated tokens create a distorted training signal, reducing generation stability.
Example 3. Language marker correction
KZ: Қaзaқшa Meн бөлiп төлeyдi жocпapлaп oтыpмын.
AZ (before): Azərbaycanca Mən pay-pay ödəməyi planlaşdırıram.
AZ (after): Qazaxça Mən pay-pay ödəməyi planlaşdırıram.
Impact: Correcting language markers improves semantic accuracy and prevents systematic language mislabeling.
Example 4. Incorrect language substitution
KZ: Қaзaқшa: Бaтapeяның қызмeт eтy мepзiмiн ұзapтyғa…
AZ (before): İngilis: Batareyanın ömrünü uzatmaq…
AZ (after): Qazaxça: Batareyanın ömrünü uzatmaq…
Impact: Correcting such distortions reduces training noise and improves inference quality in technical domains.
Example 5. Geopolitical entity substitution
KZ: … eкi eл apacындaғы дәcтүpлi қытaй-Әзipбaйжaн дocтығы…
AZ (before): … Çin-Özbəkistan dostluğu…
AZ (after): … Çin-Azərbaycan dostluğu…
Impact: Correcting named entities prevents factual errors and significantly improves perceived translation quality.
These qualitative examples demonstrate that corpus cleaning eliminates systematic generation errors rather than random noise. Despite the relatively small proportion of corrected sentences, such errors are disproportionately harmful to training and can substantially degrade translation stability and quality. These findings are further supported by the analysis of NMT outputs: models trained on uncleaned data exhibit a higher tendency toward repetition and entity mistranslation, whereas models trained on cleaned corpora produce more coherent, fluent, and semantically accurate translations.
3.3.6. Quantitative Cleaning Results
Table 2 presents the results of linguistic verification and error correction conducted across five synthetic parallel corpora of the Turkic–Kazakh language pair, totaling approximately 500,000 sentences. The analysis revealed two main types of systematic errors:
Lexical duplicates—words or phrases repeated within a sentence.
Incorrect transmission of nominal subjects and abbreviations, which include place names, country names, and legal terms.
The table shows the following data: the number of duplicate segments deleted, the number of corrected names and abbreviations, typical examples of systematic substitutions, and the percentage of corrected data in the total corpus. This preprocessing step of generated corpora substantially improves the lexical, semantic, and stylistic quality of the training data, directly enhancing the accuracy, stability, and generalization capacity of neural machine translation models for low-resource and morphologically complex Turkic languages.
Thus,
Table 2 effectively implements component-wise ablation at the data level, allowing us to separately assess the contributions of repetition loop filtering and rule-based named entity correction to overall corpus quality improvement without the need to re-train the models. The results show that the bulk of the corrections are related to repetition removal, while entity correction affects fewer strings but improves the semantic correctness and factual accuracy of translations.
By addressing these errors at the preprocessing stage, we enhanced the reliability and performance of downstream NMT models and ensured they could generate accurate, fluent, and morphologically appropriate translations across Turkic languages. The resulting cleaned corpora can also be found at this link:
https://github.com/NLP-KazNU/Parallel-Turkic-text-corpora, accessed on 15 January 2026.
The result is a validated and cleaned bilingual corpus that serves as the basis for subsequent model re-training.
The generated corpora underwent a manual peer review of abbreviation translations and translation repetitions, which was then used to automatically filter the entire text to ensure linguistic and structural quality. Where necessary, low-quality segments were regenerated using alternative models that demonstrated superior performance. The resulting validated and cleaned bilingual corpora form the foundation for downstream model fine-tuning.
3.4. Fine-Tuning AI Models for Turkic Language Translation
The goal of fine-tuning is to improve translation accuracy for specific language pairs, adapt models to the morphological and syntactic features of Turkic languages, reduce the base model’s error rate, and improve processing of rare vocabulary, idiomatic expressions, and culture-specific units.
To improve translation accuracy and adapt the models to the morphological and syntactic structures of Turkic languages, a targeted fine-tuning process was implemented using pre-generated parallel corpora.
Two models were used during fine-tuning:
NLLB-200 1.3B.
mT5-base.
Fine-Tuning of the NLLB-200 1.3B Model.
The NLLB-200 1.3B model developed by Meta AI was chosen for its optimal balance between average parameter size and computational efficiency. This enables effective model training on consumer-grade graphics processing units (GPUs).
Fine-tuning was carried out on synthetic corpora generated using the NLLB-200 3.3B and Gemma 2 27B models, thoroughly cleaned. This approach ensured the linguistic diversity of the training data and enabled a more thorough coverage of the structural differences among the Turkic languages.
The following strategies were used when setting up:
Early stop—to prevent overfitting the model.
Gradient accumulation—to simulate larger batch sizes.
Adaptive learning rate schedules—to stabilize training dynamics.
The combination of these methods enabled effective adaptation of the NLLB-200 1.3B model for low-resource Turkic languages.
Fine-Tuning of the mT5-base Model.
The mT5-based model is a multilingual transformer architecture developed by Google Research [
51] that extends the T5 model framework to support over 100 languages, including several Turkic languages. While the original T5 model primarily focused on English, the mT5 version is pre-trained on the MC4 large-scale multilingual corpus, which provides a significant expansion of language coverage.
Key advantages of mT5-base include:
A consistent text-to-text paradigm simplifies task formulation across translation, summarization, and question answering.
A language-neutral initialization, avoiding English-centric biases.
A balanced model size (~580 million parameters), allowing effective fine-tuning with limited resources.
Fine-tuning was performed on a GPU with 24 GB of memory capacity.
Training parameters are presented in
Table 3.
Training Hyperparameters:
Base model: facebook/NLLB-200 1.3B.
Optimizer: AdamW.
Learning rate: 5 × 10−5.
Warmup steps: 500.
Epochs: 6.
Batch size (per device): 2.
Gradient accumulation: 2.
Effective batch size: 4.
Max sequence length: 256.
Padding strategy: longest (dynamic).
Precision: FP16.
Random seed: 42.
Hyperparameters were optimized through grid search, with batch sizes adjusted via gradient accumulation.
Training datasets before cleaning consisted of synthetic parallel corpora of 300,000 and 500,000 sentence pairs per language pair. After cleaning, the training datasets consisted of synthetic parallel corpora with approximately 288 thousand and 497 thousand sentence pairs per language pair.
Data Splits: Common 497,000; Train 487,060—98%; Dev (internal) 4970—1%; Test 4970—1%. Common 288,000; Train 282,240—98%; Dev (internal) 2880—1%; Test 2880—1%. These are common data split schemes by proportion, but the dataset volume for each language pair varies slightly depending on cleaning.
This methodology provided a precise, repeatable approach to improving machine translation for low-resource Turkic languages. It can also be used for other agglutinative language families. This method offers a clear, consistent way to improve machine translation for low-resource Turkic languages. The fine-tuned models showed higher accuracy, greater robustness, and better ability to handle morphological complexity. The framework not only helps develop reliable translation systems under limited language representation, but can also be applied to other agglutinative language families.
A comprehensive evaluation of the obtained results, including comparative analysis, identification of methodological limitations, and discussion of future application prospects, will be presented in detail in
Section 4 and
Section 5.
4. Experimental Results and Analysis
4.1. Preliminary Experiments
In our 2024 experiments with the OPUS dataset, we were limited to Google Colab (Google LLC, Mountain View, CA, USA), which allowed us to use the mT5-small model. This resulted in relatively low BLEU scores. The mT5-small model is a general-purpose text-to-text architecture with substantially fewer parameters and without translation-specific pretraining.
Table 4 shows that augmenting OPUS data with 140 k synthetic sentences generated by Google Translate led to a consistent decrease in BLEU scores across all language pairs. Training exclusively on synthetic data further degraded the performance. These results indicate that synthetic data quality, rather than quantity, is critical. This observation motivated the transition to NLLB-based synthetic corpus generation and structured data cleaning.
Preliminary experiments were crucial for further work on the project. The results of the preliminary experiment allowed us to refine subsequent tasks, namely, selecting a specialized AI tool for generating synthetic parallel corpora for five pairs of Turkic languages, manually validating and automatically cleaning the generated corpora, selecting an AI tool for fine-tuning on the cleaned parallel corpora, and evaluating the fine-tuning results.
4.2. NLLB-200 1.3B Experiments on 300,000 and 500,000 Parallel-Sentence Corpora
Fine-tuning the NLLB-200 1.3B model on a relatively modest (300,000) parallel dataset yields significant improvements across all metrics (
Table 5).
TER decreases by an average of 20–40%, while BLEU nearly doubles (for Azerbaijani, Kyrgyz, and Turkish). chrF steadily increases by 10–15 points. The jump is particularly noticeable for the Turkmen–Kazakh pair, where BLEU increased more than fourfold (from 6.44 to 28.36), demonstrating NLLB’s sensitivity to domain-specific data fitting. Re-training NLLB-200 1.3B, even on limited clean data, effectively stabilizes the model, eliminating high error rates for all language pairs.
Table 6 summarizes the quantitative evaluation results for the five Turkic–Kazakh language pairs, reporting WER, TER, BLEU, and chrF scores for the baseline models, the fine-tuned models, and the models fine-tuned on cleaned synthetic data. Baseline refers to the NLLB-200 1.3B model fine-tuned only on the original (uncleaned) parallel data. The “Fine-tuned” column indicates the performance after additional training, and the “Fine-tuned on cleaned data” column refers to performance after training on the cleaned synthetic data. Δ columns show improvements relative to the baseline scores (Fine-tuned → Baseline and Fine-tuned on cleaned data → Baseline).
Average metric values (across 5 language pairs) for the baseline (zero-shot): WER = 0.77, TER = 74.84, BLEU = 17.62, chrF = 50.66.
Average metric values (across 5 language pairs) for the fine-tuned version (6 epochs): WER = 0.58, TER = 54.72, BLEU = 30.95, chrF = 62.28.
For the corpus cleaned 500 thousand sentences, the fine-tuned NLLB-200 1,3B average metrics values are WER = 0.42, TER = 40.88, BLEU = 43.54, chrF = 76.71 on cleaned data.
Data cleansing yields the greatest improvement in quality among all the experiments conducted. Switching from the uncleaned corpus to the cleaned corpus across all language pairs yields an average WER decrease of 0.12, a TER decrease of 13.84, a BLEU increase of 12.59, and a chrF increase of 14.43. The strongest effect is observed for the Kyrgyz–Kazakh and Azerbaijani–Kazakh pairs, where BLEU reaches ~48.
4.3. mT5-Base Experiments on a 500,000 Parallel-Sentence Corpus of Azerbaijani–Kazakh
The results of the fine-tuning are presented in
Table 7, which lists the values of four main performance indicators: WER, TER, BLEU, and chrF2.
Even a superficial casing cleanup improves all quality metrics: WER: 1.3% improvement, TER: 1.28% improvement, BLEU: +1.57%, chrF: +0.99%. Although the effect is moderate, it is stable: all metrics improve simultaneously, indicating improved data consistency. The mT5-base model also benefits from casing cleanup, but mT5’s sensitivity to data quality is lower than that of NLLB-200 1.3B
Thus, the conducted experiments show that cleaning synthetic corpora and their subsequent use for additional model training significantly improves the quality of translation for Turkic–Kazakh language pairs. The results confirm the feasibility of this approach under limited resource conditions.
4.4. NLLB-200 1.3B Experiments on the Six Turkic-Language Parallel Dataset
Table 8 presents the results of experiments fine-tuning the free, open-source AI model for machine translation between Azerbaijani–Kazakh, Kyrgyz–Kazakh, Turkmen–Kazakh, Turkish–Kazakh, and Uzbek–Kazakh on a common Turkic-language parallel dataset (3,885,542 sentences). The baseline scores represent the average performance fine-tuned separately on each language pair (as reported in
Table 8), while the fine-tuned results show performance after training on the combined cleaned dataset for 6 epochs.
Joint fine-tuning on a single purified corpus of six Turkic languages leads to a further increase in BLEU (from 43.54 to 47.84) and chrF (from 76.71 to 78.41), while simultaneously decreasing WER and TER. Interestingly, the model demonstrates an average improvement across all language pairs, confirming the presence of interlingual transfer within the Turkic language group. Joint multilingual fine-tuning strengthens the models by leveraging the structural similarities among Turkic languages and is an optimal approach given limited resources for individual languages.
4.5. External Evaluation of Fine-Tuned NLLB-200 1.3B Models on the Human-Translated Benchmark FLORES 200 Dataset
We additionally evaluated the fine-tuned NLLB-200 1.3B models on FLORES-200 as an independent human-translated benchmark; no FLORES data are used for training. The FLORES 200 dataset is an evaluation human-translated benchmark for low-resource and multilingual machine translation.
Table 9 presents the results of the external test experiments of the NLLB-200 1.3B model fine-tuned on 500,000 parallel sentences for five Turkic languages.
The results show that BLEU scores slightly decrease after fine-tuning across all language pairs, while chrF remains stable or improves for most directions. This pattern is expected for morphologically rich Turkic languages, as BLEU is sensitive to surface n-gram overlap, whereas chrF better captures character-level morphological similarity. The consistent chrF values indicate that fine-tuning preserves or improves morphological adequacy, even when lexical overlap with the reference translations decreases.
In addition to BLEU and chrF, we report a semantic evaluation using COMET and BERTScore [
63,
64]. These metrics better capture meaning preservation and paraphrastic variation, which is particularly important for morphologically rich Turkic language pairs. The results indicate that, despite the relatively low BLEU scores, the proposed model preserves the semantic content to a large extent.
For morphologically rich and closely related Turkic languages, semantic metrics (COMET, BERTScore) provide a more reliable estimate of translation quality than surface-form metrics, such as BLEU [
64].
Table 10 presents the results of the external test experiments for the NLLB-200 1.3B model fine-tuned on 500,000 parallel sentences for five Turkic-language pairs using the semantic metrics of COMET and BERTScore.
The results demonstrate a clear and consistent improvement in BERTScore-F1 after fine-tuning across all Turkic–Kazakh language pairs. On average, the BERTScore-F1 increases from 0.8168 in the zero-shot setting to 0.9258 after fine-tuning, indicating a substantially improved semantic alignment between model outputs and reference translations.
In contrast, COMET scores exhibit a slight decrease on average (from 0.8593 to 0.8274), with the largest drops observed for Turkish–Kazakh and Uzbek–Kazakh. This behavior is expected as COMET is sensitive to domain and stylistic shifts; fine-tuning on synthetic parallel data introduces mild specialization that may reduce the alignment with the FLORES-200 reference distribution.
COMET should be interpreted in terms of relative change rather than absolute value; the observed decrease from 0.859 to 0.827 (Δ ≈ −0.03) reflects a moderate domain adaptation effect.
Small decreases in COMET are commonly observed under domain adaptation and do not necessarily indicate loss of generalization [
65,
66].
Table 11 presents the results of the external test experiments using the NLLB-200 1.3B model fine-tuned on a six-Turkic-language parallel dataset (3,885,542 sentences) using metrics COMET and BERTScore.
The slightly worse performance observed for the multilingual fine-tuned model compared to the models fine-tuned on individual language pairs can be attributed to parameter sharing across multiple translation directions. In the multilingual setting, the fixed parameter budget of the NLLB-200 1.3B model is distributed across several Turkic–Kazakh language pairs, which limits the degree of specialization achievable for each individual direction. In contrast, pair-specific fine-tuning allows the model to allocate its full representational capacity to a single language pair, resulting in stronger adaptation and higher metric scores.
This balance between cross-lingual generalization and pair-specific specialization explains why multilingual models yield slightly worse performances than their bilingual counterparts.
4.6. Fine-Tuning and Evaluation of NLLB-200 1.3B on the OPUS Dataset
We used, as an example, the Kyrgyz–Kazakh OPUS dataset for this task.
This corpus was preprocessed and duplicates were deleted. Kyrgyz–Kazakh had 99,590 lines; after deleting the duplicates, it had 91,142 lines. In
Table 12, the results of the evaluation of the zero-shot and fine-tuned NLLB-200 1.3B model are presented by BLEU, chrF, BERTScore, and COMET metrics.
Compare these results with the results for the cleaned synthetic datasets: (1) on synthetic datasets, BLEU increases from 17.72 to 48.27; chrF increases from 50.58 to 80.12; BERTScore F1 increases from 0.809 to 0.9295; and COMET decreases from 0.8765 to 0.8547;
(2) On the OPUS dataset, BLEU increases from 23.00 to 44.00; chrF increases from 43.96 to 60.33. 3; BERTScore F1 increases from 0.80 to 0.86; and COMET increases from 0.75 to 0.81.
Table 13 compares the evaluation results for the cleaned synthetic datasets with those for the OPUS dataset for the Kyrgyz–Kazakh pair.
Cleaned synthetic data yield substantially larger gains in BLEU, chrF, and BERTScore-F1, while OPUS-only fine-tuning shows a higher COMET increase.
4.7. Human Evaluation
We conducted a human evaluation focusing on in-domain translation performance. Given the limited resources of the considered Turkic language pairs, validating the quality gains achieved within the curated parallel corpora used by the proposed pipeline is of primary importance.
For human evaluation, we selected 100 sentence pairs per language pair from a subset of 5000 manually verified and corrected parallel sentences, curated by native-speaker consultants for Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek. These 5000 sentences originate from the larger cleaned synthetic corpus but were independently reviewed and corrected to ensure high reference quality. None of the evaluated sentences were used during model training.
Each evaluation item contained: (1) the source sentence, (2) two anonymized system outputs (Translation A and Translation B), and (3) a reference Kazakh translation.
To avoid positional bias, the assignment of baseline and fine-tuned outputs to Translations A and B was systematically alternated across the dataset. Human evaluators were asked to:
- (i)
Indicate which translation is better (or select Tie if they are equivalent),
- (ii)
Assign quality scores on a 1–5 Likert scale to each translation independently.
In
Table 14 presents the results of the human evaluation of translation by NLLB 200 1.3B: (1) Baseline zero-shot and (2) Fine-tuned on 100 sentence pairs per language pair.
The human evaluation across five Turkic–Kazakh language pairs shows that baseline and fine-tuned models are close. Average quality scores on a 1–5 Likert scale reveal modest but consistent differences: the fine-tuned model outperforms the baseline for Azerbaijani–Kazakh, Kyrgyz–Kazakh, and Uzbek–Kazakh while remaining competitive for Turkish–Kazakh and Turkmen–Kazakh. Overall, no systematic degradation in human-perceived translation quality is observed.
In this study, we prioritize in-domain human evaluation, as the primary goal is to validate improvements within the curated and manually verified parallel corpora used by the proposed pipeline. For low-resource Turkic languages, ensuring quality gains in the target application domain is particularly critical. External generalization is assessed separately using the FLORES-200 benchmark with established automatic metrics.
5. Discussion
5.1. Preliminary Experiments and Methodological Transition
The experiments conducted with OPUS datasets and the mT5 model represent a preliminary stage of this study. At the time of experimentation (2024), the available computational resources were limited to Google Colab, which constrained training to the mT5-small configuration. Given the general-purpose nature and limited parameter capacity of mT5-small, low BLEU scores for low-resource Turkic language pairs are expected and should not be interpreted as training deficiencies.
In addition, fine-tuning mT5-small on a synthetic parallel corpus of 140k sentence pairs resulted in only modest performance gains. This outcome is largely attributable to the quality of the synthetic data, which was generated by translating a monolingual Kazakh corpus into five Turkic languages using Google Translate. While such synthetic augmentation is common in low-resource settings, translation noise and domain mismatch disproportionately affect smaller-capacity models.
Importantly, BLEU scores consistently increase across training epochs for all language pairs, indicating stable learning dynamics and correct training configuration. Based on insights from this preliminary stage and the acquisition of dedicated GPU resources (RTX 4090, 24 GB), we revised our experimental strategy. Specifically, we selected NLLB-200 3.3B for generating higher-quality synthetic parallel corpora and NLLB-200 1.3B for fine-tuning. This transition resulted in substantially improved translation quality, demonstrating the importance of model specialization and data quality in low-resource neural machine translation.
The synthetic parallel corpora used in this study are generated following the standard back-translation paradigm widely adopted in low-resource neural machine translation. Specifically, large monolingual Kazakh corpora serve as the starting point, and the corresponding target-side sentences (Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek) are obtained via back-translation using a pretrained multilingual model. This approach is well established as an effective strategy for alleviating data scarcity in low-resource settings.
Fine-tuning the NLLB-200 1.3B model on the resulting synthetic corpora leads to substantial improvements in translation quality. These gains should not be interpreted as trivial output imitation of the data-generating model, as effective generalization in low-resource neural machine translation remains a non-trivial learning problem, particularly for morphologically rich Turkic languages.
To reduce potential biases inherent to synthetic data, the generated corpora undergo multi-level cleaning, including deduplication, correction of named entities and abbreviations, and removal of hallucinated segments, thereby altering the data distribution and going beyond direct output imitation.
In addition, model evaluation is performed using both held-out synthetic test sets and independent human-translated reference data, providing empirical evidence that the proposed pipeline improves translation quality for low-resource Turkic languages.
5.2. Impact of Data Quality on NMT Performance for Low-Resource Turkic Languages
The experimental results consistently demonstrate that data quality is a critical factor in improving neural machine translation (NMT) systems for low-resource Turkic languages. Across all evaluated models and language pairs, cleaning synthetic parallel corpora led to substantial improvements in translation accuracy. Even basic preprocessing steps resulted in measurable gains, confirming that noise and inconsistencies in synthetic data significantly hinder model generalization.
Quantitatively, corpus cleaning yielded an average improvement of 12.59 BLEU points and 14.43 chrF points, while error-based metrics decreased by 0.16 (WER) and 13.84 (TER). These results highlight that morphological and lexical consistency is particularly important for agglutinative languages, where minor distortions can propagate into systematic translation errors.
5.3. Role of Synthetic Data Generation and Thematic Diversity
The use of synthetic parallel corpora generated via back-translation proved to be an effective strategy for mitigating data scarcity. The extended bilingual corpus of 500,000 sentence pairs spanned a wide range of themes, including journalistic, conversational, scientific, medical, administrative, and technical domains. This diversity enabled the NMT models to learn generalized representations of real-world language use rather than overfitting narrow domains.
The results indicate that thematic coverage directly improves robustness, enabling models to perform consistently across heterogeneous text types. This is particularly important for practical deployment scenarios involving mixed-domain inputs.
5.4. Error Characteristics and the Importance of Corpus Cleaning
Error analysis across all five language pairs revealed recurring issues related to hallucinations, mistranslation of named entities, inconsistent terminology, duplicated segments, and morphological distortions. Such errors are inherent to synthetic data generated by large-scale models and, if left unaddressed, substantially degrade translation quality.
The three-level refinement strategy—expert validation, automatic filtering, and targeted regeneration—proved essential in mitigating these issues. The marked reduction in WER and TER confirms that cleaning improves not only surface-level accuracy but also deeper structural and semantic consistency.
5.5. Comparative Performance Across Turkic Language Pairs
The effect of data cleaning was particularly pronounced for initially low-performing language pairs. For example, in the Kyrgyz–Kazakh pair, BLEU increased from 29.73 to 48.27, while WER decreased from 0.59 to 0.38. Similar trends were observed for Azerbaijani–Kazakh and Uzbek–Kazakh, where BLEU scores nearly tripled and error rates were reduced by approximately half.
The Turkmen–Kazakh pair, initially the weakest (BLEU = 9.18), showed substantial improvement after cleaning, reaching 33.22 BLEU. Although absolute performance remained lower than for other pairs, the relative gains demonstrate the effectiveness of corpus refinement, even for severely under-resourced languages.
5.6. Model Sensitivity to Noise and Cross-Model Consistency
The experiments revealed that the NLLB-200 1.3B model is highly sensitive to the quality of the training data. Fine-tuning on uncleaned synthetic corpora yielded only moderate gains, whereas training on cleaned datasets led to stable, consistent improvements across all metrics. This suggests that large multilingual models can fully exploit their capacity only when trained on structurally coherent and lexically consistent data.
These observations were independently confirmed using the mT5-base model. Although its overall improvements were more modest due to smaller model capacity, the direction of changes remained consistent: WER and TER decreased, while BLEU and chrF increased after cleaning. This confirms that the benefits of corpus refinement are model-independent.
5.7. Benefits of Multilingual Fine-Tuning
Multilingual fine-tuning on a combined Turkic dataset further enhanced translation quality. Training the NLLB-200 1.3B model on a unified multilingual corpus yielded a better performance across all metrics than pairwise fine-tuning. The improvement in BLEU from 43.54 to 47.84 and the reduction in WER from 0.42 to 0.31 indicate that shared representations across related languages improve generalization.
This effect is particularly beneficial for low-resource languages such as Turkmen, which benefit from shared morphology, cognates, and syntactic patterns present across the Turkic language family.
Despite the strong results, limitations remain. Although synthetic corpora can reach a substantial size, they inevitably contain residual hallucinations and inconsistencies. Without systematic cleaning, such artifacts negatively affect translation performance. Future work will focus on expanding the corpora by incorporating additional open-domain and spoken-language sources, as well as on incorporating more advanced filtering and confidence-based selection methods. Overall, the discussion confirms that high-quality synthetic corpora can effectively compensate for the lack of human-annotated data in low-resource Turkic languages. The findings demonstrate that advanced corpus cleaning is essential, multilingual fine-tuning significantly enhances generalization, and data quality remains the dominant factor in achieving high-performance NMT systems for morphologically rich languages.
5.8. External and Human Evaluation
To obtain a balanced assessment of translation quality, we combine external benchmark evaluation with in-domain human evaluation, capturing both generalization and practical performance in low-resource Turkic–Kazakh translation.
External evaluation is conducted on the FLORES-200 benchmark, which provides a strictly held-out, human-translated test set with a shared Kazakh reference across all source languages. This setup enables reliable comparison across language pairs and assessment of out-of-domain robustness. The results indicate that fine-tuned models remain competitive with the zero-shot baseline in semantic metrics. Small decreases in COMET are observed for some language pairs, consistent with domain adaptation rather than systematic degradation, while the BERTScore remains stable.
Human evaluation focuses on in-domain performance using manually verified and corrected parallel data. Pairwise preference judgments are largely dominated by Tie responses, suggesting comparable overall quality between baseline and fine-tuned systems. However, graded quality scores show modest yet consistent improvements for the fine-tuned model across several language pairs, with no evidence of quality degradation.
Taken together, these results show that fine-tuning on cleaned synthetic data leads to meaningful in-domain quality improvements while maintaining robust performance on an external benchmark. The observed divergence between COMET and human judgments further underscores the importance of combining automatic metrics with human evaluation, particularly in low-resource, morphologically rich languages.
5.9. About Domain Shift
The synthetic parallel data were generated from large-scale Kazakh monolingual sources, including news and official government domains, providing broad topical and stylistic coverage. Although this data composition may theoretically introduce mild domain or stylistic biases, an empirical evaluation using automatic semantic metrics (COMET, BERTScore) and human judgment shows that their practical impact on model generalization is limited. In particular, the observed decrease in COMET scores after fine-tuning remains within the range commonly associated with domain adaptation and does not suggest a loss of generalization. These findings imply that the proposed synthetic data generation and cleaning pipeline improves translation quality while largely preserving robustness across domains. Nevertheless, future work may explore explicit domain balancing and multi-domain evaluation to further mitigate potential domain bias.
6. Conclusions
In conclusion, this study successfully applied modern artificial intelligence methods to address the pressing problem of machine translation for low-resource Turkic languages (Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek). The key scientific contribution of this work lies in the development and implementation of a scalable methodology that includes the generation, automated cleaning, and further training of open-source AI models.
Synthetic Corpus Generation: To address the acute data shortage, a new multilingual parallel corpus was developed, focusing on the Kazakh language. Using the large-scale open-source NLLB-200 3.3B model, the researchers generated synthetic parallel corpora totaling 500,000 sentences and 300,000 sentences for each of the five Turkic language pairs. This work created a significant resource for low-resource languages.
Detailed Data Cleaning (Improvement Mechanism): The generated corpora were subjected to automated cleaning and filtering to remove systematic errors. This stage was critical for the agglutinative and morphologically complex Turkic languages. Using a special software module (Correction Module), the following issues were eliminated:
- -
Lexical duplicates (repeated segments).
- -
Erroneous rendering of geographical and proper names, as well as abbreviations (for example, the incorrect translation of “KR”—Republic of Kazakhstan).
- -
“Hallucinations”—distorted or lexically incoherent phrases that violate grammatical norms and reduce the generalization ability of the model.
Fine-Tuning: To adapt the models to the specific characteristics of Turkic languages, the NLLB-200-1.3B model was selected (as well as mT5-base for comparative analysis). Fine-Tuning was conducted on the cleaned synthetic corpora.
We fine-tuned the NLLB-200 1.3B model on cleaned data, demonstrating significant performance improvements across all five language pairs. This improvement was achieved primarily through the use of the cleaned corpus, which provided a robust foundation for training the NMT model.
As a result of the cleaned corpora of 500,000 sentences, compared to the original (Baseline, zero-shot) translation quality, re-training on the cleaned data yielded the following average improvements:
The average BLEU metric increased by 25.92 points.
The chrF metric (character-level precision score) increased by 26,05 points.
The error metrics (WER and TER) were nearly halved: WER decreased by 0.35, and TER by 33.96.
The largest quality improvement was recorded for the pair with the smallest number of source resources—Turkmen–Kazakh—where the BLEU metric increased from 9.18 to 33.22, representing a 3,6-times increase. For the Kyrgyz–Kazakh pair, the BLEU increased by almost 2.7 times. Furthermore, continued training on a large multilingual dataset (3,885,542 sentences) of Turkic languages further improved the translation quality, enhancing the model’s generalizability and performance. For example, in this general dataset, BLEU increased from 43.54 to 47.84, and WER decreased from 0.42 to 0.31. These results confirm the high efficiency of the developed methodology for generating and cleaning synthetic corpora using open-source AI models. The demonstrated, scalable, reproducible approach enables the improved use of open-source NMT solutions for low-resource languages and helps bridge the digital linguistic divide.
External and Human Evaluation.
The external evaluation of the fine-tuned NLLB-200 1.3B model on the FLORES 200 independent human-translated dataset shows that fine-tuned pair-wise models compared with the zero-shot model BLEU scores slightly decrease on average (from 12.71 to 11.32), while the chrF scores slightly increase (from 46.62 to 47.12), and the semantic metric BERTScore-F1 increases from 0.8168 in the zero-shot setting to 0.9258 after fine-tuning. In contrast, the COMET scores exhibit a slight decrease on average (from 0.8593 to 0.8274). Multilingual fine-tuned NLLB-200-1.3B shows a slightly worse performance (decrease in BERTScore-F1 from 0.8168 to 0.8115 and COMET from 0.8593 to 0.8446). These external evaluation scores indicate a moderate domain adaptation effect and do not suggest a loss of generalization.
Human evaluation conducted by native speakers across multiple Turkic–Kazakh language pairs confirms the improvements observed in automatic metrics. The fine-tuned models were consistently preferred over the baseline in terms of adequacy and fluency, supporting the conclusion that the proposed training pipeline leads to perceptible gains in translation quality beyond metric-based evaluation.
The proposed scalable, replicable approach not only advances machine translation research but also lays a solid foundation for the multifaceted application of its findings. Academically, the proposed methodology establishes a replicable research paradigm that can be applied to the study of other language families and contributes to expanding scientific knowledge of morphologically complex agglutinative languages. The proposed approaches strengthen linguistic inclusivity in the Turkic-speaking world, ensuring equal opportunities for digital participation for the more than 200 million Turkic language speakers.