An Integrated Approach to Adapting Open-Source AI Models for Machine Translation of Low-Resource Turkic Languages

Tukeyev, Ualsher; Shormakova, Assem; Karibayeva, Aidana; Rakhimova, Diana; Abduali, Balzhan; Amirova, Dina; Rakhmanberdi, Nazym; Aliyev, Rashid

doi:10.3390/computers15020073

Open AccessArticle

An Integrated Approach to Adapting Open-Source AI Models for Machine Translation of Low-Resource Turkic Languages

by

Ualsher Tukeyev

,

Assem Shormakova

^*

,

Aidana Karibayeva

,

Diana Rakhimova

,

Balzhan Abduali

,

Dina Amirova

,

Nazym Rakhmanberdi

and

Rashid Aliyev

Faculty of Information Technology and Artificial Intelligence, Farabi University, Almaty 050040, Kazakhstan

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(2), 73; https://doi.org/10.3390/computers15020073

Submission received: 6 December 2025 / Revised: 15 January 2026 / Accepted: 26 January 2026 / Published: 28 January 2026

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

This study presents the application of free, open-source artificial intelligence (AI) techniques to advance machine translation for low-resource Turkic languages such as Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek. This machine translation problem for Turkic languages is part of a project to generate meeting minutes from speech transcripts. Due to limited parallel corpora and underdeveloped linguistic tools for these languages, traditional machine translation approaches often underperform. The goal is to reduce digital inequality for these languages and to support scalability. We investigate the effectiveness of free open-source pre-trained specialized and general-purpose AI models for morphologically rich state Turkic languages. This research includes developing parallel corpora for six Turkic languages, fine-tuning, and performance evaluation using BLEU, WER, TER, and chrF metrics. The parallel corpora for five pair languages, each of 300,000 and 500,000 sentences, were generated and cleaned. The results for corpora 500,000 parallel sentences show significant improvements compared with baseline NLLB-200 1.3B on average: BLEU increased by 23.81 points, chrF increased by 26.05 points, and WER and TER decreased by 0.36 and 33.95, respectively, after cleaning and fine-tuning. Six Turkic-language multilingual parallel corpora of 3 885 542 sentences were developed and the fine-tuning of NLLB-200 1.3B shows the following, compared with the results for 500,000 cleaned corpus: BLEU increased by 4.3 points, chrF increased by 1.7 points, and WER and TER decreased by 0.1 and 4.75, respectively These results demonstrate the high efficiency of corpus cleaning and synthetic data generation to improve the quality of machine translation for low-resource Turkic languages using AI models. These results were confirmed by external evaluation on the FLORES 200 dataset and human evaluation. The scientific contribution of this article is the development of a methodology for generating parallel corpora using a specialized AI model of machine translation and fine-tuning the specialized AI model on the created corpora, creating new multilingual parallel corpora of Azerbaijan–Kazakh, Kyrgyz–Kazakh, Turkish–Kazakh, Turkmen–Kazakh, and Uzbek–Kazakh pairs using the proposed methodology, cleaning them, and conducting fine-tuning experiments.

Keywords:

Turkic languages; machine translation; artificial intelligence; low-resource; digital inequality; AI-driven solutions

1. Introduction

Machine translation (MT) has become a key component of modern natural language processing (NLP) systems, enabling automatic text translation between languages and opening up opportunities for global communication. In recent years, the development of neural machine translation (NMT) models based on the Transformer architecture and deep neural networks has achieved significant success for languages with large parallel corpora, such as English, French, and Chinese. However, many languages, including resource-poor Turkic languages (Kazakh, Kyrgyz, Tatar, Uzbek, Karakalpak, and others), remain low-resource languages, limiting the quality and applicability of existing NMT models.

Low-resource Turkic languages face several specific challenges. First, available parallel corpora are minimal, and digital content in these languages is limited. Secondly, Turkic languages are characterized by agglutinative morphology, resulting in numerous unique word forms that complicate training models without morphological segmentation. Thirdly, many Turkic languages use multiple scripts (Cyrillic, Latin, and Arabic), requiring additional data normalization and increasing the complexity of text preprocessing. Open-source MT systems provide researchers and developers with a flexible platform for experimentation and practical implementation. Projects such as OpenNMT, Marian NMT, mBART, NLLB-200, and models from the LLaMA and Gemma families enable re-training existing models, integrating new languages, and creating specialized MT systems for low-resource languages. These tools open up opportunities for adapting models to the specificities of resource-constrained Turkic languages, including the use of synthetic corpora, multilingual learning, and transfer learning methods.

Recent advances in artificial intelligence (AI) have brought new changes to the field of NLP, including MT. The transition to neural architecture has been an enormous step forward, significantly increasing the fluency and accuracy of translation for resource-intensive language pairs [1]. However, despite technological progress, the use of open-source AI systems for low-resource languages, such as many Turkic languages, remains an urgent problem. Many Turkic languages lack large, high-quality parallel corpora, which are critical for training accurate NMT systems. As shown in the large-scale study of Turkic languages, even with NMT, data scarcity remains a significant bottleneck [2]. A study by Jumashukurov demonstrates that standard AI translation tools perform poorly on Turkic languages because of their agglutinative morphology and limited training data [3]. For example, the Open Language Data Initiative (OLDI) recently released parallel corpora and fine-tuned NMT models for Karakalpak, a low-resource Turkic language, highlighting both the need and the difficulty of building such systems [4]. Building effective AI for underrepresented Turkic languages often means adapting multilingual models or creating smaller, language-specific ones rather than using out-of-the-box general models. For instance, recent work describes a ~1.94B-parameter LLaMA-based model for Kazakh, demonstrating that strong performance can be achieved without massive infrastructure when you specialize for the language [5]. Not just translation, but even speech recognition systems (ASR) for Turkic languages are underdeveloped. A recent multilingual ASR study combined five low-resource Turkic languages and showed that multilingual training on open-source data significantly improved recognition accuracy [6]. As noted by Veitsman and Hartmann (2025), despite progress, many Central Asian Turkic languages (e.g., Kazakh, Kyrgyz, and Uzbek) still lack sufficient NLP resources, corpora, and open-source tools [7].

The Turkic languages belong to an agglutinative language family characterized by high morphological complexity and structural flexibility, including branches such as Oghuz, Kipchak, and Karluk. Most of these languages are considered low-resource languages for machine translation due to the lack of parallel corpora, bilingual dictionaries, and open data for learning [8,9,10].

This pilot study investigates the potential and limitations of free open-source AI-based MT systems for five closely related Turkic language pairs directed toward Kazakh: Azerbaijani–Kazakh, Kyrgyz–Kazakh, Turkmen–Kazakh, Turkish–Kazakh, and Uzbek–Kazakh. This research presents a comprehensive methodology that includes selecting machine translation systems, automatically creating and refining parallel corpora, fine-tuning multilingual artificial intelligence models, and conducting a thorough assessment of translation quality. This machine translation task, focused on Turkic languages, is part of a larger project to develop meeting minutes from transcriptions of Turkic-language speech.

The outcomes of this study pave the way for expanding this methodology across other underrepresented language pairs. By demonstrating a reproducible, scalable approach, this work contributes to improving the use of free, open-source MT for low-resource languages and reducing digital linguistic inequality.

Such an analysis on this scale has not been conducted previously for the Turkic language family, making this work one of the first comparative studies using large synthetic corpora across several pairs of translations into Kazakh. The relevance of this research stems from the urgent need to overcome digital linguistic inequality across Turkic languages. Many languages, including low-resource Turkic languages, remain severely underrepresented in digital ecosystems, which substantially limits the performance and applicability of existing NMT models.

Turkic languages face several domain-specific challenges:

Data Scarcity: Available parallel corpora are extremely limited, and digital content remains insufficient. For example, the volume of parallel sentences in the OPUS open repository for the Turkmen–Kazakh language pair is only 22,119 sentences, which is inadequate for training modern NMT architectures.

Morphological Complexity: Turkic languages are characterized by agglutinative morphology, which results in a vast number of unique word forms. This significantly complicates model training without morphological segmentation or subword modeling.

Script Diversity: Many Turkic languages use multiple writing systems (Cyrillic, Latin, and Arabic script), which requires additional normalization and harmonization of textual data.

The use of open-source AI systems for low-resource languages remains a pressing challenge. Our research provides a practical, scalable solution based on the creation, cleaning, and refinement of synthetic corpora, enabling substantial improvements in translation quality and, consequently, helping reduce digital linguistic inequality. The goal of this study is to enhance machine translation quality for six Turkic languages (Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek) by leveraging publicly available, open-source AI models. Specifically, this research focuses on developing parallel corpora for Turkic languages and on selecting and fine-tuning models using these data.

The scientific contribution of this work lies in the design and validation of a scalable, reproducible pipeline for neural machine translation in low-resource Turkic languages, based entirely on free, open-access AI tools:

The construction of multilingual synthetic parallel corpora for multiple Turkic–Kazakh language pairs using established back-translation techniques;
An automated data cleaning and filtering strategy to mitigate noise, duplication, and hallucinations inherent in synthetic data;
The fine-tuning of multilingual neural machine translation models on the resulting corpora;
A comprehensive evaluation protocol combining standard surface-form metrics (BLEU, chrF) with semantic metrics (BERTScore, COMET), external evaluation on FLORES 200, and human evaluation.

This paper is structured as follows: The Introduction situates this study within its broader research context, explains why improving machine translation for low-resource Turkic languages is an urgent task, and outlines this work’s key goals. Section 2: Related Works reviews the existing literature, highlighting the main challenges of neural machine translation for low-resource and morphologically complex languages, recent developments in Turkic-language MT research, and the growing role of synthetic data generation. Section 3: Methodology details the methodological approach, including the selection of AI models, the creation of synthetic parallel corpora, and the automated procedures used for data cleaning and fine-tuning. Section 4: Experimental Results and Analysis presents the outcomes of corpus construction, provides an error analysis, and evaluates how fine-tuning the chosen models (NLLB-200 1.3B and mT5-base) influences translation quality on both cleaned and raw datasets, using metrics such as BLEU, WER, TER, chrF, BERTScore, and COMET, as well as external and human evaluation. Section 5: Discussion interprets these findings, assesses the strengths and limitations of the proposed approach, and outlines opportunities for expanding the developed resources. Finally, Section 6: Conclusion summarizes the main contributions and demonstrates the effectiveness of open-source AI models for corpus creation and model adaptation in low-resource Turkic-language settings.

2. Related Work

Machine translation (MT) has replaced rule-based and statistical approaches with neural models, which now dominate practice and research. Modern methods are based on neural machine translation (NMT) models that account for sentence context and model complex word relationships, ensuring high-quality translation. One of the important milestones in the development of industrial neural machine translation (NMT) was the introduction of Google’s GNMT system [11]. This model has significantly improved translation quality through BPE tokenization, deep LSTM layers, and attention mechanisms. GNMT architecture laid the foundation for the development of modern models, such as T5 and mT5, which are based on later transformer architectures.

The development of NMT has accelerated significantly in recent years, driven by advances in artificial intelligence, deep learning, and transformer-based architectures [12]. NMT models have vastly outperformed traditional statistical machine translation (SMT) systems in both high-resource and multilingual settings.

However, translating low-resource languages, including many Turkic languages, remains a significant challenge due to limited parallel corpora, morphological complexity, and orthographic variation. A comprehensive analysis of neural machine translation outlines the progression from statistical approaches to neural architectures, emphasizing the importance of flexible model design and careful data selection. The discussion highlights persistent limitations in NMT, including insufficient generalization, difficulties with rare words, and challenges in domain adaptation, providing a structured overview of the field’s methodological evolution [13]. This study emphasizes several critical aspects:

Architectural flexibility: Different architectures (RNN-based, convolutional, and transformer) provide varying capabilities for sequence modeling.

Data selection: NMT models are highly dependent on high-quality parallel corpora; poor-quality data leads to overfitting and poor generalization.

Challenges with rare words: Neural models struggle to translate infrequent words and morphologically rich forms without subword tokenization (e.g., BPE, SentencePiece).

Domain adaptation limitations: NMT systems trained on one domain often fail to generalize to other domains without fine-tuning.

These works establish a foundation for understanding the specific difficulties that arise when working with low-resource languages, where parallel data may be scarce or non-existent.

2.1. Synthetic Data Generation and Multilingual Transfer for Low-Resource NMT

Research on low-resource neural machine translation (NMT) has primarily focused on transfer learning, multilingual pretraining, and synthetic data generation to mitigate the scarcity of parallel corpora. Numerous studies demonstrate that combining supervised and semi-supervised learning strategies with synthetic data can substantially enhance translation quality, particularly when transformer-based architectures with advanced attention mechanisms are employed [14]. These models offer improved contextual modeling and robustness, even when trained on small, resource-constrained datasets.

A central technique in this area is back-translation, initially introduced by [15] as an effective method for exploiting monolingual data. In back-translation, monolingual target-language text is translated into the source language using a reverse MT system, producing synthetic parallel sentence pairs that are then used for NMT training. Large-scale empirical studies confirm that back-translation remains effective across diverse language pairs and training scales, consistently improving translation quality in low-resource and domain-mismatched scenarios [16]. These findings highlight the importance of synthetic data augmentation when genuine parallel resources are limited.

Complementary to back-translation, pivot translation, and multilingual transfer, these strategies have become widely adopted for low-resource MT. Pivot-based approaches generate synthetic parallel data via an intermediate language, while multilingual NMT models leverage shared representations across multiple language pairs to facilitate cross-lingual transfer. Surveys of neural machine translation emphasize that multilingual learning plays a key role in alleviating data sparsity, while also identifying challenges such as error propagation, domain mismatch, and sensitivity to noisy training data [17].

For Turkic languages, multilingual and family-based transfer is particularly effective due to shared morphological and syntactic properties. Prior studies show that multilingual NMT systems trained on related Turkic languages significantly outperform bilingual baselines in low-resource settings [2,18,19]. The availability of multilingual resources such as KazParC and MuST-C further supports effective cross-lingual transfer within this language family [20,21]. In addition, large-scale multilingual models such as NLLB-200 provide many-to-many and zero-shot translation capabilities for more than 200 languages, including under-resourced Turkic–Kazakh language pairs [22,23]. Recent work demonstrates that fine-tuning NLLB models on task-specific or synthetic data consistently improves BLEU, chrF, and TER scores, even for closely related low-resource languages [20,24].

Despite these advances, synthetic-data-based approaches—including back-translation, pivot translation, and multilingual transfer—remain vulnerable to noise and error propagation. Errors introduced during reverse or intermediate translation stages, such as hallucinations, truncation, repetition, language mixing, and incorrect handling of named entities, are often directly transferred to the synthetic corpora. Most existing studies address these issues only indirectly, relying on coarse filtering heuristics or corpus-level data selection strategies, without explicitly identifying or correcting individual error types. This limitation is especially pronounced for morphologically rich languages, such as those in the Turkic family. It motivates the need for more structured, transparent, and reproducible synthetic data cleaning and correction pipelines.

2.2. NMT for Turkic Languages

The morphological and structural features of Turkic languages complicate the tasks of low-resource machine translation. For Turkic languages, these challenges are amplified due to:

Agglutinative morphology, creating a large number of possible word forms;
Scarcity of parallel corpora, especially for less widely spoken languages (e.g., Tuvan, Karakalpak);
Multiple orthographies (Latin, Cyrillic, and Arabic scripts), requiring additional preprocessing.

Despite the linguistic proximity of Azerbaijani and Kazakh, translation from Azerbaijani into Kazakh has not been sufficiently studied. The reasons for this lack of research include the lack of parallel corpora, limited availability of well-established oral and written data, and weak digital infrastructure. Recent research has explored Azerbaijani-to-English translation using morphological segmentation, highlighting its potential to improve handling of rich morphological structures in low-resource settings [25]. Given that Azerbaijani and Kazakh are agglutinative languages, their conclusions apply to literal translation methods. The OPUS project offers parallel corpora for many language pairs, including Turkic ones [26]. Recent studies have analyzed speech translation for Turkic languages using transformer-based architectures, noting persistent challenges in processing spontaneous speech and the scarcity of available corpora [27]. In response, the multilingual KazParC corpus has been introduced, covering Kazakh, English, Turkish, and Russian, providing a resource designed to ensure cross-linguistic consistency [20].

Despite linguistic similarities, developing the Kyrgyz–Kazakh neural machine translation (NMT) system faces several challenges due to limited resources. Studies [7,18] show that, despite growing scientific interest in the Turkic languages, parallel data remain insufficient. The KazParC [20] and MuST-C [21] projects effectively support multilingual learning by providing a balanced text and speech corpus. Although the NLLB-200 model [23] has significantly improved translation quality between Turkic languages, fine-tuning remains important for closely related language pairs, such as Kyrgyz–Kazakh [24]. Recent studies have shown that using the NLLB-200 3.3B model to generate synthetic data for the Kyrgyz language based on Kazakh–English corpora and subsequent fine-tuning of the NLLB-200 1.3B model significantly improved the BLEU, CHRF, and TER indicators [20]. These results demonstrate the importance of targeted adaptation and consideration of linguistic features to achieve high-quality translation, even between related languages.

The first machine translation system from Turkmen to Turkish was proposed using a rules-based approach and structural similarity [28]. In a subsequent study, refs. [2,19] presented a robust multilingual corpus and an NMT score for 22 Turkic languages. Their results show that family learning improves interlanguage transfer. These results suggest that it can also be effectively applied to similar language pairs, for example, Turkmen-to-Kazakh translation.

The translation from Kazakh to Turkish presents several difficulties due to the morphological complexity and arbitrary word order. Although multilingual models such as MT5 and mBERT have shown promising results, they require specially adapted data for high-quality translation. In this context, the KazParC and Tilmash systems were presented [20], demonstrating the effectiveness of a nonlinear machine translation approach that outperforms commercial services on BLEU and chrF scores. To address the need for more accurate syntactic representation, a hybrid CSE architecture was proposed [29] to enhance translation quality for complex sentences. Recent research on multitask models for Turkic languages demonstrated the effectiveness of multitasking systems for Kazakh, Turkish, and Uzbek, achieving high translation quality even with limited supervision [30]. Furthermore, studies have shown that incorporating POS tagging and transfer learning can significantly improve translation quality for low-resource Turkic language pairs [31].

Overall, these results indicate that morpho-segmentation, synthetic data generation, and multilingual transformer-based learning are the most effective approaches for improving machine translation for Turkic languages.

Uzbek–Kazakh NMT has several advantages in morphological processing. The use of a Complete Set of Endings (CSE) in analysis has been shown to enhance grammatical coherence and improve translation accuracy, particularly when handling agglutinative suffix sequences [32].

Their results show that morphological awareness is significantly superior to transitional systems without linguistic segmentation. This encourages the further development of cascading translation pipelines, especially for speech-based systems.

Overall, these studies show that the effectiveness of neural machine translation depends not only on reliable model architecture, but also on the availability of high-quality datasets and careful linguistic adaptation. These results provide a basis for developing a strategy for LMP systems that accounts for the morphological and structural features of Turkic languages. These characteristics make Turkic–Kazakh language pairs particularly sensitive to noise in synthetic data, emphasizing the need for structured cleaning and correction pipelines.

2.3. Available Parallel Corpora for Turkic–Kazakh Pairs

Effective multilingual transfer (Section 2.1) relies heavily on the availability of parallel corpora, making the expansion and cleaning of Turkic–Kazakh datasets critical.

The availability of parallel corpora plays an important role in improving translation quality for languages with limited resources. The open repository OPUS aggregates parallel data from various sources (CCMatrix, WikiMatrix, ParaCrawl, OpenSubtitles, etc.) and provides access to data for low-resource languages [33]. As of 20 November 2025, the OPUS repository offers the following amounts of parallel data for the Azerbaijani–Kazakh, Kyrgyz–Kazakh, Turkmen–Kazakh, Turkish–Kazakh, and Uzbek–Kazakh language pairs, as shown in Table 1.

In addition to OPUS, specialized corpora are emerging. Thus, for the kk–tr pair, the KazParC corpus was created (371,902 parallel sentences for the kk–tr, kk–en, and kk–ru directions), developed as part of the initiative to expand Kazakh parallel resources [20]. For uz–kk, a separate corpus was published in 2024 (113,877 parallel sentences) [34], and for kk–ky and az–kk, synthetic data generated by the back-translation method is actively used [35]. In addition, there are consolidated Turkic sets (e.g., the TIL Corpus), which enable the formation of multilingual models.

Research on most pairs involving the Kazakh language is mainly limited to the kk–en and kk–ru directions, where sufficient corpora have been collected for systematic experiments. For the Turkic–Kazakh pairs (az–kk, ky–kk, tk–kk, tr–kk, and uz–kk), there are still a few directly comparable publications with BLEU and chrF results. Several papers provide the results of preliminary additional training and experiments with synthetic corpora, but they are limited by small test samples and heterogeneous methodologies.

At the same time, large multilingual systems, such as No Language Left Behind (NLLB-200) [22], demonstrate efficiency under limited resources. The results on FLORES-200 show that fine-tuning in low-resource languages can significantly increase quality (spBLEU, chrF++). This confirms the promise of large-scale AI models for the areas under consideration.

The need to expand corpora. The volume of parallel data as of 4 September 2025 remains limited for several pairs (especially tk–kk: 22,119 sentences), which constrains the accuracy of NMT. Even for az–kk and uz–kk, volumes of several-hundred-thousand sentences are insufficient for modern architectures.

Relevance of AI application. The use of large language models (e.g., NLLB-200) and synthetic augmentation methods (e.g., back-translation and data augmentation) can compensate for data scarcity and improve translation accuracy [36]. Of additional importance is the filtering of synthetic data (eliminating hallucinations and noise), which directly enhances BLEU/chrF scores.

Thus, the results of the review confirm:

The expansion and purification of corpora for Turkic–Kazakh pairs remains a critically important area.

The use of AI as a tool for generation and additional training is the most promising strategy for improving the quality of translation of low-resource agglutinative languages.

This review indicates that the quality of machine translation for Turkic languages largely depends on the availability of clean parallel corpora and the use of multilingual transformer-based models. However, many Turkic–Kazakh language pairs remain severely under-resourced, and there is still no unified methodology for generating and filtering parallel data. These limitations highlight the need for the approach proposed in the following section.

This pilot study investigates the potential and limitations of free open-source AI-based MT systems for five closely related Turkic language pairs directed toward Kazakh: Azerbaijani–Kazakh, Kyrgyz–Kazakh, Turkmen–Kazakh, Turkish–Kazakh, and Uzbek–Kazakh. This research presents a comprehensive methodology that includes selecting machine translation systems, automatically creating and refining parallel corpora, fine-tuning multilingual artificial intelligence models, and conducting a comprehensive assessment of translation quality. This machine translation task, focused on Turkic languages, is part of a larger project to develop meeting minutes from transcriptions of Turkic-language speech.

In summary, the existing work demonstrates the effectiveness of back-translation, pivot translation, and multilingual pretraining for low-resource NMT. However, there is still no unified and reproducible methodology for generating, correcting, and filtering synthetic parallel data, particularly for morphologically rich Turkic languages. Most approaches rely on heuristic filtering and discard large portions of synthetic data without attempting targeted correction. This gap directly motivates the data-centric pipeline proposed in this work.

3. Methodology

3.1. General Methodology Framework Schemes

This study aims to systematically improve the quality of machine translation for six state-level Turkic languages—Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek—by adapting and fine-tuning freely available open-source artificial intelligence models. The proposed methodology was explicitly designed for morphologically rich languages under low-resource conditions.

The selection of these six Turkic languages was motivated by a combination of methodological and practical considerations.

First, all the selected languages belong to the Turkic language family and share an agglutinative morphological structure. This typological similarity enables cross-lingual knowledge transfer and supports the use of a unified methodological framework.

Second, each language has an official status in its respective country, underscoring this study’s practical relevance for governmental, educational, and digital applications.

Third, the language set was intentionally designed to include both a relatively high-resource language (Turkish) and low-resource languages (Turkmen and Kyrgyz), allowing the proposed methodology to be evaluated under heterogeneous resource availability and to assess its generalizability.

The proposed methodology establishes a comprehensive, reproducible pipeline for the construction of synthetic parallel corpora, quality refinement, and the subsequent fine-tuning of machine translation models within the Turkic language family. The framework consists of two interdependent stages:

(1): AI-driven generation and multi-level refinement of synthetic parallel corpora.
(2): Adaptation and evaluation of machine translation models trained on the resulting corpora.

As illustrated in Figure 1, Kazakh was employed as a pivot language. This choice was driven by the availability of large-scale monolingual Kazakh corpora within the research group. It enables a systematic generation of synthetic parallel data for the following language pairs, Kazakh–Azerbaijani, Kazakh–Kyrgyz, Kazakh–Turkish, Kazakh–Turkmen, and Kazakh–Uzbek, using preselected AI-based translation models.

The proposed methodology implements a multi-stage framework for constructing high-quality parallel corpora and adapting machine translation models for low-resource Turkic languages. Each stage is functionally interdependent and contributes directly to the reliability of downstream model training and evaluation.

Stage 1: Synthetic Parallel Corpus Generation.

The process began with a large-scale monolingual Kazakh corpus, which was used as a pivot resource due to its availability and linguistic centrality within the Turkic language family. Synthetic parallel corpora were generated via back-translation for five language pairs, Kazakh–Azerbaijani, Kazakh–Kyrgyz, Kazakh–Turkish, Kazakh–Turkmen, and Kazakh–Uzbek, using preselected AI-based translation models. This step establishes the initial bilingual data required for subsequent refinement.

Stage 2: Multi-Level Quality Refinement.

The generated corpora underwent a structured, three-level quality control process. First, manual expert validation was applied to identify systematic errors such as incorrect abbreviations, repetitions, and misaligned segments. Second, automatic filtering enforced linguistic and structural constraints, eliminating duplicates and inconsistent translations. Third, targeted regeneration was performed, where low-quality segments were re-translated using alternative models. This multi-level refinement ensures both scalability and linguistic precision.

Stage 3: Error Analysis and Corpus Validation.

For each language pair, two corpora containing 300,000 and 500,000 sentence pairs were analyzed to identify recurring structural and semantic errors. The analysis revealed common challenges in morphological agreement, word order, and semantic fidelity across all language pairs, confirming the need for iterative filtering. The outcome of this stage was a validated bilingual corpus for each pair, consisting of 500,000 sentence pairs refined through three quality-control layers.

Stage 4: Model Fine-Tuning and Evaluation.

In the second primary phase, selected AI-based machine translation models were fine-tuned using the validated synthetic corpora. Model performance was evaluated at three levels: individual assessment for each language pair, comparative analysis across models, and final testing on an independent test corpus. Translation quality was measured using complementary metrics—WER [37], TER [38], BLEU [39], and chrF [40]—to capture both lexical accuracy and structural adequacy.

Overall, this methodology provides a structured, scalable, and empirically grounded approach to improving machine translation for morphologically rich, low-resource Turkic languages. The use of Kazakh as a pivot language enables systematic synthetic data generation, while multi-level refinement and iterative evaluation ensure corpus reliability and model robustness. Together, these stages form a reproducible foundation for building reliable machine translation systems under conditions of limited language representation.

In the following sections, we provide a detailed justification of the adopted design choices, a structured description of each methodological stage, and a comprehensive analysis of the practical results obtained through the proposed approach.

Figure 2 provides a high-level overview of the proposed pipeline for generating synthetic parallel data and fine-tuning a neural machine translation model. The process starts with monolingual data, which was used to generate synthetic parallel data. The generated data then underwent cleaning and re-generation to improve alignment and reduce noise, resulting in curated parallel corpora. These corpora were subsequently used to fine-tune neural machine translation models, followed by automatic and expert-based evaluations of the translation quality. For clarity, we provide a high-level pipeline overview in Figure 2, while detailed processing steps are shown in Figure 1.

To provide a comprehensive evaluation of the translation quality, we employed a combination of surface-level, character-level, and semantic metrics. BLEU and TER were used to measure n-gram overlap and edit distance with respect to reference translations, which remain standard indicators in machine translation evaluation. Given the morphologically rich nature of Turkic languages, we additionally report chrF and WER, as these metrics are more sensitive to character-level variation and inflectional morphology. Furthermore, we included BERTScore and COMET to assess semantic adequacy and meaning preservation beyond surface-form similarity.

3.2. Model Selection for Parallel Corpus Generation

The quality of synthetic parallel corpora directly determines the effectiveness of subsequent fine-tuning for further system training. Therefore, it is essential to select an appropriate AI model, and the selection should be based on transparent, reproducible criteria.

The model selection was based on the following criteria:

Open access and availability. We need to choose models that are openly accessible and that can be integrated into studies without licensing restrictions, ensuring reproducibility of experiments and scalability of the methodology.

The translation quality. Models should demonstrate special attention to the preservation of suffixes, case endings, and word order, since the languages are morphologically rich.

Language Coverage. Selected models should support the target Turkic languages either directly or indirectly. This criterion ensures that synthetic corpora can be generated consistently across multiple language pairs.

Computational Efficiency. Given the enormous volume of text required for corpus generation, the models must be able to process data within reasonable timeframes and within hardware constraints. Efficiency is evaluated based on throughput, memory consumption, and scalability.

Consistency and robustness. Models must produce stable results to ensure generality. They must be evaluated for robustness to changes in the subject area, while accounting for performance variability.

Community Validation and Benchmarking. Preference is given to models that have been evaluated in international benchmarks (e.g., WMT, FLORES-200) or have documented performance metrics. This provides an external reference point for assessing translation quality.

The most important task in creating parallel corpora for Turkic language pairs is to choose the most appropriate tools for their creation. Given the advances in machine translation, the most appropriate approach is to use modern artificial intelligence (AI) systems capable of producing high-quality translations.

Agglutinative languages such as Kazakh, Kyrgyz, and Turkmen present unique linguistic challenges due to their complex morphemic structures, extensive post-fixation, and flexible word order [41]. These features complicate both segmentation and the generation of semantically and morphologically accurate translations.

Based on these criteria, several state-of-the-art AI systems were evaluated:

Commercial/API-based:

Google Translate [42];

GPT-4o (OpenAI) [43];

Copilot (Microsoft) [44].

Open source:

NLLB-200 600M, NLLB-200 1.3B, NLLB-200 3.3B—Meta AI [45];

Gemma 2 27B—Google DeepMind [46];

Phi-4 (14B)—Microsoft [47];

Qwen2.5—Alibaba [48];

LlaMA 3.2 [49], LlaMA 3.1 (Meta/Facebook) [50].

The expert evaluation process yielded the following results:

Rejected due to availability or efficiency issues: Google Translate, GPT-4o, and Copilot;

Rejected due to insufficient translation quality: NLLB-200 600M, NLLB-200 1.3B, Phi-4, Qwen2.5, LlaMA 3.2, LlaMA 3.1;

Selected for detailed testing: Gemma 2 27B (Google) and NLLB-200 3.3B (Meta/Facebook), which demonstrated availability (both are freely available), good translation quality, and the ability to process large volumes of text.

Based on the criteria, the NLLB-200 3.3B model was chosen as the basic tool for generating parallel corpora for Turkic language pairs. This solution was driven by a combination of factors that are particularly important to morphologically complex and low-resource languages.

Firstly, The NLLB-200 3.3B (No Language Left Behind) model, developed by Meta AI [22], is specifically designed to improve translation quality for low-resource languages. Its architecture uses the Gated Mixture of Experts (MoE) framework, with only a few limitations. This ensures efficient scaling by activating only a subset of the parameters during translation.

Second, the translation quality of NLLB-200 3.3B demonstrates consistent results for languages with a rich morphology, including the accurate rendering of suffixes, case endings, and word order. The model was initially designed to support resource-poor languages, so its architecture and training data were optimized for tasks related to Turkic language pairs. Unlike smaller versions (600M and 1.3B), version 3.3B provides sufficient depth of context modeling, reflected in the stability of morphological constructions and syntactic relations.

Third, the model has broad language coverage, including support for most Turkic languages, either directly or through closely related pairs. This enables the consistent generation of parallel corpora for different translation directions, minimizing quality imbalances between language pairs.

A fourth important factor is the computational efficiency of NLLB-200 3.3B. Despite the model’s relatively large size, it can be deployed and used on modern GPU clusters and local servers, ensuring high throughput when processing large datasets. The balance between translation quality and computational costs proved to be optimal compared to other candidates.

The next aspect is the robustness and consistency of the results. NLLB-200 shows low performance fluctuations across different subject domains, which is particularly important for building a universal corpus that covers a variety of topics. The model demonstrates robustness when processing texts of varying styles and complexity.

Taking these factors together, NLLB-200 3.3B emerged as the most appropriate choice for the project: it combines openness, high translation quality, broad language coverage, and sufficient computational efficiency. These characteristics make it particularly suitable for the automated creation of synthetic parallel corpora for Turkic languages and for the subsequent re-training of a machine translation system.

So, the NLLB-200 3.3B [24] was used to translate large monolingual Kazakh corpora into five related Turkic languages. As a result, more than 500,000 sentences were generated, demonstrating the high efficiency and stability of this model.

In addition to the NLLB-200 3.3B, the Gemma 2 27B model was chosen for testing as a second parallel corpus generation model. This decision was motivated by several advantages that make the model promising for high-quality synthetic translation and subsequent evaluation.

The Gemma 2 27B model, developed by Google DeepMind in 2024, is a large-scale transformation architecture for generating and translating multilingual texts. This study uses a modified version of the model that includes multilingual attachments, advanced attention mechanisms, and multitasking learning components. The main advantage of Gemma architecture lies in its ability to adapt and efficiently perform translation tasks between closely related languages, even with limited training data.

In this project, Gemma 2 27B was used to translate more than 200,000 sentences. This allows for a comparative analysis with the NLLB-200 3.3B model. Several hallucinatory fragments were found in the Gemma 2 27b translation, so our final choice was NLLB-200 3.3B.

The selected model, NLLB-200 3.3B, showed good translation quality and time performance across five pairs of Turkic languages in a preliminary translation of 100,000 Kazakh sentences.

Additionally, a practical constraint was introduced: Compatibility with available hardware, specifically the ability to train on GPUs with 24 GB of video memory.

A range of AI models was evaluated according to these criteria, including:

Meta AI models: NLLB-200 600M, NLLB-200 1.3B, NLLB-200 3.3B;

Google models: mT5 [51], mT5-small [52], mT5-base [53], mT5-large [54], mT5-3B [55], mT5-11B [56];

Other LLMs: GPT-4o (OpenAI), Copilot (Microsoft), Phi-4 (Microsoft), Qwen2.5 (Alibaba), Gemma 2 27B (Google), LlaMA 3.1 and LlaMA 3.2 (Meta).

Based on this evaluation, NLLB-200 1.3B and mT5 were selected as the primary candidates for fine-tuning, offering a suitable compromise between model size, translation performance, and resource constraints on modeling hardware.

3.3. Error Analysis and Cleaning of Parallel Corpora from Errors

For the generation of synthetic parallel corpora, two monolingual Kazakh corpora of 300,000 sentences and 500,000 sentences were used. For the generation of synthetic parallel corpora, two monolingual Kazakh corpora of 300,000 sentences and 500,000 sentences were used. The Kazakh–English corpus consists of 300,000 parallel sentences collected from various news and government sites and primarily covers news and official government domains (grant project AP05131415: Development and research of neural machine translation of Kazakh. 2018-2020. Ministry of Education and Science of the Republic of Kazakhstan). For the current project, we took the Kazakh part of this corpus. A 500,000 sentences Kazakh corpus was compiled as part of the Joint Grant Project with the Xinjiang Technical Institute of Physical Chemistry of the Chinese Academy of Sciences for the creation of a Kazakh–Chinese multimodal corpus and the research and application of intelligent text processing technologies (2024–2025). The corpus topics included medical, technical, and colloquial terms.

Error analysis of the synthetic parallel corpora generated by NLLB-200 3.3B was performed manually. Several recurring error types were identified across all five language pairs, affecting both structural consistency and semantic accuracy. General categories of errors observed in the corpora of all pairs were:

Lexical duplicates (repeated fragments within one sentence or between sentences);

Erroneous rendering of geographical names and proper names;

Use of foreign abbreviations without decoding or adaptation;

Partially translated phrases, truncation of semantic units, and violation of the integrity of the syntactic structure.

Effective cleaning of parallel corpora is a critical prerequisite for high-quality neural machine translation [57,58]. This is particularly important for Turkic languages, which are highly agglutinative, morphologically rich, and often exhibit flexible word order—properties that complicate tokenization, segmentation, and MT model learning [2,7,59,60]. Poor-quality data directly compromises the model’s ability to learn accurate translation mappings and significantly degrades evaluation scores (e.g., BLEU, CHRF) [61,62].

These issues confirm that errors in synthetic corpora are systematic, often resulting from the interaction between morphological complexity and inadequate context modeling by large language models. Therefore, the cleaning process—involving manual inspection, automated filtering, and rule-based corrections—is essential before model fine-tuning.

3.3.1. Correction Module: Purpose and Principles

A specialized Correction Module, implemented in Python version 3.11.14, was developed for automated cleaning and the normalization of synthetic parallel data. This component is a rule-based tool designed to ensure reproducible corpus cleaning and to eliminate systematic defects introduced during synthetic translation.

Importantly, the module:

Does not use neural models;
Does not employ embedding-based similarity;
Does not rely on external named entity recognition (NER) systems;
Does not incorporate fuzzy matching or edit distance.

These design choices ensure full transparency of the applied rules and high reproducibility of the processing results.

The module workflow consists of four sequential stages:

Text normalization (whitespace correction and character/punctuation standardization);
Filtering of duplicates and hallucinations (repetition loops);
Rule-based correction of named entities and abbreviations;
Saving the cleaned corpus along with detailed change logs (deleted and corrected lines).

3.3.2. Named Entity Correction: Rules and Restrictions

Named entity correction was performed using explicitly defined rules and dictionaries of typical erroneous substitutions identified during manual inspection of synthetic translations. Corrections were applied only when a clear contextual cue was present in the original Kazakh sentence, thereby reducing the risk of false substitutions.

The module processes the following entity types:

Geographical names and country names;
State and institutional abbreviations;
Personal names;
Linguistic markers (e.g., “қaзaқшa”, “түpiкшe”).

A correction was applied only when all three of the following conditions were satisfied:

A trigger keyword appears in the Kazakh source sentence;
A predefined erroneous translation is detected in the target segment;
The target segment contains no more than one geographical entity.

The third constraint was introduced to avoid incorrect substitutions in sentences with complex geographic or political contexts (e.g., those mentioning multiple countries).

Examples of Correction Rules

Correction rules for named entities and abbreviations were defined as templates of the following form: (trigger in source KZ) + (typical erroneous substitution in target text) → correct form

Representative examples include (detailed statistics are provided in Table 2):

Kazakhstan/ҚP → Azərbaycan → Qazaxıstan.
Қaзaқшa → Azərbaycanca → Qazaxça.
ӨP → Azərbaycan/Ermənistan → Özbəkistan.
ҚP → Qırğızıstan/Qırğızıstanın → Qazaxıstan.

Technically, replacements were implemented using whole word matching with regular expressions to prevent substitutions inside word fragments and to minimize false positives.

3.3.3. Detection of Duplicates and Hallucinations: Matching Methods

Lexical duplicates and hallucinatory fragments were detected exclusively using exact matching. The module does not apply fuzzy matching, edit distance, or embedding-based semantic similarity. Two complementary methods were implemented:

Detection of consecutive token repetitions: the same word repeated at least three times in succession;
Detection of repeated n-grams within a sentence: the same fragment repeated at least three times.

Such errors are characteristic of synthetic corpora and are particularly harmful: repetition loops introduce undesirable training patterns, increase the likelihood of repetitive output during generation, degrade fluency, and negatively affect translation quality metrics.

3.3.4. Threshold Values and Filtering Parameters

All threshold values were determined empirically based on preliminary manual analysis of typical defects in synthetic corpora. The thresholds were intentionally conservative, prioritizing high precision and minimizing the removal of valid sentences.

The applied thresholds are as follows:

Single-word repetition: ≥3 consecutive occurrences;
n-gram repetition: ≥3 repetitions;
n-gram length: 3–10 words;
Maximum number of geographical entities for NE correction: 1;
Matching type: exact match.

Separate sentence-length filtering was not applied in this version of the module, as preliminary analysis indicated that the primary sources of systematic noise in the synthetic corpora were repetition loops and named entity errors.

3.3.5. Qualitative Examples of Errors Before and After Cleaning and Their Impact

To strengthen the evidence for the effectiveness of synthetic parallel data cleaning, we present several representative examples of systematic errors observed before and after applying the correction module. Manual analysis shows that the most frequent and most toxic errors for model training are:

(i): Repetition loops that disrupt sentence structure, and
(ii): Incorrect transmission of named entities and linguistic markers, leading to semantic and factual errors.

Example 1. Repetition hallucination → line removal

KZ: Бipaқ cyғa тoлы қoлaйcыз гeoлoгиялық дeнeлepдi бapлayдa…
AZ (before): … geoloji cisimlərin geoloji cisimlərin geoloji cisimlərin …
After cleaning, the line is removed automatically.
Impact: such segments bias the target-language distribution and increase repetition during generation, degrading fluency and chrF/BLEU scores.

Example 2. Short repetition loop → line removal

KZ: Aнaлық бeздi жұмыpтқa тәpiздi…
AZ (before): yumurtalı yumurtalı yumurtalı
After cleaning, the line was removed.
Impact: repeated tokens create a distorted training signal, reducing generation stability.

Example 3. Language marker correction

KZ: Қaзaқшa Meн бөлiп төлeyдi жocпapлaп oтыpмын.
AZ (before): Azərbaycanca Mən pay-pay ödəməyi planlaşdırıram.
AZ (after): Qazaxça Mən pay-pay ödəməyi planlaşdırıram.
Impact: Correcting language markers improves semantic accuracy and prevents systematic language mislabeling.

Example 4. Incorrect language substitution

KZ: Қaзaқшa: Бaтapeяның қызмeт eтy мepзiмiн ұзapтyғa…
AZ (before): İngilis: Batareyanın ömrünü uzatmaq…
AZ (after): Qazaxça: Batareyanın ömrünü uzatmaq…
Impact: Correcting such distortions reduces training noise and improves inference quality in technical domains.

Example 5. Geopolitical entity substitution

KZ: … eкi eл apacындaғы дәcтүpлi қытaй-Әзipбaйжaн дocтығы…
AZ (before): … Çin-Özbəkistan dostluğu…
AZ (after): … Çin-Azərbaycan dostluğu…
Impact: Correcting named entities prevents factual errors and significantly improves perceived translation quality.

These qualitative examples demonstrate that corpus cleaning eliminates systematic generation errors rather than random noise. Despite the relatively small proportion of corrected sentences, such errors are disproportionately harmful to training and can substantially degrade translation stability and quality. These findings are further supported by the analysis of NMT outputs: models trained on uncleaned data exhibit a higher tendency toward repetition and entity mistranslation, whereas models trained on cleaned corpora produce more coherent, fluent, and semantically accurate translations.

3.3.6. Quantitative Cleaning Results

Table 2 presents the results of linguistic verification and error correction conducted across five synthetic parallel corpora of the Turkic–Kazakh language pair, totaling approximately 500,000 sentences. The analysis revealed two main types of systematic errors:

Lexical duplicates—words or phrases repeated within a sentence.

Incorrect transmission of nominal subjects and abbreviations, which include place names, country names, and legal terms.

The table shows the following data: the number of duplicate segments deleted, the number of corrected names and abbreviations, typical examples of systematic substitutions, and the percentage of corrected data in the total corpus. This preprocessing step of generated corpora substantially improves the lexical, semantic, and stylistic quality of the training data, directly enhancing the accuracy, stability, and generalization capacity of neural machine translation models for low-resource and morphologically complex Turkic languages.

Thus, Table 2 effectively implements component-wise ablation at the data level, allowing us to separately assess the contributions of repetition loop filtering and rule-based named entity correction to overall corpus quality improvement without the need to re-train the models. The results show that the bulk of the corrections are related to repetition removal, while entity correction affects fewer strings but improves the semantic correctness and factual accuracy of translations.

By addressing these errors at the preprocessing stage, we enhanced the reliability and performance of downstream NMT models and ensured they could generate accurate, fluent, and morphologically appropriate translations across Turkic languages. The resulting cleaned corpora can also be found at this link: https://github.com/NLP-KazNU/Parallel-Turkic-text-corpora, accessed on 15 January 2026.

The result is a validated and cleaned bilingual corpus that serves as the basis for subsequent model re-training.

The generated corpora underwent a manual peer review of abbreviation translations and translation repetitions, which was then used to automatically filter the entire text to ensure linguistic and structural quality. Where necessary, low-quality segments were regenerated using alternative models that demonstrated superior performance. The resulting validated and cleaned bilingual corpora form the foundation for downstream model fine-tuning.

3.4. Fine-Tuning AI Models for Turkic Language Translation

The goal of fine-tuning is to improve translation accuracy for specific language pairs, adapt models to the morphological and syntactic features of Turkic languages, reduce the base model’s error rate, and improve processing of rare vocabulary, idiomatic expressions, and culture-specific units.

To improve translation accuracy and adapt the models to the morphological and syntactic structures of Turkic languages, a targeted fine-tuning process was implemented using pre-generated parallel corpora.

Two models were used during fine-tuning:

NLLB-200 1.3B.

mT5-base.

Fine-Tuning of the NLLB-200 1.3B Model.

The NLLB-200 1.3B model developed by Meta AI was chosen for its optimal balance between average parameter size and computational efficiency. This enables effective model training on consumer-grade graphics processing units (GPUs).

Fine-tuning was carried out on synthetic corpora generated using the NLLB-200 3.3B and Gemma 2 27B models, thoroughly cleaned. This approach ensured the linguistic diversity of the training data and enabled a more thorough coverage of the structural differences among the Turkic languages.

The following strategies were used when setting up:

Early stop—to prevent overfitting the model.

Gradient accumulation—to simulate larger batch sizes.

Adaptive learning rate schedules—to stabilize training dynamics.

The combination of these methods enabled effective adaptation of the NLLB-200 1.3B model for low-resource Turkic languages.

Fine-Tuning of the mT5-base Model.

The mT5-based model is a multilingual transformer architecture developed by Google Research [51] that extends the T5 model framework to support over 100 languages, including several Turkic languages. While the original T5 model primarily focused on English, the mT5 version is pre-trained on the MC4 large-scale multilingual corpus, which provides a significant expansion of language coverage.

Key advantages of mT5-base include:

A consistent text-to-text paradigm simplifies task formulation across translation, summarization, and question answering.

A language-neutral initialization, avoiding English-centric biases.

A balanced model size (~580 million parameters), allowing effective fine-tuning with limited resources.

Fine-tuning was performed on a GPU with 24 GB of memory capacity.

Training parameters are presented in Table 3.

Training Hyperparameters:

Base model: facebook/NLLB-200 1.3B.
Optimizer: AdamW.
Learning rate: 5 × 10⁻⁵.
Warmup steps: 500.
Epochs: 6.
Batch size (per device): 2.
Gradient accumulation: 2.
Effective batch size: 4.
Max sequence length: 256.
Padding strategy: longest (dynamic).
Precision: FP16.
Random seed: 42.

Hyperparameters were optimized through grid search, with batch sizes adjusted via gradient accumulation.

Training datasets before cleaning consisted of synthetic parallel corpora of 300,000 and 500,000 sentence pairs per language pair. After cleaning, the training datasets consisted of synthetic parallel corpora with approximately 288 thousand and 497 thousand sentence pairs per language pair.

Data Splits: Common 497,000; Train 487,060—98%; Dev (internal) 4970—1%; Test 4970—1%. Common 288,000; Train 282,240—98%; Dev (internal) 2880—1%; Test 2880—1%. These are common data split schemes by proportion, but the dataset volume for each language pair varies slightly depending on cleaning.

This methodology provided a precise, repeatable approach to improving machine translation for low-resource Turkic languages. It can also be used for other agglutinative language families. This method offers a clear, consistent way to improve machine translation for low-resource Turkic languages. The fine-tuned models showed higher accuracy, greater robustness, and better ability to handle morphological complexity. The framework not only helps develop reliable translation systems under limited language representation, but can also be applied to other agglutinative language families.

A comprehensive evaluation of the obtained results, including comparative analysis, identification of methodological limitations, and discussion of future application prospects, will be presented in detail in Section 4 and Section 5.

4. Experimental Results and Analysis

4.1. Preliminary Experiments

In our 2024 experiments with the OPUS dataset, we were limited to Google Colab (Google LLC, Mountain View, CA, USA), which allowed us to use the mT5-small model. This resulted in relatively low BLEU scores. The mT5-small model is a general-purpose text-to-text architecture with substantially fewer parameters and without translation-specific pretraining.

Table 4 shows that augmenting OPUS data with 140 k synthetic sentences generated by Google Translate led to a consistent decrease in BLEU scores across all language pairs. Training exclusively on synthetic data further degraded the performance. These results indicate that synthetic data quality, rather than quantity, is critical. This observation motivated the transition to NLLB-based synthetic corpus generation and structured data cleaning.

Preliminary experiments were crucial for further work on the project. The results of the preliminary experiment allowed us to refine subsequent tasks, namely, selecting a specialized AI tool for generating synthetic parallel corpora for five pairs of Turkic languages, manually validating and automatically cleaning the generated corpora, selecting an AI tool for fine-tuning on the cleaned parallel corpora, and evaluating the fine-tuning results.

4.2. NLLB-200 1.3B Experiments on 300,000 and 500,000 Parallel-Sentence Corpora

Fine-tuning the NLLB-200 1.3B model on a relatively modest (300,000) parallel dataset yields significant improvements across all metrics (Table 5).

TER decreases by an average of 20–40%, while BLEU nearly doubles (for Azerbaijani, Kyrgyz, and Turkish). chrF steadily increases by 10–15 points. The jump is particularly noticeable for the Turkmen–Kazakh pair, where BLEU increased more than fourfold (from 6.44 to 28.36), demonstrating NLLB’s sensitivity to domain-specific data fitting. Re-training NLLB-200 1.3B, even on limited clean data, effectively stabilizes the model, eliminating high error rates for all language pairs.

Table 6 summarizes the quantitative evaluation results for the five Turkic–Kazakh language pairs, reporting WER, TER, BLEU, and chrF scores for the baseline models, the fine-tuned models, and the models fine-tuned on cleaned synthetic data. Baseline refers to the NLLB-200 1.3B model fine-tuned only on the original (uncleaned) parallel data. The “Fine-tuned” column indicates the performance after additional training, and the “Fine-tuned on cleaned data” column refers to performance after training on the cleaned synthetic data. Δ columns show improvements relative to the baseline scores (Fine-tuned → Baseline and Fine-tuned on cleaned data → Baseline).

Average metric values (across 5 language pairs) for the baseline (zero-shot): WER = 0.77, TER = 74.84, BLEU = 17.62, chrF = 50.66.

Average metric values (across 5 language pairs) for the fine-tuned version (6 epochs): WER = 0.58, TER = 54.72, BLEU = 30.95, chrF = 62.28.

For the corpus cleaned 500 thousand sentences, the fine-tuned NLLB-200 1,3B average metrics values are WER = 0.42, TER = 40.88, BLEU = 43.54, chrF = 76.71 on cleaned data.

Data cleansing yields the greatest improvement in quality among all the experiments conducted. Switching from the uncleaned corpus to the cleaned corpus across all language pairs yields an average WER decrease of 0.12, a TER decrease of 13.84, a BLEU increase of 12.59, and a chrF increase of 14.43. The strongest effect is observed for the Kyrgyz–Kazakh and Azerbaijani–Kazakh pairs, where BLEU reaches ~48.

4.3. mT5-Base Experiments on a 500,000 Parallel-Sentence Corpus of Azerbaijani–Kazakh

The results of the fine-tuning are presented in Table 7, which lists the values of four main performance indicators: WER, TER, BLEU, and chrF2.

Even a superficial casing cleanup improves all quality metrics: WER: 1.3% improvement, TER: 1.28% improvement, BLEU: +1.57%, chrF: +0.99%. Although the effect is moderate, it is stable: all metrics improve simultaneously, indicating improved data consistency. The mT5-base model also benefits from casing cleanup, but mT5’s sensitivity to data quality is lower than that of NLLB-200 1.3B

Thus, the conducted experiments show that cleaning synthetic corpora and their subsequent use for additional model training significantly improves the quality of translation for Turkic–Kazakh language pairs. The results confirm the feasibility of this approach under limited resource conditions.

4.4. NLLB-200 1.3B Experiments on the Six Turkic-Language Parallel Dataset

Table 8 presents the results of experiments fine-tuning the free, open-source AI model for machine translation between Azerbaijani–Kazakh, Kyrgyz–Kazakh, Turkmen–Kazakh, Turkish–Kazakh, and Uzbek–Kazakh on a common Turkic-language parallel dataset (3,885,542 sentences). The baseline scores represent the average performance fine-tuned separately on each language pair (as reported in Table 8), while the fine-tuned results show performance after training on the combined cleaned dataset for 6 epochs.

Joint fine-tuning on a single purified corpus of six Turkic languages leads to a further increase in BLEU (from 43.54 to 47.84) and chrF (from 76.71 to 78.41), while simultaneously decreasing WER and TER. Interestingly, the model demonstrates an average improvement across all language pairs, confirming the presence of interlingual transfer within the Turkic language group. Joint multilingual fine-tuning strengthens the models by leveraging the structural similarities among Turkic languages and is an optimal approach given limited resources for individual languages.

4.5. External Evaluation of Fine-Tuned NLLB-200 1.3B Models on the Human-Translated Benchmark FLORES 200 Dataset

We additionally evaluated the fine-tuned NLLB-200 1.3B models on FLORES-200 as an independent human-translated benchmark; no FLORES data are used for training. The FLORES 200 dataset is an evaluation human-translated benchmark for low-resource and multilingual machine translation. Table 9 presents the results of the external test experiments of the NLLB-200 1.3B model fine-tuned on 500,000 parallel sentences for five Turkic languages.

The results show that BLEU scores slightly decrease after fine-tuning across all language pairs, while chrF remains stable or improves for most directions. This pattern is expected for morphologically rich Turkic languages, as BLEU is sensitive to surface n-gram overlap, whereas chrF better captures character-level morphological similarity. The consistent chrF values indicate that fine-tuning preserves or improves morphological adequacy, even when lexical overlap with the reference translations decreases.

In addition to BLEU and chrF, we report a semantic evaluation using COMET and BERTScore [63,64]. These metrics better capture meaning preservation and paraphrastic variation, which is particularly important for morphologically rich Turkic language pairs. The results indicate that, despite the relatively low BLEU scores, the proposed model preserves the semantic content to a large extent.

For morphologically rich and closely related Turkic languages, semantic metrics (COMET, BERTScore) provide a more reliable estimate of translation quality than surface-form metrics, such as BLEU [64].

Table 10 presents the results of the external test experiments for the NLLB-200 1.3B model fine-tuned on 500,000 parallel sentences for five Turkic-language pairs using the semantic metrics of COMET and BERTScore.

The results demonstrate a clear and consistent improvement in BERTScore-F1 after fine-tuning across all Turkic–Kazakh language pairs. On average, the BERTScore-F1 increases from 0.8168 in the zero-shot setting to 0.9258 after fine-tuning, indicating a substantially improved semantic alignment between model outputs and reference translations.

In contrast, COMET scores exhibit a slight decrease on average (from 0.8593 to 0.8274), with the largest drops observed for Turkish–Kazakh and Uzbek–Kazakh. This behavior is expected as COMET is sensitive to domain and stylistic shifts; fine-tuning on synthetic parallel data introduces mild specialization that may reduce the alignment with the FLORES-200 reference distribution.

COMET should be interpreted in terms of relative change rather than absolute value; the observed decrease from 0.859 to 0.827 (Δ ≈ −0.03) reflects a moderate domain adaptation effect.

Small decreases in COMET are commonly observed under domain adaptation and do not necessarily indicate loss of generalization [65,66].

Table 11 presents the results of the external test experiments using the NLLB-200 1.3B model fine-tuned on a six-Turkic-language parallel dataset (3,885,542 sentences) using metrics COMET and BERTScore.

The slightly worse performance observed for the multilingual fine-tuned model compared to the models fine-tuned on individual language pairs can be attributed to parameter sharing across multiple translation directions. In the multilingual setting, the fixed parameter budget of the NLLB-200 1.3B model is distributed across several Turkic–Kazakh language pairs, which limits the degree of specialization achievable for each individual direction. In contrast, pair-specific fine-tuning allows the model to allocate its full representational capacity to a single language pair, resulting in stronger adaptation and higher metric scores.

This balance between cross-lingual generalization and pair-specific specialization explains why multilingual models yield slightly worse performances than their bilingual counterparts.

4.6. Fine-Tuning and Evaluation of NLLB-200 1.3B on the OPUS Dataset

We used, as an example, the Kyrgyz–Kazakh OPUS dataset for this task.

This corpus was preprocessed and duplicates were deleted. Kyrgyz–Kazakh had 99,590 lines; after deleting the duplicates, it had 91,142 lines. In Table 12, the results of the evaluation of the zero-shot and fine-tuned NLLB-200 1.3B model are presented by BLEU, chrF, BERTScore, and COMET metrics.

Compare these results with the results for the cleaned synthetic datasets: (1) on synthetic datasets, BLEU increases from 17.72 to 48.27; chrF increases from 50.58 to 80.12; BERTScore F1 increases from 0.809 to 0.9295; and COMET decreases from 0.8765 to 0.8547;

(2) On the OPUS dataset, BLEU increases from 23.00 to 44.00; chrF increases from 43.96 to 60.33. 3; BERTScore F1 increases from 0.80 to 0.86; and COMET increases from 0.75 to 0.81.

Table 13 compares the evaluation results for the cleaned synthetic datasets with those for the OPUS dataset for the Kyrgyz–Kazakh pair.

Cleaned synthetic data yield substantially larger gains in BLEU, chrF, and BERTScore-F1, while OPUS-only fine-tuning shows a higher COMET increase.

4.7. Human Evaluation

We conducted a human evaluation focusing on in-domain translation performance. Given the limited resources of the considered Turkic language pairs, validating the quality gains achieved within the curated parallel corpora used by the proposed pipeline is of primary importance.

For human evaluation, we selected 100 sentence pairs per language pair from a subset of 5000 manually verified and corrected parallel sentences, curated by native-speaker consultants for Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek. These 5000 sentences originate from the larger cleaned synthetic corpus but were independently reviewed and corrected to ensure high reference quality. None of the evaluated sentences were used during model training.

Each evaluation item contained: (1) the source sentence, (2) two anonymized system outputs (Translation A and Translation B), and (3) a reference Kazakh translation.

To avoid positional bias, the assignment of baseline and fine-tuned outputs to Translations A and B was systematically alternated across the dataset. Human evaluators were asked to:

(i): Indicate which translation is better (or select Tie if they are equivalent),
(ii): Assign quality scores on a 1–5 Likert scale to each translation independently.

In Table 14 presents the results of the human evaluation of translation by NLLB 200 1.3B: (1) Baseline zero-shot and (2) Fine-tuned on 100 sentence pairs per language pair.

The human evaluation across five Turkic–Kazakh language pairs shows that baseline and fine-tuned models are close. Average quality scores on a 1–5 Likert scale reveal modest but consistent differences: the fine-tuned model outperforms the baseline for Azerbaijani–Kazakh, Kyrgyz–Kazakh, and Uzbek–Kazakh while remaining competitive for Turkish–Kazakh and Turkmen–Kazakh. Overall, no systematic degradation in human-perceived translation quality is observed.

In this study, we prioritize in-domain human evaluation, as the primary goal is to validate improvements within the curated and manually verified parallel corpora used by the proposed pipeline. For low-resource Turkic languages, ensuring quality gains in the target application domain is particularly critical. External generalization is assessed separately using the FLORES-200 benchmark with established automatic metrics.

5. Discussion

5.1. Preliminary Experiments and Methodological Transition

The experiments conducted with OPUS datasets and the mT5 model represent a preliminary stage of this study. At the time of experimentation (2024), the available computational resources were limited to Google Colab, which constrained training to the mT5-small configuration. Given the general-purpose nature and limited parameter capacity of mT5-small, low BLEU scores for low-resource Turkic language pairs are expected and should not be interpreted as training deficiencies.

In addition, fine-tuning mT5-small on a synthetic parallel corpus of 140k sentence pairs resulted in only modest performance gains. This outcome is largely attributable to the quality of the synthetic data, which was generated by translating a monolingual Kazakh corpus into five Turkic languages using Google Translate. While such synthetic augmentation is common in low-resource settings, translation noise and domain mismatch disproportionately affect smaller-capacity models.

Importantly, BLEU scores consistently increase across training epochs for all language pairs, indicating stable learning dynamics and correct training configuration. Based on insights from this preliminary stage and the acquisition of dedicated GPU resources (RTX 4090, 24 GB), we revised our experimental strategy. Specifically, we selected NLLB-200 3.3B for generating higher-quality synthetic parallel corpora and NLLB-200 1.3B for fine-tuning. This transition resulted in substantially improved translation quality, demonstrating the importance of model specialization and data quality in low-resource neural machine translation.

The synthetic parallel corpora used in this study are generated following the standard back-translation paradigm widely adopted in low-resource neural machine translation. Specifically, large monolingual Kazakh corpora serve as the starting point, and the corresponding target-side sentences (Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek) are obtained via back-translation using a pretrained multilingual model. This approach is well established as an effective strategy for alleviating data scarcity in low-resource settings.

Fine-tuning the NLLB-200 1.3B model on the resulting synthetic corpora leads to substantial improvements in translation quality. These gains should not be interpreted as trivial output imitation of the data-generating model, as effective generalization in low-resource neural machine translation remains a non-trivial learning problem, particularly for morphologically rich Turkic languages.

To reduce potential biases inherent to synthetic data, the generated corpora undergo multi-level cleaning, including deduplication, correction of named entities and abbreviations, and removal of hallucinated segments, thereby altering the data distribution and going beyond direct output imitation.

In addition, model evaluation is performed using both held-out synthetic test sets and independent human-translated reference data, providing empirical evidence that the proposed pipeline improves translation quality for low-resource Turkic languages.

5.2. Impact of Data Quality on NMT Performance for Low-Resource Turkic Languages

The experimental results consistently demonstrate that data quality is a critical factor in improving neural machine translation (NMT) systems for low-resource Turkic languages. Across all evaluated models and language pairs, cleaning synthetic parallel corpora led to substantial improvements in translation accuracy. Even basic preprocessing steps resulted in measurable gains, confirming that noise and inconsistencies in synthetic data significantly hinder model generalization.

Quantitatively, corpus cleaning yielded an average improvement of 12.59 BLEU points and 14.43 chrF points, while error-based metrics decreased by 0.16 (WER) and 13.84 (TER). These results highlight that morphological and lexical consistency is particularly important for agglutinative languages, where minor distortions can propagate into systematic translation errors.

5.3. Role of Synthetic Data Generation and Thematic Diversity

The use of synthetic parallel corpora generated via back-translation proved to be an effective strategy for mitigating data scarcity. The extended bilingual corpus of 500,000 sentence pairs spanned a wide range of themes, including journalistic, conversational, scientific, medical, administrative, and technical domains. This diversity enabled the NMT models to learn generalized representations of real-world language use rather than overfitting narrow domains.

The results indicate that thematic coverage directly improves robustness, enabling models to perform consistently across heterogeneous text types. This is particularly important for practical deployment scenarios involving mixed-domain inputs.

5.4. Error Characteristics and the Importance of Corpus Cleaning

Error analysis across all five language pairs revealed recurring issues related to hallucinations, mistranslation of named entities, inconsistent terminology, duplicated segments, and morphological distortions. Such errors are inherent to synthetic data generated by large-scale models and, if left unaddressed, substantially degrade translation quality.

The three-level refinement strategy—expert validation, automatic filtering, and targeted regeneration—proved essential in mitigating these issues. The marked reduction in WER and TER confirms that cleaning improves not only surface-level accuracy but also deeper structural and semantic consistency.

5.5. Comparative Performance Across Turkic Language Pairs

The effect of data cleaning was particularly pronounced for initially low-performing language pairs. For example, in the Kyrgyz–Kazakh pair, BLEU increased from 29.73 to 48.27, while WER decreased from 0.59 to 0.38. Similar trends were observed for Azerbaijani–Kazakh and Uzbek–Kazakh, where BLEU scores nearly tripled and error rates were reduced by approximately half.

The Turkmen–Kazakh pair, initially the weakest (BLEU = 9.18), showed substantial improvement after cleaning, reaching 33.22 BLEU. Although absolute performance remained lower than for other pairs, the relative gains demonstrate the effectiveness of corpus refinement, even for severely under-resourced languages.

5.6. Model Sensitivity to Noise and Cross-Model Consistency

The experiments revealed that the NLLB-200 1.3B model is highly sensitive to the quality of the training data. Fine-tuning on uncleaned synthetic corpora yielded only moderate gains, whereas training on cleaned datasets led to stable, consistent improvements across all metrics. This suggests that large multilingual models can fully exploit their capacity only when trained on structurally coherent and lexically consistent data.

These observations were independently confirmed using the mT5-base model. Although its overall improvements were more modest due to smaller model capacity, the direction of changes remained consistent: WER and TER decreased, while BLEU and chrF increased after cleaning. This confirms that the benefits of corpus refinement are model-independent.

5.7. Benefits of Multilingual Fine-Tuning

Multilingual fine-tuning on a combined Turkic dataset further enhanced translation quality. Training the NLLB-200 1.3B model on a unified multilingual corpus yielded a better performance across all metrics than pairwise fine-tuning. The improvement in BLEU from 43.54 to 47.84 and the reduction in WER from 0.42 to 0.31 indicate that shared representations across related languages improve generalization.

This effect is particularly beneficial for low-resource languages such as Turkmen, which benefit from shared morphology, cognates, and syntactic patterns present across the Turkic language family.

Despite the strong results, limitations remain. Although synthetic corpora can reach a substantial size, they inevitably contain residual hallucinations and inconsistencies. Without systematic cleaning, such artifacts negatively affect translation performance. Future work will focus on expanding the corpora by incorporating additional open-domain and spoken-language sources, as well as on incorporating more advanced filtering and confidence-based selection methods. Overall, the discussion confirms that high-quality synthetic corpora can effectively compensate for the lack of human-annotated data in low-resource Turkic languages. The findings demonstrate that advanced corpus cleaning is essential, multilingual fine-tuning significantly enhances generalization, and data quality remains the dominant factor in achieving high-performance NMT systems for morphologically rich languages.

5.8. External and Human Evaluation

To obtain a balanced assessment of translation quality, we combine external benchmark evaluation with in-domain human evaluation, capturing both generalization and practical performance in low-resource Turkic–Kazakh translation.

External evaluation is conducted on the FLORES-200 benchmark, which provides a strictly held-out, human-translated test set with a shared Kazakh reference across all source languages. This setup enables reliable comparison across language pairs and assessment of out-of-domain robustness. The results indicate that fine-tuned models remain competitive with the zero-shot baseline in semantic metrics. Small decreases in COMET are observed for some language pairs, consistent with domain adaptation rather than systematic degradation, while the BERTScore remains stable.

Human evaluation focuses on in-domain performance using manually verified and corrected parallel data. Pairwise preference judgments are largely dominated by Tie responses, suggesting comparable overall quality between baseline and fine-tuned systems. However, graded quality scores show modest yet consistent improvements for the fine-tuned model across several language pairs, with no evidence of quality degradation.

Taken together, these results show that fine-tuning on cleaned synthetic data leads to meaningful in-domain quality improvements while maintaining robust performance on an external benchmark. The observed divergence between COMET and human judgments further underscores the importance of combining automatic metrics with human evaluation, particularly in low-resource, morphologically rich languages.

5.9. About Domain Shift

The synthetic parallel data were generated from large-scale Kazakh monolingual sources, including news and official government domains, providing broad topical and stylistic coverage. Although this data composition may theoretically introduce mild domain or stylistic biases, an empirical evaluation using automatic semantic metrics (COMET, BERTScore) and human judgment shows that their practical impact on model generalization is limited. In particular, the observed decrease in COMET scores after fine-tuning remains within the range commonly associated with domain adaptation and does not suggest a loss of generalization. These findings imply that the proposed synthetic data generation and cleaning pipeline improves translation quality while largely preserving robustness across domains. Nevertheless, future work may explore explicit domain balancing and multi-domain evaluation to further mitigate potential domain bias.

6. Conclusions

In conclusion, this study successfully applied modern artificial intelligence methods to address the pressing problem of machine translation for low-resource Turkic languages (Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek). The key scientific contribution of this work lies in the development and implementation of a scalable methodology that includes the generation, automated cleaning, and further training of open-source AI models.

Synthetic Corpus Generation: To address the acute data shortage, a new multilingual parallel corpus was developed, focusing on the Kazakh language. Using the large-scale open-source NLLB-200 3.3B model, the researchers generated synthetic parallel corpora totaling 500,000 sentences and 300,000 sentences for each of the five Turkic language pairs. This work created a significant resource for low-resource languages.

Detailed Data Cleaning (Improvement Mechanism): The generated corpora were subjected to automated cleaning and filtering to remove systematic errors. This stage was critical for the agglutinative and morphologically complex Turkic languages. Using a special software module (Correction Module), the following issues were eliminated:

-: Lexical duplicates (repeated segments).
-: Erroneous rendering of geographical and proper names, as well as abbreviations (for example, the incorrect translation of “KR”—Republic of Kazakhstan).
-: “Hallucinations”—distorted or lexically incoherent phrases that violate grammatical norms and reduce the generalization ability of the model.

Fine-Tuning: To adapt the models to the specific characteristics of Turkic languages, the NLLB-200-1.3B model was selected (as well as mT5-base for comparative analysis). Fine-Tuning was conducted on the cleaned synthetic corpora.

We fine-tuned the NLLB-200 1.3B model on cleaned data, demonstrating significant performance improvements across all five language pairs. This improvement was achieved primarily through the use of the cleaned corpus, which provided a robust foundation for training the NMT model.

As a result of the cleaned corpora of 500,000 sentences, compared to the original (Baseline, zero-shot) translation quality, re-training on the cleaned data yielded the following average improvements:

The average BLEU metric increased by 25.92 points.

The chrF metric (character-level precision score) increased by 26,05 points.

The error metrics (WER and TER) were nearly halved: WER decreased by 0.35, and TER by 33.96.

The largest quality improvement was recorded for the pair with the smallest number of source resources—Turkmen–Kazakh—where the BLEU metric increased from 9.18 to 33.22, representing a 3,6-times increase. For the Kyrgyz–Kazakh pair, the BLEU increased by almost 2.7 times. Furthermore, continued training on a large multilingual dataset (3,885,542 sentences) of Turkic languages further improved the translation quality, enhancing the model’s generalizability and performance. For example, in this general dataset, BLEU increased from 43.54 to 47.84, and WER decreased from 0.42 to 0.31. These results confirm the high efficiency of the developed methodology for generating and cleaning synthetic corpora using open-source AI models. The demonstrated, scalable, reproducible approach enables the improved use of open-source NMT solutions for low-resource languages and helps bridge the digital linguistic divide.

External and Human Evaluation.

The external evaluation of the fine-tuned NLLB-200 1.3B model on the FLORES 200 independent human-translated dataset shows that fine-tuned pair-wise models compared with the zero-shot model BLEU scores slightly decrease on average (from 12.71 to 11.32), while the chrF scores slightly increase (from 46.62 to 47.12), and the semantic metric BERTScore-F1 increases from 0.8168 in the zero-shot setting to 0.9258 after fine-tuning. In contrast, the COMET scores exhibit a slight decrease on average (from 0.8593 to 0.8274). Multilingual fine-tuned NLLB-200-1.3B shows a slightly worse performance (decrease in BERTScore-F1 from 0.8168 to 0.8115 and COMET from 0.8593 to 0.8446). These external evaluation scores indicate a moderate domain adaptation effect and do not suggest a loss of generalization.

Human evaluation conducted by native speakers across multiple Turkic–Kazakh language pairs confirms the improvements observed in automatic metrics. The fine-tuned models were consistently preferred over the baseline in terms of adequacy and fluency, supporting the conclusion that the proposed training pipeline leads to perceptible gains in translation quality beyond metric-based evaluation.

The proposed scalable, replicable approach not only advances machine translation research but also lays a solid foundation for the multifaceted application of its findings. Academically, the proposed methodology establishes a replicable research paradigm that can be applied to the study of other language families and contributes to expanding scientific knowledge of morphologically complex agglutinative languages. The proposed approaches strengthen linguistic inclusivity in the Turkic-speaking world, ensuring equal opportunities for digital participation for the more than 200 million Turkic language speakers.

Author Contributions

Conceptualization, U.T.; methodology, U.T. and A.K.; software, R.A. and N.R.; experiments, R.A., A.K., B.A., D.A. and D.R.; validation, U.T. and R.A.; formal analysis, A.K., B.A., D.A., D.R. and A.S.; investigation, U.T.; resources, A.K., B.A., D.R. and A.S.; data curation, U.T.; writing—original draft preparation, U.T., A.S., A.K., B.A., D.A. and D.R.; writing—review and editing, U.T., A.K., B.A. and U.T.; visualization, B.A., A.K. and U.T.; supervision, U.T., A.K. and A.S.; project administration, U.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Ministry of Science and Higher Education of the Republic of Kazakhstan within the framework of the scientific project “Study of neural models for the formation of transcripts of speech and minutes of meetings in Turkic languages” (Grant No. AP23487816).

Data Availability Statement

The data presented in this study are openly available at [https://github.com/NLP-KazNU/Parallel-Turkic-text-corpora] accessed on 15 January 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NMT	Neural Machine Translation
MT	Machine Translation
BLEU	Bilingual Evaluation Understudy
WER	Word Error Rate
TER	Translation Edit Rate
chrF	Character F-score
GNMT	Google Neural Machine Translation
LSTM	Long Short-Term Memory
SMT	Statistical Machine Translation
OPUS	Open Parallel Corpus
CSE	Complete Set of Endings
AI	Artificial Intelligence

References

Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; Available online: https://arxiv.org/abs/1409.0473 (accessed on 15 January 2026).
Mirzakhalov, J.; Babu, A.; Ataman, D.; Kariev, S.; Tyers, F.; Abduraufov, O.; Hajili, M.; Ivanova, S.; Khaytbaev, A.; Laverghetta, A., Jr.; et al. A Large-Scale Study of Machine Translation in Turkic Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, 7–11 November 2021; pp. 5876–5890. [Google Scholar] [CrossRef]
Jumashukurov, A. The Quality of Translation of Turkic Languages by AI Translation Tools. J. Stud. Res. 2024, 13, 4. [Google Scholar] [CrossRef]
Mamasaidov, M.; Shopulatov, A. Open Language Data Initiative: Advancing Low-Resource Machine Translation for Kara-kalpak. In Proceedings of the Ninth Conference on Machine Translation (WMT), Miami, FL, USA, 12–13 March 2024; pp. 606–613. [Google Scholar] [CrossRef]
Kadyrbek, N.; Tuimebayev, Z.; Mansurova, M.; Viegas, V. The Development of Small-Scale Language Models for Low-Resource Languages, with a Focus on Kazakh and Direct Preference Optimization. Big Data Cogn. Comput. 2025, 9, 137. [Google Scholar] [CrossRef]
Bekarystankyzy, A.; Mamyrbayev, O.; Mendes, M.; Fazylzhanova, A.; Assam, M. Multilingual End-to-End ASR for Low-Resource Turkic Languages with Common Alphabets. Sci. Rep. 2024, 14, 13835. [Google Scholar] [CrossRef] [PubMed]
Veitsman, Y.; Hartmann, M. Recent Advancements and Challenges of Turkic Central Asian Language Processing. In Proceedings of the First Workshop on Language Models for Low-Resource Languages, Abu Dhabi, United Arab Emirates, 7 January 2025; pp. 309–324. Available online: https://aclanthology.org/2025.loreslm-1.25/ (accessed on 15 January 2026).
Guzmán, F.; Chen, P.; Ott, M.; Du, J.; Lewis, M.; Neubig, G. The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 6098–6111. [Google Scholar] [CrossRef]
Tiedemann, J.; Thottingal, S. OPUS-MT—Building open translation services for the World. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, Lisboa, Portugal, 3–5 November 2020; European Association for Machine Translation: Aarschot, Belgium, 2020; pp. 479–480. [Google Scholar]
Tiedemann, J. The Tatoeba Translation Challenge: Realistic Data Sets for Low-Resource and Multilingual MT. In Proceedings of the Fifth Conference on Machine Translation (WMT), Online, 19–20 November 2020; pp. 1174–1182. Available online: https://aclanthology.org/2020.wmt-1.139/ (accessed on 15 January 2026).
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5998–6008. Available online: https://arxiv.org/abs/1706.03762 (accessed on 15 January 2026).
Stahlberg, F. Neural Machine Translation: A Review. J. Artif. Intell. Res. 2020, 69, 343–418. [Google Scholar] [CrossRef]
Yang, S.; Wang, Y.; Chu, X. A Survey of Deep Learning Techniques for Neural Machine Translation. arXiv 2020, arXiv:2002.07526. [Google Scholar] [CrossRef]
Sennrich, R.; Haddow, B.; Birch, A. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin, Germany, 7–12 August 2016; Volume 1, pp. 86–96. [Google Scholar]
Edunov, S.; Ott, M.; Auli, M.; Grangier, D. Understanding Back-Translation at Scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium, 31 October–4 November 2018; pp. 489–500. [Google Scholar]
Zhang, J.; Zong, C. Neural Machine Translation: Challenges, Progress and Future. arXiv 2020, arXiv:2004.05809. [Google Scholar] [CrossRef]
Jılta, E. Low-Resource Language Processing: Turkish Language Study. In Proceedings of the 6th International Conference on Data Science and Applications (ICONDATA’24), Pristina, Kosovo, 2–6 September 2024; Available online: https://www.researchgate.net/publication/390349163 (accessed on 15 January 2026).
Mirzakhalov, J.; Babu, A.; Kunafin, A.; Wahab, A.; Moydinboyev, B.; Ivanova, S.; Uzokova, M.; Pulatova, S.; Ataman, D.; Kreutzer, J.; et al. Evaluating Multiway Multilingual NMT in Turkic Languages. arXiv 2021, arXiv:2109.06262. [Google Scholar]
Yeshpanov, R.; Polonskaya, A.; Varol, H.A. KazParC: Kazakh Parallel Corpus for Machine Translation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 9633–9644. Available online: https://aclanthology.org/2024.lrec-main.842/ (accessed on 15 January 2026).
Di Gangi, M.A.; Cattoni, R.; Bentivogli, L.; Negri, M.; Turchi, M. MuST-C: A Multilingual Speech Translation Corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA, 2–7 June 2019; pp. 2012–2017. [Google Scholar] [CrossRef]
NLLB Team; Costa-Jussà, M.R.; Cross, J.; Çelebi, O.; Elbayad, M.; Heafield, K.; Heffernan, K.; Kalbassi, E.; Lam, J.; Licht, D.; et al. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv 2022, arXiv:2207.04672. [Google Scholar] [CrossRef]
Fan, A.; Bhosale, S.; Schwenk, H.; Ma, Z.; El-Kishky, A.; Goyal, S.; Baines, M.; Celebi, O.; Wenzek, G.; Chaudhary, V.; et al. Beyond English-Centric Multilingual Machine Translation. arXiv 2020, arXiv:2010.11125. [Google Scholar] [CrossRef]
Karyukin, V.; Rakhimova, D.; Karibayeva, A.; Turganbayeva, A.; Turarbek, A. The Neural Machine Translation Models for the Low-Resource Kazakh–English Language Pair. PeerJ Comput. Sci. 2023, 9, e1224. [Google Scholar] [CrossRef] [PubMed]
Chen, C.-C.; Chen, W. Benchmarking Neural Machine Translation for Azerbaijani. arXiv 2022, arXiv:2207.14473. [Google Scholar]
Tiedemann, J. OPUS: The Open Parallel Corpus. University of Helsinki. Available online: https://opus.nlpl.eu/ (accessed on 15 January 2026).
Nurmaganbet, D.; Tukeyev, U.; Shormakova, A.; Zhumanov, Z. Comparative Analysis of Speech Translation Models for Turkic Languages. In Intelligent Information and Database Systems (Lecture Notes in Computer Science); Springer: Cham, Switzerland, 2024; pp. 360–371. Available online: https://link.springer.com/chapter/10.1007/978-981-97-4985-0_28 (accessed on 4 December 2025).
Adalı, E.; Oflazer, K.; Tantuğ, A.C. A MT System from Turkmen to Turkish Employing Finite State and Statistical Methods. In Proceedings of the MT Summit XI, Copenhagen, Denmark, 25–30 September 2007; pp. 503–510. Available online: https://aclanthology.org/2007.mtsummit-papers.61.pdf (accessed on 15 January 2026).
Qamet, B.; Zhakypbayeva, A.; Turganbayeva, A.; Tukeyev, U. Development of Kazakh–Turkish MT on the Base of CSE Model. In Proceedings of the 2022 Asian Conference on Intelligent Information and Database Systems (ACIIDS 2022), Ho Chi Minh City, Vietnam, 16–18 March 2022; pp. 543–555. Available online: https://link.springer.com/chapter/10.1007/978-981-19-8234-7_42 (accessed on 4 December 2025).
Kuriyozov, E.; Doval, Y.; Gómez-Rodríguez, C. Cross-Lingual Word Embeddings for Turkic Languages. arXiv 2020, arXiv:2005.08340. [Google Scholar] [CrossRef]
Yazar, B.K.; Kiliç, E. Improving Low-Resource Kazakh-English and Turkish-English NMT Using Transfer Learning and POS Tags. IEEE Access 2025, 13, 32341–32356. [Google Scholar] [CrossRef]
Balabekova, T.; Kairatuly, B.; Tukeyev, U. Kazakh–Uzbek Speech Cascade Machine Translation on Complete Set of Endings. In Proceedings of the 2024 International Conference on Language and Speech Technology (ICLST 2024), Almaty, Kazakhstan, 10–12 September 2024; pp. 430–442. Available online: https://link.springer.com/chapter/10.1007/978-3-031-41774-0_34 (accessed on 15 January 2026).
Tiedemann, J. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, 23–25 May 2012; pp. 2214–2218. Available online: http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf (accessed on 15 January 2026).
Allaberdiev, B.; Matlatipov, G.; Kuriyozov, E.; Rakhmonov, Z. Parallel Texts Dataset for Uzbek–Kazakh Machine Translation. Data Brief 2024, 54, 110011. Available online: https://www.sciencedirect.com/science/article/pii/S2352340924001653 (accessed on 4 December 2025).
Li, Z.; Specia, L. Improving Neural Machine Translation Robustness via Data Augmentation. In Proceedings of the Fourth Conference on Machine Translation (WMT 2019), Florence, Italy, 1–2 August 2019; pp. 328–336. [Google Scholar] [CrossRef]
Fadaee, M.; Bisazza, A.; Monz, C. Data Augmentation for Low-Resource NMT. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), Vancouver, BC, Canada, 30 July–4 August 2017; Volume 2, pp. 567–573. [Google Scholar] [CrossRef]
Wikipedia. Word Error Rate. Available online: https://en.wikipedia.org/wiki/Word_error_rate (accessed on 4 December 2025).
Snover, M.; Dorr, B.; Schwartz, R.; Micciulla, L.; Makhoul, J. A Study of Translation Edit Rate with Targeted Human Annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA 2006), Cambridge, MA, USA, 8–12 August 2006; pp. 223–231. Available online: https://aclanthology.org/2006.amta-papers.25.pdf (accessed on 4 December 2025).
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
Popović, M. chrF: Character n-Gram F-Score for Automatic MT Evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, 17–18 September 2015; pp. 392–395. [Google Scholar] [CrossRef]
Eryiğit, G.; Nivre, J.; Oflazer, K. Dependency Parsing of Turkish. Comput. Linguist. 2008, 34, 357–389. [Google Scholar] [CrossRef]
Google Translate. Available online: https://translate.google.com/ (accessed on 4 December 2025).
GPT-4o. Available online: https://openai.com/index/hello-gpt-4o/?utm_source=chatgpt.com (accessed on 4 December 2025).
Microsoft Copilot. Available online: https://copilot.microsoft.com/ (accessed on 4 December 2025).
Meta AI. No Language Left Behind: High-Quality Machine Translation for 200 Languages. Available online: https://ai.meta.com/blog/nllb-200-high-quality-machine-translation (accessed on 4 December 2025).
Google. Introducing Gemma 2. Available online: https://blog.google/technology/developers/google-gemma-2/ (accessed on 15 January 2026).
Microsoft. Introducing Phi-4: Microsoft’s Newest Small Language Model. Available online: https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/introducing-phi-4-microsoft%E2%80%99s-newest-small-language-model-specializing-in-comple/4357090 (accessed on 4 December 2025).
Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115. [Google Scholar] [CrossRef]
Meta AI. Llama-3.2: Vision on Edge and Mobile Devices. Available online: https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ (accessed on 15 January 2026).
Meta AI. Introducing Meta Llama 3. Available online: https://ai.meta.com/blog/meta-llama-3/ (accessed on 4 December 2025).
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-Trained Text-to-Text Transformer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Punta Cana, Dominican Republic, 7–11 November 2021; pp. 483–498. [Google Scholar] [CrossRef]
Wikipedia. T5 (Text-to-Text Transfer Transformer). Available online: https://en.wikipedia.org/wiki/T5_%28language_model%29 (accessed on 3 December 2025).
T5-Base Model Card. Available online: https://huggingface.co/google-t5/t5-base (accessed on 3 December 2025).
T5-Large Model Card. Available online: https://huggingface.co/google-t5/t5-large (accessed on 3 December 2025).
T5-3B Model Card. Available online: https://huggingface.co/google-t5/t5-3b (accessed on 3 December 2025).
T5-11B Model Card. Available online: https://huggingface.co/google-t5/t5-11b (accessed on 3 December 2025).
Poncelas, A.; Shterionov, D.; Way, A.; Maillette de Buy Wenniger, G.; Passban, P. Investigating Backtranslation in Neural Machine Translation. In Proceedings of the 21st Annual Conference of the European Association for Machine Translation (EAMT 2018), Alacant, Spain, 28–30 May 2018; pp. 269–278. Available online: https://aclanthology.org/2018.eamt-main.25/ (accessed on 15 January 2026).
Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin, Germany, 7–12 August 2016. [Google Scholar]
Çöltekin, Ç.; Doğruöz, A.S.; Çetinoğlu, Ö. Resources for Turkish Natural Language Processing: A Critical Survey. Lang. Resour. Eval. 2022, 57, 449–488. [Google Scholar] [CrossRef] [PubMed]
Xu, J.; Pan, J.; Yan, Y. Agglutinative Language Speech Recognition Using Automatic Allophone Deriving. Chin. J. Electron. 2016, 25, 328–333. [Google Scholar] [CrossRef]
Mathur, N.; Wei, J.; Freitag, M.; Ma, Q.; Bojar, O. Results of the WMT20 Metrics Shared Task. In Proceedings of the Fifth Conference on Machine Translation (WMT 2020), Online, 19–20 November 2020; pp. 688–725. Available online: https://aclanthology.org/2020.wmt-1.77/ (accessed on 15 January 2026).
Sellam, T.; Das, D.; Parikh, A. BLEURT: Learning Robust Metrics for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online, 5–10 July 2020; pp. 7881–7892. [Google Scholar] [CrossRef]
Rei, R.; Stewart, C.; Farinha, A.C.; Lavie, A. COMET: A Neural Framework for MT Evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Online, 16–20 November 2020; pp. 2685–2702. [Google Scholar]
Freitag, M.; Rei, R.; Mathur, N.; Lo, C.; Stewart, C.; Avramidis, E.; Kocmi, T.; Foster, G.; Lavie, A.; Martins, A.F.T. Results of WMT22 Metrics Shared Task: Stop Using BLEU—Neural Metrics Are Better and More Robust. In Proceedings of the Seventh Conference on Machine Translation (WMT 2022), Abu Dhabi, United Arab Emirates, 7–8 December 2022; pp. 46–68. [Google Scholar]
Rei, R.; Treviso, M.; Guerreiro, N.M.; Zerva, C.; Farinha, A.C.; Maroti, C.; de Souza, J.G.C.; Glushkova, T.; Alves, D.; Coheur, L.; et al. CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task. In Proceedings of the Seventh Conference on Machine Translation (WMT 2022), Abu Dhabi, United Arab Emirates, 7–8 December 2022; pp. 634–645. [Google Scholar]
Freitag, M.; Foster, G.; Grangier, D.; Ratnakar, V.; Tan, Q.; Macherey, W. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Trans. Assoc. Comput. Linguist. 2021, 9, 1460–1474. [Google Scholar] [CrossRef]

Figure 1. Scheme of the methodology for generating parallel corpora and fine-training of neural AI models for machine translation for Turkic languages. Arrows indicate the sequential data flow between the main stages of the proposed framework. Language blocks represent different Turkic language pair corpora used during data generation and cleaning, model training, and evaluation.

Figure 2. High-level overview of the proposed pipeline for generating, cleaning, and fine-tuning the NMT model with synthetic parallel data.

Table 1. Sizes of parallel corpora for Azerbaijani–Kazakh, Kyrgyz–Kazakh, Turkmen–Kazakh, Turkish–Kazakh, and Uzbek–Kazakh pairs in OPUS (as of 20 November 2025) [26].

Language Pairs	Quantity of Sentences
Azerbaijani–Kazakh	559,809
Kyrgyz–Kazakh	102,345
Turkish–Kazakh	1,136,916
Turkmen–Kazakh	22,119
Uzbek–Kazakh	253,941

Table 2. Types and frequency of detected and corrected errors in synthetic parallel corpora for five Turkic–Kazakh language pairs (based on 500,000 sentences per pair).

Language Pair	Corpus Size (Sentences)	Duplicate Lines Removed	Incorrect Named Entities/Abbreviations	Example of Typical Replacements
Azerbaijani–Kazakh	500,000	964	17	“Kazakhstan” → “Azərbaycan”, corrected to “Qazaxıstan”.
				“ӨP” → “Azərbaycan”, “Ermənistan”, replaced with “Özbəkistan”.
				“ҚP” → “Azərbaycan”, “Qırğızıstan”, “Qırğızıstanın”, corrected to “Qazaxıstan”.
Kyrgyz–Kazakh	500,000	3476	11	“Kazakhstan” → “Kyrgyzstan”, replaced with “Kazakhstan”; “KR” → “Kyrgyzstan”, “Kyrgyz”, replaced with “KR”.
Turkmen–Kazakh	500,000	9740	8	“ҚP” → “GK” replaced with “GR”; “ЖШC” → “ÝÇLÇ” replaced with “LLP”.
Turkish–Kazakh	500,000	1080	9	The abbreviation ҚP БҒM (Ministry of Education and Science of the Republic of Kazakhstan) was in some cases incorrectly translated as “Ministry of Education of Turkey,” while the name Пapлaмeнт Mәжiлici was simplified to “Assembly.”
Uzbek–Kazakh	500,000	2665	7	“Kazakhstan” → “Oʻzbekiston”, replaced with “Qozogʻiston”; “ҚP” → “Qozogʻiston”, “ЖШC → mas’uliyati cheklangan shirkatigan”.

Table 3. Fine-tuning parameters.

Stage	Model	Max Length	Beam Size	Decoding
Synthetic generation	NLLB-200 3.3B	512	1 (default)	Greedy
Training	NLLB-200 1.3B	256	—	Teacher forcing
Evaluation	Fine-tuned	1024	1 (default)	Greedy

Table 4. Results of fine-tuning the mT5-small model on the OPUS dataset, further training received model on 140 thousand synthetic parallel corpora, fine-tuning on 140 thousand corpora only.

Language Pairs	mT5-Small Fine-Tuned on OPUS Dataset, BLEU	Further Training of the Fine-Tuned mT5-Small on 140 Thousand Synthetic Parallel Corpora, BLEU	Fine-Tuning mT5-Small on 140 Thousand Corpora Only, BLEU
Azerbaijani–Kazakh	epoch 1—4.63	epoch 1—2.21	epoch 1—0.42
	epoch 2—5.04	epoch 2—2.27	epoch 2—0.49
	epoch 10—5.83	epoch 3—2.29	epoch 3—0.51
	epoch 20—6.54
Kyrgyz–Kazakh	epoch 1—8.10	epoch 1—2.61	epoch 1—0.79
	epoch 10—8.83	epoch 2—2.72	epoch 2—0.86
	epoch 20—9.35	epoch 3—2.74	epoch 2—0.88
Turkmen–Kazakh	epoch 1—6.71	epoch 1—0.73	epoch 1—0.43
	epoch 2—7.12	epoch 2—0.91	epoch 2—0.55
	epoch 10—7.29 epoch 20—7.88	epoch 3—0.97	epoch 3—0.61
Turkish–Kazakh	epoch 1—6.23	epoch 1—2.63	epoch 1—0.72
	epoch 2—6.74	epoch 2—2.71	epoch 2—0.81
	epoch 10—7.83	epoch 3—2.77	epoch 3—0.87
	epoch 20—8.24
Uzbek–Kazakh	epoch 1—4.96	epoch 1—2.41	epoch 1—0.47
Uzbek–Kazakh	epoch 2—4.99 epoch 10—6.08 epoch 20—6.51	epoch 2—2.47 epoch 3—2.49	epoch 2—0.52 epoch 3—0.53

Table 5. Evaluation of metrics after re-training the NLLB-200 1.3B model on a parallel corpus of 300,000 sentences.

Language Pairs	Metrics	Baseline (Zero-Shot)	Fine-Tuned (6 Epochs)	Δ Fine-Tuned/Baseline
Azerbaijani–Kazakh	WER	0.66	0.39	−0.27
	TER	57.57	36.55	−21.02
	BLEU	23.46	45.92	22.46
	chrF	61.15	73.66	12.51
Kyrgyz–Kazakh	WER	0.59	0.39	−0.2
	TER	50.35	37.07	−13.28
	BLEU	25.31	46.44	21.13
	chrF	61.80	73.20	11.4
Turkmen–Kazakh	WER	1.45	0.65	−0.8
	TER	139.81	62.84	−76.97
	BLEU	6.44	28.36	21.92
	chrF	36.65	55.15	18.5
Turkish–Kazakh	WER	0.69	0.48	−0.21
	TER	61.40	45.22	−16.18
	BLEU	17.44	37.91	20.47
	chrF	54.02	67.92	13.9
Uzbek–Kazakh	WER	0.72	0.44	−0.28
	TER	63.73	41.39	−22.34
	BLEU	22.01	42.95	20.94
	chrF	59.43	71.00	11.57

Table 6. The impact of data cleaning and fine-tuning of the NLLB-200 1.3B model on the translation quality for five Turkic–Kazakh language pairs (500 thousand sentences).

Language Pairs	Metrics	Baseline (Zero-Shot)	Fine-Tuned (6 Epochs)	Fine-Tuned on Cleaned Data (6 Epochs)	Δ Fine-Tuned/Baseline	Δ Fine-Tuned Cleaned/Baseline
Azerbaijani–Kazakh	WER	0.66	0.49	0.37	−0.17	−0.29
	TER	63.39	46.37	36.13	−17.02	−27,26
	BLEU	22.99	37.28	47.84	14.29	24.85
	chrF	56.81	68.14	78.41	11.33	21.60
Kyrgyz–Kazakh	WER	0.75	0.59	0.38	−0.16	−0.37
	TER	72.15	55.96	36.29	−16.19	−35.86
	BLEU	17.72	29.73	48.27	12.01	30.55
	chrF	50.58	61.3	80.12	10.72	29.54
Turkmen–Kazakh	WER	1.01	0.71	0.54	−0.30	0.47
	TER	99.09	68.19	52.8	−30.90	−46.29
	BLEU	9.18	22.82	33.22	13.64	24.04
	chrF	40.07	53.07	69.58	13.00	29.51
Turkish–Kazakh	WER	0.74	0.53	0.43	−0.21	−0.31
	TER	71.44	50.03	41.55	−21.41	−0.11
	BLEU	18.52	33.2	42.1038	14.68	23.58
	chrF	52.39	65.13	76.15	12.74	23.76
Uzbek–Kazakh	WER	0.71	0.56	0.37	−0.15	−0.34
	TER	68.12	53.03	37.67	−15.09	−30.45
	BLEU	19.7	31.71	46.28	12.01	26.58
	chrF	53.47	63.77	79.3	10.30	25.83

Table 7. Results of fine-tuning the mT5-base model on the corpora of the Azerbaijani–Kazakh language pair.

Metrics	500,000 Sentences (Uncleaned Corpus)	500,000 Sentences (Roughly Cleaned Corpus)	Difference (in Percent)
WER	0.51	0.49	−1.3%
TER	48.25	46.98	−1.28%
BLEU	33.99	35.54	+1.57%
chrF	62.03	63.02	+0.99

Table 8. Experiments of fine-tuning the NLLB-200 1.3B model on the six Turkic-language parallel dataset (3,885,542 sentences).

Dataset	Metrics	Baseline (Average Score for 5 Separated Language Pairs on Table 6)	Fine-Tuned on Cleaned Data (6 Epochs)
A common parallel dataset of six Turkic languages	wer	0.42	0.31
	ter	40.88	36.13
	bleu	43.54	47.84
	chrf	76.71	78.41

Table 9. External evaluation results of the NLLB-200 1.3B model on the FLORES-200 benchmark using BLEU and chrF metrics for five Turkic-language pairs.

Language Pair	Zero-Shot		Fine-Tuned
Language Pair	BLEU	chrF	BLEU	chrF
Azerbaijani-Kazakh	10.51	45.33	9.31	45.18
Kyrgyz–Kazakh	10.52	44.18	9.12	44.48
Turkmen–Kazakh	13.93	47.08	12.25	47.61
Turkish–Kazakh	13.62	47.51	12.42	48.45
Uzbek–Kazakh	14.96	48.98	13.51	49.89
AVERAGE	12.71	46.62	11.32	47.12

Table 10. External evaluation results of fine-tuning on 500,000 parallel sentences using the NLLB-200 1.3B model on the FLORES-200 benchmark using BERTScore and COMET metrics.

Language Pair	Zero-Shot		Fine-Tuned
Language Pair	BERTScore-F1	COMET	BERTScore-F1	COMET
Azerbaijani–Kazakh	0.8055	0.8619	0.9295	0.8624
Kyrgyz–Kazakh	0.809	0.8765	0.9242	0.8547
Turkmen–Kazakh	0.8191	0.8205	0.926	0.8301
Turkish–Kazakh	0.8223	0.8647	0.923	0.7805
Uzbek–Kazakh	0.828	0.8729	0.9261	0.8094
AVERAGE	0.8168	0.8593	0.9258	0.8274

Table 11. External evaluation results of fine-tuning on multilingual 3,885,542 sentences using the NLLB-200 1.3B model on the FLORES-200 benchmark using BERTScore and COMET metrics.

Language Pair	Zero-Shot		Fine-Tuned
Language Pair	BERTScore-F1	COMET	BERTScore-F1	COMET
Azerbaijani–Kazakh	0.8055	0.8619	0.7978	0.8443
Kyrgyz–Kazakh	0.809	0.8765	0.8025	0.8587
Turkmen–Kazakh	0.8191	0.8205	0.8165	0.8107
Turkish–Kazakh	0.8223	0.8647	0.8174	0.852
Uzbek–Kazakh	0.828	0.8729	0.8233	0.8572
AVERAGE	0.8168	0.8593	0.8115	0.8446

Table 12. Results of the evaluation of zero-shot and fine-tuned NLLB-200 1.3B on the Kyrgyz-Kazakh OPUS dataset.

Metric	Zero-Shot_Model	Fine_Tuned_Model	Improvement (%)
bleu	23	44	88.75
chrf	43.96	60.33	37.24
bertscore_precision	0.80	0.86	8.30
bertscore_recall	0.79	0.86	8.45
bertscore_f1	0.80	0.86	8.41
comet	0.75	0.81	8.70

Table 13. Comparison of the evaluation of the results for the cleaned synthetic datasets with the results for the OPUS dataset for the Kyrgyz–Kazakh pair.

Metric	OPUS Only	Cleaned Synthetic	Higher Gain
HiBLEU (Δ%)	+91.3%	+172.4%	Synthetic
chrF (Δ%)	+37.2%	+58.4%	Synthetic
BERTScore-F1 (Δ%)	+7.5%	+14.9%	Synthetic
COMET (Δ)	+0.06	−0.022	OPUS

Table 14. Results of the human evaluation of translation by NLLB 200 1.3B.

Language Pair	Mean Score (Zero-Shot)	Mean Score (Fine-Tuned)	Δ (FT − Base)	Interpretation
Azerbaijani–Kazakh	4.19	4.41	+0.22	Fine-tuned preferred
Kyrgyz–Kazakh	4.06	4.10	+0.04	Slight FT advantage
Turkmen–Kazakh	4.44	4.22	−0.22	Baseline slightly preferred
Turkish–Kazakh	4.44	4.22	−0.22	Baseline slightly preferred
Uzbek–Kazakh	4.47	4.60	+0.13	Fine-tuned preferred

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tukeyev, U.; Shormakova, A.; Karibayeva, A.; Rakhimova, D.; Abduali, B.; Amirova, D.; Rakhmanberdi, N.; Aliyev, R. An Integrated Approach to Adapting Open-Source AI Models for Machine Translation of Low-Resource Turkic Languages. Computers 2026, 15, 73. https://doi.org/10.3390/computers15020073

AMA Style

Tukeyev U, Shormakova A, Karibayeva A, Rakhimova D, Abduali B, Amirova D, Rakhmanberdi N, Aliyev R. An Integrated Approach to Adapting Open-Source AI Models for Machine Translation of Low-Resource Turkic Languages. Computers. 2026; 15(2):73. https://doi.org/10.3390/computers15020073

Chicago/Turabian Style

Tukeyev, Ualsher, Assem Shormakova, Aidana Karibayeva, Diana Rakhimova, Balzhan Abduali, Dina Amirova, Nazym Rakhmanberdi, and Rashid Aliyev. 2026. "An Integrated Approach to Adapting Open-Source AI Models for Machine Translation of Low-Resource Turkic Languages" Computers 15, no. 2: 73. https://doi.org/10.3390/computers15020073

APA Style

Tukeyev, U., Shormakova, A., Karibayeva, A., Rakhimova, D., Abduali, B., Amirova, D., Rakhmanberdi, N., & Aliyev, R. (2026). An Integrated Approach to Adapting Open-Source AI Models for Machine Translation of Low-Resource Turkic Languages. Computers, 15(2), 73. https://doi.org/10.3390/computers15020073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Integrated Approach to Adapting Open-Source AI Models for Machine Translation of Low-Resource Turkic Languages

Abstract

1. Introduction

2. Related Work

2.1. Synthetic Data Generation and Multilingual Transfer for Low-Resource NMT

2.2. NMT for Turkic Languages

2.3. Available Parallel Corpora for Turkic–Kazakh Pairs

3. Methodology

3.1. General Methodology Framework Schemes

3.2. Model Selection for Parallel Corpus Generation

3.3. Error Analysis and Cleaning of Parallel Corpora from Errors

3.3.1. Correction Module: Purpose and Principles

3.3.2. Named Entity Correction: Rules and Restrictions

3.3.3. Detection of Duplicates and Hallucinations: Matching Methods

3.3.4. Threshold Values and Filtering Parameters

3.3.5. Qualitative Examples of Errors Before and After Cleaning and Their Impact

3.3.6. Quantitative Cleaning Results

3.4. Fine-Tuning AI Models for Turkic Language Translation

4. Experimental Results and Analysis

4.1. Preliminary Experiments

4.2. NLLB-200 1.3B Experiments on 300,000 and 500,000 Parallel-Sentence Corpora

4.3. mT5-Base Experiments on a 500,000 Parallel-Sentence Corpus of Azerbaijani–Kazakh

4.4. NLLB-200 1.3B Experiments on the Six Turkic-Language Parallel Dataset

4.5. External Evaluation of Fine-Tuned NLLB-200 1.3B Models on the Human-Translated Benchmark FLORES 200 Dataset

4.6. Fine-Tuning and Evaluation of NLLB-200 1.3B on the OPUS Dataset

4.7. Human Evaluation

5. Discussion

5.1. Preliminary Experiments and Methodological Transition

5.2. Impact of Data Quality on NMT Performance for Low-Resource Turkic Languages

5.3. Role of Synthetic Data Generation and Thematic Diversity

5.4. Error Characteristics and the Importance of Corpus Cleaning

5.5. Comparative Performance Across Turkic Language Pairs

5.6. Model Sensitivity to Noise and Cross-Model Consistency

5.7. Benefits of Multilingual Fine-Tuning

5.8. External and Human Evaluation

5.9. About Domain Shift

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI