Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (4)

Search Parameters:
Keywords = NLLB

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
34 pages, 746 KB  
Article
An Integrated Approach to Adapting Open-Source AI Models for Machine Translation of Low-Resource Turkic Languages
by Ualsher Tukeyev, Assem Shormakova, Aidana Karibayeva, Diana Rakhimova, Balzhan Abduali, Dina Amirova, Nazym Rakhmanberdi and Rashid Aliyev
Computers 2026, 15(2), 73; https://doi.org/10.3390/computers15020073 - 28 Jan 2026
Cited by 2 | Viewed by 2084
Abstract
This study presents the application of free, open-source artificial intelligence (AI) techniques to advance machine translation for low-resource Turkic languages such as Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek. This machine translation problem for Turkic languages is part of a project to generate [...] Read more.
This study presents the application of free, open-source artificial intelligence (AI) techniques to advance machine translation for low-resource Turkic languages such as Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek. This machine translation problem for Turkic languages is part of a project to generate meeting minutes from speech transcripts. Due to limited parallel corpora and underdeveloped linguistic tools for these languages, traditional machine translation approaches often underperform. The goal is to reduce digital inequality for these languages and to support scalability. We investigate the effectiveness of free open-source pre-trained specialized and general-purpose AI models for morphologically rich state Turkic languages. This research includes developing parallel corpora for six Turkic languages, fine-tuning, and performance evaluation using BLEU, WER, TER, and chrF metrics. The parallel corpora for five pair languages, each of 300,000 and 500,000 sentences, were generated and cleaned. The results for corpora 500,000 parallel sentences show significant improvements compared with baseline NLLB-200 1.3B on average: BLEU increased by 23.81 points, chrF increased by 26.05 points, and WER and TER decreased by 0.36 and 33.95, respectively, after cleaning and fine-tuning. Six Turkic-language multilingual parallel corpora of 3 885 542 sentences were developed and the fine-tuning of NLLB-200 1.3B shows the following, compared with the results for 500,000 cleaned corpus: BLEU increased by 4.3 points, chrF increased by 1.7 points, and WER and TER decreased by 0.1 and 4.75, respectively These results demonstrate the high efficiency of corpus cleaning and synthetic data generation to improve the quality of machine translation for low-resource Turkic languages using AI models. These results were confirmed by external evaluation on the FLORES 200 dataset and human evaluation. The scientific contribution of this article is the development of a methodology for generating parallel corpora using a specialized AI model of machine translation and fine-tuning the specialized AI model on the created corpora, creating new multilingual parallel corpora of Azerbaijan–Kazakh, Kyrgyz–Kazakh, Turkish–Kazakh, Turkmen–Kazakh, and Uzbek–Kazakh pairs using the proposed methodology, cleaning them, and conducting fine-tuning experiments. Full article
Show Figures

Figure 1

18 pages, 640 KB  
Article
Fine-Tuning Methods and Dataset Structures for Multilingual Neural Machine Translation: A Kazakh–English–Russian Case Study in the IT Domain
by Zhanibek Kozhirbayev and Zhandos Yessenbayev
Electronics 2025, 14(15), 3126; https://doi.org/10.3390/electronics14153126 - 6 Aug 2025
Cited by 2 | Viewed by 2574
Abstract
This study explores fine-tuning methods and dataset structures for multilingual neural machine translation using the No Language Left Behind model, with a case study on Kazakh, English, and Russian. We compare single-stage and two-stage fine-tuning approaches, as well as triplet versus non-triplet dataset [...] Read more.
This study explores fine-tuning methods and dataset structures for multilingual neural machine translation using the No Language Left Behind model, with a case study on Kazakh, English, and Russian. We compare single-stage and two-stage fine-tuning approaches, as well as triplet versus non-triplet dataset configurations, to improve translation quality. A high-quality, 50,000-triplet dataset in information technology domain, manually translated and expert-validated, serves as the in-domain benchmark, complemented by out-of-domain corpora like KazParC. Evaluations using BLEU, chrF, METEOR, and TER metrics reveal that single-stage fine-tuning excels for low-resource pairs (e.g., 0.48 BLEU, 0.77 chrF for Kazakh → Russian), while two-stage fine-tuning benefits high-resource pairs (Russian → English). Triplet datasets improve cross-linguistic consistency compared with non-triplet structures. Our reproducible framework offers practical guidance for adapting neural machine translation to technical domains and low-resource languages. Full article
Show Figures

Figure 1

17 pages, 1467 KB  
Article
Confidence-Based Knowledge Distillation to Reduce Training Costs and Carbon Footprint for Low-Resource Neural Machine Translation
by Maria Zafar, Patrick J. Wall, Souhail Bakkali and Rejwanul Haque
Appl. Sci. 2025, 15(14), 8091; https://doi.org/10.3390/app15148091 - 21 Jul 2025
Viewed by 3042
Abstract
The transformer-based deep learning approach represents the current state-of-the-art in machine translation (MT) research. Large-scale pretrained transformer models produce state-of-the-art performance across a wide range of MT tasks for many languages. However, such deep neural network (NN) models are often data-, compute-, space-, [...] Read more.
The transformer-based deep learning approach represents the current state-of-the-art in machine translation (MT) research. Large-scale pretrained transformer models produce state-of-the-art performance across a wide range of MT tasks for many languages. However, such deep neural network (NN) models are often data-, compute-, space-, power-, and energy-hungry, typically requiring powerful GPUs or large-scale clusters to train and deploy. As a result, they are often regarded as “non-green” and “unsustainable” technologies. Distilling knowledge from large deep NN models (teachers) to smaller NN models (students) is a widely adopted sustainable development approach in MT as well as in broader areas of natural language processing (NLP), including speech, and image processing. However, distilling large pretrained models presents several challenges. First, increased training time and cost that scales with the volume of data used for training a student model. This could pose a challenge for translation service providers (TSPs), as they may have limited budgets for training. Moreover, CO2 emissions generated during model training are typically proportional to the amount of data used, contributing to environmental harm. Second, when querying teacher models, including encoder–decoder models such as NLLB, the translations they produce for low-resource languages may be noisy or of low quality. This can undermine sequence-level knowledge distillation (SKD), as student models may inherit and reinforce errors from inaccurate labels. In this study, the teacher model’s confidence estimation is employed to filter those instances from the distilled training data for which the teacher exhibits low confidence. We tested our methods on a low-resource Urdu-to-English translation task operating within a constrained training budget in an industrial translation setting. Our findings show that confidence estimation-based filtering can significantly reduce the cost and CO2 emissions associated with training a student model without drop in translation quality, making it a practical and environmentally sustainable solution for the TSPs. Full article
(This article belongs to the Special Issue Deep Learning and Its Applications in Natural Language Processing)
Show Figures

Figure 1

14 pages, 259 KB  
Article
Development of an Automated Moderator for Deliberative Events
by Simone Bonechi
Electronics 2024, 13(3), 544; https://doi.org/10.3390/electronics13030544 - 29 Jan 2024
Cited by 2 | Viewed by 2219
Abstract
Online communication platforms have revolutionized interpersonal interactions by transcending geographical barriers. While facilitating connectivity, these platforms have introduced challenges such as overcoming linguistic differences and preventing spam and offensive content diffusion. This is particularly pertinent in the context of deliberative events, where online [...] Read more.
Online communication platforms have revolutionized interpersonal interactions by transcending geographical barriers. While facilitating connectivity, these platforms have introduced challenges such as overcoming linguistic differences and preventing spam and offensive content diffusion. This is particularly pertinent in the context of deliberative events, where online platforms could be used to extend the inclusion of citizens in democratic decision-making. In traditional deliberative events, human moderators and translators were used to facilitate conversation; however, the need for these figures imposed a limit on both the number of deliberative events that could be organized and the number of participants. In response, this paper proposes an automated moderator for deliberative events. The moderator is developed in Python for the online communication platform Discord and can be used, thanks to the integrated AI (Artificial Intelligence) tools, to automatically manage conversation agendas, prevent spam and inappropriate language, analyze the sentiment of the conversation, and translate messages into multiple languages. In particular, three classifiers, based on a pre-trained BERT (Bidirection Encoder Representations from Transformers), were fine-tuned for spam detection, toxic comments classification, and sentiment analysis. These allow the moderator to automatically detect and remove spam and offensive messages in different languages, send warnings to users, alert administrators, and, after repeated warnings, impose bans. Additionally, a built-in translator, based on Meta’s No Language Left Behind NLLB model, translates messages into five languages (Italian, English, French, German, and Polish). The developed bot was tested in a simulated deliberative event on a Discord server, demonstrating its ability to manage conversations and prevent linguistic abuse. Full article
Back to TopTop