Machine Translation in the Era of Large Language Models:A Survey of Historical and Emerging Problems

Ataman, Duygu; Birch, Alexandra; Habash, Nizar; Federico, Marcello; Koehn, Philipp; Cho, Kyunghyun

doi:10.3390/info16090723

Open AccessReview

Machine Translation in the Era of Large Language Models:A Survey of Historical and Emerging Problems

by

Duygu Ataman

^1,*,

Alexandra Birch

²,

Nizar Habash

³

,

Marcello Federico

⁴,

Philipp Koehn

⁵

and

Kyunghyun Cho

¹

Department of Computer Science, New York University, New York, NY 10011, USA

²

School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, UK

³

Department of Computer Science, New York University, Abu Dhabi P.O. Box 129188, United Arab Emirates

⁴

Amazon, 28046 Madrid, Spain

⁵

Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA

^*

Author to whom correspondence should be addressed.

Information 2025, 16(9), 723; https://doi.org/10.3390/info16090723

Submission received: 17 June 2025 / Revised: 20 July 2025 / Accepted: 21 July 2025 / Published: 25 August 2025

(This article belongs to the Special Issue Human and Machine Translation: Recent Trends and Foundations)

Download

Browse Figures

Versions Notes

Abstract

Historically regarded as one of the most challenging tasks presented to achieve complete artificial intelligence (AI), machine translation (MT) research has seen continuous devotion over the past decade, resulting in cutting-edge architectures for the modeling of sequential information. While the majority of statistical models traditionally relied on the idea of learning from parallel translation examples, recent research exploring self-supervised and multi-task learning methods extended the capabilities of MT models, eventually allowing the creation of general-purpose large language models (LLMs). In addition to versatility in providing translations useful across languages and domains, LLMs can in principle perform any natural language processing (NLP) task given sufficient amount of task-specific examples. While LLMs now reach a point where they can both replace and augment traditional MT models, the extent of their advantages and the ways in which they leverage translation capabilities across multilingual NLP tasks remains a wide area for exploration. In this literature survey, we present an introduction to the current position of MT research with a historical look at different modeling approaches to MT, how these might be advantageous for the solution of particular problems, and which problems are solved or remain open in regard to recent developments. We also discuss the connection of MT models leading to the development of prominent LLM architectures, how they continue to support LLM performance across different tasks by providing a means for cross-lingual knowledge transfer, and the redefinition of the task with the possibilities that LLM technology brings.

Keywords:

machine translation; large language models; generative artificial intelligence; multilinguality

1. Introduction

Machine translation (MT) is the task of automatically translating text or speech in one language to another, and it has an extensive range of applications in business localization, diplomatic communications or content creation for media and educational resources. Having established itself as one of the most challenging tasks in artificial intelligence (AI) [1], MT has been the primary application that has driven research on architecture and model development [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], ultimately allowing the emergence of large-scale generally applicable generative language models, also known as large language models (LLMs). These models, with their ability to understand and generate human-like text, have revolutionized various domains, from automated customer service and content creation to complex problem-solving and interactive applications, reshaping the landscape of AI and its applications across industries.

In spite of providing a wide range of new applications, the inherent nature of the original underlying self-attention model [5] adopted in prominent LLMs was essentially designed for the task of translation. The novel setting in which MT between a pair of languages is now accomplished only as one of the tasks that the LLM is trained with cutting-edge architectures to solve raises interesting new prospects on how knowledge might be represented and shared across languages, domains and even modalities in resolution of various problems, and how LLMs can help redefine the task of MT.

In this paper, we aim to reposition the study of MT in relation to state-of-the-art advances in language modeling and present various approaches for building MT systems in historical context, with a focus on how different hypotheses about the translation task helped develop different modeling approaches, which led to both solutions and new problems. We discuss how early work in MT grappled with the foundational challenge of how to model language as a deeply structured, hierarchical system seeking models that could capture latent syntactic and semantic abstractions for robust generalization, and moving towards the rise of statistical methods, which shifted focus toward empirical performance, often dismissing linguistic theory, leading to an era questioning the necessity of explicit linguistic knowledge, favoring data-driven correlations over structured representations. With neural models, and especially LLMs, new research questions have emerged such as how a model’s capacity to interpret prompts influence translation performance, whether genuine language understanding is even necessary for effective translation, or can task competence emerge from in-context or self-supervised learning alone? These transitions mark a shift from hand-crafted rules, to statistical patterns, to implicit learning, each phase reframing the core hypotheses about what successful translation demands.

After three decades, we have reached an important milestone where MT can finally serve as a human-aided translation tool due to recent advances. We discuss the remaining challenges of using ad hoc methodologies in building and evaluating translation systems for real-life scenarios. Our survey presents a comprehensive collection of problems and limitations across different historical approaches to MT. We identify key issues that remain to be solved and future research directions. These include applicability across languages, domain adaptation, biases, and system evaluation. We also explore theoretical and practical considerations for integrating LLMs into MT systems, focusing on new applications such as fine-grained evaluation and stylistic personalization of translations.

2. Machine Translation

2.1. Historical Approaches

Traditional attempts to solve the MT task have generally adopted the two-fold process of first analyzing the source language to extract its meaning, and then synthesizing its semantic equivalent in the target language. The methods used in modeling the transfer of source meaning into the target language have been implemented using one of three major approaches that are quite distinct [21], and in terms of linguistic and structural comprehensiveness, in many ways, could be considered complimentary to each other.

The prominently adopted approach that also allowed the success of statistical models is the direct transfer approach, which assumes that translation can be solved through lexical mapping, or alignment, of a set of words, between a pair of sentences in two different languages. The direct model regards sentences as sequences of words, treating words as atomic units, and implements translation as joint analysis of the source sentence and synthesis of the target sentence relying on a set of lexical translation and reordering rules. In early rule-based systems [22], such predictions would be derived based on expert annotated dictionaries and grammars which may broadly represent syntactic, semantic, or idiomatic reordering rules for translating a sentence between a pair of languages. However, initial attempts for implementing MT with the direct transfer approach were only successful in revealing the most inherent complexity of language. The main limitation of rule-based direct translation methods was the intractability of the annotations required to build refined dictionaries and reordering rules, especially considering the infinite theoretical capacity of humans in phrasally constructing an utterance in any given language [23], in addition to the inherent complexity arisen by neglecting morphology, the subword-level transformations and their affect on syntax. However, as the preliminary approaches, they allowed for obtaining initial results and drew more interest and attention to MT research.

As a more computationally efficient alternative, following studies proposed extracting syntactic structures during source and target language analysis and hierarchically aligning two languages during translation. The syntactic transfer model takes a sentence as a structure other than a linear string of words as it is taken in the direct model, and finds a structure in the target language which can represent the same meaning. Accordingly, the cost of annotating and searching for sentences in the source language and their translations in the target language reduces to a finite number of equivalent structures in two languages. The syntactic model offers a high level of flexibility by not restricting the structures in two sentences to be exactly symmetric; instead, linguistic specialty and variations in expressivity across languages were also taken into account, making this approach in principle capable of achieving syntactic generalization or rephrasing during translation. On the other hand, developing and maintaining a framework for accurately and coherently representing structural dependencies for many languages proved to be quite challenging.

The third approach, interlingua, or translation by transfer in the universal meaning space, maps source sentences to an intermediate layer of language-agnostic semantic-syntactic representations. These representations comprise predefined symbols designed to express any pragmatic, cultural, and syntactic concepts that enable translation across different languages. The interlingual model was developed to further offer an economical way for developing translation systems that support more than a single translation direction [24]. While the difficulty of designing a comprehensive interlingua is evident, the concept would be revisited in the development of multilingual MT models [25].

As the first substantial attempt at the non-numerical use of computers, MT received great attention and aroused high expectations from technology. As can be seen in Figure 1, the three main methods for modeling the MT task have quite distinct approaches, and have been found to have different advantages and disadvantages for different types of languages or the availability of resources. While many under-resourced languages continue to rely on rule-based systems for MT due to limitations in sufficient training resources [26,27,28,29,30,31,32,33,34,35,36], the high amount of human labor required to prepare expert annotated dictionaries and grammatical rules for building rule-based systems still motivates the paradigm shift into statistical methods for learning translation, the ideal direction for potentially building MT models that can extract the rules of language from examples and generalize these to translate previously unseen lexical or grammatical forms. On the other hand, for achieving true generalization, one may also need a more comprehensive framework for modeling language and its often irregular characteristics of grammatical structures, which may not be truly possible relying solely on statistical methods.

Language is far more than a collection of lexical entries or syntactic rules; it is a deeply structured, hierarchical system shaped by layers of meaning, context, and transformation. Despite significant advances in machine translation, current approaches continue to reveal the complexity of language as a cognitive and communicative phenomenon. Many of its fundamental properties remain elusive, particularly when it comes to modeling the abstractions and structures that underlie linguistic generalization. This underscores a central challenge at the core of early MT research: how can we design models that not only process language data, but also learn its latent, hierarchical organization in a way that enables robust and generalizable translation?

2.2. Word-Based Statistical Machine Translation

As in the challenge of statistically modeling of learning any task, the most important constraint for models is the availability of task-specific examples for guiding the learning process. In case of MT, the most useful and predominantly used resource available has always been parallel data, collections of translations between a pair of source and target languages aligned at the sentence level. The modeling of translation directly at the sentence level intuitively encouraged the statistical models to adopt the direct transfer approach.

While the direct transfer method may have not been ideal for rule-based MT, it was inspiring to construct a first method for statistical MT since it had no inherent linguistic assumptions, thus, could in principle be applied to any language.

The IBM statistical word-based machine translation approach [37] was a groundbreaking development in machine translation that viewed translation as a probabilistic process. By using parallel text corpora (texts in two languages), the system learns to map words from a source language to a target language based on statistical patterns and probabilities. At its core, the approach relies on word alignment, which establishes connections between words in the source and target languages by calculating the likelihood that they are translations of each other. These alignments are learned through an iterative process using the expectation–maximization (EM) algorithm [38], which progressively refines the alignment probabilities to find the most likely word pairs across languages. The system combines three key components: a translation model that determines word correspondences between languages, a language model that ensures grammatical correctness in the output, and an alignment model that handles word reordering to account for different word order patterns between languages. While relatively simple by today’s standards, as it focuses on word-to-word translations rather than handling more complex linguistic structures, this approach laid the mathematical and conceptual foundation for modern machine translation systems [39].

2.3. Phrase-Based Machine Translation

Phrase-based machine translation (PBMT) [40,41,42,43] extends word-based MT by leveraging word alignments to learn larger translation units. While it starts with basic word-level alignments between source and target sentences, PBMT extracts consistent phrases (word sequences) from these alignments to build a phrase translation table that captures many-to-many word relationships and common expressions. This approach is enhanced by a distortion model that handles phrase reordering in the target language, typically using distance-based scoring to penalize large positional jumps [44,45,46,47,48,49,50,51]. This evolution from word-based MT offers several advantages: better handling of local context, improved translation of idiomatic expressions, and more natural phrase ordering in the output, while still maintaining the fundamental concepts of alignment and probabilistic translation modeling.

Proposing a gentle step towards a more linguistically coherent processing of language, PBMT still lacked a general grammatical conceptualization since it disregarded any information on the source sentence context and therefore could not capture long-range dependencies. This would lead to some of the well-known problems with PBMT such as mistranslation of infrequent idiomatic expressions, problems with word order between distant languages, and difficulty in handling syntactic and semantic nuances [52].

2.4. Neural Machine Translation

With the advent of sequence-to-sequence learning models, the problem of modeling the translation probability of entire sentences had become trivial [2,53]. The principle of translation in the neural MT (NMT) model is to learn representations of words and phrases in a vector space, using neural networks (NNs) to compute the probability of generating specific translation sequences based on these learned representations. In NMT, each word from the source sentence is first converted into a one-hot vector based on its unique vocabulary ID, which is then transformed into a dense vector representation (embedding) in a continuous space. The encoder, implemented as a recurrent neural network (RNN), processes these embeddings sequentially to create a comprehensive representation of the entire source sentence. Subsequently, another RNN, known as the decoder, generates the target language translation by considering both the encoded source sentence representation and the previously generated target words, effectively learning the relationships between the source and target languages. The overall network is trained to maximize the log-likelihood of a parallel training corpus via stochastic gradient descent [54] and the back-propagation through time algorithm [55]. Initially, NMT relied on a fixed context vector, which was a single compressed representation of the entire source sentence learned by the encoder RNN. This context vector was used to initialize the decoder for generating the complete translation. However, this approach proved limiting, as it forced the model to compress all source sentence information into a single fixed-length vector. A significant advancement came with the introduction of the attention mechanism, which dynamically computes a different context vector at each decoding step. This attention-based approach allows the decoder to selectively focus on different parts of the source sentence that are most relevant for generating each target word, effectively creating a more flexible and context-aware translation process [3,6]. Despite the overwhelming success of attention-based NMT, training recurrent models had been quite costly, and created a major obstacle in front of scaling larger and thus better models. Vaswani et al. [5] revolutionized NMT with the Transformer architecture, replacing the sequential processing of recurrent neural networks with parallel computation through self-attention mechanisms. Unlike RNNs that process words sequentially, the Transformer can model relationships between all words simultaneously, while maintaining word order information through positional embeddings. This design enables both more efficient parallel processing and better handling of long-distance dependencies.

As early statistical MT approaches debated over whether to model translation at the level of words or phrases, NMT pointed the opposite direction: subword units. This was mainly due to the fundamental design of the model which required learning embeddings for a fixed-size vocabulary of words. In addition to controlling the model complexity, this limitation is also related to the difficulty of learning accurate word representations which requires the model to observe words being used in various context. Under conditions of high data sparsity this creates an important bottleneck on being able to translate rare or unseen words that were excluded in the model vocabulary. Some studies proposed overcoming this limitation using optimization methods such as segmenting the model memory devoted to computing embeddings [56,57]. Among these, the particular approach of subword segmentation [58,59,60,61] achieved competitive results in translation of rare words such as loan words and numbers. However, lacking any linguistic notion, purely statistical segmentation often would lead to morphological errors [62,63,64] and cause information loss. For providing more applicable and generic solutions, self-supervised learning of morphological structure was proposed through hierarchical character/word-based architectures [57,65,66,67,68]. Some large-scale studies also showed neural models can learn morphological information to some degree with fully character-level models; however, achieving similar performance with subword-based models was found to eventually require much larger network capacity than subword-based counterparts [69,70,71,72,73,74,75].

2.5. Multilingual Machine Translation and Generative Language Modeling

Deep learning-based methods accelerated machine learning research at a rate that had not been anticipated or observed in decades, and they seemed to promise endless improvement in return of only one thing: scaling [76]. This would, of course, come at the cost of more and more training data. Unfortunately, parallel corpora are not one of the easiest resources to prepare and collect, and that was a major limitation in scaling NMT models. Previously designed PBMT models allowed for alleviating this problem by incorporating monolingual resources into the target language model, but applying this approach did not prove as successful in NMT [77,78]. A more promising approach which yielded useful results was back-translation [79,80,81,82], a data augmentation technique where the baseline translation model trained on parallel corpora would be used to translate new sentences collected in the target language. These synthetic pseudo-translations could then be used to further train the translation model and help larger models reach better performance. Improvements gained by back-translation, however, were found to be proportional to the quality of the synthetic translations obtained with the base model [83,84,85,86], which, ironically, also depends on the availability of resources, making this method only applicable to high-resourced language pairs [87,88].

Inspired by the hypothesis of distributional semantics and universal properties of languages, studies starting to build statistical language models using multiple languages started to find strong evidence that statistical models can naturally align distributed representations of multiple languages in a joint semantic space [89,90,91,92,93,94]. Almost naturally starting to revisit the idea of interlingua, subsequent studies explored these findings to develop unsupervised machine translation (MT) models, learning translation relying solely on monolingual resources in two languages [7,95,96,97,98]. These models explicitly aim to align representations across languages using a reconstruction objective such as with a denoising autoencoder, and an adversarial refinement procedure inspired by online back-translation, allowing the learned representations to gradually reach closer proximity to each other in vector space. While having the potential to overcome the scarcity of parallel data for many language pairs, unsupervised MT methods still face challenges in achieving high translation quality, especially for languages with significant linguistic differences or limited resources [99]. In such cases, using a pivot language to align languages in a common or pivot space has been a well-established alternative direction [100,101,102,103,104,105,106]. In pivot-based MT systems, the choice of the pivot language and the system set-up are crucial in determining translation quality, which is bounded by the quality of the best translation from the selected pivot in the system. Any translation errors would inevitably propagate across the translations to other directions, degrading the overall translation quality.

Successful research on the practicality of learning representations in a shared distributed space also inspired further studies on how network parameters could be shared across languages under different settings. These led to the development of multilingual MT approaches, where a single shared encoder and a decoder was proposed for usage in translating between multiple translation directions. Multilingual MT models are trained using collections of parallel corpora across languages where the translation examples do not necessarily have to contain all possible many-to-many translation directions [25,107,108,109,110]. While providing a novel application for zero-shot translation, multilingual language models were found to ultimately suffer from the curse of multilinguality [111], which states that the performance of a multilingual model with a fixed model capacity tends to decline after a point as the number of languages in the training data increases. Thus, adding more languages may hinder further improvement in low-resource languages, most likely due to incompatibility between the linguistic structures of different languages [112], and in most cases, building bilingual systems were generally able to produce more quality translations, both for the high and low-resourced languages supported by the model [113]. Therefore, very specialized and finely refined procedures for building multilingual MT systems were developed, including how to share vocabulary units and how much data to sample from each language [107], as many of the studies suggested affect the quality of successful transfer for the low-resourced or zero-shot language directions. In order to build more efficient multilingual MT models, one major line of approaches proposed taking a step back and refining which model components should be shared and which should be specialized to store information individually for each language [114,115,116,117,118]. A general conclusion was that for being able to build efficient multilingual MT systems that actually benefit low-resource language directions, languages that have distinct scripts or morphosyntactic properties should be allocated language-specific vocabularies and network parameters. With scalability challenges arising particularly for low-resource languages, following large-scale initiatives, such as the No Language Left Behind (NLLB) project [119], extended NMT to over 200 languages by employing strategies like data mining, backtranslation, and sparsely-gated mixture-of-experts (MoE) models, exemplified by their 54.5B parameter model that maintained a FLOP footprint comparable to a dense 3.3B model. Similarly, JDExplore’s [120] 4.7B parameter Transformer model demonstrated competitive performance at WMT22. Despite these advances, the field continues to face the difficulty of extreme data imbalance, as most of the 1220 evaluated language pairs still fall under the low-resource category. These developments established the foundation for exploring whether further scaling or architectural shifts could address translation challenges beyond the traditional parallel-data paradigm.

Although Transformer-based NMT systems made substantial progress by leveraging parallel corpora, their dependence on task-specific supervision and susceptibility to data scarcity remained limiting factors. To address these challenges, the Pretrain–Finetune (PF) paradigm emerged, wherein models are first pretrained on general multilingual objectives before fine-tuning for specific translation tasks. A prominent example is mBART [121], which applies multilingual denoising pretraining—combining word-span masking and sentence permutation within a Transformer encoder-decoder architecture augmented with language ID tokens. Pretraining on data from 25 languages enabled mBART to generalize across diverse translation directions, with fine-tuning specializing the model to particular language pairs using supervised parallel data. Large-scale pretraining was realized with 256 V100 GPUs, Adam optimization, and scheduled learning rate decay. While the PF paradigm substantially advanced translation capabilities, its continued reliance on supervised adaptation highlighted the need for more universally adaptable approaches.

Although the PF paradigm continued to depend on supervised adaptation, it significantly advanced multilingual generalization and established methodological foundations that deepened awareness of language diversity. These developments not only improved translation across a wider range of languages but also laid the groundwork for building language models capable of broader cross-lingual and cross-domain applicability, ultimately motivating the shift toward pretraining strategies that require minimal task-specific supervision.

2.6. Few-Shot Learning with Language Models

The outstanding performance of neural methods in natural language processing (NLP) tasks encouraged investments in exploring the limits of scaling, consequently transforming the field to take on a direction to continuously search for the optimal deep learning architectures for modeling language, resulting in the development of models with language representations containing linguistic features that are found to be useful in various downstream NLP tasks [8,122,123], making language models a preliminary component of most NLP systems. Evidently, the requirement of finding task-specific data was limiting the development of generally applicable large-scale models; therefore, many studies invested in how data from different languages, domains and eventually tasks could be combined to build large stand-alone models applicable across different settings. Eventually, the work of [124] proposed building efficient multi-task language models which could be easily fine-tuned to perform various NLP tasks.

Preliminary language models were encoder-only architectures, designed primarily for understanding and processing input text [8]. These models focused on contextual prediction objectives, such as masked language modeling and next-token prediction, which made them highly effective for classification tasks. To develop more general-purpose generative models, auto-regressive [125] and decoder-only [126] architectures employing causal language modeling were introduced. These approaches proved to be highly efficient and representative, particularly when trained on large, multilingual, and multi-task datasets [127,128,129,130].

An important innovation in scaling Transformer-based models was the adoption of byte-level vocabulary units, which allowed the handling of non-Unicode symbols included in the vast amounts of training data [111]. A surprising outcome of training these models on unfiltered, multilingual datasets—without explicitly defining language identities—was their ability to process and generate text in multiple languages. While this approach was not entirely optimal, it demonstrated the potential for models to learn cross-lingual representations and inspired further research into building systems that generalize across languages and tasks without explicit input specifications. An important and highly relevant application was indeed MT, especially in low-resourced languages using pretrained representations obtained from encoder-only language models could significantly boost performance when sufficient parallel data is not available, providing a new state-of-the-art baseline for MT which now also included combinations of multi-task and multilingual models [131,132]. The quality of representations in the sense of how multilingual information is learned and shared across languages does primarily affect the success of pretrained language models for MT. Models in particular deploying translation as an auxiliary objective, either by being trained on both masked language modeling and parallel data [111] or unsupervised reconstruction objectives [133] have proven to be more applicable.

With the possibility of training models with longer context windows and larger amounts of diverse training data collected across various tasks, language models have naturally evolved to support prompting, a technique that enables users to guide a model’s output by providing specific input instructions. Early progress in this area was marked by the development of decoder-only models like GPT [126], which utilized auto-regressive decoding to predict the next token in a sequence, making them naturally suited for tasks requiring dynamic generation based on user input. The introduction of in-context learning [134], made possible through training on large and diverse datasets, empowered models to interpret prompts as implicit instructions, effectively eliminating the need for task-specific fine-tuning. This feature showed particular benefit for zero-shot and few-shot learning scenarios, where the model could adapt to new tasks using only examples provided within the prompt.

Following the limitations of supervised fine-tuning, the pretrain–prompt paradigm emerged as a scalable alternative, leveraging large decoder-only language models pretrained on massive corpora to perform translation via prompting without explicit task-specific updates. In this framework, exemplified by models such as GPT-3 [134] and GLM-130B [135], translation is reformulated as a language modeling problem, where carefully crafted prompts guide the model to generate target translations. Architecturally, these models discard the encoder-decoder structure in favor of autoregressive Transformer decoders trained with next-token prediction or denoising objectives. GPT-3, with 175 billion parameters and primarily English-centric data (93%), demonstrated strong zero-shot and few-shot translation capabilities but exhibited notable gaps for non-English languages. To enhance translation performance, prompting strategies have been developed, including template engineering [9], in-context learning with selected demonstrations, and cross-lingual prompting where examples from different language pairs guide translation. The quality of translations in LLM-based translation however has been found to heavily depend on the quality of prompting, which should be carefully designed considering clarity and phrasing of the instructions, and the amount of preliminary information provided, such as translation examples in the given language pair and if possible task constraints which may include domain-specific terminology or stylistic preferences [9]. Their study showed that simple prompts explicitly specifying source and target languages yield superior zero-shot results, while the number and quality of prompt examples, measured through metrics such as semantic similarity (SemScore) [136] and model likelihood (LMScore) [137], substantially influence few-shot performance. Nevertheless, challenges persist: LLM translations tend to prioritize fluency over faithfulness, often leading to phenomena such as dropped content, prompt copying, hallucinations, and translation errors in non-English-centric directions [9,138].

While specialized MT systems continue to provide the highest translation quality for specific language pairs [139], the use of LLMs for translation represents a promising direction for optimizing traditional NLP pipelines, particularly in industrial applications where a more generalized solution is often required to address a wide range of tasks. Within the LLM-based translation framework, fine-tuning can be performed on selected parallel datasets, or, fine-tuning on monolingual data in the target language followed by task-specific fine-tuning on translation data offers another approach to enhance performance [140]. In the context of LLM-based MT, a critical research question that has now become relevant emerges: how does language understanding, and more specifically, the model’s capacity to interpret prompts, influence success in translation tasks? For instance, LLAMA-2 [130], a multilingual model explicitly fine-tuned on parallel datasets, achieves performance comparable to GPT-2 [127], a model originally designed for English but trained with inadvertent inclusion of multilingual data. Despite these advancements, the relationship between prompt clarity and output quality remains elusive, with no well-defined functional framework to measure how various linguistic and stylistic dimensions of translation impact performance. Furthermore, as LLMs scale in size, progress in language understanding and generation does not inherently follow, underscoring the persistent challenge of fostering efficient learning. Addressing these complexities calls for extensive research, including systematic investigations into how variations in prompting influence translation quality and broader model behavior. An extensive evaluation of LLMs in the MT task [141] shows that while GPT-4 surpasses the strong supervised baseline NLLB in 40% of directions, LLMs still lag behind commercial systems like Google Translate, especially for low-resource languages. While many findings suggest LLMs can learn translation efficiently even for unseen languages, instruction semantics can be ignored if in-context examples are given, and cross-lingual exemplars can outperform same-language examples for low-resource translation, in-context learning templates, exemplar quality, and prompt ordering significantly affect translation outcomes. Empirical analyses reveal that pivoting through English improves non-English to non-English translation quality, reflecting the English-centric training bias inherent to many LLMs [9].

Considering the remaining 60% of languages still better translated with specialized models, one can raise a pivotal inquiry not necessarily limited to LLMs but generically neural language modeling methods: does a model need to genuinely “understand” the meaning of the strings it generates to perform tasks effectively? In that case, can such learning take place with in-context or other learning methods? Resolving these research questions are crucial in being able to develop universally applicable models and lies at the heart of ongoing debates about the fundamental nature and limitations of these models in understanding grammar and structural patterns required to generalize to unseen word forms and sentences. Before we discuss this and relevant problems in the section on Current and Emerging Problems (Section 5), we would like to introduce historical methods for addressing irregular or infrequent linguistic paradigms or low-resourced languages by developing specialized hybrid MT methods.

2.7. Inductive Learning and Hybrid Models

The question of whether large language models truly “understand” language remains a contentious topic in the research community, especially as these models continue to achieve unprecedented performance across numerous sophisticated benchmarks and linguistic tasks. However, skepticism toward the necessity of linguistic knowledge in MT has been far from a novel phenomenon. This sentiment is epitomized by Jelinek’s oft-quoted remark, “Every time I fire a linguist, my scores improve,” [142] reflecting a long-standing belief among early proponents of statistical machine translation (SMT) that linguistic insights did little to enhance performance. Similarly, the current era of more and more progress in connectionist models [143,144,145,146,147], computational frameworks in which cognitive processes are represented as emergent patterns of activation across networks of interconnected units that learn from experience, invites renewed scrutiny on whether these models truly acquire and internalize the structure and semantics of language, or are they merely sophisticated pattern-matching systems? Additionally, perhaps a less popular but more fundamental question follows: what do improved scores in the context of generative models truly represent? These questions underscore the enduring tension between the role of linguistic theory and empirical performance in advancing MT and language modeling.

Humans acquire systematic generalization through the compositionality inherent in natural languages, where structural mechanisms operate over sets of primitive finite concepts [148]. In contrast, transductive learning methods, particularly ones relying on neural networks, have been found suboptimal for achieving systematic generalization [149,150,151,152,153,154,155], and they have yet to demonstrate human-like language understanding [156]. This limitation arises primarily from their reliance on extensive data to observe linguistic paradigms repeatedly, enabling accurate modeling of their meaning and usage. However, the inherent complexity of natural languages often results in paradigms that are both too intricate and too sparse to be adequately represented in any collected dataset [157,158,159,160,161,162,163].

To address these challenges, early methods in machine translation (MT) proposed hybrid models that combined statistical approaches with rule-based systems, leveraging the strengths of both methodologies [164,165]. In the realm of statistical MT, translation between typologically distant languages proved particularly problematic due to the sparse and complex nature of linguistic data. To mitigate this issue, structural alignment mechanisms adopting the syntactic transfer approach, also referred to as syntactic MT, were introduced [166,167]. Similarly, extensions to channel operations, such as those proposed in [168], enriched reordering patterns to accommodate more complex linguistic phenomena, including those that cross syntactic brackets [169]. Syntax-based models in MT provided a more linguistically coherent approach that still incorporates statistical learning, moreover, allowing a more comprehensive evaluation of the applicability of statistical methods in capturing different types of hierarchical structures observed across languages with varying typology. The three syntax-based models, Synchronous Context-Free Grammar (SCFG), Inversion Transduction Grammar (ITG), and Synchronous Tree Substitution Grammar (STSG), differ in their use of syntactic information and structural representation. SCFG extends traditional context-free grammars by incorporating two related right-hand sides, one for the source language and one for the target, making it suitable for capturing phrase-level correspondences without relying on explicit linguistic syntax. The hierarchical phrase-based model proposed by [167] is a notable SCFG-based approach. In contrast, ITG, introduced by [166], constructs synchronous parse trees for source and target sentences, explicitly modeling structural correspondences and reordering patterns through permutations, yet it similarly avoids real linguistic parse trees. STSG, on the other hand, leverages actual linguistic parse trees, with productions represented as pairs of elementary trees where non-terminal nodes are linked, providing a linguistically grounded framework for translation. While SCFG and ITG typically operate as string-to-string models with minimal syntactic reliance, STSG emphasizes detailed syntactic representation for more robust linguistic alignment, serving as an example where productions consist of pairs of elementary trees linked at non-terminal leaf nodes, similar to synchronous CFG. The first category, string-to-string, does not utilize linguistic syntax, as seen in SCFG-based and ITG-based models. The second, string-to-tree, applies linguistic syntax only to the target side, allowing the source to remain as a linear string [169,170]. The third, tree-to-string, representation model uses syntactic structures exclusively on the source side to guide translation into a linear target representation [171,172,173]. Finally, tree-to-tree models apply linguistic syntax to both source and target languages, offering the most detailed syntactic alignment [174,175,176,177]. These categories reflect increasing levels of syntactic integration, with tree-to-tree models providing the most linguistically rich framework for translation.

An important limitation of CFG is the neglect of morphology and its relationship to syntax. In fact, morphology and syntax are part of a much more complex mechanism where jointly can determine the sentence meaning. A word’s syntactic role is going to not only affect the the entire construction of the sentence but the morphological inflections that the word, as many others in the sentence which have dependencies to it. On the other hand, such relationships are quite difficult to integrate into modern computational models, making applicability to languages with rich morphology an open question. Another important consequence of statistical modeling in morphologically-rich languages is the high level of lexical sparsity observed in any amount of collected data, due to the exponentially growing potential surface forms a single lemma can have. This problem makes MT a specific challenge in terms of being able to understand as well as generate these rare or unseen word forms. In the context of statistical MT, a commonly used approach was to decrease sparsity by finding more frequent and meaningful shared set of common intermediate units, such as morphemes or morphological features shared across words [178,179,180], ultimately helping better estimate probability distribution of alignments of lexical units and help systematic generalization [181]. This approach also aids in a more homogeneous word alignment in distance languages. For instance, a Turkish word often can translate to a phrase of many words in English or another analytic/morphologically less sparse language, and in such cases segmenting words into morphemes and training machine translation systems can help improve the quality of MT.

Tackling translation into morphologically-rich languages was another line of work addressed by hybrid word–character alignment models [182]. In NMT, Sánchez-Cartagena and Toral [183] suggested using the Finnish morphological segmentation tool Omorfi [184] to separate words in the training corpus into their bases and inflectional suffixes to perform vocabulary reduction in English-to-Finnish neural machine translation. Ataman proposed an unsupervised morphologically-motivated vocabulary reduction method [63] as an extension of the Morfessor algorithm [185]. Similarly, Huck et al. [186] and Tamchyna et al. [187] applied morphological analysis to split words into sequences of lemma and syntactic feature sets in English-to-German and English-to-Czech neural machine translation. However, linguistic tools did not provide generally applicable solutions to open-vocabulary NMT. Statistical subword segmentation methods therefore proved more useful [58,59]. Of course, morphological analysis tools are typically not available across languages, and statistical techniques can only be useful to certain typologies [188]. Since words in this model are generated in the course of predicting multiple subword units, generalizing to unseen word forms becomes more difficult, where some of the subword units that could be used to reconstruct the word may be unlikely in the given context. To alleviate the sub-optimal effects of using explicit segmentation and generalize better to new morphological forms, recent studies explored the idea of extending the same approach to model translation directly at the level of characters [74], which, in turn, have demonstrated the requirement of using comparably deeper networks, as the network would then need to learn longer distance grammatical dependencies [189], increasing the computational complexity as well as the demand on training resources to an unrealistic level. Hybrid word–character models could allow for better efficiency [57,190].

The question of whether the improvements gained with inductive learning methods are significant has been widely discussed in the history of MT. For instance, Koehn et al. [40] argued that syntax-based approaches not only fail to improve performance but can actively degrade translation accuracy. Their experiments demonstrated that the removal of non-constituent yet frequently used phrases, such as “there is,” negatively impacted translation quality, leading them to the conclusion that syntax does not play a critical role in SMT. This claim challenges the assumption that incorporating linguistic structure inherently benefits translation systems, suggesting instead that vast amounts of data can compensate for the lack of explicit syntactic modeling. In high-resource scenarios, the effects of inductive learning appear to diminish, raising the question of whether evaluation metrics truly capture the role of linguistic generalization. Since automatic evaluation methods, such as BLEU [191], primarily measure lexical and phrasal similarity rather than structural generalization, they may not adequately reflect whether a system has internalized deeper syntactic or semantic relationships. This highlights a potential misalignment between how models are trained and how they are assessed, further complicating the debate on whether structured linguistic knowledge is necessary for effective MT. The details of methods used in the evaluation of MT systems are discussed in the next section.

3. Evaluation

While a thorough and accurate evaluation of any translation system should eventually involve human assessment, due to time and cost considerations, a prominent approach especially during system development typically relies on automatic heuristics which can provide costless reinforcement on the sufficiency or efficacy of the model settings or resources used in system development. Such evaluations employ specific metrics to determine the quality of translations, ensuring that they not only retain the meaning of the original content but also adhere to the linguistic and grammatical norms of the target language.

Automatic evaluation metrics are generally designed with two main approaches. The first and traditional approach undertaken in automatically evaluating translation quality was more suitable to earlier rule-based systems as MT systems which were designed to be integrated into Computer-Assisted Translation (CAT) tools. The metrics were designed to perform simple comparison of finding lexical mistakes in the output, in comparison to the reference. A straightforward method for this is measuring the number of edits needed to transform a system-generated translation into a reference translation, also defined as the Translation Edit Rate (TER) [192]. TER calculates the percentage of words that need to be inserted, deleted, substituted, or reordered to match the reference. In error-based metrics, a lower score indicates a higher-quality translation. This way, the translator using the translation memory can assess the MT output for post-editing in the most efficient and productive way.

The development of SMT models has led to the integration of statistical analysis in the automatic evaluation of MT systems. The first notable evaluation metric, BLEU [191], was introduced by the IBM research group, followed by NIST, developed by the National Institute for Standards and Technology [193]. Both metrics have been employed by (D)ARPA in assessing MT projects funded by U.S. research programs. Due to the now shifted nature of usage of MT in new applications the metrics also accounted for the maximization of similarity of system output to a gold-standard utterance presenting an example of an accurate system output. In this case, the quality is evaluated based on its intended purpose, categorized into different levels such as dissemination (publishable-quality translations), assimilation (acceptable at a lower quality threshold), interchange (communication or unscripted presentations) or information access (non-grammatical multilingual information retrieval and extraction purposes) [194]. While for earlier PBMT systems metrics like BLEU provided fast and relatively reliable feedback on the system performance, as systems became more competitive, especially with the NMT technology, matching-based metrics started to fall back significantly in application to translation of real data. For instance, when the output happens to contain a rephrased version of the context due to stylistic or syntactic variations in the generative process, or in many morphologically-rich variations where words not only can change in form at the subword level through inflectional or derivational transformations, but also at the sentence level due to free word order, word-level metrics are known to fail to capture accurate evaluations [195,196,197,198]. Alternatively, ref. [199] proposed n-gram matching at the character level, which has been more appropriate for the evaluation in morphologically-rich languages. However, matching-based approaches still might miss semantic nuances in the generated language. Recent studies proposed the alternative approach to use vector similarity in distributed representations [200]. This method provides a better semantic notion over simple word matching heuristics, yet there are still issues regarding the robustness of pretrained language representations and how well they can capture non-linear functions that can represent different types of semantic or lexical accuracy, grammatical mistakes, or many other complex features for assessing translation quality.

By the nature of their design, some metrics may be able to capture certain typological forms and patterns better than others, and thus correlate better with languages with those features. While language models have been found to learn some generalizable patterns on grammar [201,202,203], these findings have not been found to be applicable across various languages [204]. To overcome limitations of pretrained language models in applicability in translation evaluation, a more recent trend has become fine-tuning models on translations with human evaluation scores. This is accomplished using competition data sets spanning news or specific domain translations in a set of languages from recent years. In addition to having very limited scope in applicability to various languages, this solution is ill-defined in the sense that example annotations are too ambiguous as learning signals to inform a system whether a translation is accurate, a task that even humans often struggle to accomplish.

While quality assessment appraises the goodness of a translation, which can quantify how applicable or useful is the translation for deployment, error analysis judges a translation from the opposite perspective, i.e., measuring its badness, an ultimate measure that estimates the amount of work required to correct raw MT output to a standard considered acceptable as a translation [194]. In either case, metrics were ultimately designed for measuring the feasibility, potentials and limitations of a system, by examining the contexts in which it is most likely to be effective or prone to failure such that its results are usually more meaningful and interpretable to interested parties like system developers and users. With the advent of multi-task models, MT is only one of the applications, and how users use MT in new context raises the question of whether it is time to reconsider evaluation not merely as a process of maximizing specific metrics but rather as a return to its original definition—reassessing it as a multidimensional approach that integrates multiple factors to ensure more comprehensive and reliable assessments considering the possibility to produce translations that could be used across various context and domains.

4. Data Curation Methods Used for Building MT Systems

In traditional statistical and early NMT systems, models were trained separately for each language pair using curated bilingual corpora. These corpora were typically sourced from high-quality institutional data such as Europarl [205] or the UN Parallel Corpus [206], and often required extensive pre-processing steps including correction of any sentence misalignment, tokenization, trucasing, and domain-specific filtering. Domain mismatch and data sparsity, especially for specialized or under-resourced language pairs were common challenges in building MT systems, addressed through data selection techniques like cross-entropy difference scoring [207], perplexity-based ranking [208] and filtering based on alignment confidence [209]. These practices ensured domain consistency and minimized noise in statistical MT pipelines but required high effort and bespoke resources for each pair, limiting scalability. As research shifted toward multilingual MT, these foundational methods informed the development of more generalized and scalable data preparation workflows.

Modern multilingual MT systems rely on large, heterogeneous datasets spanning many languages, domains, and data qualities. Curation begins with the acquisition of both parallel and monolingual data from multilingual web sources (e.g., Common Crawl; [210] and ParaCrawl; [211], CCMatrix; [212], institutional repositories, or crowd-sourced translation efforts). Parallel sentence alignment is typically performed using multilingual encoders such as LASER [213], LaBSE [214], or newer models like LASER3 [215]. Monolingual corpora are utilized through back-translation [82], self-training [216], and forward-translation [217], especially in low-resource contexts. However, these raw corpora often contain noise such as duplicate or misaligned sentences, hallucinated content, and domain inconsistencies. For cleaning data techniques include deduplication, length and language filtering. In addition, data selection techniques continue to play a role, in particular for separating the training data into generic and domain-specific training sets with methods like cross-entropy filtering [218] and classifier-based relevance ranking [219], which can be used to subsample corpora for adapting multilingual models into a specific language or application domain.

A major challenge in multilingual MT is the highly skewed distribution of data across languages. High-resource languages like English or French may be over-represented by orders of magnitude compared to low-resource languages. Naïve uniform sampling disproportionately benefits these dominant languages and can hinder generalization elsewhere. To address this, several multilingual sampling strategies have been proposed. Temperature-based sampling [220] reweights data distributions to upsample low-resource languages, while Target Conditioned Sampling [221] learns language-specific weights to minimize downstream task loss. More recent approaches incorporate adaptive sampling based on learning dynamics [222] or hierarchical clustering to group languages by typological or semantic proximity [223]. These strategies are often combined with curriculum learning and dynamic data selection to better balance training signals across languages.

Recent large-scale MMT systems, such as M2M-100 [222], mT5 [224], and NLLB [215], highlight the central role of curated and filtered data pipelines. These models rely on massive web-scale corpora filtered with multilingual encoders and quality classifiers, often using multiple passes of noisy alignment removal, language verification, and domain scoring. Data quality is further enhanced through knowledge distillation [225], where teacher models generate translations used to train smaller or multilingual student models. Online back-translation [84] and uncertainty-based sampling [226] are increasingly adopted to generate dynamic and informative training data. Benchmark datasets such as FLORES-101 [227] and OPUS-100 [228] emphasize the need for balanced multilingual evaluation, reinforcing the importance of principled data selection and curation strategies. As the field advances, the intersection of data curation, sampling, and filtering continues to define the scalability, fairness, and robustness of multilingual MT systems.

5. Current and Emerging Problems

5.1. Applicability Across Languages

While LLMs are widely used now for MT, research has demonstrated a significant disparity in the performance of large language models (LLMs) between English and other languages [229]. While GPT-4 [127] approaches the performance of state-of-the-art fine-tuned models, it often fails to surpass them, particularly in languages that utilize non-Latin scripts and in low-resource languages. Empirical studies indicate that high-resource languages exhibit robustness in few-shot translation settings, whereas in-context learning benefits low-resource languages more consistently [230].

The extension of NMT models for multilingual training offered significant efficiency gains in across languages, yet a single model covering hundreds of languages is unlikely to provide optimal performance across all language families. Instead, fine-tuning models on language-specific or family-specific subsets has been proposed as a more effective strategy [231]. In multilingual models, a key challenge is parameter allocation across languages. For instance, in a model trained on English and Chinese, parameter updates for English may not necessarily transfer to Chinese, highlighting the difficulty of cross-lingual knowledge transfer [232], with some linguistic features requiring substantial model capacity for effective representation.

A key challenge in multilingual evaluation is the scarcity of comprehensive benchmarks, which hampers thorough assessment and increases the risk of test data contamination. Although initiatives provide translation benchmarks for low-resourced language families and dialects [233,234,235], the overall availability of low-resource evaluation datasets remains limited. Large-scale evaluation efforts [236,237], have invested significantly in developing multilingual test sets encompassing many underrepresented languages. However, these remain constrained by the high cost of human translation. Evaluation campaigns such as the International Workshop on Spoken Language Translation (IWSLT) and Conference on Machine Translation (WMT) have invested significant efforts for building high-quality evaluation data for assessing translation quality of prominent models across the years in large-scale multilingual evaluation [238,239] or focused on typologically diverse language families [236,240,241,242,243,244,245,246,247,248] or spoken dialects [249].

While efforts for building more comprehensive evaluation benchmarks are ongoing, a very recent exploration of the LLM translation capability in very low-resource languages used the AmericasNLP [250] shared task for their evaluation of LLMs in indigenous American languages. Their findings showed continued pretraining and instruction tuning strategies, including mixing MT and synthetic tasks [140,251], significantly improved translation for indigenous American languages such as Hñähñu and Wixarika. Base models like Mistral 7B and MALA-500 showed that lighter LLMs, when appropriately adapted, could achieve substantial gains in low-resource translation without needing extreme computational power. As such results reveal more applicability of LLMs in more efficient settings, it is important to continue investing in more inclusive representation in evaluation benchmarks to be able to truly assess generalization capability of models to the translation of real spoken language.

An earlier evaluation of LLMs in zero-shot MT is another important contribution in understanding the nature and limitations of LLMs. A main limitation in the applicability of MT models based on Transformer architectures is the reliance on statistical learning, which has historically minimized the role of linguistic insights, raising fundamental questions about how to determine whether models truly understand language. Studies evaluating the capability of LLMs in performing zero-shot translation, such as for Kalamang [252] (as in the setting depicted in Figure 2), highlight that LLMs primarily generate text with basic to mid-level grammatical complexity. This limitation prompts a broader inquiry into whether these models can achieve structural generalization, which urges one to reconsider the distinction between feed-forward and recurrent architectures [253], the role of inductive learning methods, and the inherent challenges of acquiring grammar through data-driven approaches, all of which underscore the complexity of modeling linguistic competence [254]. Recent results using linguistic explanations for in-context learning based on examples to generalize or encourage researchers argue that a return to hybrid rule-based methods for LLM-based MT, potentially through expert-driven interventions, may be necessary to improve linguistic accuracy [255].

Data scarcity remains a fundamental constraint for NMT and LLMs, particularly as models scale. While LLMs can leverage vast training corpora, they ultimately face saturation points where additional data no longer yields significant improvements. This raises critical questions: What determines the upper bound of translation accuracy for NMT? Can these models generate linguistically valid outputs beyond their training data? How should evaluation frameworks be designed to incentivize true generalization rather than memorization? Additionally, as models grow larger, an open question remains as to whether or not translation quality can be improved without simply increasing model size. Addressing these challenges will be crucial in determining the future trajectory of multilingual machine translation in the era of LLMs.

5.2. Evaluation

Automatic evaluation metrics have played a pivotal role in the development of machine translation (MT) systems, particularly in the era of neural machine translation (NMT). Among these, BLEU has been instrumental as a fast and computationally efficient verification tool, significantly influencing large-scale NMT development and shaping the trajectory of MT research. However, the widespread focus on maximizing BLEU scores—often at the expense of qualitative progress—has led to systemic neglect of languages and linguistic phenomena where existing metrics fail. This emphasis on score improvement, rather than meaningful evaluation, has particularly disadvantaged low-resource and morphologically complex languages, where the applicability of standard evaluation metrics remains limited.

On the other hand, many existing evaluation metrics, particularly those designed for statistical models, struggle to capture structural generalization. These models often fail to generate new words or reorder phrasal structures in a linguistically coherent way, particularly for languages with rich morphology or free word order. Given that most widely used evaluation metrics, including BLEU, favor memorization over true generalization, they inadvertently discourage models from learning meaningful linguistic structures. Furthermore, evaluation scores tend to penalize lexical choices and alternative phrasal reorderings [197] that do not exactly match the reference, even when such variations may be equally valid or more contextually appropriate.

The limitations of traditional evaluation metrics have prompted longstanding debates regarding their validity. Notably, the claim that “every time I fire a linguist, my scores improve”—attributed to Jelinek—reflects early SMT researchers’ skepticism regarding the role of linguistic theory in MT performance [142]. Yet, the unresolved discrepancies between purely statistical and linguistically informed approaches raise an important question: Are we truly measuring what we should be measuring? Current single-score evaluation methods often provide misleading insights, as even human evaluators frequently disagree on translation quality assessments. Multi-level scoring frameworks, such as MQM (Multidimensional Quality Metrics) [256], have revealed that translations deemed “high-quality” by automatic metrics often contain significant grammatical errors.

In response to these shortcomings, recent years have seen a shift away from heuristic-based metrics like BLEU [257], with increased investment in pretrained evaluation models such as COMET and BLEURT. However, these newer metrics remain in early stages of development and exhibit critical flaws. For instance, pretrained metrics may be prone to anomalies or overfitting to certain language pairs, raising concerns about their ability to provide reliable and generally applicable cross-lingual evaluation [258,259]. Additionally, different evaluation methodologies lead to inconsistencies in optimization: Maximum Likelihood Estimation (MLE) favors high recall, Reinforcement Learning for MT (RL + MT) prioritizes high precision, and BLEURT-based RL methods fail to penalize repetitive translations [260,261]. The challenge remains in integrating multiple evaluation criteria into a single, interpretable metric.

A promising direction for evaluation involves leveraging LLMs to enhance translation assessment. Recent studies have proposed using LLMs to label translation errors [262], yet they currently lack the capability to consistently rank good vs. bad translations or sentences. Evaluating long-tail errors, which occur infrequently but may be critical in specific domains (e.g., named entity errors), presents another major challenge. Future research should explore metrics that are more sensitive to different types of errors, ensuring that high-impact mistakes receive appropriate weight.

LLMs also hold potential in improving analytic scoring by identifying systematic translation errors and assessing contextual appropriateness. One open question is whether evaluation metrics should penalize lexical variation. For instance, while BLEU assigns a perfect score (1.0) to an exact lexical match, COMET evaluates semantic similarity continuously, potentially assigning no penalty at all. This raises the need for an intermediate approach that balances lexical precision with semantic flexibility.

Ultimately, the goal of MT evaluation should not be to maximize arbitrary scores but to develop robust methodologies that accurately reflect translation quality across diverse languages and domains. Future research should prioritize evaluation frameworks that go beyond single-score metrics, incorporate human-like linguistic reasoning, and address the nuanced challenges posed by multilingual and domain-specific translation tasks.

5.3. Biases and Hallucinations

Hallucinations in MT systems refer to outputs that may be fluent but semantically unfaithful to the source text. They manifest in various forms, including content fabrication (adding information to the output that is not in the source), omission of key elements, semantic drift where meaning subtly changes, and improper substitutions. Other common cases include domain mismatch hallucinations, where the model inserts domain-specific content irrelevant to the input, and memorization-based outputs that arise from noisy or low-resource inputs or in multilingual MT systems, translations that are in the wrong language or may diverge significantly from the source meaning. Additionally, small perturbations in the input can trigger disproportionately erroneous or hallucinated translations, reflecting the model’s sensitivity and limitations in generalization. In addition to internal imbalance in the distribution of data collected across resources of various quality amplifying the noise and irregularities in the underlying distribution and therefore extending the difficulty of learning generalizable patterns [263], the supervised learning setting, in particular, the exposure bias, is also a key contributor to hallucinations, particularly under domain shift conditions, where a model is tested on data from a domain different from what it was trained on [264]. Exposure bias arises from a discrepancy between the training and inference phases, where models trained using Maximum Likelihood Estimation (MLE) over-rely on gold-standard references and struggle when making predictions in real-world scenarios. Another study by [265] categorizes hallucinations into two main types: those induced by perturbations, where small changes to the input drastically alter the translation, and natural hallucinations, which occur even when the source text remains unchanged. Their research connects hallucinations to deep learning’s long-tail theory, showing that memorized samples—data points learned verbatim by the model—are more susceptible to hallucination under perturbation. They further demonstrate that specific patterns of corpus-level noise, such as repeated erroneous source-target pairs, lead to different types of hallucinations. Detached hallucinations result in fluent but semantically incorrect translations, while oscillatory hallucinations produce repetitive or nonsensical phrases. A critical finding of their work is that hallucinations can be amplified through widely used data augmentation or training techniques such as back-translation or knowledge distillation, which introduce noise that reinforces hallucination patterns in downstream models. This amplification effect suggests that careful data curation is necessary to prevent learned hallucinations from propagating across translation systems.

While LLMs certainly show more generalization and creative generation tendencies toward higher fluency, increased translation non-monotonicity (greater paraphrasing and non-literalness), and insertion or deletion of content, they also suffer from occasional hallucination, particularly when prompt templates are poorly handled (prompt traps) [9]. LLMs also sometimes compromise faithfulness to the source sentence, suggesting limitations for high-stakes translation tasks requiring strict accuracy. Studies both with massively multilingual NMT models and LLMs [266] show that hallucinations occur more frequently in low-resource language pairs, particularly when translating from English into other languages. In these cases, models sometimes generate content that is entirely fabricated, misleading, or even toxic, reflecting biases present in the training data. They also find the nature of hallucinations differs between model architectures, indicating that NMT models and LLMs exhibit distinct error patterns due to their differing underlying mechanisms. The study further finds that simply scaling up model size does not effectively reduce hallucinations when training data remains unchanged. However, diversifying training data and model training methodologies can improve translation accuracy and reduce the likelihood of such errors.

Unlike phrase-based machine translation (PBMT), which allowed for direct analysis of model predictions, NMT models lack such transparency and access to its subcomponents for interpretability, making it difficult to diagnose and address hallucinations effectively. This lack of explainability raises fundamental questions about how hallucinations occur and what interpretability methods can be developed to analyze and predict these failures. Addressing these issues requires refining training objectives beyond MLE, improving data quality and diversity, and developing more sophisticated evaluation frameworks that go beyond surface-level fluency. Future research should focus on enhancing model explainability, improving robustness to domain shifts, and designing better detection and mitigation strategies to ensure that multilingual translation systems remain accurate, trustworthy, and applicable across diverse linguistic contexts.

6. LLMs and MT Together

With their impressive performance across numerous languages, NMT models have transformed the landscape of MT from a back-end tool supporting human translators in computer-assisted translation (CAT) tools [267] into fully automated stand-alone services capable of handling translation tasks across diverse domains [4,268]. However, the emergence of LLMs introduces new possibilities, challenges, and shifts in how MT is approached.

One of the central questions in the era of LLMs is whether they can fully replace traditional NMT systems or whether they serve a complementary role. While LLMs can perform translation tasks with little to no task-specific training, dedicated bilingual or multilingual NMT systems still offer advantages in terms of efficiency, domain adaptation, and terminology control [269]. The distinction lies in their design: whereas NMT models are specifically optimized for translation and can be fine-tuned for domain-specific accuracy, LLMs approach translation as part of a broader multilingual understanding, excelling in generalization but often lacking consistency in terminology and alignment.

Fluency, once a significant challenge in machine translation, is largely solved by LLMs [270], which produce highly natural translations. However, their deceptive fluency poses a new problem: while translations may sound impeccable, they are not necessarily accurate. Evaluating translation quality in this context requires shifting from traditional metrics that measure improvement to methods that systematically identify and analyze failure cases. Research suggests that LLM-generated translations often require human intervention, and assessing the effort needed for post-editing could serve as a more reliable measure of quality. Some evaluation frameworks, such as assessing the ability of native speakers to distinguish between human and machine translations, could also provide insights into the authenticity of generated text. Further investigations can include measuring amounts of revisions required to post-edit a translation or rate the perceived authenticity of generated text [271]. Investigating methods to further learn from these corrections or other types of human feedback also remain a promising research direction [272].

Unlike earlier MT models, which offered limited user control and focused primarily on sentence-level translation, large language models (LLMs) introduce a new paradigm in which translation becomes an interactive and customizable process. Through prompt engineering, users can influence aspects such as tone, verbosity, register, and even stylistic coherence, allowing translations to be adapted to specific communicative contexts or target audiences. This flexibility is particularly relevant for discourse-level translation, where maintaining cohesion, coreference, and pragmatic appropriateness across sentences is crucial [273,274]. Unlike conventional MT systems, which typically operate in isolation at the sentence level, LLMs can leverage extended context windows to model inter-sentential relationships, although their performance on such phenomena remains variable and highly prompt-dependent [275]. Rather than replacing traditional translation workflows, this shift redefines the role of translation professionals: their expertise is increasingly critical not only for post-editing and quality control, but also for crafting effective prompts and ensuring that discourse-level features—such as anaphora, ellipsis, and stylistic continuity—are preserved in model outputs.

However, LLMs also introduce new challenges. One is the supervision problem: unlike traditional supervised MT systems trained on explicitly paired source–target examples, LLMs often rely on in-context learning or few-shot prompting, which provides weaker, transient forms of supervision. As a result, they can be prone to inconsistent or unpredictable behavior, especially when prompts are ambiguous or under-specified. While LLMs excel at adapting outputs to user-specified preferences, maintaining strict control over output fidelity and consistency remains difficult without robust prompt engineering or external constraints.

Moreover, the scalability and cost of LLMs raise concerns. Dedicated neural MT models, though smaller in size, often outperform LLMs on narrow translation tasks, particularly when trained on high-quality parallel corpora for specific language pairs or domains [275,276]. These systems offer strong performance with lower inference costs and more stable output distributions. By contrast, the general-purpose nature of LLMs entails a trade-off: while they offer greater flexibility and adaptability—useful for user-guided or context-sensitive translation—they often do so at the expense of deterministic control and efficiency. This tension underscores the need to evaluate the role of LLMs in MT not as replacements for traditional systems, but as complementary tools with distinct strengths and limitations.

The nature of translation itself poses fundamental questions about model training. While LLMs are powerful generalists, translation is a task with varying levels of constraint. Some texts, such as technical documents, require precise, controlled translations, whereas literary works, poetry, and creative texts demand greater flexibility. Dedicated MT models may be better suited for constrained translations, whereas LLMs offer exciting possibilities for integrating translation with content generation, allowing creative adaptation to target audiences. This ability to generate translations that go beyond simple equivalence opens opportunities for producing contextually enriched, stylistically appropriate translations, such as formal or polite variants, or even adapting content to specific cultural contexts.

One of the most significant advancements enabled by LLMs is their ability to handle long-context translations at a document level. Earlier document-level MT systems relied on modifying sentence-level models, often struggling with coherence and reference consistency. In contrast, modern LLMs can maintain context across longer passages, improving coherence in pronoun usage, terminology consistency, and structural alignment. However, document-level translation still presents challenges, as splitting content into segments for processing may disrupt alignment and introduce inconsistencies. This raises new concerns for evaluation methodologies, as assessing translations at the document level is more complex than at the sentence level, delaying widespread adoption.

Traditional NMT systems continue to offer advantages in efficiency, domain adaptation, and precision, particularly in high-accuracy applications requiring controlled terminology. One major limitation of continuing MT services using dedicated models is that they will only generally perform well with high-resource language pairs but struggle with low-resource languages due to data scarcity. In contrast, LLMs are general-purpose models trained on vast amounts of multilingual text data. Instead of being fine-tuned exclusively for translation, LLMs develop a broad language understanding, allowing them to perform zero-shot or few-shot translation without explicit parallel corpus training. However, LLMs bring greater flexibility, broader multilingual coverage, and human-like fluency, making them ideal for general-purpose translation tasks. Both approaches still face challenges in generalization, particularly when dealing with underrepresented languages and domains [250,277]. As LLMs can leverage contextual reasoning and have access to broader linguistic patterns, they have been observed to produce more natural and fluent translations, especially in complex or ambiguous sentences [278]. However, LLM-generated translations can sometimes introduce hallucinations, factual inaccuracies, or inconsistencies due to their probabilistic nature.

On the other hand, relying on similar architectures and learning objectives, both methods still suffer from failure to generalize to various domains and languages, especially in terms of a specific set of translation direction and domain, it is still difficult to ensure consistency in terminology and style. The vocabulary limitation and failure to generate unseen word or phrasal forms is a well-known problem in Transformer-based architectures [63,279], which makes it non-trivial for models to adapt their output to a new domain, or translate an unobserved word or phrase in the given language. An important problem to address in the future of MT research should therefore address flexibility in altering translation style or without increasing the computational cost of adapting models. An interesting direction remaining to explore may involve a hybrid approach, where LLMs augment dedicated MT systems by enhancing their adaptability while maintaining precision and efficiency.

Finally, it is worth mentioning agent-based pipelines that can potentially reshape the landscape of MT by introducing modular, interactive workflows that more closely emulate the practices of professional human translators. Unlike static end-to-end systems, LLM agents can engage with their environments, making autonomous decisions about when and how to seek external resources to resolve specific translation challenges. For instance, an LLM agent may consult online dictionaries, access web search results, or query domain-specific databases when it encounters ambiguity or gaps in contextual understanding. This modular architecture not only enhances translation quality in complex or low-resource scenarios but also opens up the possibility of hybrid systems in which LLM agents support, augment, or even replace certain functions of human translators in professional settings.

The integration of LLMs into MT marks the beginning of an exciting new phase, but challenges remain, particularly regarding cost, efficiency, and the role of human expertise. Currently, most research focuses on addressing the long tail of problems rather than redefining the core translation paradigm. The collaboration between academia and industry will be crucial in navigating these shifts, ensuring that advancements in LLMs contribute meaningfully to the field of translation rather than merely adding computational complexity.

7. Conclusions

The landscape of MT has undergone a profound transformation with the advent of LLMs, shifting from traditional statistical and neural machine translation systems to more generalized, versatile models capable of handling a wide range of linguistic tasks. This evolution has been marked by both unprecedented progress and emerging challenges that redefine the future of MT research.

Historically, MT development followed a trajectory from rule-based approaches to statistical methods and, eventually, to deep learning-driven architectures. While these advancements significantly improved translation quality and fluency, they also introduced complexities such as the need for vast amounts of parallel data, challenges in handling low-resource languages, and issues of model interpretability. LLMs have further expanded the scope of MT by enabling zero-shot and few-shot translation capabilities, demonstrating impressive fluency and adaptability across languages. However, their integration into MT systems has raised new concerns regarding accuracy, hallucination tendencies, biases, and computational efficiency. With the saturation of advancements obtained with statistical learning methods, we observe an interesting fall back into rule-based approaches in adaptation or application of LLMs for MT using methods like in-context learning to mitigate errors or failures in generalization to edge cases or languages that do not have sufficient data for properly learning the vocabulary or grammar. Further research is necessary to see how useful is the extent of these hybrid approaches to low-resourced MT.

At least for the high-resourced languages, a major advantage of LLMs is their ability to provide highly fluent and contextually rich translations. Unlike earlier MT models that struggled with rigid phrase-based translations, LLMs offer greater flexibility in style, register, and adaptability to different domains. This shift presents exciting opportunities for interactive and customizable translation, where users can guide outputs using prompts to achieve desired levels of formality, verbosity, or domain-specific terminology. The ability to generate stylistically controlled and audience-aware translations introduces new possibilities for cross-lingual content creation, making LLMs more than just translation tools but also powerful instruments for cultural and linguistic adaptation. Another future research direction should explore stylistic or formalistic adaptation of the target language and exploring the application of LLMs to more sophisticated literary content.

Despite these advantages, fundamental challenges remain. Hallucinations—instances where LLMs generate translations that are fluent but semantically incorrect—continue to be a significant concern, especially for low-resource languages. Unlike traditional MT models, where errors could often be traced back to alignment or training data issues, LLM-generated hallucinations are harder to diagnose and mitigate due to the model’s opaque decision-making process. Furthermore, evaluation metrics, which have long relied on heuristic-based approaches such as BLEU, remain inadequate in assessing the full spectrum of translation quality, particularly in capturing linguistic nuances, structural integrity, and domain consistency.

The issue of scalability and efficiency also presents critical trade-offs. While dedicated MT systems are optimized for translation and can be fine-tuned for specific domains with relatively low computational overhead, LLMs require substantial resources, making real-time, low-latency applications less feasible. Moreover, in multilingual scenarios, training a single model to handle hundreds of languages has shown diminishing returns, as performance gains tend to plateau beyond a certain scale. This highlights the necessity of exploring hybrid approaches that integrate LLMs with specialized MT models, balancing efficiency with adaptability.

Looking forward, the role of human expertise in MT is set to evolve. With LLMs facilitating more interactive and controllable translation workflows, the role of linguists and domain experts is likely to become more central in refining outputs, mitigating biases, and ensuring consistency in specialized translations. The future of MT research must address key open questions, including how to refine instruction tuning to improve LLM-driven translation, how to develop better evaluation frameworks that go beyond surface-level fluency, and how to ensure that models generalize effectively across languages and domains.

Ultimately, the integration of LLMs into MT represents both a continuation and a redefinition of the field. While they offer exciting new capabilities, they also introduce novel challenges that require a fundamental rethinking of how translation systems are built, evaluated, and deployed. Future advancements will depend on a collaborative effort between academia and industry to strike a balance between fluency, accuracy, efficiency, and interpretability, ensuring that MT systems remain not only technically sophisticated but also practically useful across a diverse range of linguistic and cultural contexts.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

Author Marcello Federico was employed by the company Amazon. Author Alexandra Birch was employed by the company Aveni. Author Kyunghyun Cho was employed by the company Genentech/Roche. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Russell, S.J.; Norvig, P. Artificial Intelligence: A Modern Approach; Pearson: London, UK, 2016. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning representations, ICLr 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Luong, M.T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-Based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421. [Google Scholar]
Artetxe, M.; Labaka, G.; Agirre, E.; Cho, K. Unsupervised neural machine translation. In Proceedings of the 6th International Conference on Learning Representations, ICLr 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT, New Orleans, LA, USA, 1–6 June 2018; pp. 4171–4186. [Google Scholar]
Zhang, B.; Haddow, B.; Birch, A. Prompting large language model for machine translation: A case study. In Proceedings of the International Conference on Machine Learning, PMLr, Honolulu, HI, USA, 23–29 July 2023; pp. 41092–41110. [Google Scholar]
Luong, M.-T.; Kayser, M.; Manning, C.D. Deep neural language models for machine translation. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning, Beijing, China, 30–31 July 2015; pp. 305–309. [Google Scholar]
Ha, T.-L.; Niehues, J.; Waibel, A. Toward multilingual neural machine translation with universal encoder and decoder. In Proceedings of the 13th International Conference on Spoken Language Translation, Seattle, WA, USA, 8–9 December 2016. [Google Scholar]
Tu, Z.; Liu, Y.; Shang, L.; Liu, X.; Li, H. Neural machine translation with reconstruction. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. No. 1. [Google Scholar]
Gehring, J.; Auli, M.; Grangier, D.; Dauphin, Y. A convolutional encoder model for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 123–135. [Google Scholar]
Gu, J.; Bradbury, J.; Xiong, C.; Li, V.O.K.; Socher, R. Non-autoregressive neural machine translation. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Ott, M.; Edunov, S.; Grangier, D.; Auli, M. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, 31 October–1 November 2018; pp. 1–9. [Google Scholar]
Irie, K.; Zeyer, A.; Schlüter, R.; Ney, H. Language modeling with deep transformers. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 3905–3909. [Google Scholar]
Han, J.M.; Babuschkin, I.; Edwards, H.; Neelakantan, A.; Xu, T.; Polu, S.; Ray, A.; Shyam, P.; Ramesh, A.; Radford, A.; et al. Unsupervised neural machine translation with generative language models only. arXiv 2021, arXiv:2110.05448. [Google Scholar] [CrossRef]
Zhang, B.; Ghorbani, B.; Bapna, A.; Cheng, Y.; Garcia, X.; Shen, J.; Firat, O. Examining scaling and transfer of language model architectures for machine translation. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 26176–26192. [Google Scholar]
Guo, S.; Zhang, S.; Feng, Y. Decoder-only streaming transformer for simultaneous translation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 8851–8864. [Google Scholar]
Sun, Y.; Dong, L.; Zhu, Y.; Huang, S.; Wang, W.; Ma, S.; Zhang, Q.; Wang, J.; Wei, F. You only cache once: Decoder-decoder architectures for language models. Adv. Neural Inf. Process. Syst. 2024, 37, 7339–7361. [Google Scholar]
Shiwen, Y.; Xiaojing, B. Rule-based machine translation. In Routledge Encyclopedia of Translation Technology; Routledge: Abingdon, UK, 2014; pp. 186–200. [Google Scholar]
ALPAC. Language and Machines: Computers in Translation and Linguistics: A Report; Number 1416; National Academy of Sciences, National Research Council: Washington, DC, USA, 1966.
Chomsky, N. Syntactic Structures; Mouton de Gruyter: Berlin, Germany, 2002. [Google Scholar]
Tanaka, H. Multilingual Machine Translation Systems in the Future. In Progress in Machine Translation; IOS Press: Washington, DC, USA, 1993. [Google Scholar]
Dong, D.; Wu, H.; He, W.; Yu, D.; Wang, H. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 26–31 July 2015; Volume 1: Long Papers, pp. 1723–1732. [Google Scholar]
Gasser, M. Minimal dependency translation: A framework for computer-assisted translation for under-resourced languages. In Proceedings of the Information and Communication Technology for Development for Africa: First International Conference, ICT4DA 2017, Bahir Dar, Ethiopia, 25–27 September 2017; Springer: Cham, Switzerland, 2018; pp. 209–218. [Google Scholar]
Khanna, T.; Washington, J.N.; Tyers, F.M.; Bayatlı, S.; Swanson, D.G.; Pirinen, T.A.; Tang, I.; Alos i Font, H. recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages. Mach. Transl. 2021, 35, 475–502. [Google Scholar] [CrossRef]
Haddow, B.; Bawden, R.; Barone, A.V.M.; Helcl, J.; Birch, A. Survey of low-resource machine translation. Comput. Linguist. 2022, 48, 673–732. [Google Scholar] [CrossRef]
Nørstebø Moshagen, S.; Pirinen, F.; Antonsen, L.; Gaup, B.; Mikkelsen, I.L.S.; Trosterud, T.; Wiechetek, L.; Hiovain-Asikainen, K. The GiellaLT infrastructure: A multilingual infrastructure for rule-based NLP. In Rule-Based Language Technology; University of Tartu: Tartu, Estonia, 2023. [Google Scholar]
Trieu, H.-L.; Tran, D.-V.; Nguyen, L.-M. Investigating phrase-based and neural-based machine translation on low-resource settings. In Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation, Cebu City, Philippines, 16–18 November 2017; pp. 384–391. [Google Scholar]
Sennrich, R.; Zhang, B. Revisiting low-resource neural machine translation: A case study. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 211–221. [Google Scholar]
Haque, R.; Liu, C.-H.; Way, A. Recent advances of low-resource neural machine translation. Mach. Transl. 2021, 35, 451–474. [Google Scholar] [CrossRef]
Wang, R.; Tan, X.; Luo, R.; Qin, T.; Liu, T.-Y. A survey on low-resource neural machine translation. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 19–27 August 2021; pp. 4636–4643. [Google Scholar]
Kumar, S.; Anastasopoulos, A.; Wintner, S.; Tsvetkov, Y. Machine translation into low-resource language varieties. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Virtual, 1–6 August 2021; pp. 110–121. [Google Scholar]
Shi, S.; Wu, X.; Su, R.; Huang, H. Low-resource neural machine translation: Methods and trends. ACM Trans. Asian -Low-Resour. Lang. Inf. Process. 2022, 21, 1–22. [Google Scholar] [CrossRef]
Ranathunga, S.; Lee, E.-S.A.; Prifti Skenduli, M.; Shekhar, R.; Alam, M.; Kaur, R. Neural machine translation for low-resource languages: A survey. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
Brown, P.F.; Cocke, J.; Della Pietra, S.A.; Della Pietra, V.J.; Jelinek, F.; Lafferty, J.; Mercer, R.L.; Roossin, P.S. A statistical approach to machine translation. Comput. Linguist. 1990, 16, 79–85. [Google Scholar]
Moon, T.K. The expectation-maximization algorithm. IEEE Signal Process. Mag. 1996, 13, 47–60. [Google Scholar] [CrossRef]
Koehn, P. Statistical Machine Translation, 1st ed.; Cambridge University Press: New York, NY, USA, 2010. [Google Scholar]
Koehn, P.; Och, F.J.; Marcu, D. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Langauge Technology (HLT-NAACL 2003), Edmonton, AB, Canada, 27 May–1 June 2003; pp. 48–54. [Google Scholar]
Zens, R.; Och, F.J.; Ney, H. Phrase-based statistical machine translation. In Proceedings of the Annual Conference on Artificial Intelligence, Canberra, Australia, 2–6 December 2002; pp. 18–32. [Google Scholar]
Zens, R.; Ney, H. Improvements in phrase-based statistical machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, Boston, MA, USA, 6 May 2004; pp. 257–264. [Google Scholar]
Costa-jussà, M.R. An Overview of the Phrase-based Statistical Machine Translation Techniques. Knowl. Eng. Rev. 2012, 27, 413–431. [Google Scholar] [CrossRef]
Bisazza, A.; Federico, M. A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena. Comput. Linguist. 2016, 42, 163–205. [Google Scholar] [CrossRef]
Zens, R.; Ney, H. A Comparative Study on Reordering Constraints in Statistical Machine Translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, 7–12 July 2003; pp. 144–151. [Google Scholar]
Kumar, S.; Byrne, B. Local Phrase Reordering Models for Statistical Machine Translation. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, BC, Canada, 6–8 October 2005; pp. 161–168. [Google Scholar]
Kanthak, S.; Vilar, D.; Matusov, E.; Zens, R.; Ney, H. Novel Reordering Approaches in Phrase-Based Statistical Machine Translation. In Proceedings of the ACL Workshop on Building and Using Parallel Texts, Ann Arbor, MI, USA, 29 June 2005; pp. 167–174. [Google Scholar]
Xiong, D.; Liu, Q.; Lin, S. Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 17–21 July 2006; pp. 521–528. [Google Scholar]
Zens, R.; Ney, H. Discriminative Reordering Models for Statistical Machine Translation. In Proceedings of the Workshop on Statistical Machine Translation, New York, NY, USA, 8 June 2006; pp. 55–63. [Google Scholar]
Li, C.; Li, M.; Zhang, D.; Li, M.; Zhou, M.; Guan, Y. A Probabilistic Approach to Syntax-Based Reordering for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, 23–30 June 2007; pp. 720–727. [Google Scholar]
Zhao, C.; Walker, M.; Chaturvedi, S. Bridging the Structural Gap Between Encoding and Decoding for Data-to-Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 2481–2491. [Google Scholar]
Bentivogli, L.; Bisazza, A.; Cettolo, M.; Federico, M. Neural Versus Phrase-Based Machine Translation Quality: A Case Study. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 257–267. [Google Scholar]
Kalchbrenner, N.; Blunsom, P. recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1700–1709. [Google Scholar]
Bottou, L. Large-scale machine learning with stochastic gradient descent. In COMPSTAT’2010, Proceedings of the 19th International Conference on Computational Statistics, Paris, France, 22–27 August 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 177–186. [Google Scholar]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533. [Google Scholar] [CrossRef]
Jean, S.; Cho, K.; Memisevic, R.; Bengio, Y. On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL-IJCNLP 2015, Beijing, China, 26–31 July 2015; Association for Computational Linguistics (ACL): Kerrville, TX, USA, 2015; pp. 1–10. [Google Scholar]
Luong, M.T.; Manning, C.D. Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Volume 1: Long Papers, pp. 1054–1063. [Google Scholar]
Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Volume 1: Long Papers, pp. 1715–1725. [Google Scholar]
Kudo, T.; Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, 31 October–4 November 2018; pp. 66–71. [Google Scholar]
Song, X.; Salcianu, A.; Song, Y.; Dopson, D.; Zhou, D. Fast WordPiece Tokenization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 2089–2103. [Google Scholar]
Xu, J.; Zhou, H.; Gan, C.; Zheng, Z.; Li, L. Vocabulary Learning via Optimal Transport for Neural Machine Translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; pp. 7361–7373. [Google Scholar]
Wang, X.; Ruder, S.; Neubig, G. Multi-View Subword Regularization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 473–482. [Google Scholar]
Ataman, D.; Negri, M.; Turchi, M.; Federico, M. Linguistically Motivated Vocabulary reduction for Neural Machine Translation from Turkish to English. Prague Bull. Math. Linguist. 2017, 331–342. [Google Scholar] [CrossRef]
Araabi, A.; Monz, C.; Niculae, V. How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation? In Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), Orlando, FL, USA, 12–16 September 2022; pp. 117–130. [Google Scholar]
Costa-jussà, M.R.; Fonollosa, J.A.R. Character-Based Neural Machine Translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, 7–12 August 2016; pp. 357–361. [Google Scholar]
Ling, W.; Trancoso, I.; Dyer, C.; Black, A.W. Character-based neural machine translation. arXiv 2015, arXiv:1511.04586. [Google Scholar] [CrossRef]
Lee, J.; Cho, K.; Hofmann, T. Fully character-level neural machine translation without explicit segmentation. Trans. Assoc. Comput. Linguist. 2017, 5, 365–378. [Google Scholar] [CrossRef]
Ataman, D.; Aziz, W.; Birch, A. A Latent Morphology Model for Open-Vocabulary Neural Machine Translation. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Gallé, M. Investigating the effectiveness of BPE: The power of shorter sequences. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 1375–1381. [Google Scholar]
Wang, C.; Cho, K.; Gu, J. Neural Machine Translation with Byte-Level Subwords. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, Number 05. pp. 9154–9160. [Google Scholar]
Chakravarthi, B.R.; Rani, P.; Arcan, M.; McCrae, J.P. A Survey of Orthographic Information in Machine Translation. SN Comput. Sci. 2021, 2, 330. [Google Scholar] [CrossRef] [PubMed]
Libovickỳ, J.; Schmid, H.; Fraser, A. Why Don’t People Use Character-Level Machine Translation? In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022. [Google Scholar]
Edman, L.; Sarti, G.; Toral, A.; van Noord, G.; Bisazza, A. Are Character-level Translations Worth the Wait? Comparing ByT5 and mT5 for Machine Translation. Trans. Assoc. Comput. Linguist. 2024, 12, 392–410. [Google Scholar] [CrossRef]
Cherry, C.; Foster, G.; Bapna, A.; Firat, O.; Macherey, W. Revisiting Character-Based Neural Machine Translation with Capacity and Compression. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4295–4305. [Google Scholar]
Xue, L.; Barua, A.; Constant, N.; Al-rfou, R.; Narang, S.; Kale, M.; Roberts, A.; Raffel, C. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Trans. Assoc. Comput. Linguist. 2022, 10, 291–306. [Google Scholar] [CrossRef]
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
Gulcehre, C.; Firat, O.; Xu, K.; Cho, K.; Bengio, Y. On integrating a language model into neural machine translation. Comput. Speech Lang. 2017, 45, 137–148. [Google Scholar] [CrossRef]
Kannan, A.; Wu, Y.; Nguyen, P.; Sainath, T.N.; Chen, Z.; Prabhavalkar, R. An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 1–5828. [Google Scholar]
Ueffing, N.; Haffari, G.; Sarkar, A. Transductive learning for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, 23–30 June 2007; pp. 25–32. [Google Scholar]
Bertoldi, N.; Federico, M. Domain Adaptation for Statistical Machine Translation with Monolingual Resources. In Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, Greece, 30–31 March 2009; pp. 182–189. [Google Scholar]
Wu, H.; Wang, H. Revisiting pivot language approach for machine translation. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore, 2–7 August 2009; pp. 154–162. [Google Scholar]
Sennrich, R.; Haddow, B.; Birch, A. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Volume 1: Long Papers, pp. 86–96. [Google Scholar]
Burlot, F.; Yvon, F. Using Monolingual Data in Neural Machine Translation: A Systematic Study. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, 31 October–1 November 2018; pp. 144–155. [Google Scholar]
Edunov, S.; Ott, M.; Auli, M.; Grangier, D. Understanding Back-Translation at Scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 489–500. [Google Scholar]
Wu, J.; Wang, X.; Wang, W.Y. Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1: Long and Short Papers, pp. 1173–1183. [Google Scholar]
Graça, M.; Kim, Y.; Schamper, J.; Khadivi, S.; Ney, H. Generalizing Back-Translation in Neural Machine Translation. In Proceedings of the Fourth Conference on Machine Translation, Florence, Italy, 1–2 August 2019; Volume 1: Research Papers, pp. 45–52. [Google Scholar]
Przystupa, M.; Abdul-Mageed, M. Neural machine translation of low-resource and similar languages with backtranslation. In Proceedings of the Fourth Conference on Machine Translation, Athens, Greece, 30–31 March 2009; pp. 224–235. [Google Scholar]
Feldman, I.; Coto-Solano, R. Neural machine translation models with back-translation for the extremely low-resource indigenous language Bribri. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 3965–3976. [Google Scholar]
Mikolov, T.; Le, Q.V.; Sutskever, I. Exploiting similarities among languages for machine translation. arXiv 2013, arXiv:1309.4168. [Google Scholar] [CrossRef]
Faruqui, M.; Dyer, C. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, 26–30 April 2014; pp. 462–471. [Google Scholar]
Xing, C.; Wang, D.; Liu, C.; Lin, Y. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 31 May–5 June 2015; pp. 1006–1011. [Google Scholar]
Artetxe, M.; Labaka, G.; Agirre, E. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 2289–2294. [Google Scholar]
Artetxe, M.; Labaka, G.; Agirre, E. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1: Long Papers, pp. 451–462. [Google Scholar]
Smith, S.L.; Turban, D.H.; Hamblin, S.; Hammerla, N.Y. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv 2017, arXiv:1702.03859. [Google Scholar] [CrossRef]
Ravi, S.; Knight, K. Deciphering foreign language. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 12–21. [Google Scholar]
Hermann, K.M.; Blunsom, P. Multilingual Distributed representations without Word Alignment. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Lample, G.; Conneau, A.; Denoyer, L.; Ranzato, M. Unsupervised Machine Translation Using Monolingual Corpora Only. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Yang, Z.; Chen, W.; Wang, F.; Xu, B. Unsupervised Neural Machine Translation with Weight Sharing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1: Long Papers, pp. 46–55. [Google Scholar]
Graça, Y.K.M.; Ney, H. When and why is unsupervised neural machine translation useless. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, Lisbon, Portugal, 3–5 November 2020; p. 35. [Google Scholar]
Kauers, M.; Vogel, S.; Fügen, C.; Waibel, A. Interlingua based statistical machine translation. In Proceedings of the INTErSPEECH, Denver, CO, USA, 16–20 September 2002; pp. 1909–1912. [Google Scholar]
De Gispert, A.; Marino, J.B. Catalan-English statistical machine translation without parallel corpus: Bridging through Spanish. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LrEC), Genoa, Italy, 22–28 May 2006; pp. 65–68. [Google Scholar]
Utiyama, M.; Isahara, H. A comparison of pivot methods for phrase-based statistical machine translation. In Proceedings of the Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference, Rochester, NY, USA, 22–27 April 2007; pp. 484–491. [Google Scholar]
Wu, H.; Wang, H. Pivot language approach for phrase-based statistical machine translation. Mach. Transl. 2007, 21, 165–181. [Google Scholar] [CrossRef]
Bertoldi, N.; Barbaiani, M.; Federico, M.; Cattoni, R. Phrase-based statistical machine translation with pivot languages. In Proceedings of the 5th International Workshop on Spoken Language Translation: Papers, Waikiki, HI, USA, 20–21 October 2008; pp. 143–149. [Google Scholar]
Chen, Y.; Liu, Y.; Cheng, Y.; Li, V.O. A Teacher-Student Framework for Zero-Resource Neural Machine Translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1: Long Papers, pp. 1925–1935. [Google Scholar]
Leng, Y.; Tan, X.; Qin, T.; Li, X.Y.; Liu, T.Y. Unsupervised Pivot Translation for Distant Languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 175–183. [Google Scholar]
Johnson, M.; Schuster, M.; Le, Q.V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.; Wattenberg, M.; Corrado, G.; et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 2017, 5, 339–351. [Google Scholar] [CrossRef]
Lample, G.; Ott, M.; Conneau, A.; Denoyer, L.; Ranzato, M. Phrase-Based & Neural Unsupervised Machine Translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 5039–5049. [Google Scholar]
Lu, Y.; Keung, P.; Ladhak, F.; Bhardwaj, V.; Zhang, S.; Sun, J. A neural interlingua for multilingual machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, 31 October–1 November 2018; pp. 84–92. [Google Scholar]
Lakew, S.M.; Cettolo, M.; Federico, M. A Comparison of Transformer and Recurrent Neural Networks on Multilingual Neural Machine Translation. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 641–652. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, É.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar]
Karthikeyan, K.; Wang, Z.; Mayhew, S.; Roth, D. Cross-lingual ability of multilingual bert: An empirical study. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Kudugunta, S.; Bapna, A.; Caswell, I.; Firat, O. Investigating Multilingual NMT Representations at Scale. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 1565–1575. [Google Scholar]
Wang, X.; Pham, H.; Arthur, P.; Neubig, G. Multilingual Neural Machine Translation with Soft Decoupled Encoding. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Kim, Y.; Gao, Y.; Ney, H. Effective Cross-Lingual Transfer of Neural Machine Translation Models Without Shared Vocabularies. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1246–1257. [Google Scholar]
Zhu, C.; Yu, H.; Cheng, S.; Luo, W. Language-aware interlingua for multilingual neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 1650–1655. [Google Scholar]
Baziotis, C.; Artetxe, M.; Cross, J.; Bhosale, S. Multilingual Machine Translation with Hyper-Adapters. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 1170–1185. [Google Scholar]
Nguyen, X.P.; Joty, S.; Wu, K.; Aw, A.T. refining low-resource unsupervised translation by language disentanglement of multilingual translation model. Adv. Neural Inf. Process. Syst. 2022, 35, 36230–36242. [Google Scholar]
Costa-jussà, M.R.; Cross, J.; Çelebi, O.; Elbayad, M.; Heafield, K.; Heffernan, K.; Kalbassi, E.; Lam, J.; Licht, D.; Maillard, J.; et al. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv 2022, arXiv:2207.04672. [Google Scholar] [CrossRef]
Zanr, C.; Peng, K.; Dingr, L.; Qiu, B.; Liu, B.; He, S.; Lu, Q.; Zhang, Z.; Liu, C.; Liu, W.; et al. Vega-MT: The JD Explore Academy Translation System for WMT22. arXiv 2022, arXiv:2209.09444. [Google Scholar]
Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 2020, 8, 726–742. [Google Scholar] [CrossRef]
Edunov, S.; Baevski, A.; Auli, M. Pre-trained language model representations for language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1: Long and Short Papers, pp. 4052–4059. [Google Scholar]
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; pp. 2227–2237. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
Yang, Z. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2019, arXiv:1906.08237. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners; OpenAI: San Francisco, CA, USA, 2019. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
Le Scao, T.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Luccioni, A.S.; Yvon, F.; Gallé, M.; et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv 2023, arXiv:2211.05100. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Bawden, R.; Birch, A.; Dobreva, R.; Oncevay, A.; Barone, A.V.M.; Williams, P. The University of Edinburgh’s English-Tamil and English-Inuktitut submissions to the WMT20 news translation task. In Proceedings of the 5th Conference on Machine Translation, Online, 19–20 November 2020. [Google Scholar]
Baziotis, C.; Haddow, B.; Birch, A. Language Model Prior for Low-Resource Neural Machine Translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 7622–7634. [Google Scholar]
Baziotis, C.; Titov, I.; Birch, A.; Haddow, B. Exploring Unsupervised Pretraining Objectives for Machine Translation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 2956–2971. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Zeng, A.; Liu, X.; Du, Z.; Wang, Z.; Lai, H.; Ding, M.; Yang, Z.; Xu, Y.; Zheng, W.; Xia, X.; et al. GLM-130B: An Open Bilingual Pre-Trained Model. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Aynetdinov, A.; Akbik, A. SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity. arXiv 2024, arXiv:2401.17072. [Google Scholar]
Zhang, Y.; Zhang, P.; Yan, Y. Language model score regularization for speech recognition. Chin. J. Electron. 2019, 28, 604–609. [Google Scholar] [CrossRef]
Hendy, A.; Abdelrehim, M.; Sharaf, A.; Raunak, V.; Gabr, M.; Matsushita, H.; Kim, Y.J.; Afify, M.; Awadalla, H.H. How good are gpt models at machine translation? A comprehensive evaluation. arXiv 2023, arXiv:2302.09210. [Google Scholar] [CrossRef]
Kocmi, T.; Avramidis, E.; Bawden, R.; Bojar, O.; Dvorkovich, A.; Federmann, C.; Fishel, M.; Freitag, M.; Gowda, T.; Grundkiewicz, R.; et al. Findings of the 2023 conference on machine translation (WMT23): LLMs are here but not quite there yet. In Proceedings of the WMT23-Eighth Conference on Machine Translation, Singapore, 6–7 December 2023; pp. 198–216. [Google Scholar]
Xu, H.; Kim, Y.J.; Sharaf, A.; Awadalla, H.H. A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Zhu, W.; Liu, H.; Dong, Q.; Xu, J.; Huang, S.; Kong, L.; Chen, J.; Li, L. Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; pp. 2765–2781. [Google Scholar]
Jelinek, F. Applying Information Theoretic Methods: Evaluation of Grammar Quality. In Proceedings of the Workshop on Evaluation of NLP Systems, Wayne, Pennsylvania, 7–9 December 1988. [Google Scholar]
Thorndike, E.L. The Fundamentals of Learning; Teachers College, Columbia University: New York, NY, USA, 1932. [Google Scholar]
McCulloch, W.S.; Pitts, W. A Logical Calculus of the Ideas Immanent in Nervous Activity. Bull. Math. Biophys. 1943, 5, 115–133. [Google Scholar] [CrossRef]
Hebb, D.O. The Organization of Behavior: A Neuropsychological Theory; Wiley: New York, NY, USA, 1949. [Google Scholar]
Rosenblatt, F. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain; Technical Report 85-460-1; Cornell Aeronautical Laboratory: Buffalo, NY, USA, 1958. [Google Scholar]
Dreyfus, H.; Dreyfus, S.E. Mind over Machine; Simon and Schuster: New York, NY, USA, 1986. [Google Scholar]
Chomsky, N. Aspects of the Theory of Syntax; Number 11; MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
Fodor, J.A.; Pylyshyn, Z.W. Connectionism and cognitive architecture: A critical analysis. Cognition 1988, 28, 3–71. [Google Scholar] [CrossRef]
Marcus, G.F. Rethinking eliminative connectionism. Cogn. Psychol. 1998, 37, 243–282. [Google Scholar] [CrossRef]
Marcus, G.F. The Algebraic Mind: Integrating Connectionism and Cognitive Science; MIT Press: Cambridge, MA, USA, 2003. [Google Scholar]
Fodor, J.A.; Lepore, E. The Compositionality Papers; Oxford University Press: Oxford, UK, 2002. [Google Scholar]
Calvo, P.; Symons, J. The Architecture of Cognition: Rethinking Fodor and Pylyshyn’s Systematicity Challenge; MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
Lake, B.M.; Ullman, T.D.; Tenenbaum, J.B.; Gershman, S.J. Building machines that learn and think like people. Behav. Brain Sci. 2017, 40, e253. [Google Scholar] [CrossRef] [PubMed]
Lake, B.; Baroni, M. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In Proceedings of the International Conference on Machine Learning, PMLr, Stockholm, Sweden, 10–15 July 2018; pp. 2873–2882. [Google Scholar]
Battaglia, P.W.; Hamrick, J.B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; et al. relational inductive biases, deep learning, and graph networks. arXiv 2018, arXiv:1806.01261. [Google Scholar] [CrossRef]
Snyder, B.; Barzilay, R. Unsupervised Multilingual Learning. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2010. [Google Scholar]
Agić, Ž.; Johannsen, A.; Plank, B.; Alonso, H.M.; Schluter, N.; Søgaard, A. Multilingual projection for parsing truly low-resource languages. Trans. Assoc. Comput. Linguist. 2016, 4, 301–312. [Google Scholar] [CrossRef]
Şahin, G.G.; Steedman, M. Data Augmentation via Dependency Tree Morphing for Low-Resource Languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
Xia, M.; Kong, X.; Anastasopoulos, A.; Neubig, G. Generalized Data Augmentation for Low-Resource Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Volume 57. [Google Scholar]
Zhu, Y.; Heinzerling, B.; Vulić, I.; Strube, M.; Reichart, R.; Korhonen, A. On the Importance of Subword Information for Morphological Tasks in Truly Low-Resource Languages. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Florence, Italy, 28 July–2 August 2019; pp. 216–226. [Google Scholar]
Ponti, E.M.; O’horan, H.; Berzak, Y.; Vulić, I.; Reichart, R.; Poibeau, T.; Shutova, E.; Korhonen, A. Modeling language variation and universals: A survey on typological linguistics for natural language processing. Comput. Linguist. 2019, 45, 559–601. [Google Scholar] [CrossRef]
Liu, Z.; Prud’Hommeaux, E. Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation. Trans. Assoc. Comput. Linguist. 2022, 10, 393–413. [Google Scholar] [CrossRef]
Carl, M. Inducing translation templates for example-based machine translation. In Proceedings of the Machine Translation Summit VII, Singapore, 13–17 September 1999; pp. 250–258. [Google Scholar]
Brown, R.D. Adding linguistic knowledge to a lexical example-based translation system. In Proceedings of the 8th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages, Chester, UK, 23–25 August 1999. [Google Scholar]
Wu, D. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Comput. Linguist. 1997, 23, 377–403. [Google Scholar]
Chiang, D. A hierarchical phrase-based model for statistical machine translation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, MI, USA, 25–30 June 2005; pp. 263–270. [Google Scholar]
Gildea, D. Loosely tree-based alignment for machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, 7–12 July 2003; pp. 80–87. [Google Scholar]
Yamada, K.; Knight, K. A syntax-based statistical translation model. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, 9–11 July 2001; pp. 523–530. [Google Scholar]
Galley, M.; Manning, C.D. Quadratic-time dependency parsing for machine translation. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore, 2–7 August 2009; pp. 773–781. [Google Scholar]
Liu, Y.; Liu, Q.; Lin, S. Tree-to-string alignment template for statistical machine translation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 17–21 July 2006; pp. 609–616. [Google Scholar]
Huang, L.; Knight, K.; Joshi, A. Statistical syntax-directed translation with extended domain of locality. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, Cambridge, MA, USA, 8–12 August 2006; pp. 66–73. [Google Scholar]
Mi, H.; Huang, L. Forest-based translation rule extraction. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Honolulu, HI, USA, 25–27 October 2008; pp. 206–214. [Google Scholar]
Eisner, J. Learning non-isomorphic tree mappings for machine translation. In Proceedings of the Companion Volume to the Proceedings of 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, 7–12 July 2003; pp. 205–208. [Google Scholar]
Menezes, A.; Quirk, C. Dependency Treelet Translation: The convergence of statistical and example-based machine-translation? In Proceedings of the Workshop on Example-Based Machine Translation, Phuket, Thailand, 13–15 September 2005; pp. 99–108. [Google Scholar]
Zhang, M.; Jiang, H.; Li, H.; Aw, A.; Li, S. Grammar comparison study for translational equivalence modeling and statistical machine translation. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), Manchester, UK, 18–22 August 2008; pp. 1097–1104. [Google Scholar]
Chiang, D. Learning to translate with source and target syntax. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010; pp. 1443–1452. [Google Scholar]
Toutanova, K.; Suzuki, H.; Ruopp, A. Applying morphology generation models to machine translation. In Proceedings of the ACL-08: HLT, Columbus, OH, USA, 15–20 June 2008; pp. 514–522. [Google Scholar]
Bisazza, A.; Federico, M. Morphological pre-processing for Turkish to English statistical machine translation. In Proceedings of the 6th International Workshop on Spoken Language Translation: Papers, Tokyo, Japan, 1–2 December 2009; pp. 129–135. [Google Scholar]
El Kholy, A.; Habash, N. Orthographic and morphological processing for English–Arabic statistical machine translation. Mach. Transl. 2012, 26, 25–45. [Google Scholar] [CrossRef]
Herzig, J.; Shaw, P.; Chang, M.W.; Guu, K.; Pasupat, P.; Zhang, Y. Unlocking compositional generalization in pre-trained models using intermediate representations. arXiv 2021, arXiv:2104.07478. [Google Scholar] [CrossRef]
Eyigöz, E.; Gildea, D.; Oflazer, K. Simultaneous word-morpheme alignment for statistical machine translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA, 9–14 June 2013; pp. 32–40. [Google Scholar]
Sánchez-Cartagena, V.M.; Toral, A. Abu-MaTran at WMT 2016 Translation Task: Deep Learning, Morphological Segmentation and Tuning on Character Sequences. In Proceedings of the 1st Conference on Machine Translation ACL, Berlin, Germany, 11–12 August 2016. [Google Scholar]
Pirinen, T.A. Omorfi-Free and open source morphological lexical database for Finnish. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), Vilnius, Lithuania, 11–13 May 2015; pp. 313–315. [Google Scholar]
Smit, P.; Virpioja, S.; Grönroos, S.A.; Kurimo, M. Morfessor 2.0: Toolkit for statistical morphological segmentation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Gothenburg, Sweden, 26–30 April 2014; Aalto University: Espoo, Sweden, 2014. [Google Scholar]
Huck, M.; Riess, S.; Fraser, A. Target-Side Word Segmentation Strategies for Neural Machine Translation. In Proceedings of the Second Conference on Machine Translation (WMT), Copenhagen, Denmark, 7–8 September 2017; pp. 56–67. [Google Scholar]
Tamchyna, A.; Marco, M.W.D.; Fraser, A. Modeling Target-Side Inflection in Neural Machine Translation. In Proceedings of the Second Conference on Machine Translation (WMT), Copenhagen, Denmark, 7–8 September 2017; pp. 32–42. [Google Scholar]
Ataman, D.; Federico, M. An evaluation of two vocabulary reduction methods for neural machine translation. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas, Boston, MA, USA, 17–21 March 2018; Volume 1: Research Track, pp. 97–110. [Google Scholar]
Sennrich, R. How Grammatical is Character-Level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; Volume 2: Short Papers, pp. 376–382. [Google Scholar]
Ataman, D.; Firat, O.; Di Gangi, M.A.; Federico, M.; Birch, A. On the Importance of Word Boundaries in Character-Level Neural Machine Translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, China, 4 November 2019; pp. 187–193. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Snover, M.; Dorr, B.; Schwartz, R.; Micciulla, L.; Makhoul, J. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, Cambridge, MA, USA, 8–12 August 2006; pp. 223–231. [Google Scholar]
Doddington, G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the Second International Conference on Human Language Technology Research, San Diego, CA, USA, 24–27 March 2002; pp. 138–145. [Google Scholar]
Hutchins, J. Machine translation: A concise history. Comput. Aided Transl. Theory Pract. 2007, 13, 11. [Google Scholar]
Culy, C.; Riehemann, S.Z. The limits of N-gram translation evaluation metrics. In Proceedings of the Machine Translation Summit IX: Papers, New Orleans, LA, USA, 18–22 September 2003. [Google Scholar]
Callison-Burch, C.; Osborne, M.; Koehn, P. re-evaluating the role of BLEU in machine translation research. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, 3–7 April 2006; pp. 249–256. [Google Scholar]
Birch, A.; Osborne, M.; Blunsom, P. Metrics for MT evaluation: Evaluating reordering. Mach. Transl. 2010, 24, 15–26. [Google Scholar] [CrossRef]
Mathur, N.; Baldwin, T.; Cohn, T. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4984–4997. [Google Scholar]
Popović, M. chrF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, 17–18 September 2015; pp. 392–395. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Jawahar, G.; Sagot, B.; Seddah, D. What does BErT learn about the structure of language? In Proceedings of the ACL 2019-57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Chi, E.A.; Hewitt, J.; Manning, C.D. Finding Universal Grammatical Relations in Multilingual BErT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5564–5577. [Google Scholar]
Tian, Y.; Xia, F.; Song, Y. Large Language Models Are No Longer Shallow Parsers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Volume 1: Long Papers, pp. 7131–7142. [Google Scholar]
Urbizu, G.; Zulaika, M.; Saralegi, X.; Corral, A. How Well Can BErT Learn the Grammar of an Agglutinative and Flexible-Order Language? The Case of Basque. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LrEC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 8334–8348. [Google Scholar]
Koehn, P. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit, Phuket, Thailand, 13–15 September 2005. [Google Scholar]
Ziemski, M.; Junczys-Dowmunt, M.; Pouliquen, B. The United Nations Parallel Corpus v1.0. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, 23–28 May 2016; pp. 3530–3534. [Google Scholar]
Moore, R.C.; Lewis, W.D. Intelligent selection of language model training data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010; pp. 220–224. [Google Scholar]
Cuong, H.; Sima’an, K. Latent domain translation models in mix-of-domains haystack. In Proceedings of the COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, 23–29 August 2014; pp. 1928–1939. [Google Scholar]
Huang, F. Confidence measure for word alignment. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing (ACL-IJCNLP), Association for Computational Linguistics, Singapore, 2–7 August 2009; pp. 932–940. [Google Scholar]
Smith, J.R.; Saint-Amand, H.; Plamada, M.; Koehn, P.; Callison-Burch, C.; Lopez, A. Dirt cheap web-scale parallel text from the common crawl. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, 4–9 August 2013. [Google Scholar]
Esplà-Gomis, M.; Forcada, M.L.; Sánchez-Martínez, F. ParaCrawl: Web-scale acquisition of parallel corpora. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, Lisboa, Portugal, 3–5 November 2020; pp. 35–42. [Google Scholar]
Schwenk, H.; Wenzek, G.; Edunov, S.; Grave, E.; Joulin, A. CCMatrix: Mining billions of high-quality parallel sentences on the web. arXiv 2019, arXiv:1911.04944. [Google Scholar]
Artetxe, M.; Schwenk, H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. 2019, 7, 597–610. [Google Scholar] [CrossRef]
Feng, F.; Yang, Y.; Cer, D.; Arivazhagan, N.; Wang, W. Language-agnostic BERT sentence embedding. arXiv 2020, arXiv:2007.01852. [Google Scholar]
NLLB Team. Scaling neural machine translation to 200 languages. Nature 2024, 630, 841–846. [Google Scholar] [CrossRef] [PubMed]
He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding-enhanced BERT with disentangled attention. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
Zhang, J.; Zong, C. Forward Translation for Improvements in Neural Machine Translation. In Proceedings of the 2016 International Conference on Computational Linguistics and Natural Language Processing, Osaka, Japan, 11–16 December 2016. [Google Scholar]
Axelrod, A.; He, X.; Gao, J. Domain adaptation via pseudo in-domain data selection. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Edinburgh, UK, 27–31 July 2011; pp. 355–362. [Google Scholar]
Peris, Á.; Chinea-Ríos, M.; Casacuberta, F. Neural networks classifier for data selection in statistical machine translation. Prague Bull. Math. Linguist. 2016, 108, 283–294. [Google Scholar] [CrossRef]
Arivazhagan, N.; Bapna, A.; Firat, O.; Aharoni, R.; Johnson, M.; Macherey, W. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv 2019, arXiv:1907.05019. [Google Scholar] [CrossRef]
Wang, Y.; Neubig, G. Target conditioned sampling: Optimizing data selection for multilingual neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 582–592. [Google Scholar]
Fan, A.; Bhosale, S.; Schwenk, H.; Ma, Z.; El-Kishky, A.; Goyal, S.; Baines, M.; Celebi, O.; Wenzek, G.; Chaudhary, V.; et al. Beyond English-centric multilingual machine translation. J. Mach. Learn. Res. 2021, 22, 107. [Google Scholar]
Tan, X.; Ren, Y.; He, D.; Qin, T.; Xu, W.; Liu, T.Y. Multilingual neural machine translation with language clustering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 963–973. [Google Scholar]
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 483–498. [Google Scholar]
Kim, Y.; Rush, A.M. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Austin, TX, USA, 1–5 November 2016; pp. 1317–1327. [Google Scholar]
Wang, S.; Liu, Y.; Wang, C.; Luan, H.; Sun, M. Improving Back-Translation with Uncertainty-Based Confidence Estimation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 791–802. [Google Scholar]
Goyal, N.; Chaudhary, V.; Gu, J.; Wenzek, G.; El-Kishky, A.; Gao, C.; Chen, P.-J.; Ju, D.; Krishnan, S.; Ranzato, M.; et al. The FLORES-101 evaluation benchmark for low-resource and multilingual translation. Trans. Assoc. Comput. Linguist. 2022, 10, 522–538. [Google Scholar] [CrossRef]
Zhang, B.; Williams, P.; Titov, I.; Sennrich, R. Improving massively multilingual neural machine translation and zero-shot translation. arXiv 2020, arXiv:2004.11867. [Google Scholar] [CrossRef]
Ahuja, S.; Aggarwal, D.; Gumma, V.; Watts, I.; Sathe, A.; Ochieng, M.; Hada, R.; Jain, P.; Ahmed, M.; Bali, K.; et al. MEGAVErSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 16–21 June 2024; Volume 1: Long Papers, pp. 2598–2637. [Google Scholar]
Choudhury, M. Generative AI has a language problem. Nat. Hum. Behav. 2023, 7, 1802–1803. [Google Scholar] [CrossRef] [PubMed]
Gurgurov, D.; Bäumel, T.; Anikina, T. Multilingual Large Language Models and Curse of Multilinguality. arXiv 2024, arXiv:2406.10602. [Google Scholar] [CrossRef]
Bafna, N.; Murray, K.; Yarowsky, D. Evaluating Large Language Models Along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-Lingual Generalization. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 18742–18762. [Google Scholar]
Mirzakhalov, J.; Babu, A.; Ataman, D.; Kariev, S.; Tyers, F.; Abduraufov, O.; Hajili, M.; Ivanova, S.; Khaytbaev, A.; Laverghetta, A., Jr.; et al. A Large-Scale Study of Machine Translation in Turkic Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 5876–5890. [Google Scholar]
Emezue, C.C.; Dossou, B.F. MMTAfrica: Multilingual Machine Translation for African Languages. In Proceedings of the Sixth Conference on Machine Translation, Online, 10–11 November 2021; pp. 398–411. [Google Scholar]
Gala, J.P.; Chitale, P.A.; Raghavan, A.; Gumma, V.; Doddapaneni, S.; Aswanth, K.M.; Nawale, J.A.; Sujatha, A.; Puduppully, R.; Raghavan, V.; et al. IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages. Trans. Mach. Learn. Res. 2023, 2023, 90. [Google Scholar]
Weller-Di Marco, M.; Fraser, A. Findings of the WMT 2022 shared tasks in unsupervised MT and very low resource supervised MT. In Proceedings of the Seventh Conference on Machine Translation (WMT), Abu Dhabi, United Arab Emirates, 7–8 December 2022; pp. 801–805. [Google Scholar]
Ahmad, I.S.; Anastasopoulos, A.; Bojar, O.; Borg, C.; Carpuat, M.; Cattoni, R.; Cettolo, M.; Chen, W.; Dong, Q.; Federico, M.; et al. Findings of the IWSLT 2024 Evaluation Campaign. In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), Bangkok, Thailand, 15–16 August 2024; pp. 1–11. [Google Scholar]
Wenzek, G.; Chaudhary, V.; Fan, A.; Gomez, S.; Goyal, N.; Jain, S.; Kiela, D.; Thrush, T.; Guzmán, F. Findings of the WMT 2021 shared task on large-scale multilingual machine translation. In Proceedings of the Sixth Conference on Machine Translation, Online, 10–11 November 2021; pp. 89–99. [Google Scholar]
Cettolo, M.; Federico, M.; Bentivogli, L.; Niehues, J.; Stüker, S.; Sudoh, K.; Yoshino, K.; Federmann, C. Overview of the IWSLT 2017 evaluation campaign. In Proceedings of the 14th International Workshop on Spoken Language Translation, Tokyo, Japan, 14–15 December 2017; pp. 2–14. [Google Scholar]
Niehues, J.; Cattoni, R.; Stüker, S.; Cettolo, M.; Turchi, M.; Federico, M. The IWSLT 2018 Evaluation Campaign. In Proceedings of the 15th International Conference on Spoken Language Translation, Bruges, Belgium, 29–30 October 2018; pp. 2–6. [Google Scholar]
Fraser, A. Findings of the WMT 2020 shared tasks in unsupervised MT and very low resource supervised MT. In Proceedings of the Fifth Conference on Machine Translation, Uppsala, Sweden, 15–16 July 2010; pp. 765–771. [Google Scholar]
Libovickỳ, J.; Fraser, A. Findings of the WMT 2021 shared tasks in unsupervised MT and very low resource supervised MT. In Proceedings of the Sixth Conference on Machine Translation, Online, 10–11 November 2021; pp. 726–732. [Google Scholar]
Sant, J. Multilingual Low-Resource Translation for Indo-European Languages. Bachelor’s Thesis, University of Malta, Msida, Malta, 2022. [Google Scholar]
Adelani, D.I.; Alam, M.M.I.; Anastasopoulos, A.; Bhagia, A.; Costa-Jussà, M.R.; Dodge, J.; Faisal, F.; Federmann, C.; Fedorova, N.; Guzmán, F.; et al. Findings of the WMT’22 shared task on large-scale machine translation evaluation for African languages. In Proceedings of the Seventh Conference on Machine Translation (WMT), Abu Dhabi, United Arab Emirates, 7–8 December 2022; pp. 773–800. [Google Scholar]
Pal, S.; Pakray, P.; Laskar, S.R.; Laitonjam, L.; Khenglawt, V.; Warjri, S.; Dadure, P.K.; Dash, S.K. Findings of the WMT 2023 shared task on low-resource Indic language translation. In Proceedings of the Eighth Conference on Machine Translation, Singapore, 6–7 December 2023; pp. 682–694. [Google Scholar]
Dabre, R.; Kunchukuttan, A. Findings of wmt 2024’s multiindic22mt shared task for machine translation of 22 indian languages. In Proceedings of the Ninth Conference on Machine Translation, Miami, FL, USA, 15–16 November 2024; pp. 669–676. [Google Scholar]
Sánchez-Martínez, F.; Pérez-Ortiz, J.A.; Galiano-Jiménez, A.; Oliver, A. Findings of the WMT 2024 Shared Task Translation into Low-Resource Languages of Spain: Blending Rule-Based and Neural Systems. In Proceedings of the Ninth Conference on Machine Translation, Miami, FL, USA, 15–16 November 2024; pp. 684–698. [Google Scholar]
Pakray, P.; Pal, S.; Vetagiri, A.; Krishna, R.; Maji, A.K.; Dash, S.; Laitonjam, L.; Sarah, L.; Manna, R. Findings of wmt 2024 shared task on low-resource indic languages translation. In Proceedings of the Ninth Conference on Machine Translation, Miami, FL, USA, 15–16 November 2024; pp. 654–668. [Google Scholar]
Anastasopoulos, A.; Barrault, L.; Bentivogli, L.; Bojar, O.; Cattoni, R.; Currey, A.; Dinu, G.; Duh, K.; Elbayad, M.; Emmanuel, C.; et al. Findings of the IWSLT 2022 Evaluation Campaign. In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), Association for Computational Linguistics, Dublin, Ireland, 26–27 May 2022; pp. 98–157. [Google Scholar]
Iyer, V.; Malik, B.; Zhu, W.; Stepachev, P.; Chen, P.; Haddow, B.; Birch-Mayne, A. Exploring very low-resource translation with LLMs: The University of Edinburgh’s submission to AmericasNLP 2024 translation task. In Proceedings of the 4th Workshop on NLP for Indigenous Languages of the Americas, Association for Computational Linguistics (ACL), Mexico City, Mexico, 21 June 2024; pp. 209–220. [Google Scholar]
Singh, S.; Vargus, F.; D’souza, D.; Karlsson, B.; Mahendiran, A.; Ko, W.Y.; Shandilya, H.; Patel, J.; Mataciunas, D.; O’Mahony, L.; et al. Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Volume 1: Long Papers, pp. 11521–11567. [Google Scholar]
Tanzer, G.; Suzgun, M.; Visser, E.; Jurafsky, D.; Melas-Kyriazi, L. A Benchmark for Learning to Translate a New Language from One Grammar Book. In Proceedings of the Twelfth International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Tran, K.M.; Bisazza, A.; Monz, C. The Importance of Being Recurrent for Modeling Hierarchical Structure. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4731–4736. [Google Scholar]
Zhu, H.; Liang, Y.; Xu, W.; Xu, H. Evaluating Large Language Models for In-Context Learning of Linguistic Patterns in Unseen Low resource Languages. In Proceedings of the First Workshop on Language Models for Low-Resource Languages, Abu Dhabi, United Arab Emirates, 20 January 2025; pp. 414–426. [Google Scholar]
Zhang, K.; Choi, Y.M.; Song, Z.; He, T.; Wang, W.Y.; Li, L. Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions. arXiv 2024, arXiv:2402.18025. [Google Scholar] [CrossRef]
Zerva, C.; Blain, F.; Rei, R.; Lertvittayakumjorn, P.; De Souza, J.G.; Eger, S.; Kanojia, D.; Alves, D.; Orǎsan, C.; Fomicheva, M.; et al. Findings of the WMT 2022 shared task on quality estimation. In Proceedings of the Seventh Conference on Machine Translation (WMT), Abu Dhabi, United Arab Emirates, 7–8 December 2022; pp. 69–99. [Google Scholar]
Freitag, M.; Rei, R.; Mathur, N.; Lo, C.K.; Stewart, C.; Avramidis, E.; Kocmi, T.; Foster, G.; Lavie, A.; Martins, A.F. results of WMT22 metrics shared task: Stop using BLEU–neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), Abu Dhabi, United Arab Emirates, 7–8 December 2022; pp. 46–68. [Google Scholar]
Freitag, M.; Mathur, N.; Lo, C.K.; Avramidis, E.; Rei, R.; Thompson, B.; Kocmi, T.; Blain, F.; Deutsch, D.; Stewart, C.; et al. Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent. In Proceedings of the Eighth Conference on Machine Translation, Singapore, 6–7 December 2023; pp. 578–628. [Google Scholar]
Freitag, M.; Mathur, N.; Deutsch, D.; Lo, C.K.; Avramidis, E.; Rei, R.; Thompson, B.; Blain, F.; Kocmi, T.; Wang, J.; et al. Are LLMs breaking MT metrics? results of the WMT24 metrics shared task. In Proceedings of the Ninth Conference on Machine Translation, Miami, FL, USA, 15–16 November 2024; pp. 47–81. [Google Scholar]
Liu, C.; Dahlmeier, D.; Ng, H.T. Better evaluation metrics lead to better machine translation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, 27–29 July 2011; pp. 375–384. [Google Scholar]
Sai, A.B.; Mohankumar, A.K.; Khapra, M.M. A survey of evaluation metrics used for NLG systems. ACM Comput. Surv. (CSUr) 2022, 55, 1–39. [Google Scholar] [CrossRef]
Moghe, N.; Sherborne, T.; Steedman, M.; Birch, A. Extrinsic Evaluation of Machine Translation Metrics. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1: Long Papers, pp. 13060–13078. [Google Scholar]
Kreutzer, J.; Caswell, I.; Wang, L.; Wahab, A.; van Esch, D.; Ulzii-Orshikh, N.; Tapo, A.; Subramani, N.; Sokolov, A.; Sikasote, C.; et al. Quality at a glance: An audit of web-crawled multilingual datasets. Trans. Assoc. Comput. Linguist. 2022, 10, 50–72. [Google Scholar] [CrossRef]
Wang, C.; Sennrich, R. On Exposure Bias, Hallucination and Domain Shift in Neural Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, WA, USA, 5–10 July 2020; pp. 3544–3552. [Google Scholar]
Raunak, V.; Menezes, A.; Junczys-Dowmunt, M. The Curious Case of Hallucinations in Neural Machine Translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 1172–1183. [Google Scholar]
Guerreiro, N.M.; Alves, D.M.; Waldendorf, J.; Haddow, B.; Birch, A.; Colombo, P.; Martins, A.F. Hallucinations in large multilingual translation models. Trans. Assoc. Comput. Linguist. 2023, 11, 1500–1517. [Google Scholar] [CrossRef]
Craciunescu, O.; Gerding-Salas, C.; Stringer-O’Keeffe, S. Machine translation and computer-assisted translation. Transl. J. 2004, 8. [Google Scholar]
Guha, J.; Heger, C. Machine translation for global e-commerce on ebay. In Proceedings of the AMTA, Vancouver, BC, Canada, 22–26 October 2014; Volume 2, pp. 31–37. [Google Scholar]
Dinu, G.; Mathur, P.; Federico, M.; Al-Onaizan, Y. Training Neural Machine Translation to Apply Terminology Constraints. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 3063–3068. [Google Scholar]
Kocmi, T.; Avramidis, E.; Bawden, R.; Bojar, O.; Dvorkovich, A.; Federmann, C.; Fishel, M.; Freitag, M.; Gowda, T.; Grundkiewicz, R.; et al. Findings of the WMT24 general machine translation shared task: The LLM era is here but MT is not solved yet. In Proceedings of the Ninth Conference on Machine Translation, Miami, FL, USA, 15–16 November 2024; pp. 1–46. [Google Scholar]
Reeder, F. In one hundred words or less. In Proceedings of the Workshop on MT Evaluation, Santiago de Compostela, Spain, 18–22 September 2001. [Google Scholar]
Ahmadian, A.; Cremer, C.; Gallé, M.; Fadaee, M.; Kreutzer, J.; Pietquin, O.; Üstün, A.; Hooker, S. Back to Basics: Revisiting rEINFOrCE-Style Optimization for Learning from Human Feedback in LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Volume 1: Long Papers, pp. 12248–12267. [Google Scholar]
Müller, M.; Sennrich, R.; Volk, M. Evaluation of coherence in machine translation with discourse-aware metrics. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, 31 October–1 November 2018; pp. 958–967. [Google Scholar]
Bawden, R.; Sennrich, R.; Birch, A.; Haddow, B. Evaluating discourse phenomena in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; pp. 1304–1313. [Google Scholar]
Wang, X.; Zhang, J.; Koehn, P. Document-level translation with large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1: Long Papers, pp. 7274–7290. [Google Scholar]
Freitag, M.; Al-Onaizan, Y.; Bapna, A.; Johnson, M.; Niu, X.; Rios, A.; Tran, C.; Firat, O. High-quality machine translation with expert-based human evaluation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 1183–1200. [Google Scholar]
Iyer, V.; Malik, B.; Stepachev, P.; Chen, P.; Haddow, B.; Birch, A. Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation. In Proceedings of the Ninth Conference on Machine Translation, Miami, FL, USA, 15–16 November 2024; pp. 1393–1409. [Google Scholar]
Iyer, V.; Chen, P.; Birch, A. Towards Effective Disambiguation for Machine Translation with Large Language Models. In Proceedings of the Eighth Conference on Machine Translation, Singapore, 6–7 December 2023; pp. 482–495. [Google Scholar]
Oncevay, A.; Ataman, D.; Van Berkel, N.; Haddow, B.; Birch, A.; Bjerva, J. Quantifying Synthesis and Fusion and Their Impact on Machine Translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 1308–1321. [Google Scholar]

Figure 1. Translation of the English sentence What a lovely day. into Turkish (Ne güzel bir gün.) using the three main approaches to machine translation.

Figure 2. Machine translation of a new language using only a LLM, a grammar book in the target language, and a bilingual dictionary, as proposed in [252].

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ataman, D.; Birch, A.; Habash, N.; Federico, M.; Koehn, P.; Cho, K. Machine Translation in the Era of Large Language Models:A Survey of Historical and Emerging Problems. Information 2025, 16, 723. https://doi.org/10.3390/info16090723

AMA Style

Ataman D, Birch A, Habash N, Federico M, Koehn P, Cho K. Machine Translation in the Era of Large Language Models:A Survey of Historical and Emerging Problems. Information. 2025; 16(9):723. https://doi.org/10.3390/info16090723

Chicago/Turabian Style

Ataman, Duygu, Alexandra Birch, Nizar Habash, Marcello Federico, Philipp Koehn, and Kyunghyun Cho. 2025. "Machine Translation in the Era of Large Language Models:A Survey of Historical and Emerging Problems" Information 16, no. 9: 723. https://doi.org/10.3390/info16090723

APA Style

Ataman, D., Birch, A., Habash, N., Federico, M., Koehn, P., & Cho, K. (2025). Machine Translation in the Era of Large Language Models:A Survey of Historical and Emerging Problems. Information, 16(9), 723. https://doi.org/10.3390/info16090723

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Translation in the Era of Large Language Models:A Survey of Historical and Emerging Problems

Abstract

1. Introduction

2. Machine Translation

2.1. Historical Approaches

2.2. Word-Based Statistical Machine Translation

2.3. Phrase-Based Machine Translation

2.4. Neural Machine Translation

2.5. Multilingual Machine Translation and Generative Language Modeling

2.6. Few-Shot Learning with Language Models

2.7. Inductive Learning and Hybrid Models

3. Evaluation

4. Data Curation Methods Used for Building MT Systems

5. Current and Emerging Problems

5.1. Applicability Across Languages

5.2. Evaluation

5.3. Biases and Hallucinations

6. LLMs and MT Together

7. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI