Translation in the Wild

Balashov, Yuri

doi:10.3390/info16121077

Open AccessReview

Translation in the Wild

by

Yuri Balashov

Department of Philosophy, Institute for Artificial Intelligence, University of Georgia, Athens, GA 30602, USA

Information 2025, 16(12), 1077; https://doi.org/10.3390/info16121077

Submission received: 10 October 2025 / Revised: 10 November 2025 / Accepted: 28 November 2025 / Published: 4 December 2025

(This article belongs to the Section Information Applications)

Download

Browse Figures

Versions Notes

Abstract

Large Language Models (LLMs) excel in translation, among other things, demonstrating competitive performance for many language pairs in zero- and few-shot settings. But unlike dedicated neural machine translation models, LLMs are not trained on any translation-related objective. What explains their remarkable translation abilities? Are these abilities grounded in “incidental bilingualism” in training data? Does instruction tuning contribute to it? Are LLMs capable of aligning and leveraging semantically identical or similar monolingual contents from different corners of the internet that are unlikely to fit in a single context window? I offer some reflections on this topic, informed by recent studies and growing user experience. My working hypothesis is that LLMs’ translation abilities originate in two different types of pre-training data that may be internalized by the models in different ways: Local and Global. “Local learning” makes use of bilingual signals present within a single training context window (e.g., an English sentence soon followed by its Chinese translation in the training data). “Global learning,” in contrast, capitalizes on mining semantically related monolingual contents that are spread out over the LLMs’ pre-training data. The key to explaining the origins of LLMs’ translation capabilities is a continuous iteration between Local and Global learning, which is a natural and helpful consequence of batch training. I discuss the prospects for testing the “duality hypothesis” empirically and its implications for reconceptualizing translation, human and machine, in the age of deep learning.

Keywords:

large language models; neural machine translation; multilingual representation; mechanistic interpretability; translation studies

1. Introduction

The relation of Large Language Models (LLMs) to translation is curious, both historically and conceptually. Historically, all the landmark achievements leading up to generative AI were made in the context of machine translation (MT) in the space of three years. This includes the original sequence-to-sequence model [1], the classical attention mechanism [2,3], and the transformer [4], as well as the attendant tokenization algorithms [5,6]. Introduced initially to improve the state of the art in MT by two or three BLEU points [7], these achievements were immediately adopted in virtually every other area of deep learning, from text to image to video generation and more.

From a more theoretical perspective, it is not so clear why LLMs translate as well as they do, showing competitive results on standard benchmarks for many language pairs in zero- and few-shot settings. For a recent comprehensive overview, see [8]. Unlike dedicated neural MT models, LLMs are not trained on any translation-related objective. Indeed, they are pre-trained almost exclusively on English—basically the entire Anglocentric Internet—with only a small percentage of non-English content [9,10]. Translation is, therefore, an emergent ability of LLMs alongside reasoning, text summarization and generation, and so forth [11]. What explains its origin?

The sheer size of LLMs (trillions of parameters) and of their pre-training data (hundreds of billions of tokens), as well as the presence of some multilingual content in such data (on the order of 3–10% in most cases), are important factors. In comparison, a strong transformer-based neural machine translation (NMT) model requires anywhere from 100M to 10B parameters and a comparable number of aligned sentence pairs in two languages to train to perfection. A three-orders-of-magnitude difference in the model and data size, along with an admixture of multilingual content, suggests that LLMs could probably internalize everything anyone has ever translated, as long as it ended up on the internet in one form or another [12].

But this is just a starting point. While multilingual pre-training data are key to explaining the translation abilities of LLMs, those data are very heterogeneous in form and content. LLMs may have been exposed to generally available parallel corpora (such as the OPUS collection, WMT benchmarks, United Nations proceedings, and other open repositories of multilingual texts often used to train dedicated NMT engines) during pre-training. They may also have seen foreign language teaching texts, bilingual abstracts of scientific papers, and other trace amounts of “incidental bilingualism”—“needles in a haystack,” as Briakou et al. put it [13]. And in the last two years, they may have learned more from translation queries made to the free versions of the largest models.

But these possibilities alone can hardly explain the remarkable translation performance of LLMs for at least two reasons: (i) the corresponding sentences in two languages (e.g., English and German) in online corpora, textbooks, and so forth, are usually separated by intervening text. They do not follow one another, making them suboptimal for an autoregressive language modeling objective. (ii) Parallel sentences are present only in a very small proportion of the total multilingual pre-training data used by LLMs.

Most of the multilingual content in LLM training data comes in the form of monolingual documents found in different parts of the internet. One fully expects these documents (e.g., news articles or Wikipedia pages in different languages) to include semantically identical or very similar sentences across languages. But, they do not arrive in neatly aligned pairs and are unlikely to co-occur within a single context window used in pre-training (4–100 K tokens at the time of writing). And even when such parallel or near-parallel sentences do co-occur in a common context window, they are typically separated and surrounded by a great deal of “noise”.

It is hard to resist the idea that contemporary LLMs must somehow be able to locate and leverage semantically related monolingual content from the noisy multilingual soup of their training data and that this ability contributes to their translation performance and other multilingual abilities such as responding to questions or instructions in different languages (see, in particular, [14,15]). If that is indeed the case, then the translation abilities of LLMs may have two distinct sources and may learn from these sources in two somewhat different ways. For the sake of discussion, I will refer to these as Local and Global. “Local learning” makes use of bilingual signals present within a single context window (e.g., an English sentence soon followed by its French translation in the training data). “Global learning,” in contrast, capitalizes on making the best use of semantically related monolingual contents that are spread out over the LLM’s pre-training data.

‘Local’ and ‘Global’ are rough-and-ready labels, and I offer them primarily as a heuristic framework for exploration of potentially fruitful connections. Due to differences in their architecture, training data, and training objectives, dedicated NMT systems and LLMs perform translation tasks in significantly different ways (see Section 3). If, in addition, LLMs learn to translate in two different ways, Local and Global, then neural-network-based translation in a broader sense (encompassing both traditional NMT models and LLMs prompted for translation tasks) cannot be conceptualized as a unified process. Perhaps this is not surprising, given the checkered history of MT and the variety of approaches to translation developed along the way (Section 2), with each stage introducing a new understanding of what it means, and takes, to translate. What is different nowadays is the stunning translation quality provided by neural networks and deep learning, which makes the need for understanding their inner workings more pressing.

Notably, human translation is far from being a univocal concept either. This is obvious from a cursory glance at the landscape of modern translation studies [16,17,18] highlighting the complexity and subjectivity of human translation and its dependence on various linguistic, cultural, cognitive, environmental, and ergonomic factors. Recent progress in computer-assisted human translation technologies has added even more complexity to the picture [19,20]. At the end of the day, we humans may translate in more than one way, especially when we are equipped with CAT (computer-assisted translation) tools which extend our cognitive abilities in non-trivial manners [21], and artificial neural networks may do it differently from us and even from each other. The overall lesson from the recent developments in translation technologies, both human and machine, is that there may be no such thing as translation with a capital ‘T’.

The primary goal of this paper, however, is to reflect on the origins of the remarkable translation abilities of LLMs in light of recent research and growing user experience. The plan is as follows. Section 2 provides a brief overview of the history of translation technologies with emphasis on the latest developments. Section 3 compares standard state-of-the-art NMT models with LLMs. (Readers familiar with the material of Section 2 and Section 3 may skip ahead). Section 4 reviews recent applications of LLMs to translation and translation-related tasks. In Section 5, I discuss important theoretical work exploring how LLMs perform these tasks. In Section 6, I turn to the question of why LLMs are so good at them and develop a “dualistic” proposal about the origin of LLMs’ translation abilities in two types of pre-training data and two learning processes. Section 7 considers prospects for operationalizing and testing this proposed duality empirically. Section 8 draws broader implications for our rapidly evolving concept of translation in the deep learning era. Section 9 offers concluding remarks.

2. Translation Technologies: Brief History and Current State

The development of translation technologies has a rich history, with multiple paradigm shifts over the past decades (Figure 1). Each stage in this development introduced new techniques as well as new conceptualizations of what translation entails. For useful historical accounts, the reader is referred to [8,22,23,24,25,26].

Today’s translation technology landscape is one of convergence and integration [19,20]. The boundary between human and machine translation is rapidly blurring. Professional translators use computer-assisted translation (CAT) tools that are often integrated with a custom NMT system or a generic service such as DeepL or Google. The translator’s role is often to post-edit MT output and curate translation memories, a process that loops human expertise back into improving machine suggestions. This symbiosis has led to significant productivity gains and reconceptualization of translation as a joint human–AI effort [21], where creativity and critical judgment come from humans, and speed and consistency come from machines.

Most recently, the emergence of LLMs has begun to influence translation workflows. Although NMT systems are still the go-to resource for high-volume, high-speed translation needs [27,28], LLMs have demonstrated impressive translation capabilities even without being specialized for the task. There are early signs of LLMs being integrated into translation pipelines; for instance, using an LLM to refine or evaluate MT outputs or to handle difficult cases that NMT systems mistranslate. Conversely, practitioners have explored using NMT engines to augment LLMs, for example, by translating non-English user queries into English before feeding them to an English-centric LLM and then translating the output back to the original language using a domain-specific MT system.

The rapidly growing literature on using LLMs in translation tasks reveals that, with strategic prompting, LLMs can perform progressively sophisticated operations that include, but are not limited to, evaluating the quality of translation output, including their own [29,30,31]; spotting and categorizing translation errors and suggesting corrections [32,33]; automatic post-editing of raw MT output [34,35,36]; adapting translation output to the specific terminology [37,38], to a given domain (e.g., pharmaceuticals, oil and gas, IT) [39,40], and to existing translation memories and other project-, client-, or domain-specific instructions and reference materials, often outperforming in these respects the more traditional approaches earlier implemented in NMT systems [41,42,43]; generating glossaries of special terms from pairs of source and target documents [44]; improving the quality of translation in low-resource directions (e.g., Swahili–Japanese) by following a “chain-of-thought” [45] prompt which explicitly requires them to pivot (“Translate this sentence from Swahili to English first; then translate the English output to Japanese,” see Section 4.3); and following, with benefit, a human translation workflow [46,47], often by engaging LLMs in a prolonged interaction involving pre-translation research, drafting, refining, and proofreading [48]. The opportunities in this area are virtually unlimited.

Let us turn to a closer comparison of dedicated NMT models and LLMs as agents of translation to clarify how they differ and what those differences imply.

3. NMT vs. LLMs: A Comparison

Dedicated neural machine translation (NMT) models and Large Language Models (LLMs) share a common foundation: both utilize transformer-based neural architectures optimized for autoregressive text generation. Yet, they differ profoundly in architecture details, training data, learning objectives, and modes of usage. These differences illuminate the distinctions critical for understanding LLMs’ translation capabilities, especially in contrast to standard NMT systems.

3.1. Model Architecture and Size

Typical state-of-the-art NMT models have 100M to 10B parameters and 6–12 transformer layers on the encoder and decoder sides. In contrast, cutting-edge LLMs are rumored to have trillions of parameters (two to three orders of magnitude larger). NMT models are usually encoder–decoder transformers: an encoder processes the source sentence, and a decoder generates the target sentence, attending to the encoder’s representation (Figure 2). Most LLMs, on the other hand, are decoder-only transformers, especially those designed for text generation. A decoder-only model produces output based on a single stream of input, which can include instructions or source text (Figure 3). Despite this difference, an LLM can effectively act like an encoder–decoder when the prompt includes the source text: the instruction implicitly tells it to “encode” the source meaning into its latent state before producing the output; an example includes the following: Translate the following sentence to Spanish: “My name is John”.

3.2. Training Data

This is perhaps the starkest difference. An NMT model is trained on a relatively narrow and specific dataset: a large collection of parallel corpora (source–target sentence pairs). For example, a high-quality English–German NMT model might be trained on tens of millions of aligned sentence pairs from sources like EU Parliament proceedings, translated news, subtitles, and so forth. The NMT model thus directly learns to map between two languages. In contrast, an LLM is typically trained on a broad swath of text scraped from the Internet, predominantly in one language (most often English) with a mix of others and not specifically on parallel data. For instance, GPT-style models are trained on web pages, books, and articles comprising many billions of words, of which only a small fraction may be parallel or even multilingual content [9,49]. As a result, NMT systems have in-domain strength for translation but limited “knowledge” outside language mappings, whereas LLMs boast vast “knowledge” and multilingual exposure but only implicit translation mappings.

3.3. Training Objective

The dedicated NMT’s training objective is explicitly to maximize translation likelihood, essentially directly training to translate. The loss function compares the model’s output with reference translations for each source sentence. The LLM’s objective, by contrast, is next-word prediction (language modeling) across its training texts. No direct signal tells the LLM “this is a translation of that”. Instead, at best, some training documents might include translations or multilingual sentence collections, and the LLM learns to predict those sequences. In short, translation is the end goal for NMT but a mere side effect for LLMs.

3.4. Inference and Use

Using an NMT model typically means feeding a source sentence into the model’s encoder and letting the decoder produce a translation. It is a straightforward, single-shot process. As already noted, using an LLM for translation requires crafting a suitable prompt. For example, to translate from English to Spanish, one might prompt the LLM with the following: Translate the following English sentence to Spanish: <source sentence>. The LLM then generates the Spanish text as a continuation. The quality of the LLM’s translation can be sensitive to prompt wording [50,51,52]. The presence of examples (few-shot prompting; more on this below) or specifying the desired style can change results. By contrast, an NMT model’s output is largely deterministic given the source. This means that LLMs offer more flexibility (one can prompt for a literal translation or a creative one, or ask an LLM to translate and explain) but also more variability. LLMs might ignore the instruction if not phrased clearly or might produce extra text (like apologies or analysis) if the prompt is not tight [52,53,54,55].

3.5. Quality and Evaluation

How do these approaches compare in translation quality? Earlier studies benchmarked LLMs against dedicated MT systems. For example, Zhang et al. [56] evaluated 15 open-source language models on translation tasks via prompting and compared their results to fine-tuned NMT models. They found that a bit of fine-tuning or domain adaptation (e.g., via QLoRA, Dettmers et al. [57]) on a small translation dataset boosts the translation performance of the model very significantly. A more extensive study showed that GPT-4 (which was a leading model at that time) outperformed Meta’s NLLB (No Language Left Behind, a specialized multilingual NMT model [58]) in about 40% of evaluated translation directions, but it still underperformed a strong commercial engine (Google Translate) on many low-resource language pairs [59]. The consensus emerging from the work reported in 2023–2024 was that LLMs are remarkably good general translators, especially for high-resource languages and with careful prompting, but a dedicated NMT that is tailored to a specific domain or style can still have an edge in its niche. For example, for technical texts with specialized terminology, a fine-tuned NMT might preserve terminology more consistently than an LLM unless the LLM is explicitly guided or instructed [60]. But the field is developing very rapidly. For a recent industry report, see [61]. For a recent linguistically focused comparison of NMT, LLM, and human translation outputs, see [62]. Ref. [63] is an example of a comparative user study of the translation performance of ChatGPT-4 and Google Translate in a very practical domain of hospital discharge instructions.

3.6. Multilingualism

While some recent NMT models are multilingual by design (i.e., trained on parallel sentences in multiple languages and capable of translating between a chosen pair, e.g., NLLB), most are bilingual (one language pair). LLMs, in contrast, tend to be trained on multiple languages as a necessity: even if 90–95% of the data are in English, the remaining 5–10% might cover dozens of languages [9,64]. As a result, LLMs often have a broader multilingual scope than a given NMT model [65]. A single LLM like GPT-5 or Llama can translate between many languages (English to French, French to German, Japanese to Russian, etc.), whereas an NMT model is usually limited to the directions it was trained on (unless it is a massively multilingual model like NLLB). This breadth is advantageous, but it comes with uncertainty: the LLM may not be equally strong in all languages. Some languages with very little representation in pre-training data see much lower translation quality. For example, Zhu et al. [59] note that GPT-4 still struggled on several African and Southeast Asian languages. NMT models can also struggle with low-resource languages of course, but one can explicitly gather parallel data for those languages to train or fine-tune the model, whereas an LLM’s fixed pre-training mix might under-represent them (unless augmented with further training).

3.7. Behavior and Errors

NMT and LLM systems also tend to exhibit different failure modes. NMT models, when they err, often produce mistranslations that are locally implausible (e.g., garbled or nonsensical phrases), or they might drop segments of the input (omissions) if attention-based alignment fails. LLMs, by contrast, rarely produce outright garble; thanks to their strong language modeling priors, their errors are usually fluent but can be misleading. A common LLM translation error is hallucination—inserting content absent from the source, or over-smoothing—translating idiomatic expressions literally, thus losing nuance. But He et al. [47] report that prompting an LLM to mimic a human translator’s step-by-step process can reduce such deficiencies. LLMs might also be inconsistent in style or level of formality unless instructed, whereas an NMT system can be constrained by training data style or using explicit formality tags. These differences remind us that an LLM is trying to produce a plausible continuation given the prompt and its training, whereas an NMT is laser-focused on reproducing the source content in the target form.

3.8. Specialization Versus Generalization

In many cases, the contrast between NMT models and LLMs could be framed in terms of specialization versus generalization. NMT models tend to be specialists—trained explicitly for translation, making them limited in scope and language directions but efficient and typically reliable within those limits. LLMs tend to be generalists—not trained for translation in particular but endowed with a wide range of “knowledge” and linguistic background from which their translation abilities emerge. This generalist nature means LLMs can sometimes surprise with creative or context-aware translations (using their vast “knowledge” to resolve ambiguities that an NMT might be blind to), but it also means they might not always prioritize strict translation fidelity unless coaxed.

Understanding these differences is important as we examine how LLMs are being used for translation and other multilingual tasks in practice (Section 4), how they generate translation internally (Section 5), and why they perform so well despite the lack of direct translation training (Section 6). The next section surveys the rapidly growing array of applications and experiments using LLMs in translation-related tasks, setting the stage for deeper analysis of their abilities.

4. Recent Uses of LLMs in Translation Tasks

The application of LLMs to translation-related tasks began almost as soon as LLMs themselves burst onto the scene [49,66,67]. Researchers and practitioners have tested cutting-edge LLMs on various multilingual tasks, often with impressive results. In this section, I review recent uses of LLMs in translation and other cross-lingual contexts. Key themes include zero-shot vs. few-shot prompting, performance on low-resource languages, pivot translation strategies, chain-of-thought prompting for translation, and exploiting the inherently multilingual nature of latent representations in LLMs.

4.1. Zero- and Few-Shot Translation with LLMs

One of the headline capabilities of LLMs is zero-shot learning—performing a task without direct task-specific training [66]. Translation has become a primary example. Prompting an LLM with a simple instruction such as “Translate the following sentence to Mandarin:________” can yield a credible translation even if the model was never explicitly trained on parallel data for that language. Early tests with GPT-3 showed surprisingly competent zero-shot translation for several languages [68], sparking immediate interest in using such models as universal translators [69].

Subsequent research has systematically evaluated zero-shot vs. few-shot prompting for translation (e.g., [56]). Few-shot prompting means providing a few example translations in the prompt before asking the model to translate a new sentence. For instance, one might input

English: Hello, how are you? →Spanish: Hola, ¿cómo estás?;

English: My name is John. →Spanish: Me llamo John;

English: Where is the library? →Spanish:

And let the model continue. This technique often boosts translation accuracy, especially for models that are not specifically instruction-tuned. Garcia et al. [70] dubbed it the “unreasonable effectiveness of few-shot learning,” finding that even 1–2 examples can significantly improve the translation output for certain LLMs. The improvement is typically more pronounced for smaller or less instruction-savvy models. On the other hand, for GPT-4 and later models, the gap between zero-shot and few-shot regimes is smaller, since the former is already strong. Later work noted that few-shot translation prompting can, in effect, make up for the lack of translation-specific instruction tuning or of any instruction tuning.

In fact, curious phenomena have been observed. For example, Zhu et al. [59] found that, when given in-context exemplars, LLMs sometimes ignore “unreasonable instructions” asking the model to translate into a different language or do something else altogether, e.g., summarize a sentence instead of translating it. The authors also found that cross-lingual exemplars (in different language pairs than the target pair) can sometimes provide better guidance for low-resource translation than same-pair exemplars. Thus, inserting several RU-EN sentence pairs in the prompt followed by un unrelated Chinese sentence produces a better English translation of it than ZH-EN examples. A hypothetical explanation may be that showing the model examples of, say, RU-EN translation may activate its “translation mode” generally, even if we actually want it to translate from Chinese to English, a direction for which it saw no direct examples. This aligns with the idea that the LLM has a general concept of “translating” that is somewhat language-agnostic. The general lesson appears to be that instruction semantics can be overridden by example context.

Both zero- and few-shot regimes raise intriguing questions about how and where translation happens in a fully trained LLM. I will discuss them in Section 5.

4.2. Low-Resource Languages

As noted, LLMs’ translation performance appears to mirror global data inequalities: it tends to be strong for high-resource languages and weaker for low-resource ones. Pivoting through English (Section 4.3) is sometimes used to deal with this challenge.

For extremely low-resource languages, with virtually no representation in the LLM’s pre-training data, neither pivoting nor prompt optimization may suffice. In such cases, researchers have explored fine-tuning LLMs on supplemental parallel data in a low-resource language pair. The hope is the LLM’s general knowledge transfers, and only minimal data are needed to teach it a new language mapping. Early results show some promise, but the cost of fine-tuning an LLM can be prohibitive. Fortunately, techniques such as LoRA (Low-Rank Adaptation, [71]) can be used to inject translation capabilities for new languages into an existing LLM at relatively low compute cost. Thus, Zhang et al. [56] used an even more efficient method (QLoRA, [57]) to effectively turn a general LLM into a specialized translator with just a fraction of parameters updated, something that can be done on a single consumer-grade GPU.

4.3. Pivoting and Intermediate Languages

Pivoting (translating through an intermediate language) has been a longstanding strategy in MT for dealing with language pairs lacking direct parallel data. With LLMs, explicit pivoting can be attempted by prompting the model to go through two steps; one example is the following: “Translate this sentence from German to English first; then translate the English output to Hindi.” This prompt can sometimes yield a better Hindi translation than prompting the model to translate directly from German to Hindi (cf. [50]).

4.4. Chain-of-Thought Prompting for Translation

Chain-of-thought (CoT) prompting is a technique where the model is asked to produce intermediate reasoning steps before the final answer [45]. Originally developed for math and logic problems, it has been applied to help LLMs improve their translation output by following, in some cases, a human translation workflow model. For example, Briakou et al. [48] engaged Gemini 1.5 Pro in a multi-turn interaction involving pre-translation research, drafting, refining, and proofreading, which resulted in notable quality improvements.

CoT prompting could be particularly useful when dealing with longer documents. He et al. [47] adopted this approach and asked LLMs to generate keywords, topics, and potential translations of tricky terms before attempting the full translation and then used those to guide the process. Since state-of-the-art LLMs have very large context windows (1M+ tokens in some cases), one can feed an entire document and prompt the model to work section by section. Researchers are actively exploring this: for instance, how to prompt an LLM to decide when to translate literally and when to adapt, essentially giving the model a strategy akin to a human translator’s decision making [72,73]. This ventures into the territory of interactive translation with LLMs: rather than using a single prompt, one can craft a sequence of prompts for a genuinely collaborative scenario. There have been experiments using LLMs as translator assistants, where the human can query the model for alternate phrasings, explanations of source text nuances, error annotation, and more [35,72,74].

Another interesting recent study develops “compositional translation,” a new method that uses LLMs to improve translation quality for low-resource languages [75]. The approach works by breaking down complex sentences into simpler, semantically coherent phrases, translating each phrase using contextually similar examples and then merging these translations into a coherent final result. Experiments show that this method can outperform traditional few-shot translation techniques in LLMs, making it a promising solution for low-resource challenges.

Finally, explicit pivoting through English (Section 4.3) is another, less sophisticated case of CoT prompting.

Table 1 is a brief summary of the strategies described in Section 4.2, Section 4.3 and Section 4.4 above.

4.5. Multilingual Latent Representations and Cross-Lingual Tasks

Beyond direct translation, LLMs have shown capability in a variety of cross-lingual tasks, such as cross-lingual question answering (such as answering in English from a French context), multilingual summarization (summarizing a document in another language), code-switched dialogue understanding, and so on [14,15,76]. Presumably, all these tasks leverage the model’s multilingual latent space. But what, exactly, is the shape of this space [77]? Where in the model should we look for it? And what methods could be used to study it?

The next section reviews key recent work addressing these questions, offering additional background and a valuable framework for the investigation of the origins of LLMs’ translation abilities in their training data in Section 6.

5. How Do LLMs Translate? Where Does Translation Happen in Them?

5.1. In-Context Translation

Few-shot translation mentioned above (Section 4.1) is a remarkable example of in-context learning in LLMs [66,78]. The “unreasonable effectiveness” [70] of in-context translation and its ability to override semantically inadequate prompt instructions [59] raise questions about its underlying computational mechanisms.

Sia et al. [79] approached these questions by asking at what stage in the computation LLMs recognize and execute the translation task in a few-shot mode. By using a technique called “layer-from context masking,” the authors selectively removed the model’s ability to attend to prompt instructions and examples from specific layers onward. This method allowed them to locate a “task recognition point”—around layer 14 of 32 in Llama 7B for instance—where the model has internalized the task and no longer needs to consult the context to generate accurate translations. Before this point, masking the context significantly degraded performance; after it, performance remained stable even when the context was masked. This suggests a three-phase process: an initial stage with minimal context influence, a middle stage of active task location, and a final stage where execution proceeds independently of the prompt.

Further experiments identified these task recognition layers as “critical layers”. Masking them severely impacted performance, whereas masking later layers had minimal effect, indicating redundancy. Moreover, fine-tuning experiments using lightweight LoRA adapters showed the most gains when applied to these middle layers, underscoring their central role in learning the translation task.

5.2. Cross-Lingual Representation Alignment

An intriguing earlier finding by Wen-Yi and Mimno [80] was that some language models cluster semantically similar words from different languages together at the very first, often ignored, embedding layer. Specifically, the authors found that, without explicit bilingual training, the token embeddings of mT5-XL [81] often align across languages for concepts; for example, the vectors for når (“when,” Danish), cuando (“when,” Spanish), quando (“when,” Portuguese), and quand (“when,” French) are all close neighbors. A similar phenomenon was earlier observed in models like mBERT [82] but not in the input embedding layers.

It should be noted that mT5 is an encoder–decoder model trained on a balanced mix of 100+ languages with the “masked span prediction” training objective and designed for text-to-text tasks, including translation, summarization, and so on. mBERT and its descendants such as XLM-R, on the other hand, are encoder-only multilingual models similar to BERT but trained for multilingual understanding tasks such as classification or NER and not for generation.

What about pristine decoder-only models? Are they also capable of learning cross-lingual representation alignment and cross-lingual transfer during pre-training? Hua et al. [83] approached these questions with a controlled training of GPT-2 models on a multilingual extension of a toy synthetic language that was earlier used to model legal moves in the Othello board game [84]. While limited both in scope and scale, their findings suggest that relatively small decoder-only models do not learn a common representation across multiple artificial languages when the models are exposed to them separately. However, adding “anchor tokens”—lexical items that are shared across languages—to the training sequences improves cross-lingual representation alignment quite considerably. As the authors note, mOthello languages with their simplistic grammar and only 180 tokens are a far cry from natural languages. Yet, the catalyzing role of “anchor tokens” in “Global” (Section 1) multilingual representation learning is notable and consistent with the role of shared subword vocabularies and “seed lexicons” earlier studied in other settings, such as unsupervised MT [85,86,87] and multilingual word embeddings more generally ([25], Ch. 12).

Experiments with toy models, synthetic data, and narrowly scoped tasks offer a distinctive advantage: they give researchers fine-grained control over training dynamics. In a recent study, Blum et al. [76] trained tiny 2M-parameter transformers from scratch on synthetic multilingual corpora and uncovered an early learning phase during which models either unify or segregate the same facts across languages. Only the unification regime yielded robust cross-lingual transfer. The authors showed that the strength of unification closely tracks cross-lingual factual accuracy and can be causally modulated through dataset design. In particular, when language identity is easy to infer and predictive of simple labels, models drift toward language-siloed representations; by contrast, balancing the data and attenuating language-specific surface cues encourages cross-lingual unification.

5.3. Do LLMs Pivot Through English on Their Own?

As already mentioned in Section 4.3, pivoting through an intermediate language (most often English) has been widely exploited in the MT industry for a long time. Recent user studies have extended this approach to LLMs by explicitly asking the models to go through two stages, and they sometimes found improved accuracy.

Interestingly, evidence suggests LLMs themselves might internally be doing something like this. Wendler et al. [88] provided insight by asking, do multilingual LLMs use English as an internal pivot language? They prompted Llama-2 models to translate a single French word ‘fleur’ to Chinese with a few-shot prompt (where ‘中文’ means “Chinese”) as

Francais: “vertu”—中文: “德”

Francais: “siege”—中文: “座”

Francais: “neige”—中文: “雪”

Francais: “montagne”—中文: “山”

Francais: “fleur”—中文: “

And applied a technique called logit lens [89] to unembed models’ intermediate layers early to see what they encode. They found that the model’s representation of the French input drifts closer to English semantic space before settling into the target language space (i.e., Chinese) (Figure 4).

In essence, an LLM might internally translate everything to a kind of “English-centric interlingua” as a stepping stone. If so, then explicitly pivoting via English might resonate with the model’s inner workings. Indeed, some prompting approaches encourage the model to think aloud in English before producing the final translation in the target language. And more recent “reasoning” models sometimes do it on their own [90,91], using English as a kind of “metalanguage”. Unfortunately, this interesting metalinguistic phenomenon lies outside the scope of the paper. Could it form the basis of another distinct mode of translation as “extended inference” analogous to similar operations sometimes performed by human translators when they encounter difficult cases? This question is definitely worth exploring. (I thank Raphaël Millière for drawing my attention to it).

5.4. Language-Agnostic Representation?

But do LLMs actually “think” in English when performing multilingual tasks [92]? Or do they “think” in a truly interlingual way, with English emerging in the intermediate layers simply because it happens to dominate the training data [93]? Wendler et al.’s early decoding study [88,94] did not allow them to address this question head-on. Dumas et al. [95] took up this challenge with a more nuanced mechanistic interpretability approach based on activation patching. They designed simple translation prompts in different languages to serve as two contexts: a “source prompt” (translating a word from, say, German to Italian) and a “target prompt” (translating a different word from, say, French to Chinese). The authors ran Llama-2 7B on both prompts and then swapped part of the internal state between them mid-process. More precisely, at a chosen transformer layer, they took the hidden latent representation of the last token from the source prompt’s forward pass and injected it into the target prompt’s forward pass at the same layer. By observing how this surgical insertion alters the model’s next-word prediction, they can tell what information that latent was carrying (Figure 5).

The outcome was quite striking: different layers separated the handling of language and meaning. When Dumas et al. [95] patched at early layers (closer to the input), Llama-2 still produced the correct word in the target language as if nothing had changed. But patching at a middle layer caused a curious swap: the model generated the correct concept but in the wrong language. (Specifically, it used the source prompt’s language). And patching at a late layer flipped both aspects—the concept as well as the language, effectively following the source prompt’s content entirely. This layered behavior reveals that Llama-2 selects the output language earlier in the network and determines the actual concept to translate later in deeper layers. In other words, the model first decides “I need to respond in Chinese” (target language) before it figures out “what is a lemon?” in the context of translation.

To dig deeper, Dumas and co-authors formulated two competing hypotheses about Llama-2’s internal mechanism: H1 (“Disentangled Representation”)—the model keeps language and meaning separate, computing a language-agnostic concept first and then translating it into the required language; and H2 (“Entangled Representation”)—the model’s concept representation is always tied to a particular language, so it directly converts from a source-language word to a target-language word in one entwined step. The layer patching results hint at H1: one could swap languages while keeping the same concept and vice versa. To confirm this, the authors averaged the latent concept representations for a given word across multiple source languages and used this mean representation in a translation prompt. Remarkably, this language-agnostic average did not confuse the model or degrade its performance; in fact, the performance was improved! In other words, forcing the model to translate from a blended, language-neutral hidden state yielded even better results. This counterintuitive outcome adds force to the claim that the model’s internal representation of a concept may be genuinely “language-neutral”.

The authors note the limitations of their study, which operates with single-word prompts and very simple unambiguous concepts, explicitly acknowledging that richer lexical items, multi-word expressions, sentential context, discourse phenomena, and language-specific idioms remain outside the scope of this approach. It is unclear whether the demonstrated disentanglement of language and meaning carries over to full-sentence translation. All the reported experiments use open-source models (in addition to Llama-2 7B, the authors ran experiments with Llama-3 8B, Mistral 7B, Qwen 7B, Aya 8B, and Gemma 2B). While the authors replicated some of their findings on a 70B Llama, they obviously could not test frontier-scale, instruction-tuned, or RLHF-aligned systems. Differences in the model size, training mix, and alignment objectives may alter how larger “consumer-grade” LLMs partition language and meaning. Finally, while activation patching is a powerful tool, it may introduce artifacts and does not, by itself, guarantee that the model would naturally traverse the identified internal states during ordinary inference. But despite these limitations, Dumas et al.’s work contributes important insights to the ongoing debate about the nature of multilingual representation in LLMs.

A further step in this direction was recently undertaken by Anthropic’s transformer circuits team [96], who used a newly developed tool called attribution graphs to trace how certain features (interpretable patterns of activation in the network) interact to produce the model’s output.

An attribution graph is essentially a causal map of the model’s internal computation for a given prompt, where the nodes are interpretable features, and the directed edges show how activating one feature leads to effects on others and ultimately the output. Attribution graphs trace how the model’s internal components contribute to a specific answer, much like an information flowchart. A feature, in this context, means an interpretable direction in the model’s activation space—a pattern of neural activity that consistently represents a particular concept. Features can range from low-level patterns such as detecting a specific token or character to high-level semantic themes such as a concept of size, negation, or French language. By identifying these features and how they connect with each other in an attribution graph, Anthropic’s researchers formulated hypotheses about what sub-computations the model is performing to generate an output. Crucially, this approach yields an interpretable pattern of the model’s internals as a candidate explanation of which concepts the model used and how they interacted to produce a given response. The authors then validated these explanations with intervention experiments by actively editing features in the model’s computations to test whether the predicted causal roles hold up. This combination of attribution graphs (visualizing which features matter) and feature-level interventions (probing what happens if we change them) serves as a powerful tool for mechanistic interpretability.

In a bit more detail, the tool was put to work in a simple translation-like task: the prompt “The opposite of ‘small’ is …” posed in three different languages (English, French, and Chinese). Despite the surface language differences, all three prompts have the same meaning, and their correct completions are semantically equivalent: ‘big’, ‘grand’, and ‘大’. The question is, does the model (Claude 3.5 Haiku) respond to these prompts using one language-neutral reasoning process, or does it handle each language separately? The attribution graphs reveal a striking answer: the model uses very similar internal circuits for all three languages, combining shared cross-lingual features with language-specific ones. In each case, Claude first activates a set of language-independent features that recognize the task of finding the antonym of ‘small’. These include an abstract concept of “smallness” and an “opposite-of” operation feature. Once this conceptual step finds the idea of big (the opposite of small), the model then engages language-specific features to express that idea in the appropriate tongue. In other words, Claude appears to “think” in a language-neutral way about the meaning (identifying the “opposite of small” as the concept big) and only then translating that concept into the word forms ‘big’, ‘grand’, or ‘大’ depending on whether the context is English, French, or Chinese. There are features that act as language markers: for example, an open-quote-in-French feature or a Chinese-context feature, which ensure the answer is framed in the correct language with the right quotation marks. But the core semantic features (for small and its opposite) are multilingual: the model has learned a single abstract notion of small that spans languages rather than distinct unrelated notions for each language. The authors describe the overall computation as having three factorable parts: Operation (finding an antonym), Operand (the concept of small), and Language (which language’s vocabulary to use for the answer). This decomposition suggests that the model might indeed be employing something like an internal “interlingua” for meaning combined with a separate layer for language-specific expression.

To confirm that these parts (Operation, Operand, Language) are truly separable in Claude’s internal representation, the authors performed three kinds of causal intervention experiments, with each targeting one part while leaving the others unchanged. Figure 6 illustrates these three interventions and their effects. In each case, the researchers located the relevant features in the attribution graph and surgically swapped or altered them mid-computation to see how the model’s output changes, thereby testing whether that facet of the circuit is indeed functioning as hypothesized.

The results confirmed the expectation that Claude’s internal processing has a “modular” structure: one module figures out the task (antonym vs. synonym), another handles the core meaning of the words involved, and a third handles the language context. Each can be perturbed on its own. This level of interpretability is quite remarkable: it is as if one opened the model’s “brain” and identified a sub-network for “opposite meaning,” a sub-network for “the concept of small,” and a sub-network for “speaking French” and demonstrated each one’s role by poking it. This offers a fascinating parallel to the idea that meanings are not intrinsically tied to words: the model generates the appropriate word when needed, implying it has an independent grasp of the meaning itself. Furthermore, the model’s ability to swap out Operation and Operand independently points to a form of compositional representation inside the neural network. The classical hallmark of compositionality in linguistic semantics is the requirement that “the meaning of an expression [be] a function of the meanings of its parts and of the way they are syntactically combined” ([97], p. 281). Here, we see it at work: the model’s internal process for “the opposite of X” can be decomposed into a part that understands the relation “opposite of” and a part that provides the content “X,” and these parts can be intervened on separately. The “antonym-finding” operation acts almost like a built-in function that can be applied to different inputs (‘small’ or ‘hot’ or others) to yield different outputs. Importantly, the concept of small is represented in a way that is not entangled with the fact that it was asked in French or English; it is a “portable” piece of information.

On a cautionary note, Anthropic’s team also observed that English seems to be the model’s “default” language in this internal process; which is hardly surprising. They describe English as “mechanistically privileged,” meaning that, in the model’s circuits, English has a built-in advantage or priority. How does this show up mechanistically? It turns out that those abstract “say big” features (the ones that correspond to expressing the concept of big) have a stronger direct influence on producing the English word ‘big’ than the equivalents in French or Chinese. In Claude’s neural wiring, the path from the concept of big to the English token ‘big’ is somewhat more straightforward and higher-weighted than the path to ‘grand’ or ‘大’. For non-English outputs, the model relies a bit more on the extra “language context” features to get to the right word. Another subtle observation was that some English-specific features engage in what the authors call a “double inhibitory effect”: essentially, certain English features suppress competing features that would otherwise produce the English word when the context is not English, thus allowing the non-English word to come out. This complicated dance of inhibition and activation paints a picture in which the model retains a universal semantic representation (of big) that is biased to map to English unless additional mechanisms redirect it. In more philosophical terms, Claude 3.5 has a kind of universal “mentalese” for concepts [98], but it is skewed toward outputting English by default. Only when the model clearly detects a non-English context (quotes, characters, etc.) do these language-specific signals intervene to translate that internal thought into, say, French. This finding echoes the tension noted in the recent literature: some prior work had found evidence of a shared multilingual representation [99], while others noted an English-centric bias [92]. Here, both appear to be true: the representations are genuinely multilingual, and English is a privileged default internally. One could say the model has a universal “mental dictionary” but with English entries written in bold.

All of that is still quite a distance away from real-life translation “in the wild”. And the indirect nature of the attribution graphs method, which operates on a separate “replacement model” rather than the original model, contributes to the limitations of this study. Nonetheless, cutting-edge mechanistic interpretability research casts considerable light on both questions stated in the title of this section: “How do LLMs translate?” and “Where does translation happen in them?”. The answers are nuanced, depending on multiple factors involved in various experiments: zero- vs. few-shot prompting, single concepts (lemon, book, etc.) vs. concepts and functions (“the opposite of___”, “a synonym of___”), as well as model size and other details. But the evidence strongly suggests that LLMs, large and small, develop some kind of shared multilingual or language-neutral representations during training and then leverage it at inference time when prompted for translation-related tasks.

How do LLMs develop these representations in the first place? The answer lies in the models’ training data and training dynamics.

6. What Explains the Origin of LLMs’ Translation Abilities?

6.1. The Minimal Role of Instruction Tuning

After their initial pre-training, LLMs normally undergo supervised instruction tuning, which improves their responsiveness to specific requests. This is done on a mixture of mostly proprietary data, but there is little doubt that translation tasks are included in the mix. In any case, they are present in open-source instruction tuning datasets such as Aya [100]. How important are they to the emergence of LLMs’ translation abilities?

Early experiments revealed rather erratic behavior of even the largest language models and their high sensitivity to the prompting details. For example, in response to the prompt

Translate French to English: Mon corps est un transformateur de soi, mais aussi un transformateur pour cette cire de langage.

GPT-3 continued with another French sentence instead of translating [101].

Instruction tuning is partly about teaching the model how to respond in a way that humans find coherent and helpful. When a user asks, “Translate from French to English,” the model leverages the multilingual knowledge it has already formed during pre-training, but now it has learned a clearer instruction on how to apply that knowledge (i.e., produce a direct translation rather than continuing in the original language).

Although there is still some dependence on the prompting details [52,60,62], in the latest LLMs, suboptimal prompts tend to yield overly generous rather than incorrect responses to translation tasks. Some models need to be restrained with longer prompts in order to limit their output to what is needed (e.g., [55]).

To be sure, decidedly vague or ambiguous prompts may still push a model in a wrong direction. And, as already noted, including translation examples in the prompt tends to override inadequate instructions. But there is every reason to believe that LLMs’ translation abilities are already there in a dormant state and are simply waiting to be unlocked with a good prompt.

So, while it may be tempting to attribute LLMs’ translation capabilities to explicit instruction tuning, it likely plays only a minor or supplementary role. Indeed, the translation capabilities of LLMs were already prominent in earlier, non-instruction-tuned models, such as GPT-2, GPT-3, and BLOOM. And they are stronger in non-instruction-tuned versions of more balanced multilingual models, such as Llama-3 and Llama-4, even if the latter are smaller in size. Due to their exposure to massive multilingual texts, LLMs inadvertently learn cross-lingual patterns and semantic mappings. Instruction tuning may enhance model responsiveness and specificity, but it does not fundamentally generate their cross-lingual capabilities. The end result is that, once prompted to translate, such a model can search its deeply learned space of multilingual representations and produce coherent translations, even for language pairs it has not been explicitly trained to translate in a supervised manner. Good prompts are more important than instruction tuning on translation examples.

At their core, LLMs develop their translation abilities during the pre-training phase. How does it happen?

6.2. Incidental Bilingualism in Training Data

In an earlier work, Eleftheria Briakou and colleagues [13] approached this question by searching for “bilingual needles” in a haystack of PaLM’s [49] training data. The authors hypothesized that incidental bilingualism—the unintentional inclusion of bilingual text within single training instances—is a key source of PaLM’s translation capabilities. In other words, the 540B-parameter model may have learned to translate from naturally occurring translations and mixed-language content in its massive training corpus (780 billion tokens of crawled text from web pages, social media, books, and so on), even though it was never given dedicated parallel data. The idea is that, while combing through its training corpus, PaLM may have stumbled upon numerous “needles in the haystack”: sentences and documents that appear side by side in different languages, essentially acting as implicit translation examples. These bilingual snippets were not explicitly labeled as “training examples” for translation; rather, they were incidental, arising naturally in the wild text. For example, a website might list a paragraph in English followed by the same paragraph in Spanish, or a forum post might include a quote in French with an English gloss. Briakou et al. suspected that such incidental bilingual data could be the primary source from which PaLM picked up its translation skills.

To test this hypothesis, the authors adopted a multi-step mixed-method approach, combining large-scale quantitative analysis with targeted qualitative inspection. The first step in the process was distinguishing bilingual from monolingual text in PaLM’s 2048-token-long training instances. In turns out that only about 1.4% of them are bi- or multilingual, i.e., contain more than one language, and only 0.34% contain at least one translated sentence pair: a sentence in one language directly paired with its translation in another. However, given the enormous scale of the corpus, this still amounts to a very large absolute volume of parallel data. In total, PaLM was unintentionally exposed to over 30 million translated sentence pairs spanning 44 languages (all paired with English). In other words, none of those 44 languages were truly “zero-shot” for translation, since each had some parallel English examples in the training mix.

Moreover, the incidence of bilingual content strongly correlates (

r \approx 0.94

) with the presence of that language in the corpus. Languages that appear more often in monolingual form also tend to have more bilingual instances and translation pairs. This means that high-resource languages like French or Spanish not only have lots of data in PaLM’s training set, but also many incidental translations (millions of translation pairs), whereas a low-resource language (e.g., Telegu, Gujarati) has little of both (mere thousands of translation pairs).

Identification and manual examination of these bilingual and translation-pair snippets was the next step. The majority (around 55%) of detected bilingual instances were not explicit translations of the same content. Instead, these cases included phenomena like code-switching within a conversation, embedded references or foreign words (book titles, proper names, or quoted phrases in another language inside an English document), or just unrelated multilingual content that happened to co-occur (e.g., on a single web page).

The remaining bilingual instances (roughly 40% of them) did involve translation relationships. About half of these were direct translations: the same sentence or paragraph repeated in English and another language (sometimes formatted as parallel text sections). The other half were semantically related cross-lingual texts that were not exact translations but still conveyed overlapping content, for example, an English passage followed by a summary or commentary in French or vice versa. These findings illustrate that PaLM’s training data contained a spectrum of bilingual signals, from casual mixtures of languages to clear parallel translations.

To understand how PaLM might leverage these, the study looked at how the parallel sentences were presented. Often, bilingual texts were formatted with explicit or implicit prompts indicating language directions. A common pattern was the use of language labels followed by a colon, such as an English sentence prefixed with ‘English:’ and its translation prefixed with ‘French:’. This colon-delimited language name format is in fact the default style used in MT research, and here it appears naturally in the data.

Briakou and colleagues also observed variations on this theme: some documents used ISO language codes (e.g., ‘EN:’ or ‘FR:’) instead of full names, while others used language names in the native tongue (e.g., ‘Français’ to signal a French sentence). In some cases, the word ‘Translation’ in the target language was used as an indicator; e.g., a French text might begin with ‘Traduction:’. These are effectively natural prompts that inform a reader (or a model) that the following text is a translation of a preceding sentence.

Exploiting these insights, Briakou et al. constructed prompts that mirror the patterns found in the training data and tested them on PaLM’s translation tasks. The results showed a marked improvement in PaLM’s zero-shot translation performance when using data-derived prompts. In particular, for translations out of English, using the newly discovered prompt formats (such as native language labels) boosted the average translation quality by about 14 chrF points [102] compared to a generic prompting approach—a very substantial gain. This indicates that PaLM had indeed learned to respond to those familiar bilingual cues: when the prompt matched patterns it saw during training, it produced significantly better translations. In practical terms, the model’s true translation ability was higher than initially thought. The “emergent” zero-shot performance was underestimated when users did not prompt the model in the right way. By prompting in a style consistent with the incidental training examples, PaLM could be coaxed to translate more accurately and consistently; for example, producing output in the correct target language more reliably. This highlights how context and presentation can govern LLM’s behavior: the relevant “knowledge” was latent in PaLM yet only fully expressed when the trigger phrase matched its ingrained experience. It is a bit like a person knowing a skill but needing the right cue or reminder to demonstrate it.

There is, of course, no guarantee that the translation pairs mined from the internet are good. In fact, the general quality of the internet’s multilingual content is low [103] and continues to degrade due to the rapidly increasing contamination from MT and LLM output [104,105]. Briakou et al. performed an extrinsic evaluation of their makeshift corpus of 3.3M French–English sentence pairs extracted from PaLM’s translation instances by using them to train a standard NMT model from scratch and tested its performance. The resulting scores (37–38 BLEU) were not far from those (41 BLEU) for the same NMT trained on the full 40M-sentence WMT dataset. This means that PaLM’s training corpus implicitly contained a parallel corpus sufficient to train a decent translation system. The “needles” it swallowed were largely real and useful translation signals, not just noise. This reinforces the notion that PaLM had effectively learned from legitimate translation examples, so its translation ability is built on a foundation of authentic bilingual knowledge gleaned from the web.

To directly assess how much those incidental translation examples contributed to PaLM’s capabilities, the authors performed ablation studies with the model’s scaled-down versions (1B and 8B) and trained them on a filtered version of the original PaLM’s corpus. In the filtered data, all of the previously identified translation pairs were removed, simulating a world where the model sees no direct parallel sentences during training. They then compared the translation performance of these models to baseline models trained on the unmodified data.

The ablation results were remarkable: removing the translation pairs caused a significant drop in translation quality, especially for the 1B model. For a set of high-resource language pairs, the 1B model’s average zero-shot translation into English suffered about a 7.4 BLEU-point reduction, and even with few-shot examples, the performance remained

\approx 5.9

BLEU points lower than the baseline model. This sizable degradation confirms that the small model had been heavily relying on those incidental parallel examples to learn how to translate. In contrast, the larger 8B model also showed an impact but a smaller one: roughly a 2–3 BLEU point drop on the same translation tasks without the bilingual data. The translation ability of the 8B model was hurt by the removal, yet it retained more of its capability than the 1B model did. This trend—a larger relative impact on smaller models and a diminishing but still noticeable impact on a bigger model—suggests that as model capacity grows, it can compensate to some extent for the lack of explicit parallel data. In other words, a really large model like PaLM might infer cross-lingual mappings from other signals (such as named entities, similar context across languages, and other heuristics) and general language understanding, but the presence of parallel sentences gives a smaller model an irreplaceable head-start in learning to translate. Even at 8B, the fact that performance drops when translation pairs are removed underscores that those incidental examples are indeed a cause (not just a coincidence) of the model’s translation skill.

At the same time, the 8B model trained on English-only data (no foreign text at all) was not found to be completely clueless. It could still translate a bit for languages that use the Latin script, achieving BLEU scores in the teens and twenties for translating into English. How is that possible? This is likely because some traces of other languages slipped through the filtering, or the model picked up names and tokens that overlap with English as useful seed lexicon “anchors” (Section 5.2 above). For example, the model might know some Spanish words (like ‘universidad’) from English context sentences that mention them and thus can map a simple Spanish sentence to English by recognizing shared terms. It might also exploit the fact that languages sharing script have similar character patterns. The takeaway is that a larger model can infer translation mappings from very sparse data—an intriguing hint that at sufficient scale, cross-lingual abilities can emerge from even minimal exposure.

Despite the inevitable limitations (even Google researchers could not retrain a full-scale LLM from scratch multiple times!) and numerous advancements that have happened in the last two years, Briakou et al.’s findings invite us to revisit the notion of “emergence” in generative AI. PaLM’s (and other LLMs’) translation skills might initially look emergent as if the capacity to translate popped up out of sheer model complexity and general-purpose learning. However, Briakou et al.’s study reveals that the capability was latent in the training data all along. PaLM was not explicitly tasked with translation, but the training data contained the right examples and cues, allowing the model to implicitly learn the task. In other words, the model’s knowledge of translation was acquired through the structure of its experience, not through an explicit directive. This is analogous to a child who grows up in a bilingual household. They were never formally taught to translate, but by constantly hearing two languages side by side, they naturally learn to convert between them. Digesting the internet, LLMs effectively encounter a multilingual world where translations are sometimes provided, and they absorb that correspondence.

The notion of incidental bilingualism emphasizes how learning can occur as a byproduct. For AI researchers, this highlights the importance of carefully examining training data. Many capabilities of LLMs might trace back to hidden lessons in the data rather than purely novel reasoning. This raises the question: should we attribute an ability to the architecture and learning algorithm of the model or to the information embedded in its environment? In PaLM’s case, the environment (i.e., the training corpus) contained cross-lingual mappings; the model’s architecture allowed it to pick up on those mappings and store them. So the answer is a bit of both: the capacity to detect and use such patterns is an emergent property of a sufficiently powerful sequence predictor, but the specific skill (French ↔ English translation) is grounded in having seen actual examples.

Another key insight is that even a small fraction of data can exert a disproportionate influence on behavior. Less than 1% of PaLM’s training instances were explicitly bilingual, yet those bits had an outsized impact on the model’s performance on translation tasks. This asymmetry points to the efficiency of pattern-learning in LLMs: when the model encounters a rare but highly informative pattern (like a sentence and its translation), it can leverage it to handle many other instances at inference time. It also means that omitted data could lead to omitted capabilities. If, hypothetically, PaLM’s (and, by extrapolation, other full-scale LLMs’) crawl of the web had missed all bilingual content, it might have shown far weaker translation abilities, at least until reaching enormous scale where other heuristics kick in. Thus, the structure and composition of training data directly shape the emergent “translation knowledge”.

Finally, this work has practical implications for improving and understanding LLMs. By recognizing that LLMs may have “translation knowledge” locked behind suboptimal prompting, we may be able to unlock better performance simply by changing how we query the models. This suggests a broader principle: when an LLM seems to lack skill in something, it may be a matter of not accessing the skill properly. The model might be like a library full of facts with no index. If we find the right prompt “index,” suddenly the knowledge comes to the surface. For translation, the “index” could be a native prompt format. For other tasks, there might be analogues (perhaps certain tasks were also seen incidentally in training, and the key is to find the trigger that activates that capability).

In conclusion, “Searching for Needles in a Haystack” paints a narrative of discovery within LLMs’ training data: a substantial translation curriculum might be hidden in plain sight. It shows how an LLM can acquire what looks like new competence from unlabeled data simply by virtue of the data’s internal structure and diversity. This blurs the line between supervised and unsupervised learning: LLMs are not provided with labeled translation pairs in a traditional sense, but they encounter them in context and may learn from them in a self-supervised way.

But this cannot be the whole story about the origins of LLMs’ translation abilities.

6.3. Global and Local Learning

Despite the presence of trace amounts of translation pairs in non-overlapping token windows—chunks of text consumed by LLMs in single forward passes during pretraining—and their significance in generating LLMs’ translation abilities, most of the multilingual content in the training data comes in the form of monolingual documents found in different parts of the internet. Documents such as news articles or Wikipedia pages in different languages are generally expected to contain semantically identical or closely similar sentences. However, these sentences do not come neatly aligned in pairs, and they are unlikely to appear together within a single sliding context window used during pre-training (typically 4–100K tokens at the time of writing).

We have already noted (Section 5) that LLMs are capable of learning shared multilingual or language-neutral representations during training on semantically related monolingual documents, thanks to the awesome power of distributional semantics. In that respect, Global multilingual learning in LLMs is similar to cross-lingual representation alignment in mBERT [82], mBART [106], and its variants, but on steroids. Indeed, the same phenomenon was found to occur in earlier pre-LLM static-embedding settings; e.g., Mikolov et al. [107] exploited the potential of efficient learning of linear mappings between the monolingual embedding spaces in simple frameworks such as Skip-gram and CBOW. (I am grateful to Michael Carl for reminding me of this important historical precedent). For example, real-world facts (dates, names, places) tend to appear repeatedly across languages, so the model continuously refines how these concepts are expressed in their respective monolingual contexts. If an LLM sees an article in English about climate change and another in German about ‘Klimawandel’ with overlapping terminology and structure, it can infer Global alignment between them. In general, when it encounters a text in language A that is very similar to a text in language B it saw earlier, the only way to reduce perplexity (prediction error) is to internally link those two contexts. Over many such occurrences, the model might cluster representations of sentences by meaning rather than by language. Importantly, this kind of Global learning (alignment, clustering) does not happen as a result of scanning single translation instances and adjusting the attention and other model weights in response. Rather, it happens in the course of repeated readjustment of the weights throughout multiple training steps.

Of course, the model occasionally stumbles upon “needles in a haystack”—training instances containing exact translation pairs. When this happens, the model may exploit the multilingual embeddings already learned from the previous steps to better align bilingual context present in a current Local context window. And then it may, in turn, leverage this Local alignment to improve its global cross-lingual representation potential at the next step.

This kind of ongoing iteration between Local and Global learning is a natural and very helpful consequence of batch training, in which the model updates its weights after every step. In stochastic mini-batch training, each step’s gradient is computed over a shuffled mini-batch that typically mixes examples from different languages and domains. When a batch happens to contain both an occasional local bilingual cue and many globally related monolingual fragments, the single update aggregates these signals, simultaneously strengthening direct mappings and broader cross-lingual structure. Because mini-batches (often hundreds to thousands of sentences) may be reshuffled and resampled across successive steps and epochs, these heterogeneous mixtures can recur with different neighbors, so weights adjusted on one step are immediately put to work in the next one, closing the loop between Local and Global learning. Once updated, the weights are used in the next iteration, continuously improving the model’s translation abilities (among other abilities it may learn in the process). Notably, the same local processes are consistently employed across all stages of training.

In summary, then, we have the following:

Local learning: acquisition from bilingual signals that co-occur within a single training context window (e.g., an English sentence followed shortly by its translation).
Global learning: alignment of semantically related monolingual content distributed across the training corpus, not necessarily co-occurring in a single window.
The two interact iteratively during batch training.

The strategy of continuous iteration between “local” and “global” processes is deeply embedded in machine learning, optimization, and probabilistic modeling. This paradigm allows systems to gradually refine local parameters based on global structure or objectives and vice versa, converging on a stable final state. The original Expectation–Maximization (EM) algorithm is a foundational framework for this interplay [108]. Many other techniques—from Latent Dirichlet Allocation [109] to statistical MT [110] to SentencePiece [6]—borrow this pattern.

So, on the balance of evidence and common wisdom, LLMs’ translation skills seem to emerge from a combination of these Local and Global learning processes. Over the full course of training, the model appears to exploit direct local correspondences when available and to rely on broader global semantic alignment when they are not. This dual-sourced competence can explain why LLM translations sometimes resemble verbatim dictionary lookups and other times read like fluent paraphrases.

Crucially, this duality hypothesis is not just a post hoc narrative. It may yield concrete, testable predictions about LLMs’ translation behavior, which may be sensitive to the relative contributions from two learning mechanisms. I outline three types of such empirical predictions below and offer preliminary considerations on how they could be tested in Section 7.

6.4. Empirical Implications of Global and Local Learning

6.4.1. Style of Translation Outputs

If the model’s properties learned locally (memorized bilingual snippets) are driving a particular translation, the output should skew more literal and form-fixed, whereas globally learned cross-lingual representations will be more adaptive. For instance, we expect LLMs to produce standard, idiomatic translations for common phrases that likely occurred in parallel form during training (e.g., proverb-like expressions or technical terms drawn from bilingual glossaries) but to generate more explanatory or rephrased translations for novel or rare expressions.

Recent analyses of LLM outputs suggest that their translations tend to be literal for frequent source-target phrase pairs yet more interpretative for inputs that were unlikely to have direct parallel examples. This aligns with manual analysis: in earlier studies, GPT models often rendered common idioms in a straightforward way but would explicate obscure idioms in the target language when a ready equivalent is not known (a sign of on-the-fly alignment). For instance, Bureau Works’ analysis [111] found that GPT-3 achieved 90% accuracy in translating Chinese idioms, outperforming traditional machine translation engines, which often produced literal and awkward translations. Similarly, Tang et al. [112] demonstrated that GPT-4, when prompted appropriately, could generate high-quality, context-aware translations of East Asian idioms, surpassing commercial translation engines in both faithfulness and creativity.

These findings are further supported by comprehensive evaluations comparing GPT-4 to human translators. Yan et al. [113] observed that GPT-4 tends toward overly literal translations and exhibits lexical inconsistency in some situations where human translators sometimes over-interpret context and introduce distinctly human “hallucinations” resulting from fatigue. This suggests that GPT-4’s translation behavior varies depending on the familiarity of the source phrases, being more literal with common expressions and more interpretative with less familiar ones.

Accordingly, if we construct a test set of sentences half of which contain well-known idioms (present in training data) and half containing novel metaphors or nonce phrases, an LLM should translate the first category more succinctly and idiomatically but the second with longer, more descriptive phrasing. Such a stylistic divergence would indicate the model is switching between retrieved local translations and generative global reasoning. If no difference is found—for instance, if the model always produces literal outputs or always paraphrases regardless of input novelty—that would challenge the dual-mechanism account.

6.4.2. Dependence on Model Scale and Data

Local vs. global learning should result in different scaling behaviors. A smaller LLM (with fewer parameters or less training data) is expected to rely more on memorized translation examples, while a larger model can compensate via broader semantic abstraction. Indeed, the ablation studies by Briakou et al. [13] (Section 6.2 above) find that removing incidental parallel data causes a dramatic performance drop in a 1B-parameter model but a smaller drop in an 8B model. We predict this trend to extend to even bigger models: as we scale to tens or hundreds of billions of parameters, translation quality will increasingly survive the removal or absence of direct parallel pairs, because larger models should be able to infer better mappings through global context. Conversely, very small models might fail to translate at all without seeing explicit examples. In other words, when a model lacks capacity for deep semantic abstraction, it needs local bilingual signals; when it has sufficient capacity, it can achieve a form of translation by aligning semantics across languages (analogous to how multilingual BERT finds a shared embedding space). A concrete empirical test here is scaling ablation: train or fine-tune a series of models of increasing size on a fixed corpus with and without parallel examples, and then measure translation fidelity. The dual-learning theory predicts an interaction: larger models should show relatively smaller gaps between the “with pairs” and “without pairs” conditions (relying on global learning to fill the gap), whereas smaller models will be crippled without explicit pairs. If instead all model sizes suffer proportionally or a large model is equally impaired by removal of parallel data, that would indicate a single-mode (or at least scale-invariant) learning process, contradicting the hypothesis.

6.4.3. Generalization to Unseen Language Pairs and Domains

Global semantic learning implies that an LLM can perform translation even between language pairs or domains it never saw aligned in training by bridging via meaning. By contrast, purely local learning would struggle in truly novel transfers. We therefore predict that zero-shot translation—especially for unusual language combinations or highly specialized topics—is powered mostly by global mechanism. Empirically, this can be tested by looking at how well an LLM translates between languages that had no overlapping parallel data in its training set. If global learning is real, the model should still handle such pairs (perhaps by mapping each language to an English-centric semantic space, effectively pivoting internally; see Section 4.3 and Section 5.3 above). In fact, previous research on multilingual NMT showed that a single encoder–decoder model can learn to bridge languages through an implicit “interlingua” when direct pairs are absent from its training data [114]. For LLMs, we have contemporary support: the latest GPT and Claude models reportedly perform decently on language pairs like Gujarati–Swahili or Bengali–Turkish [115], even though it is very unlikely they had parallel data for those pairs; they rely on internal representations and possibly English as a latent pivot.

Recent experiments with toy transformer models trained on highly-structured knowledge graph-based multilingual synthetic factual datasets reported in Blum et al. [76] further suggest that larger models should better resist ablations of parallel data if training mixtures suppress language-as-shortcut signals (e.g., balanced attribute frequencies, tokenization that blurs script cues).

Another area to examine is the models’ capacity for domain and/or stylistic adaptation. An LLM trained on general web text might never see, say, legal contract sentences in both French and English in the same context. Yet, globally, it learns the legal terminology and style in each language. The dual origin view predicts the model can translate legal text reasonably well (by aligning the semantics of legal expressions learned separately in French and English), but if a specific provision was present verbatim in a training bilingual document, the model might output the known translation verbatim. In evaluations, this would appear as high accuracy in common legal phrases (perhaps even outperforming human translators on consistency) but occasional weird literalness or errors on clauses that require more creative lexical adaptation (since global learning does not provide a direct template). Testing on domain-specific parallel sets that were not in the training mix (to our best knowledge) could reveal this pattern. If observed, it would further support the idea that LLMs have both a “memory” for seen translations and a more dynamic ability to translate via understanding.

In summary, treating LLMs’ translation abilities as having dual origins—part “memory-based,” part emergent alignment—provides a plausible explanation for the mixture of fluent generalization and oddly specific translations these models produce. It also aligns with what we know about scaling laws and training data: bigger models behave more “multilingually general,” and small ones cling to their few examples. The three predictions above offer avenues to verify this account. In the next section, I add some details on how one might design experiments to operationalize these predictions and more rigorously test the dual origin hypothesis.

7. Prospects for Empirical Testing

If LLMs indeed acquire translation ability via two different mechanisms, how can we verify this in more practical terms? Establishing the relative contributions of Local vs. Global learning is not straightforward because both processes are intertwined in a single pre-training outcome. Nevertheless, a combination of carefully crafted experiments, ranging from corpus ablations to model interventions and output analyses, could illuminate the translation behavior of LLMs, which, in turn, could cast more light on the relative significance of Local and Global learning. In this section, I sketch several experimental paradigms for testing the “duality” hypothesis and discuss practical considerations: data requirements, model scale, and evaluation metrics. The unifying goal is to observe differential signatures of local memorization versus global generalization in translation.

7.1. Selective Ablation of Fine-Tuning Data

One direct way to probe the origins of LLMs’ translation behavior is to selectively remove or add certain data from an LLM’s training set and examine the impact. As outlined above (Section 6.2), this “retrospective surgery” approach was pioneered by Briakou et al. [13] on PaLM’s smaller replicas in a full-scale training regime. Could it be done in a less expensive fine-tuning mode and applied to other increasingly available open-source models?

For example, one could take an existing pre-trained model, choose a new language pair, and continue training or fine-tuning the model on an additional dataset from which all parallel sentences for a new language pair are ablated and then do the same with a version of that dataset where those pairs are exclusively retained, but all the monolingual portions in the two chosen languages are removed. By comparing these two extremes—“no parallel data” vs. “only parallel data”—one could, in principle, observe differences in resulting translation performance. The dualism hypothesis predicts that a model fine-tuned without any parallel examples would still develop some translation ability in a new language pair (via global semantic alignment), but it might translate more slowly or less idiomatically for complex sentences. Conversely, a model fine-tuned only on parallel sentences would certainly learn direct mappings but might struggle on inputs that require world knowledge or cross-sentence context (since the training signal was confined to one-to-one sentence translations). If adding parallel data for, say, Russian–Japanese suddenly causes the model to produce very literal, consistent Japanese translations for Russian inputs (while translation outputs for other relevant language pairs remain unchanged), that would be a strong indication of a Local learning effect.

Such controlled fine-tuning experiments could be done on smaller proxy models first, and data could be added incrementally, in both regimes. This might be similar to classic studies in cross-lingual word embeddings where parallel data were incrementally added to chiefly monolingual training to observe alignment effects (e.g., refining mBERT’s semantic space; cf. [116]). Key data sources for real-world ablation studies include multilingual web crawls like mC4 or Common Crawl derivatives, where one can identify and remove bilingual segments (using language detectors and alignment algorithms). However, smaller corpus resources—from the latest MT benchmarks or recent real-life translation projects (e.g., the Reeve Corpus, [55])—could also be used in the fine-tuning mode. Model sizes should vary: experiments on 1B, 5B, and 10B parameter models can indicate scaling trends.

As for evaluation, standard automatic metrics (BLEU, chrF, COMET) would quantify overall translation quality changes, but it is equally important to perform targeted manual evaluations. For example, after an ablation, test the model on sentences it previously could translate to see if those specific mappings vanished (indicating memory erasure), and test on new, compositionally challenging sentences to gauge if general ability remains. Significant divergences between the differently ablated models—especially where one does well and the other fails (a case of “double dissociation”)—would provide evidence for dual learning pathways.

A note of caution is warranted here. The preceding considerations regarding controlled ablation strictly apply to full-scale fine-tuning, in which all model parameters are updated under the same regime as in the initial training. This process is computationally expensive, even for relatively small (1–10B parameter) open-source LLMs. In practice, highly efficient Low-Rank Adaptation (LoRA) methods have become standard for fine-tuning across many tasks, including translation (see Section 3.5, Section 4.2 and Section 5.1, and references therein). While the practical value of LoRA and related techniques is indisputable, their use in testing the duality hypothesis outlined above may be problematic. Our objective is to assess the differential impact of two types of fine-tuning data—those with and without “bilingual needles”—on the emergence of highly specific translation abilities in fine-tuned LLMs. With LoRA, however, the fine-tuning signal is diverted away from the main model weights (which remain frozen) into a set of additional low-rank matrices. This architectural change precludes informed judgments about the comparative roles of Local and Global learning, as all new learning is now channeled into the adaptation weights. This redirection may produce a “third pathway” for developing specialized translation abilities, one that merits investigation in its own right; but its relevance to the Local–Global learning dilemma remains uncertain. In light of these considerations, full-scale fine-tuning, though more resource-intensive, appears to be the more appropriate approach.

7.2. Synthetic Tasks and Controlled Probes

Another strategy is to design diagnostic tasks that disentangle local from global translation capabilities without the need for re-training or even fine-tuning. An intriguing direction is to investigate whether the distinction between local and global mechanisms re-emerges within the framework of in-context learning.

One way to approach this question is to provide the model with synthetic bilingual contexts at inference time and examine how it exploits these cues. For instance, we could feed the model a prompt that includes a few embedded parallel examples for a new language pair (e.g., a mini-glossary or several translated sentences) and then ask it to translate a novel sentence. This is akin to few-shot prompting. If local learning is a distinct operative mechanism, the model should improve or change style given those explicit examples, essentially performing cross-lingual “translation memory” look-up. On the other hand, if it relies mostly on global knowledge, a few extra examples might not change its performance much. By varying the content of these examples (literal translations vs. paraphrastic ones) and observing the output, one can infer the model’s default bias. A model heavily skewed to local learning might even mistranslate by over-following the format of a misleading example (e.g., if given a bad literal translation as a cue, it might mimic that literality). In contrast, a globally oriented model would stick closer to conveying meaning and be less perturbed by small prompt changes. Recent work on prompt-based analysis of LLM translation indicates that certain prompt formats act as “soft switches” for translation style [117]; this can be leveraged as a probe.

Additionally, one can create controlled translation challenges: for example, craft a paragraph in language A, then present the model with a shuffled version of the same paragraph in language B (so that all sentences from A have a counterpart in B, but not in order, and maybe with distractors). Ask the model to translate language A sentences to English. A model using global alignment might pick up on the fact that the content in B includes the same ideas in another language and use it as context (like a human translator consulting a reference translation), whereas a purely local approach might ignore the jumbled B text or not realize its relevance. By measuring improvements in translation when the “reference” text in another language is present (even unordered), we can see if the model performs a kind of on-the-fly bitext mining and alignment (a hallmark of global learning). This is a bit like giving the model an open-book exam and asking afterwards: did they use the book or not? If an English LLM translating Catalan text benefits from having the same content available in Spanish concurrently, this implies that it can align meanings across languages on its own (since we did not explicitly tell it the Spanish text is a translation). Such behavior was unthinkable in older NMT systems, but with attention mechanisms and huge training, LLMs might do it implicitly.

From a model internals perspective, one can utilize techniques from interpretability research to find traces of local vs. global processing. For example, activation probing could be used to see if certain neurons or attention heads activate strongly when translating content that the model has seen before. If we had an LLM with known training data, we could take a sentence that appeared in training with its translation, present the source now, and compare the network activations to a case where we present a semantically similar sentence that never appeared in the “bilingual needles” during training. Do different heads light up? Does the model attend to different parts of its context or use different layers more intensively [118]? Tools like integrated gradients or activation patching might let us “swap in” or suppress certain computations to test their effect on translation output. (Activation patching was discussed in Section 5.4 above; see also [119]. For a recent use of integrated gradients to analyze human and machine translation outputs, see [62]). For instance, one could attempt to ablate the model’s high-level semantic representation (by intervening in middle layers) and see if it can still translate via literal mappings at lower layers or vice versa. If disabling a semantic circuit causes the model to output word-by-word translations instead of fluent sentences, that would align with the duality hypothesis (pointing to a distinct global semantic circuit). Conversely, knocking out a “memorization circuit” might degrade names and idiom translation while leaving general capability intact. These interpretability probes are speculative but increasingly feasible as researchers map circuits for specific abilities. For example, Wang et al. [120] earlier identified a circuit responsible for indirect object identification in GPT-2. A parallel effort could perhaps target a “bilingual lexicon circuit” versus a “semantic alignment circuit”.

The approaches described in Section 7.1 and Section 7.2 are briefly summarized in Table 2 below.

7.3. Human Evaluation and Error Analysis

While automatic metrics will remain central, human evaluation can be indispensable in diagnosing how a translation was produced. Professional translators or bilingual speakers could examine model outputs for signs of literalness, creativity, and consistency with source. One experimental setup might involve a blind evaluation where annotators see pairs of translations from two versions of a model for the same input: a baseline model and one fine-tuned with selectively ablated data, as described in Section 7.1 above, or after priming the baseline model with parallel examples (Section 7.2). If annotators consistently notice differences—say, one translation is more “word-for-word” and the other more “natural”—that would reveal the dual modes. One could also ask annotators to label errors in LLM translations by type: omission, addition, literal translation, mistranslation of idiom, terminology error, and other error categories. Alternatively, one could tailor the MQM framework to these purposes [121] and experiment with different context spans—the number of preceding and/or succeeeding sentences shown to the annotators [122]. If the dual learning view is correct, we might find a bimodal error distribution: some errors look like over-literalness (the model sticking too closely to source form, potentially when it recalls a parallel), whereas others look like semantic drift or hallucination (the model trying to paraphrase meaning but going astray—an over-extension of global reasoning). Such patterns have been anecdotally reported by users and in case studies (e.g., models occasionally insert a plausible sentence that was not in the source—a global inference gone wrong or translate a proverb too directly—representing a “local memory” without adaptation). A systematic annotation of these phenomena across many outputs and languages would provide empirical grounding. Moreover, human experts could probe the model through interactive evaluations: for instance, start translating a sentence and see if the model can continue (testing if it aligns with a human translator’s partial output), or ask the model to explain its translation choices. If the model cites a source phrase mapping (e.g., “I translated X as Y because I have seen ‘X means Y’ before”) versus a reasoning (“X is similar to an English concept Z, so I rendered it as...”), it might indirectly reveal the influence of a memorized pair vs. real-time reasoning. Of course, models can fabricate explanations, but with careful prompt engineering (or using decoder trace techniques), one might obtain useful signals.

7.4. Feasible Settings and Metrics

It is worth considering what resources and scales are required for these investigations. Full retraining of a GPT-scale model for ablation is of course out of question, but fine-tuning experiments on smaller scale models (up to 7B or 13B parameters) are within reach for academic researchers today. Using open-source LLMs like Llama, Mistral, OLMo [123], or GPT-OSS [124] as bases would allow extraction of their training data sources to identify inherent parallel content, which is important for designing ablation studies.

Relatedly, this approarch could cast additional light on data contamination. For instance, one could compile all known parallel sentences in the RedPajama or Pile datasets and determine what portion of FLORES-200 test sets [58] might have leaked; this is similar in spirit to the controlled study of contamination by Kocyigit et al. [105] who examined how LLM evaluation on MT tasks is impacted when test examples overlap with training data. Leveraging such findings, we can intentionally create “contaminated” vs. “clean” training splits and see how translation performance differs, thereby quantifying memorization.

For automatic evaluation, string-based metrics (e.g., BLEU or chrF) can flag potential “memorized translations,” whereas neural metrics such as COMET may be more informative because they better capture differences between adequacy and fluency. For measuring alignment resulting from Global learning, one might use cross-lingual retrieval tasks: e.g., compute the similarity of the model’s representations (from a certain layer) for a sentence and its translation. A model strong in global alignment should yield closer embeddings for translation pairs than for unrelated pairs. This could be evaluated by retrieval accuracy or clustering purity—an approach used, e.g., by Wang et al. [125]. In fact, earlier work on reference-free neural network-based evaluation of monolingual inter-sentential “cohesiveness” showed the effectiveness of even very simple earlier static embeddings such as Work2Vec [126]. If an ablated model (with no parallel fine-tuning data) still clusters translations together in embedding space, that is evidence of global semantic alignment surviving without local signals. If a control model with added parallel data shows significantly tighter clustering, that quantifies the contribution of local learning. One might also repurpose the BLEU difference by shard metrics: divide a test set into segments where we believe the model had or lacked direct training exposure, and measure the quality on each. A large gap would confirm the two modes.

While dissecting an LLM’s training brain is complex, a suite of complementary experiments—corpus surgery, synthetic prompting, interpretability analysis, and human judgment—can collectively either reinforce the dual-origin theory or reveal a more unitary explanation. The coming years are likely to see such multi-faceted evaluation setups become standard as researchers strive to understand why these models perform as they do. The next section turns to the broader implications of the dualism hypothesis, if it holds true (or even if it merely prompts us to think differently about translation).

8. Reconceptualizing Translation in the Age of Deep Learning

The possibility that high-performing translation by LLMs arises from dual origins prompts us to rethink what “translation” even means in this context. Historically, translation has never been a monolithic, one-size-fits-all process, neither for humans nor machines; and the rise of deep learning amplifies this pluralism. In this section, I consider broader theoretical implications of a dual-origin translation competence, touching on issues of pluralism in translation theory, emergent linguistic abilities, and the opacity of AI systems. I also draw connections to major shifts in translation studies, to the undelying ideas in classical philosophy of translation, and to the framework of distributional semantics that underlies modern NLP.

8.1. Pluralism and Duality in Translation

If LLMs indeed translate via two pathways—local recall vs. global inference, which are differentially informed by the corresponding learning processes—that underscores a broader lesson: there may be no singular translation with a capital ‘T’. This resonates with the ethos of contemporary translation studies which emphasizes pluralism, which is the idea that what counts as translation can vary by its context, purpose, and agent [16,18]. Human translators adopt different strategies on the fly: sometimes translating word-for-word (e.g., for legal texts or patent materials where fidelity to terminology is key) or other times paraphrasing or explicating (e.g., when localizing a joke, an idiom, or culturally specific content for a new audience). The dualism proposal for LLMs suggests a similar internal diversity. Rather than a single unified algorithm, the model might be juggling something akin to the age-old dichotomy of “literal vs. free translation,” but doing so at a sub-symbolic level through patterns in data. From a theoretical perspective, this aligns with pluralist views that there is no one correct way to translate; success can be achieved through multiple different approaches, even within one mind (or one model).

Interestingly, machine translation history itself has swung between extremes: from early rule-based systems (and bilingual dictionaries) versus later statistical approaches with no rules to neural systems that seek deeper semantic transfer. LLMs may be blending these paradigms: one part of the model’s learning process and the resulting behavior at inference time mimics a bilingual lexicon lookup (as rule-based systems did), while another part may leverage semantic recombination. Embracing this duality could encourage translation scholars and practitioners to see advanced MT not as a singular “mind” but as an ensemble of different competences working in concert. It also invites a comparison to human bilinguals: does a professional translator ever rely on direct recall (memory of a known translation) versus on-the-spot rephrasing? Anecdotally, yes: experienced translators often remember how certain phrases were translated in the past or consult translation memory tools (Section 2), which is essentially externalizing the local memory approach, and mix that with fresh composition. Thus, the LLM’s dual approach might be less alien to human practice than it seems, reinforcing a pluralistic epistemology of translation.

This pluralism also resonates with classic philosophical treatments of translation. On Quine’s view, multiple “translation manuals” can fit all the behavioral (and textual) evidence equally well; indeterminacy of translation is not failure but a structural feature of the process [127]. Davidson likewise frames translation as radical interpretation: recovering meaning by imposing rational coherence (his Principle of Charity) on a speaker’s utterances in context [128]. Read through this lens, the local/global duality in LLMs is a computational echo of those frameworks: “local recall” resembles a manual-like mapping, while “global alignment” resembles the Davidsonian drive toward holistic coherence across a broad web of usage. Both modes can yield adequate translations, yet neither is uniquely mandatory. One can think of it as a modern, sub-symbolic instance of pluralism anticipated in classical philosophy of translation. See [129] for a recent collection of articles exploring important questions that arise at the interface of classical translation studies and philosophy of language.

8.2. Emergent Translation Competence?

One striking aspect of LLM translation is its emergent nature. The model’s ability to translate was not pre-programmed; it arose as a side effect of scale and data. This raises a question: to what extent is translation an emergent property of any sufficiently sophisticated information-processing system? If cross-lingual alignment can be achieved by exposure to multilingual data, then perhaps translation competence in general exists implicitly in the fabric of linguistic meaning. Philosophically, this touches on the concept of a universal meaning space, an idea with a long lineage (from Leinbiz’s characteristica universalis to the interlingua in MT research; see e.g., [22], Ch. 2). Modern distributional semantics gives this idea computational form: LLMs encode knowledge in high-dimensional vectors, and when trained on many languages, those vectors naturally form clusters by meaning rather than by language (at least for high-level abstractions). Empirical studies back this up: as mentioned earlier, mBERT, for example, was found to have aligned semantic representations across languages without any parallel data, representing a phenomenon noted as an “emerging cross-lingual structure” [116]. Similarly, recent analyses show that certain LLM families have embeddings where tokens from different languages with the same meaning end up nearby, whereas other families’ embeddings cluster by language [96,130]. The fact that such structure can materialize in an artificial neural network suggests that translation is, at least partly, an emergent property of recognizing sameness of meaning across linguistic form.

Quine [127] cautions us against reifying this emergent common space as a uniquely correct “interlingua”. If translation manuals are underdetermined by evidence, then a high-dimensional distributional space may encode many near-equivalent alignments compatible with the same corpus. LLMs select among them based on their training mix and objective. A Davidsonian moral [128] might be somewhat different: successful interpretation is secured by maximizing truth and coherence across a speaker’s total (or model-internal) corpus. In practice, instruction-following and decoding objectives act as “charity-like” pressures, steering outputs toward globally coherent continuations. Thus, along one dimension of this classical philosophical framework (Quine), we should expect principled underdetermination within the model’s latent alignments; along another (Davidson), we could expect convergence in use, because the system is constantly optimized to produce coherent, truth-cunducive continuations under task constraints.

However, emergence has degrees. The dualism hypothesis suggests that what has emerged in LLMs is twofold: a capacity to detect translation equivalents when they are explicitly present (a kind of emergent memory of translation pairs) and a capacity to merge knowledge from different languages into one representation (an emergent “interlingua”). This dual emergence complicates our conceptualization. It means an LLM does not always deploy a single clean language-neutral representation; sometimes, it relies on more language-specific correlations. In theoretical terms, we might say LLMs challenge the binary of literal translation vs. “transcreation” by doing both, depending on context. And they do so without specific instructions, which is unlike human translators who consciously choose strategies. This raises the following question: are we witnessing a rudimentary form of meta-cognition in the model (implicitly “deciding” between translation strategies based on input)? Or is it simply the probabilistic application of patterns? Either way, the outcome is an agent that can flexibly translate in different modes. In human terms, that would be considered a highly competent translator. So, should we credit the model with a form of translation competence? And if so, what does that competence consist of? Perhaps we need to take a cue from translation studies and broaden the definition of competence beyond “knows how to replace words in one language with words in another” (“translation as retrieval”) to “has internalized a cross-linguistic web of meaning that can be traversed in multiple ways” (“translation as inference”). This perspective connects with work in mainstream semantics, suggesting that meaning is not tied to any single language, which is an idea that translation studies pioneers like Nida [131] and linguists such as Jakobson [132] would agree with in principle, though they never imagined it instantiated in billions of weighted connections.

8.3. Opacity and Interpretability: Translation as a Black Box

Despite these considerations, one must acknowledge the overwhelming opacity of how LLMs actually translate. Even if we posit two mechanisms, they are not cleanly separable modules inside the network; they are interwoven in gazillions of parameters. Even if we agree that LLMs’ translation abilities originate from an iterative interplay between two processes, Local and Global, the task of saying “how” the model translates a given sentence remains extremely difficult, posing a familiar problem in explainable AI. For translators and linguists used to working with rule-based or even phrase-based systems, this opacity may be unsettling. It harkens back to the late 20th-century debates on whether translation equivalence can be formally defined or whether it is a fundamentally intuitive, context-bound judgment. Quine’s indeterminacy of translation again comes to mind. In a way, the computational opacity of LLMs embodies a version of his underdetermination in mechanistic form. Many distinct internal parameterizations can implement the same input–output behavior (functional equivalence up to reparameterization). With LLMs, the translation process is distributed across layers of matrix multiplications. The duality hypothesis gives us a conceptual handle, but it remains a theoretical abstraction until we can open the black box. Some efforts in interpretability, as we have seen (Section 5.4), are trying to do exactly this—identifying circuits or neurons responsible for aspects of translation. But even if we map some components, the holistic behavior might evade full detailed understanding. Does this matter? In practical terms, maybe not: if the model works, users and developers might not care whether it leverages those aspects of its abilities that were learned via route A rather than B internally. Yet, for trustworthiness and further improvement, it does matter. For instance, if we know a model is relying too much on local memory, we might worry about data bias or inappropriate translations being regurgitated (e.g., if it memorized a biased translation of a term from a specific source). Conversely, if it relies on global semantic alignment, we might be concerned about hallucinations or meaning shifts, especially for critical texts (like medical equipment manuals where a “creative” translation could be dangerous). Thus, understanding the balance between local and global learning is important for governance of MT systems.

It also connects to a deeper point: LLM translation challenges the notion of a translator’s intentionality. Human translators can explain why they chose a phrase (even if post facto, as in the older think-aloud protocol studies; see e.g., [133]). What about an LLM? We might program it to output a rationale, but that rationale is just another generated text, not an introspective report. This gap has methodological weight as it forces us to ask: can we ascribe concepts like “strategy” or “preference” to an algorithm? The dualism hypothesis suggests that we use such terms by analogy (we can say the model “leans on” one source of knowledge or another), but ultimately these are metaphors. The reality might be that the model has one huge strategy—next-token prediction—and everything else (including translation behaviors) is subsumed under that. We must be careful, then, in attributing human-like decision making to LLMs. Instead, what we are really observing is a kind of emergent effect that may only resemble having two strategies. This recognition should remind us that any reconceptualization of translation via LLMs has to grapple with the pervasive inscrutability of deep learning. Our theories (duality included) are attempts to impose understandable structure on something intrinsically complex.

8.4. Continuities with Translation Studies and Distributional Semantics

Retrospectively, it may still be enlightening to place these developments in the context of major turns in translation studies and the rise of distributional semantics in NLP. Translation studies over decades have seen the linguistic turn (focus on equivalence and structure), the cultural turn (focus on context, power, ideology, as per Venuti [17] and other scholars), and, more recently, a technological turn (examining how tools and automation shape translation, cf. [19,20]). The dual-origin story intersects with all three in surprising ways. The local learning aspect of LLMs is akin to a linguistic approach: it deals with correspondences between words and phrases, almost like a probabilistic dictionary. The global learning aspect echoes the cultural/functional approach: it is about conveying meaning and intent across languages, not tied to form. Meanwhile, the very existence of LLM translators is itself a result of the technological turn which encourages scholars to revisit longstanding assumptions. For instance, translation quality used to be discussed in terms of equivalence or acceptability, but now we see LLM outputs that are fluent yet may conceal odd source-target mapping behavior (like untranslated bits that “slip through” because the model’s training data had code-switching). This challenges evaluators to differentiate between surface fluency and genuine fidelity. It also rekindles an old debate: Is there a universal translation method? Early MT researchers in the 1950s (following Weaver’s memorandum) searched for a universal code or interlingua; statistical MT later said “no, it is all about data: the more data, the better”; now LLMs suggest “actually, both approaches co-exist.” This is oddly reminiscent of the compromise proposed by some 20th-century translation theorists who argued for a mixed approach: that translators should sometimes be literal and sometimes free, as needed. It is as if the pendulum swung and finally settled in the middle within AI itself.

As for distributional semantics, the success of LLMs is another triumph of the famous principle (“you shall know a word by the company it keeps”). In a multilingual setting, this principle naturally leads to cross-lingual representation: words that appear in similar contexts across languages end up with similar vectors. LLMs take this to an extreme, with billions of parameters adjusting to ensure that texts in different languages that talk about the same facts yield similar internal activations. This may be reminiscent of the dream of a multilingual “semantic web.” What is different in the deep learning era is that it is not a human-crafted cross-lingual web, but an emergent, opaque one. It might be imperfect; maybe the model has a partial interlingual representation intertwined with language-specific channels. But even that possibility pushes us to reconceptualize translation: not as a discrete mapping from Language A to Language B but as a kind of lightning flash that illuminates a shared semantic ground from two sides. In practice, when we use an LLM for translation, we are invoking this lightning flash—the model’s vast interlingual knowledge igniting to connect our input and output. The dual nature of the learning process (local/global) which generated this behavior means the flash might follow one of two paths through the cloud of representations: one more direct and one more circuitous. For translation studies, this offers a new metaphor and perhaps a real model of what translation could mean when not performed by conscious agents: it is enacted equivalence within a complex system, involving an alignment of probabilities that produces a text we recognize as a translation.

In conclusion of this section, the age of deep learning invites us—translators, researchers, scholars—to reconceive translation beyond the traditional human-centric frameworks. We must account for multiple co-existing methods (even within one model), accept emergence and opacity as part of the game, and borrow insights from both the humanities and computational sciences. This pluralistic, interdisciplinary understanding is not just academic; it will shape how we build and trust future translation systems and how we train human translators alongside them (e.g., teaching when to trust the “memory” of an LLM vs. when to rely on its “reasoning,” which is analogous to using a bilingual dictionary vs. paraphrasing an idea).

9. Conclusions

LLMs have astonished us with their translation abilities, performing at levels often comparable to specialized systems and human translators. In this paper, I offered a dual-origin account of how these models achieve such feats, positing that LLM translation prowess stems from two sources in pre-training: (1) Local learning from incidental parallel data (“bilingual needles in the haystack”), leading to effective memorization of specific translations and the ability to mimic translation format, and (2) Global learning from distributed semantic alignment across languages, which an emergent capability allowing the model to infer translations by understanding. We saw that neither alone is likely to explain the full picture; instead, it is their interplay that produces the remarkable breadth and quality of LLM translations. This duality is supported by initial evidence (e.g., ablation studies, output analysis) and is consistent with the scaling behavior and training conditions of LLMs. It also offers an explanation for why LLMs can translate between languages with no direct training pairs: the global semantic matrix provides the bridge.

This exploration leads to several open problems. First and foremost is the need for more direct empirical verification of the local/global split: can we find definitive signatures or neural correlates of each? I outlined experimental approaches ranging from data ablations to interpretability probes (Section 7), but these require significant effort and careful design. As LLM research progresses, a priority will be to carry out such experiments, perhaps starting with smaller models as proxies. Data transparency will be key here: researchers need better access to what LLMs have seen in training to truly link training data patterns with output behavior. A related open question is how instruction tuning and reinforcement learning post-training interact with these translation capabilities. I argued in Section 6.1 that instruction tuning likely plays a minimal role in creating translation ability, but it surely affects how the model deploys that ability (e.g., being more user-friendly or preferring one style of translation over another). Understanding this interaction—essentially, how a small dose of supervised data can modulate an emergent ability—could inform better tuning strategies for specific domains and language directions.

Another open problem lies in extending high-quality translation to low-resource languages and domains. If the dual-origin hypothesis is correct, it suggests two routes to improvement: feed the model more parallel data for those low-resource cases (to boost local learning) and/or find ways to enhance its semantic alignment (perhaps via continued multilingual training or smarter tokenization that links languages). As we have seen (Section 4 and Section 5; see also [134]), there is ongoing work in both directions. An interesting research avenue is to see if synthetic parallel data generated by an LLM (using its global knowledge) can then be used to iteratively improve the same model, essentially bootstrapping global into local knowledge. This could help address the data scarcity issue in a virtuous cycle, but caution is needed to avoid reinforcing errors (a risk if the model’s initial translations are imperfect).

From an interdisciplinary angle, there are opportunities to bridge translation studies and AI research. Translation scholars could contribute nuanced evaluation methodologies to better assess not just whether a translation is “good,” but how it is good or bad. Conversely, AI findings about LLM translation might revive old linguistic debates under a new light; for instance, there is the notion of equivalence: LLMs provide a testbed for theories like dynamic or functional equivalence vs. formal equivalence in a completely different organism (a machine). Collaboration across fields could yield richer frameworks for thinking about MT evaluation, perhaps incorporating functional adequacy, audience design, and other human-centric criteria into the loop. It may also lead to improved human-AI translation interfaces. If we know an AI has dual translation modes, a translator working with it might query it differently (“give me a literal translation first, then a rephrased one”) or debug an output by asking which words it was unsure about, effectively treating the AI as a junior translator with a vast memory and decent intuition.

Finally, we should acknowledge the evolving nature of “translation” itself. As models blur the line between translation, paraphrasing, and summarization (since all are just forms of transforming text), future research might generalize these as part of a broader capability of “language transformation.” Perhaps what we call translation in LLMs is just one facet of a more general text-to-text transformation ability that also includes rewriting in the same language, explaining in simpler terms, converting code to pseudocode, and so forth. Viewing LLMs through this lens might unify knowledge and lead to improvements across tasks. However, it also raises ethical and quality considerations: a model adept at paraphrasing might inadvertently paraphrase when we wanted a strict translation or vice versa. Thus, control becomes a crucial open issue: how do we reliably prompt or condition LLMs to follow the desired mode of translation? Research into prompt engineering and decoder constraints is ongoing but not foolproof.

In closing, the advent of LLMs has propelled machine translation into a new era—one where translation emerges spontaneously from general intelligence-like training and where the distinction between learning by example and learning by abstraction becomes intriguingly blurred. I have argued that embracing a dual-origin perspective can clarify some of the mystery: it provides a narrative that fits the data and suggests concrete ways to verify and apply this understanding. Whether this dualism stands the test of time or becomes refined into a more continuum view (perhaps “mostly global with a sprinkle of local” or vice versa), it is clear that the traditional explanations for MT need updating. By bringing together insights from computational experiments, linguistic theory, and translation practice, we can better grasp the phenomenon of translation in the wild—not as a solved problem but as a continuing story of humans and machines learning to bridge languages in tandem.

Funding

This research was funded by the National Science Foundation through Grant No. SES-2336713.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

There are no data associated with this research.

Acknowledgments

The author thanks Chris Wendler, Ryan Nefdt, Michael Carl, members of the Philosophy and Cognitive Science of Deep Learning group, the audience at the AMTA 2025 Virtual: Emerging AI Breakthroughs and Challenges in Translation Automation, the students in the graduate seminar on Language Translation Technologies: Past, Present, and Future taught at the University of Georgia in Fall 2025, and the reviewers for Information for the very helpful feedback on earlier drafts.

Conflicts of Interest

The author declares no conflicts of interest.

References

Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems NIPS’14—Volume 2, Cambridge, MA, USA, 8–13 December 2014; pp. 3104–3112. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar] [CrossRef]
Luong, T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1715–1725. [Google Scholar] [CrossRef]
Kudo, T.; Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, 31 October–4 November 2018; pp. 66–71. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics—ACL ’02, Philadelphia, PA, USA, 7–12 July 2002; p. 311. [Google Scholar] [CrossRef]
Ataman, D.; Birch, A.; Habash, N.; Federico, M.; Koehn, P.; Cho, K. Machine Translation in the Era of Large Language Models: A Survey of Historical and Emerging Problems. Information 2025, 16, 723. [Google Scholar] [CrossRef]
Blevins, T.; Zettlemoyer, L. Language Contamination Helps Explains the Cross-lingual Capabilities of English Pretrained Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 3563–3574. [Google Scholar] [CrossRef]
Lin, X.V.; Mihaylov, T.; Artetxe, M.; Wang, T.; Chen, S.; Simig, D.; Ott, M.; Goyal, N.; Bhosale, S.; Du, J.; et al. Few-shot Learning with Multilingual Generative Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 9019–9052. [Google Scholar] [CrossRef]
Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent Abilities of Large Language Models. arXiv 2022, arXiv:2206.07682. [Google Scholar] [CrossRef]
Morris, J.X.; Sitawarin, C.; Guo, C.; Kokhlikyan, N.; Suh, G.E.; Rush, A.M.; Chaudhuri, K.; Mahloujifar, S. How much do language models memorize? arXiv 2025, arXiv:2505.24832. [Google Scholar] [CrossRef]
Briakou, E.; Cherry, C.; Foster, G. Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 9432–9452. [Google Scholar] [CrossRef]
Chua, L.; Ghazi, B.; Huang, Y.; Kamath, P.; Kumar, R.; Manurangsi, P.; Sinha, A.; Xie, C.; Zhang, C. Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models. arXiv 2023, arXiv:2406.16135. [Google Scholar]
Gao, C.; Lin, H.; Huang, X.; Han, X.; Feng, J.; Deng, C.; Chen, J.; Huang, S. Understanding LLMs’ Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 4–9 November 2025; pp. 22808–22837. [Google Scholar]
Snell-Hornby, M. The Turns of Translation Studies; Benjamins Translation Library, John Benjamins Publishing Company: Amsterdam, The Netherlands; Philadelphia, PA, USA, 2006. [Google Scholar] [CrossRef]
Venuti, L. Translation Changes Everything: Theory and Practice; Routledge; London, UK; New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Munday, J.; Pinto, S.R.; Blakesley, J. Introducing Translation Studies: Theories and Applications, 5th ed.; Routledge: Abingdon, UK; New York, NY, USA, 2022. [Google Scholar]
Rothwell, A.; Moorkens, J.; Fernández-Parra, M.; Drugan, J.; Austermühl, F. Translation Tools and Technologies, 1st ed.; Routledge Introductions to Translation and Interpreting, Routledge: Abingdon, UK; New York, NY, USA, 2023. [Google Scholar] [CrossRef]
Moorkens, J.; Way, A.; Lankford, S. Automating Translation; Routledge Introductions to Translation and Interpreting, Routledge: Abingdon, UK; New York, NY, USA, 2025. [Google Scholar] [CrossRef]
Balashov, Y. The Translator’s Extended Mind. Minds Mach. 2020, 30, 349–383. [Google Scholar] [CrossRef]
Hutchins, W.J. Machine Translation: Past, Present, Future; Ellis Horwood Series in Computers and Their Applications; Ellis Horwood: Chichester, UK; Halsted Press: New York, NY, USA, 1986. [Google Scholar]
Jurafsky, D.; Martin, J.H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd ed.; Prentice Hall Series in Artificial Intelligence; Pearson Prentice Hall: Upper Saddle River, NJ, USA, 2009. [Google Scholar]
Sin-wai, C. The Routledge Encyclopedia of Translation Technology; Routledge: London, UK; New York, NY, USA, 2015. [Google Scholar] [CrossRef]
Koehn, P. Neural Machine Translation, 1st ed.; Cambridge University Press: New York, NY, USA, 2020. [Google Scholar]
Mercan, H.; Akgün, Y.; Odacıoğlu, M. The Evolution of Machine Translation: A Review Study. Int. J. Lang. Transl. Stud. 2024, 4, 104–116. [Google Scholar]
Makarenko, Y. 5 Best Machine Translation Software. Crowdin Blog, 8 August 2025. Available online: https://crowdin.com/blog/best-machine-translation-software (accessed on 27 November 2025).
Big Language Solutions. Why NMT Still Leads: Smarter, Faster, and Safer Translation in the Age of AI. BIG Language Solutions, 16 June 2025. Available online: https://biglanguage.com/insights/blog/why-nmt-still-leads-smarter-faster-and-safer-translation-in-the-age-of-ai/ (accessed on 27 November 2025).
Kocmi, T.; Federmann, C. Large Language Models Are State-of-the-Art Evaluators of Translation Quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, Tampere, Finland, 12–15 June 2023; pp. 193–203. [Google Scholar]
Fernandes, P.; Deutsch, D.; Finkelstein, M.; Riley, P.; Martins, A.; Neubig, G.; Garg, A.; Clark, J.; Freitag, M.; Firat, O. The Devil Is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation. In Proceedings of the Eighth Conference on Machine Translation, Singapore, 6–7 December 2023; pp. 1066–1083. [Google Scholar] [CrossRef]
Lu, Q.; Qiu, B.; Ding, L.; Zhang, K.; Kocmi, T.; Tao, D. Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 8801–8816. [Google Scholar] [CrossRef]
Berger, N.; Riezler, S.; Exel, M.; Huck, M. Prompting Large Language Models with Human Error Markings for Self-Correcting Machine Translation. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1), Sheffield, UK, 24–27 June 2024; pp. 636–646. [Google Scholar]
Feng, Z.; Zhang, Y.; Li, H.; Wu, B.; Liao, J.; Liu, W.; Lang, J.; Feng, Y.; Wu, J.; Liu, Z. TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, NM, USA, 29 April–4 May 2025; pp. 3922–3938. [Google Scholar] [CrossRef]
Raunak, V.; Sharaf, A.; Wang, Y.; Awadalla, H.; Menezes, A. Leveraging GPT-4 for Automatic Translation Post-Editing. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 12009–12024. [Google Scholar] [CrossRef]
Ki, D.; Carpuat, M. Guiding Large Language Models to Post-Edit Machine Translation with Error Annotations. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; pp. 4253–4273. [Google Scholar] [CrossRef]
Alves, D.M.; Pombal, J.; Guerreiro, N.M.; Martins, P.H.; Alves, J.; Farajian, A.; Peters, B.; Rei, R.; Fernandes, P.; Agrawal, S.; et al. Tower: An Open Multilingual Large Language Model for Translation-Related Tasks. arXiv 2024, arXiv:2402.17733. [Google Scholar] [CrossRef]
Ghazvininejad, M.; Gonen, H.; Zettlemoyer, L. Dictionary-based Phrase-level Prompting of Large Language Models for Machine Translation. arXiv 2023, arXiv:2302.07856. [Google Scholar] [CrossRef]
Rios, M. Instruction-tuned Large Language Models for Machine Translation in the Medical Domain. In Proceedings of the Machine Translation Summit XX: Volume 1, Geneva, Switzerland, 23–27 June 2025; pp. 162–172. [Google Scholar]
Sia, S.; Duh, K. In-context Learning as Maintaining Coherency: A Study of On-the-fly Machine Translation Using Large Language Models. In Proceedings of the Machine Translation Summit XIX, Volume 1: Research Track, Macau, China, 4–8 September 2023; pp. 173–185. [Google Scholar]
Zheng, J.; Hong, H.; Liu, F.; Wang, X.; Su, J.; Liang, Y.; Wu, S. Fine-tuning Large Language Models for Domain-specific Machine Translation. arXiv 2024, arXiv:2402.15061. [Google Scholar] [CrossRef]
Moslem, Y.; Haque, R.; Kelleher, J.D.; Way, A. Adaptive Machine Translation with Large Language Models. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, Tampere, Finland, 12–15 June 2023; pp. 227–237. [Google Scholar]
Moslem, Y. Language Modelling Approaches to Adaptive Machine Translation. arXiv 2024, arXiv:2401.14559. [Google Scholar] [CrossRef]
Vieira, I.; Allred, W.; Lankford, S.; Castilho, S.; Way, A. How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes. In Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), Chicago, IL, USA, 28 September–2 October 2024; pp. 236–249. [Google Scholar]
Ding, Q.; Cao, H.; Feng, Z.; Yang, M.; Zhao, T. Enhancing bilingual lexicon induction via harnessing polysemous words. Neurocomputing 2025, 611, 128682. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Chen, P.; Guo, Z.; Haddow, B.; Heafield, K. Iterative Translation Refinement with Large Language Models. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1), Sheffield, UK, 24–27 June 2024; pp. 181–190. [Google Scholar]
He, Z.; Liang, T.; Jiao, W.; Zhang, Z.; Yang, Y.; Wang, R.; Tu, Z.; Shi, S.; Wang, X. Exploring Human-Like Translation Strategy with Large Language Models. Trans. Assoc. Comput. Linguist. 2024, 12, 229–246. [Google Scholar] [CrossRef]
Briakou, E.; Luo, J.; Cherry, C.; Freitag, M. Translating Step-by-Step: Decomposing the Translation Process for Improved Translation Quality of Long-Form Texts. In Proceedings of the Ninth Conference on Machine Translation, Miami, FL, USA, 15–16 November 2024; pp. 1301–1317. [Google Scholar] [CrossRef]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 2023, 24, 11324–11436. [Google Scholar]
Jiao, W.; Wang, W.; Huang, J.t.; Wang, X.; Tu, Z. Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine. arXiv 2023, arXiv:2301.08745. [Google Scholar] [CrossRef]
Vilar, D.; Freitag, M.; Cherry, C.; Luo, J.; Ratnakar, V.; Foster, G. Prompting PaLM for Translation: Assessing Strategies and Performance. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 15406–15427. [Google Scholar] [CrossRef]
Peng, K.; Ding, L.; Zhong, Q.; Shen, L.; Liu, X.; Zhang, M.; Ouyang, Y.; Tao, D. Towards Making the Most of ChatGPT for Machine Translation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 5622–5633. [Google Scholar] [CrossRef]
Zhang, B.; Haddow, B.; Birch, A. Prompting Large Language Model for Machine Translation: A Case Study. arXiv 2023, arXiv:2301.07069. [Google Scholar] [CrossRef]
Lyu, C.; Xu, J.; Wang, L. New Trends in Machine Translation using Large Language Models: Case Examples with ChatGPT. arXiv 2023, arXiv:2305.01181. [Google Scholar] [CrossRef]
Balashov, Y.; Balashov, A.; Koski, S.F. Translation Analytics for Freelancers: I. Introduction, Data Preparation, Baseline Evaluations. In Proceedings of the Machine Translation Summit XX, Geneva, Switzerland, 23–27 June 2025; Bouillon, P., Gerlach, J., Girletti, S., Volkart, L., Rubino, R., Sennrich, R., Farinha, A.C., Gaido, M., Daems, J., Kenny, D., et al., Eds.; European Association for Machine Translation: Geneva, Switzerland, 2025; Volume 1, pp. 538–565. [Google Scholar]
Zhang, X.; Rajabi, N.; Duh, K.; Koehn, P. Machine Translation with Large Language Models: Prompting, Few-shot Learning, and Fine-tuning with QLoRA. In Proceedings of the Eighth Conference on Machine Translation, Singapore, 6–7 December 2023; pp. 468–481. [Google Scholar] [CrossRef]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLORA: Efficient finetuning of quantized LLMs. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 10–16 December 2023. [Google Scholar]
Team, N.; Costa-jussà, M.R.; Cross, J.; Çelebi, O.; Elbayad, M.; Heafield, K.; Heffernan, K.; Kalbassi, E.; Lam, J.; Licht, D.; et al. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv 2022, arXiv:2207.04672. [Google Scholar] [CrossRef]
Zhu, W.; Liu, H.; Dong, Q.; Xu, J.; Huang, S.; Kong, L.; Chen, J.; Li, L. Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; pp. 2765–2781. [Google Scholar] [CrossRef]
Hendy, A.; Abdelrehim, M.; Sharaf, A.; Raunak, V.; Gabr, M.; Matsushita, H.; Kim, Y.J.; Afify, M.; Awadalla, H.H. How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation. arXiv 2023, arXiv:2302.09210. [Google Scholar] [CrossRef]
Sinitsyna, D. Generative AI for Translation in 2025. Intento, 30 March 2025. Available online: https://inten.to/blog/generative-ai-for-translation-in-2025/ (accessed on 27 November 2025).
Sizov, F.; España-Bonet, C.; Van Genabith, J.; Xie, R.; Dutta Chowdhury, K. Analysing Translation Artifacts: A Comparative Study of LLMs, NMTs, and Human Translations. In Proceedings of the Ninth Conference on Machine Translation, Miami, FL, USA, 15–16 November 2024; pp. 1183–1199. [Google Scholar] [CrossRef]
Kong, M.; Fernandez, A.; Bains, J.; Milisavljevic, A.; Brooks, K.C.; Shanmugam, A.; Avilez, L.; Li, J.; Honcharov, V.; Yang, A.; et al. Evaluation of the accuracy and safety of machine translation of patient-specific discharge instructions: A comparative analysis. BMJ Qual. Saf. 2025. [Google Scholar] [CrossRef]
Meta. The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI Innovation. 2025. Available online: https://ai.meta.com/blog/llama-4-multimodal-intelligence/ (accessed on 27 November 2025).
Cui, M.; Gao, P.; Liu, W.; Luan, J.; Wang, B. Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 30 April–1 May 2025; pp. 5420–5443. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Castilho, S.; Mallon, C.Q.; Meister, R.; Yue, S. Do online Machine Translation Systems Care for Context? What About a GPT Model? In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, Tampere, Finland, 12–15 June 2023; pp. 393–417. [Google Scholar]
Savenkov, K. GPT-3 Translation Capabilities. 2023. Available online: https://inten.to/blog/gpt-3-translation-capabilities/ (accessed on 27 November 2025).
Garcia, X.; Bansal, Y.; Cherry, C.; Foster, G.; Krikun, M.; Feng, F.; Johnson, M.; Firat, O. The unreasonable effectiveness of few-shot learning for machine translation. arXiv 2023, arXiv:2302.01398. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar] [CrossRef]
Jiao, W.; Huang, J.t.; Wang, W.; He, Z.; Liang, T.; Wang, X.; Shi, S.; Tu, Z. ParroT: Translating during Chat using Large Language Models tuned with Human Translation and Feedback. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 15009–15020. [Google Scholar] [CrossRef]
Lu, Q.; Ding, L.; Zhang, K.; Zhang, J.; Tao, D. MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 5570–5587. [Google Scholar]
Treviso, M.V.; Guerreiro, N.M.; Agrawal, S.; Rei, R.; Pombal, J.; Vaz, T.; Wu, H.; Silva, B.; Stigt, D.V.; Martins, A. xTower: A Multilingual LLM for Explaining and Correcting Translation Errors. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 15222–15239. [Google Scholar] [CrossRef]
Zebaze, A.R.; Sagot, B.; Bawden, R. Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, 4–9 November 2025; pp. 22328–22357. [Google Scholar]
Blum, C.; Filippova, K.; Yuan, A.; Ghandeharioun, A.; Zimmert, J.; Zhang, F.; Hoffmann, J.; Linzen, T.; Wattenberg, M.; Dixon, L.; et al. Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics. arXiv 2025, arXiv:2508.11017. [Google Scholar] [CrossRef]
Ravisankar, K.; Han, H.; Carpuat, M. Can you map it to English? The Role of Cross-Lingual Alignment in Multilingual Performance of LLMs. arXiv 2025, arXiv:2504.09378. [Google Scholar] [CrossRef]
Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Xia, H.; Xu, J.; Wu, Z.; Chang, B.; et al. A Survey on In-context Learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 1107–1128. [Google Scholar] [CrossRef]
Sia, S.; Mueller, D.; Duh, K. Where does In-context Translation Happen in Large Language Models. arXiv 2024, arXiv:2403.04510. [Google Scholar] [CrossRef]
Wen-Yi, A.W.; Mimno, D. Hyperpolyglot LLMs: Cross-Lingual Interpretability in Token Embeddings. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 1124–1131. [Google Scholar] [CrossRef]
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 483–498. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Hua, T.; Yun, T.; Pavlick, E. mOthello: When Do Cross-Lingual Representation Alignment and Cross-Lingual Transfer Emerge in Multilingual Models? In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; pp. 1585–1598. [Google Scholar] [CrossRef]
Li, K.; Hopkins, A.K.; Bau, D.; Viégas, F.; Pfister, H.; Wattenberg, M. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. arXiv 2024, arXiv:2210.13382. [Google Scholar] [CrossRef]
Artetxe, M.; Labaka, G.; Agirre, E. Unsupervised Statistical Machine Translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 3632–3642. [Google Scholar] [CrossRef]
Søgaard, A.; Ruder, S.; Vulić, I. On the Limitations of Unsupervised Bilingual Dictionary Induction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 June 2018; pp. 778–788. [Google Scholar] [CrossRef]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
Wendler, C.; Veselovsky, V.; Monea, G.; West, R. Do Llamas Work in English? On the Latent Language of Multilingual Transformers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 15366–15394. [Google Scholar] [CrossRef]
Nostalgebraist. Interpreting GPT: The Logit Lens. 2020. Available online: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens (accessed on 27 November 2025).
Liu, S.; Lyu, C.; Wu, M.; Wang, L.; Luo, W.; Zhang, K.; Shang, Z. New Trends for Modern Machine Translation with Large Reasoning Models. arXiv 2025, arXiv:2503.10351. [Google Scholar] [CrossRef]
Chen, A.; Song, Y.; Zhu, W.; Chen, K.; Yang, M.; Zhao, T.; Zhang, M. Evaluating o1-Like LLMs: Unlocking Reasoning for Translation through Comprehensive Analysis. arXiv 2025, arXiv:2502.11544. [Google Scholar] [CrossRef]
Schut, L.; Gal, Y.; Farquhar, S. Do Multilingual LLMs Think In English? arXiv 2025, arXiv:2502.15603. [Google Scholar] [CrossRef]
Wang, M.; Adel, H.; Lange, L.; Liu, Y.; Nie, E.; Strötgen, J.; Schuetze, H. Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 5075–5094. [Google Scholar] [CrossRef]
Lim, Z.W.; Aji, A.F.; Cohn, T. Language-Specific Latent Process Hinders Cross-Lingual Performance. arXiv 2025, arXiv:2505.13141. [Google Scholar] [CrossRef]
Dumas, C.; Wendler, C.; Veselovsky, V.; Monea, G.; West, R. Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 31822–31841. [Google Scholar] [CrossRef]
Lindsey, J.; Gurnee, W.; Ameisen, E.; Chen, B.; Pearce, A.; Turner, N.L.; Citro, C.; Abrahams, D.; Carter, S.; Hosmer, B.; et al. On the Biology of a Large Language Model. Transform. Circuits Thread. 2025. Available online: https://transformer-circuits.pub/2025/attribution-graphs/biology.html (accessed on 27 November 2025).
Partee, B. Compositionality. In Varieties of Formal Semantics: Proceedings of the Fourth Amsterdam Colloquium; Foris: Dordrecht, The Netherlands, 1984; pp. 281–311. [Google Scholar]
Fodor, J. The Language of Thought; Harvard University Press: Cambridge, MA, USA, 1975. [Google Scholar]
Zhang, R.; Yu, Q.; Zang, M.; Eickhoff, C.; Pavlick, E. The Same But Different: Structural Similarities and Differences in Multilingual Language Modeling. arXiv 2024, arXiv:2410.09223. [Google Scholar] [CrossRef]
Singh, S.; Vargus, F.; D’souza, D.; Karlsson, B.; Mahendiran, A.; Ko, W.Y.; Shandilya, H.; Patel, J.; Mataciunas, D.; O’Mahony, L.; et al. Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 11521–11567. [Google Scholar] [CrossRef]
Reynolds, L.; McDonell, K. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. arXiv 2021, arXiv:2102.07350. [Google Scholar] [CrossRef]
Popović, M. chrF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, 17–18 September 2015; pp. 392–395. [Google Scholar] [CrossRef]
Kreutzer, J.; Caswell, I.; Wang, L.; Wahab, A.; van Esch, D.; Ulzii-Orshikh, N.; Tapo, A.; Subramani, N.; Sokolov, A.; Sikasote, C.; et al. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Trans. Assoc. Comput. Linguist. 2022, 10, 50–72. [Google Scholar] [CrossRef]
Thompson, B.; Dhaliwal, M.; Frisch, P.; Domhan, T.; Federico, M. A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 1763–1775. [Google Scholar] [CrossRef]
Kocyigit, M.Y.; Briakou, E.; Deutsch, D.; Luo, J.; Cherry, C.; Freitag, M. Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination’s Impact on Machine Translation. arXiv 2025, arXiv:2501.18771. [Google Scholar] [CrossRef]
Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual Denoising Pre-training for Neural Machine Translation. Trans. Assoc. Comput. Linguist. 2020, 8, 726–742. [Google Scholar] [CrossRef]
Mikolov, T.; Le, Q.V.; Sutskever, I. Exploiting Similarities among Languages for Machine Translation. arXiv 2013, arXiv:1309.4168. [Google Scholar] [CrossRef]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data Via the EM Algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 1977, 39, 1–22. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Brown, P.F.; Della Pietra, S.A.; Della Pietra, V.J.; Mercer, R.L. The Mathematics of Statistical Machine Translation: Parameter Estimation. Comput. Linguist. 1993, 19, 263–311. [Google Scholar]
Bureau Works. We Tested Chat GPT for Translation—Here’s the Data. Available online: https://www.bureauworks.com/blog/chatgpt-for-translation (accessed on 27 November 2025).
Tang, K.; Song, P.; Qin, Y.; Yan, X. Creative and Context-Aware Translation of East Asian Idioms with GPT-4. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 11–16 November 2024; pp. 9285–9305. [Google Scholar] [CrossRef]
Yan, J.; Yan, P.; Chen, Y.; Li, J.; Zhu, X.; Zhang, Y. Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise Levels. arXiv 2024, arXiv:2411.13775. [Google Scholar] [CrossRef]
Johnson, M.; Schuster, M.; Le, Q.V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.; Wattenberg, M.; Corrado, G.; et al. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Trans. Assoc. Comput. Linguist. 2017, 5, 339–351. [Google Scholar] [CrossRef]
Enis, M.; Hopkins, M. From LLM to NMT: Advancing Low-Resource Machine Translation with Claude. arXiv 2024, arXiv:2404.13813. [Google Scholar] [CrossRef]
Conneau, A.; Wu, S.; Li, H.; Zettlemoyer, L.; Stoyanov, V. Emerging Cross-lingual Structure in Pretrained Language Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6022–6034. [Google Scholar] [CrossRef]
Aldosari, L.A.; Altuwairesh, N. Assessing the effects of translation prompts on the translation quality of GPT-4 Turbo using automated and human evaluation metrics: A case study. Perspectives 2025, 1–25. [Google Scholar] [CrossRef]
Zhang, H.; Chen, K.; Bai, X.; Li, X.; Xiang, Y.; Zhang, M. Exploring Translation Mechanism of Large Language Models. arXiv 2025, arXiv:2502.11806. [Google Scholar] [CrossRef]
Heimersheim, S.; Nanda, N. How to use and interpret activation patching. arXiv 2024, arXiv:2404.15255. [Google Scholar] [CrossRef]
Wang, K.R.; Variengien, A.; Conmy, A.; Shlegeris, B.; Steinhardt, J. Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Lommel, A.; Gladkoff, S.; Melby, A.; Wright, S.E.; Strandvik, I.; Gasova, K.; Vaasa, A.; Benzo, A.; Marazzato Sparano, R.; Foresi, M.; et al. The Multi-Range Theory of Translation Quality Measurement: MQM scoring models and Statistical Quality Control. In Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 2: Presentations), Chicago, IL, USA, 30 September–2 October 2024; pp. 75–94. [Google Scholar]
Castilho, S.; Knowles, R. A survey of context in neural machine translation and its evaluation. Nat. Lang. Process. 2025, 31, 986–1016. [Google Scholar] [CrossRef]
OLMo, T.; Walsh, P.; Soldaini, L.; Groeneveld, D.; Lo, K.; Arora, S.; Bhagia, A.; Gu, Y.; Huang, S.; Jordan, M.; et al. 2 OLMo 2 Furious. arXiv 2025, arXiv:2501.00656. [Google Scholar] [CrossRef]
OpenAI. Introducing GPT-OSS. 2025. Available online: https://openai.com/index/introducing-gpt-oss/ (accessed on 27 November 2025).
Wang, Y.; Wu, A.; Neubig, G. English Contrastive Learning Can Learn Universal Cross-lingual Sentence Embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 9122–9133. [Google Scholar] [CrossRef]
Elmakias, I.; Vilenchik, D. An Oblivious Approach to Machine Translation Quality Estimation. Mathematics 2021, 9, 2090. [Google Scholar] [CrossRef]
Quine, W.V.O. Word and Object; MIT Press: Cambridge, MA, USA, 1960. [Google Scholar]
Davidson, D. Radical Interpretation. Dialectica 1973, 27, 313–328. [Google Scholar] [CrossRef]
Rawling, P.; Wilson, P. (Eds.) The Routledge Handbook of Translation and Philosophy; Routledge Handbooks in Translation and 452 Interpreting Studies; Routledge: Abingdon, UK; New York, NY, USA, 2019. [Google Scholar] [CrossRef]
Brinkmann, J.; Wendler, C.; Bartelt, C.; Mueller, A. Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; pp. 6131–6150. [Google Scholar] [CrossRef]
Nida, E. Toward a Science of Translating: With Special Reference to Principles and Procedures Involved in Bible Translating; E.J. Brill: Leiden, The Netherlands, 1964. [Google Scholar]
Jakobson, R. On Linguistic Aspects of Translation. In On Translation; Brower, R., Ed.; MIT Press: Cambridge, MA, USA, 1959; pp. 232–239. [Google Scholar]
Bernardini, S. Think-aloud protocols in translation research: Achievements, limits, future prospects. Target. Int. J. Transl. Stud. 2001, 13, 241–263. [Google Scholar] [CrossRef]
Xu, Y.; Hu, L.; Zhao, J.; Qiu, Z.; Xu, K.; Ye, Y.; Gu, H. A survey on multilingual large language models: Corpora, alignment, and bias. Front. Comput. Sci. 2025, 19, 1911362. [Google Scholar] [CrossRef]

Figure 1. Brief history of translation technologies.

Figure 2. NMT architecture.

Figure 3. LLM architecture.

Figure 4. Llama-2 7B translates ‘fleur’ to ‘花’ by seemingly pivoting through ‘flower’ in intermediate layers. Figure 1 in Wendler et al. [88].

Figure 5. Figure 1 in Dumas et al. [95]. A hidden state from the source prompt (red, left) at a certain layer is inserted into the forward pass of the target prompt (blue, right) at the corresponding layer. The source prompt here provides an example translation (German ‘Dorf’ → Italian ‘villaggio’) and then asks to translate ‘Buch’ (book) to Italian, while the target prompt asks to translate French ‘citron’ (lemon) to Chinese. By patching the red latent into the blue context at different layers, the researchers observed changes in the predicted translation. Early-layer patching yields the correct target concept in the target language (e.g., ‘柠檬’ for lemon in Chinese), mid-layer patching yields the correct concept but in the wrong language (‘limone’ = lemon in Italian), and late-layer patching yields the wrong concept in the source language (‘libro’ = book in Italian).

Figure 6. Three targeted interventions on Claude’s multilingual circuit (Figure 18 in [96]). Left: Operation Swap—replacing the antonym operation features with synonym features (while the input still asks for “opposite”) causes the model to output ‘little’ (a synonym of ‘small’) instead of the antonym ‘big’. This shows that toggling the internal operation from “find opposite” to “find similar” yields a corresponding change in output. Center: Operand Swap—replacing the features encoding the concept small with those for hot causes the model to respond with ‘cold’ (the antonym of ‘hot’). The prompt remains “The opposite of ‘small’ is…”, but internally, the model now treats ‘small’ as if it were ‘hot’, demonstrating that the content of the adjective can be swapped independently of the prompt’s wording. Right: Language Swap—replacing English-language context features with Chinese ones makes the model output ‘大’ (Chinese for ‘big’) for the English prompt. The model still found the concept big as the opposite of small but expressed it in a different language. In each case, only one aspect of the computation is changed (function, content, or language), and the model’s behavior changes in the predicted way, confirming that these internal features indeed govern those distinct components of the translation task.

Table 1. Translation with LLMs: some strategies.

	LoRA/QLoRA Fine-Tuning	Chain-of-Thought Prompting for Translation Tasks	Explicit Pivot Translation
Best	When a stable, repeated translation task exists (e.g., for a specific client or domain with terminology constraints), modest compute is available, and privacy allows holding small parallel sets.	When inputs are long or ambiguous, require disambiguation, terminology planning, or document-level coherence, and when one can afford a few extra tokens.	When the direct pair is low-resource but the source ↔ pivot and pivot ↔ target directions are strong (typically via English) and errors are mostly lexical rather than cultural or stylistic.
Strengths	Durable gains, controllable style and terminology.	Improves factual grounding, consistency, and error visibility; can guide pivoting or term choices explicitly.	Leverages high-resource links; simple to implement; useful as a fallback.
Costs/risks	Setup time, evaluation/monitoring, maintaining adapters per domain/language.	Higher latency and longer outputs; diminishing returns on simple sentence.	Compounding errors, loss of nuance, exposure to pivot-language biases, and the need to mitigate these effects with targeted post-editing or terminology constraints.
To be avoided	When the task is one-off, data extremely scarce, or latency and storage constraints preclude adapters.	For very short, formulaic content or when token budgets are strict.	When the direct pair suffices; when match to target cannot tolerate ambiguity.

Table 2. Operationalization of the dual learning hypotheses via targeted predictions (P1–P3).

Method/Approach	Prediction/Evaluation
P1: Data manipulation	Fine-tuning on data with no parallel segments versus data containing only parallel segments should differentially affect translation style and generalization. Systematic double dissociations in performance would support the existence of dual pathways. Evaluation: standard automatic metrics (BLEU/chrF/COMET) plus targeted manual checks.
P2: Synthetic probes	In in-context learning settings, injecting a few parallel exemplars into the prompt should push the model into a more “Local/TM-like” mode, while shuffled cross-lingual contexts should help only if the model performs Global alignment “on the fly.”
P3: Mechanistic signatures	Mid-layer “task/meaning” circuits should separate from language-context circuits. Intervening on each (via techniques such as activation patching or related tools) should selectively degrade either “Global semantics” or “Local lexicon/memorization.”

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Balashov, Y. Translation in the Wild. Information 2025, 16, 1077. https://doi.org/10.3390/info16121077

AMA Style

Balashov Y. Translation in the Wild. Information. 2025; 16(12):1077. https://doi.org/10.3390/info16121077

Chicago/Turabian Style

Balashov, Yuri. 2025. "Translation in the Wild" Information 16, no. 12: 1077. https://doi.org/10.3390/info16121077

APA Style

Balashov, Y. (2025). Translation in the Wild. Information, 16(12), 1077. https://doi.org/10.3390/info16121077

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Translation in the Wild

Abstract

1. Introduction

2. Translation Technologies: Brief History and Current State

3. NMT vs. LLMs: A Comparison

3.1. Model Architecture and Size

3.2. Training Data

3.3. Training Objective

3.4. Inference and Use

3.5. Quality and Evaluation

3.6. Multilingualism

3.7. Behavior and Errors

3.8. Specialization Versus Generalization

4. Recent Uses of LLMs in Translation Tasks

4.1. Zero- and Few-Shot Translation with LLMs

4.2. Low-Resource Languages

4.3. Pivoting and Intermediate Languages

4.4. Chain-of-Thought Prompting for Translation

4.5. Multilingual Latent Representations and Cross-Lingual Tasks

5. How Do LLMs Translate? Where Does Translation Happen in Them?

5.1. In-Context Translation

5.2. Cross-Lingual Representation Alignment

5.3. Do LLMs Pivot Through English on Their Own?

5.4. Language-Agnostic Representation?

6. What Explains the Origin of LLMs’ Translation Abilities?

6.1. The Minimal Role of Instruction Tuning

6.2. Incidental Bilingualism in Training Data

6.3. Global and Local Learning

6.4. Empirical Implications of Global and Local Learning

6.4.1. Style of Translation Outputs

6.4.2. Dependence on Model Scale and Data

6.4.3. Generalization to Unseen Language Pairs and Domains

7. Prospects for Empirical Testing

7.1. Selective Ablation of Fine-Tuning Data

7.2. Synthetic Tasks and Controlled Probes

7.3. Human Evaluation and Error Analysis

7.4. Feasible Settings and Metrics

8. Reconceptualizing Translation in the Age of Deep Learning

8.1. Pluralism and Duality in Translation

8.2. Emergent Translation Competence?

8.3. Opacity and Interpretability: Translation as a Black Box

8.4. Continuities with Translation Studies and Distributional Semantics

9. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI