1. Introduction
Machine translation is a transformative technology that bridges communication gaps between speakers of different languages. Its ability to automatically translate text or speech from one language to another has significant implications for global communication, commerce, and information exchange. Despite its advancements, machine translation remains a complex and challenging task, primarily due to the requirement for vast amounts of high-quality bilingual data. This necessity for extensive datasets is driven by the need to capture and model the nuances of different languages, including their syntax, semantics, and context.
Obtaining sufficient parallel data for every language pair is often both challenging and costly. High-resource languages, such as English, Spanish, and Chinese, benefit from large corpora of bilingual text due to extensive digital resources and widespread use. In contrast, low-resource languages often suffer from a lack of such data, limiting the effectiveness of traditional machine translation models. This disparity in data availability has prompted researchers to explore alternative approaches to improve translation performance for low-resource languages.
One promising approach is transfer learning, which leverages knowledge gained from high-resource languages to enhance translation models for low-resource languages. Transfer learning involves adapting models trained on well-resourced languages to new language pairs by utilizing pre-existing knowledge and data. This technique capitalizes on the idea that linguistic similarities and patterns learned from high-resource languages can be transferred to improve translations in less resourced ones.
In recent years, transfer learning has gained traction as a method to address data scarcity issues in machine translation. By utilizing large datasets from high-resource languages or data from related languages, researchers can improve the performance of translation models for languages with limited parallel data. The effectiveness of transfer learning is influenced by various factors, including the volume of data and the linguistic relatedness of the languages involved.
Data volume plays a crucial role in transfer learning. Large datasets provide a more comprehensive representation of the language, allowing models to learn more nuanced and diverse patterns. This abundance of data can significantly enhance a model’s ability to generalize and perform well on new, unseen data. For instance, Zoph et al. [
1] demonstrated that using extensive data from high-resource languages leads to notable improvements in translation performance for low-resource languages. The increased data exposure helps the model capture a wider range of linguistic phenomena, ultimately resulting in more accurate translations.
On the other hand, language relatedness also plays a vital role in transfer learning. Languages that are closely related or belong to the same language family often share common linguistic features such as grammatical structures, vocabulary, and syntax. These similarities can be exploited to enhance translation performance. Research by Cotterell and Heigold [
2] highlighted that related languages, such as Spanish and Portuguese or Czech and Slovak, benefit more from transfer learning compared to unrelated language pairs. The shared linguistic background allows models to leverage common patterns and knowledge, improving translation quality.
Transfer learning is particularly valuable in scenarios where parallel data is limited, such as in zero-shot and few-shot translation tasks. Zero-shot translation refers to translating between language pairs for which no parallel data exists, while few-shot translation involves translating with minimal data. Related languages can provide a crucial advantage in these scenarios by facilitating knowledge transfer and enhancing model performance despite the lack of extensive parallel data.
This paper makes a focused empirical contribution to the study of transfer learning in low-resource neural machine translation. Specifically, we analyze how the effectiveness of transfer languages interacts with the amount of available target-language supervision in a controlled Polish–English mBART setting. While prior work has established that both data volume and language relatedness influence transfer performance, our results show that their effects are not static across training regimes. Instead, transfer-language effectiveness depends strongly on the level of target-language data available for adaptation.
Our main empirical finding is a regime-dependent reversal in transfer-language rankings. In strict zero-shot conditions, Czech provides the strongest transfer performance. However, as Polish supervision increases, this advantage diminishes, and at higher shot levels Russian and German become competitive with, and in some cases surpass, Czech. In addition, we evaluate a mixed-parent Czech + Russian configuration and show that multi-source transfer is consistently beneficial relative to the no-transfer baseline, but does not uniformly outperform the best single-parent model. Taken together, these results position the paper as a case study of transfer-language choice under different supervision regimes, rather than as a general theory of transfer learning across language pairs and models.
2. Previous Research
2.1. Few-Shot and Zero-Shot Machine Translation
Few-shot and zero-shot learning are crucial advancements in the field of machine translation, enabling models to translate between languages with minimal or no direct training data for certain language pairs. These approaches are particularly valuable for handling low-resource languages and building more efficient multilingual translation systems.
Zero-shot learning in machine translation allows a model to translate between language pairs it never encountered during training. This capability is typically achieved by training a multilingual neural machine translation model on multiple language pairs, creating a shared language space where languages are represented in a way that enables cross-lingual transfer. A landmark study by Johnson et al. [
3] introduced the concept of zero-shot translation by employing a “language token” that guides the model towards the target language. While this approach demonstrated the feasibility of zero-shot translation, the initial translations often exhibited lower quality compared to those generated from direct training data, primarily due to insufficient alignment between languages within the shared space. Subsequent research has focused on enhancing this alignment to improve translation quality in zero-shot scenarios. Zhang et al. [
4] showed that massively multilingual neural machine translation models often underperform compared to bilingual models, particularly in zero-shot translations. They addressed the issue by enhancing modeling capacity with language-specific components and deeper architectures, and tackling off-target translations with random online backtranslation, boosting zero-shot translation quality by
BLEU.
The research by Chen et al. [
5,
6] introduced SixT and the improved version SixT+. Unlike previous approaches that focus on natural language understanding or supervised translation improvements using BERT, SixT explores how a multilingual pretrained encoder can enhance cross-lingual transfer in NMT. The model is trained on only one language pair alongside an off-the-shelf MPE and then tested directly on unseen languages. SixT employs a two-stage training strategy, a position-disentangled encoder, and an enhanced decoder, leading to significant improvements over mBART and other state-of-the-art models.
Recently, Mullov et al. [
7] proposed a method that decouples vocabulary and syntax learning, using cross-lingual word embeddings and freezing word representations during translation training. They demonstrated that this approach enables successful zero-shot translation from languages not seen during training, achieving notable BLEU scores for Portuguese–English and Russian–English on the TED domain. Additionally, they showed that their decoupled learning strategy is effective for unsupervised machine translation, achieving results close to a supervised setting through iterative backtranslation.
Few-shot learning in MT involves fine-tuning a pretrained multilingual model on a new language pair with a small amount of parallel data. This approach allows the model to quickly adapt to the new task, often yielding better results than purely zero-shot methods. Studies such as those by Neubig and Hu [
8] have shown that even with limited data, these models can achieve significant performance improvements. Garca et al. [
9] showed that with just a few high-quality translation examples at inference, a model trained through self-supervised learning matches or outperforms specialized state-of-the-art translation systems. Additionally, the study highlights the importance of the quality of few-shot examples. Additionally, Zhao et al. [
10] experimented with up to 40 different languages and six different NLP tasks and showed that few-shot cross-lingual transfer is highly sensitive to the selection of the number of few-shots.
Cross-lingual transfer learning techniques have been extensively researched to improve zero-shot and few-shot machine translation. These techniques involve training models on high-resource languages and fine-tuning them on related low-resource languages, allowing the models to transfer learned linguistic patterns across languages. Massively multilingual models like Google’s mT5 and Facebook’s M2M-100 and mBART have been developed to handle hundreds of languages, designed specifically to excel in zero-shot and few-shot scenarios by leveraging shared linguistic features across a wide range of languages.
More recently, large language models (LLMs) and retrieval-augmented generation (RAG) have opened an additional design space that is relevant to low-resource machine translation, even though this paper does not evaluate such systems directly. In a graph-based RAG setting, TrumorGPT combines an LLM with retrieved graph-structured knowledge for grounded generation and reasoning [
11]. In a different paradigm, Conger et al. propose contextual RAG with prompt-based guidance, where retrieval is conditioned not only on the query but also on task context to improve answer grounding [
12]. Although these studies are not machine translation papers, they illustrate two retrieval strategies—graph-based retrieval and contextual retrieval with prompting—that could inspire future extensions of low-resource translation systems. For example, an LLM-based translation framework could dynamically retrieve multilingual sentence pairs, translation memories, or aligned document fragments at inference time to improve zero-shot and few-shot translation quality in ways that are not limited to linguistic relatedness alone.
2.2. Data Volume
Data volume is a pivotal factor in transfer learning for machine translation, as it typically leads to improved model performance. When a model has access to a larger dataset, it can learn from a more diverse array of examples, which enhances its ability to generalize and perform well on unseen data. This increased exposure allows the model to better understand the nuances of the languages involved, resulting in more accurate and fluent translations.
Numerous studies support the notion that greater data quantity enhances transfer learning efficacy. For instance, Zoph et al. [
1] explored the use of extensive data from high-resource languages to enhance machine translation models for low-resource languages. Their findings indicate that leveraging large datasets from high-resource languages can lead to substantial improvements in translation performance for languages with limited data. This suggests that more data provides a richer learning environment, enabling models to capture and generalize better from the underlying patterns in the data.
Koehn and Knowles [
13] further investigated the impact of data volume across various language pairs and confirmed that increased data generally results in better performance. However, they also noted that the incremental benefits of additional data diminish as the dataset grows larger. This finding implies that while more data is advantageous, there is a point where the marginal gains in performance become less significant.
The size of the source corpus has been highlighted as a critical factor in transfer learning. Kocmi and Bojar [
14] found that the size of the source corpus could be more influential than the relatedness of the languages involved. Their research showed that translating between languages such as Czech and Estonian could sometimes outperform translations between more closely related languages like Finnish and Estonian, emphasizing the importance of data quantity over linguistic similarity in some cases.
Moreover, employing multiple languages as source data for transfer learning has been shown to further enhance performance. Research by Maimaiti et al. [
15] and Chen et al. [
16] demonstrated that multi-source cross-lingual transfer is highly effective for machine translation tasks. By integrating data from several languages, models can benefit from a broader range of linguistic patterns and structures, which contributes to improved translation quality.
On the other hand, a paper by Mandal et al. [
17] introduced two methods for efficiently selecting training data for human translation, aimed at improving statistical machine translation (SMT) systems while reducing the cost of generating large parallel corpora. The first method used disagreement among multiple SMT systems, while the second employed a perplexity criterion. Experiments on Chinese–English data across multiple domains showed that selecting only one-fifth of the additional training data achieved comparable or better translation performance than using the entire dataset.
Additionally, the research by Maillard et al. [
18] explored the impact of small but high-quality training data on machine translation for low-resource languages. Since large-scale datasets are often unavailable or costly to create, models typically rely on pre-existing corpora and synthetic data. The study investigated whether a few thousand professionally translated sentence pairs can improve translation quality. Their analysis showed that even small, high-quality datasets significantly enhance model performance, even when combined with larger, lower-quality corpora or backtranslated data.
2.3. Language Relatedness
Common methods for analyzing language similarity include examining language families [
19], lexicons [
20], morphology [
21], syntax [
22], and typological features [
23]. Language family classification organizes languages based on shared linguistic and historical traits, often involving lexical, phonological, and morphological comparisons. These relationships are visually represented through classification trees [
23,
24], forming a foundation for historical and comparative linguistics.
This genealogical approach typically uses the comparative method to identify cognates—words with shared historical origins—which helps determine the degree of relatedness among languages and aids in reconstructing proto-languages [
23,
24]. However, genealogical similarity is only one dimension of linguistic similarity. Other factors like language contact, borrowing, and independent structural development also play significant roles.
Importantly, languages can exhibit strong structural or typological similarities without sharing a common ancestor. Such cases have been noted between Turkic and Japanese [
25,
26], or even between English and Chinese [
27], suggesting that similarity relevant to language transfer can exist beyond genealogical ties.
The relatedness between languages is another crucial aspect of transfer learning for machine translation. Languages that share a common linguistic background, such as similar grammatical structures, vocabulary, and syntax, offer significant advantages. This shared linguistic heritage can be leveraged to improve translation performance, especially in scenarios where parallel data is limited [
28].
Cotterell and Heigold [
2] investigated the impact of language relatedness on cross-lingual transfer learning for low-resource languages. Their study revealed that related languages, such as Spanish and Portuguese or Czech and Slovak, yielded better translation outcomes compared to unrelated language pairs. This suggests that the linguistic similarities between related languages facilitate the transfer of knowledge, leading to improved translation performance.
Nguyen and Chiang [
29] also found that transfer learning was particularly effective when applied to related language pairs. They observed that the common linguistic features shared by related languages, such as similar grammatical structures and vocabulary, enhance the model’s ability to transfer knowledge and improve translation quality. Additionally, a study by Winata et al. [
30] showed that, in few-shot learning, linguistically similar and geographically similar languages are more useful for cross-lingual adaptation.
Similarly, Dabre et al. [
31] found that transfer learning from languages within the same or similar language families was most effective for low-resource languages. This is because the model can utilize the knowledge acquired from the source languages to enhance translation quality for the target language. The similarities between the source and target languages enable the model to make more accurate predictions and better handle the translation task.
Related languages have also shown their significance in zero-shot and few-shot machine translation scenarios. Nooralahzadeh et al. [
28] demonstrated that related languages perform better in zero-shot translation, where the system is trained on a source language and tested on a target language with no parallel data available. This advantage arises because the model can leverage the linguistic similarities between the source and target languages to make more accurate translations.
When data is limited for a specific language pair, related languages can provide a valuable alternative by transferring knowledge across languages. This approach can help mitigate the challenges posed by data scarcity and improve the performance of machine translation models [
32,
33].
Additionally, studies have shown a correlation between language similarity and cross-lingual transfer efficiency in various natural language processing tasks. Research by Lauscher et al. [
34] and Eronen et al. [
35] highlights that the similarity of language pairs significantly impacts the efficiency of cross-lingual transfer, emphasizing the importance of leveraging related languages to enhance translation performance.
3. Methods
This study examines how data volume and language relatedness affect transfer learning for Polish-to-Englishneural machine translation. We fine-tune mBART [
36], a multilingual sequence-to-sequence model pretrained by Facebook AI Research, and evaluate performance under zero-shot and few-shot conditions.
In this experiment, transfer-language relatedness is not operationalized with a single quantitative metric. Instead, it is treated qualitatively through the selected transfer languages: Czech as a closely related West Slavic language, Russian as a Slavic language with different script usage, German as a less related Indo-European language, and Slavic as a mixed Czech–Russian parent. The analysis should therefore be read as a comparison of selected transfer-language configurations rather than as a formal measurement study of linguistic similarity.
The experiments were conducted in two rounds: Czech, Russian, and German were evaluated first, and the Slavic parent setup was added in the subsequent round. Across these rounds, the parent setups are:
The experiments follow a parent–child transfer configuration [
1]. Parent models are first trained on translation into English with non-Polish source data, and then adapted to Polish–English using progressively larger Polish subsets.
More concretely, each transfer setup is trained in two sequential phases. In the parent phase, the model is exposed to source-language-to-English parallel data (100,000 samples per parent language). In the child phase, training is continued on Polish-to-English with the designated shot budget. This design allows us to isolate how transfer source choice and target-language data volume interact.
For each setup, we evaluate five Polish data conditions: 0-shot, 10-shot, 100-shot, 1k-shot, and 10k-shot. The evaluation is performed in two stages: we run zero-shot experiments first, then continue with few-shot experiments. The initial zero-shot stage focused on Czech, Russian, and German.
The training data comes from OPUS-100 [
4], a large multilingual sentence-aligned parallel corpus. Parent models are trained with 100,000 samples per transfer language before Polish adaptation.
To keep comparisons fair, all configurations share the same architecture, optimizer family, and evaluation protocol. The only controlled differences are: (i) transfer source language(s) in the parent phase, and (ii) number of Polish samples in adaptation. This ensures that score differences can be interpreted as transfer/data effects rather than implementation variance. We used the
facebook/mbart-large-50-many-to-many-mmt checkpoint with PyTorch (2.0.1) and Hugging Face Transformers [
37] (4.30.2). Training used AdamW with learning rate
, batch size 16 for both training and evaluation, and 3 epochs, with random seed 42. For evaluation, generation was performed with
predict_with_generate=True. Tokenization was based on the tokenizer associated with the selected mBART checkpoint.
We report BLEU [
38] and METEOR [
39]. BLEU captures n-gram overlap with brevity control, while METEOR emphasizes precision, recall, and alignment quality. Experiments were run on an NVIDIA RTX 3090 GPU.
The preliminary zero-shot stage evaluates parent models before Polish adaptation, while the main stage evaluates the same models after few-shot Polish adaptation. This ordering directly reflects the intended low-resource workflow: initialize from cross-lingual transfer, then incrementally inject target-language supervision.
4. Results
4.1. Preliminary Zero-Shot Experiments
To establish transfer behavior before Polish adaptation, we first ran preliminary zero-shot experiments for Czech, Russian, and German by varying the source-language parent-data size(10, 100, 1000, 10,000, and 100,000 samples), while keeping target-language fine-tuning at zero-shot. These experiments are intended as an exploratory comparison of source-data scaling patterns rather than as a dedicated ablation study of all possible training or data-related factors.
These preliminary zero-shot results indicate that relatedness helps in low-resource transfer, but source-data scaling is not strictly monotonic. For Russian and German, there is almost no improvement from 100 to 10,000 samples in this setup, followed by a decline at 100,000.
A closer look at
Table 1 shows distinct transfer dynamics across languages. Czech remains comparatively stable at low source-data sizes, with modest gains in the 1000–10,000 range, before diverging at 100,000 samples. In contrast, Russian and German remain nearly flat from 100 to 10,000 samples (Russian BLEU 0.74 →0.78, METEOR 0.13 →0.14; German BLEU 0.76 →0.78, METEOR 0.14 →0.14), indicating no mid-range benefit from additional parent data.
At 100,000 samples, Czech improves sharply, whereas Russian and German decline. This indicates that larger parent corpora do not automatically translate to better zero-shot Polish performance in this setup, and that scaling effects differ by transfer language. One plausible explanation is domain or distribution mismatch between the parent-language training signal and Polish target behavior at inference time, but this remains speculative.
4.2. Main Polish Fine-Tuning Results
Table 2 summarizes the Polish–English results obtained after the preliminary zero-shot stage.
The BLEU and METEOR curves in
Figure 1 make the regime shift visually explicit: Czech is strongest in zero-shot, whereas Russian, Slavic, and German improve sharply once Polish adaptation reaches the 1k-shot level. In particular, the curves show that Czech begins with the strongest zero-shot initialization, while Russian and German achieve their largest gains between 100-shot and 1k-shot, where the effect of Polish supervision becomes much stronger.
4.3. Zero-Shot Analysis
For zero-shot transfer, the single-parent comparison among Czech, Russian, and German is clear. Czech outperforms the other two in zero-shot, which is consistent with its closer relatedness to Polish. Russian and German show weaker zero-shot scores, but both benefit strongly from few-shot fine-tuning.
4.4. Main Findings
Three patterns are clear from
Table 2. First, increasing Polish data consistently improves translation quality, with the largest jump appearing between 100-shot and 1k-shot. Second, Czech is strongest in zero-shot (11.61 BLEU, 0.35 METEOR), while Russian and German begin substantially lower. Third, in the 1k-shot and 10k-shot settings, Russian and German match and even slightly surpass Czech in BLEU (19.42 and 19.35 vs. 17.17), and both reach 0.44 METEOR.
From the few-shot perspective, the transition from 100-shot to 1k-shot is the most influential regime change for Russian, Slavic, and German, where BLEU rises sharply into the 16–17 range and METEOR reaches 0.41–0.42. This indicates that once a minimum adaptation budget is reached, models can rapidly align to Polish-specific patterns.
One plausible interpretation is that these systems pass through two qualitatively different learning phases. In the first phase (0–100 shots), performance is dominated by the transfer prior: models rely mostly on what was encoded during parent training, so language relatedness strongly shapes outcomes. In the second phase (1k–10k shots), the adaptation signal from Polish becomes dominant, allowing initially weaker parents (Russian and German) to catch up and even overtake Czech. The strong higher-shot performance of German is especially noteworthy because German is less closely related to Polish than Czech or Russian. One possible explanation is that, once enough Polish supervision is available, performance depends less on linguistic similarity alone and more on the model’s capacity to learn transferable representations and reorganize them during adaptation. In that view, the multilingual mBART architecture may exploit broad cross-lingual regularities that are not reducible to simple genealogical relatedness, allowing a less related parent language to become highly effective after sufficient fine-tuning. An additional factor may be subword tokenization. Because mBART uses a shared multilingual vocabulary, cross-lingual similarity may be encoded not only through syntax or genealogy, but also through the degree to which languages share reusable subword units and segmentation patterns. If Polish and a transfer language are represented with overlapping or compatible subword pieces, the model may benefit from more efficient parameter sharing even when the languages are not closely related in a traditional linguistic sense. Conversely, differences in script usage or segmentation behavior may weaken transfer even between genealogically related languages. This means that part of the observed transfer effect may arise from tokenization-level compatibility inside the shared vocabulary, rather than from language relatedness alone. While this interpretation is consistent with the observed curves, it should be treated as a hypothesis rather than a causal claim.
Comparing against the no-transfer baseline further highlights the transfer benefit: at 10k-shot, transfer-based configurations (17.17–19.42 BLEU) outperform the no-transfer baseline (15.42 BLEU), and similarly improve METEOR (0.41–0.44 vs. 0.36). Thus, parent initialization remains useful even when target-language data is no longer extremely small.
4.5. Effect of the Slavic Parent Model
The Slavic parent model (Czech + Russian) performs better than the no-transfer baseline at all shot levels and remains competitive with single-language parents. However, using two related transfer languages does not guarantee better results than the best single parent in every setting. In lower-resource conditions, Czech alone is stronger than Slavic, while in higher-resource conditions Russian can be stronger. This indicates that simply increasing transfer-source volume is not always sufficient; transfer composition also matters.
At zero-shot, Slavic (8.33 BLEU) is clearly below Czech (11.61), but above Russian and German, suggesting that mixed-parent transfer may dilute the strongest language-specific signal. At 10k-shot, Slavic (18.18) remains strong but still below Russian (19.42) and German (19.35), reinforcing that multi-source transfer is not automatically optimal without careful balancing.
5. Discussion
5.1. Data Volume in Translation Performance
The most stable trend across our main experiments (
Table 2) is that adding more Polish target-language data improves quality. The jump from 100-shot to 1k-shot is particularly large for all transfer setups, and 10k-shot provides the best overall performance. This aligns with the general transfer-learning expectation that larger target-language samples improve adaptation by exposing the model to more patterns and reducing mismatch between parent and target tasks.
At the same time, the preliminary zero-shot experiments in
Table 1 show that increasing source-language parent data alone is not always sufficient. For Russian and German, quality improves up to intermediate source-data sizes before dropping at 100,000 in this setup, suggesting that transfer data quantity and transfer quality are not perfectly monotonic.
The non-monotonic trend indicates that adding more parent-language data does not guarantee better zero-shot transfer in this setup. The current experiments do not identify the cause of this pattern, so explanations such as domain mismatch or overfitting should be treated as open hypotheses for future work. More broadly, the results suggest that target-language data volume has a clearer and more consistent effect on translation quality than parent-language data volume in the present experiments.
5.2. Transfer-Language Relatedness
Qualitative language relatedness remains most visible in low-resource conditions. The model using Czech as the parent language, where Czech is linguistically closer to Polish, gives the strongest zero-shot result in the main table (11.61 BLEU, 0.35 METEOR). This is consistent with the idea that related parent languages provide reusable lexical and structural information that helps the model before any Polish-specific fine-tuning. In that sense, relatedness appears particularly important when little or no Polish supervision is available.
However, this effect weakens as Polish data increases. By 10k-shot, the models using Russian and German as parent languages become competitive with, or slightly stronger than, the Czech parent model in BLEU. This indicates that enough target-language data can compensate for weaker initial relatedness. In more moderately supervised settings, model performance depends less on the initial closeness of the transfer language and more on the model’s ability to adapt from Polish examples.
The Slavic mixed parent adds nuance to this pattern. Because it is trained on both Czech and Russian, it uses roughly twice the parent-language data of the single-parent models. Even so, it does not become the strongest configuration: at 0-shot and in the few-shot settings it remains below Czech, and at 10k-shot it remains below Russian. These results show that combining related parent languages is useful in this experiment, but does not automatically outperform the best single-parent model.
Taken together, these observations indicate that the qualitative notion of transferlanguage relatedness used here cannot be reduced to a single factor. In this study, the main evidence is comparative rather than mechanistic: the Czech parent model performs best at zero-shot, the Russian and German parent models improve more strongly with additional Polish data, and the mixed Slavic parent model remains competitive.
One additional interpretation is that the observed transfer patterns may partly reflect how mBART encodes languages through a shared subword vocabulary. In that case, effective cross-lingual transfer may depend not only on genealogical or typological similarity, but also on whether Polish and the parent language are segmented into overlapping or compatible subword units. This could help explain why a less closely related language such as German becomes highly competitive once Polish supervision is available, while script differences or segmentation mismatches may weaken transfer for otherwise related languages. Because subword overlap was not measured directly in this study, this explanation should be treated as a plausible model-internal mechanism rather than a demonstrated causal account.
5.3. Zero-Shot vs. Few-Shot Learning
The contrast between zero-shot and few-shot settings is clear. Zero-shot performance varies greatly by transfer-language configuration, with the Czech parent model much stronger than the Russian and German parent models in the main setup. Once even small amounts of Polish data are introduced (10-shot and 100-shot), all models improve, and by 1k-shot the gap narrows substantially.
This pattern suggests that few-shot adaptation is an effective bridge between zero-shot initialization and higher-resource performance. Practically, if no Polish data is available, transfer-language choice is critical; once few-shot data becomes available, model performance depends increasingly on adaptation data rather than only the initial transfer-language configuration.
From a deployment perspective, this implies two different optimization strategies: if the annotation budget is near zero, choose the transfer language that gives the strongest zero-shot initialization; if collecting around 1k+ target pairs is realistic, prioritize parent configurations that adapt fastest under supervision, even if their zero-shot scores are modest.
More broadly, the observed phase transition between 0–100 shots and 1k–10k shots suggests a practical staged deployment strategy for low-resource languages. In an initial emergency or cold-start setting, where no parallel corpus exists and a usable system is needed quickly, practitioners should optimize for zero-shot robustness by selecting the most favorable related parent language and accepting that performance will be strongly constrained by the transfer prior. In a second stage, once roughly 1k or more target-language pairs can be collected, the strategy should shift from choosing the best zero-shot parent to building an efficient adaptation pipeline: targeted data collection, rapid fine-tuning, and periodic model refreshes may produce larger gains than further parent-language pretraining alone. In other words, the results imply that early deployment decisions should be driven by parent-language selection, whereas later deployment quality is more effectively improved through investment in even moderate amounts of target-language supervision.
This interpretation is especially relevant for real-world low-resource deployment, where collecting 1k sentence pairs may be difficult but is often still more realistic than building a fully scaled parallel corpus. For many endangered, regional, or institutionally under-resourced languages, assembling 1k curated translation pairs may be feasible through focused annotation campaigns, collaboration with bilingual speakers, or translation of high-value domain material such as public-service information, school content, or health guidance. In such settings, our findings suggest a practical decision rule: if even 1k target-language pairs is unlikely to be obtainable in the near term, the safest strategy is to maximize zero-shot quality through careful parent-language choice; if collecting around 1k pairs is realistic, then the system should be designed from the beginning to exploit that adaptation stage, because moderate supervision may change which transfer configuration is actually best in deployment. This also means that transfer strategy should be matched not only to linguistic relatedness, but to a realistic estimate of the data that local stakeholders can afford to create and maintain.
A quantitative view further highlights this transition. Czech starts high and gains steadily (11.61 →17.17 BLEU), while Russian and German show much larger relative jumps from weak zero-shot starting points (Russian 0.42 →19.42; German 0.12 →19.35). This suggests that zero-shot ranking and few-shot ranking can be very different, depending on how much target supervision is available. In this case, the languages perform more or less equally in few-shot settings, whereas Czech clearly dominates in zero-shot settings.
These observations support a practical recommendation for future work: transfer strategies should be evaluated across at least one near-zero regime and one moderate-shot regime, because conclusions drawn from only zero-shot or only higher-shot settings can be misleading about relative parent-language performance.
5.4. Threshold Effects and Saturation Points
Our results suggest threshold-like behavior in two places. First, in the main experiments, several configurations show a marked quality increase around the 1k-shot stage, implying a minimum target-data level needed for stable adaptation. Second, in preliminary zero-shot experiments, Russian and German peak at intermediate source-data scales and then decline, indicating potential saturation or over-specialization effects for zero-shot transfer.
This preliminary source-scaling result was not what we initially expected. We expected that increasing the amount of parent-language training data would produce more consistent zero-shot gains. Instead, additional parent-language data beyond intermediate ranges does not appear to yield a reliable improvement in this setup, and in some cases the scores remain nearly unchanged or decline. Within the limits of this exploratory experiment, the result suggests that simply adding more transfer-source data is not sufficient by itself to guarantee better transfer performance.
One possible contributor to this non-monotonic behavior is domain mismatch within OPUS-100. Although OPUS-100 provides broad multilingual coverage, it aggregates sentence pairs from multiple source collections with potentially different topical, stylistic, and alignment characteristics. As parent-language data is scaled upward, the added material may therefore become less similar to the distribution implicitly required for Polish–English zero-shot transfer, even if the total volume increases. Under that interpretation, intermediate-scale parent training may provide enough general multilingual signal to support transfer, whereas larger-scale training may introduce noisier or less compatible domain patterns that weaken zero-shot generalization. The present experiments do not isolate this factor directly, but domain mismatch is a plausible explanation for why more parent-language data does not translate into uniformly better zero-shot performance.
The Slavic parent setup remains useful across data regimes and consistently outperforms the baseline, but it does not uniformly exceed the best single-parent model. This reinforces a practical guideline: parent-model selection should be tuned to the expected target-data budget rather than assuming multi-source transfer is always optimal.
The present results do not establish why the mixed Slavic parent behaves this way. They show that multi-source transfer can be competitive, but further ablations would be needed to determine whether the outcome is driven by data balance, parent-language interaction, tokenization effects, or other training factors. Investigating when multiple transfer languages help, and when they fail to improve over a strong single related parent, is therefore an important direction for future work.
6. Limitations
Despite the contributions of the proposed method, there are some limitations of this study that should be acknowledged. First, while the OPUS-100 corpus provides extensive multilingual data, it may not capture all linguistic features or domain variations comprehensively. Variations in data quality and alignment could also affect robustness. Second, the main limitation is generalizability. Our experiments are limited to Polish–English transfer with Czech, Russian, German, and one combined Slavic parent model within a single model family, mBART. Although this focused setup supports a clear empirical analysis, it remains uncertain whether the same conclusions would hold for other target languages, language families, datasets, or translation architectures. The paper should therefore be interpreted as a focused case study rather than a broadly validated statement about transfer learning in machine translation.
Third, although the paper discusses transfer-language relatedness, relatedness is not measured directly with a quantitative predictor such as lexical overlap, typological distance, or subword overlap under mBART tokenization. This is especially relevant because mBART’s shared vocabulary may make tokenization-level compatibility an additional driver of transfer quality. As a result, the study should be interpreted as a comparison of selected transfer languages rather than as a rigorous measurement study of linguistic similarity. Fourth, the preliminary source-scaling experiment in
Table 1 is exploratory and is not supported by dedicated ablations on factors such as domain balance, domain mismatch within OPUS-100, stopping criteria, preprocessing differences, or possible overfitting at larger parent-data sizes. Its results are therefore useful for identifying patterns, but not for making strong causal claims about why those patterns occur. Fifth, the Slavic parent model is trained on both Czech and Russian, which increases the total amount of parent training data relative to the single-language parent models. This introduces a confound in interpreting the Slavic parent results: any observed gain may reflect multi-source transfer, greater parent-data volume, or both. A stronger controlled comparison would equalize the total number of parent-training samples across configurations so that the effect of multi-source transfer can be separated from the effect of simply having more data.
Moreover, the performance of the models is evaluated using BLEU and METEOR scores. While these metrics are well-established, they have known limitations in fully capturing translation quality. BLEU focuses on precision and brevity, whereas METEOR considers precision, recall, and alignment, but may not completely address aspects of translation fluency and adequacy. Stronger contemporary evaluation approaches, such as COMET or human evaluation, were not included in the present study. In addition, the paper reports single BLEU and METEOR values without an accompanying measure of variability. This is especially important in few-shot settings, where performance can fluctuate depending on data sampling and random initialization. Some reported score differences are very small—for example, Russian and German at 10k-shot differ only marginally in BLEU—so their practical significance cannot be established from single-run results alone. Future work should therefore report means and standard deviations over multiple runs and include statistical significance testing, such as paired bootstrap resampling or approximate randomization tests, to determine whether small score differences are robust. These limitations should be considered when interpreting the study’s results and applying its findings to broader contexts in machine translation.
7. Conclusions
In conclusion, our study shows that, for Polish–English translation with mBART, both transfer-language choice and target-data volume strongly affect performance. We find that Czech is the most effective source language in zero-shot conditions, while Russian and German become highly competitive as more target-language data is added.
Our experiments also show that the Slavic parent model (Czech + Russian) is consistently useful, outperforming the baseline and remaining competitive across shot levels, although it does not always exceed the best single-parent model.
In this Polish–English mBART setting, few-shot learning provides an effective way to improve translation quality beyond zero-shot conditions. At the same time, zero-shot performance remains highly dependent on transfer-language choice, with Czech offering substantially better initial transfer than Russian and German.
Our preliminary source-scaling results also indicate that adding more parent-language data does not automatically improve zero-shot transfer. Contrary to our initial expectation, larger transfer-source corpora did not produce a clear monotonic benefit in this experiment, which suggests that transfer-data volume alone is not enough to explain performance differences.
Moving forward, it is essential to test whether these patterns hold for other datasets, language pairs, and model families before generalizing beyond the present setting. Future research should also examine additional parent-language configurations and multi-source transfer setups, and should ideally incorporate direct measures of language relatedness together with stronger evaluation protocols such as COMET and human assessment. Another important direction is to investigate whether LLM-based translation systems with retrieval augmentation can complement or outperform encoder–decoder transfer learning in low-resource regimes. In particular, graph-based retrieval and contextual RAG with prompting provide concrete architectural ideas for how external multilingual resources could be injected dynamically during translation, rather than relying only on parametric transfer from parent-language training [
11,
12].
Taken together, our results provide a detailed empirical characterization of transfer behavior in a realistic low-resource setting and offer practical guidance for transfer-language selection. When no target-language data is available, the transfer language that gives the strongest zero-shot initialization is the most effective choice in this Polish–English mBART setting. In such low-resource conditions, related parent languages appear especially useful, although this paper treats relatedness descriptively rather than as a directly measured variable. However, once even moderate supervision becomes available, transfer configurations that adapt more efficiently may be preferable. We therefore recommend that researchers and practitioners consider both the expected target-data budget and the choice of related parent languages when designing transfer-learning setups for machine translation. More broadly, the study is best understood as a solid empirical contribution with practical value and as a focused reference point for future work, rather than as a final general claim across languages and architectures.