Next Article in Journal
Talent Identification and AI-Driven Decision Tools in Sport: A Policy-Oriented Perspective on Algorithmic Bias, Data Privacy, and Digital Determinism in Player Evaluation
Previous Article in Journal
HYSARD: A Hybrid Feature-Fusion Model for Sarcasm Detection Using RoBERTa Embeddings and Linguistic Features
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AI-Driven Generation of Old English: A Framework for Low-Resource Languages

by
Rodrigo Gabriel Salazar Alva
1,
Matías Núñez
1,2,3,
Cristian López Del Alamo
1 and
Javier Martín Arista
4,*
1
Department of Computing, Universidad de Ingeniería y Tecnología (UTEC), Lima 15063, Peru
2
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires C1425FQB, Argentina
3
Instituto de Investigaciones en Biodiversidad y Medioambiente (INIBIOMA), Universidad Nacional del Comahue, Bariloche 8400, Argentina
4
Department of Modern Languages, Universidad de La Rioja, 26004 Logroño, Spain
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2026, 10(5), 145; https://doi.org/10.3390/bdcc10050145
Submission received: 16 September 2025 / Revised: 25 April 2026 / Accepted: 29 April 2026 / Published: 6 May 2026

Abstract

Preserving ancient languages is essential for understanding the cultural and linguistic heritage of humanity. Old English, however, remains critically under-resourced, which limits its accessibility to modern natural language processing (NLP) techniques. We present a scalable framework that uses advanced large language models (LLMs) to generate high-quality Old English texts to address this gap. In this study, we specifically employ state-of-the-art models, including Llama-3.1-8B and Mistral-7B, as our foundation models, which are then adapted to the unique characteristics of Old English. Our approach combines parameter-efficient fine-tuning (Low-Rank Adaptation (LoRA)), data augmentation via back-translation, and a dual-agent pipeline that separates content generation (in English) and translation (into Old English). Evaluation with automated metrics (BLEU, METEOR, and CHRF) shows improvements over baseline models, with BLEU scores increasing from 26 to over 65 for English-to-Old English translation. Expert human assessment confirms high grammatical accuracy and stylistic fidelity in the generated texts, with average scores of 9.0/10 for inflection and word order, 9.1/10 for lexical authenticity, and 7.8 for semantic coherence. These results demonstrate that the framework can reliably expand limited historical corpora while maintaining linguistic integrity, with immediate practical applications in digital humanities research, computational philology, and the development of educational resources for Old English study. Beyond expanding the Old English corpus, our method offers a practical blueprint for revitalizing other endangered languages, thus linking AI innovation with the goals of cultural preservation.

1. Introduction

Language is among the most profound tools of human civilization, which embodies cultural heritage, historical knowledge, and intellectual evolution. Studying ancient languages reveals the origins of societies and their ties to the present. Among these, Old English holds a unique position as the earliest form of the English language, which serves as the foundation for contemporary English. Spoken between the 5th and 11th centuries of the Common Era, it represents a rich tapestry of linguistic complexity, although it remains under-resourced in the current paradigm of widespread generative artificial intelligence.
Recent advances in NLP for low-resource languages have explored various computational strategies. Work on Latin, for instance, has benefited from substantially larger corpora (the Latin Library contains over 10 million words) and established toolkits that represent the successful applications of transformer models for tasks such as lemmatization and dependency parsing [1,2]. Gothic, despite having fewer surviving texts than Old English, has seen recent computational work through the use of transfer learning from modern Germanic languages, though these approaches typically achieve BLEU scores below 40 for translation tasks [3].
Existing NLP models for LRLs (low-resource languages) face common limitations: (1) data scarcity necessitates extensive transfer learning, which risks introducing linguistic features from donor languages; (2) morphological complexity in inflected languages challenges tokenization and requires careful handling of case systems; and (3) free word order complicates sequence modeling approaches designed for fixed-order languages like Modern English.
Parameter-efficient fine-tuning methods have emerged as crucial tools for LRL adaptation. LoRA [4] reduces trainable parameters by injecting low-rank matrices into pretrained models, which guarantees efficient adaptation with limited data and preserves the knowledge of the base model. Back-translation [5], originally developed for neural machine translation, has proven particularly effective for data augmentation in low-resource settings because it can leverage monolingual corpora to generate synthetic parallel data. Recent work by Nguyen et al. [6] demonstrates that linguistically diverse prompting can improve LLM performance on low-resource languages by up to 8points in the chrF++ metric, although their approach requires substantial multilingual pre-training data unavailable for Old English.
Beyond the original work by Hu et al. [4], subsequent studies have extended LoRA to specialized low-resource translation scenarios. Liang et al. [7] recently introduced Language-Specific Fine-Tuning with LoRA (LSFTL), optimizing multi-head attention and feed-forward layers of Transformer blocks through low-rank matrix adaptation to achieve substantial BLEU and chrF gains for low-resource language pairs. These results confirm that LoRA-based adaptation can match or exceed full fine-tuning with a fraction of the trainable parameters, a property especially valuable for historical languages where overfitting is a constant risk.
The evolution of back-translation techniques has similarly advanced beyond the foundational work of Edunov et al. [5]. Recent innovations include iterative back-translation, where multiple rounds of synthetic data generation progressively refine model quality, and tagged back-translation, which explicitly marks synthetic versus authentic data to help models distinguish between them.
Recent work on English–Luganda [8] demonstrated that incremental and iterative back-translation, combined with dataset selection across multiple small monolingual sources, can exceed previous benchmarks by more than 10 BLEU units across translation directions for an African low-resource language, which directly motivates the iterative back-translation strategy in our Phase 2 task specialization.
For morphologically rich languages, hybrid approaches combining back-translation with morphological analysis have shown particular promise. Studies on agglutinative languages like Turkish and Finnish demonstrate that incorporating morphological segmentation into the back-translation pipeline can improve BLEU scores by 5–8 points over naive back-translation. These advances informed our decision to combine back-translation with careful linguistic preprocessing tailored to Old English’s inflectional complexity.
The intersection of efficient fine-tuning and data augmentation for historical languages represents relatively unexplored territory. While Latin NLP has benefited from larger corpora and established computational infrastructure [1,2], truly low-resource ancient languages like Gothic and Old Church Slavonic present challenges more analogous to Old English. Recent work on Gothic achieves BLEU scores in the 30–40 range through transfer learning from modern Germanic languages but often introduces anachronistic constructions from donor languages.
Very recent work on classical Chinese confirms the value of domain-specialized models: Zhao et al. [9] applied LoRA instruction tuning to the Xunzi series of ancient-book LLMs using 1.2 million pairs of parallel ancient–modern Chinese text. These authors show that domain-specialized models substantially outperform general-purpose baselines on ancient-book translation metrics. The same research group has further extended this paradigm to multimodal settings, with Zhu et al. [10] introducing XunZi-MLLM, a multimodal large language model for joint recognition of ancient texts and historical document images. In a closely related line of work on a different low-resource setting, Joshi et al. [11] introduced Nemotron-Mini-Hindi 4B, a bilingual model trained through continued pre-training of a multilingual LLM on a mix of real and translation-based synthetic Hindi + English tokens. They achieve state-of-the-art results on Hindi benchmarks while remaining competitive on English tasks.
These works adopt a methodological philosophy analogous to ours, namely combining domain-specialized continued pre-training with synthetic data augmentation, and we adapt this paradigm to the markedly smaller Old English corpus (≈3 million words). Our approach differs by treating Old English generation as a structured task combining domain adaptation and task specialization, explicitly separating content creation from translation to maintain linguistic authenticity while leveraging modern language model capabilities.
Unlike previous work that treats ancient languages as simple translation targets, our framework treats corpus expansion as a structured generation task that preserves grammatical and stylistic authenticity. Our approach advances beyond these methods by combining domain-adaptive continual pre-training with task-specific back-translation in a dual-agent architecture specifically designed to address the unique challenges of Old English: its limited corpus size (3 million words), complex inflectional morphology (four-case nominal system, rich verbal conjugation), and relatively free word order. This limitation hinders not only linguistic research but also the application of state-of-the-art NLP techniques, which often depend on large, high-quality datasets [12].
LLMs have revolutionized NLP but face critical limitations when generating Old English texts. Most models are English-centric, trained predominantly on contemporary corpora, while multilingual models lack sufficient representation of low-resource languages like Old English. The unique linguistic features of Old English—complex case systems, free word order, and Germanic vocabulary—compound these challenges, resulting in stylistically and grammatically inaccurate outputs that require targeted training and augmentation strategies.
The primary objective of this study is to adapt a pretrained large language model for data generation tasks in Old English through a systematic fine-tuning process that leverages the limited available Old English data and advanced training techniques. This approach allows the model to generate syntactically well-formed and semantically accurate Old English text. Furthermore, to guide the stylistic and contextual quality of the generated output, the proposed framework uses a dual-agent architecture: a generative agent constructs coherent Modern English prompts, and a translation agent renders these into high-quality Old English.
Based on the successful application of transfer learning and data augmentation in related low-resource scenarios, we hypothesize that a pretrained large language model, when systematically adapted through staged domain-specific training and enriched with synthetically augmented data via back-translation, can generate Old English texts that achieve both high automated metric scores (BLEU > 60) and expert-validated linguistic fidelity (average score > 8/10) across grammatical accuracy, lexical authenticity, and stylistic coherence. This hypothesis is predicated on the assumption that the existing knowledge of the model of Modern English and related Germanic languages provides sufficient structural foundation for effective transfer to Old English, and that synthetic data can overcome the limitations of the small historical corpus without introducing systematic errors or anachronisms that would compromise authenticity.
Evaluation was conducted using both automated metrics, such as BLEU, CHRF, and METEOR, and human evaluation by experts in Old English. Texts were rated on grammatical accuracy (inflection and word order), lexical selection (attestedness), and semantic coherence, with only high-quality outputs incorporated into the extended corpus. This refinement ensured the data met both linguistic and computational standards.
Our approach begins by adapting a state-of-the-art language model through a carefully staged training pipeline. The process starts with domain adaptation, where the model is progressively exposed to authentic Old English data—however scarce—alongside carefully selected Modern English examples, which leverages advanced techniques such as domain-adaptive pre-training and efficient fine-tuning. Synthetic data generation is then achieved via a dual-agent system: one agent crafts stylistically consistent Modern English prompts, and a specialized translation agent renders these into fluent Old English, which is enriched with context and guided by few-shot learning. By iteratively expanding and refining the training data, our methodology enables the model to produce Old English texts with unprecedented fidelity, which opens the door to scalable digital preservation and revitalization for other LRLs.
The contributions of this work are manifold. It expands the Old English corpus and provides a scalable framework for LRLs. This approach serves as a replicable template for other under-resourced languages, which fosters their preservation and study. Beyond linguistics, this work unites computational techniques with cultural heritage preservation, thus contributing to a movement that democratizes access to linguistic resources and ensures the survival of underrepresented languages. By demonstrating how LLMs can address challenges in LRLs, we provide valuable insights for researchers at the juncture of machine learning, humanities, and technology—which shows how the same tools that shape the future of artificial intelligence can be repurposed to illuminate the linguistic past of humankind.
All code, models, and implementation details are publicly available in our repository (https://github.com/tux550/OldEnglish-LLM (accessed on 17 January 2025)). The pipeline implementation in Google Colab notebooks makes our framework accessible with minimal setup requirements. Our implementation leverages the Hugging Face Transformers library for model management and fine-tuning, while the dual-agent pipeline utilizes OpenAI’s API for the content generation component. This architecture allows researchers to reproduce our results and adapt our methods to other low-resource languages with standard computational resources.
The remainder of this article is organized as follows. Section 2 details our three-stage methodology, beginning with data preparation using the Dictionary of Old English Corpus and other historical sources, followed by a progressive model training approach that combines domain adaptation and task specialization phases. Section 2 also describes our dual-agent synthetic data generation pipeline distinguishing content creation from translation tasks. Section 3 outlines our use of automated metrics (BLEU, METEOR, CHRF) alongside expert human assessment criteria. Section 4 presents our findings, which demonstrate substantial improvements in translation quality—with BLEU scores increasing from 26 to over 65—and highlights both the strengths and limitations of our approach through detailed linguistic analysis. Section 4 also examines the broader implications of this work for other LRLs. Section 5 summarizes the main conclusions and provides future directions for improving semantic coherence and historical authenticity in generated texts.

2. Methods

This study addresses the challenge of expanding the Old English corpus by adopting a structured, multi-stage methodology. We reconceptualize synthetic data generation as a machine translation task, which leverages the strengths of models trained on resource-rich languages to manage the linguistic complexities inherent to Old English. The workflow comprises three primary stages: data preparation, model training, and synthetic data generation. Throughout these stages, we utilize advanced machine learning techniques—including LoRA for efficient fine-tuning, back-translation for robust data augmentation, and a dual-agent architecture for targeted generation and translation. (Code and resources: GitHub, Inc., San Francisco, CA, USA; https://github.com/tux550/OldEnglish-LLM (accessed on 17 January 2025)). The repository includes complete implementations of all prompt templates, training scripts, and evaluation protocols discussed in this paper. Specifically, the prompt templates can be found in the AgentPipeline.ipynb notebook under the ‘Prompt Templates’ section, which provides researchers with ready access to our exact implementation details).

2.1. Data Preparation

To establish a strong foundation for training, a diverse dataset was curated. The primary source was the Dictionary of Old English Corpus (DOEC), which contains the complete written records of the language, around 3000 texts comprising 3 million words [13]. The DOEC has gathered prose and poetry texts that belong to a wide range of styles: religious texts (homilies and sermons, biblical translations, hagiographies, and liturgical texts), legal texts, historical and chronicle texts, literary texts (poetry, riddles, and wisdom literature), scientific and medical texts, glosses and glossaries, letters and administrative texts (charts and wills), proverbs and maxims, and miscellaneous texts. Additionally, an annotated subcorpus of the DOEC was employed to provide the model with examples of translations [14]. The Bosworth–Toller Anglo-Saxon Dictionary [15,16] served as a complementary dataset that offered definitions and usage contexts for words in Old English.
The collected texts underwent rigorous standardization to ensure consistency across sources. Issues such as non-standard characters, diacritics, and linguistic ambiguities were addressed using a modified version of The Classical Language Toolkit [17] that was adapted to standardize punctuation and character representation. Additionally, low-quality samples were filtered out. The finalized dataset was split into training, validation, and testing subsets, with monolingual Old English texts separately prepared for data augmentation tasks such as back-translation.

2.2. Model Training

The training process adapts the language model to Old English through a progressive, multi-stage approach that combines Domain-Adaptive Pretraining (DAPT) and Task-Adaptive Pretraining (TAPT) to achieve greater linguistic and stylistic accuracy [18]. As illustrated in Figure 1, this process systematically transitions the model into the new language domain, which enhances both grammatical fidelity and alignment with the stylistic norms of Old English.
For clarity and consistency, we use the ISO 639-3 codes ANG (Old English) and ENG (Contemporary English) to refer to these languages in all prompt templates, tables, and diagrams throughout this section and the remainder of the paper.
The pipeline depicted in Figure 1 represents the implemented workflow in this study, not merely a proposal for future work. It illustrates the practical implementation of the three-phase approach described in Section 2: data preparation (mixed dataset creation), model training (domain adaptation and task specialization), and synthetic data generation. The two training circles in the diagram correspond to the two distinct training phases detailed in Section 2.2.1 and Section 2.2.2: domain adaptation (producing the OldEnglishBase model) and task specialization with enriched translations (producing the OldEnglishRefined model).
As can be seen in Figure 1, the process begins with a base model trained using Efficient Task-Similar Domain-Adaptive Continual-Pretraining on a mixed dataset of English and Old English segments (blue and red, respectively). Synthetic Old English translations (purple) are generated from an unseen English corpus via inference. These are then combined with human-annotated parallel data to create an enriched translation dataset. In a final step, the model is further specialized (fine-tuned) using this combined dataset, resulting in a refined model with improved fluency and linguistic fidelity. Color coding highlights English (blue), Old English (red), and synthetic Old English (purple) data at each stage.

2.2.1. Phase 1: Domain Adaptation

The initial phase of our training pipeline centers on domain adaptation, where the language model is progressively exposed to Old English as its new target domain. This step is crucial for equipping the model with a foundational grasp of the structures and vocabulary of Old English, which ensures it can both interpret and generate authentic language fragments.
Due to the scarcity of high-quality resources in Old English, traditional Domain-Adaptive Pretraining (DAPT) [12] is not feasible at scale. To overcome this limitation, we implement an Efficient Task-Similar Domain-Adaptive Continual-Pretraining strategy [19]. Continual pre-training has recently emerged as a leading strategy for adapting LLMs to low-resource languages: Nag et al. [20] applied continual pre-training to Llama-3 across nine Indian languages with diverse scripts and resource levels. These authors use specific algorithms for selecting both the most informative subset of texts and the most relevant tokens to add to the model vocabulary, and they evaluate on IndicGenBench. Their findings reinforce that targeted continual pre-training is an effective and computationally tractable route to language adaptation in low-resource settings, providing additional support for our choice of an efficient task-similar continual pre-training strategy for Old English. In our approach, the continual pre-training strategy leverages the existing proficiency of the model in Modern English to facilitate a smoother and more data-efficient transition to Old English. This maximizes the utility of the available resources.
Specifically, we define four related tasks that guide the adaptation of the model:
1. Text completion: The model is prompted to complete an Old English fragment, which reinforces its ability to generate fluent continuations.
2. Forward translation: The model translates Modern English fragments into Old English, which directly targets the core generation task.
3. Back translation: The model translates Old English fragments back into Modern English, which promotes bidirectional understanding and reduces overfitting.
4. Crossed definition: The model provides Modern English definitions for Old English words, which strengthens lexical alignment between the languages.
Examples of these prompt formats are provided in Table 1, which illustrates how each task contributes to the overall domain adaptation.
For this phase, we initialize from the Llama-3.1-8B checkpoint [21] and employ LoRA [4] for computationally efficient training. The resulting checkpoint, adapted for Old English, is referred to as OldEnglishBase.

2.2.2. Phase 2: Task Specialization

The second stage of training builds on the OldEnglishBase model obtained from the domain adaptation phase and focuses on task specialization. In this step, we introduce the technique of back-translation [5], a widely adopted data augmentation method in low-resource neural machine translation. More recent evidence supports this choice: de Gibert et al. [22] demonstrated at EMNLP 2025 that LLM-generated synthetic parallel data, even when noisy, can substantially improve low-resource machine translation across seven target languages and 147 additional language pairs via pivoting. Automatic and human evaluation confirm the overall high quality of the synthesized corpora. Our approach builds on this paradigm, specializing it for the historical setting of Old English.
Back-translation works by taking monolingual Old English text from the corpus and translating it into Modern English using the strong reverse translation capability of the model. Because the base model is English-centric, it reliably produces high-quality Modern English translations from Old English input at this stage [6]. These newly generated Modern English sentences are then paired with their original Old English counterparts to form synthetic parallel corpora. This expanded dataset introduces greater lexical and stylistic variety. This is critical for further fine-tuning the ability of the model to generate Old English from Modern English prompts.
During the synthetic data generation process, we feed the model previously unseen fragments from the monolingual Old English corpus and use greedy inference [5] to produce accurate Modern English translations. By combining these synthetic parallel pairs with the existing human-annotated examples, we assemble a richer and more diverse training set for forward translation.
Fine-tuning the model on this enriched dataset yields the OldEnglishRefined model. This final checkpoint demonstrates improved capabilities, generating Old English texts that are both more coherent and more contextually faithful to the original Modern English input.

2.3. Toward High-Quality Synthetic Data

Following the domain adaptation and task specialization phases, the final stage addresses the persistent scarcity of Old English corpora by generating high-quality synthetic data. We implement a dual-agent architecture that divides generation into fragment creation and translation, maximizing both linguistic diversity and fidelity through coordinated processing of authentic and synthetic datasets.
In this pipeline, shown schematically in Figure 2, the agents interact with both authentic and synthetic datasets, with each agent performing a distinct but complementary role. Randomly sampled fragments from the reference corpus are provided as contextual anchors for the agents to ensure that the generated sentences remain stylistically consistent and lexically relevant to real usage in Old English.

2.3.1. Fragment Generator Agent

The first agent, FragmentGen, is responsible for producing new text fragments in Modern English. To guide this process, the agent receives a selection of example sentences randomly drawn from the DOEC, which serves as a stylistic template and vocabulary guide. This context-driven prompting strategy reduces the likelihood of out-of-vocabulary or domain-inappropriate content, thus facilitating more effective downstream translation.
Since this agent operates exclusively in Modern English, it can leverage advanced general-purpose language models without requiring any expertise in Old English. For this study, we employ GPT-4o-mini for its strong generative capabilities and controllability.

2.3.2. Translation Agent

The second agent, OldEnglishTranslator, utilizes the OldEnglishRefined model to translate the Modern English fragments into Old English. To preserve grammatical accuracy and stylistic coherence, the agent applies few-shot prompting. This draws on bilingual pairs originally sampled by the FragmentGen agent. Greedy decoding is used for inference, which ensures reproducibility and minimizes unnecessary variation in the outputs.
By integrating these two specialized agents into a unified pipeline, we can efficiently generate synthetic fragments in Old English that are not only linguistically diverse but also adhere closely to authentic stylistic and lexical norms. This process augments the available training data and further improves performance of the model for both translation and text generation tasks in low-resource settings.
As shown in Figure 2, datasets (red) provide randomly sampled fragments (blue process) to two collaborating agents (green). First, FragmentGen creates new Modern English text using few-shot prompts for stylistic alignment. Then, the OldEnglishTranslator converts these into Old English, leveraging bilingual context. The result is a high-quality output dataset of synthetic Old English text. This architecture maximizes linguistic diversity and coherence by combining guided mutation and translation in a controlled, multi-agent workflow.
The few-shot prompts used in our implementation follow a consistent template structure. For the FragmentGen agent, the prompt includes 3–5 authentic Old English text samples followed by their Modern English translations, and instructions to generate new content in a similar style. For example:
[Context provided: 3–5 Old English samples with translations]
Based on these examples, generate a new text fragment in Modern English that would be stylistically appropriate for translation to Old English. The text should maintain similar themes, complexity, and register as the examples.
For the OldEnglishTranslator agent, the prompt includes parallel examples and explicit instructions for translation, as in the example:
[Context provided: 2–3 Modern English-Old English pairs] Translate the following Modern English text to Old English, maintaining appropriate grammar, vocabulary, and style: [Modern English text to be translated]
These prompts are dynamically constructed during pipeline execution, with context samples randomly selected from our curated parallel corpus to ensure diversity in the generated outputs.

3. Evaluation

The evaluation process combined automated metrics and human assessments to ensure reliability and quality.
For the automated metrics we employed BLEU [23], METEOR [24], and CHRF [25]. These metrics function as quantitative proxies for human judgment that compare the generated translation to reference human translations. BLEU evaluates precision at the n-gram level, and thus, assesses how much of the generated text overlaps with the reference. METEOR improves on this by incorporating recall and using techniques such as stemming and synonym matching for better alignment with human evaluation. CHRF, in contrast, evaluates translations based on character-level precision and recall.
Regarding human evaluation, expert reviewers rated the outputs on three criteria: grammatical accuracy (inflection and word order), stylistic fluency (lexical choice), and contextual coherence. The expert evaluation was conducted on a 10-point scale, with a threshold score of 7 set to determine whether a text is appropriate for incorporation into the extended corpus. This task was carried out on a stratified sample of 150 generated segments, randomly selected from the output of 500 generated texts to avoid selection bias. Each segment ranged from 15 to 45 words in length and was required to constitute a syntactically complete unit (containing at least one independent clause). The sample reflected the major textual genres as reflected by the corpora of Old English, including religious texts (homilies and biblical translations, 35%), legal documents (charters and law codes, 20%), historical chronicles (15%), literary texts (poetry, 15%), and medical/scientific texts (15%).
The expert evaluation was conducted by three specialists in Old English linguistics, each with doctoral-level qualifications and/or publications in the field of historical Germanic linguistics or Old English corpus linguistics. The evaluators included two university professors with over 15 years of experience in Old English philology and one predoctoral researcher specializing in computational approaches to historical linguistics. To ensure consistency between evaluators, we implemented a standardized scoring rubric with detailed criteria for each evaluation dimension. Before the main evaluation, we conducted a calibration session where all evaluators independently scored the same set of 10 text samples, followed by a discussion to align their interpretations of the scoring criteria. The names of these experts are acknowledged in the Acknowledgements Section of this paper.

4. Results and Discussion

This study provides compelling evidence that large language models (LLMs) can address the challenge of corpus expansion for LRLs, with Old English serving as a representative case study. Through a systematic combination of advanced model fine-tuning, targeted data augmentation, and a modular, task-specific pipeline, we achieved improvements in both translation quality and linguistic fidelity. Our results showcase not only gains in performance over baseline models but also illuminate the specific strengths and persistent limitations encountered when adapting LLMs to a historically and linguistically complex low-resource setting.

4.1. Model Fine-Tuning and Performance

In Table 2, all displayed values are reported on a 0–100 scale, with higher scores indicating closer correspondence with reference human translations.
The initial evaluation of baseline models—Llama-3.1-8B [21] and Mistral-7B [26]—revealed critical limitations in their ability to process Old English. Both models achieved moderate results in reverse translation tasks (Old English to Contemporary English), yet their performance in direct translation (Contemporary English to Old English) lagged (Table 2). This pronounced asymmetry highlights the inherent difficulties of generating accurate and fluent text in a low-resource language characterized by complex grammatical structures.
Among the baselines, Llama-3.1-8B consistently outperformed Mistral-7B across all evaluation metrics—BLEU, CHRF, and METEOR—which showed an average improvement of 2 to 5 points (Table 2). Nonetheless, both models commonly exhibited translation errors, including repetitive or looped text, untranslated segments, and stylistic inconsistencies (Table 3). These recurring issues underscored the necessity of further fine-tuning and methodological innovation to adapt LLMs for high-fidelity text generation in Old English.
The refinement process demonstrated measurable improvements. OldEnglishBase served as a robust initial model capable of capturing the vocabulary and grammar of Old English and demonstrated substantial improvement over the baseline Llama model in both direct and reverse translation tasks, with absolute score increases of approximately 35 points (from ~25 to ~60) for English to Old English translation and 20 points (from ~47 to ~67) for Old English to English translation, as can be seen by comparing Table 2 and Table 4. These improvements effectively doubled the performance metrics of the baseline model. We focused our subsequent experiments on the Llama architecture due to its consistently superior performance over Mistral in our initial evaluations.
The refinement of this model through the technique of back-translation proved to be key in overcoming the stagnation. OldEnglishRefined excelled in producing fluent and contextually appropriate Old English text. This text showed improvements of approximately 6 points across all metrics for direct translation. It is important to note that this process specializes the model for direct translation, with the gains in reverse translation being minimal.
The gains obtained by our model align with previous applications of back-translation in large language models such as the Linguistically Diverse Prompting [6] approach, reporting increases of up to 8 points in the chrF++ metric. Notably, our model begins with a baseline over twice as high due to the domain adaptation step. This makes improvements more difficult to achieve. Despite this, we attained gains comparable with those of prior work, which highlights the effectiveness of this approach.
In Table 3, the examples illustrate issues such as looped generation, non-translated segments, and hallucinated vocabulary.
In Table 4, higher values indicate greater similarity to reference human translations. Results are reported for both Modern English to Old English (EN → ANG) and Old English to Modern English (ANG → EN) directions. Performance improves up to three epochs, after which overfitting and catastrophic forgetting reduce scores, especially for reverse translation.
In Figure 3, box plots show the distributions of BLEU, CHRF, and METEOR scores for three model stages (Base, OE-Base, and OE-Refined) on both translation directions: (a) Modern English → Old English and (b) Old English → Modern English. Scores are normalized to [0, 1]. The dotted line within each box indicates the mean. Analysis: For both directions, there is a clear performance progression: OE-Base models (orange) show marked gains over the Base (blue), with further improvements from OE-Refined (green). This is reflected in higher medians, means, and upper quartiles across all metrics. Notably, the score distribution for OE-Refined is more compact, indicating more consistent high-quality outputs. Forward translation benefits most, with especially large gains in BLEU and METEOR. Outliers are reduced as training progresses, underscoring improved robustness. The observed plateau in improvement after three epochs aligns with expectations for translation tasks involving morphologically complex languages. BLEU scores above 60 for both translation directions, showing that the model approaches performance levels typically associated with human-quality translations between more resource-rich language pairs. This suggests that our domain adaptation phase successfully captured the fundamental linguistic patterns of Old English despite the limited training data available.

4.2. Synthetic Data Generation

The dual-agent pipeline, comprising FragmentGen and OldEnglishTranslator, played a pivotal role in addressing the limited availability of annotated data in Old English. FragmentGen generated coherent and thematically diverse texts in Contemporary English, thus providing high-quality inputs for the translation agent. OldEnglishTranslator, built on the trained OldEnglishRefined model, translated these texts into Old English with a high degree of grammatical and stylistic fidelity.

4.3. Expert Linguistic Evaluation

Three expert linguists with published research in Old English corpus linguistics and computational linguistics independently evaluated the generated texts across four criteria: inflection (nominal and adjectival declension, verbal conjugation, and subject-verb agreement), word order, lexical choice (authenticity based on attestedness in surviving Old English texts), and semantic coherence. Each text segment—defined as a syntactically complete unit containing one or more clauses—was scored on a 1–10 scale for each criterion, and these four scores were averaged to produce a composite score per segment. To assess the consistency of expert evaluations, we calculated inter-rater agreement across the three independent linguist evaluators. For each evaluated segment, we computed the range (difference between highest and lowest scores) and standard deviation of composite scores. Across 50 randomly sampled segments, the mean score range was 0.83 points (SD = 0.34), with 94% of segments showing agreement within 1 point. This indicates substantial inter-rater reliability. Fleiss’ kappa for the four evaluation criteria yielded κ = 0.76, which demonstrates “substantial agreement” according to standard interpretation guidelines. The highest agreement was observed for inflection (κ = 0.81) and lexical choice (κ = 0.79) and semantic coherence showed slightly lower but still acceptable agreement (κ = 0.68), which reflects the more subjective nature of contextual interpretation.
For instance, the segment “spyr on ðam pocce on eagum; heo sceal niman wad and gearwan, forðam hi ealle andswaredon” (“Inquire about the pock in the eyes; she shall take woad and yarrow, for they all answered”) was rated for its accurate use of imperative and future verbal forms (spyr, sceal niman), thus demonstrating a strong command of the morphology of Old English. The passage also presents a clear instructional structure and employs medical and technical herb terminology with precision, which closely follows authentic conventions of the genre. These strengths are reflected in the scores for inflection (9/10), word order (8/10), and lexical choice (10/10). On the other hand, the segment received a lower score for semantic coherence (7/10), due to an inadequate continuation of the topic: the causal relationship is unclear, as the last clause (forðam hi ealle andswaredon) is incorrectly linked to the main predication. Despite this minor issue, the overall evaluation for this passage was positive, with an average score of 8.5/10.
In Table 5, Metrics (BLEU, CHRF, and METEOR) are scaled 0–100; higher scores indicate greater similarity to human reference translations. Results show gains moving from the baseline Llama model to OldEnglishBase, with further improvements achieved by the OldEnglishRefined model, especially in direct translation to Old English.
From a qualitative perspective, manual assessment of the synthetically generated texts underscores the strengths of the model in grammatical structure, particularly in inflection and word order. The system reliably produces complex constructions—such as relative and conditional clauses, and other intricate syntactic patterns—with high fidelity. Furthermore, the decorum of the genre is consistently maintained, as the generated texts adapt stylistic conventions appropriate to different genres. Specialized terminology in religious, legal, and historical contexts is both accurate and contextually appropriate, which demonstrates the flexibility of the model across diverse textual traditions.
Despite these notable strengths, semantic coherence emerged as the most frequent issue in the generated texts. In some cases, anachronistic concepts, such as modern dates or technologies, are retrofitted into Old English, which leads to slight incongruences. For example, in the segment “gif ðu bist getogen on andwerdre tide and on tocyme, ðonne scealt ðu gymene habban to ðære lare on armenia” (“if you are trained on present and future tense data, you must pay attention to the teaching methods in Armenia”), references to data and training introduce modern elements that are historically out of place.
Complex narratives can occasionally be oversimplified, and unrelated domains may be inadvertently blended. For instance, in the passage “wið swile gate tyrdlu on scearpum ecede gesoden and on selfe wisan on gedon” (“to overcome swelling, apply sharp vinegar sodden with goats treadles and keep going on good advise”), the second coordinate clause diverges from the main narrative, so that the overall coherence is reduced.
Across genres, biblical quotations and religious instruction are typically more convincing and stylistically consistent than medical remedies, land charters, or detailed historical chronicles. The model tends to avoid structurally marked configurations such as non-nominative subjects, non-accusative objects, or pragmatic reorderings (like the V2 rule). Another recurring issue is the loss of co-reference tracking in complex syntactic structures. For example, in the segment “Ða axodon hi hie hu fela sealfe hæbbe ge. Hi cwædon seofon hræfnes & an rahdeores” (“Then they asked her how many salves have you? She said seven crabs and a roebuck”), the reference point shifts from the singular second person to the plural third person, which creates ambiguity.
Table 6 illustrates the progressive enhancement of translation quality in a zero-shot setting. It highlights the evolution of outputs as the model refines its understanding of the target language.
As summarized in Table 7, expert evaluation demonstrates that the model excels in producing Old English with high grammatical accuracy and appropriate lexical choices, as reflected by average scores of at least 9.0 for inflection, word order, and lexical selection. These results indicate an excellent command of the morphology, syntax, and vocabulary of Old English in the generated texts. However, a lower score for semantic coherence (7.8) underlines a recurring limitation: the output is structurally and lexically convincing, but the model occasionally struggles to maintain deep narrative consistency or capture subtle contextual relationships. This assessment is based on detailed, criterion-specific scoring by specialist linguists of Old English. This guarantees that both surface-level correctness and deeper semantic fidelity are carefully measured. Overall, consistently high scores across linguistic categories confirm the effectiveness of our approach in generating plausible Old English and points to areas for future improvement in semantic integration.
In Table 7, each criterion was rated on a 0–10 scale, with higher scores indicating closer alignment with grammatical and stylistic expectations found in authentic Old English sources. The model achieved high marks for inflection, word order, and lexical choice, and semantic coherence, though strong, remains the most challenging aspect. The overall average reflects robust linguistic performance. The model excels in grammatical accuracy and lexical selection, while semantic coherence remains the main area for improvement.

4.4. Challenges and Limitations

Despite the notable achievements of this study, several challenges remain. The scarcity of high-quality resources in Old English continues to constrain the diversity and complexity of training data. Although data augmentation strategies partially alleviate this limitation, certain linguistic features of Old English—such as its intricate case system and free word order—pose difficulties to both training of the model and text generation.
The generation of synthetic Old English texts raises fundamental questions about authenticity, representation, and the epistemological status of computationally reconstructed language. While our model achieves high grammatical fidelity and draws exclusively from attested vocabulary and constructions, the generated texts remain fundamentally different from authentic manuscripts: they lack the cultural embeddedness, sociolinguistic context, and interpretive traditions that characterize surviving Old English literature. In this line, we would like to emphasize that synthetic texts should be understood as computational approximations rather than historical discoveries. They reflect patterns learned from the extant corpus but cannot capture unattested nuances, regional variations, or diachronic changes that existed in the living language. Their primary value lies in corpus augmentation for computational tasks (training NLP models, testing linguistic hypotheses) and educational applications (generating practice materials, illustrating grammatical patterns), rather than serving as evidence for historical or philological claims about the language itself.
Users of generated Old English must maintain critical awareness of these limitations. Synthetic texts are best employed as supplementary resources that expand analytical possibilities while remaining grounded in authentic sources. For educational contexts, generated materials should be clearly labeled as computational constructs and used alongside genuine historical texts. For research applications, synthetic data can enable larger-scale computational analyses (such as training taggers or parsers) that would be impossible with the limited historical corpus alone, but findings should be validated against authentic sources whenever possible. This epistemological caution ensures that AI-generated language serves scholarship without displacing the irreplaceable value of historical linguistic evidence.
A promising avenue for addressing these questions is the integration of Retrieval-Augmented Generation (RAG) techniques [27,28]. By embedding curated fragments from authentic sources in Old English into a dedicated vector database, the model can dynamically retrieve contextually relevant passages during both training and synthetic data generation. This retrieval-based grounding not only helps mitigate issues such as hallucinations, repetitive text, and stylistic drift but also provides real historical references to guide the output of the model. More importantly, RAG can improve semantic coherence and context sensitivity, which makes it valuable for downstream applications such as question answering, where generating accurate and contextually anchored responses is critical.
Implementing RAG in a low-resource setting introduces additional technical complexities. Determining optimal chunk sizes, establishing effective metadata tagging, and managing the computational overhead of retrieval operations are all crucial to prevent performance bottlenecks and maintain scalability of the system. Nevertheless, very recent work on the endangered Paiwan language of Taiwan [29] demonstrates the effectiveness of a hybrid architecture combining dictionary-based pre-translation, retrieval-augmented generation, and LLM post-editing for extremely low-resource pairs. This architecture is particularly relevant to Old English because the Bosworth–Toller lexicon [15,16] could serve the same role as the Paiwan handcrafted bilingual dictionary in future iterations of our pipeline. Similarly, Bayat et al. [30] advanced RAG for Persian by developing language-specific embeddings (MatinaRoberta, MatinaSRoberta) and systematic benchmarks, showing that careful tuning of chunk size, temperature, and document summary indexing is decisive for RAG quality in morphologically rich low-resource languages.
Another pathway to improved performance of the model lies in the use of more powerful LLMs, such as GPT-4, larger variants of Llama, or the DeepSeek R1 model. With greater parameter counts and broader training corpora, these models could offer enhanced contextual understanding and stylistic versatility. Nevertheless, such advancements require more computational resources and risk overfitting to noise in limited data environments. Thus, balancing computational efficiency with accuracy of the model and historical fidelity remains a central concern for future research.

4.5. Broader Implications

This research demonstrates that advanced NLP and machine learning techniques, when carefully adapted, can bridge the gap between limited linguistic resources and modern computational methods. By generating and analyzing texts in Old English—a historically important but underrepresented language—our framework provides valuable tools for linguistic research, cultural preservation, and educational initiatives. The ability to produce high-quality synthetic data opens up new avenues for investigating the evolution of language, enriching digital archives, and creating interactive tools for scholars and the general public.
The dual-agent pipeline and data augmentation strategies presented here are broadly applicable to other endangered or LRLs. The trained models not only excel at unsupervised text generation but also show emergent capacity for high-quality translation and downstream NLP applications, such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging, especially when fine-tuned or adapted through task-specific prompts. Moreover, the flexible design allows for adaptation to domain-specific text generation or smart data augmentation tasks that may support a wide range of projects in digital humanities and computational linguistics.
To demonstrate the transferability of the framework, we consider its potential application to Quechua, a low-resource language family spoken by approximately 8–10 million people across South America. Like Old English, Quechua varieties face significant resource scarcity [31]: the largest digital corpus (Quechua–Spanish parallel data) contains fewer than 200,000 sentence pairs, and most varieties lack systematic computational resources. Quechua shares several challenges with Old English—rich agglutinative morphology with extensive suffixation, relatively free word order, and dialectal variation that complicates standardization [32,33].
Recent work on Quechua machine translation directly complements our methodology. Chen et al. [34] introduced QueEn, a Quechua–English system that combines retrieval-augmented generation with parameter-efficient fine-tuning of a relatively lightly adapted base model. Whereas QueEn’s main strength lies in grounding outputs through retrieval over an external Quechua knowledge base, our framework contributes the complementary component largely missing from QueEn: a deep, staged domain adaptation followed by task-specific synthetic augmentation. A natural extension would therefore combine our staged continual pre-training and dual-agent back-translation pipeline as the underlying translation engine, with QueEn-style RAG layered on top, plausibly yielding finer-grained translation quality than either approach alone. Beyond Quechua-specific systems, the 2024 edition of the AmericasNLP shared task [35] reported ChrF++ improvements over strong baselines of up to 4.2 points for Chatino, Guarani, Quechua, and Rarámuri, confirming that data augmentation and transfer learning remain the most effective strategies for these languages. Even more recently, Asillo Congora et al. [36] showed at AmericasNLP 2025 that augmenting the Quechua parallel corpus with LLM-generated synthetic sentences and fine-tuning mT5 yields BLEU/ChrF++ improvements over the state of the art, which parallels the synthetic-data pipeline that we propose for Old English.
Beyond the case of Old English, these methodological innovations—which combine efficient fine-tuning, iterative back-translation, and modular architectures—offer a scalable, cost-effective blueprint for the revitalization and study of other languages that face resource scarcity, such as Quechua. Thus, this research not only contributes to linguistic and cultural heritage but also to the broader democratization of artificial intelligence for the benefit of society.

5. Conclusions

This work introduces a novel and effective framework for the generation of texts in Old English using state-of-the-art transformer-based language models. Through a combination of low-rank adaptation, back-translation, and a dual-agent pipeline, we expand the available corpus of Old English with linguistically accurate and stylistically plausible texts. Our multi-stage approach—which encompasses data preparation, domain adaptation, task specialization, and synthetic data generation—proves capable of capturing the grammatical and stylistic intricacies of the language, achieving BLEU scores of 65.41 and expert validation scores averaging 8.7/10.
The evaluation results confirm the success of our methodology: the generated texts achieve high scores from both automated metrics (BLEU, METEOR, CHRF) and expert evaluators, especially in grammatical accuracy and lexical choice. However, semantic coherence and context remain areas for ongoing improvement, with challenges such as anachronisms and oversimplification of narratives present in some outputs. These limitations remind us that computational text generation, while powerful, operates within epistemic boundaries: synthetic Old English can augment but never replace the historical record, serving as a practical tool for corpus expansion and computational analysis rather than as authentic linguistic evidence.
The strategies developed here provide a scalable and replicable path for supporting and revitalizing other LRLs. Looking forward, integrating retrieval-augmented generation and leveraging more advanced language models offer promising directions to further improve both linguistic fidelity and contextual accuracy. Yet we must acknowledge what remains uncertain: whether machine-generated language, however technically accomplished, can fully capture the interpretive nuance, cultural resonance, and historical specificity of texts produced by human speakers embedded in their linguistic communities. Our contribution lies not in resolving this epistemological tension but in demonstrating that AI can meaningfully support linguistic preservation when deployed with critical awareness of its limitations. This ensures that these methods will remain at the forefront of computational approaches to cultural heritage if they respect the irreplaceable value of authentic historical sources.

Author Contributions

Conceptualization, R.G.S.A., M.N. and J.M.A.; Methodology, R.G.S.A., M.N. and C.L.D.A.; Software, R.G.S.A. and M.N.; Validation, J.M.A. and C.L.D.A.; Formal analysis, R.G.S.A. and M.N.; Investigation, M.N.; Resources, J.M.A.; Data curation, J.M.A.; Writing—original draft, R.G.S.A. and M.N.; Writing—review & editing, R.G.S.A., M.N., J.M.A. and C.L.D.A.; Visualization, R.G.S.A.; Supervision, C.L.D.A.; Project administration, J.M.A.; Funding acquisition, J.M.A. All authors have read and agreed to the published version of the manuscript.

Funding

We gratefully acknowledge the grant PID2023-149762NB-100, funded by MCIN/AEI/10.13039/501100011033.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All code, models, and implementation details are publicly available in our repository (https://github.com/tux550/OldEnglish-LLM (accessed on 17 January 2025)).

Acknowledgments

We gratefully acknowledge Ana Elvira Ojanguren López and Silvia Saporta Tarazona for their participation in the expert evaluation tasks.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Bamman, D.; Burns, P.J. Latin BERT: A Contextual Language Model for Classical Philology. arXiv 2020, arXiv:2009.10053. [Google Scholar] [CrossRef]
  2. Ortmann, K.; Dipper, S. Variation Between Different Discourse Types: Literate vs. Oral. In Proceedings of the Fourth Workshop on Natural Language Processing for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain, 3 April 2017; pp. 64–79. [Google Scholar]
  3. Joshi, P.; Santy, S.; Budhiraja, A.; Bali, K.; Choudhury, M. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6282–6293. [Google Scholar]
  4. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 25–29 April 2022. [Google Scholar]
  5. Edunov, S.; Ott, M.; Auli, M.; Grangier, D. Understanding Back-Translation at Scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 489–500. [Google Scholar]
  6. Nguyen, X.-P.; Aljunied, S.M.; Joty, S.; Bing, L. Democratizing LLMs for Low-Resource Languages by Leveraging Their English Dominant Abilities with Linguistically Diverse Prompts. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 8837–8850. [Google Scholar]
  7. Liang, X.; Khaw, Y.M.J.; Liew, S.Y.; Tan, T.P.; Qin, D. Toward Low-Resource Languages Machine Translation: A Language-Specific Fine-Tuning with LoRA for Specialized Large Language Models. IEEE Access 2025, 13, 46616–46626. [Google Scholar] [CrossRef]
  8. Kimera, R.; Heo, D.; Rim, D.N.; Choi, H. Data Augmentation With Back translation for Low Resource Languages: A Case of English and Luganda. In Proceedings of the 2024 8th International Conference on Natural Language Processing and Information Retrieval, Okayama Japan, 13–15 December 2024. [Google Scholar] [CrossRef]
  9. Zhao, Z.; Sun, G.; Liu, C.; Wang, D. Research on machine translation of ancient books in the era of large language model. npj Herit. Sci. 2025, 13, 122. [Google Scholar] [CrossRef]
  10. Zhu, D.; Liu, C.; Zhao, X.; Zhao, Z.; Shen, S.; Wang, D. XunZi-MLLM: A multimodal large language model for ancient text and image recognition. Digit. Scholarsh. Humanit. 2025, 40, 709–722. [Google Scholar] [CrossRef]
  11. Joshi, R.; Singla, K.; Kamath, A.; Kalani, R.; Paul, R.; Vaidya, U.; Chauhan, S.S.; Wartikar, N.; Long, E. Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus: A Case Study for Hindi LLMs. In Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2025; pp. 50–57. [Google Scholar]
  12. Sommerschield, T.; Assael, Y.; Pavlopoulos, J.; Stefaniak, V.; Senior, A.; Dyer, C.; Bodel, J.; Prag, J.; Androutsopoulos, I.; de Freitas, N. Machine Learning for Ancient Languages: A Survey. Comput. Linguist. 2023, 49, 703–747. [Google Scholar]
  13. Healey, A.d.; Wilkin, J.P.; Xiang, X. The Dictionary of Old English Web Corpus; Dictionary of Old English Project; Centre for Medieval Studies, University of Toronto: Toronto, ON, Canada, 2004. [Google Scholar]
  14. Domínguez Barragán, S.; Fidalgo Allo, L.; García Fernández, L.; Hamdoun Bghiyel, Y.; Lacalle Palacios, M.; Mateo Mendaza, R.; Novo Urraca, C.; Ojanguren López, A.E.; Ruíz Narbona, E.; Martín Arista, J.; et al. ParCorOEv3. An Open Access Annotated Parallel Corpus Old English-English; Martín Arista, J., Ed.; Nerthus Project, Universidad de La Rioja: Logroño, Spain, 2023; Available online: www.nerthusproject.com (accessed on 17 January 2025).
  15. Bosworth, J.; Toller, T.N. An Anglo-Saxon Dictionary; (Main Volume); Supplement by T.N. Toller, 1921; Addenda and Corrigenda by A. Campbell, 1972; Clarendon Press: Oxford, UK, 1898. [Google Scholar]
  16. Tichy, O.; Rocek, M. Bosworth–Toller’s Anglo-Saxon Dictionary Online; Digital Edition of Bosworth & Toller (1898/1921), Hosted by the Faculty of Arts, Charles University in Prague. Available online: https://bosworthtoller.com (accessed on 7 January 2025).
  17. Johnson, K.P.; Burns, P.J.; Stewart, J.; Cook, T.; Besnier, C.; Mattingly, W.J.B. The Classical Language Toolkit: An NLP Framework for Pre-Modern Languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, Online, 1–6 August 2021; pp. 20–29. [Google Scholar]
  18. Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8342–8360. [Google Scholar]
  19. Xie, Y.; Aggarwal, K.; Ahmad, A. Efficient Continual Pre-Training for Building Domain Specific Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 10184–10201. [Google Scholar]
  20. Nag, A.; Chakrabarti, S.; Mukherjee, A.; Ganguly, N. Efficient Continual Pre-training of LLMs for Low-resource Languages. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track); Association for Computational Linguistics: Albuquerque, NM, USA, 2025; pp. 304–317. [Google Scholar] [CrossRef]
  21. Dubey, A.; Grattafiori, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
  22. de Gibert, O.; Attieh, J.; Vahtola, T.; Aulamo, M.; Li, Z.; Vázquez, R.; Hu, T.; Tiedemann, J. Scaling Low-Resource MT via Synthetic Data Generation with LLMs. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Suzhou, China, 2025; pp. 27674–27692. [Google Scholar]
  23. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 311–318. [Google Scholar]
  24. Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 65–72. [Google Scholar]
  25. Popović, M. chrF: Character n-gram F-score for Automatic MT Evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, 17–18 September 2015; pp. 392–395. [Google Scholar]
  26. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
  27. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual Event, 6–12 December 2020; pp. 9459–9474. [Google Scholar]
  28. Lyu, X.; Grafberger, S.; Biegel, S.; Wei, S.; Cao, M.; Schelter, S.; Zhang, C. Improving Retrieval-Augmented Large Language Models via Data Importance Learning. arXiv 2023, arXiv:2307.03027. [Google Scholar] [CrossRef]
  29. Wang, R.-C.; Yang, C.-K.; Yang, T.-C.; Tseng, Y.-X. Hybrid Dictionary–Retrieval-Augmented Generation–Large Language Model for Low-Resource Translation. Eng. Proc. 2025, 120, 52. [Google Scholar] [CrossRef]
  30. Hosseinbeigi, S.B.; Asghari, S.; Kashani, M.A.S.; Shalchian, M.H.; Abbasi, M.A. Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization. arXiv 2025, arXiv:2501.04858. [Google Scholar] [CrossRef]
  31. Rios, A. A Basic Language Technology Toolkit for Quechua. Ph.D. Thesis, University of Zurich, Zurich, Switzerland, 2015. [Google Scholar] [CrossRef]
  32. Ortega, J.E.; Oncevay, A.; Ríos, A.; Gasser, E. Overcoming Data Sparsity in Morphological Inflection with Subword Segmentation and Data Augmentation. In Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Online, 10 July 2020; pp. 138–143. [Google Scholar]
  33. Mager, M.; Oncevay, A.; Rios, A.; Meza Ruiz, I.; Kann, K.; Chiruzzo, L. Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, Online, 11 June 2021; pp. 202–217. [Google Scholar]
  34. Chen, J.; Shu, P.; Liu, Z.; Howe, C.; Liu, T. QueEn: A Large Language Model for Quechua-English Translation. arXiv 2024, arXiv:2412.05184. [Google Scholar]
  35. Ebrahimi, A.; de Gibert, O.; Vazquez, R.; Coto-Solano, R.; Denisov, P.; Pugh, R.; Mager, M.; Oncevay, A.; Chiruzzo, L.; von der Wense, K.; et al. Findings of the AmericasNLP 2024 Shared Task on Machine Translation into Indigenous Languages. In Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024); Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 236–246. [Google Scholar]
  36. Asillo Congora, J.; Santisteban, J.; Lazo Vasquez, R. UCSP Submission to the AmericasNLP 2025 Shared Task. In Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP); Association for Computational Linguistics: Albuquerque, NM, USA, 2025; pp. 84–91. [Google Scholar]
Figure 1. Schematic representation of the proposed training pipeline for Old English text generation.
Figure 1. Schematic representation of the proposed training pipeline for Old English text generation.
Bdcc 10 00145 g001
Figure 2. Dual-agent synthetic data generation pipeline.
Figure 2. Dual-agent synthetic data generation pipeline.
Bdcc 10 00145 g002
Figure 3. Translation performance distributions for base and trained models. In each box plot in Figure 3, the solid line within the box marks the median, while the dashed line indicates the mean of the score distribution. The box edges represent the first and third quartiles (interquartile range, IQR), and the whiskers extend to the most extreme values within 1.5 × IQR. Individual dots beyond the whiskers correspond to outliers, that is, scores that fall outside this range and reflect atypical translation outputs.
Figure 3. Translation performance distributions for base and trained models. In each box plot in Figure 3, the solid line within the box marks the median, while the dashed line indicates the mean of the score distribution. The box edges represent the first and third quartiles (interquartile range, IQR), and the whiskers extend to the most extreme values within 1.5 × IQR. Individual dots beyond the whiskers correspond to outliers, that is, scores that fall outside this range and reflect atypical translation outputs.
Bdcc 10 00145 g003
Table 1. Illustrative examples of tasks designed for Old English adaptation.
Table 1. Illustrative examples of tasks designed for Old English adaptation.
TaskInputOutput
Text Completion[ANG]and fæste mann þærto swa fela daga swa þærto fæstene arærde wæron and þenung togesett is.[/ANG][ANG]and fæste mann þærto swa fela daga swa þærto fæstene arærde wæron and þenung togesett is.[/ANG]
Forward Translation[INST]Translate the following English fragment to Anglo-Saxon[/INST][EN]xxi. what kind of men the deans of the monastery must be.[/EN][ANG]xxi. hwilce mynstres teoðingealdras beon sceolon.[/ANG]
Back Translation[INST]Translate the following Anglo-Saxon fragment to English[/INST][ANG]se oðer him andwirde and cwæð: [/ANG][EN]the second answered him and said: [/EN]
Crossed Definition[INST]What is the English definition of the following word in Anglo-Saxon?[/INST][ANG]getoge[/ANG][EN]A tugging; contractio; contraction; convulsio; convulsion; cramp; spasm; spasmus[/EN]
Table 2. Evaluation of baseline models in translation tasks using BLEU, CHRF, and METEOR scores.
Table 2. Evaluation of baseline models in translation tasks using BLEU, CHRF, and METEOR scores.
ModelBLEUCHRFMETEOR
EN → ANG
Llama3.1-8B29.1325.5118.95
Mistral-7B24.5122.6116.12
ANG → EN
Llama3.1-8B46.8738.9244.20
Mistral-7B43.9436.3140.27
Table 3. Common translation errors identified in the base models during forward translation evaluation.
Table 3. Common translation errors identified in the base models during forward translation evaluation.
Error TypeInputGeneration
Looped generationThat is, at stoke twelve hides; and they have also given me thirty mancuses of gold, and that i free the monastery from all the taxes which belong to the lord of the people (...)Thæt is, at stoc twelve hides; and hi hafa meth meth hund mancusa golds, and thæt ic monæstære fræolice fræolice fræolice fræolice fræolice fræolice fræolice fræolice (...)
Non-translated segmentsthe roman people first called it by that name, because on the first day of the month he established the roman empire and overthrew those who formerly had destroyed it (...)the roman people first called it by that name, because on the first day of the month he established the roman empire and overthrew those who formerly had destroyed it (...)
Vocabulary hallucinationfirst he killed amyntas, his aunt’s son, and afterwards his brother, and then pannenio, his general, and then philotas (...)he gefeng amyntas, his modorbrothor, and eft pannenio, his herefore, and eft philotas (...)
Table 4. Translation performance of the OldEnglishBase model at different training epochs, evaluated using BLEU, CHRF, and METEOR metrics (all scaled 0–100).
Table 4. Translation performance of the OldEnglishBase model at different training epochs, evaluated using BLEU, CHRF, and METEOR metrics (all scaled 0–100).
EpochsBLEUCHRFMETEOR
ENANG
159.3950.8350.13
360.7355.6854.19
555.4352.4953.76
ANGEN
166.5460.2465.91
359.2756.9858.80
552.3052.3052.48
Table 5. Translation performance comparison for different trained models on Modern English → Old English (EN → ANG) and Old English → Modern English (ANG → EN) tasks.
Table 5. Translation performance comparison for different trained models on Modern English → Old English (EN → ANG) and Old English → Modern English (ANG → EN) tasks.
ModelBLEUCHRFMETEOR
ENANG
Llama25.9422.9717.43
OldEnglishBase59.9951.6551.01
OldEnglishRefined65.4157.8257.40
ANGEN
Llama47.6839.5544.03
OldEnglishBase69.4163.2067.45
OldEnglishRefined69.8963.5468.50
Table 6. Improvement in zero-shot translation quality.
Table 6. Improvement in zero-shot translation quality.
SourceBLEUText
English-that is, at stoke twelve hides; and they have also given me thirty mancuses of gold, and that i free the monastery from all the taxes which belong to the lord of the people, little or great, known or unknown, except.the ‘angild’ to others, aud fortress-work, and ‘fyrdsoon,’ and bridge-work...
Human-ðæt is æt stoce twelf hida, and ðritig mancusa goldes hio sealdan eac me, and ic and mynster fram æghwelcum gafolum gefreoge ðe to ðiode hlafarde belimpeð, littles oððe micles, cuðes ge uncuðes, butan angilde wiðoðrum, and fæstengewerce, and fyrdsocne, and brycggeweorce...
Llama 7b45beoð þa twelf hida landes, and hæfiað me þa þritig mancusas goldes, and þa þe ic freoðe þone minan monasterium of allum þam scotum þe to þam hlafordes þe þa folc beoð, litel oððe micel, þe þe ic wite oððe þe ic ne wite, butan þam ‘angild’ to oþrum, and ‘fyrdson’, and ‘bricg-weroc’.
OldEnglishBase62ðæt is æt stoce twelf hida, and heom hæbbe ic eac geunnen xxx mancus goldes, and ic freo ðæt mynstre of eallum ðam gafolum ðe to ðam folcgere belimpað, lytle oððe mycele, cuðe oððe uncuðe, butan angilde to oðrum, and fæstenweorce, and fyrdsocne, and brycggeworce.
OldEnglishRefined65ðæt is æt stoce twelf hida, and hæbbe ic eac geunnen ðæt man freoge ðæt mynstre of eallum ðam gafoles ðe to ðam folces hlaforde belimpað, lytles oððe micles, cuðes oððe uncuðes, butan angilde wið oðrum, and fæstenweorce, and fyrdsocne, and brycggeweorce.
Table 7. Results of expert evaluation for model-generated Old English texts.
Table 7. Results of expert evaluation for model-generated Old English texts.
Evaluation CriterionAverage Score
Inflection (morphology)9.0
Word order (syntax)9.0
Lexical choice (attestedness)9.1
Semantic coherence7.8
Overall average8.7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Salazar Alva, R.G.; Núñez, M.; López Del Alamo, C.; Martín Arista, J. AI-Driven Generation of Old English: A Framework for Low-Resource Languages. Big Data Cogn. Comput. 2026, 10, 145. https://doi.org/10.3390/bdcc10050145

AMA Style

Salazar Alva RG, Núñez M, López Del Alamo C, Martín Arista J. AI-Driven Generation of Old English: A Framework for Low-Resource Languages. Big Data and Cognitive Computing. 2026; 10(5):145. https://doi.org/10.3390/bdcc10050145

Chicago/Turabian Style

Salazar Alva, Rodrigo Gabriel, Matías Núñez, Cristian López Del Alamo, and Javier Martín Arista. 2026. "AI-Driven Generation of Old English: A Framework for Low-Resource Languages" Big Data and Cognitive Computing 10, no. 5: 145. https://doi.org/10.3390/bdcc10050145

APA Style

Salazar Alva, R. G., Núñez, M., López Del Alamo, C., & Martín Arista, J. (2026). AI-Driven Generation of Old English: A Framework for Low-Resource Languages. Big Data and Cognitive Computing, 10(5), 145. https://doi.org/10.3390/bdcc10050145

Article Metrics

Back to TopTop