1. Introduction
Language is among the most profound tools of human civilization, which embodies cultural heritage, historical knowledge, and intellectual evolution. Studying ancient languages reveals the origins of societies and their ties to the present. Among these, Old English holds a unique position as the earliest form of the English language, which serves as the foundation for contemporary English. Spoken between the 5th and 11th centuries of the Common Era, it represents a rich tapestry of linguistic complexity, although it remains under-resourced in the current paradigm of widespread generative artificial intelligence.
Recent advances in NLP for low-resource languages have explored various computational strategies. Work on Latin, for instance, has benefited from substantially larger corpora (the Latin Library contains over 10 million words) and established toolkits that represent the successful applications of transformer models for tasks such as lemmatization and dependency parsing [
1,
2]. Gothic, despite having fewer surviving texts than Old English, has seen recent computational work through the use of transfer learning from modern Germanic languages, though these approaches typically achieve BLEU scores below 40 for translation tasks [
3].
Existing NLP models for LRLs (low-resource languages) face common limitations: (1) data scarcity necessitates extensive transfer learning, which risks introducing linguistic features from donor languages; (2) morphological complexity in inflected languages challenges tokenization and requires careful handling of case systems; and (3) free word order complicates sequence modeling approaches designed for fixed-order languages like Modern English.
Parameter-efficient fine-tuning methods have emerged as crucial tools for LRL adaptation. LoRA [
4] reduces trainable parameters by injecting low-rank matrices into pretrained models, which guarantees efficient adaptation with limited data and preserves the knowledge of the base model. Back-translation [
5], originally developed for neural machine translation, has proven particularly effective for data augmentation in low-resource settings because it can leverage monolingual corpora to generate synthetic parallel data. Recent work by Nguyen et al. [
6] demonstrates that linguistically diverse prompting can improve LLM performance on low-resource languages by up to 8points in the chrF++ metric, although their approach requires substantial multilingual pre-training data unavailable for Old English.
Beyond the original work by Hu et al. [
4], subsequent studies have extended LoRA to specialized low-resource translation scenarios. Liang et al. [
7] recently introduced Language-Specific Fine-Tuning with LoRA (LSFTL), optimizing multi-head attention and feed-forward layers of Transformer blocks through low-rank matrix adaptation to achieve substantial BLEU and chrF gains for low-resource language pairs. These results confirm that LoRA-based adaptation can match or exceed full fine-tuning with a fraction of the trainable parameters, a property especially valuable for historical languages where overfitting is a constant risk.
The evolution of back-translation techniques has similarly advanced beyond the foundational work of Edunov et al. [
5]. Recent innovations include iterative back-translation, where multiple rounds of synthetic data generation progressively refine model quality, and tagged back-translation, which explicitly marks synthetic versus authentic data to help models distinguish between them.
Recent work on English–Luganda [
8] demonstrated that incremental and iterative back-translation, combined with dataset selection across multiple small monolingual sources, can exceed previous benchmarks by more than 10 BLEU units across translation directions for an African low-resource language, which directly motivates the iterative back-translation strategy in our Phase 2 task specialization.
For morphologically rich languages, hybrid approaches combining back-translation with morphological analysis have shown particular promise. Studies on agglutinative languages like Turkish and Finnish demonstrate that incorporating morphological segmentation into the back-translation pipeline can improve BLEU scores by 5–8 points over naive back-translation. These advances informed our decision to combine back-translation with careful linguistic preprocessing tailored to Old English’s inflectional complexity.
The intersection of efficient fine-tuning and data augmentation for historical languages represents relatively unexplored territory. While Latin NLP has benefited from larger corpora and established computational infrastructure [
1,
2], truly low-resource ancient languages like Gothic and Old Church Slavonic present challenges more analogous to Old English. Recent work on Gothic achieves BLEU scores in the 30–40 range through transfer learning from modern Germanic languages but often introduces anachronistic constructions from donor languages.
Very recent work on classical Chinese confirms the value of domain-specialized models: Zhao et al. [
9] applied LoRA instruction tuning to the Xunzi series of ancient-book LLMs using 1.2 million pairs of parallel ancient–modern Chinese text. These authors show that domain-specialized models substantially outperform general-purpose baselines on ancient-book translation metrics. The same research group has further extended this paradigm to multimodal settings, with Zhu et al. [
10] introducing XunZi-MLLM, a multimodal large language model for joint recognition of ancient texts and historical document images. In a closely related line of work on a different low-resource setting, Joshi et al. [
11] introduced Nemotron-Mini-Hindi 4B, a bilingual model trained through continued pre-training of a multilingual LLM on a mix of real and translation-based synthetic Hindi + English tokens. They achieve state-of-the-art results on Hindi benchmarks while remaining competitive on English tasks.
These works adopt a methodological philosophy analogous to ours, namely combining domain-specialized continued pre-training with synthetic data augmentation, and we adapt this paradigm to the markedly smaller Old English corpus (≈3 million words). Our approach differs by treating Old English generation as a structured task combining domain adaptation and task specialization, explicitly separating content creation from translation to maintain linguistic authenticity while leveraging modern language model capabilities.
Unlike previous work that treats ancient languages as simple translation targets, our framework treats corpus expansion as a structured generation task that preserves grammatical and stylistic authenticity. Our approach advances beyond these methods by combining domain-adaptive continual pre-training with task-specific back-translation in a dual-agent architecture specifically designed to address the unique challenges of Old English: its limited corpus size (3 million words), complex inflectional morphology (four-case nominal system, rich verbal conjugation), and relatively free word order. This limitation hinders not only linguistic research but also the application of state-of-the-art NLP techniques, which often depend on large, high-quality datasets [
12].
LLMs have revolutionized NLP but face critical limitations when generating Old English texts. Most models are English-centric, trained predominantly on contemporary corpora, while multilingual models lack sufficient representation of low-resource languages like Old English. The unique linguistic features of Old English—complex case systems, free word order, and Germanic vocabulary—compound these challenges, resulting in stylistically and grammatically inaccurate outputs that require targeted training and augmentation strategies.
The primary objective of this study is to adapt a pretrained large language model for data generation tasks in Old English through a systematic fine-tuning process that leverages the limited available Old English data and advanced training techniques. This approach allows the model to generate syntactically well-formed and semantically accurate Old English text. Furthermore, to guide the stylistic and contextual quality of the generated output, the proposed framework uses a dual-agent architecture: a generative agent constructs coherent Modern English prompts, and a translation agent renders these into high-quality Old English.
Based on the successful application of transfer learning and data augmentation in related low-resource scenarios, we hypothesize that a pretrained large language model, when systematically adapted through staged domain-specific training and enriched with synthetically augmented data via back-translation, can generate Old English texts that achieve both high automated metric scores (BLEU > 60) and expert-validated linguistic fidelity (average score > 8/10) across grammatical accuracy, lexical authenticity, and stylistic coherence. This hypothesis is predicated on the assumption that the existing knowledge of the model of Modern English and related Germanic languages provides sufficient structural foundation for effective transfer to Old English, and that synthetic data can overcome the limitations of the small historical corpus without introducing systematic errors or anachronisms that would compromise authenticity.
Evaluation was conducted using both automated metrics, such as BLEU, CHRF, and METEOR, and human evaluation by experts in Old English. Texts were rated on grammatical accuracy (inflection and word order), lexical selection (attestedness), and semantic coherence, with only high-quality outputs incorporated into the extended corpus. This refinement ensured the data met both linguistic and computational standards.
Our approach begins by adapting a state-of-the-art language model through a carefully staged training pipeline. The process starts with domain adaptation, where the model is progressively exposed to authentic Old English data—however scarce—alongside carefully selected Modern English examples, which leverages advanced techniques such as domain-adaptive pre-training and efficient fine-tuning. Synthetic data generation is then achieved via a dual-agent system: one agent crafts stylistically consistent Modern English prompts, and a specialized translation agent renders these into fluent Old English, which is enriched with context and guided by few-shot learning. By iteratively expanding and refining the training data, our methodology enables the model to produce Old English texts with unprecedented fidelity, which opens the door to scalable digital preservation and revitalization for other LRLs.
The contributions of this work are manifold. It expands the Old English corpus and provides a scalable framework for LRLs. This approach serves as a replicable template for other under-resourced languages, which fosters their preservation and study. Beyond linguistics, this work unites computational techniques with cultural heritage preservation, thus contributing to a movement that democratizes access to linguistic resources and ensures the survival of underrepresented languages. By demonstrating how LLMs can address challenges in LRLs, we provide valuable insights for researchers at the juncture of machine learning, humanities, and technology—which shows how the same tools that shape the future of artificial intelligence can be repurposed to illuminate the linguistic past of humankind.
All code, models, and implementation details are publicly available in our repository (
https://github.com/tux550/OldEnglish-LLM (accessed on 17 January 2025)). The pipeline implementation in Google Colab notebooks makes our framework accessible with minimal setup requirements. Our implementation leverages the Hugging Face Transformers library for model management and fine-tuning, while the dual-agent pipeline utilizes OpenAI’s API for the content generation component. This architecture allows researchers to reproduce our results and adapt our methods to other low-resource languages with standard computational resources.
The remainder of this article is organized as follows.
Section 2 details our three-stage methodology, beginning with data preparation using the Dictionary of Old English Corpus and other historical sources, followed by a progressive model training approach that combines domain adaptation and task specialization phases.
Section 2 also describes our dual-agent synthetic data generation pipeline distinguishing content creation from translation tasks.
Section 3 outlines our use of automated metrics (BLEU, METEOR, CHRF) alongside expert human assessment criteria.
Section 4 presents our findings, which demonstrate substantial improvements in translation quality—with BLEU scores increasing from 26 to over 65—and highlights both the strengths and limitations of our approach through detailed linguistic analysis.
Section 4 also examines the broader implications of this work for other LRLs.
Section 5 summarizes the main conclusions and provides future directions for improving semantic coherence and historical authenticity in generated texts.
3. Evaluation
The evaluation process combined automated metrics and human assessments to ensure reliability and quality.
For the automated metrics we employed BLEU [
23], METEOR [
24], and CHRF [
25]. These metrics function as quantitative proxies for human judgment that compare the generated translation to reference human translations. BLEU evaluates precision at the
n-gram level, and thus, assesses how much of the generated text overlaps with the reference. METEOR improves on this by incorporating recall and using techniques such as stemming and synonym matching for better alignment with human evaluation. CHRF, in contrast, evaluates translations based on character-level precision and recall.
Regarding human evaluation, expert reviewers rated the outputs on three criteria: grammatical accuracy (inflection and word order), stylistic fluency (lexical choice), and contextual coherence. The expert evaluation was conducted on a 10-point scale, with a threshold score of 7 set to determine whether a text is appropriate for incorporation into the extended corpus. This task was carried out on a stratified sample of 150 generated segments, randomly selected from the output of 500 generated texts to avoid selection bias. Each segment ranged from 15 to 45 words in length and was required to constitute a syntactically complete unit (containing at least one independent clause). The sample reflected the major textual genres as reflected by the corpora of Old English, including religious texts (homilies and biblical translations, 35%), legal documents (charters and law codes, 20%), historical chronicles (15%), literary texts (poetry, 15%), and medical/scientific texts (15%).
The expert evaluation was conducted by three specialists in Old English linguistics, each with doctoral-level qualifications and/or publications in the field of historical Germanic linguistics or Old English corpus linguistics. The evaluators included two university professors with over 15 years of experience in Old English philology and one predoctoral researcher specializing in computational approaches to historical linguistics. To ensure consistency between evaluators, we implemented a standardized scoring rubric with detailed criteria for each evaluation dimension. Before the main evaluation, we conducted a calibration session where all evaluators independently scored the same set of 10 text samples, followed by a discussion to align their interpretations of the scoring criteria. The names of these experts are acknowledged in the Acknowledgements Section of this paper.
4. Results and Discussion
This study provides compelling evidence that large language models (LLMs) can address the challenge of corpus expansion for LRLs, with Old English serving as a representative case study. Through a systematic combination of advanced model fine-tuning, targeted data augmentation, and a modular, task-specific pipeline, we achieved improvements in both translation quality and linguistic fidelity. Our results showcase not only gains in performance over baseline models but also illuminate the specific strengths and persistent limitations encountered when adapting LLMs to a historically and linguistically complex low-resource setting.
4.1. Model Fine-Tuning and Performance
In
Table 2, all displayed values are reported on a 0–100 scale, with higher scores indicating closer correspondence with reference human translations.
The initial evaluation of baseline models—Llama-3.1-8B [
21] and Mistral-7B [
26]—revealed critical limitations in their ability to process Old English. Both models achieved moderate results in reverse translation tasks (Old English to Contemporary English), yet their performance in direct translation (Contemporary English to Old English) lagged (
Table 2). This pronounced asymmetry highlights the inherent difficulties of generating accurate and fluent text in a low-resource language characterized by complex grammatical structures.
Among the baselines, Llama-3.1-8B consistently outperformed Mistral-7B across all evaluation metrics—BLEU, CHRF, and METEOR—which showed an average improvement of 2 to 5 points (
Table 2). Nonetheless, both models commonly exhibited translation errors, including repetitive or looped text, untranslated segments, and stylistic inconsistencies (
Table 3). These recurring issues underscored the necessity of further fine-tuning and methodological innovation to adapt LLMs for high-fidelity text generation in Old English.
The refinement process demonstrated measurable improvements. OldEnglishBase served as a robust initial model capable of capturing the vocabulary and grammar of Old English and demonstrated substantial improvement over the baseline Llama model in both direct and reverse translation tasks, with absolute score increases of approximately 35 points (from ~25 to ~60) for English to Old English translation and 20 points (from ~47 to ~67) for Old English to English translation, as can be seen by comparing
Table 2 and
Table 4. These improvements effectively doubled the performance metrics of the baseline model. We focused our subsequent experiments on the Llama architecture due to its consistently superior performance over Mistral in our initial evaluations.
The refinement of this model through the technique of back-translation proved to be key in overcoming the stagnation. OldEnglishRefined excelled in producing fluent and contextually appropriate Old English text. This text showed improvements of approximately 6 points across all metrics for direct translation. It is important to note that this process specializes the model for direct translation, with the gains in reverse translation being minimal.
The gains obtained by our model align with previous applications of back-translation in large language models such as the Linguistically Diverse Prompting [
6] approach, reporting increases of up to 8 points in the chrF++ metric. Notably, our model begins with a baseline over twice as high due to the domain adaptation step. This makes improvements more difficult to achieve. Despite this, we attained gains comparable with those of prior work, which highlights the effectiveness of this approach.
In
Table 3, the examples illustrate issues such as looped generation, non-translated segments, and hallucinated vocabulary.
In
Table 4, higher values indicate greater similarity to reference human translations. Results are reported for both Modern English to Old English (EN → ANG) and Old English to Modern English (ANG → EN) directions. Performance improves up to three epochs, after which overfitting and catastrophic forgetting reduce scores, especially for reverse translation.
In
Figure 3, box plots show the distributions of BLEU, CHRF, and METEOR scores for three model stages (Base, OE-Base, and OE-Refined) on both translation directions: (a) Modern English → Old English and (b) Old English → Modern English. Scores are normalized to [0, 1]. The dotted line within each box indicates the mean. Analysis: For both directions, there is a clear performance progression: OE-Base models (orange) show marked gains over the Base (blue), with further improvements from OE-Refined (green). This is reflected in higher medians, means, and upper quartiles across all metrics. Notably, the score distribution for OE-Refined is more compact, indicating more consistent high-quality outputs. Forward translation benefits most, with especially large gains in BLEU and METEOR. Outliers are reduced as training progresses, underscoring improved robustness. The observed plateau in improvement after three epochs aligns with expectations for translation tasks involving morphologically complex languages. BLEU scores above 60 for both translation directions, showing that the model approaches performance levels typically associated with human-quality translations between more resource-rich language pairs. This suggests that our domain adaptation phase successfully captured the fundamental linguistic patterns of Old English despite the limited training data available.
4.2. Synthetic Data Generation
The dual-agent pipeline, comprising FragmentGen and OldEnglishTranslator, played a pivotal role in addressing the limited availability of annotated data in Old English. FragmentGen generated coherent and thematically diverse texts in Contemporary English, thus providing high-quality inputs for the translation agent. OldEnglishTranslator, built on the trained OldEnglishRefined model, translated these texts into Old English with a high degree of grammatical and stylistic fidelity.
4.3. Expert Linguistic Evaluation
Three expert linguists with published research in Old English corpus linguistics and computational linguistics independently evaluated the generated texts across four criteria: inflection (nominal and adjectival declension, verbal conjugation, and subject-verb agreement), word order, lexical choice (authenticity based on attestedness in surviving Old English texts), and semantic coherence. Each text segment—defined as a syntactically complete unit containing one or more clauses—was scored on a 1–10 scale for each criterion, and these four scores were averaged to produce a composite score per segment. To assess the consistency of expert evaluations, we calculated inter-rater agreement across the three independent linguist evaluators. For each evaluated segment, we computed the range (difference between highest and lowest scores) and standard deviation of composite scores. Across 50 randomly sampled segments, the mean score range was 0.83 points (SD = 0.34), with 94% of segments showing agreement within 1 point. This indicates substantial inter-rater reliability. Fleiss’ kappa for the four evaluation criteria yielded κ = 0.76, which demonstrates “substantial agreement” according to standard interpretation guidelines. The highest agreement was observed for inflection (κ = 0.81) and lexical choice (κ = 0.79) and semantic coherence showed slightly lower but still acceptable agreement (κ = 0.68), which reflects the more subjective nature of contextual interpretation.
For instance, the segment “spyr on ðam pocce on eagum; heo sceal niman wad and gearwan, forðam hi ealle andswaredon” (“Inquire about the pock in the eyes; she shall take woad and yarrow, for they all answered”) was rated for its accurate use of imperative and future verbal forms (spyr, sceal niman), thus demonstrating a strong command of the morphology of Old English. The passage also presents a clear instructional structure and employs medical and technical herb terminology with precision, which closely follows authentic conventions of the genre. These strengths are reflected in the scores for inflection (9/10), word order (8/10), and lexical choice (10/10). On the other hand, the segment received a lower score for semantic coherence (7/10), due to an inadequate continuation of the topic: the causal relationship is unclear, as the last clause (forðam hi ealle andswaredon) is incorrectly linked to the main predication. Despite this minor issue, the overall evaluation for this passage was positive, with an average score of 8.5/10.
In
Table 5, Metrics (BLEU, CHRF, and METEOR) are scaled 0–100; higher scores indicate greater similarity to human reference translations. Results show gains moving from the baseline Llama model to OldEnglishBase, with further improvements achieved by the OldEnglishRefined model, especially in direct translation to Old English.
From a qualitative perspective, manual assessment of the synthetically generated texts underscores the strengths of the model in grammatical structure, particularly in inflection and word order. The system reliably produces complex constructions—such as relative and conditional clauses, and other intricate syntactic patterns—with high fidelity. Furthermore, the decorum of the genre is consistently maintained, as the generated texts adapt stylistic conventions appropriate to different genres. Specialized terminology in religious, legal, and historical contexts is both accurate and contextually appropriate, which demonstrates the flexibility of the model across diverse textual traditions.
Despite these notable strengths, semantic coherence emerged as the most frequent issue in the generated texts. In some cases, anachronistic concepts, such as modern dates or technologies, are retrofitted into Old English, which leads to slight incongruences. For example, in the segment “gif ðu bist getogen on andwerdre tide and on tocyme, ðonne scealt ðu gymene habban to ðære lare on armenia” (“if you are trained on present and future tense data, you must pay attention to the teaching methods in Armenia”), references to data and training introduce modern elements that are historically out of place.
Complex narratives can occasionally be oversimplified, and unrelated domains may be inadvertently blended. For instance, in the passage “wið swile gate tyrdlu on scearpum ecede gesoden and on selfe wisan on gedon” (“to overcome swelling, apply sharp vinegar sodden with goats treadles and keep going on good advise”), the second coordinate clause diverges from the main narrative, so that the overall coherence is reduced.
Across genres, biblical quotations and religious instruction are typically more convincing and stylistically consistent than medical remedies, land charters, or detailed historical chronicles. The model tends to avoid structurally marked configurations such as non-nominative subjects, non-accusative objects, or pragmatic reorderings (like the V2 rule). Another recurring issue is the loss of co-reference tracking in complex syntactic structures. For example, in the segment “Ða axodon hi hie hu fela sealfe hæbbe ge. Hi cwædon seofon hræfnes & an rahdeores” (“Then they asked her how many salves have you? She said seven crabs and a roebuck”), the reference point shifts from the singular second person to the plural third person, which creates ambiguity.
Table 6 illustrates the progressive enhancement of translation quality in a zero-shot setting. It highlights the evolution of outputs as the model refines its understanding of the target language.
As summarized in
Table 7, expert evaluation demonstrates that the model excels in producing Old English with high grammatical accuracy and appropriate lexical choices, as reflected by average scores of at least 9.0 for inflection, word order, and lexical selection. These results indicate an excellent command of the morphology, syntax, and vocabulary of Old English in the generated texts. However, a lower score for semantic coherence (7.8) underlines a recurring limitation: the output is structurally and lexically convincing, but the model occasionally struggles to maintain deep narrative consistency or capture subtle contextual relationships. This assessment is based on detailed, criterion-specific scoring by specialist linguists of Old English. This guarantees that both surface-level correctness and deeper semantic fidelity are carefully measured. Overall, consistently high scores across linguistic categories confirm the effectiveness of our approach in generating plausible Old English and points to areas for future improvement in semantic integration.
In
Table 7, each criterion was rated on a 0–10 scale, with higher scores indicating closer alignment with grammatical and stylistic expectations found in authentic Old English sources. The model achieved high marks for inflection, word order, and lexical choice, and semantic coherence, though strong, remains the most challenging aspect. The overall average reflects robust linguistic performance. The model excels in grammatical accuracy and lexical selection, while semantic coherence remains the main area for improvement.
4.4. Challenges and Limitations
Despite the notable achievements of this study, several challenges remain. The scarcity of high-quality resources in Old English continues to constrain the diversity and complexity of training data. Although data augmentation strategies partially alleviate this limitation, certain linguistic features of Old English—such as its intricate case system and free word order—pose difficulties to both training of the model and text generation.
The generation of synthetic Old English texts raises fundamental questions about authenticity, representation, and the epistemological status of computationally reconstructed language. While our model achieves high grammatical fidelity and draws exclusively from attested vocabulary and constructions, the generated texts remain fundamentally different from authentic manuscripts: they lack the cultural embeddedness, sociolinguistic context, and interpretive traditions that characterize surviving Old English literature. In this line, we would like to emphasize that synthetic texts should be understood as computational approximations rather than historical discoveries. They reflect patterns learned from the extant corpus but cannot capture unattested nuances, regional variations, or diachronic changes that existed in the living language. Their primary value lies in corpus augmentation for computational tasks (training NLP models, testing linguistic hypotheses) and educational applications (generating practice materials, illustrating grammatical patterns), rather than serving as evidence for historical or philological claims about the language itself.
Users of generated Old English must maintain critical awareness of these limitations. Synthetic texts are best employed as supplementary resources that expand analytical possibilities while remaining grounded in authentic sources. For educational contexts, generated materials should be clearly labeled as computational constructs and used alongside genuine historical texts. For research applications, synthetic data can enable larger-scale computational analyses (such as training taggers or parsers) that would be impossible with the limited historical corpus alone, but findings should be validated against authentic sources whenever possible. This epistemological caution ensures that AI-generated language serves scholarship without displacing the irreplaceable value of historical linguistic evidence.
A promising avenue for addressing these questions is the integration of Retrieval-Augmented Generation (RAG) techniques [
27,
28]. By embedding curated fragments from authentic sources in Old English into a dedicated vector database, the model can dynamically retrieve contextually relevant passages during both training and synthetic data generation. This retrieval-based grounding not only helps mitigate issues such as hallucinations, repetitive text, and stylistic drift but also provides real historical references to guide the output of the model. More importantly, RAG can improve semantic coherence and context sensitivity, which makes it valuable for downstream applications such as question answering, where generating accurate and contextually anchored responses is critical.
Implementing RAG in a low-resource setting introduces additional technical complexities. Determining optimal chunk sizes, establishing effective metadata tagging, and managing the computational overhead of retrieval operations are all crucial to prevent performance bottlenecks and maintain scalability of the system. Nevertheless, very recent work on the endangered Paiwan language of Taiwan [
29] demonstrates the effectiveness of a hybrid architecture combining dictionary-based pre-translation, retrieval-augmented generation, and LLM post-editing for extremely low-resource pairs. This architecture is particularly relevant to Old English because the Bosworth–Toller lexicon [
15,
16] could serve the same role as the Paiwan handcrafted bilingual dictionary in future iterations of our pipeline. Similarly, Bayat et al. [
30] advanced RAG for Persian by developing language-specific embeddings (MatinaRoberta, MatinaSRoberta) and systematic benchmarks, showing that careful tuning of chunk size, temperature, and document summary indexing is decisive for RAG quality in morphologically rich low-resource languages.
Another pathway to improved performance of the model lies in the use of more powerful LLMs, such as GPT-4, larger variants of Llama, or the DeepSeek R1 model. With greater parameter counts and broader training corpora, these models could offer enhanced contextual understanding and stylistic versatility. Nevertheless, such advancements require more computational resources and risk overfitting to noise in limited data environments. Thus, balancing computational efficiency with accuracy of the model and historical fidelity remains a central concern for future research.
4.5. Broader Implications
This research demonstrates that advanced NLP and machine learning techniques, when carefully adapted, can bridge the gap between limited linguistic resources and modern computational methods. By generating and analyzing texts in Old English—a historically important but underrepresented language—our framework provides valuable tools for linguistic research, cultural preservation, and educational initiatives. The ability to produce high-quality synthetic data opens up new avenues for investigating the evolution of language, enriching digital archives, and creating interactive tools for scholars and the general public.
The dual-agent pipeline and data augmentation strategies presented here are broadly applicable to other endangered or LRLs. The trained models not only excel at unsupervised text generation but also show emergent capacity for high-quality translation and downstream NLP applications, such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging, especially when fine-tuned or adapted through task-specific prompts. Moreover, the flexible design allows for adaptation to domain-specific text generation or smart data augmentation tasks that may support a wide range of projects in digital humanities and computational linguistics.
To demonstrate the transferability of the framework, we consider its potential application to Quechua, a low-resource language family spoken by approximately 8–10 million people across South America. Like Old English, Quechua varieties face significant resource scarcity [
31]: the largest digital corpus (Quechua–Spanish parallel data) contains fewer than 200,000 sentence pairs, and most varieties lack systematic computational resources. Quechua shares several challenges with Old English—rich agglutinative morphology with extensive suffixation, relatively free word order, and dialectal variation that complicates standardization [
32,
33].
Recent work on Quechua machine translation directly complements our methodology. Chen et al. [
34] introduced QueEn, a Quechua–English system that combines retrieval-augmented generation with parameter-efficient fine-tuning of a relatively lightly adapted base model. Whereas QueEn’s main strength lies in grounding outputs through retrieval over an external Quechua knowledge base, our framework contributes the complementary component largely missing from QueEn: a deep, staged domain adaptation followed by task-specific synthetic augmentation. A natural extension would therefore combine our staged continual pre-training and dual-agent back-translation pipeline as the underlying translation engine, with QueEn-style RAG layered on top, plausibly yielding finer-grained translation quality than either approach alone. Beyond Quechua-specific systems, the 2024 edition of the AmericasNLP shared task [
35] reported ChrF++ improvements over strong baselines of up to 4.2 points for Chatino, Guarani, Quechua, and Rarámuri, confirming that data augmentation and transfer learning remain the most effective strategies for these languages. Even more recently, Asillo Congora et al. [
36] showed at AmericasNLP 2025 that augmenting the Quechua parallel corpus with LLM-generated synthetic sentences and fine-tuning mT5 yields BLEU/ChrF++ improvements over the state of the art, which parallels the synthetic-data pipeline that we propose for Old English.
Beyond the case of Old English, these methodological innovations—which combine efficient fine-tuning, iterative back-translation, and modular architectures—offer a scalable, cost-effective blueprint for the revitalization and study of other languages that face resource scarcity, such as Quechua. Thus, this research not only contributes to linguistic and cultural heritage but also to the broader democratization of artificial intelligence for the benefit of society.
5. Conclusions
This work introduces a novel and effective framework for the generation of texts in Old English using state-of-the-art transformer-based language models. Through a combination of low-rank adaptation, back-translation, and a dual-agent pipeline, we expand the available corpus of Old English with linguistically accurate and stylistically plausible texts. Our multi-stage approach—which encompasses data preparation, domain adaptation, task specialization, and synthetic data generation—proves capable of capturing the grammatical and stylistic intricacies of the language, achieving BLEU scores of 65.41 and expert validation scores averaging 8.7/10.
The evaluation results confirm the success of our methodology: the generated texts achieve high scores from both automated metrics (BLEU, METEOR, CHRF) and expert evaluators, especially in grammatical accuracy and lexical choice. However, semantic coherence and context remain areas for ongoing improvement, with challenges such as anachronisms and oversimplification of narratives present in some outputs. These limitations remind us that computational text generation, while powerful, operates within epistemic boundaries: synthetic Old English can augment but never replace the historical record, serving as a practical tool for corpus expansion and computational analysis rather than as authentic linguistic evidence.
The strategies developed here provide a scalable and replicable path for supporting and revitalizing other LRLs. Looking forward, integrating retrieval-augmented generation and leveraging more advanced language models offer promising directions to further improve both linguistic fidelity and contextual accuracy. Yet we must acknowledge what remains uncertain: whether machine-generated language, however technically accomplished, can fully capture the interpretive nuance, cultural resonance, and historical specificity of texts produced by human speakers embedded in their linguistic communities. Our contribution lies not in resolving this epistemological tension but in demonstrating that AI can meaningfully support linguistic preservation when deployed with critical awareness of its limitations. This ensures that these methods will remain at the forefront of computational approaches to cultural heritage if they respect the irreplaceable value of authentic historical sources.