Detection and Typology of Psalmic Text Reuses in the New Testament

de la Selle, Théotime; Mellerin, Laurence

doi:10.3390/rel17010088

Open AccessArticle

Detection and Typology of Psalmic Text Reuses in the New Testament

by

Théotime de la Selle

^*

and

Laurence Mellerin

^*

Centre National de la Recherche Scientifique, HiSoMA-Sources Chrétiennes, 22 rue Sala, 69002 Lyon, France

^*

Authors to whom correspondence should be addressed.

Religions 2026, 17(1), 88; https://doi.org/10.3390/rel17010088

Submission received: 19 November 2025 / Revised: 28 December 2025 / Accepted: 30 December 2025 / Published: 12 January 2026

(This article belongs to the Special Issue Computational Approaches to Ancient Jewish and Christian Texts)

Download

Browse Figures

Versions Notes

Abstract

In the context of the BiblIndex project, which is an online index of biblical textual reuses by the Church Fathers, intrabiblical intertextuality must be considered to better understand the underlying basis of the Church Fathers’ thought. This paper examines the reuse of Psalmic texts in the New Testament as a test case for experimenting with a detection tool that uses traditional natural language processing (NLP) methods exclusively. Biblical verses are compared using similarity measures based on various NLP operations, such as tokenization, lemmatization, part-of-speech tagging, stop word filtering and synset assignment. Textometric measures provide a framework for the numerical assessment of grammatical, lexical and semantic similarities between textual units. We demonstrate the efficiency of this reproducible method, which does not involve a ‘black box’ effect, for detecting and characterizing literal quotations and a significant range of echoes.

Keywords:

text reuse; intertextuality; natural language processing; textometry; similarity measure; Bible; New Testament; Psalms; BiblIndex

1. Introduction

This research forms part of the BiblIndex project, an online index of biblical textual reuses by the Church Fathers. It was developed at the Institut des Sources Chrétiennes in Lyon. The website currently provides lists of canonical references, linking a particular verse with a specific passage in a patristic work. The main advantage of this system is its ability to process large volumes of data and provide an overview of the role played by specific Bible passages in the history of interpretation. However, this can be an overly simplistic way of delineating the boundaries between source and target texts. Under other biases, it ignores fluctuations in the canon during the production of the earliest Christian writings as well as the influence of patristic quotations on the subsequent canonical form of biblical texts. In this paper, however, we will focus on another issue: the rich intrabiblical intertextuality that often makes identifying a precise biblical reference difficult when determining which text the Church Father was referring to. Our global technical objective is to develop a tool that classifies verses based on their proximity to others. This will help annotators of the BiblIndex corpus to select one biblical reference over another and offer website users alternative verses if they wish to expand their research.

Given the Psalms’ status as Jesus’ prayer book and their shared cultural significance in Jewish and Christian traditions, we selected the reuses of the Psalms in the Greek New Testament as a preliminary case study. We used the Septuagint version of the Psalms because specialists today consider this Greek version to be the most important source of Psalmic quotations in New Testament writings (Dorival 2016)1. This corpus is large enough to allow us to make several observations. We identified 614 Psalms reuses in the New Testament.

Instead of exploring the potential of supervised learning with neural networks, we used conventional similarity detection methods to evaluate their effectiveness and establish their potential usefulness to philologists and theologians annotating the BiblIndex corpus. A neural network approach would require a corpus of validated ancient biblical text that machines could use directly, and such a corpus currently does not exist. Additionally, the Bible is insufficient in volume. Furthermore, our goal is to produce a configurable tool that is transparent enough to allow the settings to be adjusted according to scientific criteria for evaluating reproducible results—a capability that would be precluded by AI. Firstly, we aim to identify text reuses ranging from distant thematic echoes to explicit literal quotations, and second, to classify them into a precise typology based on measurable criteria such as various morphosyntactic characteristics and proximity to the source text or explicit mentions of the act of quoting. This approach could be extended to other biblical corpora.

After presenting our gold standard reference corpus, we will outline the methodology adopted to compile the data. Next, we will present the numerical methods applied to detect and classify instances of text reuse. Finally, we will present and analyze the results.

2. The Gold Standard Reference for Text Reuse Between the Psalms and the Greek New Testament

The prevalence of psalm-like language in the New Testament is well-documented. A significant number of records relating to quotations from the Psalms, or more tenuous textual reuses, have been meticulously documented in the notes of most contemporary Bibles. In addition, many partial lists are available online. Several specific studies have also been conducted on quotations from the Old Testament in the New Testament. Some authors, including Archer and Chirichigno ([1983] 2005), have already developed typologies based on the proximity of the New Testament text to the Septuagint or Hebrew text. However, a thorough examination of the provided lists of occurrences reveals that the research is neither exhaustive nor definitive. The differing objectives of these studies are one key factor in explaining this lack. Some studies are only concerned with identifying literal text reuse for the purpose of philological analysis. Discrepancies also arise because the text may refer to the Hebrew text and Greek translations that differ from the Septuagint2. However, the main reason for the discrepancies is the variety of textual reuse, which extends far beyond simple, explicit quotations, consisting most of the time of tenuous echoes or reminiscences.

To compile a comprehensive list of references that serves as the gold standard for intertextual research and can be used as a reliable point of comparison with the results obtained by statistical analysis, we took into account every single instance of textual reuse found by us or by other analysts in previous centuries. We used the lists provided at the end of the following editions of the New Testament: Westcott and Hort (1890); Aland et al. ([1966] 1993); the marginal notes of Nestle and Aland (1984) and Nestle and Aland (2012). We also referred to specific studies on quotations from the Old Testament in the New Testament: Swete (1902) which gives a list of explicit quotations, Kirkpatrick (1902) in its appendix, Smits (1952–1963); Brachter ([1961] 1987); France (1971); Archer and Chirichigno ([1983] 2005); Fernandez Marcos (2000); Rüsen-Weinhold (2004); Moyise and Menken (2004); Dorival (2014). This involved identifying the common words in both the Psalms and the New Testament for each correspondence detected. A list of 614 references was obtained and classified using a simple typology. This typology includes three categories.

Quotations, which are verbatim textual reuses of different lengths that may be either explicit or implicit, depending on whether introducing terms referring to the Scriptures (e.g., καθὼς γέγραπται) are present. Quotations may be literal, meaning the words in the quoted psalm and New Testament verse are identical, or modified, meaning the source is inserted into the target text with minor lexical or morphosyntactic changes, e.g.,
Echoes are non-verbatim textual reuses that share something with the source text. There are three types of echoes: thematic, semantic, and lexical. Thematic echoes share a common theme, e.g.,
Semantic echoes share a common lexical field, e.g.,
Lexical echoes share one or two lemmas. These three characteristics may occur simultaneously.
Coincidences are lexical proximity in different thematic and semantic contexts.

The following typologies will be used in the Section 5:

Typology 1: distinction between quotations, echoes, and coincidences.
Typology 2: distinction between explicit and implicit.
Typology 3: distinction between literal, modified, lexical, semantic, thematic, and mixed text reuses. Mixed text reuses combine several characteristics from the list above.

Figure 1 displays the respective distributions of types within each typology. As can be seen, types have unbalanced populations. Note that the “modified” and “literal” types only apply to quotations.

3. Corpus Preparation

The successive stages of corpus preparation will be described. These stages are extensive and intricate, and require meticulous verification.

3.1. Source Texts

We used as source texts the Greek New Testament (denoted as NT in this paper) edited by Tischendorf in 1869, taken from the PROIEL project (Eckhoff et al. 2018) and the Rahlfs’ Septuagint (LXX), edited in 1950, taken from the BiblIndex data (text and verse segmentation). Rather than applying the natural language processing (NLP) pipeline from scratch on raw texts, we preferred to reuse preprocessed and verified open-source data as much as possible, as current NLP software programs still require time-consuming manual corrections when applied to Ancient Greek texts. Despite the existence of numerous sources of already lemmatized biblical texts available online, acquiring high-quality data remains challenging. Several textual databases containing the Septuagint and the Greek New Testament provide lemmas and morphosyntactic information. This information can be accessed by clicking on a specific word (e.g., Thesaurus Linguae Graecae, BibleWorks, etc.). However, users are unable to access a file containing the entire lemmatized and annotated text. Consequently, our dataset is a patchwork of several resources coming from different projects in computing humanities and NLP applications performed by us. New Testament tokens, lemmas and Part-of-Speech (PoS) tags originated from a PROIEL file3. and the Psalms’ lemmas originated from the OpenScriptures repository (Resources for Biblical Texts in Greek, specifically the Septuagint)4. Note that 291 verses were not annotated in the New Testament file (see below in Section 3.4 for details on data filling).

We had to pay particular attention to diacritical marks in the Septuagint data because these are encoded differently from those in the New Testament. Additionally, we identified errors in the diacritical mark encoding when integrating Trench’s synonyms Trench (1880) (see Section 3.7) and the Louw and Nida (1988) semantic domains (see Section 3.8) data to our analysis. Therefore, we decided to remove all diacritic marks from our dataset.

3.2. Partitioning the Corpus

Determining the basic textual unit for a tool designed to analyze reuse phenomena involves balancing algorithmic performance (optimizing the size of the units) with the analytical scale, in line with end-user practices. In our case study of intrabiblical intertextuality, philologists describe reuses, particularly in the critical apparatus, on the basis of the long-established system of versification (dating back to the sixteenth century).

To make similarity and proximity measures more interpretable while remaining close to philological practices, we chose biblical verses as the basic textual unit for our similarity measures. Alternatively, a segmentation into sentences or clauses could also have been adopted for greater semantic coherence. However, such an approach would have made it difficult to relate the results of the text reuse analysis back to the verse division, thus reducing the usability of the tool.

Nevertheless, as sub-verse units are commonly referred to in biblical and patristic studies—especially in the Psalms, where verses often consist of distichs—we carried out a manual segmentation based on units of meaning. Psalmic verses may therefore contain one, two, three, or even four parts in our analysis. Finally, our two corpora comprise 7939 verses from the New Testament and 3093 verse parts from the Book of Psalms. The next steps are applied independently to each of these textual entities.

3.3. Tokenization

All of our similarity measures are based on sequence comparisons at the word level. In our study, therefore, the tokenization step simply involves individualizing words based on spaces, apostrophes, and punctuation marks. In the case of Psalms, tokenization is performed using the SpaCy Ancient Greek pipeline5 named GreCy6 (note that tokenization and all necessary NLP steps are based on the grc proiel trf model7 which are primarily established on the New Testament corpus). After tokenization, we also used this model to perform cleaning steps, such as suppressing punctuation, accents and spirits.

3.4. Lemmatization

We then lemmatized our corpus. As explained above, the lemmas data were obtained from external sources. Nevertheless, we compared it with lemmatization performed using different NLP tools8: the Classical Language Toolkit9 and GreCy. CLTK’s performance on the Septuagint is unsatisfactory. It often fails to analyze complex verb forms and participles, and it frequently makes gender errors. An evaluation of the results on the first two psalms provides further insight (see image: the red words required a correction). Although CLTK proved more effective in its analysis of the New Testament, it failed to identify elided prepositions such as ‘μετ´’ and ‘παρ´’. Additionally, there were inaccuracies in conjugations, and personal pronouns were not always in the nominative case. Finally, we chose to use the lemmatized New Greek Testament available in PROIEL. This decision proved superior thanks to its superior Part-of-speech (PoS) analysis and systematic verification by Hellenists. However, we reported a systematic error in cases of crase (‘κἀγώ’ instead of ‘καὶ’, ‘ἀγὼ’), which we solved by automatically replacing terms in a post-processing step.

Unfortunately, the PROIEL data were incomplete: Chapter 13 of the Letter to the Hebrews, 1 Peter 3–5, 2 Peter 1–3, the epistles of John and Jude, and a few other verses were missing. This totaled 291 verses. Therefore, we had to complete the lemmatization and PoS tagging of these verses manually.

3.5. Parts-of-Speech

PoS tags were already provided by the New Testament source used (from the PROIEL project), and manual verification confirmed the quality of the tags. Tagging of Psalmic verse parts was done by GreCy in the same NLP pipeline application as tokenization. As these differed from the PROIEL PoS tags used for the New Testament, we chose to harmonize the data using the (Prévost et al. 2009) referential. Initially focused on medieval French and Latin languages, this French project has a broader objective of incorporating multilingual grammatical analysis. The resulting tags were thus obtained through a correspondence table (see Appendix A).

3.6. Stop Words

To compare the efficiency of these two methodological approaches, we created two versions of our texts: one with stop words and one without. We used the GreCy stop word list10.

3.7. Lexicon

Since paraphrasing a source text using synonyms is a common feature of intertextuality, we attempted to incorporate the ability to compare synonyms into our method. From the filtered lemmas of each verse, i.e., ordered lists of lemmas without stop words, we created a lexicon containing all the lemmas of the verse (without duplicates) in alphabetical order. Based on a digitized database of the Synonyms of the New Testament (Trench 1880), we built a table containing the 124 lists of synonymous lemmas (see the project GitHub repository11). Using this table, we replaced each lemma having synonyms with an arbitrary lemma within the corresponding list. For a given list of synonyms, the arbitrary lemma is obviously the same for each replacement. After completing this process, we obtained a lexicon (one for each New Testament verse or Psalms verse part) that was less specific to a given context than the filtered lemmas.

3.8. Semantic Domains

Since most quotations from the Psalms in the New Testament are neither explicit nor literal, we used tools to identify paraphrases and lexical similarities. Once again, despite the existence of many print resources12, we had to build our own tool.

Since we needed a mass processing of all the lemmas, we decided to use the (United Bible Societies 2023), adapted from the Semantic Dictionary of Biblical Greek (SDGNT). This revised and reformatted edition of Louw and Nida (1988) is supplemented with an exhaustive list of biblical references for each lexical meaning. It was produced and kindly made available by the Summer Institute of Linguistics (SIL). This dictionary contains a lexical analysis for each entry, including definitions, glosses, all scriptural references, and lexical-semantic domains or subdomains.

A table was created from an XML file13 including the Louw–Nida lexicon that associates semantic domains and subdomains with each New Testament lemma. Since the Louw–Nida lexicon only covers the New Testament, the table was completed for the 614 Psalms-specific lemmas: semantic domains and subdomains (from Louw–Nida categories) were attributed to each of these.

3.9. Multi-Representations of Verses from NLP Operations

We stored and gathered the resulting sequences of textual entities (tokens, PoS tags, lemmas, domains, etc.) for each of these NLP operations applied sequentially. Thus, each New Testament verse or Psalmic verse part is described by eight linguistic representations; organized in two tables containing, respectively, 7939 (NT)/3093 (Psalms) rows and eight columns of sequences:

Tokens: An ordered sequence of unaccented Greek terms.
Lemmas: An ordered sequence of unaccented Greek lemmas.
PoS: An ordered sequence of part-of-speech tags.
Stop words: An ordered sequence of unaccented Greek stop words.
Filtered lemmas: An ordered sequence of unaccented Greek lemmas without stop words.
Lexicon: An unordered sequence of unaccented Greek lemmas for synonym substitution.
Domains: An unordered sequence of semantic domains.
Subdomains: An unordered sequence of semantic subdomains.

Each sequence of items provides a distinct representation of a given verse. Thus, these sequences can be grouped into literal representations (tokens and lemmas), grammatical representations (PoS and stop words), lexical representations (filtered lemmas and lexicon), and semantic representations (domains and subdomains). These two tables constitute our datasets. In other words, they are the basis for all subsequent experimentation. Figure 2 presents an overview of the elaboration process of the two datasets (the New Testament and the Psalms), summarizing the operations, integration of external resources and application of GreCy. Note that, at this stage of our analysis, the data are strictly textual (no numerical values) and can thus be interpreted and reviewed by individuals without digital tool expertise. Complete versions of our New Testament and Psalms datasets (including the lemmatized NTG text) are available on the project GitHub14.

4. Numerical Methods

In the previous Section 3, we created verse representation datasets for the New Testament and the Book of Psalms, using raw Greek texts and linguistic or philological resources. These representations comprise only textual entities (e.g., Greek terms, lemmas, PoS tags) and are strictly knowledge-based, with no statistics involved. Consequently, the objectives of the numerical methods section are twofold.

First, we introduce algorithmic operations that quantify the similarity between each representation of NT and Psalms. These operations attribute a numerical value corresponding to the proximity of the sequences within representations of a given NT verse and a given Psalms verse part. Thus, each pair of an NT verse and a Psalmic verse part has eight values. These values correspond, respectively, to the measured similarities between the token, lemma, PoS, stop words, filtered lemma, lexicon, domain and subdomain representations of the given two verses (or a verse and a verse part). Therefore, we speak of a numerical textual reuse model based on literal, grammatical, lexical, and semantic similarity measures.

Second, we apply statistical methods to analyze these numerical measures of similarity between verse representations to automatically perform several philological tasks:

Detection of new textual reuses of the Psalms in the New Testament.
Prediction of existing Psalmic reuses in the NT (methodological validation).
Classification (i.e., clustering methods in data science) of these reuses according to the balance of the different types of similarity measures (grammatical, lexical, semantic, etc.).

Note that all of these analyses are based solely on unsupervised methods (including machine learning algorithms), meaning that no training is involved in the sense of data science. Although we created a gold standard dataset of psalm reuses in the New Testament (see Section 2), the information in it is not integrated into the statistical methods of our model. Rather, it is used for comparison to validate (see Section 5) and support the interpretation of the results (see Section 6).

4.1. Measuring Similarity Between Representations

To characterize the proximity between a given Psalmic verse part and an NT verse, we compute eight independent similarity measures, one for each type of verse representation. The results of detection, prediction and classification of reuse occurrences depend on the chosen similarity measure method. To propose a transparent method that avoids the black-box effect and preserves interpretability for non-experts, we chose a more deterministic approach than a statistical one, which implies the numeration15 of verse representations. Therefore, we used string similarity metrics (Navarro 2001)16, a common family of methods called that compare sequences of textual items (characters, words, etc.) directly and return numerical values corresponding to their definition of similarity.

4.1.1. Sequence Similarity Metrics

Since we have both ordered and unordered sequences (see Section 3.9), two different metrics are required. For the ordered sequences, we selected the Levenshtein distance, which is a widely used string metric (Navarro 2001) for this type of task because it is easy to understand. The Levenshtein distance

d_{L} (s_{1}, s_{2})

between two strings

s_{1}

and

s_{2}

is equal to the minimum number of operations on the elements of

s_{1}

necessary to obtain

s_{2}

(or conversely, as

d_{L} (s_{1}, s_{2})

is symmetric). Admissible operations on elements include deletion (delete an element anywhere in the sequence), insertion (add an element anywhere in the sequence) or substitution (change an element anywhere in the sequence). For example, the Levenshtein distance between the two Greek term lists

d_{L} ([ε ἰ μ ί, ϑ ε ό, ἀ ρ χ ή], [ε ἰ μ ί, ϑ ε ό ς, λ ό γ ο ς, ἀ ρ χ ή])

is equal to two as one insertion (λόγος) and one substitution (

ϑ ε ό \to ϑ ε ό ς

) are necessary to convert the first list to the second. For unordered sequences, we defined a set similarity metric (see Equation (1)) based on the difference between the union of the cardinal sum of the two sequences and the minimum cardinality. We also determined the minimum number of operations needed.

d_{s} (s_{1}, s_{2}) = C a r d (s_{1} \cup s_{2}) - m i n (C a r d (s_{1}), C a r d (s_{2}))

(1)

4.1.2. Normalization Processes

The results of the introduced similarity metrics depend directly on the lengths of the sequences. Since these are rarely equal (e.g., NT and Psalms verses), normalization of the similarity measures is necessary to avoid implying sequence lengths in the results. Additionally, comparing values within statistical methods requires a uniform scale. Therefore, we apply a normalization process based on sequence length comparison. This process is applied to each metric, yielding a measure of similarity ranging from 0 (identical) to 1 (totally different). Here, we propose two normalization processes that address sequence lengths differently. The first normalization process (see Equation (2)) divides the similarity measure between sequences by the length of the longest sequence. The second normalization process (see Equation (3)) addresses the issue of very long sequences requiring a large number of deletions in the Levenshtein distance process. This method attempts to reduce this bias.

δ_{i j}^{1} = \frac{d (s_{i}, s_{j})}{m a x i m u m (l_{i}, l_{j})} \in [0, 1]

(2)

δ_{i j}^{2} = \frac{d (s_{i}, s_{j}) - | l_{i} - l_{j} |}{m i n i m u m (l_{i}, l_{j})} \in [0, 1]

(3)

where

d (s_{i}, s_{j})

is the similarity metric (

d_{L}

or

d_{s}

) between sequences

s_{i}

and

s_{j}

, while

l_{i}

and

l_{j}

are the respective lengths of these sequences.

These two normalization processes may encode different information about the similarity between the representations, and since we do not know which process is best suited to a given pair of NT verse and Psalms verse part, we apply both processes on each of the eight representation similarity measures. This results in 16 numerical values in the range of

[0, 1]

for each pair of verses.

Finally, we obtain a knowledge-based model composed of 24,555,327 samples (all possible pairs between NT verses and Psalms verse parts) described by 16 features. This model forms the basis for various analyses of Psalmic reuses in the New Testament at different scales. Five samples are presented as examples in Appendix B. As this methodological study focuses on our numerical method applied for the detection of Psalmic reuses in the NT, we propose three approaches linked to the established gold standard: reuse prediction, detection, and clustering. By comparing with this reference, we can measure the detection performance and clustering coherence with typologies.

4.2. Reuse Prediction

Note that the ultimate goal of this tool is to assist philologists, patrologists and biblical scholars. We simulate the search for Psalmic reuses in the NT by performing reuse prediction on our intertextuality model. The process takes an input NT verse (or a range of NT verses) from the gold standard data set as an entry. It returns the number of predicted Psalms verse parts equal to the number of references to the input verse. Figure 3 illustrates the schematic view of this prediction process.

Computational analysis sorts all the verse pairs in the textual reuse model containing the input NT verse from the most probable reuse occurrence to the least probable. Finally, it returns the N Psalms verse parts associated with the N first sorted pairs. N is the number of Psalmic reuses related to the NT verse in the gold standard data. Then, we compare the predicted Psalms verse parts with the Psalms verse reuses in the gold standard.

In this strictly defined process, the sorting rule is an open choice. As our model comprises various types of similarity measures (literal, lexical, grammatical and semantic), we propose performing predictions either independently on each similarity measure (e.g. predictions based only on lexical similarity) or through a combination of several measures (e.g., the mean of similarity measures from the same normalization). In this study, we evaluate multiple cases: 16 similarity measure values (one by one), mean values of eight similarity measure values (i.e. all verse representations) for both normalization, and the overall mean value.

Note that in a real usage case (i.e., without a gold standard), the user defines the input verse, selects a sorting rule and specifies the number N of predicted Psalms, because the number of Psalmic reuses in the input verse is unknown.

4.3. Reuse Detections

The detection process simply involves computing a score for each pair in the textual reuse model and returning the top-scoring pairs (without any input). As with the prediction method, detection is based on a sorting rule chosen by the user. Here, we attempt to detect new intertextuality occurrences by using the mean values of the eight representations’ similarity measures for both normalizations and the overall mean value.

In a real usage case, reuse detection is a complementary approach to reuse prediction. This is because it does not depend on an input verse. Thus, it allows for the discovery of unexpected reuses. However, the results are too lengthy to examine, so not all reuses can be found through detection alone (in a large corpus). Additionally, not all types of sorting rules can be applied due to the high number of false positives when using certain similarity measures alone (e.g., the similarity measure of PoS representations).

4.4. Reuse Clustering

The multi-similarity measures approach enables comparison of similarity types (literal, grammatical, lexical and semantic) in order to categorize reuses according to their balance of these measures. The corresponding numerical methods are based on the statistics of a reuse dataset, which is obtained by extracting samples from the textual reuse model (i.e., building a reduced dataset by selecting pairs of verses). For this study, the reduced dataset is defined by the gold standard of reuses to enable a comparison of the statistical categorization (a posteriori) with the typologies of reuse established by experts.

Clustering methods are used computationally to find a statistically meaningful partition of the dataset. These methods investigate the underlying structure of the dataset, grouping its samples according to their relative distances within the space defined by the features. In this case, the features are the eight similarity measures. Several clustering algorithms (K-means, Gaussian Mixture Model, hierarchical clustering, and DBSCAN in UMAP Atlas) were tested and compared with the gold standard. For clarity, this study only presents and uses one clustering from the k-means model which can be considered as the simplest and most common clustering algorithm. It tries to partition n samples into k groups—k defined by user—where each sample belongs to the cluster with the nearest mean (cluster centers), resulting in a minimization of within-cluster variances (i.e. minimizing squared Euclidean distances). See (MacQueen 1967) for more details.

5. Results

5.1. Detecting Psalms Reuses

5.1.1. Comparison with Standard Text-Reuse Tool

The same Psalms reuse detection experiment was conducted using Passim17, a standard text reuse detection tool based on n-gram alignment (Smith et al. 2015), to enable a performance comparison using the same corpus (preprocessed as explained in Section 3) and gold standard (see Section 2). For both Passim and our method, we performed Psalms reuse detection in the New Testament in a fully unsupervised manner. We counted the number of detections that effectively belonged to the gold standard, and derived the percentage of gold standard reuses found. For our multi-representation similarity measures, we computed the mean values of the measures for normalization 1 and 2 as the sorting rule. For Passim, given the large number of parameters and how sensitive Passim is to them, various sets of parameters and text representations were explored to maximize performances18. Note that, to generate comparable results, we configured each method to obtain exactly the same number of reuse guesses. Results for both methods and Passim’s test cases for three different text representations are displayed in Table 1.

The results of Table 1 seem to show that it is difficult to detect text reuse in this dataset, regardless of the methods and or text representations used. However, as these results are based on only 240 reuse guesses, while the gold standard contains 614, these statistics must be interpreted relatively, not absolutely. In contrast, we found that our method, which is based on multi-representation similarity measures of text, significantly outperforms Passim’s best results. Additionally, we observed that lemmatization and stop word filtering negatively affected n-gram alignment for Psalms reuse detection in the New Testament. This means that, compared to our method, Passim avoids the NLP steps required to build text representations. However, Passim’s sensitivity to its parameters requires optimization, which may be time-consuming and difficult, while our approach does not require tuning.

5.1.2. New Psalms Quotations and Echos Detected in the New Testament

Although the textual reuses of Psalms in the NT have been widely investigated and referenced in our gold standard dataset, applying the detection process defined in Section 4.3 led to the detection of new reuses. For the sorting rules, we computed the mean values of measures for normalization 1 and 2. Of the 240 detection first results that we reviewed, i.e., 240 pairs of New Testament verses and Psalms verse parts, 56 existing quotations and 10 echoes were referenced in the gold standard dataset. Of the unknown verse pairs, 68 were unpublished possible reuses (to our knowledge), and the rest were false positives. As with the gold standard occurrences, we annotated the detected reuses according to the same typology. The profile of the newly detected possible reuses differs greatly from the gold standard distribution (see Table 2 in comparison with Figure 1a,b).

Among the 68 reuses detected, 20 are distant lexical coincidences, which cannot be considered real text reuses, but rather testify to a common lexical substrate. For example, the word τέλος, meaning “end,” appears in the recurrent Psalmic formula “Εἰς τὸ τέλος” and retains its literal meaning in the New Testament verse. Similarly, the expression “ἐν γλώσσῃ” was found. There are 45 implicit echoes to Psalmic verses, among others. For example, Acts 4:24 echoes Psalm 134:6. The former reads Δέσποτα, σὺ ὁ ποιήσας τὸν οὐρανὸν καὶ τὴν γῆν καὶ τὴν ϑάλασσαν καὶ πάντα τὰ ἐν αὐτοῖς, “Lord, you who made heaven and earth and the sea and everything in them”, while the latter reads πάντα, ὅσα ἠϑέλησεν ὁ κύριος, ἐποίησεν ἐν τῷ οὐρανῷ καὶ ἐν τῇ γῇ, ἐν ταῖς ϑαλάσσαις καὶ ἐν πάσαις ταῖς ἀβύσσοις, “The Lord has done everything he wanted to do in heaven and on earth, in the seas and in all the depths.” Both verses just share the semantic domain of creation. Another example is John 10:3 and Psalm 99:3, which share the word πρόβατα. Of these echoes, one-third have no shared words with the proposed New Testament verse, and only thematic similarity. Only three detections are implicit non-literal quotations:

In the first two cases, the Bible’s annotations refer to another psalm because they are undoubtedly part of an explicit, longer quotation. 1 Peter 3:10–12 quotes Psalm 35:13–17, and Hebrews 1:6–8 quotes Psalm 8:5–7. In the last case, Hebrews 10:8 repeats Hebrews 10:5 and refers to the same psalm in the Bible apparatus. Since the first reference was so recent, noting the corresponding verse again was deemed unnecessary. In conclusion, none of the new detections are useful for exegetical or theological analysis.

5.2. Predictions of Psalms Reuses in the New Testament

As introduced in Method Section 4.2, the automatic prediction of reuses is an experiment conducted with our tool on a labeled dataset, i.e., a gold standard. After reviewing the reuse detections, we added the 68 positive reuses detected to the existing dataset presented in Section 2. This yielded a complete reuse dataset of 682 occurrences. The model performs a detection for each of these entries: it tries to predict a source Psalmic verse from a New Testament verse input, seeking the highest similarity value within all possible pairs of NT verses/Psalms verse parts. This prediction test was applied independently for each type of representation in order to compare the performance of each similarity measure and evaluate their respective contributions to a global prediction rate. Prediction success is measured by prediction rates, i.e., the ratio of true predictions to the total number of predictions for each representation. These results are presented in Figure 4, which shows a bar plot of the 16 prediction rates and contributions. Contribution is a measure of how relevant a representation is to the prediction, i.e., the increase in the true predictions due to a specific representation. As with the prediction rate, contribution is a ratio of true predictions for each representation, but it is computed among the subset of true predictions obtained only through a single representation.

5.2.1. Prediction Rates

First, we observe similar trends in the results of both normalizations: the eight bars on the left (normalization 1) are compared with the eight bars on the right (normalization 2). The normalization 1 results are always significantly higher than the normalization 2 results, except for the subdomains. It happens because normalization 2 is much more sensitive, generating more false positives. Second, comparing the prediction rates from the different representations yields the following results:

Filtered lemmas and lexicon are the best representations for automatically predicting reuses.
The lexicon has significantly better performance (in terms of prediction rates and contribution) than filtered lemmas. This means that using a synonym database (see Section 3.7) is relevant for reuse detection.
Tokens are slightly better than lemmas, especially for normalization 1. This is likely due to the presence of stop words in these two representations because lemmatized stop words may produce more false positives.
PoS, stop words, domains and subdomains—i.e., the most abstract representations—exhibit lower prediction rates than representations that include meaningful terms.

Finally, we note that the prediction rates are quite low. The highest value is approximately 0.3 and only eight of the 16 bars are greater than 0.1. However, the overall prediction rate–i.e., the ratio of correct predictions from all representations–reaches 0.5. This means that reuse predictions are not achieved by a single representation. The contribution score aims to measure the impact of each representation on the overall prediction rate.

5.2.2. Contributions

We observe that lexicon 1 accounts for about

30 %

of the reuses predicted by a single representation. However, the surprising result is that, except for PoS and domain 1, all representations increase the overall prediction rate. Even representations with very low prediction rates or that appear redundant provide information for the automatic reuse model. This is particularly noticeable for domain 1 and subdomain 1 which have relatively high contributions compared to the same representations in normalization 1. These results demonstrate the relevance of normalization 2 for reuse prediction, and the relevance of a multi-representation approach combining literal, grammatical, lexical and semantic similarity measures.

5.2.3. Predictions by Typologies

Reuse predictions are not uniformly distributed within the gold standard typologies. Among the

50 %

of correct predictions, quotations and explicit reuses are strongly overrepresented, compared to echoes, coincidences and implicit reuses. Figure 5 shows these results, with high scores reaching

0.9

for quotations and explicit reuses, and mitigated results around

0.4

for the rest. Due to the significant proportion of echoes and implicit reuses in our corpus, this average result of

0.4

is closer to the resulting prediction rates and the overall rate given in Figure 4. More precisely, the missed quotations are implicit or modified. The only exception is one literal, explicit quotation (John 10:34/Psalm 81:6). This leads to a

98 %

prediction rate of literal quotations. Given the nature of our corpus—composed of a majority of echos, coincidences and implicit reuses (cf. Figure 1)–achieving such a high prediction rate for literal and explicit text reuse is a significant achievement.

5.3. Clustering of the Psalms Reuses in the New Testament

We created a reuse dataset consisting of 682 rows, each described by 16 values ranging from 0 to 1, based on 16 similarity measures derived from our eight representations and two normalizations. As explained in Section 4.4, we used the K-means method to cluster these reuse occurrences into a small number of groups that share literal, grammatical, lexical, and semantic characteristics. We selected five clusters and compared them to the third typology described in the gold standard section: literal, modified, lexical, semantic, thematic, and mixed. After clustering all reuses into five groups, we computed the consistency of each group in relation to the reuse types. Finally, we plotted the proportion of each reuse type within the five clusters in Figure 6.

As we can see in Figure 6, there is significant coherence between our third typology and the clustering results through the type proportions (indicated by colors) in the bars representing reuse clusters. Each cluster can be characterized and attributed to a type as follows:

Thematic cluster: Thematic echoes with a small amount of lexical reuses, as well as some rare modified, literal and semantic reuses.
Mixed thematic, lexical and semantic echoes (mainly).
Semantic cluster: Almost all semantic reuses are found in this cluster.
Pseudo-literal cluster: A significant portion of literal and modified (near-literal) quotations, as well as thematic and semantic reuses.
Literal cluster: The vast majority of literal and modified quotations.

Note that lexical echoes or coincidences, and modified quotations, which are not quantitatively predominant, are not really assigned to a specific cluster, but rather spread across all clusters. This is probably because lexical similarity forms the basis of our intertextuality model, as demonstrated in the prediction results in Section 5.2 above. This means that the model has difficulty differentiating lexical reuses from other types. We also observe two opposite trends in the proportion of reuse types: As the number of literal quotations in a cluster increases, the quantity of thematic reuses decreases. Therefore, we can interpret the process of reuse attribution corresponding to our third typology as follows: The more subtle a reuse is (i.e., the farther it is from a literal quotation), the more likely the expert is to tag it as thematic. Finally, the consistency of the unsupervised reuse classification, shown in Figure 6, indicates that similarity measures based on verse representations encode information about reuse linguistic categories. Thus, they can help characterize intertextuality between two corpora.

In addition, to interpret the clustering, we computed eight mean similarity measures—one for each representation—for each cluster. We plotted these mean measures for a given normalization on a radar chart (see Figure 7) to compare the values and shapes of the five clusters for a given normalization. Regarding normalization comparisons, normalization 2 is clearly more sensitive to differences between reuses because the shapes of the clusters diverge much more than they do with normalization 1. Although normalization 1 produced significantly better prediction results, normalization 2 is nevertheless more suitable for characterization. For clarity, we chose to only present normalization 2 radar charts. As expected, cluster shapes decrease as the proportion of literal reuses increases (from cluster 1 to cluster 5). We also found that parts-of-speech similarity measures were lowest because parts-of-speech values were restricted to a few tags. As expected, measures based on the lexicon and filtered lemmas exhibit the greatest variation between clusters. Measures based on domains and subdomains are not significant for discriminating between clusters. However, semantic subdomains and domains representations help distinguish clusters 1 and 2 from cluster 3, which contains semantic reuses. Clusters 1 and 2 (thematic) differ from cluster 3 (thematic and lexical). This difference lies mainly in parts-of-speech and stop words. This can be explained by the nature of lexical similarity, which uses stop words. A textual analysis of the reuses in clusters 1 and 2 should confirm this. In conclusion, cluster content analysis is a relevant method for characterizing reuse clustering. In the case of an unlabeled dataset, it can be used to propose a typology of reuses based on clustering.

6. Discussion

6.1. A Reuse Characterization Tool

As an extension of the cluster content analysis in Section 5.3, we present examples of reuse that are analyzed using similarity measures and plotted on radar charts. Figure 8 shows two such charts. The left chart contains three quotations (one explicit, one implicit, and one implicitly detected), and the right chart contains two implicit echoes.

First, significant variations in similarity measures are observed between reuses, even when they belong to the same type (quotation or echo). The two selected echoes (Figure 8b) have very different profiles. The first echo, between 1 Corinthians 13:1 (κύμβαλον ἀλαλάζον) and Psalm 150:5 (ἐν κυμβάλοις ἀλαλαγμοῦ), has a high semantic similarity, as seen in the domains and subdomains values. However, the difference in the context and meaning of this common expression explains why it is characterized as an echo. The second echo, between John 4:36 and Psalm 125:5, exhibits better grammatical and lexical similarities, as seen in the Parts-of-speech and lexicon values. Both verses share the theme of harvest, but their contexts exclude the idea of quotation.

Quotations exhibit very low values compared to echoes. We observe a quantitative difference between our three quotations. The literal quotation of Psalm 15:10 found in Acts 2:27–28 has lower values than the two modified quotations, except for the stop words value, which is lower for the detected quotation because of its length and the high number of common lemmas.

The variations between the two implicit quotations (the detected one and the gold standard one) demonstrate qualitative differences. The reuse of Psalm 143:3 in Hebrews 2:6, which experts did not characterize as a quotation due to its proximity to Psalm 8:5, was detected because of the grammatical similarity between the two verses.

The gold standard quotation of Psalm 21:19 in Matthew 27:35 demonstrates grammatical similarities in Parts-of-speech and stop words, and lexical similarities in filtered lemmas and lexicon.

Note that the measures of lemmas and tokens are also sharply impacted because many stop words are taken into account. Finally, this brief analysis of reuse examples demonstrates the intertextuality model’s ability to capture reuse characteristics and provides methods for analyzing reuse at various scales, from single reuses to groups of tens or hundreds of reuses, as shown in the clustering results in Section 5.3.

6.2. Psalmic Reuses in the NT Map

We created a map that illustrates the locations and types (typology 1) of all the reported and categorized reuses of Psalms in the New Testament (NT), which originate from literature (see Section 2) or the detection results (see Section 5.1). The map provides an overview of trends in psalmic quotations, echoes, and coincidences in NT chapters (Figure 9), as well as of explicit and implicit reuses (Figure 10). With these maps, one can make qualitative observations, such as the comparative density of reuses in different NT books. The Book of Revelation most frequently reuses the Psalms (mainly echoes), while the Epistle to the Hebrews exhibits the highest density of quotations (echoes are rare). Additionally, the first chapter of the Gospel of Luke contains a high number of quotations, and several patterns are repeated in the Gospels. For example, all four Gospels contain the same quotations of Psalm 21 during Jesus’ passion, and three quotations of Psalms 109 and 117 (cited twice) appear in a similar pattern in the second half of each Synoptic Gospel. Note also the evolution of reuse types and density in the Pauline epistles, especially the low number of reuses (only a few echoes) in the second half of the Pauline corpus, as defined by the order retained here. Figure 10 shows that although the Catholic epistles and the Book of Revelation are the densest texts in the NT, there are no explicit reuses. The Gospel of John reuses the Psalms as many times as the three Synoptic Gospels combined. Conversely, the influence of particular psalms on the NT is evident. For example, the significance of Psalm 109 is emphasized, especially in the Epistle to the Hebrews, and Psalm 21 is referenced repeatedly in the Synoptic Passion narratives. These maps illustrate the significant role of Psalmic echoes in the NT, which serve as a woven backdrop. While quotations play an essential, albeit limited, role in specific narratives or demonstrations, especially when made by Jesus himself, echoes better demonstrate the presence of Psalms in the NT and the continuity between the two testaments.

6.3. A General Tool for Textual Reuse Detection

Our goal was to incorporate as many features as possible from traditional natural language processing (NLP) tools to determine whether a combined approach could produce useful results for biblical and patristic scholars working on BiblIndex, all the while avoiding the black box effect. The resulting tool detects quotations and certain thematic and semantic echoes in texts that have not yet been analyzed by humans. In the context of the BiblIndex project, our work demonstrates that merely reporting reuse occurrences between a biblical corpus and another corpus citing biblical texts is insufficient. Intra-biblical references must be analyzed in parallel because many cross-reuses may be missed by analysts, as demonstrated in the detection results in Section 5.1. Note that reuse of Psalms in the New Testament has been widely studied and mainly takes the form of echoes, as explained in the dataset preparation (see Section 2). This makes it a difficult test for a reuse detection tool. The clustering-based typology and the possibilities of reuse characterization are also convincing. Additionally, it generates fewer false positives than tools that use n-gram methods (see, for instance, Forstall et al. (2014) or different experiments conducted with TRACER19 in the 2010s). Therefore, this approach can be useful for analyzing large patristic corpora or other ancient Greek texts, saving a significant amount of time. However, applying this approach to the same text representations requires digitized linguistic resources, such as synonyms and semantic domains, or an extension of the biblical resources used here.

Furthermore, depending on the corpus, the proposed detection and prediction methods may miss a large proportion of echoes. This shows that, even though our approach leverages the ability to predict reuses through the integration of biblical synonyms and semantic domain databases, development efforts must still be made to achieve satisfactory results for a detection tool. This improvement will likely require large language models that are trained on specific corpora in ancient languages and fine-tuned to measure textual similarity.

6.4. Analyzing Textual Similarity from Intertext Embedding

More generally, the proposed intertext model is not limited to the three operations performed here: detection, prediction, and clustering. Composed of various linguistic field similarity measures (grammatical, lexical, and semantic) between all pairs of textual entities—in this case, verses—from two corpora, it allows for the analysis of intertextuality at various scales (from verses to entire books) and can be employed in several types of investigations (e.g., text classification and stylometry). Additionally, due to its data structure, which resembles text or sentence embeddings (Mikolov et al. 2013a, 2013b; McGovern et al. 2025), the model can be referred to as “intertext embedding” and described as a knowledge-based sentence pair similarity represented in a vector space. Through this embedding architecture, machine learning algorithms can perform the aforementioned types of analyses. The clustering of reuses of the gold standard proposed here is an example using a restricted set of samples in the embedding. However, a larger or more specific set of verse pairs could undergo clustering analysis.

7. Conclusions

In the context of the BiblIndex project, an online index of biblical textual reuses by the Church Fathers, this work experiments with a numerical approach to the unsupervised detection and characterization of intra-biblical reuses. It introduces a new method of measuring literal, grammatical, lexical and semantic similarities between pairs of textual entities. Unlike recent language models which have a black box effect that limits their usage in philology, the goal was to propose a transparent method.

The tool is applied to two ancient Greek corpora: the Book of Psalms from the Septuagint and the New Testament. In parallel, a gold standard of 614 Psalms reuses in the New Testament has been constituted from an extended literature review and manually tagged through three different typologies. Each pair of Psalms and New Testament verses is described by a set of numerical measures derived from eight representations obtained through natural language processing operations combined with Greek biblical synonyms and semantic domains databases. The resulting intertextuality model formally consists of a knowledge-based sentence pair embedding and is deployed in three reuse analysis tasks: detection of Psalms reuses within the gold standard (outperforming the standard n-gram alignment method) and within unpublished reuses (68 new reuses were found); prediction of Psalms reuses from each New Testament verse within the gold standard to evaluate the model’s efficiency; clustering of gold standard reuses to automatically assign reuse types in coherence with the gold standard typologies.

We concluded from this experiment that lexical representations are best suited for reuse prediction, though the combination of all representations significantly increases the prediction rates from 0.3 to 0.5. This final rate, which is not high enough for a reuse detection tool in biblical or patristic studies, is a weighted average of all reuse types, which have various individual prediction rates ranging from 0.38 for implicit echoes to 0.98 for literal quotations. We also compared two normalization processes and determined that the basic process was more relevant for reuse prediction, while the second one permits a clearer characterization of reuses and clusters.

Finally, this study demonstrates the relevance of our multi-representation approach, which combines common NLP methods and linguistic databases, for the detection and characterization of reuses. However, prediction results must improve for a fully automated reuse detection tool to cover implicit echoes in the BiblIndex project. A more abstract representation—especially one based on large language models—would likely improve performance, albeit at the expense of interpretability due to a lack of transparency in the method.

Author Contributions

Conceptualization, T.d.l.S. and L.M.; methodology, T.d.l.S. and L.M.; software, T.d.l.S.; validation, T.d.l.S. and L.M.; formal analysis, T.d.l.S. and L.M.; investigation, T.d.l.S. and L.M.; resources, T.d.l.S. and L.M.; data curation, L.M.; writing—original draft preparation, T.d.l.S. and L.M.; writing—review and editing, T.d.l.S. and L.M.; visualization, T.d.l.S.; project administration, L.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received funding from Biblissima+, Observatoire des cultures écrites anciennes, de l’argile à l’imprimé—ANR-21-ESRE-0005, https://projet.biblissima.fr/en (accessed on 28 December 2025). A CC-BY public copyright license has been applied by the authors to the present document and will be applied to all subsequent versions up to the Author Accepted Manuscript arising from this submission, in accordance with the grant’s open access conditions.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data and codes supporting results are available on the project GitHub repository: https://github.com/Tdelaselle/Psalms_in_NT (accessed on 28 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Lists of Part of Speech Tags

PoS PROIEL	PoS GreCy	PoS (Cattex Prévost et al. 2009)
C-	CCONJ	CON
Df	ADV	ADV
Dq	ADV	ADV
Du	ADV	ADV
F-	NUM	NUM
G-	SCONJ	CON
I-	INTJ	INTJ
Ma	NUM	NUM
Mo	NUM	NUM
N-	VERB	VERB
Nb	NOUN	NOUN
Ne	PROPN	PROPN
Pc	PRON	PRON
Pd	PRON	PRON
Pi	PRON	PRON
Pk	PRON	PRON
Pp	PRON	PRON
Pr	PRON	PRON
Ps	PRON	PRON
Pt	PRON	PRON
Px	PRON	PRON
Py	NUM	NUM
R-	ADP	PREP
S-	DET	DET
V-	VERB	VERB
X-	NaN	X
A-	ADJ	ADJ
	AUX	VERB

Appendix B. Intertextuality Model Samples

Table A1. Examples of 5 intertextuality model samples described by 16 features obtained through similarity measures between 8 representations of verses for 2 normalization processes. Columns full names: tokens, lemmas, parts-of-speech, stop words, filtered lemmas, lexicon, domains, subdomains.

NT	Ps	Part	tok.	lem.	PoS	Stops	filt. lem.	lex.	dom.	subdom.
Lk 14:23	Ps 102:1	2	0.90	0.86	0.67	0.70	1.00	0.91	0.65	0.91
			0.88	0.81	0.56	0.67	1.00	0.86	0.62	0.90
Ac 3:26	Ps 55:1	1	0.90	0.80	0.70	0.64	1.00	1.00	0.81	1.00
			0.82	0.64	0.45	0.43	1.00	1.00	0.64	1.00
Mt 8:31	Ps 73:21	1	1.00	0.94	0.62	0.89	1.00	1.00	0.82	0.90
			1.00	0.91	0.45	0.75	1.00	1.00	0.67	0.88
He 7:21	Ps 109:4	1	0.74	0.79	0.58	0.83	0.71	0.43	0.40	0.70
			0.67	0.73	0.47	0.75	0.71	0.43	0.25	0.67
Ro 11:10	Ps 68:24	1	0.29	0.14	0.21	0.12	0.17	0.17	0.50	0.59
			0.23	0.08	0.15	0.00	0.17	0.17	0.27	0.44

Notes

1	In this experimentation, we don’t take into account the textual variations of the Septuagint.
2	See, for example, Amphoux and Dorival (2006) and Jobes and Silva (2000). Some quotations align with a Septuagint text that has been realigned with the Hebrew text, or which is attributed to Theodotion. The same quotations may appear on multiple times in different forms (e.g., Ps 117:22). Additionally, the New Testament contains hitherto unknown textual forms (such as Ps 21:19 in Mt 27:43 and Ps 39:7 in He 10:5), etc.
3	https://github.com/proiel/proiel-treebank; all websites have been visited 28 December 2025
4	https://github.com/openscriptures/GreekResources and https://github.com/openscriptures/GreekResources/tree/master/LxxLemmas.
5	Python open-source software library for advanced natural language processing (3.8.5 version used here). See https://spacy.io/.
6	https://github.com/jmyerston/greCy?tab=readme-ov-file.
7	https://huggingface.co/Jacobo/grc_proiel_trf.
8	For an overview of lemmatisers for ancient languages, see (Clérice 2022).
9	https://cltk.org/.
10	See a stop words list on the project GitHub repository.
11	https://github.com/Tdelaselle/Psalms_in_NT.
12	For the New Testament: Trench (1880); Louw and Nida (1988); the BDAG Bauer et al. (2021) which serves as a reference. For the Septuagint, Muraoka (2009) and online Ancient Greek Word Net (https://greekwordnet.chs.harvard.edu/), Lueur.org (https://lueur.org/), StepBible (https://www.stepbible.org/), Blue Letter Bible (https://www.blueletterbible.org/).
13	https://github.com/ubsicap/ubs-open-license.
14	https://github.com/Tdelaselle/Psalms_in_NT.
15	E.g., through Bag-of-Word, TF-IDF, word or sentence embeddings.
16	In computer programming, a string is an ordered sequence of characters.
17	https://github.com/dasmiq/passim.
18	Retained parameters set: $n = 15$ , $M \in [1, 5]$ , $a \in [1, 3]$ , $g \in [1, 15]$ . See Passim’s GitHub for details about these parameters.
19	https://www.etrap.eu/research/tracer/.

References

Aland, Kurt, Matthew Black, Carlo Maria Martini, Bruce M. Metzger, and Allen Wikgren. 1993. The Greek New Testament. Stuttgart: Deutsche Bibelgesellschaft. First published 1966. [Google Scholar]
Amphoux, Christian, and Gilles Dorival. 2006. Des oreilles tu m’as creusées ou un corps tu m’as façonné? À propos de ps 39 (40 tm), 7. In Philologia— Mélanges Offerts à Michel Casevitz. Edited by Pascale Brillet-Dubois and Édith Parmentier. Lyon: Université Lyon 2, pp. 315–27. [Google Scholar]
Archer, Gleason L., and Gregory Chirichigno. 2005. Old Testament Quotations in the New Testament, 2nd ed. Eugene: Wipf & Stock Publishers. First published 1983. [Google Scholar]
Bauer, Walter, Frederick William Danker, William Frederick Arndt, and Felix Wilbur Gingrich. 2021. A Greek-English Lexicon of the New Testament and Other Early Christian Literature. Chicago: University of Chicago Press. [Google Scholar]
Brachter, Robert G. 1987. Old Testament Quotations in the New Testament, 2nd ed. New York: Amer Bible Society. First published 1961. [Google Scholar]
Clérice, Thibault. 2022. Détection D’isotopies par Apprentissage Profond: I’exemple de la Sexualité en latin Classique et Tardif. Ph.D. thesis, Université Jean Moulin, Lyon, France, March 28. Available online: https://theses.fr/2022LYSE3007 (accessed on 28 December 2025).
Dorival, Gilles. 2014. L’Ancien Testament du Nouveau Testament. In Manuel de Critique Textuelle du Nouveau Testament. Edited by Christian Amphoux. Brussels: Safran, pp. 194–210. [Google Scholar]
Dorival, Gilles. 2016. La réception de la Septante. Paris: Les Belles Lettres. [Google Scholar]
Eckhoff, Hanne, Kristin Bech, Gerlof Bouma, Kristine Eide, Dag Haug, Odd Einar Haugen, and Marius Jøhndal. 2018. The PROIEL treebank family: A standard for early attestations of Indo-European languages. Language Resources and Evaluation 52: 29–65. [Google Scholar] [CrossRef]
Fernandez Marcos, Natalio. 2000. The Septuagint in Context: Introduction to the Greek Version of the Bible. See chap. 17 on biblical quotations and chap. 21 on the LXX in the NT. Leiden, Boston and London: Brill. [Google Scholar]
Forstall, Chris, Neil Coffee, Thomas Buck, Katherine Roache, and Sarah Jacobson. 2014. Modeling the scholars: Detecting intertextuality through enhanced word-level n-gram matching. Digital Scholarship in the Humanities 30: 503–15. [Google Scholar] [CrossRef]
France, Richard Thomas. 1971. Jesus and the Old Testament: His Application of Old Testament Passages to Himself and His Mission. London: Regent College Publishing. [Google Scholar]
Jobes, Karen H., and Moisés Silva. 2000. The Septuagint and the quotations in the New Testament. In Introduction to the Septuagint. Waco: Baylor University Press, pp. 195–99. [Google Scholar]
Kirkpatrick, Alexander Francis. 1902. The Book of Psalms. The Cambridge Bible. Cambridge: Cambridge University Press. [Google Scholar]
Louw, Johannes P., and Eugene Albert Nida. 1988. Greek-English Lexicon of the New Testament Based on Semantic Domains. 2 vols, New York: United Bible Societies. [Google Scholar]
MacQueen, James B. 1967. Multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Los Angeles, CA, USA, June 18–21. Berkeley: University of California, Vol. 1, pp. 281–97. [Google Scholar]
McGovern, Hope, Hale Sirin, and Tom Lippincott. 2025. Characterizing the Effects of Translation on Intertextuality using Multilingual Embedding Spaces. arXiv arXiv:2501.10731. [Google Scholar] [CrossRef]
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space. arXiv arXiv:1301.3781. [Google Scholar] [CrossRef]
Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. 2013b. Linguistic Regularities in Continuous Space Word Representations. Paper presented at 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA, June 9–14. [Google Scholar]
Modular Aggregation of Resources on the Bible (MARBLE). 2023. Semantic Dictionary of Biblical Greek (SDBG). Adapted from the 2018–2023 version of the UBS Semantic Dictionary, itself based on Louw and Nida (1988). Swindon: United Bible Societies. [Google Scholar]
Moyise, Steve, and Maarten J. J. Menken, eds. 2004. The Psalms in the New Testament. The New Testament and the Scriptures of Israel. London and New York: T&T Clark. [Google Scholar]
Muraoka, Takamitsu. 2009. A Greek-English Lexicon of the Septuagint. Louvain and Walpole: Peeters. [Google Scholar]
Navarro, Gonzalo. 2001. A guided tour to approximate string matching. ACM Computing Surveys 33: 31–88. [Google Scholar] [CrossRef]
Nestle, Eberhard, and Kurt Aland. 1984. Novum Testamentum Graece et Latine, 27th ed. Stuttgart: Deutsche Bibelgesellschaft. [Google Scholar]
Nestle, Eberhard, and Kurt Aland. 2012. Novum Testamentum Graece et Latine, 28th ed. Stuttgart: Deutsche Bibelgesellschaft. [Google Scholar]
Prévost, Sophie, Céline Guillot, Alexei Lavrentiev, and Serge Heiden. 2009. Jeu d’étiquettes Morphosyntaxiques CATTEX2009. Available online: http://bfm.ens-lyon.fr/IMG/pdf/Cattex2009_2.0.pdf (accessed on 28 December 2025).
Rüsen-Weinhold, Ulrike. 2004. Der Septuagintapsalter im Neuen Testament: Eine textgeschichtliche Untersuchung. Neukirchen-Vluyn: Neukirchener Verlag. [Google Scholar]
Smith, David A., Ryan Cordell, and Abby Mullen. 2015. Computational Methods for Uncovering Reprinted Texts in Antebellum Newspapers. American Literary History 27: E1–E15. [Google Scholar] [CrossRef]
Smits, Cornelis. 1952–1963. Oud-Testamentische Citaten in Het Nieuwe Testament. 4 vols, Nijmegen: Malmberg. [Google Scholar]
Swete, Henry Barclay. 1902. An Introduction to the Old Testament in Greek. Reprint: New York, 1968; Quotations from the LXX in the New Testament. Cambridge: Cambridge University Press, pp. 383–84. [Google Scholar]
Trench, Richard Chenevix. 1880. Synonyms of the New Testament. London: Trübner & Co., Ltd. [Google Scholar]
Westcott, Brooke Foss, and Fenton John Anthony Hort. 1890. The New Testament in the Original Greek. Cambridge: Cambridge University Press. [Google Scholar]

Figure 1. Distributions of types for the Psalms reuse typologies 1 (a), 2 (b) and 3 (c).

Figure 2. Schematic view of the New Testament and Psalms dataset constitution through NLP operations. Each NT verse or Psalms verse part is described by eight linguistic representations (blue boxes). Textual, linguistic or philological external resources are colored gray, while the application of GreCy is depicted in orange.

Figure 3. Schematic view of Psalmic reuses prediction process with validation. Orange color depicts computation steps while blue one depicts textual input or output and gray preprocessing or postprocessing (evaluation process in this case).

Figure 4. Statistics of the predictions for all the independent similarity measures, based on the eight representations and the two normalizations. The colored bars indicate the accuracies, and the black boxes indicate the respective contributions to reuses that were correctly predicted by a single representation. The numbers 1 and 2 on the abscissa axis refer to normalization.

Figure 5. Prediction rates for reuse typologies 1 (a) and 2 (b).

Figure 6. Consistency of clusters with typology 3. The five bars depict the reuse clusters, in which the proportions of each type are shown in different colors.

Figure 7. Mean similarity measures of the five clusters superposed on a radar chart (normalization 2).

Figure 8. Similarity measures of five reuse examples superposed on two radar charts (normalization 2) (a). 3 quotations: a literal and two modified, with one that has been detected by the model (b).

Figure 9. Map of Psalms reuses in the New Testament. The NT verses on the x-axis are organized according to the standard Bible order, while the psalms are sorted numerically. The coloring is based on a reuse typology that distinguishes between quotations, echoes and coincidences. Vertical black dotted lines indicate the boundaries of NT books.

Figure 10. Map of Psalms reuses in the New Testament. The NT verses on the x-axis are organized according to the standard Bible order, while the psalms are sorted numerically. The coloring is based on a reuse typology that distinguishes between explicit and implicit reuses. Vertical black dotted lines indicate the boundaries of NT books.

Table 1. Comparison of text-reuse detection methods through percentages of gold standard Psalms reuses detected. The 240 best mean values of all representations’ similarity measures are compared to the 240 reuse guesses proposed by Passim for 3 different text representations.

Method	Text Representation	Gold Standard Covering Rates
Passim	Token	$8.31 %$
Passim	Lemma	$7.65 %$
Passim	Filtered lemma	$6.35 %$
Multi-Representation Similarity	All (mean value)	$10.75 %$

Table 2. Detected reuses distributions within typologies 1 and 2.

Typology 1	$4 %$ quotations	$29 %$ coincidences	$77 %$ echos
Typology 2	$3 %$ explicit reuses	$97 %$ implicit reuses

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

de la Selle, T.; Mellerin, L. Detection and Typology of Psalmic Text Reuses in the New Testament. Religions 2026, 17, 88. https://doi.org/10.3390/rel17010088

AMA Style

de la Selle T, Mellerin L. Detection and Typology of Psalmic Text Reuses in the New Testament. Religions. 2026; 17(1):88. https://doi.org/10.3390/rel17010088

Chicago/Turabian Style

de la Selle, Théotime, and Laurence Mellerin. 2026. "Detection and Typology of Psalmic Text Reuses in the New Testament" Religions 17, no. 1: 88. https://doi.org/10.3390/rel17010088

APA Style

de la Selle, T., & Mellerin, L. (2026). Detection and Typology of Psalmic Text Reuses in the New Testament. Religions, 17(1), 88. https://doi.org/10.3390/rel17010088

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detection and Typology of Psalmic Text Reuses in the New Testament

Abstract

1. Introduction

2. The Gold Standard Reference for Text Reuse Between the Psalms and the Greek New Testament

3. Corpus Preparation

3.1. Source Texts

3.2. Partitioning the Corpus

3.3. Tokenization

3.4. Lemmatization

3.5. Parts-of-Speech

3.6. Stop Words

3.7. Lexicon

3.8. Semantic Domains

3.9. Multi-Representations of Verses from NLP Operations

4. Numerical Methods

4.1. Measuring Similarity Between Representations

4.1.1. Sequence Similarity Metrics

4.1.2. Normalization Processes

4.2. Reuse Prediction

4.3. Reuse Detections

4.4. Reuse Clustering

5. Results

5.1. Detecting Psalms Reuses

5.1.1. Comparison with Standard Text-Reuse Tool

5.1.2. New Psalms Quotations and Echos Detected in the New Testament

5.2. Predictions of Psalms Reuses in the New Testament

5.2.1. Prediction Rates

5.2.2. Contributions

5.2.3. Predictions by Typologies

5.3. Clustering of the Psalms Reuses in the New Testament

6. Discussion

6.1. A Reuse Characterization Tool

6.2. Psalmic Reuses in the NT Map

6.3. A General Tool for Textual Reuse Detection

6.4. Analyzing Textual Similarity from Intertext Embedding

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Lists of Part of Speech Tags

Appendix B. Intertextuality Model Samples

Notes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI