Computational Stylometrics and the Pauline Corpus: Limits in Authorship Attribution

Rosa, Anthony

doi:10.3390/rel16101264

Open AccessArticle

Computational Stylometrics and the Pauline Corpus: Limits in Authorship Attribution

by

Anthony Rosa

Cyber Operations Department, Y-12 National Security Complex, Oak Ridge, TN 37830, USA

Religions 2025, 16(10), 1264; https://doi.org/10.3390/rel16101264

Submission received: 5 August 2025 / Revised: 15 September 2025 / Accepted: 28 September 2025 / Published: 1 October 2025

(This article belongs to the Special Issue Computational Approaches to Ancient Jewish and Christian Texts)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The computer age has transformed Pauline stylometric analyses by enabling quick, repeatable studies. However, studies continue to produce conflicting results. This paper highlights the core limitations of computational stylometric analyses in contributing to Pauline authorial-attribution questions. Without secularly verifiable, authentic texts, there is no referential Pauline stylome and thus the proper mathematical model to evaluate authorship cannot be deciphered. The theoretical objections presented are supported by an NLP-study of the New Testament texts and letters from the Roman emperor Julian that produces characteristically incoherent, and malleable, results. These findings suggest researchers should proceed with caution when applying quantitative methods to fundamentally epistemological problems.

Keywords:

Paul; stylometry; authorship attribution; biblical studies

1. Introduction

Stylometry, the statistical analysis of literary style, operates under the fundamental premise “that each author has a unique stylistic fingerprint that will always be left behind when writing a text.” (Pracht and McCauley 2025, p. 7). If this premise is valid, then stylometry can provide a less subjective contribution to assess the authorship of disputed texts. A famous application of stylometrics is Mosteller and Wallace’s analysis of anonymous texts in The Federalist Papers which concluded that the probability of Madison’s versus Hamilton’s authorship of certain disputed texts was at least 80 to 1 (Tanur et al. 1989, p. 121). These are compelling figures that lend credibility to stylometric applications. Indeed, stylometry can produce powerful insights in corpora with known authorial provenance, but in the context of the Pauline corpus, there is no secularly verifiable authorial corpus, rendering stylometrically informed authorial attributions theoretically insecure.

Stylistic analysis, in a rudimentary form, has long been employed in Pauline studies to judge the authenticity of writings. For example, Origen, as recorded by Eusebius, argued that the language of Hebrews “is not rude like the language of the apostle” and “the diction and phraseology” do not seem to be Paul’s (Eusebius n.d.). In modernity, stylometric analysis has evolved into a statistical discipline used to distinguish between letters of “undisputed” and “disputed” Pauline authenticity. Stylometric efforts have intensified during the rise of the computing age, as laborious tasks prone to human error can now be automated and made repeatable. The streamlining of stylometric analyses has been coupled with a philosophical shift in scholarship positing that “linguistic variation is best explained by author variation,” adding increasing weight to the field (Van Nes 2017, p. 1). Despite this, computational stylometric studies published in the last two decades of the 20th century varied widely in their conclusions about the number of authentic letters, ranging from an “authentic” corpus of four to twelve letters (White 2025, p. 138).

Over the past 10 years, computational stylometrics has become increasingly complex, especially with the rise of machine learning, but study results still differ. Libby concluded “it is in the realm of paradigmatics where the Pastoral Epistles, especially Titus and 1 Timothy, differ most from the rest of the epistles.” (Libby 2016). Van Nes found that grouping is malleable based on the employed method (Van Nes 2018), and that the pastorals do not significantly differ from the rest of the Pauline corpus in many syntactical features (Van Nes 2017, p. 203). Savoy’s analysis found four distinct clusters in the Pauline corpus, and he asserts that the results “indicate clearly” that Ephesians, Colossians, Titus, 1 Timothy, and 2 Timothy are non-Pauline (Savoy 2019). Van der Ventel and Newman found symmetric pairings between Romans and Galatians, Ephesians and Colossians, and 1 Timothy and Titus (Van der Ventel and Newman 2022). Pracht and McCauley found that generally, the pastoral epistles did not differ from the rest of the corpus at a statistically significant level (Pracht and McCauley 2025, p. 20). Finally, White concluded that “the Pauline Epistles divide into several groups of closely related texts—the Hauptbriefe, 1–2 Thessalonians, Philippians and Philemon, Ephesians and Colossians, 1 Timothy and Titus.” (White 2025, p. 200).

Why have discordant results emerged? Certainly, some of the causes are varied methodologies. For example, whereas most models measure lexical and syntactical features of the Pauline corpus, Pracht and McCauley measure 18 higher-level “stylistic” features (Pracht and McCauley 2025, p. 9). In effect, these methodologies are exploring different questions. While it would bolster the validity of Pauline stylometrics for various methodologies to converge on a similar conclusion, it is an established feature of statistical modeling that analytic choices affect a study’s outcome. This was convincingly demonstrated by Silberzahn et al. (2018), where researchers were given the same dataset and asked to analyze the same question. The researchers’ analytic approach (e.g., Zero-Inflated Poisson Regress vs. Poisson Regression) and covariant selections yielded significantly different results and, critically, “peer ratings of the quality of the analyses also did not account for the variability” (Silberzahn et al. 2018). Thus, there were legitimate discordant results considered valid by peers.

Another important methodological difference is how the manuscripts are processed. While some studies favor processing letters in their entirety, some research has broken letters down into subpopulations, such as Mealand’s use of 1000-word samples (Mealand 1996). Each of these approaches may have its merits, but relationships that appear in an aggregated regression can change once the data are stratified by a confounding variable—a phenomenon known as Simpson’s paradox.

In addition to methodological variance, fundamental assumptions incorporated into studies drastically impact their outcomes. The most important distinction is determining whether to pre-judge any selection of writings as authentically Pauline. Pracht and McCauley’s analysis assumes a seven-letter authentic corpus which they use to define the “authentic” Pauline stylome to judge other writings against (Pracht and McCauley 2025, p. 6). On the other hand, White pre-judges no writings as authentic, and uses a cluster analysis to make stylometric claims about the corpus (White 2025, p. 157).

Beyond the varied methodologies, assumptions, and results, some common assumptions exist in all aforementioned studies, as follows:

The historical Paul authored one or more of these letters.
Stylometry can contribute to resolving Pauline authorship disputes.

Both of these assumptions are fundamentally flawed, as will be argued theoretically and demonstrated through designing and manipulating an unsupervised machine learning model. The resulting contention presented in this paper is that since there is no secularly verifiable reference corpus, there is no way to design an accurate computational stylometric model that can contribute to the Pauline authorship dispute, and this limitation produces discordant results, as various methodologies are equally valid in the epistemologically uncertain landscape.

2. Broad Objections to the Validity of Pauline Computational Stylometric Assumptions

2.1. Assumption: The Historical Paul Authored One or More of These Letters

Stylometric studies compare the style, defined in a litany of ways, of writing from one or more texts against another set of one or more texts. Here, stylometry is defined as the measurement of objective features within a text. This definition excludes analyses that use measurements dependent on disputed interpretation. For example, overarching Pauline concepts, such as the centrality of “saved by faith alone” are excluded, but the measurement of the phrases and sentences in questions are included.

The key factor in broadly successful stylometric forensic investigations is that a known authorial corpus provides an authoritative stylome to judge disputed texts against. Much of The Federalist Papers had known authors, as Hamilton and Madison produced attribution lists with strong concurrence (Tanur et al. 1989, p. 116), and Mosteller and Wallace used these known letters to derive predictive discriminators to judge the disputed letters against (Mosteller and Wallace 1963, pp. 275–78). In their analysis, Mosteller and Wallace report that “The single best discriminator we have ever discovered is upon, whose rate is about 3 per thousand for Hamilton and about ⅙ per thousand for Madison.” (Mosteller and Wallace 1963, p. 278).

The word upon is not a common word used by Hamilton or Madison. Hamilton and Madison share 18 of their 20 most frequently used words (MFWs), and neither list contains upon (Figure 1):

Upon is Hamilton’s 39th MFW and is outside of the top 100 for Madison. That the word upon is the most predictive discriminating feature to measure is telling only when examining the manuscripts with prior knowledge of authorship. An author’s style is not defined by the use of a single word, but rather the common combinative features present across their productions. If the corpuses are approached from the perspective that there are three authors producing anonymous letters, and the analysis is extended to measuring the 100 MFWs, three of the known attributions become misgrouped (Figure 2).

If the feature selection shifts to the 100 MFW-pairs (bigrams), then 12 papers are misgrouped (Figure 3).

This demonstrates how feature selection is not obvious, and Mosteller and Wallace’s compelling results were driven by precise feature selection from known authorial corpuses, which are a necessity for valid stylometric analyses.

An additional consideration in unsupervised stylometric analyses is understanding the number of authors to measure. If we did not know there were three authors of The Federalist Papers, the k-hyperparameter would need to be chosen, and choosing a k-value different from the actual authorship structure would drastically reduce the quality of results. For example, if we tested the same 100 MFW unigrams against the k-value of five, instead of three, our misgroupings increase to 28 (Figure 4). However, there would be no definitive way to know that this was incorrect.

Mosteller and Wallace used these curated features, derived from the known contributions of Madison and Hamilton, against disputed letters between two out of three candidates. Since the unknown letters matched Madison’s signals significantly better than Hamilton’s, there is a high likelihood of Madison authorship. But what is the comparison in Pauline stylometry? Here, there is no secularly verifiable authorial corpus (Morton 1965, p. 6) or certainty of the number of authors in the traditional Pauline corpus; there is no way, outside means of revelation, to know that any letters attributed to Paul were written by the historical figure, either alone or in a group effort. The central problem with employing stylometrics to contribute to Pauline authorship-disputes is that without knowing who, or how many individuals, authored each text, there is no reference stylome(s), and thus no way to determine the correct mathematical model to build.

Our earliest extant reference to Paul’s letters is likely 1 Clement, a text estimated to have been written decades after Paul’s life (Nicklas et al. 2021, p. 51), and we have no clear extant eyewitness reports to attest to Paul’s authorship. As others have noted, the second century contains more references to “disputed” letters such as Ephesians and Colossians than the “undisputed” letters of 1 Thessalonians, Philippians, Philemon, Galatians, and 2 Corinthians (Strawbridge n.d.). Consequently, the appeal to history cannot soundly rebut this objection, and Pauline stylometry has an intrinsic limitation preventing stylometric comparisons to the “true” Paul. Despite discussions of a consensus surrounding the seven-epistle “undisputed” corpus, the alleged consensus is mere opinion that cannot be objectively proven through existing evidence. To make an additional leap and say that, through mathematical analyses, the similarities between letters can contribute to identifying an author without a reference corpus, and without certainty of the number of authors in the corpus, is logically incoherent.

A counterargument would be to say that certain Pauline letters are stylistically and theologically congruent enough as to self-evidently validate their authenticity. First, if this argument was valid, it would demonstrate that some letters possess authorship which was soon identified as Paul; it would not demonstrate genuine Pauline authorship. Second, this argument does not align with the current evidence from forgery identification. Studies have consistently demonstrated that forgeries are exceedingly difficult to identify using stylometrics when authors obfuscate their style or mimic others.

In 2009, Brennan and Greenstadt demonstrated that authors imitating another author’s style were widely successful in defeating authorial attribution techniques, “attributing authorship of the passages to the intended victim of the attack in most instances.” (Brennan and Greenstadt 2009, p. 60). Juola and Vescovi analyzed additional methods against Brennan and Greenstadt’s obfuscation corpus and reported that “no method was able to perform ‘significantly’ above chance.” (Juola and Vescovi 2011, p. 121). In the PAN 2016 Author Obfuscation challenge, the best obfuscator was able to fool verifiers in about 47% of cases (Hagen et al. 2016). In 2019, Bevendorff et al. were able to decrease attribution from state-of-the-art compression techniques to 55% with minimal text manipulation (Bevendorff et al. 2019, 2020, p. 1104). In 2024, neural paraphrasing models demonstrated the ability to defeat stylometric attribution (Fisher et al. 2024). All of this research demonstrates that authorial imitation and obfuscation is demonstrably capable of confounding attribution, and thus the Pauline corpus’ apparent congruency is not sufficient to demonstrate genuine Pauline authorship. This issue is compounded by the quantity of forged productions, as “examples of authentic authorial representation are the exception rather than the rule among Christian works from this era.” (Goodacre 2025, p. 132).

Might the argument from history and stylistic and theological congruence merit some degree of probability of Pauline authorship? Perhaps, but it is a probability outside of scientific specification that casts a specter of doubt upon the field of Pauline stylometry. When the assumptions are weak, the conclusions are untrustworthy.

2.2. Assumption: Stylometry Can Contribute to Resolving Pauline Authorship Disputes

The consequences of acknowledging the lack of an authoritative reference corpus are great. If this objection is valid, it follows that stylometric analysis cannot provide significant contributions to Pauline authorship disputes, because without a large corpus of known authorial texts, the authentic Pauline stylome is unknowable, and thus the proper mathematical model to evaluate authorship cannot be deciphered.

The extent to which the circumstances and rhetorical bent impact the style of writing differs by author. Since we have no certain authorial reference points to study the extent of Paul’s stylistic elasticity across circumstantial factors, the acceptable variance level across subject, tone, genre, and mood cannot be taken into account during computational stylometric analyses. Studies from known authorial corpuses demonstrate that writing style can significantly change over time (Can and Patton 2004; Ríos-Toledo et al. 2022). Since we do not have known reference texts from Paul across various stages of his life, the acceptable variance level cannot be taken into account during computational stylometric analyses. A.Q. Morton, a pioneer in Pauline computational stylometry, acknowledged these limitations, writing, “to be useful in the determination of authorship,” authorial habits must “be shown to be unaffected by changes in subject matter, by the passing of periods of time, by reasonable differences in literary form and all other possible influences which might affect the habit.” (Morton 1978, p. 96). However, without a known authentic reference corpus, no habits can demonstrably pass these criteria. Morton unsoundly avoided this issue by simply assuming Galatians was authentic in his analyses (Morton 1965, p. 14).

It is also unknowable to what extent Paul’s amanuenses impacted the style of his letters. As the letters attest (Rom. 16:22; 1 Cor. 16:21; Col. 4:18; 2 Thess. 3:17) (Johnson 2001, p. 58), Paul used aides to write some of his letters. Since we do not have secularly verifiable authentic letters from each aide, in addition to secularly verifiable authentic letters from Paul’s hand, there is no way to analyze the actual stylistic effect aides had on Paul’s compositions. Similarly, Paul lists cosenders (Col. 1:1; 1 Cor. 1:1; 2 Cor. 1:1; Phm. 1:1; Php. 1:1; 1 Thess. 1:1; 2 Thess. 1:1), and given it was not conventional to list noncontributing cosenders, coauthors could obscure Paul’s individual stylome (Richards 2004, pp. 103–6). The inability to disentangle potential contributions of amanuenses and coauthors is significant for our study. Literary–historical analyses may validly analyze clusters of style within specific traditions without possessing incontrovertible knowledge of historical figures. In contrast, the Pauline authorship dispute is authorial-historical, as Paul’s specific authorship affects authority and reception. It is not enough to demonstrate that a letter is “Pauline,” and the alleged corpus is rife with conflict between the letter author and respective communities, despite their common tradition.

Furthermore, it is unknowable how much scribal corruption altered the textual details and style of the manuscripts in our possession. It is unsound to say that because the variation between our oldest and current manuscripts does not significantly affect stylometric analyses, then the variation between the lost originals and our oldest manuscripts would also not significantly affect stylometric analyses. Without the original manuscripts, it is unknowable how faithfully the extant manuscripts preserve the style of the originals.

Paul’s theological consistency across time also cannot be determined. Even within the “undisputed” corpus, there are signs of apparent elasticity on issues. 1 Thessalonians 4:15 has the resurrection event firmly within Paul’s lifetime, yet Philippians 1:23 and 2 Corinthians 5:8 seem to indicate that it may occur beyond his lifetime. The extent to which Paul may have changed, upgraded, or rephrased his thinking is unknowable, and allowing for certain “authentic developments” while disallowing others is conjecture. Legitimate developmental theology can confound context-aware computational analyses, as the varied theology will alter the location and significance of words, increasing the apparent difference between potentially genuine statements.

Discussions surrounding statistical significance also lose meaning, as statistical significance cannot be meaningfully derived without an authoritative reference corpus. There is a litany of potential metrics to utilize in stylometric analyses. Within a single author’s corpus, the writing style is dependent on genre, topic, length, time consideration, and other factors. In some analyses, Burrows’ Delta performs the best, while in others, Cosine Delta performs the best (Evert et al. 2017). There are arguments for n-grams and most frequent words being the best measures of authorial attribution, along with arguments for Support Vector Machines trained on these data (Mikros 2013). Ultimately, without secularly verifiable genuine Pauline texts, there is no reference point to derive the most accurate metric for identifying Paul’s specific style. Consequently, analyses may be corrupting results with less accurate indicators, with no definitive way to know.

Some scholars have attempted to overcome these issues by appealing to studies conducted on other corpuses, such as White’s appeal to J.N. Adams’ study of Claudius Terentianus’ letters that supposedly demonstrates that amanuenses had little effect on the authorial style (White 2025, p. 181). First, it is important to clarify that Adams’ analysis leads him to thrice state his belief that Terentianus “dictated” the letters to his scribes (Adams 1977, pp. 3, 4, 84). Strict dictation was not the only method of scribal contribution, as Adams’ comments imply, and other scholars view strict dictation as unlikely in Paul’s case (Richards 2004, p. 92). Second, Adams’ analysis only examines six letters, which is an exceedingly small dataset to extrapolate conclusions from. Third, other analyses have noted that “the letters vary widely in many spelling features and orthographic styles,” (Conley 2017, p. 29) and the letter P.Mich. VIII 469 is noted as a potential autograph of Terentianus precisely because its style is noticeably different from that of the scribal letters, indicating amanuenses can perceptibly influence texts (Halla-aho 2018, p. 221). Furthermore, within the Pauline corpus, there are observable stylistic differences between coauthored letters; for example, “Paul uses a plural Thanksgiving formula only in the letters that he coauthored with Timothy.” (Richards 2004, p. 35).

A more important point, applicable to all objections that appeal to external corpuses, is that generalizations drawn from one body of texts cannot be confidently projected onto an unrelated authorial corpus. Various studies have been performed to determine how much authorial styles change over time, which stylometric indicators best cluster known authorial writings, and how much genre impacts stylometric indicators; however, none of this research tells us anything specifically about Paul. Morton, a supporter of Pauline stylometrics, addresses this argument dismissively, stating, “it is assumed that Paul was a writer of Greek prose, unique only in the same sense as we are all individuals.” (Morton 1965, p. 5). While it is certain that all individuals differ, and individual variation can be meaningfully discussed without knowing absolute mean-distances among all authors, the unknowable extent to which Paul differs from the mean is a legitimate issue in quantitative analyses. Because we lack a secularly verifiable Pauline corpus, we cannot gauge how far his stylome deviates from the contemporaneous norm, so appeals to mean-based characteristics are conjectural, and extrapolating models on unknown data is known to generally degrade accuracy (Kneusel 2024, p. 18–19). White’s reference to Adams’ study is flawed for the same reason; unless it is demonstrated that amanuenses systematically preserve an author’s stylome at scale, which is not shown by Adams, the concern remains unresolved.

All of these issues point to a common conclusion: Pauline computational stylometrics is too fundamentally flawed to significantly contribute to the Pauline authorship discussion. To demonstrate how these issues inhibit valid stylometric analysis, an unsupervised natural language processing (NLP) study was conducted with robust feature selection. Then, with the knowledge that the precise Pauline stylistic features are unknowable, feature adjustments were made that altered the study’s results. This elucidates that stylometric studies cannot contribute to Pauline authorial–attribution disputes since the feature and parameter selections are subjective and equally valid without a known reference corpus.

3. Basic Study Design

3.1. Core Files

This study was implemented as a comprehensive Python 3.12.3 analysis pipeline. It was composed of five core files: text normalization and tokenization, feature extraction, feature processing and similarity calculations, NLP using CLTK and transformers, and clustering analysis and validation. It used a main analysis script to coordinate the analytical pipeline, collect the manuscripts, process the texts, and generate the results.

3.2. Data

This study used the entirety of the New Testament from the open-source SBLGNT edition, with direct Old Testament quotes removed in accordance with best practices (White 2025, pp. 148, 169). According to White, “The differences between the manuscripts of the Pauline Epistles themselves and the modern editions based on them seem to fall within Eder’s margin of error, such that disciplinary standards for stylometry can still be applied with consistent results.” (White 2025, p. 186). As a result, the analysis conducted should not inherently be considered as limited to the versions employed.

Six of Julian the Apostate’s letters were selected as a distractor corpus: To Dionysius, Letter Fragment, To the Same Person, Untitled Letter about the Argives, To Sarapion the Most Illustrious, and To Libanius the Sophist. Julian’s letters were sourced from the Perseus Digital Library and range from 753 to 3951 words per letter. Typically, distractor corpuses should match the “time period, language, region, genre, and gender” of the target corpus (Juola 2015, p. 106). While Julian lived 3 centuries after Paul, he was trained in Christian theology, the selected letters are in Koine Greek, he was born in Constantinople, and Paul is alleged to have sojourned extensively through modern day Turkey, his alleged letters are in the typical Greco-Roman style, and he was male.

While the difference in time period is substantial, Greco-Roman letter conventions were largely stable throughout antiquity (Riehle 2020, p. 58; Richards 2004, p. 85). Additionally, while the conventional distractor corpus criteria are critical for supervised classification problems (did x or y write z), they are less pivotal for unsupervised models. Unsupervised learning algorithms discover structure in unlabeled data and group them based on similarity. In this model, the distractor corpus should be similar enough to co-cluster with Paul under weak indicators, yet distinct enough that variation is identifiable. This contextualizes our output to inform us whether the model has discovered meaningful differences within the broader context of other New Testament productions.

Julian’s corpus is disputed; the extent to which he actually authored the letters and the effect of his amanuenses is unknown. However, these are precisely the issues with the Pauline corpus. If a clean, uniform, single-voice corpus was used in a study with the potential multi-voice Pauline corpus, the algorithm would be over-helped in clustering these groups separately. What logically follows is that there is no way to select an ideal distractor corpus, since the degree of multi-vocalness is unknowable. Given these considerations, Julian’s letters are a reasonable selection.

3.3. Design

This study took an unbiased approach to the manuscript analysis, pre-judging none of the 33 texts as authentic to ensure they are clustered purely on linguistic grounds. This methodological choice reflects the actual epistemic situation facing Pauline stylometrics: researchers cannot know a priori which texts are authentic, yet many approaches assume certain letters are authentic as a baseline for comparisons.

Supervised machine learning occurs by providing an algorithm with examples to learn from (input to output mappings: x → y). Since there is no secularly identifiable Pauline corpus, there is nothing to classify as authentically Pauline for the learning algorithm. As these data lack pre-defined output labels, an unsupervised learning study is warranted.

Clustering algorithms were employed which took these unlabeled data and automatically grouped them based on similarity. This highlights a key consideration of stylometric studies analyzing authors that lack secularly verifiable productions; namely, that similarity does not imply common authorship.

Mathematical axioms are formally neutral and applied equally regardless of the application domain. For example, the laws of addition apply equally to space exploration, financial transactions, and linguistic analysis. However, higher-level mathematics, such as algorithmic outputs, only become meaningful within the interpretive frameworks of their application domains. This is to say that mathematical analyses do not provide self-contained interpretative value, and applying quantitative methods to the humanities does not offer objective footing to make claims.

This is germane to this study’s employment of clustering algorithms. Clustering does not posit any truth claim besides similarity. When a clustering algorithm groups various writings together, the algorithm says nothing explicit about authorial attribution. For example, Google Web Guide uses clustering algorithms to produce search results (Wu 2025). These results are usually not from the same author, and that inference is rarely considered when recommended articles are presented, demonstrating how the meaning derived from the clustering algorithm differs within the application’s context.

3.4. Note on Stylo Tool

Stylo is an open source, “flexible R package for the high-level stylistic analysis of text collections.” (Eder et al. 2016, p. 107). It is a useful tool for creating reproducible results across computing environments, and it was widely used by White in his analysis. Stylo was not used in this analysis due to its limited functionality. For example, feature extraction is limited to word and character n-grams; the source code lists the features parameter as “w” (word) or “c” (character), not a vector of varied feature types, constraining the native analysis possible (Computational Stylistics Group 2015; Eder et al. 2016, p. 108).

Among the available features, there is an additional limitation that researchers must run separate analyses for each feature type if they wish to use Stylo in its native format. The result is a narrow and shallow perspective of an author’s stylome. An article listed on the official Stylo GitHub repository (v0.7.6) published in The R Journal acknowledges the project’s limitations and clarifies that “stylo does not aim to supplant existing, more targeted tools and packages from Natural Language Processing” (Eder et al. 2016, p. 108).

Analyzing feature types simultaneously within a unified space produces different results than analyzing them independently, as the unified analysis operates under different scaling, normalization, and distance calculation procedures that alter the relative contribution and interaction effects of each feature type in the final cluster formation. It would have been far more difficult for an ancient forger to replicate these interconnected effects than to mimic individual feature types, and it is unlikely any person can visualize high-dimensionality and take action on its inferences without the aid of technology, so the former is preferred by this study.

4. Computational Methodology

Natural Language Processing Pipeline

This study employed several NLP tools to analyze the Greek texts. At initialization, the process activated four major systems: the CLTK pipeline for morphological analysis, the Greek spaCy model for syntactic parsing, the Greek BERT model for contextual embeddings, and a sentence transformer for semantic analysis. After initialization and processing, the system extracted linguistic features for each manuscript across six general categories. All categories contain partial sections of code employed and are not exhaustive representations.

Vocabulary Richness

This study implemented a python method for statistical measurement of lexical diversity using the mathematical formulations of Yule’s K, Simpson’s D, Herdan C, and Guiraud R for results derivations:

def calculate_vocabulary_richness(self, tokens: List[str]) -> Dict[str, float]:
if not tokens:
return self._empty_vocab_features()

N = len(tokens)
V = len(set(tokens))
freq_dist = Counter(tokens)

V1 = sum(1 for freq in freq_dist.values() if freq == 1)
V2 = sum(1 for freq in freq_dist.values() if freq == 2)

M1 = N
M2 = sum(freq ** 2 for freq in freq_dist.values())
yules_k = 10000 * (M2 − M1) / (M1 ** 2) if N > 0 and M1 > 0 else 0

simpsons_d = sum((freq * (freq − 1)) / (N * (N − 1))
for freq in freq_dist.values()) if N > 1 else 0

herdan_c = math.log(V) / math.log(N) if N > 0 and V > 0 else 0

guiraud_r = V / math.sqrt(N) if N > 0 else 0

probs = [freq / N for freq in freq_dist.values()]
entropy = −sum(p * math.log2(p) for p in probs if p > 0)

return {
‘type_token_ratio’: V / N if N > 0 else 0,
‘hapax_legomena_ratio’: V1 / N if N > 0 else 0,
‘dis_legomena_ratio’: V2 / N if N > 0 else 0,
‘yules_k’: yules_k,
‘simpsons_d’: simpsons_d,
‘herdan_c’: herdan_c,
‘guiraud_r’: guiraud_r,
‘vocab_size’: V,
‘entropy’: entropy
}
}

2.: Sentence Complexity Analysis

Another python method was employed to capture syntactic construction tendencies in each study text:

def calculate_sentence_complexity(self, tokenized_sentences: List[List[str]]) -> Dict[str, float]:
if not tokenized_sentences:
return self._empty_complexity_features()

lengths = [len(sent) for sent in tokenized_sentences if sent]

if not lengths:
return self._empty_complexity_features()

mean_length = np.mean(lengths)

complexity_score = np.std(lengths) / mean_length if mean_length > 0 else 0

short_ratio = sum(1 for l in lengths if l < 10) / len(lengths)
long_ratio = sum(1 for l in lengths if l > 20) / len(lengths)

return {
‘mean_length’: float(mean_length),
‘std_length’: float(np.std(lengths)),
‘median_length’: float(np.median(lengths)),
‘length_variance’: float(np.var(lengths)),
‘num_sentences’: len(lengths),
#Coefficient of variation (σ/μ)
‘complexity_score’: float(complexity_score),
‘short_sentence_ratio’: short_ratio,
‘long_sentence_ratio’: long_ratio
}

3.: Greek Function Word Analysis

The predominant view currently in stylometric analyses is that most frequent word usages are more useful for authorial attribution than least frequent word usages (White 2025, p. 74). This has been empirically demonstrated through studies where specific word usage rates indicate a high level of probability of authorship distinction (Tanur et al. 1989, p. 121). As a result, this study employed a method to analyze function words in an attempt to catalog unconscious stylistic habits:

def calculate_function_word_features(self, tokens: List[str]) -> Dict[str, float]:
if not tokens:
return {f‘{category}_ratio’: 0 for category in self.function_words.keys()}

total_tokens = len(tokens)
token_counts = Counter(tokens)
features = {}

for category, words in self.function_words.items():
count = sum(token_counts.get(word, 0) for word in words)
features[f‘{category}_ratio’] = count / total_tokens

total_function_words = sum(features.values())
features[‘total_function_word_ratio’] = total_function_words

return features

self.function_words = {
‘articles’: [‘ὁ’, ‘ἡ’, ‘τό’, ‘οἱ’, ‘αἱ’, ‘τά’, ‘τοῦ’, ‘τῆς’, ‘τῶν’,
‘τῷ’, ‘τῇ’, ‘τοῖς’, ‘ταῖς’, ‘τόν’, ‘τήν’],
‘particles’: [‘δέ’, ‘γάρ’, ‘οὖν’, ‘μέν’, ‘δή’, ‘τε’, ‘γε’, ‘μήν’, ‘τοι’, ‘ἄρα’],
‘conjunctions’: [‘καί’, ‘ἀλλά’, ‘ἤ’, ‘οὐδέ’, ‘μηδέ’, ‘εἰ’, ‘ἐάν’, ‘ὅτι’, ‘ἵνα’, ‘ὡς’],
‘prepositions’: [‘ἐν’, ‘εἰς’, ‘ἐκ’, ‘ἀπό’, ‘πρός’, ‘διά’, ‘ἐπί’, ‘κατά’,
‘μετά’, ‘περί’, ‘ὑπό’, ‘παρά’],
‘pronouns’: [‘αὐτός’, ‘οὗτος’, ‘ἐκεῖνος’, ‘τις’, ‘τί’, ‘ὅς’, ‘ἥ’, ‘ὅ’, ‘ἐγώ’, ‘σύ’]
}

4.: Morphological Complexity

To analyze Greek morphological patterns, this study used CLTK’s morphological analyzer to provide direct morphological features, POS-based syntactic features, lemma-based vocabulary features, and dependency features:

def calculate_morphological_features(self, nlp_features: Dict) -> Dict[str, float]:
features = {
‘morphological_diversity’: 0,
‘tense_variation’: 0,
‘lemma_token_ratio’: 0
}

if ‘morphological_features’ in nlp_features and nlp_features[‘morphological_features’]:
morph_features = nlp_features[‘morphological_features’]
unique_patterns = len(set(str(f) for f in morph_features if f))
total_words = len(morph_features)
features[‘morphological_diversity’] = unique_patterns/total_words if total_words > 0 else 0

if ‘lemmas’ in nlp_features and nlp_features[‘lemmas’]:
lemmas = nlp_features[‘lemmas’]
words = len(lemmas)
unique_lemmas = len(set(lemmas))
features[‘lemma_token_ratio’] = unique_lemmas/words if words > 0 else 0

return features

Note that there are zero values during initialization which do not represent analytical results. During program execution, these values are dynamically replaced with real calculations that are used by the clustering algorithm. For example, the actual study yielded the following results in the 33-manuscript analysis: morphological diversity: 0.156–0.342; tense variation: 0.089–0.456; and lemma–token ratios: 0.234–0.678.

5.: N-gram Feature Analysis

N-grams are also considered one of the best tools for authorship attribution, and so a variety of n-grams were employed in this study (White 2025, p. 74). The character n-grams employed were 3–5-character sequences capturing morphological patterns that can capture prefixes, suffixes, root morphemes, extended morphological units (e.g., “θεον”, “λογο”, “μενο”), and complete morphological patterns (e.g., “θεους”, “λογον”, “μενος”), resulting in a max feature count of 339. Word bigrams and trigrams were also used for further variety, resulting in a max feature count of 830.

def extract_ngrams(self, tokens: List[str], n: int) -> List[Tuple[str, …]]:
return [tuple(tokens[i:i + n]) for i in range(len(tokens) − n + 1)]

def calculate_ngram_frequency(self, tokens: List[str], n: int = 2) -> Dict[Tuple[str, …], float]:
ngrams = self.extract_ngrams(tokens, n)
if not ngrams:
return {}

ngram_counts = Counter(ngrams)
total_ngrams = len(ngrams)

return {ngram: count/total_ngrams
for ngram, count in ngram_counts.most_common(50)}

def get_tfidf_features(self, text: str) -> Dict[str, float]:
if not self.is_fitted:
raise ValueError(“TF-IDF vectorizers must be fitted first”)

features = {}

char_tfidf = self.tfidf.transform([text])
char_feature_names = self.tfidf.get_feature_names_out()
for i, score in enumerate(char_tfidf.toarray()[0]):
if score > 0:
features[f‘char_tfidf_{char_feature_names[i]}’] = score

word_tfidf = self.word_tfidf.transform([text])
word_feature_names = self.word_tfidf.get_feature_names_out()
for i, score in enumerate(word_tfidf.toarray()[0]):
if score > 0:
features[f‘word_tfidf_{word_feature_names[i]}’] = score

return features

6.: TF-IDF Analysis

As is evidenced in other sections, the Term Frequency–Inverse Document Frequency (TF-IDF) was implemented to provide a weighted context to identify the importance of a feature in a document within the study collection. The maximum number of features was set to 1000 with a 0.1 threshold.

def __init__(self):
self.tfidf = TfidfVectorizer(
analyzer = ‘char’,
ngram_range = (3, 5),
max_features = 1000
)

self.word_tfidf = TfidfVectorizer(
analyzer = ‘word’,
max_features = 500,
min_df = 2,
max_df = 0.8
)

def fit(self, texts: List[str]):
self.tfidf.fit(texts)
self.word_tfidf.fit(texts)
self.is_fitted = True

5. Machine Learning and Clustering Methodology

5.1. Manuscript Comparison

This study employed a Manuscript Comparison class to handle text comparison across the preprocessing pipeline. The core system was the Similarity Calculator class, which implements feature standardization and dimensionality reduction. A global vocabulary was agglomerated to ensure consistent feature representation across all manuscripts, collecting the top TF-IDF features from each document to create a unified feature space. The pipeline then applied variance filtering (if: Var(X[:,j]) = (1/n)Σ_i(X[i,j] − μ_j)² > 0.01) to remove uninformative features with little cross-manuscript variation (variance > 0.01 threshold), then scaled (X_scaled[i,j] = (X[i,j] − median(X[:,j]))/IQR(X[:,j])/IQR = Q3 − Q1 (interquartile range)) to normalize the data using median-centering and interquartile range normalization, increasing outlier resistance, and then Principal Component Analysis (PCA) (Y = X_centered × V/V contains eigenvectors of covariance matrix) reduced dimensionality to the extent that it preserved >95% of the variance in the data. This approach transformed the original 1208-dimensional feature space to 10 dimensions, preserving 95.6% of the variance. The final feature vectors captured the most linguistically significant patterns while remaining computationally tractable for clustering and similarity analysis.

The Similarity Calculator class applies multi-stage preprocessing as follows:

class SimilarityCalculator:
def __init__(self, feature_selection_k: int = 100, pca_components: float = 0.95):

self.scaler = RobustScaler() # Median + IQR scaling

self.variance_filter = VarianceThreshold(threshold = 0.01)

self.pca = PCA(n_components = pca_components, svd_solver = ‘auto’)

self.global_tfidf_vocab = set()
self.global_ngram_vocab = set()

def build_global_vocabulary(self, all_features: List[Dict[str, Any]]):
print(“global vocabulary for consistent features”)

for features in all_features:
if ‘tfidf’ in features:
tfidf_features = features[‘tfidf’]
sorted_tfidf = sorted(tfidf_features.items(),
key = lambda x: x[1], reverse = True)[:50]
for key, _ in sorted_tfidf:
self.global_tfidf_vocab.add(key)

self.global_tfidf_vocab = sorted(list(self.global_tfidf_vocab))
print(f”Global TF-IDF vocabulary size: {len(self.global_tfidf_vocab)}”)

def fit_transform_features(self, feature_matrices: List[np.ndarray],
manuscript_names: List[str]) -> np.ndarray:
X = np.vstack(feature_matrices)
X = np.nan_to_num(X, nan = 0.0, posinf = 0.0, neginf = 0.0)
print(f”Original feature matrix shape: {X.shape}”)

X_filtered = self.variance_filter.fit_transform(X)
print(f”After variance filtering: {X_filtered.shape}”)

X_scaled = self.scaler.fit_transform(X_filtered)

X_pca = self.pca.fit_transform(X_scaled)
print(f”After PCA: {X_pca.shape}”)
print(f”Explained variance ratio: {self.pca.explained_variance_ratio_.sum():.3f}”)

return X_pca

5.2. Clustering Algorithm Testing

The clustering algorithm testing phase implemented a method to compare multiple clustering approaches across different cluster numbers to identify optimal groupings. The system employed deterministic seeding to ensure reproducible results and tests both K-means and hierarchical clustering algorithms from 2 to 33 clusters. The random_state was set to 42 in deference to the default convention (Wang 2025, p. 2).

K-means clustering operates by partitioning manuscripts into k groups through iterative optimization, minimizing the within-cluster sum of squared distances by repeatedly assigning each manuscript to its nearest centroid and updating centroids as cluster means until convergence. Hierarchical clustering with Ward linkage builds a tree-like structure of nested clusters by successively merging the two clusters that result in the smallest increase in within-cluster variance, using the Ward distance formula that accounts for both cluster sizes and centroid separation. Different datasets can have different optimal clustering algorithms, so it is important to formulate an objective, repeatable way to select them. Consequently, this study implements a method to automatically select the algorithm based on the silhouette score:

def find_optimal_clustering(self) -> Dict:
results_df = pd.DataFrame(all_results)
best_idx = results_df[‘silhouette’].idxmax()
best_result = results_df.iloc[best_idx]

The hierarchical clustering algorithm produced the highest silhouette score, 0.3052; however, when the Julian letters were removed, K-means was selected (silhouette score: 0.337).

def perform_clustering(self, n_clusters_range: Tuple[int, int] = (2, 33)) -> Dict:
X = self.similarity_calculator.fit_transform_features(
self.feature_matrices, self.manuscript_names
)

clustering_results = {}
np.random.seed(42)

for n_clusters in range(min_clusters, max_clusters + 1):
cluster_results = {}

kmeans = KMeans(n_clusters = n_clusters, random_state = 42, n_init = 10)
kmeans_labels = kmeans.fit_predict(X)
cluster_results[‘kmeans’] = {
‘labels’: kmeans_labels,
‘silhouette’: silhouette_score(X, kmeans_labels),
}

hierarchical = AgglomerativeClustering(n_clusters = n_clusters, linkage = ‘ward’)
hierarchical_labels = hierarchical.fit_predict(X)
cluster_results[‘hierarchical’] = {
‘labels’: hierarchical_labels,
‘silhouette’: silhouette_score(X, hierarchical_labels),
}

clustering_results[n_clusters] = cluster_results

return clustering_results

5.3. Mathematical Foundations of Hierarchical Clustering

Hierarchical clustering with Ward linkage operates by systematically merging clusters in a way that minimizes the increase in within-cluster variance at each step, making it well-suited for text analysis. The Ward distance formula, Ward distance(A,B) = √(2n_an_β/(n_a + n_β)) × ||μ_a − μ_β||², calculates the optimal merge criterion by considering both the size of the clusters being merged (nₐ and nᵦ representing the number of points in clusters A and B, respectively) and the distance between their centroids (μ_a and μ_β). This approach ensures that the tightest possible groupings are preserved while accounting for cluster size imbalances. Additional benefits include its lack of required initialization randomness and greater pre-built robustness to outliers compared to K-means, which can be compromised by small numbers of outliers (Forero et al. 2012).

6. Results and Analysis

This study returned hierarchical clustering with three clusters as the optimal solution, achieving a silhouette score of 0.305. The clustering results reveal difficult findings. Cluster 0 comprises 26 manuscripts that contain all but two of Paul’s supposed letters, all non-Pauline epistles, Revelation, and all six of Julian’s letters. Cluster 1 comprises the gospels and Acts, consistent with expectations of genre influence (Libby 2016, p. 128). Cluster 2 comprises Ephesians and Colossians, interestingly representing only half of the prison literature (Figure 5).

The most significant finding is the clustering of Julian’s letters with Pauline literature. It is not altogether absurd that this clustering arose. Greco-Roman letter conventions largely remained consistent throughout antiquity (Riehle 2020, p. 58; Richards 2004, p. 85), and while Paul possesses distinctive elements from this tradition, his letters retain many Greco-Roman characteristics, such as the typical letter structure, use of stereotyped formulae, and Greco-Roman epistolary rhetoric (Fouard 2023, p. 136; Richards 2004, pp. 77, 130–37, 173; Riehle 2020, pp. 58–63; Stowers 1986, pp. 19–24; Weima 2016, pp. 4–9). However, these results should dispel the notion that stylometric clustering inherently captures common authorship—a result which presents major difficulties to the applicability of stylometrics to Pauline text disputes.

Incoherent findings are characteristic of computational stylometric analyses on the Pauline corpus. White found 2 Timothy and 2 Peter to have strong connections (White 2025, p. 193), and under certain conditions, 1 John was “as strongly related to the first sample from 1 Corinthians as the strongest intra-text samples are related to each other.” (White 2025, p. 191). Savoy’s “twenty-one epistles regrouped using all tokens” and “twenty-one epistles regrouped using the 200 MFT” grouped Philippians with 1 Peter (Savoy 2019, pp. 6–7). Libby groups Luke closer to Matthew and Mark than Acts (Libby 2016, pp. 157–59).

These findings defy interpretative value; they are mathematically valid, but no scholar surmises that 2 Timothy and 2 Peter, Philippians and 1 Peter, Luke and Matthew, or the Julian and the Pauline corpuses have the same author. It is not sensible to ignore these findings while referencing more agreeable findings from the same models. Rather, the lack of coherent results in well-formed studies demonstrates that using computational stylometrics to determine authentic Pauline letters, without an authoritative secularly verifiable corpus of genuine texts, is flawed.

Nonetheless, this study’s results share similarities with previous research: 1 Corinthians and Romans (Libby 2016, p. 166; Savoy 2019, p. 1; Van der Ventel and Newman 2022, p. 260), 1 Timothy and Titus (Libby 2016, p. 162; Savoy 2019, p. 1; Van der Ventel and Newman 2022, p. 251; White 2025, p. 198), 1–2 Thessalonians (Libby 2016, p. 166; Morton 1978, p. 182; Savoy 2019, p. 1; White 2025, p. 192), and Ephesians and Colossians (Libby 2016, p. 166; Savoy 2019, p. 1; Van der Ventel and Newman 2022, p. 251; White 2025, p. 198) are all related. The cosine similarity, which calculates the angle between vectors and is representative of feature similarity, of the documents within Cluster 0 is represented graphically below (Figure 6).

Of all the relations, Romans and 1 Corinthians achieve the highest similarity score of 0.986. While exceedingly high, this finding is broadly commensurate with other cosine similarity research, as Roy and Robertson find a similarity score between these letters of 0.941 (Roy and Robertson 2022, p. 99). However, this study departs from Roy and Robertson in a series of methodological choices, such as implementing variance filtering, robust scaling, PCA, and applying cluster mapping, and many results differ, such as this study’s significantly higher similarity score for 1–2 Thessalonians (Roy and Robertson 2022, pp. 113–14).

The study results were broadly similar when Julian’s letters were removed from the analysis. In this analysis, four clusters emerged with the K-means clustering algorithm selected. Dimensionality was reduced to eight after PCA, preserving 96.2% of the original variance. Cluster 0 comprises the gospels and Acts, Cluster 1 comprises the non-Pauline letters except for 1 John, Cluster 2 comprises Ephesians and Colossians, and Cluster 3 again comprises all but two of Paul’s supposed letters and 1 John (Figure 7).

To demonstrate the variability of quantitative study results from design choices, another test was run with a custom noun_chunks function using the same clustering range (2–33). In this trial, two clusters emerged between the letters and the gospels with a K-means clustering selection and silhouette score of 0.443 (Figure 8).

Finally, when the non-Pauline New Testament corpus was removed to isolate the Julian and Pauline texts, 11 clusters emerged with a K-means clustering selection and silhouette score of 0.321. Only one cluster, Cluster 3, is a mix between a Pauline and Julian text. However, half of Julian’s clusters are closest to a Pauline cluster: Julian Clusters 2 and 10 are closest to Pauline Cluster 5, which is composed of the Pastoral Epistles (Figure 9).

Across the 33-manuscript analyses, the Pauline and Julian letters generally clustered together, indicating that intra-epistolary variation is smaller than inter-genre variation. Then, when the epistolary texts were isolated, instead of a potential two-author cluster emerging, the algorithm produced 11 clusters with one mixed-author corpus and poor author distance separation. This suggests that if the two-author theory is correct, and distinct low-level stylomes exist at scale, the algorithm has failed to isolate and identify a meaningful authorial signal within the epistolary genre, and document-specific factors had stronger stylistic signals than authorship. However, it is possible that this clustering analysis is correct, and the corpuses are separable into 11 authorial groupings. Given the epistemological uncertainty inherent in this study, and the lack of a reference corpus, there is ultimately no way to know.

These cluster analyses elucidate the methodological issue of defining style and groupings, and selecting features and hyperparameters, when designing quantitative studies involving authorially uncertain corpuses. When the goal is to differentiate the probability of authorship between two individuals, such as in the case of Mosteller and Wallace’s study, it is simple to measure high-level feature frequencies and compare the authors to the productions in question. However, when no reference corpus exists, it is unclear how to define a stylome, and thus there is great uncertainty about which features to include and exclude.

Even in corpuses with known authors, selecting features for unsupervised analyses is challenging. Using the Stylo tool, at k = 3, an unsupervised analysis of the 100 MFW unigrams in The Federalist Papers incorrectly clusters papers 2, 37, and 38; when the analysis is extended to 100 MFW bigrams, the model incorrectly clusters papers 1, 35, 40, 44, 68, 71, 72, 78, 79, 81, 83, and 84 (Rosa 2025). This demonstrates how Mosteller and Wallace’s employment of discriminators was rooted in a concentrated study of the known corpuses, and not in a universal rule.

Model design is also a core issue in machine learning. Recent research indicates that applying entropy minimization, biasing the model towards its more confident outputs, to large language models (LLMs) produces more accurate results (Agarwal et al. 2025, p. 1). This means that if the question as to the capital of France arose, even if the frequency of “Paris” was well below 100%, the model’s probability of outputting “Paris” would be exceedingly high. What a model is confident in does not necessarily equate to truthfulness, and confident assertions today often become uncertainties tomorrow. Philosophical uncertainties need not hinder the progress of LLMs, but they should provide pause to researchers applying utility-based metrics gleaned from unrelated corpuses.

The theoretical objections provided in this paper, buttressed by a demonstrative analysis, establish grounds for questioning stylometrics as a contributive discipline for Pauline authorial-attribution studies. The clustering of Julian’s letters with the Pauline corpus in the 33-manuscript analysis and the proximity of Julian clusters to Pauline clusters in the isolated analysis, coupled with the incoherent results of previous research, provides evidence that Pauline stylometric methods produce dubious results due to the inability to decipher an appropriate model. Ultimately, technological and methodological sophistication cannot overcome fundamental historical and epistemological problems, and computational stylometrics must be grounded in verified historical evidence to produce reliable conclusions.

7. Conclusions

With the advent of advanced computing, the use of computational stylometrics to discern genuine and inauthentic Pauline letters has flourished. However, no letters are secularly verifiable as authentically Pauline, and thus there is no authoritative reference point from which to effectively employ computational stylometric analyses. Without such a reference point, we are prohibited from identifying the Pauline stylome, including its natural stylistic and theological elasticity and development and thus from integrating it into an accurate analysis. Appeals to non-Pauline stylistic research are unsound; Paul’s individual style cannot be known with certainty without a genuine Pauline corpus, and research that seeks to connect these disparate concepts is unscientific. These theoretical objections were reiterated and demonstrated by a review of the contradicting literature on the surmised genuine Pauline corpus and a new NLP study, and their characteristic incoherent results resulting from the inability to mathematically model this authorship problem. More broadly, this study illustrates the limits of importing quantitative methods into disciplines with unresolvable historical dependencies.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The research repository is publicly available at archive.anthonyrosa.xyz (accessed on 15 September 2025).

Acknowledgments

Code review conducted by Ali Kishk, University of Luxembourg, ali.kishk@uni.lu.

Conflicts of Interest

The author declares no conflicts of interest.

References

Adams, James Noel. 1977. The Vulgar Latin of the Letters of Claudius Terentianus (P.Mich. VIII, 467–72). Manchester: Manchester University Press. [Google Scholar]
Agarwal, Shivam, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. 2025. The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning. Available online: https://arxiv.org/pdf/2505.15134 (accessed on 7 August 2025).
Bevendorff, Janek, Martin Potthast, Matthias Hagen, and Benno Stein. 2019. Heuristic Authorship Obfuscation. Kerrville: Association for Computational Linguistics. Available online: https://aclanthology.org/P19-1104.pdf (accessed on 6 April 2025).
Bevendorff, Janek, Tobias Wenzel, Martin Potthast, Matthias Hagen, and Benno Stein. 2020. On Divergence-Based Author Obfuscation: An Attack on the State of the Art in Statistical Authorship Verification. IT—Information Technology 62: 99–115. [Google Scholar] [CrossRef]
Brennan, Michael, and Rachel Greenstadt. 2009. Practical Attacks against Authorship Recognition Techniques. Paper presented at Name of the Twenty-First Innovative Applications of Artificial Intelligence Conference, Pasadena, CA, USA, July 14–16; pp. 60–65. Available online: https://cdn.aaai.org/ocs/i/iaai0020/257-3903-1-PB.pdf (accessed on 19 April 2025).
Can, Fazli, and Jon M. Patton. 2004. Change of Writing Style with Time. Computers and the Humanities 38: 61–82. [Google Scholar] [CrossRef]
Computational Stylistics Group. 2015. GitHub. Available online: https://github.com/computationalstylistics/stylo/blob/98998cd6b10c0fc097bbc72ed9885a58cc1701d8/R/txt.to.features.R (accessed on 4 August 2025).
Conley, Brandon. 2017. Minore(m) Pretium: Morphosyntactic Considerations for the Omission of Word-Final-m in Non-Elite Latin Texts; Kent: Kent State University. Available online: https://etd.ohiolink.edu/acprod/odb_etd/ws/send_file/send?accession=kent149253496962922 (accessed on 17 May 2025).
Eder, Maciej, Jan Rybicki, and Mike Kestemont. 2016. Stylometry with R: A Package for Computational Text Analysis. The R Journal 8: 107–21. Available online: https://journal.r-project.org/archive/2016/RJ-2016-007/RJ-2016-007.pdf (accessed on 4 August 2025). [CrossRef]
Eusebius. n.d. Church History, Book VI. New Advent, Chapter 25. Available online: https://www.newadvent.org/fathers/250106.htm (accessed on 1 June 2025).
Evert, Stefan, Fotis Jannidis, Thomas Proisl, Isabella Reger, Steffen Pielström, Christof Schöch, and Thorsten Vitt. 2017. Understanding and Explaining Delta Measures for Authorship Attribution. Digital Scholarship in the Humanities 32: ii4–ii16. [Google Scholar] [CrossRef]
Fisher, Jillian, Skyler Hallinan, Ximing Lu, Mitchell Gordon, Zaid Harchaoui, and Yejin Choi. 2024. StyleRemix: Interpretable Authorship Obfuscation via Distillation and Perturbation of Style Elements. Paper presented at the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, November 12–16; Available online: https://aclanthology.org/2024.emnlp-main.241.pdf (accessed on 17 April 2025).
Forero, Pedro, Vassilis Kekatos, and Georgios Giannakis. 2012. Robust Clustering Using Outlier-Sparsity Regularization. IEEE Transactions on Signal Processing 60: 4163–77. [Google Scholar] [CrossRef]
Fouard, Abbé Constant. 2023. St. Peter and the First Years of Christianity. Bedford: Sophia Institute Press. [Google Scholar]
Goodacre, Mark. 2025. The Fourth Synoptic Gospel: John’s Knowledge of Matthew, Mark, and Luke. Grand Rapids: Wm. B. Eerdmans Publishing. [Google Scholar]
Hagen, Matthias, Martin Potthast, and Benno Stein. 2016. Author Obfuscation: Attacking the State of the Art in Authorship Verification. Available online: https://downloads.webis.de/publications/papers/potthast_2016f.pdf (accessed on 18 April 2025).
Halla-aho, Hilla. 2018. Scribes in Private Letter Writing: Linguistic Perspectives. In Scribal Repertoires in Egypt from the New Kingdom to the Early Islamic Period. Edited by Jennifer Cromwell and Eitan Grossmann. Oxford Studies in Ancient Documents. Oxford: Oxford University Press, pp. 239–77. [Google Scholar] [CrossRef]
Johnson, Luke Timothy. 2001. The First and Second Letters to Timothy, 1st ed. New York: Doubleday. [Google Scholar]
Juola, Patrick. 2015. The Rowling Case: A Proposed Standard Analytic Protocol for Authorship Questions. Digital Scholarship in the Humanities 30: fqv040. Available online: https://newtfire.org/courses/tutorials/juola-rowlingcase.pdf (accessed on 11 September 2025). [CrossRef]
Juola, Patrick, and Darren Vescovi. 2011. Analyzing Stylometric Approaches to Author Obfuscation. In IFIP Advances in Information and Communication Technology. Berlin/Heidelberg: Springer, pp. 115–25. [Google Scholar] [CrossRef]
Kneusel, Ronald T. 2024. How AI Works. San Francisco: No Starch Press. [Google Scholar]
Libby, James A. 2016. The Pauline Canon Sung in a Linguistic Key: Visualizing New Testament Text Proximity by Linguistic Structure, System, and Strata. BAGL 5: 122–21. Available online: https://www.researchgate.net/publication/311923095_The_Pauline_Canon_Sung_in_a_Linguistic_Key_Visualizing_New_Testament_Text_Proximity_by_Linguistic_Structure_System_and_Strata (accessed on 2 April 2025).
Mealand, David L. 1996. The Extent of the Pauline Corpus: A Multivariate Approach. Journal for the Study of the New Testament 18: 61–92. [Google Scholar] [CrossRef]
Mikros, George K. 2013. Authorship Attribution and Gender Identification in Greek Blogs. London: Academic Mind. Available online: https://www.researchgate.net/publication/236583622_Authorship_Attribution_and_Gender_Identification_in_Greek_Blogs (accessed on 20 May 2025).
Morton, Andrew Queen. 1965. The Authorship of the Pauline Epistles: A Scientific Solution. University Lectures 3. Saskatoon: University of Saskatchewan Press. [Google Scholar]
Morton, Andrew Queen. 1978. Literary Detection: How to Prove Authorship and Fraud in Literature and Documents. New York: Scribner’s Sons. [Google Scholar]
Mosteller, Frederick, and David L. Wallace. 1963. Inference in an Authorship Problem. Journal of the American Statistical Association 58: 275. [Google Scholar] [CrossRef] [PubMed]
Nicklas, Tobias, Jozef Verheyden, and SchröterJens. 2021. Texts in Context: Essays on Dating and Contextualising Christian Writings from the Second and Early Third Centuries. Leuven, Paris and Bristol: Peeters. [Google Scholar]
Pracht, Erich Benjamin, and Thomas McCauley. 2025. Paul’s Style and the Problem of the Pastoral Letters: Assessing Statistical Models of Description and Inference. Open Theology 11: 20240034. [Google Scholar] [CrossRef]
Richards, E. Randolph. 2004. Paul and First-Century Letter Writing: Secretaries, Composition, and Collection. Downers Grove: Intervarsity Press. [Google Scholar]
Riehle, Alexander, ed. 2020. A Companion to Byzantine Epistolography. Boston: Brill, Vol. 7. [Google Scholar]
Ríos-Toledo, Germán, Juan Pablo Francisco Posadas-Durán, Grigori Sidorov, and Noé Alejandro Castro-Sánchez. 2022. Detection of Changes in Literary Writing Style Using N-Grams as Style Markers and Supervised Machine Learning. PLoS ONE 17: e0267590. [Google Scholar] [CrossRef] [PubMed]
Rosa, Anthony. 2025. Federalist Papers Stylo 100 MFW Analysis: 08112025 Archive.zip. Available online: https://archive.anthonyrosa.xyz/ (accessed on 11 August 2025).
Roy, Ashley, and Paul Robertson. 2022. Applying Cosine Similarity to Paul’s Letters: Mathematically Modeling Formal and Stylistic Similarities. In New Approaches to Textual and Image Analysis in Early Jewish and Christian Studies. Boston: Brill, pp. 88–117. [Google Scholar] [CrossRef]
Savoy, Jacques. 2019. Authorship of Pauline Epistles Revisited. Journal of the Association for Information Science and Technology 70: 1089–97. [Google Scholar] [CrossRef]
Silberzahn, Raphael, Eric L. Uhlmann, Daniel P. Martin, Pasquale Anselmi, Frederik Aust, Eli Awtrey, Štěpán Bahník, Feng Bai, Colin Bannard, Evelina Bonnier, and et al. 2018. Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Advances in Methods and Practices in Psychological Science 1: 337–56. [Google Scholar] [CrossRef]
Stowers, Stanley Kent. 1986. Letter Writing in Greco-Roman Antiquity. Philadelphia: The Westminster Press. [Google Scholar]
Strawbridge, Jennifre. n.d. Paul and Patristics Database. Available online: https://paulandpatristics.web.ox.ac.uk/database#collapse397896 (accessed on 25 May 2025).
Tanur, Judith, Frederick Mosteller, William Kruskal, Erich Lehmann, Richard Fink, Richard Pieters, and Gerald Rising, eds. 1989. Statistics: A Guide to the Unknown, 3rd ed. Pacific Grove: Wadsworth & Brooks/Cole Advanced Books & Software. [Google Scholar]
Van der Ventel, Brandon, and Richard Newman. 2022. Application of the Term Frequency-Inverse Document Frequency Weighting Scheme to the Pauline Corpus. Andrews University Seminary Studies (AUSS) 59: 251–72. Available online: https://digitalcommons.andrews.edu/auss/vol59/iss2/4/ (accessed on 2 April 2025).
Van Nes, Jermo. 2017. Pauline Language and the Pastoral Epistles: A Study of Linguistic Variation in the Corpus Paulinum. Available online: https://theoluniv.ub.rug.nl/92/1/2017%20Nes%20Dissertation.pdf (accessed on 28 March 2025).
Van Nes, Jermo. 2018. Missing ‘Particles’ in Disputed Pauline Letters? A Question of Method. Journal for the Study of the New Testament 40: 383–98. Available online: http://hdl.handle.net/2263/69051 (accessed on 28 March 2025). [CrossRef]
Wang, Wanning. 2025. Machine Learning Based Engagement Prediction for Online Courses. ITM Web of Conferences 70: 04014. [Google Scholar] [CrossRef]
Weima, Jeffrey. 2016. Paul the Ancient Letter Writer. Grand Rapids: Baker Academic. [Google Scholar]
White, Benjamin L. 2025. Counting Paul. Oxford: Oxford University Press. [Google Scholar]
Wu, Austin. 2025. Web Guide: An Experimental AI-Organized Search Results Page. The Keyword. Available online: https://blog.google/products/search/web-guide-labs/ (accessed on 3 August 2025).

Figure 1. Top 20 Words: Hamilton vs Madison.

Figure 2. Federalist Papers Clustering Analysis (Unigrams, 100 MFW).

Figure 3. Federalist Papers Clustering Analysis (Bigrams, 100 MFW).

Figure 4. Federalist Papers Clustering Analysis (k = 5).

Figure 5. 33 Manuscript Analysis.

Figure 6. Cluster 0 Cosine Similarity.

Figure 7. Analysis Without Julian Corpus.

Figure 8. Analysis With Noun Chunking.

Figure 9. Isolated Julian and Pauline Corpus Analysis.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rosa, A. Computational Stylometrics and the Pauline Corpus: Limits in Authorship Attribution. Religions 2025, 16, 1264. https://doi.org/10.3390/rel16101264

AMA Style

Rosa A. Computational Stylometrics and the Pauline Corpus: Limits in Authorship Attribution. Religions. 2025; 16(10):1264. https://doi.org/10.3390/rel16101264

Chicago/Turabian Style

Rosa, Anthony. 2025. "Computational Stylometrics and the Pauline Corpus: Limits in Authorship Attribution" Religions 16, no. 10: 1264. https://doi.org/10.3390/rel16101264

APA Style

Rosa, A. (2025). Computational Stylometrics and the Pauline Corpus: Limits in Authorship Attribution. Religions, 16(10), 1264. https://doi.org/10.3390/rel16101264

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Computational Stylometrics and the Pauline Corpus: Limits in Authorship Attribution

Abstract

1. Introduction

2. Broad Objections to the Validity of Pauline Computational Stylometric Assumptions

2.1. Assumption: The Historical Paul Authored One or More of These Letters

2.2. Assumption: Stylometry Can Contribute to Resolving Pauline Authorship Disputes

3. Basic Study Design

3.1. Core Files

3.2. Data

3.3. Design

3.4. Note on Stylo Tool

4. Computational Methodology

Natural Language Processing Pipeline

5. Machine Learning and Clustering Methodology

5.1. Manuscript Comparison

5.2. Clustering Algorithm Testing

5.3. Mathematical Foundations of Hierarchical Clustering

6. Results and Analysis

7. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI