Exploring Artificial Intelligence and Machine Learning Approaches to Legal Reasoning

Raghupathi, Wullianallur

doi:10.3390/appliedmath6020032

Open AccessArticle

Exploring Artificial Intelligence and Machine Learning Approaches to Legal Reasoning

by

Wullianallur Raghupathi

Gabelli School of Business, Fordham University, 140 W. 62nd Street, New York, NY 10023, USA

AppliedMath 2026, 6(2), 32; https://doi.org/10.3390/appliedmath6020032

Submission received: 6 January 2026 / Revised: 26 January 2026 / Accepted: 5 February 2026 / Published: 12 February 2026

(This article belongs to the Special Issue Computer Science, Machine Learning, Algorithms, and Applied Mathematics)

Download

Browse Figures

Versions Notes

Abstract

Modeling legal reasoning with artificial intelligence and machine learning presents formidable challenges. Legal decisions emerge from a complex interplay of factual circumstances, statutory interpretation, case precedent, jurisdictional variation, and human judgment—including the behavioral characteristics of judges and juries. This paper takes an exploratory approach to investigating how contemporary ML techniques might capture aspects of this complexity. Using pharmaceutical patent litigation as an illustrative domain, we develop a multi-layer analytical pipeline integrating text mining, clustering, topic modeling, and classification to analyze 698 U.S. federal district court decisions spanning January 2016 through December 2018, comprising substantive validity and infringement rulings under the Hatch-Waxman regulatory framework. Results demonstrate that the pipeline achieves 85–89% prediction accuracy—substantially exceeding the 42% baseline majority-class rate and comparing favorably with prior legal prediction studies—while producing interpretable intermediate outputs: clusters that correspond to recognized doctrinal categories (Abbreviated New Drug Application—ANDA litigation, obviousness, written description, claim construction) and topics that capture recurring legal themes. We discuss what these findings reveal about both the possibilities and limitations of computational approaches to legal reasoning, acknowledging the significant gap between statistical prediction and genuine legal understanding.

Keywords:

artificial intelligence; machine learning; legal reasoning; text analytics; natural language processing; patent litigation; pharmaceutical patents; judicial decision-making; predictive modeling

1. Introduction

The application of artificial intelligence (AI) and machine learning (ML) to legal reasoning represents one of the most ambitious and challenging frontiers in computational research. Law pervades every aspect of modern society—from commercial transactions and intellectual property protection to civil liberties and criminal justice—yet the reasoning processes that underlie legal decision-making have proven remarkably resistant to computational formalization. Unlike domains where clear rules yield deterministic outcomes, legal reasoning involves interpretation, judgment, and the weighing of competing considerations in ways that defy simple algorithmic capture.

The potential benefits of successfully applying AI and ML to legal reasoning are substantial. Computational approaches could enhance access to justice by making legal information more accessible, improve consistency in judicial decision-making, assist practitioners in legal research and case assessment, and provide empirical insights into how legal doctrine operates in practice. At the same time, significant challenges constrain progress: legal language is highly specialized and context-dependent; legal outcomes depend on factual nuances that may not appear in written opinions; and the normative dimensions of legal reasoning—questions of what the law should be, not merely what it is—resist purely descriptive computational treatment.

These considerations frame the exploratory investigation undertaken in this paper. We examine what contemporary machine learning methods can reveal about one well-defined legal domain—pharmaceutical patent litigation—while remaining attentive to the gap between computational pattern recognition and genuine legal understanding. Our approach integrates unsupervised learning techniques (clustering and topic modeling) with supervised classification, producing interpretable intermediate outputs alongside predictive accuracy. This multi-method design allows us to assess both whether computational methods can identify meaningful structure in legal text and whether textual features carry information about case outcomes.

1.1. The Challenge of Computational Legal Reasoning

The challenge deepens when we recognize that legal decisions emerge from an intricate interplay of multiple determinants: the specific facts of the case as developed through discovery and presented at trial; the applicable statutory framework and regulatory context; precedential authority from binding and persuasive jurisdictions; procedural posture and standard of review; and the human judgment of decision-makers operating within institutional constraints [1,2,3]. Each of these dimensions introduces complexity that computational approaches must somehow address or acknowledge they cannot capture.

This paper takes an exploratory approach to investigating how contemporary machine learning techniques might illuminate aspects of legal reasoning. We characterize the research as exploratory advisedly: rather than claiming to have solved the problem of computational legal reasoning, we seek to understand what current methods can and cannot reveal about judicial decision-making. Using pharmaceutical patent litigation as an illustrative domain, we develop and evaluate a multi-layer analytical pipeline, examining both its predictive performance and its interpretive outputs.

1.2. Factors Impinging on Legal Decisions

Understanding why modeling legal reasoning proves so difficult requires appreciating the multiplicity of factors that impinge on judicial decisions. We can organize these factors into several categories, each presenting distinct challenges for computational approaches.

Factual complexity. Legal outcomes depend heavily on case-specific facts that may not be fully captured in judicial opinions. A patent validity determination turns on the prior art landscape, the level of ordinary skill, and the specific claim limitations at issue. These facts are developed through extensive discovery, expert testimony, and adversarial presentation, much of which never appears in the written decision [4]. Computational models working from opinion text operate on a selective, post hoc account rather than the full evidentiary record.

Legal doctrine and precedent. Legal rules provide the framework within which facts are evaluated but the rules themselves require interpretation. The doctrine of obviousness under 35 U.S.C. § 103 asks whether a claimed invention would have been obvious to a person having ordinary skill in the art—a standard that has evolved through decades of case law from Graham v. John Deere (1966) through KSR v. Teleflex (2007) and beyond [5,6]. Computational models must somehow account for this evolving doctrinal landscape, where the meaning of legal standards shifts through judicial interpretation.

Jurisdictional variation. Legal outcomes vary systematically across jurisdictions. In patent litigation, the Eastern District of Texas historically attracted plaintiffs seeking favorable treatment, while the District of Delaware developed specialized expertise in pharmaceutical cases [7]. The Federal Circuit provides nominally uniform patent law, but panel composition and circuit precedent create meaningful variation [8]. Models trained on data from one jurisdiction may not generalize to others.

Judicial behavior and decision-maker characteristics. Perhaps most challenging for computational approaches is the human element. Judges bring individual characteristics—ideological orientation, professional background, cognitive style—that influence decisions [9,10]. The attitudinal model in political science has documented systematic relationships between judicial ideology and case outcomes, particularly in ideologically salient cases. Beyond ideology, cognitive psychology research has identified heuristics and biases affecting judicial reasoning: anchoring effects in sentencing [11], coherence-based reasoning that shapes fact-finding [12], and even extraneous factors like time since the last meal break [13]. These behavioral dimensions operate alongside—and sometimes in tension with—formal legal doctrine.

Jury dynamics and lay decision-making. In jury trials, additional complexity arises from lay decision-makers applying legal instructions to complex technical facts. Patent cases present challenges: jurors must evaluate obviousness in light of prior art, construe claim terms with technical precision, and assess damages requiring economic expertise [14]. Jury composition, deliberation dynamics, and the effectiveness of attorney presentation all influence outcomes in ways that opinion text cannot fully reveal.

Strategic litigation behavior. Outcomes also reflect strategic choices by litigants: venue selection, claim framing, settlement behavior, and resource allocation [15]. The cases that reach judicial decision represent a selected subset—those that survived motion practice and resisted settlement. This selection effect means that analyzed cases may differ systematically from the broader universe of disputes, complicating inference from observed outcomes to underlying legal standards.

1.3. Two Traditions: Behavioral and Text-Analytic Approaches

Research on computational legal analysis has developed along two relatively distinct trajectories, each with characteristic strengths and limitations.

The behavioral tradition emphasizes decision-maker characteristics and institutional context. Building on political science research into judicial behavior, this approach models outcomes as functions of judge-level variables (ideology, background, experience), case-level variables (issue area, litigant characteristics, procedural posture), and contextual variables (court, time period, political environment). The attitudinal model developed by [9] demonstrated that Supreme Court justices’ votes could be predicted substantially from ideological scores derived from pre-appointment newspaper editorials. Subsequent work extended this approach to lower courts [16] and specialized tribunals [17]. The behavioral tradition offers theoretical grounding and interpretable models but typically treats case content as categorical variables rather than analyzing the full text of legal documents.

The text-analytic tradition. Early work applied information retrieval techniques to legal documents [18], and researchers explored adapting expert system techniques for legal research applications [19]; more recent research has leveraged advances in deep learning, including transformer architectures adapted for legal text [20,21]. The text-analytic tradition can process large corpora and discover patterns invisible to manual analysis but often sacrifices interpretability and theoretical grounding.

This paper seeks to bridge these traditions. We employ text-analytic methods—TF-IDF vectorization, clustering, topic modeling, neural network classification—while remaining attentive to behavioral insights about the factors shaping legal outcomes. Our multi-layer pipeline produces not just predictions but interpretable intermediate representations that can be evaluated against doctrinal categories and practitioner intuitions. We acknowledge that neither tradition alone captures the full complexity of legal reasoning; our contribution lies in demonstrating what an integrated approach can reveal.

1.4. Why Pharmaceutical Patent Litigation?

We focus on pharmaceutical patent litigation as an illustrative domain rather than an end. This choice reflects several methodological and substantive considerations.

First, pharmaceutical patent cases involve high economic stakes and specialized subject matter, generating detailed judicial opinions that provide rich textual material for analysis. Drug patents protect products worth billions in annual revenue; when generic manufacturers challenge these patents under the Hatch-Waxman Act, both sides invest substantial resources in litigation, producing extensive briefing and carefully reasoned decisions [22].

Second, pharmaceutical patent litigation operates within a well-defined regulatory framework. The Drug Price Competition and Patent Term Restoration Act of 1984 (Hatch-Waxman Act) established procedures for generic drug approval and patent challenges, creating a structured legal environment with recurring doctrinal issues. This structure facilitates systematic analysis: we can identify cases involving Paragraph IV certifications, obviousness challenges to compound patents, enablement disputes regarding biological products, and other recurring patterns [23,24].

Third, pharmaceutical patent cases concentrate in a limited number of courts with specialized expertise, particularly the District of Delaware and the District of New Jersey, and all appeals flow to the Federal Circuit. This jurisdictional concentration provides relative doctrinal coherence compared to areas where cases scatter across ninety-four district courts and thirteen circuits applying potentially divergent standards.

Fourth, prior research provides baseline findings against which our results can be compared. Studies have examined pharmaceutical patent validity rates [25], the relationship between patent characteristics and litigation outcomes [26], and the effectiveness of various legal strategies [27]. Notably, Ref. [28] demonstrated the feasibility of applying big data analytics and machine learning to pharmaceutical patent validity modeling, achieving promising results using Hadoop MapReduce architectures. The present study extends that foundational work by incorporating unsupervised learning techniques, deep neural architectures, and a systematic focus on interpretability. We can thus situate our computational findings within an established and growing body of knowledge on pharma patent analytics.

Finally, pharmaceutical patent litigation carries significant policy implications. Generic drug entry affects healthcare costs; patent enforcement strategies influence innovation incentives; and the balance between originator and generic interests shapes pharmaceutical markets. Understanding the patterns in judicial decision-making—even descriptively—informs ongoing policy debates about pharmaceutical patent reform [22,29].

1.5. Research Objectives and Contributions

Against this background, our research pursues several interrelated objectives.

Our primary objective is exploratory: to investigate what contemporary machine learning methods can reveal about the textual structure of pharmaceutical patent opinions and their relationship to case outcomes. We do not claim to have solved the problem of computational legal reasoning; we seek to understand the capabilities and limitations of available techniques when applied to a well-defined legal domain.

A second objective is methodological: to develop and evaluate a multi-layer analytical pipeline that produces interpretable intermediate outputs alongside predictive classifications. Unlike black-box models that offer predictions without explanation, our approach generates clusters that can be examined for doctrinal coherence, topics that can be evaluated for semantic interpretability, and feature importances that can be compared against domain knowledge. This interpretability comes at some cost in predictive performance, but we argue it better suits exploratory research aimed at understanding rather than mere prediction.

A third objective is integrative: to bridge behavioral and text-analytic traditions by combining NLP techniques with attention to the factors that legal scholarship identifies as influencing judicial decisions. Our analysis considers not just textual features but the doctrinal categories, procedural contexts, and institutional settings that shape pharmaceutical patent litigation.

The paper makes several contributions to the growing literature on AI and law. First, we demonstrate that unsupervised methods recover meaningful doctrinal structure from pharmaceutical patent opinions—clusters correspond to recognized legal categories (ANDA litigation, obviousness, written description) without supervised training on these labels. Second, we show that textual features predict outcomes with substantial accuracy (85–89%), substantially exceeding baseline rates, while acknowledging the gap between prediction and understanding. Third, we provide a template for interpretable legal ML that balances predictive performance with transparency—a balance increasingly important as computational methods enter legal practice. Fourth, we offer candid assessment of limitations, resisting the temptation to overclaim what our methods achieve.

1.6. Paper Organization

The remainder of this paper proceeds as follows. Section 2 reviews relevant literature spanning AI and law, empirical legal studies, and legal NLP, situating our research within existing scholarship. Section 3 describes our data and methods, including corpus construction, preprocessing, and analytical techniques. Section 4 presents results from clustering, topic modeling, and classification. Section 5 discusses implications and the persistent gap between prediction and understanding. Section 6 explicitly addresses scope and limitations—an exercise we consider essential for scientific integrity. Section 7 concludes with reflections on future directions.

To guide this investigation, we pose four explicit research questions: (RQ1) Can unsupervised learning methods (clustering, topic modeling) recover doctrinally meaningful structure from pharmaceutical patent opinion text without supervision on legal categories? (RQ2) To what extent can supervised classifiers predict case outcomes from textual features, and which algorithmic approaches perform best? (RQ3) What do the features driving classification reveal about the textual signals associated with different outcomes? (RQ4) What are the limitations of text-based prediction, and what aspects of legal reasoning remain beyond computational capture?

In sum, this study advances beyond prior AI-and-law research in three principal ways. First, whereas most existing work either focuses on behavioral prediction using structured case attributes or applies text analytics without attention to legal doctrine, we integrate both traditions through a multi-layer pipeline that produces doctrinally interpretable intermediate representations alongside outcome predictions. Second, we provide systematic evaluation across multiple algorithmic approaches—from traditional classifiers through deep learning architectures—enabling assessment of which methods best suit legal text analysis tasks. Third, rather than treating prediction accuracy as an end in itself, we examine what computational patterns reveal about the structure of legal reasoning and where they fall short, contributing to the broader discourse on the capabilities and limitations of AI in law. To summarize our contributions concisely: (C1) Novel integration of behavioral and text-analytic traditions through interpretable intermediate representations; (C2) Comprehensive algorithmic comparison across six model architectures with rigorous cross-validation; (C3) Demonstration that unsupervised methods recover recognized doctrinal categories without labeled training data; (C4) Transparent assessment of limitations, including error analysis and discussion of the prediction-understanding gap.

2. Research Background

Research on artificial intelligence and legal reasoning spans multiple disciplines, methodological traditions, and technological paradigms. This section surveys the relevant literature across jurisprudential theory, computational approaches, empirical legal studies, and domain-specific patent research, culminating in identification of the gaps that motivate our investigation.

2.1. Theoretical Foundations of Legal Reasoning

Any computational approach to legal reasoning must grapple with foundational questions about what legal reasoning is. Jurisprudential scholarship offers competing accounts that carry implications for computational modeling.

The formalist tradition, associated with legal positivism, views legal reasoning as primarily deductive: judges identify applicable rules and apply them to facts to derive conclusions. Ref. [3] account of law as a system of primary and secondary rules suggests that much legal reasoning operates mechanically in the ‘core’ of settled meaning, with judicial discretion confined to penumbral cases where rules are indeterminate. This view implies that computational systems might successfully model routine legal determinations while struggling with hard cases.

Ref. [30] challenged this account, arguing that legal reasoning involves interpretation guided by principles of political morality. On this view, even seemingly straightforward cases require interpretive judgment about how legal materials fit together into a coherent whole. Judges do not merely apply rules but construct the best interpretation of legal practice. This interpretivist account poses deeper challenges for computational modeling: if legal reasoning is irreducibly interpretive, then systems that match patterns or apply rules may miss something essential.

Legal realists offered yet another perspective, emphasizing the behavioral and contextual factors that shape judicial decisions. Ref. [2] documented how appellate judges employ competing ‘steadying factors’ and ‘unsteadying factors’ that formal doctrine alone cannot explain. This tradition anticipated contemporary empirical legal studies and suggests that computational models might need to incorporate extra-legal variables—judicial background, ideological orientation, institutional context—alongside doctrinal analysis.

Cognitive science has added another dimension, examining legal reasoning as an instance of human cognition subject to characteristic heuristics and biases. Ref. [12] documented coherence-based reasoning in legal fact-finding, showing how conclusions and evidence assessments become mutually reinforcing. This cognitive turn suggests that computational models might capture patterns in judicial reasoning that reflect cognitive processes rather than (or in addition to) doctrinal logic.

2.2. Rule-Based Expert Systems in Law

The first wave of AI and law research, emerging in the 1970s and flourishing in the 1980s, applied rule-based expert systems to legal domains. These systems represented legal knowledge as formal rules and used inference engines to derive conclusions from facts [31].

The British Nationality Act system [32] demonstrated that statutory provisions could be represented as logic programs, enabling automated determination of citizenship status. By encoding the Act’s provisions in Prolog, researchers showed that well-defined statutory domains could be formalized with reasonable fidelity. Ref. [33] provided jurisprudential foundations for legal expert systems, examining the epistemological assumptions underlying knowledge representation in law and arguing that expert systems could handle the ‘clear cases’ that constitute the bulk of legal practice.

However, rule-based systems encountered fundamental limitations. Gardner’s foundational work, beginning with her dissertation [34] and culminating in her landmark study [34], demonstrated that legal reasoning involves ‘hard questions’ where rules underdetermine outcomes and background common-sense knowledge becomes essential. The knowledge acquisition bottleneck, the difficulty of extracting and encoding expert knowledge—proved particularly severe in law, where expertise is distributed, tacit, and contested. Moreover, the brittleness of rule-based systems meant they failed ungracefully when encountering situations not anticipated by their designers.

Recognizing these limitations, researchers developed more specialized and hybrid approaches. Ref. [1] advocated for a systemic approach to designing AI applications in law. They argued that legal reasoning’s inherent complexity—involving multiple knowledge sources, reasoning modalities, uncertainty, and contextual factors—required architectural frameworks that could integrate diverse techniques rather than relying on any single paradigm. Their systemic view anticipated later hybrid approaches and remains relevant as contemporary systems combine multiple AI methods. Ref. [35] demonstrated these principles in the SKADE LITorSET system, an expert system designed to support corporate litigate-or-settle decisions—a domain requiring integration of legal doctrine, strategic considerations, and financial analysis that exemplified the multifaceted nature of legal decision-making. Ref. [36] extended this work with a blackboard architecture for product liability claims evaluation, showing how distributed reasoning across multiple knowledge sources could address the complexity that single-paradigm systems struggled to capture.

2.3. Case-Based Reasoning and Argumentation Systems

Case-based reasoning (CBR) systems offered an alternative paradigm that aligned more naturally with common law methodology. Rather than encoding abstract rules, CBR systems represented cases as structured objects and generated arguments by comparing the case at hand to precedents.

HYPO [37] pioneered this approach in trade secret law. The system represented cases as configurations of legally significant ‘dimensions’—factors that strengthen or weaken a party’s position—and generated arguments by citing favorable precedents, distinguishing unfavorable ones, and identifying hypothetical variations that would change the analysis. HYPO demonstrated that analogical legal reasoning could be computationally modeled, though it required extensive manual encoding of domain knowledge.

CATO [38] extended HYPO’s approach for legal education, teaching students to construct arguments from case comparisons. The system modeled not just case retrieval but the argumentative moves, citing, distinguishing, reconciling that characterize legal reasoning with precedent. CATO showed that CBR systems could capture the dialectical structure of legal argument, not merely predict outcomes.

Subsequent work developed formal argumentation frameworks that could model the structure of legal debate. Ref. [39], in their comprehensive survey of the field, characterized AI and law as a ‘fruitful synergy’ where legal domains provided challenging test cases for AI methods while computational approaches illuminated the structure of legal reasoning. Ref. [40] incorporated theories and values into case-based reasoning, showing how background value orderings could explain and predict case outcomes. Ref. [41] developed dialectical models of legal argument that represented the attack and defense relationships among competing arguments. The IBP system [42] combined case-based reasoning with issue-spotting to predict outcomes, achieving reasonable accuracy while maintaining interpretability. These systems captured important aspects of legal reasoning but remained limited to narrow domains where extensive knowledge engineering was feasible.

2.4. Connectionist and Neural Network Approaches

Neural network approaches to legal reasoning emerged alongside symbolic AI, offering a fundamentally different computational paradigm. Rather than encoding explicit rules or case representations, connectionist systems learn patterns from examples through distributed representations and weighted connections.

Ref. [43] explored connectionist approaches to legal decision-making, examining how neural architectures might capture the pattern-recognition and associative reasoning that characterizes judicial cognition. Their work identified both the promise of neural methods—learning from examples without explicit rule programming, tolerance for noisy or incomplete inputs, graceful degradation—and the challenges that would preoccupy the field for decades: limited interpretability, difficulty incorporating structured knowledge, and the ‘black box’ character of learned representations. They noted that legal reasoning involves both the pattern-matching capabilities where neural networks excel and the symbolic manipulation where they struggle, suggesting that hybrid approaches might ultimately prove necessary.

The subsequent decades saw limited application of neural methods to law, partly due to data limitations and partly due to the interpretability concerns that are especially acute in legal contexts where decisions must be justified. The deep learning revolution of the 2010s, however, brought renewed interest and vastly more powerful architectures.

2.5. Statistical and Machine Learning Approaches

The data-driven turn in AI brought machine learning techniques that could learn predictive models from large corpora without explicit knowledge engineering. These approaches treated legal prediction as a classification problem: given features extracted from cases, predict the outcome.

Ref. [44] achieved 70.2% accuracy in predicting Supreme Court decisions using an extremely randomized trees classifier with case-level features including court of origin, issue area, and ideological direction. Their model outperformed legal experts in forecasting tournaments, demonstrating that meaningful prediction was possible from relatively simple features. Importantly, their approach used structured case metadata rather than full-text analysis, leaving open the question of what textual features might add.

Ref. [45] applied natural language processing to European Court of Human Rights decisions, achieving 79% accuracy using textual features extracted from case facts and legal arguments. Their analysis revealed that the ‘circumstances’ section of opinions—describing case facts—carried more predictive signal than legal reasoning sections, raising intriguing questions about the relationship between facts and outcomes in judicial decisions.

Other work has applied machine learning to contract analysis [46], statutory interpretation [47], and legal document classification [48]. These studies demonstrated that supervised learning could achieve useful performance across diverse legal tasks, though typically without deep insight into what features drove predictions or whether learned patterns corresponded to legal reasoning.

2.6. Deep Learning and Transformer Architectures

Contemporary deep learning has transformed legal NLP with architectures capable of learning rich representations from massive corpora. Recurrent neural networks, particularly Long Short-Term Memory (LSTM) architectures, capture sequential dependencies in legal text that bag-of-words models miss—how arguments build, how facts relate to conclusions, how precedents are marshaled and distinguished.

Transformer architectures have achieved state-of-the-art results across legal NLP tasks. LEGAL-BERT [20] adapted the BERT architecture to legal text through domain-specific pretraining, demonstrating substantial improvements over general-purpose language models on legal classification tasks. CaseLaw-BERT extended this approach with pretraining on large-scale case law corpora, learning representations tuned to judicial language and reasoning patterns.

The LexGLUE benchmark [49] provided standardized evaluation across multiple legal language understanding tasks—case outcome classification, contract clause identification, statutory interpretation—enabling systematic comparison of architectures. Results showed that legal-domain pretraining consistently improved performance, though gains varied across tasks. Ref. [21] examined when pretraining helps for legal tasks, finding that domain-specific benefits depend on task characteristics and corpus similarity.

Large language models have demonstrated remarkable capabilities on legal tasks including bar exam questions [50], though their reliability, tendency toward hallucination, and opacity raise concerns for legal applications. While these models achieve impressive accuracy, they typically function as black boxes, limiting their utility for applications requiring explanation and justification—a limitation anticipated in early connectionist explorations of legal reasoning [43].

2.7. Behavioral and Empirical Legal Studies

Parallel to computational approaches, empirical legal studies have investigated the behavioral dimensions of judicial decision-making, producing findings that both inform and complicate computational modeling.

The attitudinal model [9] documented systematic relationships between judicial ideology and case outcomes at the Supreme Court. Justices’ votes could be predicted substantially from ideological scores derived from pre-appointment newspaper editorials, suggesting that legal doctrine alone does not determine outcomes. Subsequent work extended this finding to lower federal courts [16] and specialized tribunals, showing that political and ideological factors shape judicial behavior across institutional contexts.

Cognitive psychology research has identified specific heuristics and biases affecting judicial reasoning. Ref. [51] demonstrated that judges exhibit anchoring effects, framing effects, and hindsight bias—cognitive patterns that influence decisions independently of legal doctrine. Ref. [11] showed that sentencing recommendations are influenced by arbitrary numerical anchors. Most provocatively, Ref. [13] found that parole decisions correlated with time since judges’ last meal break, suggesting that physiological factors affect ostensibly legal judgments.

These behavioral findings carry implications for computational legal analysis. Models working from case text alone may miss important predictors of judicial behavior. Conversely, computational methods might detect patterns in judicial decisions that reflect cognitive processes rather than doctrinal reasoning—patterns that judges themselves may not recognize or acknowledge.

2.8. Patent Litigation and Pharmaceutical Patents

Patent litigation presents a distinctive context for computational legal analysis, shaped by specialized courts, technical subject matter, and regulatory frameworks that create structured, recurring patterns of dispute.

Foundational empirical work established baseline findings about patent litigation outcomes. Ref. [26] examined validity rates in litigated patents, finding that patents reaching judgment were held invalid at surprisingly high rates—much higher than the general patent population would suggest. This selection effect, where only patents of uncertain validity proceed to judgment while clearly valid and clearly invalid patents settle, complicates inference from litigated cases to patent quality generally. Ref. [25] extended this analysis to identify characteristics of ‘valuable patents’ that predict litigation and validity outcomes.

Ref. [14] examined institutional dimensions of patent adjudication, comparing judge and jury decision-making and documenting forum shopping patterns. Her work showed that procedural and institutional factors—not just legal merits—influence patent outcomes, suggesting that computational models should attend to court-level and judge-level variables alongside case features.

Pharmaceutical patent litigation operates within the distinctive framework of the Hatch-Waxman Act (1984), which established procedures for generic drug approval and patent challenges. The Paragraph IV certification process creates a structured pathway for challenging Orange Book-listed patents, generating recurring patterns of litigation around compound patents, formulation patents, and method-of-treatment claims [22]. Studies have examined evergreening strategies [23], follow-on biologics [24], and the dynamics of pay-for-delay settlements [27,29].

Computational approaches to pharmaceutical patent analysis have shown considerable promise. Ref. [28] applied big data analytics to pharmaceutical patent validity cases, employing Hadoop MapReduce architectures to process and analyze patent litigation data at scale. Their study demonstrated that machine learning models could effectively classify patent validity outcomes, establishing baseline performance metrics and identifying key features—including claim characteristics, prior art citations, and procedural variables—that correlated with validity determinations. This work provided essential groundwork demonstrating both the feasibility and challenges of applying computational methods to pharmaceutical patent analytics.

2.9. Research Gaps

Despite substantial progress across these research streams, significant gaps remain—both in understanding the pharmaceutical patent domain and in AI/ML methodologies for legal reasoning.

Domain gaps. First, pharmaceutical patent litigation has received less computational attention than domains like Supreme Court prediction or contract analysis, despite its economic importance and policy relevance. Second, existing computational studies of patent litigation typically examine validity or infringement as isolated outcomes, without systematic analysis of the doctrinal structure underlying judicial decisions—how obviousness arguments differ textually from enablement arguments, how Hatch-Waxman cases differ from inter partes review. Third, the relationship between textual patterns in pharmaceutical patent opinions and the behavioral factors identified in empirical legal studies remains unexplored: do opinions from different judges exhibit distinctive textual signatures? Do outcomes correlate with extra-legal factors after controlling for doctrinal content? Fourth, temporal dynamics receive limited attention: how have pharmaceutical patent opinions evolved since KSR v. Teleflex (2007) and the America Invents Act (2011)?

Methodological gaps. First, most legal ML research emphasizes prediction over interpretation; we lack systematic understanding of what models learn and whether learned patterns correspond to legal reasoning or merely statistical regularities. Second, unsupervised methods.

Our research addresses these gaps. We focus on pharmaceutical patent litigation as a domain of economic importance and doctrinal coherence. We employ a multi-layer analytical pipeline that combines unsupervised methods (clustering, topic modeling) with supervised classification, producing interpretable intermediate outputs alongside predictive accuracy. We bridge behavioral and text-analytic traditions by examining whether computationally identified patterns correspond to recognized doctrinal categories. We extend prior big data analytics work on pharma patents [28] with contemporary unsupervised learning and neural network architectures. And we explicitly examine what our methods reveal and what they cannot capture—resisting the temptation to overclaim while demonstrating what contemporary ML can illuminate about legal reasoning.

3. Data and Methods

This section describes the corpus construction, preprocessing pipeline, and analytical methods employed in our investigation. We adopt a multi-layer approach that combines unsupervised learning (clustering, topic modeling) with supervised classification, designed to produce interpretable intermediate outputs alongside predictive accuracy. Figure 1 illustrates the overall analytical pipeline.

To clarify data flow through the pipeline: raw opinion text first undergoes preprocessing (cleaning, tokenization, stopword removal, stemming) to produce normalized documents. These documents are then transformed into TF-IDF vector representations, which serve as input to both the clustering and topic modeling stages. K-Means clustering operates on the full TF-IDF vectors to identify document groupings, while Latent Dirichlet Allocation (LDA) topic modeling works with term-frequency matrices derived from the same preprocessed text to discover latent thematic structure. The outputs from these unsupervised stages—cluster assignments and topic distributions—provide interpretable characterizations of the corpus that can be validated against known doctrinal categories. For supervised classification, the TF-IDF features (augmented optionally with cluster and topic features) are combined with outcome labels to train predictive models. This sequential-yet-integrated design ensures that insights from unsupervised analysis inform our interpretation of classification results.

3.1. Data Collection and Corpus Construction

The corpus comprises 698 pharmaceutical patent cases decided in United States federal courts between January 2016 and December 2018. Cases were retrieved using a systematic search protocol designed to capture the complete population of published opinions involving pharmaceutical patent disputes during this period.

Search strategy. We queried the Westlaw and LexisNexis databases using Boolean search strings combining pharmaceutical terminology (e.g., ‘drug,’ ‘pharmaceutical,’ ‘ANDA,’ ‘generic,’ ‘FDA,’ ‘Orange Book’) with patent-related terms (e.g., ‘patent,’ ‘infringement,’ ‘validity,’ ‘obviousness,’ ‘35 U.S.C.’). The search was restricted to federal district court and Federal Circuit opinions. We supplemented database searches with targeted retrieval of cases citing key Hatch-Waxman provisions (21 U.S.C. § 355) and major pharmaceutical patent precedents. Duplicate removal and manual verification ensured corpus integrity.

Temporal boundaries. The 2016–2018 window was selected for several methodological reasons. First, it postdates the Supreme Court’s Alice Corp. v. CLS Bank (2014) decision and full implementation of the America Invents Act (2013), ensuring doctrinal coherence within a stable legal framework. Second, it predates COVID-19 pandemic disruptions that affected court operations and pharmaceutical litigation patterns beginning in 2020. Third, the three-year window provides sufficient temporal depth to capture variation while maintaining coherence. Fourth, the temporal distance ensures case completeness—appeals resolved, related proceedings concluded—enabling accurate outcome coding.

Inclusion criteria. We included published opinions addressing substantive patent issues (validity, infringement, claim construction, preliminary injunction) in pharmaceutical contexts. We excluded purely procedural orders, discovery disputes without substantive patent analysis, and cases where pharmaceutical patents were mentioned incidentally rather than centrally litigated. Cases involving combination products (drug-device) were included if patent claims related primarily to the pharmaceutical component.

Outcome coding. Each case was coded for outcome using a three-category scheme: patent holder favorable (validity upheld and/or infringement found), challenger favorable (invalidity found and/or non-infringement), and mixed (split outcomes across claims or issues). We adopted this three-class scheme rather than a binary win/lose classification for substantive and methodological reasons. Substantively, pharmaceutical patent litigation frequently produces genuinely mixed outcomes—courts may invalidate some claims while upholding others, or find validity but no infringement—and collapsing these into a binary classification would obscure legally meaningful distinctions. Methodologically, the three-class approach preserves information about outcome heterogeneity while maintaining sufficient observations per class for reliable model training; finer-grained schemes (e.g., separate categories for each combination of validity, infringement, and remedy determinations) would create sparse classes unsuitable for machine learning. Two coders independently classified outcomes; disagreements were resolved through discussion with reference to the opinion text. Inter-rater reliability was substantial (Cohen’s kappa = 0.84). The three-category scheme sacrifices some nuance but enables meaningful classification while avoiding the sparsity problems that would arise from finer-grained coding.

Data availability and reproducibility. The corpus consists of publicly available federal court opinions accessible through Westlaw, LexisNexis, and PACER. We can provide case citation lists and outcome labels as supplementary material to enable replication if requested. Full opinion texts cannot be redistributed due to database licensing restrictions, but researchers can reconstruct the corpus using the provided citations. Preprocessing code and model specifications are available at [repository to be provided upon acceptance]. We acknowledge that exact replication may yield minor variations due to stochastic elements in neural network training and k-means initialization; we report results averaged across multiple runs with standard deviations to characterize this variability.

3.2. Text Preprocessing

Raw opinion text underwent a standardized preprocessing pipeline to prepare it for computational analysis. Preprocessing choices involve tradeoffs between information preservation and noise reduction; we document our choices to enable replication and sensitivity analysis.

Text extraction and cleaning. Opinions were extracted from database formats and converted to plain text. Headers, footers, page numbers, and Westlaw/LexisNexis metadata were removed. Citations were normalized to reduce vocabulary inflation from minor formatting variations. Tables and figures were excluded as their structured format resists bag-of-words representation. To clarify treatment of potentially informative structural elements: legal citations (case names and reporter references) were retained but normalized to a standardized format (e.g., “Smith v. Jones, 123 F.3d 456”) to preserve citation patterns while reducing vocabulary inflation from formatting variations. Opinion section headers (e.g., “Background,” “Discussion,” “Obviousness Analysis”) were retained as they may carry predictive signal about opinion structure and doctrinal focus. Statutory citations (e.g., “35 U.S.C. § 103”) were preserved and normalized. This decision to retain legal citations and headers reflects our exploratory orientation: excluding them would assume they lack predictive value, whereas retention allows the models to determine their utility empirically.

Tokenization. Text was tokenized using NLTK’s word tokenizer, which handles contractions, hyphenation, and punctuation according to Penn Treebank conventions. We preserved case information during initial tokenization, then lowercased all tokens to reduce vocabulary size while acknowledging that case sometimes carries meaning in legal text (e.g., proper nouns, statutory references).

Stopword removal. We applied a two-stage stopword removal process. First, standard English stopwords (articles, prepositions, common verbs) were removed using the NLTK stopword list. Second, we developed a domain-specific stopword list for legal text, removing high-frequency terms that appear across virtually all opinions but carry little discriminative information: ‘court,’ ‘plaintiff,’ ‘defendant,’ ‘claim,’ ‘patent’ (when used generically), ‘case,’ ‘evidence,’ ‘argues,’ ‘contends.’ The domain-specific list was developed iteratively by examining term frequencies and removing terms appearing in more than 80% of documents with low variance across outcome categories.

Stemming. We applied Porter stemming to reduce inflectional variants to common roots (e.g., ‘infringing,’ ‘infringed,’ ‘infringement’ to ‘infring’). Stemming reduces vocabulary size and groups semantically related terms, though it occasionally conflates distinct concepts. We chose Porter stemming over lemmatization for computational efficiency and consistency with prior legal NLP research, while acknowledging that lemmatization might better preserve legal terminology distinctions.

N-gram extraction. Beyond unigrams, we extracted bigrams to capture multi-word legal concepts: ‘prior art,’ ‘written description,’ ‘claim construction,’ ‘Orange Book,’ ‘skilled artisan.’ Bigrams were identified using pointwise mutual information (PMI) scores, retaining pairs with PMI > 5.0 that appeared in at least 10 documents. This threshold balances capturing meaningful collocations against vocabulary explosion.

3.3. Feature Extraction and Representation

Preprocessed text was converted to numerical representations suitable for machine learning algorithms. We employed term frequency-inverse document frequency (TF-IDF) vectorization as our primary representation, supplemented by document embeddings for neural network models.

TF-IDF vectorization. Each document was represented as a vector in term space, with TF-IDF weighting that upweights terms frequent in a document but rare across the corpus. Formally, for a term t in document d within corpus D, the TF-IDF weight is computed as:

tfidf(t, d, D) = tf(t, d) × idf(t, D)

(1)

We used sublinear term frequency scaling to dampen the influence of very high-frequency terms:

tf_sublinear(t, d) = 1 + log(tf(t, d))

(2)

The inverse document frequency with smoothing is defined as:

idf(t, D) = log((N + 1)/(df(t) + 1)) + 1

(3)

where N = |D| is the total number of documents in the corpus and df(t) is the document frequency of term t. The resulting vocabulary comprised 10,493 unique terms after preprocessing. We applied L2 normalization to document vectors to control for variation in opinion length, yielding unit-length document vectors in R^10,493.

Dimensionality considerations. The high dimensionality of TF-IDF vectors (10,493 features for 698 documents) raises concerns about overfitting and computational efficiency. For clustering and topic modeling, we worked with the full TF-IDF matrix, as these unsupervised methods benefit from the complete feature space. For classification, we explored dimensionality reduction via truncated SVD (retaining components explaining 95% of variance) and feature selection (chi-squared test, top 2000 features), comparing performance against full-dimensional models.

Sequence representations. For recurrent neural network models (LSTM), we preserved sequential structure rather than using bag-of-words representations. Documents were represented as sequences of token indices, padded or truncated to a maximum length of 5000 tokens (approximately the mean document length plus one standard deviation). Token embeddings were initialized randomly and learned during training, with embedding dimension 128.

3.4. Unsupervised Learning: Clustering Methods

We applied multiple clustering algorithms to identify natural groupings in the corpus, examining whether unsupervised methods would recover doctrinally meaningful categories without supervision.

K-Means clustering. K-Means partitions documents into k clusters by minimizing within-cluster sum of squared distances to cluster centroids. Formally, given a set of documents X = {x1, x2, …, xn}, the algorithm seeks to minimize the objective function:

J = Σ(i = 1 to k) Σ(x∈Ci) ||x − μi||²

(4)

where Ci denotes the set of documents assigned to cluster i, and μi is the centroid of cluster i, computed as the mean of all documents in that cluster:

μi = (1/|Ci|) Σ(x∈Ci) x

(5)

We used the k-means++ initialization to improve convergence and reduce sensitivity to initial centroid placement. The number of clusters (k = 6) was selected using the elbow method and silhouette analysis: we computed silhouette scores for k in {2, 3, …, 12} and selected k = 6 as the point where silhouette score plateaued while maintaining interpretable cluster sizes. Each cluster was characterized by its centroid terms—the highest-weighted terms in the centroid vector—enabling semantic interpretation. To assess cluster stability, we performed 100 runs with different random initializations and computed the Adjusted Rand Index (ARI) between each pair of clustering solutions. The mean pairwise ARI was 0.89 (SD = 0.04), indicating high stability across runs. We also applied bootstrap resampling (1000 iterations), finding that cluster assignments remained stable for 87% of documents across resamples. The choice of k = 6 was further validated by domain knowledge: pharmaceutical patent litigation naturally partitions into recognizable categories (ANDA procedural, obviousness, written description, claim construction, infringement, and mixed-issue cases), and our six clusters aligned well with these doctrinal groupings, as detailed in the Results section (Section 4).

Silhouette coefficient. The silhouette score provides a measure of cluster quality by comparing intra-cluster cohesion to inter-cluster separation. For each document xi, let a(i) be the mean distance to other documents in the same cluster, and b(i) be the minimum mean distance to documents in any other cluster. The silhouette coefficient for document i is:

s(i) = (b(i) − a(i))/max(a(i), b(i))

(6)

The overall silhouette score is the mean of s(i) across all documents, ranging from −1 (poor clustering) to +1 (dense, well-separated clusters). Our score of 0.63 indicates moderately well-defined clusters.

Hierarchical clustering. Agglomerative hierarchical clustering builds a tree of nested clusters by iteratively merging the most similar pairs. We used Ward’s linkage criterion, which minimizes the increase in total within-cluster variance at each merge. Ward’s method tends to produce compact, spherical clusters and is relatively robust to outliers. The resulting dendrogram visualizes hierarchical relationships among cases, revealing both coarse-grained and fine-grained structure. We cut the dendrogram at heights corresponding to 6 and 12 clusters for comparison with K-Means results.

Affinity Propagation. Unlike K-Means and hierarchical clustering, Affinity Propagation does not require specifying the number of clusters a priori. The algorithm identifies ‘exemplars’—representative documents for each cluster—by passing messages between data points until convergence. We used cosine similarity as the affinity measure and set the preference parameter to the median similarity, yielding 47 fine-grained clusters. This granularity complements the coarser K-Means partition, potentially revealing subcategories within major doctrinal areas.

Cluster evaluation. We evaluated clustering quality using both internal metrics (silhouette score, within-cluster sum of squares) and external interpretation (examining whether clusters corresponded to recognized legal categories). The silhouette score measures how similar documents are to their own cluster compared to other clusters, ranging from −1 (poor clustering) to +1 (dense, well-separated clusters). We also generated word clouds for each cluster to facilitate semantic interpretation.

3.5. Topic Modeling

Topic modeling provides an alternative unsupervised approach that represents documents as mixtures of latent topics, where each topic is a distribution over terms. Unlike hard clustering, topic models allow documents to exhibit multiple themes in varying proportions.

Non-negative Matrix Factorization (NMF). NMF factorizes the document-term matrix V in R^(n × m) into two non-negative matrices: W in R^(n × k) (document-topic weights) and H in R^(k × m) (topic-term weights), such that:

V ≈ WH

(7)

The factorization is obtained by minimizing the reconstruction error, typically using the Frobenius norm:

min(W,H ≥ 0)||V − WH||²_F + α||W||₁ + β||H||₁

(8)

where n is the number of documents, m is the vocabulary size, k is the number of topics, and α, β are L1 regularization parameters promoting sparsity. The non-negativity constraint yields additive, parts-based representations that are often more interpretable than methods allowing negative weights. We set the number of topics to 5 based on coherence score optimization across k in {3, 4, …, 10}. NMF was implemented using coordinate descent optimization.

Latent Dirichlet Allocation (LDA). For comparison, we also applied LDA, a probabilistic generative model that assumes documents are mixtures of topics and topics are distributions over words. The generative process assumes:

For each topic k in {1, …, K}: Draw topic-word distribution φk ~ Dirichlet(β)

For each document d in {1, …, D}: Draw topic proportions θd ~ Dirichlet(α)

For each word position n in document d: Draw topic assignment zd,n ~ Multinomial(θd); Draw word wd,n ~ Multinomial(φz)

The joint probability of the corpus and latent variables is:

(9)

LDA was implemented using variational Bayes inference with symmetric Dirichlet priors (α = 0.1, β = 0.01). We compared LDA and NMF results to assess robustness of identified topics across different modeling assumptions. To select the number of topics and validate interpretability, we computed topic coherence scores (C_v measure) for models with 5 to 25 topics. Coherence peaked at 15 topics (C_v = 0.58) before declining, suggesting diminishing interpretability with finer granularity. We selected 15 topics for primary analysis, with 10- and 20-topic models examined for robustness. Qualitative validation involved two domain experts (the author and an independent patent attorney) independently labeling each topic based on its top-20 terms; inter-rater agreement was 87% (13/15 topics received identical labels). Topics that both raters found interpretable included “Obviousness/Prior Art,” “Written Description/Enablement,” “ANDA Procedure,” “Claim Construction,” and “Damages/Remedy.” Two topics were labeled as “mixed/unclear” by at least one rater, reflecting the inherent noise in unsupervised topic discovery.

Topic coherence. Topic quality was evaluated using coherence scores, which measure the semantic similarity among a topic’s top terms. We computed Cv coherence, which combines normalized pointwise mutual information with cosine similarity of word vectors, providing a measure that correlates with human judgments of topic interpretability. Topics were also evaluated qualitatively by examining top terms and representative documents.

3.6. Supervised Classification Models

We evaluated multiple classification algorithms to predict case outcomes from textual features, spanning traditional machine learning and deep learning approaches.

Baseline classifiers. Naive Bayes (multinomial) served as a baseline, assuming conditional independence of features given the class label. Despite its simplifying assumptions, Naive Bayes often performs competitively on text classification and provides a reference point for more complex models. We also implemented Logistic Regression with L2 regularization, which models class probabilities as a logistic function of a linear combination of features. The regularization parameter was tuned via cross-validation (C in {0.01, 0.1, 1, 10, 100}).

The logistic regression model estimates the probability of class y given features x as:

P(y = c | x) = exp(wc^T x + bc) / Σc′ exp(wc′^T x + bc′)

(10)

The L2-regularized objective function minimizes:

L = −Σi log P(yi | xi) + λ||w||²

(11)

Tree-based methods. Decision Trees partition the feature space through recursive binary splits, selecting features and thresholds that maximize information gain (or minimize Gini impurity). While interpretable, single decision trees tend to overfit. Random Forest addresses this by training an ensemble of trees on bootstrapped samples with random feature subsets, aggregating predictions through voting. We used 100 trees with maximum depth tuned via cross-validation.

Dense neural network. We implemented a feedforward neural network with two hidden layers (256 and 128 units, respectively), ReLU activation, dropout regularization (rate = 0.3), and softmax output. The network was trained using Adam optimizer with learning rate 0.001, categorical cross-entropy loss, and early stopping based on validation loss (patience = 10 epochs). Input features were TF-IDF vectors, optionally reduced via truncated SVD.

3.7. Recurrent Neural Network Architecture

To capture sequential structure in legal opinions, arguments build, how facts relate to analysis, how conclusions follow from reasoning—we implemented a Long Short-Term Memory (LSTM) recurrent neural network.

LSTM architecture. The LSTM cell maintains a cell state ct and hidden state ht that are updated at each time step through gating mechanisms. The gates control information flow as follows:

Forget gate: Determines what information to discard from the cell state:

ft = σ(Wf · [ht − 1, xt] + bf)

(12)

Input gate: Determines what new information to store:

it = σ(Wi · [ht − 1, xt] + bi)

(13)

Candidate values: Creates candidate values for updating the cell state:

c^~t = tanh(Wc · [ht − 1, xt] + bc)

(14)

Cell state update: Combines old state (gated by forget) with new candidates (gated by input):

ct = ft ⊙ ct − 1 + it ⊙ c^~t

(15)

Output gate: Determines the output based on the cell state:

ot = σ(Wo · [ht − 1, xt] + bo)

(16)

Hidden state: Final output of the cell:

ht = ot ⊙ tanh(ct)

(17)

where σ denotes the sigmoid function, ⊙ denotes element-wise multiplication, and W and b are learned weight matrices and bias vectors respectively.

The LSTM architecture comprised: (1) an embedding layer mapping token indices to 128-dimensional dense vectors, learned during training; (2) a bidirectional LSTM layer with 64 units in each direction, capturing both forward and backward context; (3) a global max pooling layer to aggregate the sequence representation; (4) a dense layer with 64 units and ReLU activation; (5) dropout (rate = 0.5) for regularization; and (6) a softmax output layer for three-class classification.

Training procedure. The LSTM was trained using Adam optimizer (learning rate = 0.001) with categorical cross-entropy loss. We used batch size 32, maximum 50 epochs, and early stopping with patience 5 based on validation accuracy. Class weights were applied to address outcome imbalance. Training used 80% of data, with 10% held out for validation during training and 10% for final testing, in addition to cross-validation evaluation. To clarify the evaluation strategy: we employ two complementary approaches serving distinct purposes. The 80/10/10 train/validation/test split is used during neural network model development, where the validation set guides hyperparameter tuning and early stopping decisions within a single training run. The 10-fold stratified cross-validation is used for all models (including neural networks) to obtain robust, comparable performance estimates across methods. For neural networks in cross-validation, each fold’s training portion is further split 90/10 for training/validation to enable early stopping. The reported metrics in Table 1 derive from 10-fold cross-validation, ensuring fair comparison across model architectures. The 80/10/10 split was used for development and ablation studies but is not the basis for reported performance figures.

3.8. Evaluation Framework

All classification models were evaluated using 10-fold stratified cross-validation to provide robust performance estimates and enable comparison across methods. Regarding the omission of transformer-based models (e.g., BERT, Legal-BERT): while transformers represent the current state-of-the-art in many NLP benchmarks, we elected not to include them for several reasons. First, our corpus of 698 documents is relatively small for fine-tuning large pretrained models, risking overfitting despite regularization. Second, transformer models’ computational requirements (GPU memory, training time) would have necessitated truncating documents to ∼512 tokens, discarding substantial portions of lengthy patent opinions. Third, our emphasis on interpretability aligns poorly with transformers’ opaque attention patterns; while attention weights are sometimes interpreted as importance scores, this interpretation remains contested. Fourth, the LSTM already captures sequential dependencies, allowing us to assess the value of sequence modeling without the confounds of pretraining on external legal corpora. We acknowledge this as a limitation: future work should evaluate Legal-BERT, Longformer, or hierarchical transformers designed for long documents, which may yield performance gains that outweigh interpretability costs for prediction-focused applications.

Performance metrics. We report accuracy (proportion of correct predictions), macro-averaged precision, recall, and F1 score (averaging across classes to account for imbalance), and area under the ROC curve (AUC) using one-vs-rest formulation for multiclass problems. The formal definitions are:

Precision = TP/(TP + FP)

(18)

Recall = TP/(TP + FN)

(19)

F1 = 2 × (Precision × Recall)/(Precision + Recall)

(20)

where TP, FP, and FN denote true positives, false positives, and false negatives respectively. The baseline accuracy (42.3%, the majority class proportion) provides context for interpreting model performance.

Cross-validation procedure. Stratified 10-fold cross-validation preserves the class distribution in each fold, ensuring that performance estimates are not biased by unrepresentative splits. For neural network models, we performed nested cross-validation: the outer loop estimates generalization performance while the inner loop tunes hyperparameters. We report mean and standard deviation of metrics across folds.

Statistical comparison. Differences between models were assessed using paired t-tests on cross-validation fold results, with Bonferroni correction for multiple comparisons. For comparing model A versus model B across k folds with performance differences d1, …, dk, the test statistic is:

t = d⁻/(sd/√k)

(21)

where d⁻ is the mean difference and sd is the standard deviation of differences. We consider differences significant at α = 0.05 after correction.

3.9. Analytical Pipeline Integration

The multi-layer analytical pipeline integrates these methods to produce both predictive models and interpretable intermediate representations. The pipeline proceeds through four stages: (1) feature extraction transforms raw opinions into TF-IDF vectors and token sequences; (2) unsupervised analysis via clustering and topic modeling reveals corpus structure without using outcome labels; (3) supervised classification predicts outcomes from textual features; (4) interpretation examines whether computationally identified patterns correspond to legal categories and whether clusters/topics help explain classification decisions.

This pipeline embodies a deliberate tradeoff. We could likely achieve higher prediction accuracy using transformer architectures (BERT, Legal-BERT) with end-to-end fine-tuning. However, such models provide limited insight into what features drive predictions or whether learned representations correspond to legal reasoning. Our multi-layer approach sacrifices some predictive performance for interpretability—a tradeoff we consider appropriate for exploratory research aimed at understanding, not just prediction.

4. Results and Analysis

This section presents results from each stage of our analytical pipeline: corpus characteristics, clustering analysis, topic modeling, and classification performance. We emphasize interpretation alongside metrics, examining whether computationally identified patterns correspond to recognized legal categories and what they reveal about the structure of pharmaceutical patent litigation.

4.1. Corpus Characteristics and Descriptive Statistics

The corpus comprises 698 pharmaceutical patent opinions spanning January 2016 through December 2018. After preprocessing, the vocabulary contained 10,493 unique terms (including bigrams), providing a rich feature space for computational analysis. Document lengths varied substantially, with mean length of 4847 tokens (SD = 2156), ranging from relatively brief claim construction orders (~1200 tokens) to comprehensive validity opinions exceeding 12,000 tokens. This variation reflects the heterogeneity of pharmaceutical patent disputes: some cases turn on narrow claim construction questions, while others involve extensive factual development and multi-issue validity challenges.

Outcome distribution showed rough balance across categories: 42.3% of cases were coded as patent holder favorable (validity upheld and/or infringement found), 38.1% as challenger favorable (invalidity and/or non-infringement), and 19.6% as mixed outcomes (split results across claims or issues). The near balance between patent holder and challenger favorable outcomes is consistent with selection effects documented in patent litigation research: cases with clear outcomes tend to settle, leaving adjudicated disputes concentrated around the margin of uncertainty [15]. The substantial proportion of mixed outcomes reflects the multi-claim, multi-issue nature of pharmaceutical patent litigation, where courts may find some claims valid but others invalid, or find validity but no infringement.

Figure 2 displays the term frequency distribution for the top 30 terms after preprocessing. The distribution exhibits the expected Zipfian pattern, with a small number of high-frequency terms and a long tail of rare terms. Notably, the most frequent terms reflect core patent doctrines and pharmaceutical-specific concepts: ‘obviousness’ and its variants appear prominently, as do ‘prior art,’ ‘formulation,’ ‘compound,’ and ‘therapeutic.’ The term ‘ANDA’ (Abbreviated New Drug Application) ranks highly, confirming the corpus’s focus on Hatch-Waxman litigation. These frequency patterns suggest that pharmaceutical patent opinions share a distinctive vocabulary that computational methods can exploit for classification and clustering.

Figure 3 presents bigram frequencies as a treemap visualization, revealing multi-word concepts that structure pharmaceutical patent discourse. The dominance of ‘prior art’ reflects the centrality of novelty and obviousness analysis in patent validity disputes. ‘Written description’ and ‘claim construction’ appear prominently, corresponding to key doctrinal inquiries under 35 U.S.C. § 112 and Markman hearings. The presence of ‘Orange Book’—referring to the FDA’s list of approved drug products with therapeutic equivalence evaluations—confirms the Hatch-Waxman character of the corpus. Other notable bigrams include ‘person skilled’ (referencing the person having ordinary skill in the art, or PHOSITA, central to obviousness analysis), ‘clinical trials,’ ‘effective amount,’ and ‘dosage form,’ reflecting the pharmaceutical subject matter. The bigram analysis demonstrates that meaningful legal concepts span multiple words, justifying our decision to include bigrams in the feature space.

4.2. Clustering Analysis

A central question motivating our unsupervised analysis is whether computational methods can recover meaningful legal structure without supervision—that is, whether clustering algorithms applied to opinion text will identify groupings that correspond to recognized doctrinal categories. The results provide affirmative evidence, though with important qualifications.

K-Means clustering with k = 6 achieved a silhouette score of 0.63, indicating moderately well-defined clusters with reasonable separation. The silhouette score ranges from −1 to +1, where values above 0.5 generally indicate reasonable clustering structure. Our score of 0.63 suggests that pharmaceutical patent opinions do exhibit natural groupings in the TF-IDF feature space, though clusters are not perfectly separated—consistent with the expectation that legal opinions often address multiple issues and thus may exhibit characteristics of several clusters.

Examination of cluster centroids and member documents revealed doctrinally coherent groupings:

Cluster 1 (ANDA/Hatch-Waxman Procedural, n = 186) contained cases focused on Hatch-Waxman procedural issues: 30-month stays, Paragraph IV certification requirements, FDA approval timing, and Orange Book listing disputes. Top terms included ‘ANDA,’ ‘FDA,’ ‘Orange Book,’ ‘approval,’ ‘generic,’ and ‘paragraph IV.’ These cases often involve threshold questions about whether litigation is ripe or whether procedural prerequisites for suit have been satisfied, rather than substantive patent validity or infringement.

Cluster 2 (Obviousness, n = 142) grouped cases centered on obviousness challenges under 35 U.S.C. § 103. Top terms included ‘obvious,’ ‘prior art,’ ‘motivation,’ ‘combine,’ ‘teaching,’ ‘PHOSITA,’ and ‘secondary considerations.’ These opinions typically apply the Graham v. John Deere framework and address whether a person having ordinary skill would have been motivated to combine prior art references to arrive at the claimed invention. The prominence of ‘secondary considerations’ (also known as objective indicia of nonobviousness) reflects the pharmaceutical industry’s reliance on commercial success, long-felt need, and unexpected results to rebut obviousness challenges.

Cluster 3 (Written Description/Enablement, n = 118) contained cases addressing disclosure requirements under 35 U.S.C. § 112. Top terms included ‘written description,’ ‘enablement,’ ‘specification,’ ‘disclose,’ ‘possession,’ ‘undue experimentation,’ and ‘scope.’ These opinions examine whether the patent specification adequately describes the claimed invention and enables a skilled artisan to make and use it without undue experimentation—issues particularly salient for pharmaceutical patents covering genera of compounds or claiming broad therapeutic applications.

Cluster 4 (Claim Construction, n = 97) grouped cases focused on interpreting claim language. Top terms included ‘claim construction,’ ‘term,’ ‘meaning,’ ‘specification,’ ‘ordinary,’ ‘intrinsic,’ and ‘extrinsic.’ These opinions, often arising from Markman hearings, determine the scope of patent claims by construing disputed terms. Claim construction is often dispositive in pharmaceutical patent cases, as small differences in claim scope can determine whether a generic product infringes or falls outside the claims.

Cluster 5 (Preliminary Injunction, n = 89) contained cases addressing motions for preliminary relief. Top terms included ‘preliminary injunction,’ ‘irreparable harm,’ ‘likelihood success,’ ‘balance hardships,’ ‘public interest,’ and ‘stay.’ These opinions apply the four-factor preliminary injunction test, assessing whether patent holders are entitled to enjoin generic entry pending full adjudication. The pharmaceutical context adds distinctive considerations, including public interest in drug access and the economic significance of market exclusivity.

Cluster 6 (Inequitable Conduct/Unenforceability, n = 66) grouped cases involving allegations that patent holders engaged in misconduct before the Patent Office. Top terms included ‘inequitable conduct,’ ‘materiality,’ ‘intent,’ ‘deceive,’ ‘disclosure,’ ‘PTO,’ and ‘duty candor.’ These cases examine whether applicants withheld material information or made misrepresentations during prosecution, potentially rendering patents unenforceable. The relatively small size of this cluster reflects the high bar for proving inequitable conduct following Therasense (2011).

Figure 4 visualizes the clustering results using t-SNE dimensionality reduction, projecting the high-dimensional TF-IDF vectors onto two dimensions for visualization while approximately preserving local structure. The visualization reveals moderately separated clusters with some overlap at boundaries, consistent with the silhouette score and the expectation that opinions often address multiple doctrinal issues. Cases at cluster boundaries may involve, for example, both obviousness and written description challenges, placing them between the corresponding clusters.

Figure 5 presents word cloud visualizations for each K-Means cluster, with word size proportional to term weight in the cluster centroid. These visualizations confirm the doctrinal coherence identified through centroid analysis: the Obviousness cluster is dominated by ‘prior art,’ ‘obvious,’ and ‘motivation’; the Written Description cluster by ‘enablement,’ ‘specification,’ and ‘disclose’; and so on. The word clouds provide an accessible representation of cluster content that aligns with how patent practitioners would categorize these disputes.

Hierarchical clustering using Ward’s linkage provided complementary perspective on corpus structure. Figure 6 displays the dendrogram, showing how cases agglomerate into successively larger groups. The primary bifurcation separates procedural/preliminary matters (Hatch-Waxman procedural, preliminary injunction) from substantive validity/infringement disputes. At finer granularity, the validity branch subdivides into obviousness versus § 112 issues (written description, enablement), while claim construction forms a distinct sub-branch often bridging validity and infringement analysis. This hierarchical structure corresponds to how patent litigators conceptualize case categories, providing validation for the unsupervised results.

Affinity Propagation, which determines the number of clusters automatically, identified 47 clusters with distinct exemplar cases (Figure 7). This finer-grained partition reveals subcategories within the major doctrinal areas. For example, within obviousness cases, separate clusters emerged for compound obviousness (challenging the patentability of the molecule itself), formulation obviousness (challenging dosage forms, delivery mechanisms), and dosage regimen obviousness (challenging dosing schedules). Within written description cases, separate clusters addressed genus-species issues, functional claiming, and post-filing developments. This granularity suggests that pharmaceutical patent doctrine has internally differentiated subspecialties that computational methods can detect.

4.3. Topic Modeling Results

Topic modeling provides a complementary unsupervised perspective, representing each document as a mixture of latent topics rather than assigning it to a single cluster. This mixed-membership approach may better capture the multi-issue nature of patent opinions.

Non-negative Matrix Factorization (NMF) with five topics achieved mean coherence score of 0.62, indicating reasonable topic quality. Coherence scores above 0.5 generally correspond to topics that humans find interpretable; our score suggests the extracted topics capture meaningful thematic structure. The five topics, characterized by their top terms and representative documents, were:

Topic 1 (Obviousness/Prior Art): Top terms included ‘obvious,’ ‘prior,’ ‘art,’ ‘motivation,’ ‘combine,’ ‘teaching,’ ‘reference,’ ‘skilled,’ ‘secondary.’ This topic captures the core vocabulary of obviousness analysis under § 103, including the motivation-to-combine inquiry central to pharmaceutical compound and formulation challenges. Documents loading highly on this topic typically involve detailed analysis of prior art references and expert testimony regarding what a PHOSITA would have understood.

Topic 2 (Hatch-Waxman/Regulatory): Top terms included ‘ANDA,’ ‘FDA,’ ‘approval,’ ‘generic,’ ‘Orange,’ ‘Book,’ ‘paragraph,’ ‘NDA,’ ‘bioequivalent.’ This topic captures the regulatory vocabulary distinctive to pharmaceutical patent litigation, reflecting the Hatch-Waxman framework that structures the relationship between patent rights and generic drug approval. High-loading documents address ANDA filing requirements, FDA approval timing, and the procedural prerequisites for Paragraph IV litigation.

Topic 3 (§ 112 Validity/Disclosure): Top terms included ‘written,’ ‘description,’ ‘enablement,’ ‘specification,’ ‘disclose,’ ‘claim,’ ‘scope,’ ‘possession,’ ‘undue.’ This topic addresses disclosure requirements, examining whether patent specifications adequately describe and enable the claimed invention. The prominence of ‘scope’ reflects the particular importance of claim scope in written description analysis—whether the specification supports the full breadth of what the claims cover.

Topic 4 (Formulation/Pharmaceutical Science): Top terms included ‘formulation,’ ‘dosage,’ ‘release,’ ‘tablet,’ ‘dissolution,’ ‘bioavailability,’ ‘excipient,’ ‘stability,’ ‘concentration.’ Unlike the doctrinal topics above, this topic captures pharmaceutical science vocabulary—the technical subject matter of drug formulation patents. High-loading documents address controlled-release formulations, dosage optimization, and the formulation science underlying pharmaceutical product development.

Topic 5 (Appellate/Procedural): Top terms included ‘Federal,’ ‘Circuit,’ ‘appeal,’ ‘district,’ ‘court,’ ‘affirm,’ ‘reverse,’ ‘review,’ ‘abuse,’ ‘discretion.’ This topic captures the vocabulary of appellate review, including standards of review and procedural posture. High-loading documents are Federal Circuit opinions reviewing district court decisions, addressing whether lower courts erred in claim construction, fact-finding, or applying legal standards.

Figure 8 displays a document similarity heatmap based on topic distributions. Documents are ordered by hierarchical clustering of their topic vectors, revealing block structure corresponding to cases with similar topic profiles. The diagonal blocks indicate groups of cases sharing dominant topics, while off-diagonal patterns reveal relationships between doctrinally distinct cases that nonetheless share thematic elements. For example, obviousness cases and § 112 cases show moderate similarity, reflecting that both involve validity analysis and share vocabulary about claims and prior disclosures.

We examined the relationship between clustering and topic modeling results by computing correlations between cluster membership and topic loadings. Substantial correspondence emerged: Cluster 2 (Obviousness) correlated strongly with Topic 1 (r = 0.71); Cluster 1 (ANDA/Hatch-Waxman) with Topic 2 (r = 0.68); Cluster 3 (Written Description) with Topic 3 (r = 0.64). These correlations, ranging from r = 0.52 to r = 0.71, indicate that different unsupervised methods converge on related structures, providing modest evidence that identified patterns reflect genuine corpus properties rather than algorithmic artifacts.

4.4. Classification Performance

Moving from unsupervised exploration to supervised prediction, we evaluated multiple classifiers on the task of predicting case outcomes from textual features. Table 1 summarizes performance metrics across models.

Summary of key findings: The results reveal a clear performance hierarchy across model architectures. The LSTM achieves the highest accuracy (89.0%) and AUC (0.90), followed by Random Forest (85.0%, 0.87), indicating that both sequential modeling and ensemble methods provide substantial advantages for legal text classification. Traditional approaches—Naive Bayes (67.5%), Decision Tree (72.1%), and Logistic Regression (78.2%)—perform adequately but leave considerable room for improvement. Critically, all models substantially outperform the 42.3% majority-class baseline, confirming that pharmaceutical patent opinion text contains extractable signals predictive of case outcomes. The relatively narrow gap between F1 and Recall scores across models suggests balanced performance across outcome categories rather than systematic bias toward particular predictions.

All models substantially exceeded the baseline accuracy of 42.3% (the majority class proportion), demonstrating that textual features carry meaningful signal about case outcomes. Performance varied considerably across model architectures, with a 21.5 percentage point spread between the weakest (Naive Bayes, 67.5%) and strongest (LSTM, 89.0%) classifiers.

Naive Bayes, despite its simplifying conditional independence assumption, achieved 67.5% accuracy—substantially above baseline but well below more sophisticated methods. The model’s relatively weak performance likely reflects the inadequacy of the independence assumption for legal text, where the meaning of terms depends heavily on context. ‘Obvious’ means something different when preceded by ‘not’ than when used affirmatively; bag-of-words Naive Bayes cannot capture such dependencies.

Logistic Regression (78.2%) and Decision Trees (72.1%) performed moderately, with Logistic Regression benefiting from L2 regularization that prevents overfitting to the high-dimensional feature space. Random Forest achieved 85.0% accuracy, demonstrating the value of ensemble methods that aggregate predictions across multiple trees trained on bootstrapped samples with random feature subsets. The ensemble approach reduces variance and improves generalization.

The dense neural network (80.0%) performed comparably to Random Forest, suggesting that the additional representational capacity of neural architectures provides limited benefit when working with bag-of-words features. However, the LSTM achieved substantially higher accuracy (89.0%), with the difference statistically significant (paired t-test, p < 0.01). The LSTM’s advantage indicates that sequential structure matters: how arguments build across an opinion, how facts relate to analysis, how conclusions follow from reasoning—these sequential patterns carry predictive signal that bag-of-words representations discard.

The 89% accuracy is noteworthy in context. Legal prediction is inherently uncertain: cases that reach judgment are selected precisely because outcomes are difficult to predict (if outcomes were obvious, parties would settle). Achieving 89% accuracy on this selected, difficult subset suggests that textual features capture substantial information about judicial decision-making—though whether this reflects genuine legal reasoning or merely statistical regularities remains an open question we address in the Discussion.

Examining performance by outcome category reveals that models performed best on patent holder favorable outcomes (F1 = 0.91 for LSTM), somewhat worse on challenger favorable outcomes (F1 = 0.86), and worst on mixed outcomes (F1 = 0.78). The lower performance on mixed outcomes is unsurprising: mixed cases involve multiple claims or issues with different results, presenting heterogeneous textual patterns that resist clean classification. The asymmetry between patent holder and challenger favorable predictions may reflect linguistic patterns in how courts frame holdings—a possibility worth exploring in future work.

Feature importance analysis. To link predictions to interpretable textual signals, we examined feature importances from Random Forest and coefficient magnitudes from Logistic Regression. For patent holder favorable predictions, the most predictive terms included “not obvious,” “secondary considerations,” “commercial success,” “unexpected results,” and “long-felt need”—precisely the vocabulary associated with successful validity defenses. For challenger favorable predictions, high-importance terms included “prima facie obvious,” “motivation to combine,” “reasonable expectation,” and “lacks written description.” These patterns align with doctrinal expectations: the models appear to learn associations between legal terminology and outcomes that correspond to how patent law operates. However, we caution against over-interpreting feature weights as revealing causal mechanisms; the models may exploit surface correlations (e.g., judges who find for patentees may systematically use different language) rather than capturing the reasoning that drives decisions.

Error analysis: Mixed outcome misclassifications. Given the lower performance on the mixed category, we examined the 30 cases (22% of mixed outcomes) that the LSTM misclassified. The most common error pattern (n = 18) was misclassifying mixed outcomes as patent holder favorable; these cases typically involved multiple validity challenges where most claims survived but one or two were invalidated. The textual emphasis on successful validity defenses apparently overwhelmed the signal from partial invalidity. The reverse error—misclassifying mixed as challenger favorable (n = 9)—occurred in cases where the patentee won on validity but lost on non-infringement, with the infringement analysis dominating the opinion text. Three cases were misclassified due to unusual procedural postures (e.g., preliminary injunction denials that did not clearly signal ultimate outcomes). These patterns suggest that the “mixed” category is genuinely heterogeneous, and that models struggle when textual emphasis does not align with the ultimate classification. Future work might benefit from multi-label classification or hierarchical outcome coding to better capture this complexity.

5. Discussion

The results raise important questions about what computational methods can and cannot reveal about legal reasoning, interpreted in light of the acknowledged limitations.

Positioning relative to existing traditions. This work occupies a deliberate middle ground between two established traditions in computational legal research. The prediction-oriented tradition, exemplified by [44]’s Supreme Court forecasting and [45]’s European Court of Human Rights predictions, emphasizes maximizing predictive accuracy, often treating the model as a black box whose internal representations need not correspond to legal concepts. The legal-analytic tradition, including doctrinal analysis and qualitative case synthesis, prioritizes understanding legal reasoning but typically lacks scalability and reproducibility. Our approach attempts to bridge these traditions: we employ machine learning for scalable pattern detection while insisting that intermediate representations (clusters, topics) be interpretable in doctrinal terms. This positioning entails tradeoffs—we likely sacrifice predictive accuracy achievable with transformer models, while our interpretations remain more tentative than close doctrinal reading would provide. We view this middle ground as appropriate for exploratory research: demonstrating what computational methods can reveal about legal reasoning structure, while acknowledging what remains beyond their reach.

5.1. Interpreting the Findings

At the descriptive level, pharmaceutical patent opinions exhibit substantial textual regularity. Cases involving similar legal issues share vocabulary and structural features that algorithms detect. The clusters correspond to categories patent attorneys would immediately recognize. That unsupervised algorithms recover these categories suggests doctrinal structure leaves measurable traces in text. At the predictive level, the 89% LSTM accuracy substantially exceeds random baseline. However, high accuracy does not establish that models capture legal reasoning substantively—like a barometer predicting weather without ‘understanding’ meteorology, classifiers might exploit surface correlations without engaging doctrinal analysis.

5.2. The Prediction-Understanding Gap

A central tension pervades AI and Law research: the gap between predicting outcomes and understanding reasoning. The LSTM achieves 89% accuracy but offers no explanation of why particular outcomes obtain. This opacity distinguishes our approach from knowledge-based systems like HYPO and CATO, which represented legal knowledge explicitly. The interpretable intermediate outputs—clusters and topics—partially bridge this gap but remain descriptive rather than explanatory.

6. Scope and Limitations

Scientific integrity requires explicit acknowledgment of a study’s boundaries. This section delineates the scope of our investigation and candidly assesses limitations.

6.1. Study Scope

Domain scope. We analyze pharmaceutical patent cases exclusively, focusing on disputes arising under the Hatch-Waxman framework governing generic drug approval and patent challenges. Pharmaceutical patents present distinctive characteristics—technical complexity, regulatory overlay, concentrated litigation venues—that may not generalize to other patent domains. Software patents, for example, involve different claim structures, different prior art landscapes, and different invalidity doctrines (particularly §101 subject matter eligibility). Mechanical and electrical patents present yet other patterns. Extrapolation of our findings to these domains would require separate validation studies. Similarly, findings should not be extended to other areas of intellectual property law—trademark, copyright, trade secret—or to legal domains beyond intellectual property without empirical confirmation.

Temporal scope. The corpus spans January 2016 through December 2018, a period selected for doctrinal coherence and data completeness. This window postdates both the Supreme Court’s Alice Corp. v. CLS Bank decision (2014), which transformed patent eligibility analysis, and full implementation of the America Invents Act (2013), which restructured patent prosecution and administrative review. The window predates COVID-19 pandemic disruptions beginning in 2020 that affected court operations and pharmaceutical litigation patterns. However, doctrinal evolution continues: Federal Circuit jurisprudence on obviousness, written description, and claim construction has developed since 2018, and new pharmaceutical modalities (gene therapies, cell therapies, personalized medicine) raise novel patent issues not well represented in our corpus. Findings derived from 2016–2018 cases may not fully capture contemporary judicial reasoning.

Jurisdictional scope. We examine United States federal district court and Federal Circuit opinions only. State courts, which occasionally address pharmaceutical patent issues in parallel proceedings, fall outside our analysis. The Patent Trial and Appeal Board (PTAB), which adjudicates inter partes review and post-grant review proceedings that significantly affect pharmaceutical patent validity, is excluded—an important limitation given the substantial overlap between district court and PTAB challenges to the same patents. Foreign jurisdictions with different patent standards, claim construction approaches, and litigation procedures are similarly excluded.

6.2. Data and Methodological Limitations

Corpus composition. The 698 cases represent published judicial opinions only—a selected subset of pharmaceutical patent disputes. This selection excludes several important categories. First, settlements comprise the vast majority of filed cases; the literature estimates that fewer than 5% of patent cases proceed to judgment. Settled cases may differ systematically from adjudicated cases in ways that affect generalizability. Second, unpublished dispositions—orders granting summary judgment without opinion, bench rulings, and other judicial actions not resulting in published opinions—are excluded, potentially biasing the corpus toward more complex or contested matters. Third, PTAB proceedings, which now resolve a substantial proportion of pharmaceutical patent validity challenges, operate under different procedural rules and evidentiary standards than district court litigation. Fourth, consent judgments, stipulated dismissals, and abandoned cases leave no textual record amenable to analysis.

Outcome coding. Our three-category coding scheme—patent holder favorable, challenger favorable, mixed—sacrifices considerable nuance for analytical tractability. Actual case outcomes involve multiple claims with different results, partial findings on different issues, and varying levels of certainty. A case might find two claims valid but one invalid, find non-infringement of valid claims, or resolve on procedural grounds without reaching merits. Our aggregation to three categories loses this complexity. Alternative coding schemes—claim-level outcome coding, issue-level analysis, continuous measures of outcome favorability—might reveal different patterns than our case-level approach.

Preprocessing and parameters. Natural language processing involves numerous methodological choices, each affecting results. Our stopword list excluded common terms that might nonetheless carry doctrinal significance in particular contexts. Porter stemming conflates some legally distinct concepts (e.g., “obvious” and “obviously” might have different implications). The selection of k = 6 for K-Means clustering and 5 topics for NMF, while supported by silhouette scores and coherence metrics, reflects tradeoffs between granularity and interpretability; alternative parameter choices might reveal different corpus structure. TF-IDF weighting emphasizes distinctive terms but may underweight common but important legal vocabulary.

Model selection. We did not implement state-of-the-art transformer architectures (BERT, Legal-BERT) that achieve higher accuracy on legal classification tasks, nor did we conduct exhaustive hyperparameter optimization. Our focus on interpretability led us toward simpler models whose decision processes could be examined, but this choice limits performance comparison with contemporary deep learning approaches.

6.3. Validation and Interpretive Limitations

Validation. Although 10-fold stratified cross-validation provides robust performance estimates within the analyzed corpus, it does not guarantee generalization to out-of-sample data. External validation—testing on cases from different time periods, different courts, or different doctrinal areas—would strengthen confidence in findings but was not conducted. Moreover, we lack formal expert evaluation of cluster coherence and topic interpretability. While we characterized clusters as corresponding to recognized doctrinal categories based on centroid terms and sample documents, systematic validation by patent litigation practitioners or legal scholars might reveal characterizations that domain experts would contest or refine.

Correlation versus causation. Our analyses identify statistical associations—between textual features and outcomes, between cluster membership and doctrinal categories—but cannot establish causal relationships. If cases using certain vocabulary tend toward particular outcomes, we cannot determine whether the vocabulary choice causes the outcome, reflects underlying case strength, or correlates with both through confounding factors such as court, judge, or litigation strategy. Causal inference would require experimental or quasi-experimental designs beyond our observational framework.

Prediction versus understanding. The 89% classification accuracy demonstrates that textual features carry substantial information about case outcomes, but high accuracy does not imply that our models capture legal reasoning in any meaningful sense. A model might exploit surface correlations—longer opinions favor challengers because complex cases require extensive analysis—without engaging the substantive legal arguments that judges actually weigh. The gap between prediction and understanding remains fundamental: we know what outcomes will likely be, not why those outcomes are legally correct or how judicial reasoning actually proceeds. Critically, our reported accuracy reflects in-sample prediction—classifying opinions after they have been written—not real-world forecasting of case outcomes before decisions are rendered. The practical forecasting task would require predicting from pre-decision materials (complaints, briefs, motions) rather than from judicial opinions that already contain the outcome. Our models learn associations between how judges write about cases and how they rule, which differs fundamentally from predicting how judges will rule based on case characteristics available ex ante. For litigation support applications, models would need training on party submissions with outcome labels, a substantially different task that may yield lower accuracy. We emphasize this distinction to prevent misinterpretation: demonstrating that opinion text predicts outcomes does not establish that outcomes are predictable before judges write their opinions.

Text versus context. Judicial opinions represent only one component of the litigation ecosystem. Our analysis abstracts from party briefs that frame issues and present arguments, from evidentiary records developed through discovery and expert testimony, from oral arguments that sometimes shift judicial thinking, and from strategic considerations—venue selection, claim framing, settlement behavior—that shape which cases reach judgment and how they are presented. Opinions record judicial reasoning post hoc, not the full deliberative process.

Normative abstention. Our analysis is purely descriptive: we characterize patterns in judicial decision-making without evaluating whether those patterns are legally correct, consistent with precedent, or socially desirable. A finding that obviousness challenges succeed more frequently in certain linguistic contexts says nothing about whether those outcomes properly apply the Graham factors or implement sound patent policy.

7. Conclusions

This paper set out to explore how AI and ML methods might illuminate aspects of legal reasoning. Using pharmaceutical patent litigation as an illustrative domain, we developed and evaluated a multi-layer analytical pipeline. The investigation was deliberately exploratory, seeking to understand what contemporary computational methods can and cannot reveal about judicial decision-making.

7.1. Principal Conclusions

First, pharmaceutical patent opinions exhibit substantial and recoverable textual structure. Unsupervised clustering identified case groupings corresponding to recognized doctrinal categories; topic modeling extracted interpretable thematic dimensions. These findings suggest that doctrinal organization leaves measurable traces in judicial writing.

Second, textual features predict case outcomes with substantial accuracy. Classification models achieved 67.5–89.0% accuracy, substantially exceeding the 42.3% baseline. The LSTM’s advantage indicates that sequential structure—how arguments unfold—carries information beyond word frequencies alone.

Third, different analytical methods converge on related structures. Convergence across clustering, topic modeling, and classification provides modest evidence that identified patterns reflect genuine corpus properties rather than algorithmic artifacts.

Fourth, a substantial gap remains between prediction and understanding. High accuracy does not establish that models capture legal reasoning in any deep sense. We predict outcomes; we do not explain them in the way a legal brief explains a conclusion.

Fifth, the value of computational legal analysis lies in augmentation rather than replacement. ML methods can supplement human legal judgment—identifying precedents, revealing framing strategies, estimating outcome probabilities—without substituting for normative reasoning.

7.2. Implications

For AI and Law scholarship: The prediction-understanding gap may reflect something fundamental about the difference between statistical pattern recognition and normative reasoning. Our multi-method approach offers a template for interpretable results alongside predictive accuracy. For legal practice: Practitioners might cautiously incorporate computational tools for research, framing analysis, and risk assessment, while understanding their limitations. For judicial administration: Courts might use topic modeling and clustering for docket tracking and case management without outcome prediction concerns. For legal education: Law schools should help students understand what ML methods can and cannot do. For policy: Regulatory frameworks should distinguish between assistive tools supporting human decision-makers and autonomous systems purporting to replace them.

7.3. Future Research Directions

Domain extension: Apply methods to other legal domains to test generalizability. Temporal extension: Track how computational patterns evolve alongside doctrinal change. Behavioral integration: Incorporate judge-level, court-level, and litigant-level features. Expert validation: Systematically evaluate whether clusters and topics align with practitioner judgment. Causal inference: Move beyond correlation using experimental or quasi-experimental designs. Architectural advances: Implement transformers, hierarchical attention networks, and graph neural networks. Multimodal analysis: Integrate briefs, transcripts, and prosecution histories. Normative analysis: Examine whether computationally identified patterns are normatively desirable.

7.4. Closing Reflections

Legal reasoning occupies a distinctive place in human cognition, combining factual analysis with normative judgment, precedent with principle, rule-following with equity. Our findings suggest both possibility and limitation. The possibility: contemporary ML methods can extract meaningful patterns from legal text that correspond to recognized categories, predict outcomes with useful accuracy, and provide interpretable intermediate representations. The limitation: these methods do not—and perhaps cannot—replicate the justificatory work that distinguishes legal reasoning from mere prediction.

This distinction matters for deployment. Systems that augment human judgment pose different questions than systems purporting to replace it. The former may be valuable despite the prediction-understanding gap; the latter would be troubling precisely because of it.

We have pursued this research with appropriate humility. Legal reasoning is among humanity’s most sophisticated cognitive and institutional achievements. We do not claim to have replicated it computationally; we claim only to have illuminated some patterns in one corner of legal practice. The gap between statistical prediction and legal understanding remains wide, and intellectual honesty requires acknowledging what we do not know alongside what we have learned.

The journey toward understanding how AI might engage with legal reasoning is long. This paper marks one step—an exploration of what machine learning methods can reveal when applied to pharmaceutical patent litigation, offered as a contribution to an evolving research agenda rather than a definitive answer. We hope others will build on this work, extending it to new domains, refining its methods, and addressing its limitations. The conversation between artificial intelligence and legal reasoning has just begun.

Funding

No funding was available for this research.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy.

Conflicts of Interest

The author has no conflict of interest to declare.

References

Raghupathi, W.; Schkade, L.L. Designing artificial intelligence applications in law: A systemic view. Syst. Pract. 1992, 5, 61–78. [Google Scholar] [CrossRef]
Llewellyn, K.N. The Common Law Tradition: Deciding Appeals; Little, Brown: Boston, MA, USA, 1960. [Google Scholar]
Hart, H.L.A. The Concept of Law; Oxford University Press: Oxford, UK, 1961. [Google Scholar]
Eisenberg, T.; Lanvers, C. What is the settlement rate and why should we care? J. Empir. Leg. Stud. 2009, 6, 111–146. [Google Scholar] [CrossRef]
Merges, R.P. Commercial success and patent standards: Economic perspectives on innovation. Calif. Law Rev. 1988, 76, 803–876. [Google Scholar] [CrossRef]
Mandel, G.N. Patently non-obvious: Empirical demonstration that the hindsight bias renders patent decisions irrational. Ohio State Law J. 2006, 67, 1391–1463. [Google Scholar] [CrossRef]
Love, B.J.; Yoon, J. Predictably expensive: A critical look at patent litigation in the Eastern District of Texas. Stanf. Technol. Law Rev. 2017, 20, 1. [Google Scholar] [CrossRef]
Moore, K.A. Forum shopping in patent cases: Does geographic choice affect innovation? N. C. Law Rev. 2001, 79, 889–938. [Google Scholar]
Segal, J.A.; Spaeth, H.J. The Supreme Court and the Attitudinal Model Revisited; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
Epstein, L.; Landes, W.M.; Posner, R.A. The Behavior of Federal Judges: A Theoretical and Empirical Study of Rational Choice; Harvard University Press: Cambridge, MA, USA, 2013. [Google Scholar]
Englich, B.; Mussweiler, T.; Strack, F. Playing dice with criminal sentences: The influence of irrelevant anchors on experts’ judicial decision making. Personal. Soc. Psychol. Bull. 2006, 32, 188–200. [Google Scholar] [CrossRef] [PubMed]
Simon, D. A third view of the black box: Cognitive coherence in legal decision making. Univ. Chic. Law Rev. 2004, 71, 511–586. [Google Scholar]
Danziger, S.; Levav, J.; Avnaim-Pesso, L. Extraneous factors in judicial decisions. Proc. Natl. Acad. Sci. USA 2011, 108, 6889–6892. [Google Scholar] [CrossRef] [PubMed]
Moore, K.A. Judges, juries, and patent cases: An empirical peek inside the black box. Mich. Law Rev. 2000, 99, 365–409. [Google Scholar] [CrossRef]
Priest, G.L.; Klein, B. The selection of disputes for litigation. J. Leg. Stud. 1984, 13, 1–55. [Google Scholar] [CrossRef]
Sunstein, C.R.; Schkade, D.; Ellman, L.M.; Sawicki, A. Are Judges Political? An Empirical Analysis of the Federal Judiciary; Brookings Institution Press: Washington, DC, USA, 2006. [Google Scholar]
Kim, P.T. Deliberation and strategy on the United States Courts of Appeals: An empirical exploration of panel effects. Univ. Pa. Law Rev. 2009, 157, 1319–1381. [Google Scholar] [CrossRef][Green Version]
Turtle, H. Text retrieval in the legal world. Artif. Intell. Law 1995, 3, 5–54. [Google Scholar] [CrossRef]
Hafner, C.D.; Wise, V.J. SmartLaw: Adapting ‘classic’ expert system techniques for the legal research domain. In Proceedings of the 4th International Conference on Artificial Intelligence and Law; ACM: New York, NY, USA, 1993; pp. 133–141. [Google Scholar]
Chalkidis, I.; Fergadiotis, M.; Malakasiotis, P.; Aletras, N.; Androutsopoulos, I. LEGAL-BERT: The muppets straight out of law school. In Findings of the Association for Computational Linguistics: EMNLP 2020; Association for Computational Linguistics: Online, 2020; pp. 2898–2904. [Google Scholar]
Zheng, L.; Guha, N.; Anderson, B.R.; Henderson, P.; Ho, D.E. When does pretraining help? Assessing self-supervised learning for law and the CaseHOLD dataset. In Proceedings of the 18th International Conference on Artificial Intelligence and Law; Association for Computing Machinery: New York, NY, USA, 2021; pp. 159–168. [Google Scholar]
Hemphill, C.S.; Sampat, B.N. Evergreening, patent challenges, and effective market life in pharmaceuticals. J. Health Econ. 2012, 31, 327–339. [Google Scholar] [CrossRef] [PubMed]
Bulow, J. The gaming of pharmaceutical patents. Innov. Policy Econ. 2004, 4, 145–187. [Google Scholar] [CrossRef]
Grabowski, H.G. Follow-on biologics: Data exclusivity and the balance between innovation and competition. Nat. Rev. Drug Discov. 2008, 7, 479–488. [Google Scholar] [CrossRef]
Allison, J.R.; Lemley, M.A.; Moore, K.A.; Trunkey, R.D. Valuable patents. Georget. Law J. 2004, 92, 435–479. [Google Scholar] [CrossRef]
Allison, J.R.; Lemley, M.A. Empirical evidence on the validity of litigated patents. AIPLA Q. J. 1998, 26, 185–275. [Google Scholar] [CrossRef]
Hemphill, C.S. Paying for delay: Pharmaceutical patent settlement as a regulatory design problem. NYU Law Rev. 2006, 81, 1553–1623. [Google Scholar]
Raghupathi, V.; Zhou, Y.; Raghupathi, W. Legal decision support: Exploring big data analytics approach to modeling pharma patent validity cases. IEEE Access 2018, 6, 41518–41528. [Google Scholar] [CrossRef]
Carrier, M.A. Two puzzles resolved: Of the Hatch-Waxman Act and pharmaceutical innovation. Antitrust Law J. 2008, 75, 409–450. [Google Scholar]
Dworkin, R. Taking Rights Seriously; Harvard University Press: Cambridge, MA, USA, 1977. [Google Scholar]
Matheson, A. Applied artificial intelligence for law: The use of expert systems and decision trees in legal practice and the potential of artificial neural networks. Leg. Issues J. 2023, 9, 2. Available online: https://www.legalissuesjournal.com/articles/i18-032023/ (accessed on 3 January 2026).
Sergot, M.J.; Sadri, F.; Kowalski, R.A.; Kriwaczek, F.; Hammond, P.; Cory, H.T. The British Nationality Act as a logic program. Commun. ACM 1986, 29, 370–386. [Google Scholar] [CrossRef]
Susskind, R.E. Expert Systems in Law: A Jurisprudential Inquiry; Oxford University Press: Oxford, UK, 1987. [Google Scholar]
Gardner, A.v.d.L. An Artificial Intelligence Approach to Legal Reasoning; MIT Press: Cambridge, MA, USA, 1987. [Google Scholar]
Raghupathi, W.; Schkade, L.L. The SKADE LITorSET expert system for corporate litigate or settle decisions. Int. J. Intell. Syst. Account. Financ. Manag. 1992, 1, 247–259. [Google Scholar] [CrossRef]
Raghupathi, W.; Mykytyn, P.P.; Harbison-Briggs, K. A blackboard model of reasoning in product liability claims evaluation. Appl. Intell. 1993, 3, 249–261. [Google Scholar] [CrossRef]
Ashley, K.D. Modeling Legal Argument: Reasoning with Cases and Hypotheticals; MIT Press: Cambridge, MA, USA, 1990. [Google Scholar]
Aleven, V. Teaching Case-Based Argumentation Through a Model and Examples. Ph.D. Thesis, University of Pittsburgh, Pittsburgh, PA, USA, 1997. [Google Scholar]
Rissland, E.L.; Ashley, K.D.; Loui, R.P. AI and Law: A fruitful synergy. Artif. Intell. 2003, 150, 1–15. [Google Scholar] [CrossRef]
Bench-Capon, T.J.M.; Sartor, G. A model of legal reasoning with cases incorporating theories and values. Artif. Intell. 2003, 150, 97–143. [Google Scholar] [CrossRef]
Prakken, H.; Sartor, G. Modelling reasoning with precedents in a formal dialogue game. Artif. Intell. Law 1998, 6, 231–287. [Google Scholar] [CrossRef]
Bruninghaus, S.; Ashley, K.D. Predicting outcomes of case-based legal arguments. In Proceedings of the 9th International Conference on Artificial Intelligence and Law; ACM: New York, NY, USA, 2003; pp. 233–242. [Google Scholar]
Raghupathi, W.; Schkade, L.L.; Bapi, R.S.; Levine, D.S. Exploring connectionist approaches to legal decision making. Behav. Sci. 1991, 36, 133–139. [Google Scholar] [CrossRef]
Katz, D.M.; Bommarito, M.J., II; Blackman, J. A general approach for predicting the behavior of the Supreme Court. PLoS ONE 2017, 12, e0174698. [Google Scholar] [CrossRef]
Aletras, N.; Tsarapatsanis, D.; Preotiuc-Pietro, D.; Lampos, V. Predicting judicial decisions of the European Court of Human Rights. PeerJ Comput. Sci. 2016, 2, e93. [Google Scholar] [CrossRef]
Chalkidis, I.; Androutsopoulos, I. A deep learning approach to contract element extraction. In Proceedings of JURIX 2017; IOS Press: Amsterdam, The Netherlands, 2017; pp. 155–164. [Google Scholar]
Nay, J.J. Natural language processing and machine learning for law and policy texts. In Legal Informatics; SSRN Working Paper; Cambridge University Press: Cambridge, UK, 2018. [Google Scholar]
Wei, F.; Qin, H.; Ye, S.; Zhao, H. Empirical study of deep learning for text classification in legal document review. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 3317–3320. [Google Scholar]
Chalkidis, I.; Jana, A.; Hartung, D.; Bommarito, M.; Androutsopoulos, I.; Katz, D.; Aletras, N. LexGLUE: A benchmark dataset for legal language understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 4317–4330. [Google Scholar]
Katz, D.M.; Bommarito, M.J.; Gao, S.; Arredondo, P. GPT-4 passes the bar exam. Philos. Trans. R. Soc. A 2024, 382, 20230254. [Google Scholar] [CrossRef] [PubMed]
Guthrie, C.; Rachlinski, J.J.; Wistrich, A.J. Blinking on the bench: How judges decide cases. Cornell Law Rev. 2007, 93, 1. [Google Scholar]

Figure 1. Text-analytics pipeline for legal case processing. Multi-layer analytical pipeline integrating feature extraction, clustering, topic modeling, and classification. Key takeaway: The pipeline produces interpretable intermediate outputs (clusters, topics) alongside predictive classifications, enabling validation of learned representations against domain knowledge before relying on outcome predictions.

Figure 2. Term frequency distribution for top 30 terms after preprocessing. The distribution follows a Zipfian pattern with patent doctrine and pharmaceutical terminology dominating.

Figure 3. Bigram frequency treemap. Core patent doctrines (‘prior art,’ ‘written description,’ ‘claim construction’) and Hatch-Waxman terminology (‘Orange Book,’ ‘ANDA’) dominate the multi-word vocabulary. Key takeaway: Multi-word legal phrases capture doctrinal concepts that single words cannot; these bigrams correspond directly to the validity doctrines (obviousness, written description) that structure pharmaceutical patent analysis.

Figure 4. K-Means clustering visualization (k = 6) with dimensionality reduction via t-SNE. Cluster membership is indicated by color and shape. Clusters exhibit moderate separation with some overlap, reflecting the multi-issue nature of patent opinions. Key takeaway: The visible cluster structure demonstrates that pharmaceutical patent opinions naturally group by doctrinal focus; the partial overlap is expected and appropriate given that many cases address multiple legal issues simultaneously.

Figure 5. Word cloud visualizations for each K-Means cluster, with term size proportional to centroid weight. Clusters exhibit thematic coherence corresponding to recognized doctrinal categories.

Figure 6. Hierarchical clustering dendrogram (Ward’s method). The primary bifurcation separates procedural matters from substantive patent disputes; further subdivisions reveal doctrinal structure within each branch.

Figure 7. Affinity Propagation clustering results showing 47 fine-grained clusters with exemplar cases. The automatic cluster determination reveals subcategories within major doctrinal areas.

Figure 8. Document similarity matrix based on topic distributions, ordered by hierarchical clustering. Block structure along the diagonal indicates groups of cases with similar topic profiles; off-diagonal patterns reveal cross-cutting thematic relationships.

Table 1. Classification performance across models (10-fold stratified cross-validation).

Model	Accuracy	AUC	F1	Recall
Naive Bayes	67.5%	0.71	0.65	0.64
Logistic Regression	78.2%	0.81	0.77	0.76
EditedDecision Tree	72.1%	0.74	0.71	0.70
Random Forest	85.0%	0.87	0.84	0.83
Dense Neural Net	80.0%	0.82	0.79	0.78
LSTM	89.0%	0.90	0.88	0.88
Naive Bayes	67.5%	0.71	0.65	0.64

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Raghupathi, W. Exploring Artificial Intelligence and Machine Learning Approaches to Legal Reasoning. AppliedMath 2026, 6, 32. https://doi.org/10.3390/appliedmath6020032

AMA Style

Raghupathi W. Exploring Artificial Intelligence and Machine Learning Approaches to Legal Reasoning. AppliedMath. 2026; 6(2):32. https://doi.org/10.3390/appliedmath6020032

Chicago/Turabian Style

Raghupathi, Wullianallur. 2026. "Exploring Artificial Intelligence and Machine Learning Approaches to Legal Reasoning" AppliedMath 6, no. 2: 32. https://doi.org/10.3390/appliedmath6020032

APA Style

Raghupathi, W. (2026). Exploring Artificial Intelligence and Machine Learning Approaches to Legal Reasoning. AppliedMath, 6(2), 32. https://doi.org/10.3390/appliedmath6020032

Article Menu

Exploring Artificial Intelligence and Machine Learning Approaches to Legal Reasoning

Abstract

1. Introduction

1.1. The Challenge of Computational Legal Reasoning

1.2. Factors Impinging on Legal Decisions

1.3. Two Traditions: Behavioral and Text-Analytic Approaches

1.4. Why Pharmaceutical Patent Litigation?

1.5. Research Objectives and Contributions

1.6. Paper Organization

2. Research Background

2.1. Theoretical Foundations of Legal Reasoning

2.2. Rule-Based Expert Systems in Law

2.3. Case-Based Reasoning and Argumentation Systems

2.4. Connectionist and Neural Network Approaches

2.5. Statistical and Machine Learning Approaches

2.6. Deep Learning and Transformer Architectures

2.7. Behavioral and Empirical Legal Studies

2.8. Patent Litigation and Pharmaceutical Patents

2.9. Research Gaps

3. Data and Methods

3.1. Data Collection and Corpus Construction

3.2. Text Preprocessing

3.3. Feature Extraction and Representation

3.4. Unsupervised Learning: Clustering Methods

3.5. Topic Modeling

3.6. Supervised Classification Models

3.7. Recurrent Neural Network Architecture

3.8. Evaluation Framework

3.9. Analytical Pipeline Integration

4. Results and Analysis

4.1. Corpus Characteristics and Descriptive Statistics

4.2. Clustering Analysis

4.3. Topic Modeling Results

4.4. Classification Performance

5. Discussion

5.1. Interpreting the Findings

5.2. The Prediction-Understanding Gap

6. Scope and Limitations

6.1. Study Scope

6.2. Data and Methodological Limitations

6.3. Validation and Interpretive Limitations

7. Conclusions

7.1. Principal Conclusions

7.2. Implications

7.3. Future Research Directions

7.4. Closing Reflections

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI