Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research

Young, Richard J.; Matthews, Alice M.; Poston, Brach

doi:10.3390/a18050296

Open AccessArticle

Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research

by

Richard J. Young

^1,2

,

Alice M. Matthews

¹

and

Brach Poston

^1,2,*

¹

Interdisciplinary Ph.D. Program in Neuroscience, University of Nevada, Las Vegas, NV 89154, USA

²

Department of Kinesiology and Nutrition Sciences, University of Nevada, Las Vegas, NV 89154, USA

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(5), 296; https://doi.org/10.3390/a18050296

Submission received: 21 March 2025 / Revised: 8 May 2025 / Accepted: 13 May 2025 / Published: 20 May 2025

(This article belongs to the Special Issue Machine Learning in Medical Signal and Image Processing (3rd Edition))

Download

Browse Figures

Versions Notes

Abstract

Large-language models (LLMs) show promise for automating evidence synthesis, yet head-to-head evaluations remain scarce. We benchmarked five state-of-the-art LLMs—openai/o1-mini, x-ai/grok-2-1212, meta-llama/Llama-3.3-70B-Instruct, google/Gemini-Flash-1.5-8B, and deepseek/DeepSeek-R1-70B-Distill—on extracting protocol details from transcranial direct-current stimulation (tDCS) trials enrolling older adults. A multi-LLM ensemble pipeline ingested ClinicalTrials.gov records, applied a structured JSON schema, and generated comparable outputs from unstructured text. The pipeline retrieved 83 aging-related tDCS trials—roughly double the yield of a conventional keyword search. Across models, agreement was almost perfect for the binary field brain stimulation used (Fleiss κ ≈ 0.92) and substantial for the categorical primary target (κ ≈ 0.71). Numeric parameters such as stimulation intensity and session duration showed excellent consistency when explicitly reported (ICC 0.95–0.96); secondary targets and free-text duration phrases remained challenging (κ ≈ 0.61; ICC ≈ 0.35). An ensemble consensus (majority vote or averaging) resolved most disagreements and delivered near-perfect reliability on core stimulation attributes (κ = 0.94). These results demonstrate that multi-LLM ensembles can markedly expand trial coverage and reach expert-level accuracy on well-defined fields while still requiring human oversight for nuanced or sparsely reported details. The benchmark and open-source workflow set a solid baseline for future advances in prompt engineering, model specialization, and ensemble strategies aimed at fully automated evidence synthesis in neurostimulation research involving aging populations. Overall, the five-model multi-LLM ensemble doubled the number of eligible aging-related tDCS trials retrieved versus keyword searching and achieved near-perfect agreement on core stimulation parameters (κ ≈ 0.94), demonstrating expert-level extraction accuracy.

Keywords:

clinical trial data extraction; large language models (LLMs); API integration; multi-agent systems; systematic review methodology; transcranial direct current stimulation (tDCS); aging research; automated data extraction; protocol analysis; evidence synthesis

1. Introduction

Continual advances in medicine rely on evidence-based knowledge [1], which continues to grow at an unprecedented pace. Clinical trials are a cornerstone of modern medical research and inform best practices by providing crucial data on the safety and efficacy of interventions [2]. Although platforms such as ClinicalTrials.gov aggregate hundreds of thousands of trials worldwide [3], systematically extracting and synthesizing their information remains a formidable challenge [4]. Traditional review methods—keyword-based searches and manual data extraction—can take months to complete and often fail to capture critical details found in unstructured protocol text [5,6,7].

Large language models (LLMs) offer a promising avenue for automating this labor-intensive process [8]. Recent models—including openai/o1-mini [9], x-ai/grok-2-1212, meta-llama/llama-3.3-70b-instruct [10], google/gemini-flash-1.5-8b [11], and deepseek/deepseek-r1-distill-llama-70b [12]—exhibit advanced language understanding that surpasses simple keyword matching. Nonetheless, concerns about accuracy, reliability, and “hallucinations” persist [13,14,15,16,17]. As a result, multi-agent systems, in which several specialized LLMs collaborate to tackle different parts of the extraction workflow, have emerged as a compelling approach [18,19,20,21,22]. While single LLMs are commonly benchmarked on general tasks [23,24], their combined performance for domain-specific use cases, such as analyzing aging-related protocols, remains poorly understood and often requires additional strategies like retrieval-augmented generation (RAG).

To shed light on these challenges, we systematically compared the outputs of multiple LLMs for data extraction in clinical trials examining transcranial direct current stimulation (tDCS) in aged populations [25], a domain featuring evolving terminology and varied interventions. In brief, tDCS is a non-invasive brain stimulation technique that can modify cortical excitability and enhance motor skill acquisition in healthy young [26,27,28,29,30] and old adults when delivered to regions of the brain, such as the cerebellum [31], supplemental motor area [32], dorsolateral prefrontal cortex [33], and primary motor cortex. Although a nontrivial minority of studies do not show these positive effects [29,34,35,36,37,38], the balance of the research in young and old adults demonstrates motor skill enhancements of about 10–15% when tDCS is delivered at an intensity of 1–2 mA for 20 min before or during motor task performance. Despite growing interest, the heterogeneity of study protocols including stimulation parameters, targeted brain regions, and outcome measures makes systematic synthesis of this field particularly challenging. For instance, a simple keyword search such as “Aged AND tDCS” often retrieves only a fraction of relevant studies, overlooking important trials due to variations in terminology and inconsistent reporting. Moreover, vital protocol details frequently reside within unstructured text, requiring extensive human curation for accurate extraction [39,40]. Consequently, aging-related tDCS trials serve as an ideal model topic for testing automated data extraction methods, providing valuable insights into the capabilities and limitations of LLM-based approaches within a clinically relevant and complex research area.

In the current study, we automated and standardized significant portions of this data extraction process using a multi-agent LLM pipeline [41]. After downloading and consolidating clinical trial data, each model independently parsed the BriefSummary and DetailedDescription fields before outputting a structured, machine-readable JSON summary. Comparing outputs revealed areas of consensus (e.g., yes/no presence of tDCS interventions) as well as points of ambiguity (e.g., complex parameter descriptions or anatomical targets). This approach not only captured more relevant trials than conventional searches but also demonstrated how multi-agent LLMs can streamline systematic reviews [8,42,43,44,45]. While the findings highlight the benefits of automated pipelines—quicker extraction and reduced manual workload—they also underscore the need for human oversight to reconcile ambiguous information and address the variability of LLM performance. Furthermore, by enabling a rapid iterative refinement of the search criteria and extraction parameters, multi-agent systems can significantly enhance the responsiveness and adaptability of systematic reviews [46] in dynamic research domains. Illustrating both the potential and limitations of this multi-agent paradigm, this work advances aging-related tDCS research and lays the groundwork for applying similar methods to other domains, such as electronic medical records (EMRs), clinical trial guideline synthesis, and broader medical data processing.

This study is among the first to systematically benchmark multiple state-of-the-art LLMs specifically for extracting structured clinical trial data from aging-related research protocols involving neuromodulatory interventions such as tDCS. Unlike previous research, which predominantly assesses single-model performance or focuses on general clinical texts, this investigation uniquely evaluates a multi-agent ensemble approach. By demonstrating the strengths and limitations of individual LLMs and the added value of cross-model consensus methods, this work introduces a novel, rigorous methodology for automated data extraction tailored to the nuanced and heterogeneous domain of aging research.

The primary research question (RQ1) of this study was to examine how consistently five state-of-the-art LLMs—openai/o1-mini, x-ai/grok-2-1212, Llama-3.3-70B-Instruct, Gemini-Flash-1.5-8B, and DeepSeek-R1-70B-Distill—extract key methodological fields from aging-related tDCS trial protocols. We hypothesized substantial, systematic performance differences and expected instruction-tuned, high-parameter models (e.g., Llama-3.3-70B-Instruct, DeepSeek-R1-70B-Distill) to outperform smaller, general-purpose models (e.g., openai/o1-mini) on complex, multi-field extractions involving numeric and multi-arm design details.

The secondary research question (RQ2) investigated whether a simple ensemble consensus method using majority voting for categorical fields and mean or median values for numeric fields could achieve greater accuracy and inter-rater reliability compared to the best-performing individual model. We anticipated that the ensemble method would yield significantly higher agreement with expert-validated reference labels (κ > 0.90 for categorical fields, ICC > 0.95 for numeric fields) than any single model, thus demonstrating the practical value of multi-agent pipelines.

The final research question (RQ3) assessed which attribute classes (binary, categorical, or continuous numeric) pose the greatest challenge for automated extraction. We predicted that binary attributes, such as “brain-stimulation used,” would show near-perfect agreement (κ ≥ 0.90), whereas continuous numeric parameters (e.g., stimulation intensity, session duration) would exhibit notably lower concordance (ICC ≤ 0.60), which would highlight the inherent difficulties models encounter when normalizing heterogeneous quantitative descriptions.

2. Literature Review

LLMs have recently emerged as powerful tools for automating the extraction of structured information from unstructured clinical texts. Traditional manual data extraction for evidence synthesis remains labor-intensive and prone to human error, creating significant workflow bottlenecks [47]. Recent studies highlight the promising capabilities of LLMs in streamlining clinical trial data extraction. For example, Lai et al. (2025) demonstrated that advanced models such as Claude-3.5 and Moonshot-v1 achieve approximately 95% accuracy in extracting data from complementary medicine trials, with accuracy surpassing 97% when supplemented with expert oversight [47]. Similarly, Jensen et al. (2025) showcased the effectiveness of ChatGPT-4o as a second reviewer in systematic reviews of exercise interventions, which achieved 92.4% accuracy with minimal false-positive outputs [48]. Furthermore, Liu et al. (2025) utilized structured prompting methods that achieved 94.8% accuracy while significantly reducing extraction times to approximately 88 s per trial, which underscores the practical advantages of structured prompts in systematic review contexts [49].

Despite these encouraging findings, the literature on LLM-driven clinical trial data extraction remains relatively underdeveloped, often limited to single-model evaluations or narrow clinical contexts. Multi-model evaluations have predominantly targeted clinical narratives and EHRs. These have achieved accuracy rates exceeding 98% with models such as GPT-4 and Claude variants [49]. Stuhlmiller et al. (2025) further confirmed the potential of LLMs in clinical documentation, notably enhancing medication data completeness from EHRs, though their study did not explicitly address clinical trial protocols [50]. Furthermore, Khan et al. (2024) introduced a collaborative two-LLM workflow which demonstrated substantial improvements in accuracy and reliability over single-model approaches [51]. However, systematic benchmarking across multiple LLMs specifically targeting structured clinical trial data extraction remains sparse. Additionally, specialized domains such as aging research and neuromodulatory interventions have received minimal attention in the current literature. Sun et al. (2024) notably highlighted a critical limitation of LLMs—low performance (approximately 39% accuracy) when extracting complex continuous outcome data—which underscores the ongoing challenges inherent in these specialized research areas [52].

Thus, there is a significant gap in comprehensive, multi-LLM benchmarking studies tailored explicitly to nuanced and domain-specific clinical trial contexts. To date, no research has systematically evaluated multiple state-of-the-art LLMs specifically for extracting structured data from clinical trial documents involving aging populations or neuromodulatory techniques, such as brain stimulation. Furthermore, the accuracy of LLMs in handling continuous numeric outcomes and the benefits of cross-model consensus workflows has been virtually unexplored. The present study directly addresses these gaps by systematically benchmarking multiple LLMs for structured data extraction from aging-related tDCS and other neuromodulatory intervention trials.

3. Materials and Methods

We developed a systematic pipeline for retrieving and analyzing aging-related clinical trial data from ClinicalTrials.gov using its v2 Application Programming Interface (API). The pipeline, implemented in Python, interacts with the API through custom functions designed for efficient pagination, comprehensive coverage, and minimal rate limit errors (see GitHub link for source code).

Initial API queries were structured to capture study design, intervention details, enrollment numbers, and outcome measures (Appendix A). We combined search terms such as “Aged” and “tDCS” and applied regex-based filtering (e.g., “tdcs”), which identified a pool of trials potentially relevant to noninvasive brain stimulation in older adults [53,54,55,56].

A verification system checked the unique National Clinical Trial (NCT) identifiers retrieved against the total count reported by the API in each data collection cycle. Any discrepancy automatically triggered additional retrieval attempts, ensuring that the final dataset comprehensively represents the available trials.

All retrieved records were stored in Parquet format to facilitate quick access and seamless transfer between Jupyter notebooks. For broader compatibility and easier manual review, we also maintained copies in CSV and Excel formats. During data cleaning, we standardized placeholders for missing values (e.g., replacing various empty fields with a consistent NULL), converted enrollment counts into numeric fields, and normalized metadata such as FDA regulatory status and healthy-volunteer indicators.

All core data-processing scripts were written in Python 3.12, leveraging several key libraries:

pandas [1], for data manipulation;
Requests, for handling API interactions;
NumPy [2], for numerical operations;
OpenRouter, for coordinating multi-LLM (large language model) requests.

Version control was maintained for all scripts, configuration files, and environment details (including Python packages and random seeds), thereby ensuring reproducibility of results.

Table 1 Shows the Data Source Transparency which includes; Overview of ClinicalTrials.gov API data extraction, including query parameters, inclusion/exclusion criteria, and the final dataset utilized for analysis.

3.1. Natural Language Processing and Analysis

Our analysis framework incorporated five large language models (LLMs), each accessed through the OpenRouter API infrastructure:

openai/o1-mini;
x-ai/grok-2-1212;
meta-llama/llama-3.3-70b-instruct;
google/gemini-flash-1.5-8b;
deepseek/deepseek-r1-distill-llama-70b.

Each model was configured with a maximum context window (up to 120,000 tokens) and a temperature of 0.5 to balance creativity and consistency. Although all LLMs received the same prompt, their architectural differences and unique training corpora led to distinctive extraction tendencies—an advantage that enabled a comparative evaluation of their strengths and weaknesses in identifying clinical trial attributes.

Multi-LLM workflow for clinical trial data review. Each trial’s text is processed by all five models, and results are aggregated for cross-validation (Figure 1).

Prompt Architecture

To achieve structured, consistent outputs from these LLMs, we developed a “final” prompt specifying the following:

1. Contextual Trial Information: Key protocol text, including intervention descriptions and outcome measures.

2. Directive: “Analyze whether brain stimulation was used in this trial. If so, provide details”.

Code Box 1 (json):

{

“brain_stimulation_used”: “Yes” or “No”,

“stimulation_details”: {

“primary_type”: “e.g., tDCS, TMS, tACS, DBS, etc.” or null,

“is_noninvasive”: true or false,

“primary_target”: “Primary brain region or null”,

“secondary_targets”: [“List of secondary regions”] or [],

“stimulation_parameters”: {

“intensity”: “e.g., 2mA” or null,

“duration”: “e.g., 20 min” or null

}

},

“confidence_level”: “High”, “Medium”, or “Low”,

“relevant_quotes”: [“Direct quotes supporting the analysis”]

}

Strict Output Formatting Rules: Each LLM must return only a single JSON object conforming to the schema, without additional explanatory text.

This strict schema enables automated parsing, validation, and comparison of outputs across multiple LLMs. When relevant quotes are provided, they further ground the extracted data in the original protocol text, enhancing transparency and reproducibility. The prompt was iteratively tested across multiple LLM versions to ensure both syntactic and semantic accuracy. Specifically, format checks verified strict JSON syntax, with the exact schema fields and content checks ensuring proper mapping of the fields (e.g., “primary_type”, “is noninvasive”) to trial text.

3.2. Statistical Analysis of Multi-LLM Reliability

We treated the five LLMs as independent “raters” of each trial’s key attributes. Depending on the data type, categorical fields (e.g., presence/absence of brain stimulation) were evaluated using Fleiss’ kappa (κ_F) or Krippendorff’s alpha; continuous fields (e.g., stimulation intensity, duration) were evaluated using the intraclass correlation coefficient (ICC).

For nominal outcomes such as the specific stimulation modality (e.g., tDCS, TMS), we computed Fleiss’ kappa to assess overall agreement among all five models. For numeric parameters, we employed ICC(2,1) (a single measure) or ICC(2,k) (average measures), in line with standard multi-rater reliability approaches in clinical and behavioral research.

We derived consensus (“ensemble”) fields in two ways:

Majority voting for categorical variables (e.g., “Yes” vs. “No” for brain_stimulation_used).
The mean or median for numeric fields (e.g., stimulation intensity or session duration). These consensus fields serve as the “best estimates” in downstream analyses, including systematic reviews and meta-analyses.

3.3. Analysis Pipeline

Following data cleaning and initial filtering, each trial’s BriefSummary or DetailedDescription text was processed in parallel by the five LLMs. This multi-model approach leverages the unique strengths of each model while minimizing the impact of omissions or misinterpretations by any single LLM.

A specialized subsystem then conducts the following:

Parses the returned JSON outputs.
Validates the JSON structure (e.g., required fields and data types).
Flags malformed outputs for potential manual review.
Stores validated results in standardized columns, allowing for direct comparison across models.

Comprehensive logging includes request parameters, response times, and errors to facilitate auditing. Automatic retry logic handles transient network or rate limit failures, enabling stable large-scale operation.

3.4. Reproducibility

Several rigorous measures were employed to ensure the reproducibility of this study. Fixed random seeds (seed: 10031975) were consistently used across all steps involving stochastic processes. Detailed logging captured comprehensive records of all API interactions, including request URLs, response codes, and timestamps. Additionally, all scripts and pipeline configurations were maintained under version control, enabling exact replication of computational environments and procedures. Collectively, these strategies guarantee consistent outcomes across repeated analyses and over time, thereby strengthening the reliability and validity of our multi-LLM framework for clinical trial data extraction.

3.5. Alignment of Research Questions with Analytical Approaches

RQ1:

How consistently do five state-of-the-art LLMs (openai/o1-mini, x-ai/grok-2-1212, Llama-3.3-70B-Instruct, Gemini-Flash-1.5-8B, DeepSeek-R1-70B-Distill) extract key methodological fields from aging-related tDCS trial protocols?

Analysis Method and Justification: Fleiss’ Kappa (κ), for categorical data, and intraclass correlation coefficient (ICC), for numeric data, were selected.

Justification: Fleiss’ Kappa is ideal for assessing agreement among multiple raters (LLMs) when classifying categorical variables. ICC is appropriate for measuring reliability in extracting numeric data, as it assesses the consistency or conformity of quantitative measurements made by different observers (LLMs).

RQ2:

Does a simple ensemble consensus (majority vote for categorical fields; mean/median for numeric fields) outperform the best individual model in accuracy and inter-rater reliability?

Analysis Method and Justification: Ensemble consensus metrics (majority vote for categorical and averaging methods for numeric variables) were compared directly to individual LLM performance using κ and ICC.

Justification: Ensemble methods typically outperform individual classifiers due to a reduction in individual biases and errors. Majority voting for categorical fields and averaging numeric fields are robust consensus-building methods, and comparing these to the best-performing individual model helps clarify the practical advantage of multi-model ensemble approaches.

RQ3:

Which attribute classes (binary, categorical, and continuous numeric) pose the most significant challenge to automated extraction?

Analysis Method and Justification: Performance metrics (Fleiss’ Kappa and ICC) were calculated separately for binary, categorical, and numeric fields, facilitating direct comparison across attribute classes.

Justification: Analyzing attribute-specific extraction difficulties provides targeted insights for future methodological refinements. By differentiating attribute classes, the analysis explicitly identifies where LLM-driven pipelines might require additional human oversight or prompt engineering improvements.

The selected analytical framework systematically evaluates model performance, captures inter-model variability, and clearly distinguishes individual versus ensemble outcomes. This structured approach ensures transparency and reproducibility, while explicitly aligning each research question with an appropriate, well-established statistical measure.

Code Availability

The Python code that generated and processed these data—including API calls, data cleaning routines, and multi-LLM evaluation scripts—is available in the Appendix B and online via GitHub https://github.com/ricyoung/LLM-Pipeline-for-Clinical-Trial-Data-Extraction; accessed on 2 March 2025.

4. Results

The multi-model pipeline analyzed a total of 83 aging-related clinical trials that potentially involved noninvasive brain stimulation, yielding systematically extracted data from five large language models (LLMs). To contextualize the consensus across models, Table 2 (Brain Stimulation Used) illustrates the first 10 trials’ results. Here, each row indicates how many models returned a “Yes” or “No” for the presence of brain stimulation, along with the final consensus and the number of discrepant models. Notably, all trials except one (NCT05511259) elicited unanimous agreement, underlining the relative clarity of references to stimulation methods in protocol descriptions. Moving on to Table 3 (primary type, intensity, and confidence), we include each LLM’s parsing of the critical attributes—primary type, intensity, and confidence for the first five unique NCT identifiers. In most instances, the models converge on “tDCS,” though some records incorporate variations (e.g., “ctDCS,” “TMS”). The intensity column often remains blank, or partially populated (e.g., “2 mA”) when the text provides a clear numeric detail. Confidence levels here trend strongly toward “High” (Figure 2) but the variability across models becomes more evident as additional trials are analyzed.

Table 2 presents the two sample trials (NCT05511259 and NCT06501755) where the five models did not unanimously vote “Yes” or “No”.

Table 3 presents each model’s extracted Primary Type of brain stimulation, its intensity (e.g., 2 mA), and a confidence rating for the first five unique NCT IDs. It demonstrates how different large language models parse complex clinical trial descriptions.

Turning to overall inter-model concordance, Table 4 (High-Level Inter-Model Reliability Summary, Debug Edition) lists Fleiss’ kappa (κ) and the mean percent agreement for six fields: for brain_stimulation_used, the near-perfect κ (~0.94) and ~99% mean agreement underscore how readily the models identify transcranial stimulation; for primary_type, there is a substantial κ (~0.71) and >93% mean agreement, reflecting minor confusion around acronyms (e.g., “tCS,” “TPS”); for is_noninvasive, despite the ~98% raw agreement, κ (~−0.003) hovers near zero due to class imbalance—all but a few interventions were classified as noninvasive; for primary_target, there is a moderate κ (~0.53) with ~75% agreement, driven by the variety of anatomical terms and the partial mention of target sites; for parameters_intensity and parameters_duration, both hovered around κ ≈ 0.50, with 75–78% agreement, reflecting variability in numeric references (“2 mA,” “very weak current,” “20 to 60 min,” etc.).

In this analysis, we evaluated categorical and numeric fields extracted by five LLMs. We added two additional fields beyond the original set confidence_level (categorical: High, Medium, or Low) and frequency (numeric, e.g., 140 from “140 Hz” or 2 from “2 Hz”).

Table 5 (Inter-model reliability and agreement statistics) provides a more detailed breakdown, splitting the reliability metrics into (A) Fleiss’ kappa for categorical fields, with bootstrap confidence intervals, and (B) ICC (intraclass correlation coefficient) for numeric data. “brain_stimulation_used” achieves an Almost Perfect κ (0.90); “primary_type” (κ ≈ 0.71) remains Substantial; and “is_noninvasive” (κ ≈ 0.59) stands at Moderate.

For numeric parameters (intensity, duration, frequency), ICC values range from 0.95 to 1.0 (Excellent), indicating strong consistency when the text clearly presents numerical values. At the same time, an ICC of 1.0 may reflect limited variability in how frequency is reported—many trials either did not provide frequency or gave a simple measure (e.g., “2 Hz”).

Table 6 (Final “Consensus” Table) illustrates how each trial’s fields look after majority voting for categorical variables and mean (or most common) values for numeric variables. Trials like NCT03814304 show a clear consensus (“tDCS,” 2.0 mA, 20.0 min), whereas others display blank or partial columns when protocols lack explicit numeric details.

Table 7 (Per-field agreement vs. majority) presents how often each LLM’s output matches the consensus label for each field. Binary fields like “Brain Stim.” and “Is Noninvasive” achieve ≥90% agreement in nearly all cases, with some models (x.ai Grok and DeepSeek Distill 70B) reaching 100%. Numeric fields such as “Intensity” and “Duration” show lower alignment (68–94%), reflecting inconsistencies in how dosage or session length is described. “Primary Target” also remains moderately challenging: while most models exceed 75%, DeepSeek Distill 70B dips near 50%, possibly indicating heightened sensitivity to ambiguous text or multiple target mentions within a single protocol. Overall, these results confirm that the multi-agent LLM pipeline reliably classifies fundamental trial features, though textual ambiguity still introduces moderate variation, especially in multi-arm or highly detailed protocols.

Figure 3 provides detailed pairwise comparisons among models for the binary field ‘Brain Stimulation Used’. The high Cohen’s kappa values (≥0.90) indicate excellent inter-model reliability, further validating the robustness of the LLM-based extraction process. Confusion matrices reveal minor discrepancies, which predominantly occurred in only a few isolated cases.

This heatmap displays a symmetrical 5 × 5 matrix representing the pairwise agreement (e.g., Cohen’s kappa) among the five large language models (LLMs). Each cell indicates how similarly two models classified or extracted data across all trials. Darker shades (closer to 1.0) reflect near-perfect consistency, while lighter shades reveal greater discrepancies. The diagonal entries, set to 1.0, represent each model’s agreement with itself.

5. Discussion

The primary findings of this study demonstrate that individual large language models (LLMs) exhibit distinct strengths and weaknesses when extracting key parameters from clinical trial protocols involving transcranial direct current stimulation (tDCS) in aging research [57,58,59,60,61]. Consistent with previous research [11,12,62], all tested models reliably identified fundamental binary attributes, such as the presence or absence of brain stimulation and whether interventions were invasive or noninvasive, achieving near-perfect to substantial agreement. However, considerable variability was observed in extracting more nuanced or numeric protocol features, such as secondary anatomical targets or session durations, reflecting inherent differences in model architectures and training strategies [63,64].

The observed differences in performance across large language models (LLMs) likely stem from their underlying architectures, parameter sizes, and training corpora. Larger, instruction-tuned models such as Meta Llama 3.3–70B and DeepSeek-Distill 70B consistently performed better in extracting complex numeric details and nuanced descriptions due to their extensive training on large, domain-specific datasets, which likely provide them with richer contextual embeddings. In contrast, smaller, general-purpose models (e.g., openai/o1-mini, Google Gemini-Flash-1.5-8B) struggled with ambiguous language or imprecise numeric references, likely because they rely heavily on general linguistic patterns without sufficient domain-specific exposure. Additionally, the moderate agreement seen in identifying anatomical targets appears rooted in protocol ambiguity, where indirect references (such as “regions associated with cognitive function”) create inherent interpretative challenges. The strength of ensemble methods observed here can be attributed to their ability to compensate for individual model biases and inconsistencies, highlighting their advantage in reducing errors through consensus. Finally, persistent variability and occasional inaccuracies among all models underscore inherent limitations in current LLM technology, emphasizing the critical role of continued human oversight, particularly for ensuring reliability in complex, clinically nuanced data extraction.

As demonstrated in Table 6, majority voting and consensus strategies effectively distilled individual LLM outputs into a coherent record, facilitating rapid identification and appraisal of relevant trial data. Furthermore, the consistent patterns observed in Table 7, where inter-model agreement varied by field complexity, underscore the importance of careful curation and targeted prompt engineering to improve extraction accuracy, specially Figure 4, for complex protocol elements. The κ and ICC scores reported here translate directly into practical guidance for implementing automated extraction pipelines in systematic reviews. High agreement metrics affirm the capability of multi-LLM ensembles to substantially streamline the extraction of clearly defined protocol details. However, lower reliability observed for ambiguous or sparsely documented attributes underscores the continued necessity of structured human oversight. Thus, researchers should leverage these findings to strategically integrate human validation points, especially for nuanced protocol elements. This would ensure the robust applicability of LLM-based systems in real-world settings.

The significant variation among models on complex attributes highlights a critical consideration for future multi-agent system (MAS) design [65,66,67] by selecting complementary models to leverage their individual strengths effectively. For example, models like DeepSeek-Distill [68] showed high accuracy in extracting numeric parameters such as intensity and duration, whereas models such as Meta Llama excelled in identifying multi-arm trial designs or detailed anatomical descriptions [69,70]. These insights underscore the value of a combined approach in future MAS development, tailored to specific data extraction tasks to optimize accuracy and consistency.

Leveraging these model capabilities can significantly advance aging research by providing clearer insights into current clinical trials, enabling medical professionals to better understand aging processes and variability across different populations [71,72,73]. Furthermore, optimized LLM-based systems could greatly improve patient education [74,75], providing access not only to foundational knowledge but also to current and promising clinical trials. This approach could facilitate better communication among patients, researchers, and medical providers, thereby enhancing the overall effectiveness of aging interventions [76].

A practical implication for aging research is that multi-agent LLM pipelines can significantly enhance the current understanding of aging processes by systematically identifying the current state of clinical trials and highlighting variations across different populations [77]. Clinically tuned LLM systems could greatly improve patient education, as they can incorporate up-to-date clinical trial data and provide accessible, detailed information to patients, researchers, and medical providers alike.

This comparative analysis not only establishes benchmarks for LLM performance in clinical trial extraction, but also emphasizes the ongoing necessity of human oversight to reconcile ambiguous outputs and address limitations inherent in current LLM technology [78], including susceptibility to misinterpretation and hallucination [14,79,80]. While RAG strategies offer promising enhancements for accuracy, implementing RAG with extensive clinical trial datasets is not always practical or feasible. Consequently, future research should focus on optimizing foundational LLMs through specialized prompt engineering and domain-specific ontologies [81]. A deeper understanding of foundational model performance will guide effective RAG design choices, ensuring that multi-agent systems developed are precise, practical, and well suited to advanced clinical applications and precision medicine.

Recognizing these ethical and practical imperatives, we implemented an explicit human-in-the-loop framework within our multi-LLM ensemble pipeline. Although our approach is not a true multi-agent system where agents dynamically collaborate, but rather an ensemble of independently operated LLMs, structured human oversight remains integral. This framework included three essential components: (1) periodic manual audits, conducted regularly on randomly selected subsets of extracted data to identify systematic errors or misinterpretations by individual models; (2) defined human-in-the-loop checkpoints, strategically positioned after initial data extraction and prior to consensus formation, ensuring reviewers could systematically review and resolve ambiguous or conflicting outputs; and (3) automated triggers for human intervention, activated by predefined thresholds such as low-confidence scores, significant discrepancies among model outputs, or parsing anomalies. Our pipeline was intentionally structured into three distinct processing stages (trial retrieval, initial filtering, and structured data extraction) with automated halts designed explicitly to facilitate careful human validation before advancing. This structured integration of human oversight addresses critical ethical considerations, enhances data reliability, and bolsters the practical validity of the high κ/ICC scores reported, underscoring the real-world applicability of our results in systematic reviews and clinical decision-making.

6. Conclusions

This study provides a systematic comparative analysis of several large language models (LLMs) for clinical trial data extraction in aging research, revealing distinct performance profiles that can inform future multi-agent system (MAS) development. Although all evaluated models performed exceptionally well on straightforward attributes, considerable variability was noted for complex or numeric parameters. Recognizing these differences enables researchers to strategically select complementary models for MAS implementation, ensuring that each model’s strengths are optimally leveraged.

The clinical value of such optimized MAS frameworks is substantial, as they can rapidly synthesize extensive clinical trial data, ultimately enhancing patient care and provider education. Future work should extend this comparative framework by integrating external data sources (e.g., PubMed, NIH Reporter) to enhance context and validation and by developing targeted MAS pipelines where models specialize in different extraction subtasks. Such refined systems could significantly expedite systematic reviews and meta-analyses in clinical aging research, benefiting researchers through improved efficiency and precision. However, careful attention must be given to ethical considerations, transparency, and rigorous human oversight to ensure reliability, particularly as these tools become more integrated into clinical decision-making processes. As LLM capabilities advance, structured guidelines and rigorous validation standards will become essential to maintain data quality, safeguard patient privacy, and promote the responsible adoption of AI-driven methodologies in medical research [82].

Future Research Directions

Future research should focus on designing true multi-agent agentic systems, where individual LLM agents dynamically collaborate and refine their extraction tasks in real-time, significantly enhancing extraction accuracy and efficiency. Integrating additional data sources, such as PubMed publications and other scholarly databases, can provide richer contextual insights, especially when individual clinical trials yield multiple related publications. Moreover, evaluating more advanced and larger models, including GPT-4 variants and Claude 3.7, could offer further improvements in accuracy and reliability, particularly for handling complex, nuanced, or ambiguous clinical data. Such advancements would move the field closer to fully automated, precise, and contextually aware clinical data extraction workflows.

Author Contributions

Conceptualization, R.J.Y. and B.P.; methodology, R.J.Y. and B.P.; software, R.J.Y.; validation, A.M.M., R.J.Y. and B.P.; formal analysis, R.J.Y. and B.P.; investigation, A.M.M., R.J.Y. and B.P.; resources, B.P.; data curation, R.J.Y., A.M.M. and B.P.; writing—original draft preparation, A.M.M., R.J.Y. and B.P.; writing—review and editing, A.M.M., R.J.Y. and B.P.; visualization, A.M.M., R.J.Y. and B.P.; supervision, B.P.; project administration, A.M.M., R.J.Y. and B.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data used in this study are publicly available through ClinicalTrials.gov at https://clinicaltrials.gov accessed on 1 January 2025. The source code for the data extraction and analysis pipeline developed in this study is openly available on GitHub at https://github.com/ricyoung/LLM-Pipeline-for-Clinical-Trial-Data-Extraction (accessed on 2 May 2025).

Conflicts of Interest

Richard J. Young reports a relationship with UnitedHealth Group Inc. that includes employment. Richard J. Young previously received research credits from OpenAI for an unrelated project. These credits were not used for the work presented in this manuscript, and the authors have no other interests or activities to disclose that could be perceived as a conflict of interest. The other authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

API	Application Programming Interface
CI	Confidence Interval
CSV	Comma-Separated Values
DLPFC	Dorsolateral Prefrontal Cortex
EMR	Electronic Medical Records
ICC	Intraclass Correlation Coefficient
JSON	JavaScript Object Notation
LLM	Large Language Model
MAS	Multi-Agent System
NCT	National Clinical Trial
NIH	National Institutes of Health
NLP	Natural Language Processing
Parquet	A columnar storage format for efficient data handling
PubMed	A free resource for biomedical and life sciences literature
RAG	Retrieval-Augmented Generation
tDCS	Transcranial Direct Current Stimulation
TMS	Transcranial Magnetic Stimulation

Appendix A

“NCTId”, “LeadSponsorClass”, “LeadSponsorName”, “Condition”, “OfficialTitle”, “BriefTitle”, “Acronym”, “StudyType”, “InterventionType”, “InterventionName”, “InterventionOtherName”, “InterventionDescription”, “Phase”, “StudyFirstSubmitDate”, “LastUpdateSubmitDate”, “CompletionDate”, “OverallStatus”, “BriefSummary”, “IsFDARegulatedDevice”, “StartDate”, “DetailedDescription”, “ConditionMeshTerm”, “PrimaryOutcomeDescription”, “SecondaryOutcomeDescription”, “EnrollmentCount”, “EnrollmentType”, “BaselineCategoryTitle”, “BaselinePopulationDescription”, “BaselineTypeUnitsAnalyzed”, “OtherOutcomeDescription”, “EligibilityCriteria”, “StudyPopulation”, “HealthyVolunteers”, “ReferencePMID”, “LocationCountry”, “PrimaryOutcomeTimeFrame”, “BaselineMeasureTitle”, “BaselineMeasureUnitOfMeasure”, “BaselineMeasurementValue”

ix.

Appendix B

https://github.com/ricyoung/LLM-Pipeline-for-Clinical-Trial-Data-Extraction (accessed on 2 May 2025).

References

Ricco, J.B.; Guetarni, F.; Kolh, P. Learning from artificial intelligence and big data in health care. Eur. J. Vasc. Endovasc. Surg. 2020, 59, 868–869. [Google Scholar] [CrossRef] [PubMed]
Zarin, D.A.; Tse, T.; Williams, R.J.; Carr, S. Trial reporting in clinicaltrials.Gov—The final rule. N. Engl. J. Med. 2016, 375, 1998–2004. [Google Scholar] [CrossRef]
Zarin, D.A.; Fain, K.M.; Dobbins, H.D.; Tse, T.; Williams, R.J. 10-year update on study results submitted to clinicaltrials.Gov. N. Engl. J. Med. 2019, 381, 1966–1974. [Google Scholar] [CrossRef]
Chaturvedi, N.; Mehrotra, B.; Kumari, S.; Gupta, S.; Subramanya, H.S.; Saberwal, G. Some data quality issues at clinicaltrials.Gov. Trials 2019, 20, 378. [Google Scholar] [CrossRef]
Pradhan, R.; Hoaglin, D.C.; Cornell, M.; Liu, W.; Wang, V.; Yu, H. Automatic extraction of quantitative data from clinicaltrials.Gov to conduct meta-analyses. J. Clin. Epidemiol. 2019, 105, 92–100. [Google Scholar] [CrossRef] [PubMed]
Tasneem, A.; Aberle, L.; Ananth, H.; Chakraborty, S.; Chiswell, K.; McCourt, B.J.; Pietrobon, R. The database for aggregate analysis of clinicaltrials.Gov (aact) and subsequent regrouping by clinical specialty. PLoS ONE 2012, 7, e33677. [Google Scholar] [CrossRef]
Nye, B.; Jessy Li, J.; Patel, R.; Yang, Y.; Marshall, I.J.; Nenkova, A.; Wallace, B.C. A Corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. Proc. Conf. Assoc. Comput. Linguist. Meet. 2018, 2018, 197–207. [Google Scholar] [CrossRef] [PubMed]
Jonnalagadda, S.; Petitti, D. A new iterative method to reduce workload in systematic review process. Int. J. Comput. Biol. Drug Des. 2013, 6, 5–17. [Google Scholar] [CrossRef]
Contributors, F.; El-Kishky, A.; Selsam, D.; Song, F.; Parascandolo, G.; Ren, H.; Lightman, H.; Won, H.; Akkaya, I.; Sutskever, I.; et al. Openai o1 system card. arXiv 2024, arXiv:2412.16720. [Google Scholar] [CrossRef]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:abs/2407.21783. [Google Scholar]
Reid, M.; Savinov, N.; Teplyashin, D.; Lepikhin, D.; Lillicrap, T.; Alayrac, J.-B.; Soricut, R.; Lazaridou, A.; Firat, O.; Schrittwieser, J.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar] [CrossRef]
Guo, D.; Yang, D.; Zhang, H.; Song, J.-M.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. DeepSeek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv 2025, arXiv:2501.12948. [Google Scholar] [CrossRef]
Alkaissi, H.; McFarlane, S.I. Artificial hallucinations in chatgpt: Implications in scientific writing. Cureus 2023, 15, e35179. [Google Scholar] [CrossRef]
Azamfirei, R.; Kudchadkar, S.R.; Fackler, J. Large language models and the perils of their hallucinations. Crit. Care 2023, 27, 120. [Google Scholar] [CrossRef] [PubMed]
Liu, F.; Liu, Y.; Shi, L.; Huang, H.; Wang, R.; Yang, Z.; Zhang, L. Exploring and evaluating hallucinations in llm-powered code generation. arXiv 2024, arXiv:2404.00971. [Google Scholar] [CrossRef]
Abdelghafour, M.A.M.; Mabrouk, M.; Taha, Z. Hallucination mitigation techniques in large language models. Int. J. Intell. Comput. Inf. Sci. 2024, 24, 73–81. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Gartlehner, G.; Kahwati, L.; Hilscher, R.; Thomas, I.; Kugley, S.; Crotty, K.; Viswanathan, M.; Nussbaumer-Streit, B.; Booth, G.; Erskine, N.; et al. Data extraction for evidence synthesis using a large language model: A proof-of-concept study. Res. Synth. Methods 2023, 15, 576–589. [Google Scholar] [CrossRef]
Valentina, P.; Aneta, H. Breaking down the metrics: A comparative analysis of llm benchmarks. Int. J. Sci. Res. Arch. 2024, 13, 777–788. [Google Scholar] [CrossRef]
Erdengasileng, A.; Han, Q.; Zhao, T.; Tian, S.; Sui, X.; Li, K.; Wang, W.; Wang, J.; Hu, T.; Pan, F.; et al. Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification. Database 2022, 2022, baac066. [Google Scholar] [CrossRef]
Jia, Y.; Wang, H.; Yuan, Z.; Zhu, L.; Xiang, Z.L. Biomedical relation extraction method based on ensemble learning and attention mechanism. BMC Bioinform. 2024, 25, 333. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Wei, Q.; Huang, L.C.; Li, J.; Hu, Y.; Chuang, Y.S.; He, J.; Das, A.; Keloth, V.K.; Yang, Y.; et al. Ensemble pretrained language models to extract biomedical knowledge from literature. J. Am. Med. Inform. Assoc. 2024, 31, 1904–1911. [Google Scholar] [CrossRef] [PubMed]
Banerjee, S.; Agarwal, A.; Singh, E. The Vulnerability of language model benchmarks: Do they accurately reflect true llm performance? arXiv 2024, arXiv:2412.03597. [Google Scholar] [CrossRef]
Bowman, S.R.; Dahl, G.E. What will it take to fix benchmarking in natural language understanding? arXiv 2021, arXiv:abs/2104.02145. [Google Scholar]
Yosefi, M.H.; Yagedi, Z.; Ahmadizadeh, Z.; Ehsani, F. Effect of transcranial direct current stimulation on learning and motor skill in healthy older adults: A systematic review. J. Maz. Univ. Med. Sci. 2017, 26, 221–231. [Google Scholar]
Meek, A.W.; Greenwell, D.R.; Nishio, H.; Poston, B.; Riley, Z.A. Anodal m1 tdcs enhances online learning of rhythmic timing videogame skill. PLoS ONE 2024, 19, e0295373. [Google Scholar] [CrossRef]
Pantovic, M.; Albuquerque, L.L.; Mastrantonio, S.; Pomerantz, A.S.; Wilkins, E.W.; Riley, Z.A.; Guadagnoli, M.A.; Poston, B. transcranial direct current stimulation of primary motor cortex over multiple days improves motor learning of a complex overhand throwing Task. Brain Sci. 2023, 13, 1441. [Google Scholar] [CrossRef] [PubMed]
Wilson, M.A.; Greenwell, D.; Meek, A.W.; Poston, B.; Riley, Z.A. neuroenhancement of a dexterous motor task with anodal tdcs. Brain Res. 2022, 1790, 147993. [Google Scholar] [CrossRef]
Buch, E.R.; Santarnecchi, E.; Antal, A.; Born, J.; Celnik, P.A.; Classen, J.; Gerloff, C.; Hallett, M.; Hummel, F.C.; Nitsche, M.A.; et al. Effects of tdcs on motor learning and memory formation: A consensus and critical position paper. Clin. Neurophysiol. 2017, 128, 589–603. [Google Scholar] [CrossRef]
Meek, A.W.; Greenwell, D.; Poston, B.; Riley, Z.A. Anodal tdcs accelerates on-line learning of dart throwing. Neurosci. Lett. 2021, 764, 136211. [Google Scholar] [CrossRef]
Hardwick, R.M.; Celnik, P.A. Cerebellar direct current stimulation enhances motor learning in older adults. Neurobiol. Aging 2014, 35, 2217–2221. [Google Scholar] [CrossRef]
Nomura, T.; Kirimoto, H. Anodal transcranial direct current stimulation over the supplementary motor area improves anticipatory postural adjustments in older adults. Front. Hum. Neurosci. 2018, 12, 317. [Google Scholar] [CrossRef]
Ljubisavljevic, M.R.; Oommen, J.; Filipovic, S.; Bjekic, J.; Szolics, M.; Nagelkerke, N. Effects of tdcs of dorsolateral prefrontal cortex on dual-task performance involving manual dexterity and cognitive task in healthy older adults. Front. Aging Neurosci. 2019, 11, 144. [Google Scholar] [CrossRef]
Jiang, Y.; Ramasawmy, P.; Antal, A. Uncorking the limitation-improving dual tasking using transcranial electrical stimulation and task training in the elderly: A systematic review. Front. Aging Neurosci. 2024, 16, 1267307. [Google Scholar] [CrossRef] [PubMed]
Siew-Pin Leuk, J.; Yow, K.E.; Zi-Xin Tan, C.; Hendy, A.M.; Kar-Wing Tan, M.; Hock-Beng Ng, T.; Teo, W.P. A meta-analytical review of transcranial direct current stimulation parameters on upper limb motor learning in healthy older adults and people with parkinson’s disease. Rev. Neurosci. 2023, 34, 325–348. [Google Scholar] [CrossRef]
Pantovic, M.; Macak, D.; Cokorilo, N.; Moonie, S.; Riley, Z.A.; Madic, D.M.; Poston, B. The influence of transcranial direct current stimulation on shooting performance in elite deaflympic athletes: A case series. J. Funct. Morphol. Kinesiol. 2022, 7, 42. [Google Scholar] [CrossRef]
Pantovic, M.; Lidstone, D.E.; de Albuquerque, L.L.; Wilkins, E.W.; Munoz, I.A.; Aynlender, D.G.; Morris, D.; Dufek, J.S.; Poston, B. cerebellar transcranial direct current stimulation applied over multiple days does not enhance motor learning of a complex overhand throwing task in young adults. Bioengineering 2023, 10, 1265. [Google Scholar] [CrossRef]
Pino-Esteban, A.; Megía-García, Á.; Álvarez, D.M.-C.; Beltran-Alacreu, H.; Avendaño-Coy, J.; Gómez-Soriano, J.; Serrano-Muñoz, D. Can transcranial direct current stimulation enhance functionality in older adults? A systematic review. J. Clin. Med. 2021, 10, 2981. [Google Scholar] [CrossRef] [PubMed]
Marshall, I.J.; Noel-Storr, A.; Kuiper, J.; Thomas, J.; Wallace, B.C. machine learning for identifying randomized controlled trials: An evaluation and practitioner’s guide. Res. Synth. Methods 2018, 9, 602–614. [Google Scholar] [CrossRef] [PubMed]
Tsafnat, G.; Glasziou, P.; Choong, M.K.; Dunn, A.; Galgani, F.; Coiera, E. Systematic review automation technologies. Syst. Rev. 2014, 3, 74. [Google Scholar] [CrossRef]
He, C.; Zou, B.; Li, X.; Chen, J.; Xing, J.; Ma, H. Enhancing llm reasoning with multi-path collaborative reactive and reflection agents. arXiv 2024, arXiv:2501.00430. [Google Scholar] [CrossRef]
Marshall, I.J.; Wallace, B.C. Toward systematic review automation: A practical guide to using machine learning tools in research synthesis. Syst. Rev. 2019, 8, 163. [Google Scholar] [CrossRef] [PubMed]
O’Mara-Eves, A.; Thomas, J.; McNaught, J.; Miwa, M.; Ananiadou, S. Using text mining for study identification in systematic reviews: A systematic review of current approaches. Syst. Rev. 2015, 4, 5. [Google Scholar] [CrossRef] [PubMed]
Riaz, I.B.; Naqvi, S.A.A.; Hasan, B.; Murad, M.H. Future of evidence synthesis: Automated, living, and interactive systematic reviews and meta-analyses. Mayo Clin. Proc. Digit. Health 2024, 2, 361–365. [Google Scholar] [CrossRef] [PubMed]
Sheikhalishahi, S.; Miotto, R.; Dudley, J.T.; Lavelli, A.; Rinaldi, F.; Osmani, V. Natural language processing of clinical notes on chronic diseases: Systematic review. JMIR Med. Inform. 2019, 7, e12239. [Google Scholar] [CrossRef]
Bernard, N.; Sagawa, Y., Jr.; Bier, N.; Lihoreau, T.; Pazart, L.; Tannou, T. Using artificial intelligence for systematic review: The example of elicit. BMC Med. Res. Methodol. 2025, 25, 75. [Google Scholar] [CrossRef]
Lai, H.; Liu, J.; Bai, C.; Liu, H.; Pan, B.; Luo, X.; Hou, L.; Zhao, W.; Xia, D.; Tian, J.; et al. Language models for data extraction and risk of bias assessment in complementary medicine. NPJ Digit. Med. 2025, 8, 74. [Google Scholar] [CrossRef]
Motzfeldt Jensen, M.; Brix Danielsen, M.; Riis, J.; Assifuah Kristjansen, K.; Andersen, S.; Okubo, Y.; Jorgensen, M.G. Chatgpt-4o can serve as the second rater for data extraction in systematic reviews. PLoS ONE 2025, 20, e0313401. [Google Scholar] [CrossRef]
Liu, J.; Lai, H.; Zhao, W.; Huang, J.; Xia, D.; Liu, H.; Luo, X.; Wang, B.; Pan, B.; Hou, L.; et al. Ai-driven evidence synthesis: Data extraction of randomized controlled trials with large language models. Int. J. Surg. 2025, 111, 2722–2726. [Google Scholar] [CrossRef]
Stuhlmiller, T.J.; Rabe, A.J.; Rapp, J.; Manasco, P.; Awawda, A.; Kouser, H.; Salamon, H.; Chuyka, D.; Mahoney, W.; Wong, K.K.; et al. A scalable method for validated data extraction from electronic health records with large language models. medRxiv 2025. [Google Scholar] [CrossRef]
Khan, M.A.; Ayub, U.; Naqvi, S.A.A.; Khakwani, K.Z.R.; Sipra, Z.B.R.; Raina, A.; Zou, S.; He, H.; Hossein, S.A.; Hasan, B.; et al. Collaborative large language models for automated data extraction in living systematic reviews. medRxiv 2024. [Google Scholar] [CrossRef] [PubMed]
Sun, Z.; Zhang, R.; Doi, S.A.; Furuya-Kanamori, L.; Yu, T.; Lin, L.; Xu, C. How good are large language models for automated data extraction from randomized trials? medRxiv 2024. [Google Scholar] [CrossRef]
Fan, D.; Che, X.; Jiang, Y.; He, Q.; Yu, J.; Zhao, H. Noninvasive brain stimulations modulated brain modular interactions to ameliorate working memory in community-dwelling older adults. Cereb. Cortex 2024, 34, bhae140. [Google Scholar] [CrossRef] [PubMed]
Manor, B. Transcranial electrical stimulation as a tool to understand and enhance mobility in older adults. Innov. Aging 2023, 7, 281. [Google Scholar] [CrossRef]
Manor, B.; Lo, O.; Zhou, J.; Dhami, P.; Farzan, F. Noninvasive brain stimulation to reduce falls in older adults. In Falls and Cognition in Older Persons: Fundamentals, Assessment and Therapeutic Options; Springer: Cham, Switzerland, 2019; pp. 373–398. [Google Scholar] [CrossRef]
Zhou, J.; Lo, O.; Halko, M.; Harrison, R.; Lipsitz, L.; Manor, B. Noninvasive brain stimulation increases the complexity of resting-state brain network activity in older adults. Innov. Aging 2018, 2, 402. [Google Scholar] [CrossRef]
Moon, I.H. Performance Comparison of large language models on advanced calculus problems. arXiv 2025, arXiv:2503.03960. [Google Scholar] [CrossRef]
Shojaee-Mend, H.; Mohebbati, R.; Amiri, M.; Atarodi, A. Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions. Sci. Rep. 2024, 14, 10785. [Google Scholar] [CrossRef]
Chen, X.; Xiang, J.; Lu, S.; Liu, Y.; He, M.; Shi, D. Evaluating large language models in medical applications: A survey. arXiv 2024, arXiv:2405.07468. [Google Scholar] [CrossRef]
Guo, Z.; Jin, R.; Liu, C.; Huang, Y.; Shi, D.; Supryadi; Yu, L.; Liu, Y.; Li, J.; Xiong, B.; et al. Evaluating large language models: A comprehensive survey. arXiv 2023, arXiv:2310.19736. [Google Scholar] [CrossRef]
Mondorf, P.; Plank, B. Beyond Accuracy: Evaluating the reasoning behavior of large language models—A survey. arXiv 2024, arXiv:2404.01869. [Google Scholar] [CrossRef]
Anthropic. Claude 3 Model Card: October Addendum. PDF, n.d. Available online: https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf (accessed on 2 May 2025).
Barnett, S.; Brannelly, Z.; Kurniawan, S.; Wong, S. Fine-tuning or fine-failing? debunking performance myths in large language models. arXiv 2024, arXiv:2406.11201. [Google Scholar] [CrossRef]
Liu, S.; McCoy, A.B.; Wright, A. Improving large language model applications in biomedicine with retrieval-augmented generation: A systematic review, meta-analysis, and clinical development guidelines. J. Am. Med. Inform. Assoc. 2025, 32, 605–615. [Google Scholar] [CrossRef] [PubMed]
Li, R.; Wang, X.; Yu, H. Exploring llm multi-agents for icd coding. arXiv 2024, arXiv:2406.15363. [Google Scholar] [CrossRef]
Pandey, H.; Amod, A.; Kumar, S. Advancing healthcare automation: Multi-agent system for medical necessity justification. Workshop Biomed. Nat. Lang. Process. 2024, arXiv:2404.17977. [Google Scholar] [CrossRef]
Yang, H.; Chen, H.; Guo, H.; Chen, Y.; Lin, C.; Hu, S.; Hu, J.; Wu, X.; Wang, X. Llm-medqa: Enhancing medical question answering through case studies in large language models. arXiv 2024, arXiv:2501.05464. [Google Scholar] [CrossRef]
Zhao, K.; Liu, Z.; Lei, X.; Wang, N.; Long, Z.; Zhao, J.; Wang, Z.; Yang, P.; Hua, M.; Ma, C.; et al. Quantifying the capability boundary of deepseek models: An application-driven performance analysis. arXiv 2025, arXiv:2502.11164. [Google Scholar] [CrossRef]
Gao, T.; Jin, J.; Ke, Z.; Moryoussef, G. A comparison of deepseek and other llms. arXiv 2025, arXiv:2502.03688. [Google Scholar] [CrossRef]
Sapkota, R.; Raza, S.; Karkee, M. Comprehensive analysis of transparency and accessibility of chatgpt, deepseek, and other sota large language models. arXiv 2025, arXiv:2502.18505. [Google Scholar] [CrossRef]
Yang, J.; Luo, J.; Tian, X.; Zhao, Y.; Li, Y.; Wu, X. Progress in understanding oxidative stress, aging, and aging-related diseases. Antioxidants 2024, 13, 394. [Google Scholar] [CrossRef]
Lee, J.; Kim, H.J. Normal aging induces changes in the brain and neurodegeneration progress: Review of the structural, biochemical, metabolic, cellular, and molecular changes. Front. Aging Neurosci. 2022, 14, 931536. [Google Scholar] [CrossRef]
Epel, E.S.; Prather, A.A. Stress, telomeres, and psychopathology: Toward a deeper understanding of a triad of early aging. Annu. Rev. Clin. Psychol. 2018, 14, 371–397. [Google Scholar] [CrossRef]
Trapp, C.; Schmidt-Hegemann, N.; Keilholz, M.; Brose, S.F.; Marschner, S.N.; Schonecker, S.; Maier, S.H.; Dehelean, D.C.; Rottler, M.; Konnerth, D.; et al. Patient-and clinician-based evaluation of large language models for patient education in prostate cancer radiotherapy. Strahlenther. Onkol. 2025, 201, 333–342. [Google Scholar] [CrossRef]
Hao, Y.; Holmes, J.; Waddle, M.; Yu, N.; Vickers, K.; Preston, H.; Margolin, D.; Löckenhoff, C.E.; Vashistha, A.; Ghassemi, M.; et al. Outlining the borders for llm applications in patient education: Developing an expert-in-the-loop llm-powered chatbot for prostate cancer patient education. arXiv 2024, arXiv:2409.19100. [Google Scholar] [CrossRef]
Yang, Y.; Peng, Q.; Wang, J.; Zhang, W. Multi-llm-agent systems: Techniques and business perspectives. arXiv 2024, arXiv:2411.14033. [Google Scholar] [CrossRef]
Fuellen, G.; Kulaga, A.; Lobentanzer, S.; Unfried, M.; Avelar, R.A.; Palmer, D.; Kennedy, B.K. Validation requirements for ai-based intervention-evaluation in aging and longevity research and practice. Ageing Res. Rev. 2025, 104, 102617. [Google Scholar] [CrossRef] [PubMed]
Shusterman, R.; Waters, A.C.; O’Neill, S.; Bangs, M.; Luu, P.; Tucker, D.M. An active inference strategy for prompting reliable responses from large language models in medical practice. NPJ Digit. Med. 2025, 8, 119. [Google Scholar] [CrossRef]
Tonmoy, S.; Zaman, S.M.M.; Jain, V.; Rani, A.; Rawte, V.; Chadha, A.; Das, A. A Comprehensive survey of hallucination mitigation techniques in large language models. arXiv 2024, arXiv:2401.01313. [Google Scholar] [CrossRef]
Duan, H.; Yang, Y.; Tam, K. Do llms know about hallucination? an empirical investigation of llm’s hidden states. arXiv 2024, arXiv:2402.09733. [Google Scholar] [CrossRef]
Ciatto, G.; Agiollo, A.; Magnini, M.; Omicini, A. Large language models as oracles for instantiating ontologies with domain-specific knowledge. Knowledge-Based Syst. 2025, 310, 112940. [Google Scholar] [CrossRef]
Belisle-Pipon, J.C. Why we need to be careful with llms in medicine. Front. Med. 2024, 11, 1495582. [Google Scholar] [CrossRef]

Figure 1. High-level workflow of the multi-agent (LLM) pipeline. After retrieving trials from ClinicalTrials.gov and performing data cleaning, the text fields are sent in parallel to five different LLMs with a strict JSON schema. The returned outputs are validated, followed by being subject to statistical reliability analyses (e.g., Fleiss’ kappa, ICC). Lastly, a consensus record is generated by majority voting or averaging.

Figure 2. Distribution of confidence ratings across five large language models (LLMs). Results demonstrate a strong predominance of “High” confidence ratings, with minor variations in the frequency of “Medium” or “Low” ratings. Notably, GPT-o1-mini occasionally assigns “Medium” or “Low” confidence, whereas DeepSeek Distill 70B consistently maintains “High” confidence”.

Figure 3. Pairwise comparison analysis for “Brain Stimulation Used”: (a) Exact match agreement Matrix, which shows the percentage of trials where models give identical responses. (b) Cohen’s kappa scores between each pair of LLMs, which shows agreement adjusted for chance. (c) Confusion matrices for selected model pairs, in which each matrix shows how often one model classifies a trial as “Yes” or “No” given the other model’s classification.

Figure 4. Heatmap of pairwise model agreement. (a) Binary field (brain stimulation used). This panel shows the agreement between models for the binary field “brain_stimulation_used”. The consistently high agreement scores (0.99–1.00) indicate excellent reliability across all model pairs when identifying the presence or absence of brain stimulation in clinical trials. (b) Categorical field (primary type). This panel displays agreement for the categorical field “primary_type”, which classifies the type of brain stimulation used. Agreement scores range from 0.82 to 0.95 demonstrated good but lower agreement compared to binary fields, and reflected the increased complexity of categorical classification. (c) Numeric field (intensity). This panel illustrates agreement for the numeric field “intensity” (stimulation strength). The notably lower agreement scores (0.62–0.77) highlight the challenge models face in consistently extracting precise numeric values from clinical trial descriptions.

Table 1. Data Source Transparency.

Description	Details
Data Source	ClinicalTrials.gov
API Endpoint URL	https://clinicaltrials.gov/data-api/api
Query Terms-Keywords	“Aged”
Number of Trials Retrieved	10,030 trials
Inclusion Criteria	Trials explicitly mentioning tDCS, targeting older adults (aged ≥ 65), and involving brain stimulation methods
Exclusion Criteria	Trials not explicitly mentioning tDCS, animal studies, duplicates, and unrelated interventions
Final Number of Trials Analyzed	83 trials
Date of Data Extraction	1 January 2025

Table 2. Type of brain stimulation used. # Trials: The number of trials that had non-missing data for that field.

NCTId	# Models Saying Yes	# Models Saying No	Consensus
NCT05511259	3	2	Yes
NCT06501755	4	0	Yes

Table 3. Primary type, intensity, and confidence.

NCTId	Model	Primary Type	Intensity	Confidence
NCT06658795	OpenAI_o1_mini	tDCS	None	High
NCT06658795	x_ai_grok_2_1212	tDCS	None	High
NCT06658795	Meta_LLaMA_3.3_70B	tDCS	None	High
NCT06658795	Google_Gemini_Flash_1.5_8b	tDCS	None	High
NCT06658795	DeepSeek_Distill_70b	tDCS	None	High
NCT03814304	OpenAI_o1_mini	tDCS	None	High
NCT03814304	x_ai_grok_2_1212	tDCS	None	High
NCT03814304	Meta_LLaMA_3.3_70B	tDCS	None	High
NCT03814304	Google_Gemini_Flash_1.5_8b	tDCS	None	High
NCT03814304	DeepSeek_Distill_70b	tDCS	2mA	High
NCT02436915	OpenAI_o1_mini	tDCS	None	High
NCT02436915	x_ai_grok_2_1212	tDCS	None	High
NCT02436915	Meta_LLaMA_3.3_70B	tDCS	None	High
NCT02436915	Google_Gemini_Flash_1.5_8b	tDCS	None	High
NCT02436915	DeepSeek_Distill_70b	tDCS	2mA	High

Table 4. This table compares how consistently five large language models (LLMs) label key fields in clinical trial text. We report field the attribute being extracted (e.g., whether or not brain stimulation is used, the primary stimulation type, etc.). Fleiss’ kappa: a statistical measure of how well the models agree beyond chance. Values typically range from −1.0 to 1.0. Mean % Agreement: The average fraction of model outputs that match the majority label in each trial. # Trials: The number of trials that had non-missing data for that field. Categories: The distinct labels (e.g., ‘yes’, ‘no’, ‘tDCS’) that appeared in the model outputs for that field.

Field	FleissKappa	Mean % Agreement	# Trials	Categories
brain_stimulation_used	0.941	0.993	83	yes, no
primary_type	0.709	0.938	78	ctdcs, dtms, dtms, tdcs, hd-tdcs, missing, non-invasive brain stimulation, tacs, tcs, tdcs, tdcs and tacs, tdcs, tacs, tms, tps, tps (transcranial pulse stimulation), tps (transcranial pulse stimulation, also known as low-intensity extracorporeal shock wave therapy (li-eswt))
is_noninvasive	−0.003	0.978	81	true, false
primary_target	0.53	0.747	76	anodal, anodal transcranial direct current stimulation (atdcs), audio-visual associative memory areas, bifrontal, brain regions associated with cognitive function, brain regions associated with sleep spindles, brain regions involved in active cognitive function, brain regions underneath the neocortex, brain underneath the neocortex, center electrode, central nervous system, cerebellum, cerebral cortex, cognitive control network (ccn), cortical and deep brain structures, cortical areas, corticothalamic and corticospinal projections, dlpfc, dorso-lateral prefrontal cortex, dorso-lateral prefrontal cortices, dorsolateral prefrontal cortex, dorsolateral prefrontal cortex (dlpfc), dorsolateral prefrontal cortex (f3), dorsolateral prefrontal cortex, areas of the memory and language network, dorsolateral prefrontal cortex, memory and language network, f3, frontal brain regions, frontal circuits, fronto-central, fronto-central alpha, fronto-central alpha region, fronto-central region, frontopolar cortex, inferior frontal lobe, inferior frontal lobe and superior parietal lobe, knee oa-related areas, language areas, language areas of the brain, lateral prefrontal cortex (lpfc), lateral prefrontal cortex (lpfc) and default mode network (dmn), left dlpfc, left dorsal lateral prefrontal cortex (dlpfc), left dorsolateral prefrontal cortex, left dorsolateral prefrontal cortex (dlpfc), left m1, left prefrontal cortex, m1, m1 (primary motor cortex), memory-related brain regions, mid-cingulate cortex (mcc), missing, motor cortex, neocortex, not explicitly stated, not specified, null, pain-related brain regions, pf areas, pfc, pre-frontal (pf) areas, pre-frontal (pf) brain areas, pre-frontal areas, precuneus, prefrontal circuits, prefrontal cortex, prefrontal cortex and angular gyrus, prefrontal cortex, angular gyrus, prefrontal cortex, specifically the dlpfc, prefrontal cortical activity, prefrontal regions f3/f4, prefrontal right region, primary motor cortex, primary motor cortex (m1), primary motor cortex (m1) contralateral to the moving leg (cm1), primary motor cortex (m1), posterior parietal cortex (ppc), and cerebellar cortex (cbm), right dlpfc and right ppc, right dorsolateral prefrontal cortex (dlpfc), right dorsolateral prefrontal cortex (dlpfc) and right posterior parietal cortex (ppc), right inferior frontal lobe, right temporoparietal junction, right temporoparietal junction (rtpj), right temporoparietal junction (rtpj) or dorsomedial prefrontal cortex (dmpfc), scalp, subcortical areas, theta frequency, unspecified
parameters_intensity	0.498	0.786	28	1 ma, 1–2 ma, 1–2 ma (tdcs), 1–2 milliampere (ma), 1.5 ma, 1.5 ma, 1 ma, 2 ma, 2 milliampere (ma), 2 ma, 3 ma, 3 ma, anodal stimulation, low-intensity, missing, not specified, null, very low energy, very weak electrical current, very weak electrical current from a 9-volt battery
parameters_duration	0.498	0.765	40	1 h/week, 1 h/week for 10 weeks, 1 min, 1 min, 1 min on, 5 s off, 10 sessions over 2 weeks, 13 min, 13 min, 1 min with 5 s breaks, 2 weeks, 20 min, 20 to 60 min, 20–30 min, 10 days, and 2 weeks, 3 sessions per week for 2–4 weeks, 3 sessions per week for 3 weeks (9 sessions total), 3 sessions per week, 6000 pulses each, for 2–4 weeks, 30 min, 30 min for 5 successive days for temporal cortex stimulation, 30 min per session, 30 min per session, 10 sessions over two weeks, 4 weeks, 4-week (3 times per week) treatment, 4000 pulses each, every 6 months for an average of two to four years; 3 tps sessions (6000 pulses each) per week for 2–4 weeks, 5 consecutive days, 5 consecutive days, 30 min per session, 5 sessions/day over 4 days, 5 sessions/day over 4 days only, 6 sessions over 2 weeks, 8 sessions of anodal tdcs completed twice a week for 4 weeks, 8 sessions, twice a week for 4 weeks, 9 sessions (3 sessions per week for 3 weeks), acute (one-time), applied during training sessions, approximately 20 min, daily over 4 weeks, during slow wave sleep, missing, null, repeated sessions, two weeks, two-week, varied (e.g., 3 times per week for 2–4 weeks)

Table 5. Inter-model reliability and agreement statistics.

(a)
Field	FleissKappa	95% CI	Interpretation
brain_stimulation_used	0.904	0.68–1.0	Almost Perfect
primary_type	0.709	0.6–0.82	Substantial
is_noninvasive	0.59	0.4–0.71	Moderate
(b)
Parameter	ICC(2,1)	95% CI	Interpretation
intensity	0.95	0.9–0.98	Excellent
duration	0.964	0.94–0.98	Excellent
frequency	1	1.0–1.0	Excellent

(a) Fleiss’ kappa for categorical fields (bootstrap CI): each categorical field was converted into a contingency matrix (N×k), and Fleiss’ κ was computed using a naive bootstrap (1000 resamples) to derive approximate 95% confidence intervals. (b) ICC for numeric fields (Pingouin, ICC2): Numeric fields were parsed as floats (e.g., “140 Hz” → 140.0) and processed via pingouin’s intraclass_corr to obtain the ICC(2) (two-way random, single measure). We report the ICC estimate, 95% CI, and a qualitative interpretation (Excellent, Good, Moderate, or Poor).

Table 6. Final consensus table—first.

NCT ID	Brain Stimulation	Stimulation Type	Intensity (mA)	Duration (Minutes)	Primary Target	Confidence	Unanimous?	Comments
NCT06658795	Yes	tDCS			lateral prefrontal cortex (LPFC)	High	No	Intensity (mA) note: no numeric parse; duration (minutes) note: no numeric parse
NCT05511259	Yes	TMS			None	High	No	Intensity (mA) note: no numeric parse; duration (minutes) note: no numeric parse
NCT04154397	Yes	ctDCS			cerebellum	High	No	Intensity (mA) note: no numeric parse; duration (minutes) note: no numeric parse
NCT03814304	Yes	tDCS	2.00	20.00	left dorsolateral prefrontal cortex (dlPFC)	High	No
NCT02436915	Yes	tDCS	2.00	20.00	left prefrontal cortex	High	No

Table 7. Per-field agreement vs. majority.

Model Suffix	Brain Stim.	Primary Type	Is Noninvasive	Intensity	Duration	Primary Target	Avg. Agreement
GPT-o1-mini	98.8	90.2	95.2	77.8	68.8	87.3	86.3
x.ai Grok	100	96.3	100	77.8	90.6	75.9	90.1
Meta LLaMA 3.3 70B	98.8	92.7	98.8	77.8	90.6	86.1	90.8
Google Gemini Flash 1.5B	98.8	96.3	97.6	77.8	78.1	83.5	88.7
DeepSeek Distill 70B	100	98.8	97.6	94.4	90.6	50.6	88.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Young, R.J.; Matthews, A.M.; Poston, B. Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research. Algorithms 2025, 18, 296. https://doi.org/10.3390/a18050296

AMA Style

Young RJ, Matthews AM, Poston B. Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research. Algorithms. 2025; 18(5):296. https://doi.org/10.3390/a18050296

Chicago/Turabian Style

Young, Richard J., Alice M. Matthews, and Brach Poston. 2025. "Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research" Algorithms 18, no. 5: 296. https://doi.org/10.3390/a18050296

APA Style

Young, R. J., Matthews, A. M., & Poston, B. (2025). Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research. Algorithms, 18(5), 296. https://doi.org/10.3390/a18050296

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Natural Language Processing and Analysis

Prompt Architecture

3.2. Statistical Analysis of Multi-LLM Reliability

3.3. Analysis Pipeline

3.4. Reproducibility

3.5. Alignment of Research Questions with Analytical Approaches

Code Availability

4. Results

5. Discussion

6. Conclusions

Future Research Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI