Understanding Tradeoffs in Clinical Text Extraction: Prompting, Retrieval-Augmented Generation, and Supervised Learning on Electronic Health Records

Yadav, Tanya; Tekale, Aditya; Chong, Jeff; Masum, Mohammad

doi:10.3390/a19030215

Open AccessArticle

Understanding Tradeoffs in Clinical Text Extraction: Prompting, Retrieval-Augmented Generation, and Supervised Learning on Electronic Health Records

Department of Applied Data Science, San Jose State University, San Jose, CA 95192, USA

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(3), 215; https://doi.org/10.3390/a19030215

Submission received: 1 February 2026 / Revised: 6 March 2026 / Accepted: 10 March 2026 / Published: 13 March 2026

(This article belongs to the Special Issue Advanced Algorithms for Biomedical Data Analysis)

Download

Browse Figures

Versions Notes

Abstract

Clinical discharge summaries contain rich patient information but remain difficult to convert into structured representations for downstream analysis. Recent advances in large language models (LLMs) have introduced new approaches for clinical text extraction, yet their relative strengths compared with supervised methods remain unclear. This study presents a controlled evaluation of three dominant strategies for structured clinical information extraction from electronic health records: prompting-based extraction using LLMs, retrieval-augmented generation for terminology canonicalization, and supervised fine-tuning of domain-specific transformer models. Using discharge summaries from the MIMIC-IV dataset, we compare zero-shot, few-shot, and verification-based prompting across closed-source and open-source LLMs, evaluate retrieval-augmented canonicalization as a post-processing mechanism, and benchmark these methods against a fine-tuned BioClinicalBERT model. Performance is assessed using a multi-level evaluation framework that combines exact matching, fuzzy lexical matching, and semantic assessment via an LLM-based judge. The results reveal clear tradeoffs across approaches: prompting achieves strong semantic correctness with minimal supervision, retrieval augmentation improves terminology consistency without expanding extraction coverage, and supervised fine-tuning yields the highest overall accuracy when labeled data are available. Across all methods, we observe a consistent

40 - 50 %

gap between exact-match and semantic correctness, highlighting the limitations of string-based metrics for clinical Natural Language Processing (NLP). These findings provide practical guidance for selecting extraction strategies under varying resource constraints and emphasize the importance of evaluation methodologies that reflect clinical equivalence rather than surface-form similarity.

Keywords:

clinical information extraction; large language models; retrieval-augmented generation; BioClinicalBERT; MIMIC-IV; semantic evaluation; electronic health records

1. Introduction

Unstructured clinical notes contain a substantial portion of clinically relevant patient information, yet their free-text format poses persistent challenges for automated extraction and structuring. Reliable identification of diagnoses, procedures, and medications from clinical narratives is essential for biomedical research, clinical decision support, and the development of data-driven healthcare systems [1,2]. However, clinical information extraction remains a challenging task in clinical natural language processing (NLP) due to high lexical variability, large and heterogeneous label spaces, frequent abbreviations, and domain-specific writing styles that often diverge from standardized terminologies [3].

Early work in clinical named entity recognition and information extraction relied on rule-based systems and statistical models such as Conditional Random Fields, which required extensive feature engineering and exhibited limited robustness across institutions and documentation styles. The introduction of transformer-based language models, particularly Bidirectional Encoder Representations from Transformers (BERT) [4], marked a turning point by enabling contextualized representations of clinical language. Domain-adapted variants including BioBERT [5], ClinicalBERT, and BlueBERT [6], trained on biomedical literature and clinical notes, consistently outperformed general-domain models on biomedical and clinical NLP tasks. Subsequent transformer-based clinical models, such as RoBERTa-MIMIC, reported strong performance on selected clinical concept extraction benchmarks, with F1-scores exceeding 0.89 in controlled settings [7]. These advances established domain-specific pretraining as a foundational component of high-performing supervised clinical information extraction systems. In parallel, multi-label classification formulations have been shown to be effective when ground truth is provided as sets of clinical concepts rather than entity spans, particularly in settings characterized by large clinical vocabularies.

More recently, large language models (LLMs) have introduced an alternative paradigm for clinical text extraction that reduces reliance on task-specific supervision. LLMs demonstrate strong natural language understanding and generation capabilities [8], enabling task adaptation through prompting rather than explicit fine-tuning. Few-shot prompting has been successfully applied to a range of clinical NLP tasks, allowing models to leverage limited in-context examples [9]. However, recent empirical studies report nuanced and sometimes counterintuitive findings. Prompting performance is highly sensitive to prompt structure, with reasoning-oriented strategies such as chain-of-thought and verification-based prompting outperforming simple example-based approaches in certain clinical contexts [10]. Moreover, evidence that zero-shot prompting can outperform few-shot configurations in biomedical tasks suggests that pretrained LLMs may already encode substantial domain knowledge, which is not always enhanced by in-context demonstrations [11]. These observations raise important questions about when prompting alone is sufficient for structured clinical extraction and when supervised learning remains necessary.

Retrieval-augmented generation (RAG) has emerged as a complementary strategy for grounding LLM outputs in external knowledge sources, a consideration of particular importance in healthcare applications where accuracy and consistency are critical [12]. RAG-based systems have demonstrated improvements in diagnostic reasoning, clinical decision support, and biomedical information extraction by incorporating evidence from structured knowledge bases and clinical literature [13]. In the context of clinical entity extraction, RAG can be applied as a post-processing canonicalization mechanism to align extracted surface forms with standardized terminology. Such canonicalization can mitigate variability arising from abbreviations, brand–generic differences, and heterogeneous documentation practices, thereby improving semantic consistency and interpretability. However, when predicted entities diverge substantially from reference terminology, canonicalization alone is unlikely to recover missing extractions or significantly improve strict string-based performance.

Alongside advances in modeling, growing attention has been directed toward limitations of traditional evaluation metrics for clinical NLP. Exact string matching often underestimates extraction quality by penalizing clinically equivalent expressions that differ lexically [14]. Fuzzy matching partially addresses this limitation but remains insufficient for capturing true clinical equivalence across diverse surface forms. Recent work demonstrates that large language model-based semantic evaluation, including LLM-as-Judge approaches, better reflects clinical correctness by accounting for synonymy, abbreviation expansion, and partial matches [15,16]. Comparative analyses consistently reveal large discrepancies between exact-match and semantically informed evaluation, underscoring the need for assessment frameworks that align more closely with clinical interpretation rather than surface-form similarity alone.

Despite substantial progress across supervised learning, prompting-based extraction, and retrieval-augmented approaches, there remains limited systematic understanding of their relative tradeoffs when applied to the same clinical task under controlled conditions. Existing studies often evaluate these paradigms in isolation, rely on heterogeneous datasets or metrics, or prioritize performance improvements without examining methodological implications. This study addresses these gaps by conducting a unified evaluation of three complementary strategies for structured clinical information extraction from MIMIC-IV discharge summaries: (1) prompting-based extraction using zero-shot, few-shot, and Chain-of-Verification strategies; (2) retrieval-augmented generation for entity canonicalization; and (3) supervised fine-tuning of a domain-specific transformer model. By combining exact, fuzzy, and semantic evaluation metrics, we aim to clarify algorithmic tradeoffs across approaches, quantify the discrepancy between string-based and clinically informed evaluation, and provide practical guidance for selecting extraction strategies under varying data availability and resource constraints.

2. Methodological Framework and Study Design

Figure 1 provides a high-level overview of the proposed methodological framework.

The framework is designed to enable a controlled comparison of three dominant paradigms for structured clinical entity extraction from free-text discharge summaries: prompting-based extraction using large language models, retrieval-augmented canonicalization, and supervised fine-tuning of a domain-specific transformer model. Discharge summaries from the MIMIC-IV dataset are first preprocessed and partitioned into training, validation, and test sets. The pipeline then branches into two primary extraction pathways: (1) a prompting-based pathway that applies large language models to directly generate structured outputs, and (2) a supervised learning pathway based on BioClinicalBERT.

Outputs generated by the prompting-based pathway may optionally pass through a retrieval-augmented generation (RAG)-based canonicalization stage. This stage is applied as a post-processing step and is designed to improve terminology consistency without altering extraction coverage. Outputs from all pathways are subsequently evaluated using a unified assessment protocol. By keeping data splits, entity definitions, and evaluation criteria fixed across all experiments, the framework isolates methodological differences between prompting, retrieval augmentation, and supervised learning, allowing their respective tradeoffs to be examined under comparable conditions.

Within this framework, prompting-based extraction is evaluated across multiple in-context learning strategies using both closed-source and open-source LLMs. Zero-shot, few-shot, and Chain-of-Verification prompting are applied to the same extraction task and output schema, enabling assessment of how prompt structure and model choice influence extraction behavior. This design facilitates comparison of prompting strategies without conflating performance differences with variations in task formulation or output constraints.

Retrieval-augmented canonicalization is incorporated as an entity-type-specific post-processing component. Predicted entities are embedded using biomedical sentence representations and compared against separate canonical label sets for diagnoses, procedures, and medications. Canonicalization is applied selectively based on similarity thresholds, ensuring that surface-form normalization improves interpretability and consistency while avoiding aggressive over-normalization. Treating retrieval augmentation as a modular refinement step allows its effects to be evaluated independently from the underlying extraction process.

The supervised learning pathway employs BioClinicalBERT, a transformer model pre-trained on clinical text, and formulates entity extraction as a multi-label classification problem. This formulation reflects the structure of the available ground truth, which is provided as sets of clinical concepts per admission rather than token-level annotations. Model training and inference are conducted using the same data partitions as the prompting-based experiments, enabling direct comparison between supervision-intensive and supervision-light approaches under identical conditions.

Finally, all extraction outputs are assessed using a unified evaluation framework that combines exact string matching, fuzzy lexical matching, and semantic evaluation via an LLM-based judge. This multi-level evaluation is intended to capture both surface-level accuracy and clinically meaningful correctness, and to highlight discrepancies between string-based and semantic assessment. Integrating all methods within a single experimental and evaluation framework enables systematic analysis of tradeoffs among prompting, retrieval-augmented, and supervised approaches, particularly with respect to data availability, computational cost, and evaluation sensitivity.

3. Clinical Entity Extraction Approaches

We evaluate three complementary pathways for structured clinical entity extraction: prompting-based extraction using large language models, retrieval-augmented canonicalization as a post-processing step, and supervised fine-tuning of a domain-specific transformer model.

The prompting-based pathway applies multiple in-context learning strategies using GPT-4o-mini and Qwen-2.5-7B. Three prompting paradigms commonly used in LLM-based information extraction are evaluated. In the zero-shot setting, models receive only task instructions and a fixed JSON output schema specifying three entity lists: diagnoses, procedures, and medications [17]. Few-shot prompting is evaluated using two-shot and four-shot configurations, with annotated examples selected via stratified sampling based on entity count percentiles (25th and 75th percentiles for two-shot; 20th, 40th, 60th, and 80th percentiles for four-shot) [8]. Chain-of-Verification (CoVE) prompting is also evaluated as a multi-stage procedure involving initial extraction, generation of verification questions, re-examination of the clinical note, error correction, and production of a final verified JSON output [18,19]. All prompting strategies enforce a stable output schema to ensure comparability across configurations. Prompt templates and example constructions are provided in Appendix A.

To address surface-form variability and clinical synonymy in prompting-based outputs, retrieval-augmented generation (RAG) is applied as an optional post-processing canonicalization step. Predicted entities are embedded using S-PubMedBERT and compared against canonical label sets derived from ground-truth annotations [12]. Separate knowledge bases are maintained for diagnoses, procedures, and medications to prevent cross-category mismatches. For each predicted entity, the top-k candidates retrieved from a FAISS IndexFlatIP index are examined. If the highest cosine similarity score exceeds a predefined threshold, the predicted surface form is replaced with the corresponding canonical label; otherwise, the original prediction is retained. This selective, entity-type-specific canonicalization improves terminology consistency while avoiding incorrect over-normalization and does not alter extraction coverage.

The supervised extraction pathway employs BioClinicalBERT, a transformer model pre-trained on clinical notes [20]. Because ground truth annotations are provided as sets of entity labels per admission rather than token-level spans, extraction is formulated as a multi-label classification problem with three independent sigmoid output heads corresponding to diagnoses (1078 labels), procedures (344 labels), and medications (555 labels). Weighted binary cross-entropy loss is used to mitigate class imbalance [21]. Model optimization follows a two-stage strategy consisting of initial training with frozen encoder layers, followed by progressive unfreezing [22]. Thresholds for converting predicted probabilities to binary outputs, along with top-K output caps for each entity type, are tuned using validation data.

4. Data and Preprocessing

This study uses the publicly available MIMIC-IV dataset, which contains 331,793 discharge summaries corresponding to more than 360,000 patients treated at Beth Israel Deaconess Medical Center between 2008 and 2022 [23]. Discharge summaries serve as the unstructured text source, while structured ground-truth information is obtained from linked clinical tables, including diagnoses (diagnoses_icd.csv), procedures (procedures_icd.csv), and medications (prescriptions.csv). Discharge summaries are linked to these tables using the hospital admission identifier (hadm_id), enabling admission-level alignment between unstructured text and structured entity references.

To ensure consistent coverage across all entity types, the dataset is restricted to admissions for which diagnoses, procedures, and medication records are all available, resulting in 194,530 eligible admissions. Additional filtering is applied to remove atypical or incomplete cases and to focus the analysis on clinically representative discharge summaries. Specifically, admissions are retained if they contain 3–20 diagnoses per admission (median: 12), 1–8 procedures (median: 2), and 10–50 medications (median: 37). These criteria exclude extreme outliers while preserving typical inpatient complexity. After filtering, the final working dataset consists of 449 discharge summaries from 300 unique patients.

While the initial cohort included 194,530 eligible admissions, the final analytic dataset consisted of 449 discharge summaries from 300 unique patients after applying filtering criteria to ensure complete entity coverage and typical inpatient complexity. These constraints were intentionally applied to enable controlled comparison across extraction paradigms under consistent entity-type availability and to prevent patient-level information leakage across splits. However, this filtering substantially reduces the sample size relative to the source corpus and may limit representativeness. Accordingly, the results should be interpreted as a controlled methodological comparison rather than a population-level performance estimate across the full MIMIC-IV dataset.

It is also important to note that ground-truth entity labels were derived from admission-level structured coding tables (diagnoses_icd, procedures_icd, prescriptions) rather than manually annotated mentions within discharge summaries. In practice, narrative documentation and billing or structured coding data may not perfectly align. Certain diagnoses or historical conditions may appear in structured tables but not be explicitly described in the discharge summary, while clinically relevant narrative mentions may not correspond to structured codes. As a result, the reference labels used in this study may contain both omissions and inclusions relative to the narrative text, introducing potential label noise.

To prevent information leakage and ensure independence across experimental splits, patient-level stratified sampling is employed such that no patient appears in more than one partition [24]. The resulting splits include 276 discharge summaries from 180 patients for training (61.5%), 84 summaries from 60 patients for validation (18.7%), and 89 summaries from 60 patients for testing (19.8%). Ground-truth entity labels are extracted via hadm_id joins with the structured tables and stored in JSON format to ensure consistency across extraction pathways and evaluation procedures.

5. Evaluation Framework

To assess extraction performance comprehensively, we employ a three-level evaluation framework that captures surface-level accuracy as well as clinically meaningful semantic correctness. The framework combines exact string matching, fuzzy lexical matching, and semantic evaluation using a large language model-based judge. Together, these metrics enable analysis of how different extraction approaches perform under increasingly permissive notions of correctness.

5.1. Exact Match

Exact Match evaluates strict string equivalence between predicted and ground-truth entities after normalization. This metric reflects the most conservative evaluation setting and penalizes any lexical deviation, including differences in word order, abbreviation usage, or synonymous expressions. Precision, recall, and F1-score are defined as follows:

P r e c i s i o n = \frac{M a t c h e s}{T o t a l P r e d i c t e d}

R e c a l l = \frac{M a t c h e s}{T o t a l G r o u n d T r u t h}

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{(P r e c i s i o n + R e c a l l)}

Exact Match provides a lower bound on extraction performance and is commonly used in clinical NLP benchmarks, but it may underestimate clinically valid predictions when surface forms differ. Because the reference labels are derived from structured admission-level coding tables rather than token-level manual annotations, exact string matching may penalize predictions that are clinically correct but absent from billing codes, or may reward matches that reflect coding artifacts rather than explicit narrative documentation. The inclusion of fuzzy and semantic evaluation metrics partially mitigates this limitation by assessing clinical equivalence rather than strict code-level alignment.

5.2. Fuzzy Match

Fuzzy Match relaxes strict string equivalence by accounting for lexical variation using Levenshtein similarity. The similarity between two strings is defined as follows:

S i m i l a r i t y (s_{1}, s_{2}) = 100 (1 - \frac{e d i t_{d i s t (s_{1}, s_{2})}}{m a x (l e n (s_{1}), l e n (s_{2})})

Predicted and ground-truth entity pairs with similarity scores of at least 60% are considered matches. This metric partially mitigates penalties arising from minor spelling differences, abbreviations, or morphological variants, but it remains limited in capturing deeper semantic equivalence.

5.3. Semantic Evaluation Using an LLM-As-Judge

To assess semantic correctness beyond surface-form similarity, we employ a large language model-based evaluation strategy using DeepSeek-v3 (December 2025) [15] as an automated judge. The judge compares predicted and ground-truth entities and assigns one of three correctness labels: correct (1.0), partial (0.5), or incorrect (0.0). This grading scheme allows partial credit for predictions that are clinically related but differ in specificity or phrasing.

The evaluation prompt instructs the model to consider clinical equivalence rather than lexical identity, explicitly accounting for factors such as synonymy, abbreviation expansion, reordered phrases, and brand–generic medication equivalence. For example, predictions that differ only in specificity (e.g., “heart failure” vs. “chronic heart failure”) are treated as partial rather than incorrect matches.

Using the judge-assigned scores, precision, recall, and F1-score are computed as follows:

P r e c i s i o n = \frac{(c o r r e c t + 0.5 \times p a r t i a l)}{t o t a l p r e d i c t e d}

R e c a l l = \frac{(c o r r e c t + 0.5 \times p a r t i a l)}{t o t a l g r o u n d t r u t h}

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{(P r e c i s i o n + R e c a l l)}

This semantic evaluation is intended to approximate clinically informed judgment at scale and to complement string-based metrics rather than replace them. While LLM-as-Judge evaluation introduces model-dependent biases and does not substitute for expert human adjudication, prior work has shown that it provides a more faithful estimate of clinical correctness than exact or fuzzy matching alone [15]. In this study, semantic evaluation is used to highlight discrepancies between surface-level accuracy and clinically meaningful extraction quality across methods.

This semantic evaluation relies on a single LLM judge (DeepSeek) operating under a fixed evaluation prompt. No multi-judge ensemble, cross-model agreement analysis, or calibration against a manually adjudicated gold standard was performed. As such, semantic scores may reflect model-specific evaluation tendencies. In this study, LLM-as-Judge is used as a scalable approximation of clinically informed assessment rather than a definitive substitute for expert human adjudication. To partially assess clinical validity beyond automated scoring, a clinician evaluation study was conducted on a subset of test summaries (Section 7.6).

6. Experimental Design

All experiments are conducted using the training, validation, and test splits described in Section 4. Prompting-based extraction, retrieval-augmented prompting, and supervised fine-tuning are evaluated on the identical test set to ensure direct comparability across approaches. The experimental workflow consists of applying prompting strategies to validation and test discharge summaries, optionally post-processing prompting outputs using retrieval-augmented canonicalization, training and validating the supervised model on the training split, and evaluating all approaches using a unified metric suite.

Prompting experiments are performed using GPT-4o-mini via the OpenAI API and Qwen-2.5-7B with local inference. To ensure reproducibility and minimize stochastic variation, all prompting runs use a temperature of 0.0, a maximum token length of 2048, batch size of 1, and a fixed random seed of 42. Qwen-2.5-7B inference is executed on a Google Colab A100 GPU with 40 GB memory. Semantic evaluation is performed using DeepSeek, and all prompting templates and example constructions are provided in Appendix A.

Each prompting configuration, including zero-shot, two-shot, four-shot, and Chain-of-Verification prompting, is applied to all 84 validation discharge summaries and 89 test summaries for both model backbones. This results in eight prompting conditions in total (four prompting strategies across two models). Latency and API usage costs are recorded for all prompting runs to support comparison across configurations.

6.1. RAG Canonicalization Experiments

RAG was applied strictly as a post-processing step to prompting outputs. The configuration used in all RAG experiments is summarized in Table 1.

Entity embeddings were computed using S-PubMedBERT; retrieval used FAISS IndexFlatIP with top-k = 3 and a similarity threshold of 0.7. Knowledge bases were constructed from all unique ground-truth labels in the dataset. RAG enhanced outputs were evaluated against raw prompting outputs to isolate the effect of canonicalization.

6.2. Fine-Tuning Experiments

Bio ClinicalBERT was fine-tuned on the training data using the hyperparameters shown in Table 2.

The Model was trained for up to 50 epochs with early stopping based on validation loss (patience = 10). A 10% linear warmup schedule was used, and optimizer settings followed standard AdamW configurations. Thresholds for converting sigmoid probabilities to binary predictions were tuned on the validation set via grid search. Top K caps (20 for diagnoses, 8 for procedures, 50 for medications) were applied during inference to limit over-prediction. The best-performing checkpoint was evaluated on the test set.

Figure 2 illustrates the training and validation loss trajectories across all epochs for Bio ClinicalBERT fine-tuning. Both losses decrease rapidly during the first 10 epochs, with training loss falling from 2.06 to 0.28 and validation loss from 1.96 to 0.26.

The curves converge closely until epoch 12, after which validation loss begins to plateau while training loss continues to decline. The minimum validation loss (0.19) occurs at epoch 15, which was selected as the optimal checkpoint under the early stopping criterion. Subsequent divergence indicates the onset of overfitting, validating the checkpoint selection procedure.

7. Experimental Results

7.1. Prompting-Based Extraction Performance

Table 3 and Table 4 summarize the performance of GPT-4o-mini and Qwen 2.5-7B across prompting strategies and entity types. Across both models, semantic evaluation consistently yields substantially higher scores than exact or fuzzy matching, underscoring the extent to which string-based metrics underestimate clinically meaningful correctness in free-text extraction tasks.

For GPT-4o-mini (Table 3), zero-shot prompting achieves the strongest overall semantic performance on the validation set (Judge F1 = 0.447), while few-shot prompting provides limited and inconsistent gains. Chain-of-Verification prompting does not yield systematic improvements over simpler prompting strategies. Across all prompting configurations, medication extraction consistently outperforms diagnosis and procedure extraction, achieving Judge F1 scores above 0.50. This pattern reflects lower lexical variability and clearer surface forms for medications relative to diagnoses and procedures in discharge summaries. Results for Qwen 2.5-7B (Table 4) exhibit similar trends. CoVE prompting achieves the highest overall semantic performance (Judge F1 = 0.419), although differences across prompting strategies remain modest. As with GPT-4o-mini, medication extraction substantially outperforms other entity types. Despite running entirely locally, Qwen 2.5-7B attains approximately 94% of GPT-4o-mini’s zero-shot semantic performance, highlighting the competitiveness of open-source models under resource-constrained settings.

Across both models, the consistent gap between exact, fuzzy, and semantic scores reinforces the importance of evaluation frameworks that move beyond surface-form matching when assessing clinical extraction quality.

7.2. Effects of Retrieval-Augmented Canonicalization

Table 5 reports test-set performance for GPT-4o-mini with and without retrieval-augmented canonicalization. RAG yields localized improvements in fuzzy matching, particularly for diagnoses, where Fuzzy F1 increases from 0.123 to 0.162. Medication performance remains largely stable, while procedure extraction exhibits mixed behavior.

Overall performance differences between RAG-enhanced and non-RAG configurations are modest, indicating that retrieval-based canonicalization primarily improves output consistency rather than extraction coverage. These results suggest that RAG is most effective as a normalization mechanism for surface-form variability, rather than as a substitute for improved entity detection. It is important to emphasize that retrieval in this study is restricted to post-processing canonicalization against ground-truth-derived label sets. This configuration refines surface-form consistency but does not introduce new contextual knowledge or reasoning support during extraction. Therefore, the limited gains observed here reflect the behavior of this specific normalization-oriented implementation rather than a general limitation of retrieval-augmented generation in clinical NLP tasks.

7.3. Cost and Latency Analysis

In addition to extraction accuracy, practical deployment considerations are critical for clinical NLP systems. Table 6 and Table 7 summarize the monetary cost, token usage, and latency associated with each prompting configuration, enabling direct comparison of performance–efficiency trade-offs across model families and prompting strategies.

Evaluating all prompting strategies on the validation set requires a total cost of $0.366 and approximately 53 min of compute time. These results highlight the practical tradeoffs between supervision-free prompting strategies and resource consumption, particularly in large-scale or cost-sensitive deployment scenarios.

Zero-shot prompting consistently provides the strongest performance-to-cost ratio, achieving competitive semantic accuracy with the lowest token usage and inference cost. Few-shot prompting substantially increases input token counts without delivering consistent performance gains, resulting in diminishing returns under cost-sensitive settings. GPT-4o-mini exhibits lower per-request latency but incurs API cost, whereas Qwen-2.5-7B eliminates monetary cost at the expense of higher inference latency and local hardware requirements. These findings highlight trade-offs between supervision level, monetary cost, and computational overhead that must be considered in real-world deployment scenarios.

7.4. Supervised Fine-Tuning Results

Table 8 presents test-set performance for the fine-tuned BioClinicalBERT model. Compared with prompting-based approaches, supervised fine-tuning achieves substantially higher string-based performance, with an overall Exact F1 of 0.201 and Fuzzy F1 of 0.608. This improvement reflects the model’s ability to leverage labeled data to capture domain-specific terminology, even when surface forms vary. Performance gains are most pronounced for diagnoses and medications, while procedure extraction remains more challenging. The large discrepancy between Exact and Fuzzy scores further illustrates the extent of lexical variation present in discharge summaries and the limitations of strict string matching for evaluating supervised clinical NLP systems.

7.5. Comparative Analysis Across Extraction Paradigms

Table 9 directly compares the best-performing prompting configuration, its RAG-enhanced variant, and the supervised fine-tuned model. Supervised fine-tuning achieves the strongest performance across string-based metrics, improving Exact F1 from 0.125 to 0.201 and Fuzzy F1 from 0.236 to 0.608 relative to zero-shot prompting. RAG-enhanced prompting maintains similar fuzzy performance to prompting alone but does not yield consistent gains in overall extraction accuracy.

These results highlight a clear tradeoff between supervision requirements and performance: prompting-based methods provide competitive semantic correctness with minimal labeled data, while supervised fine-tuning delivers higher string-level accuracy when labeled data are available.

7.6. Clinician Evaluation of Extraction Quality

To complement automated evaluation metrics, a board-certified physician independently evaluated 46 discharge summaries sampled from the test set. For each note, the clinician was provided with the full discharge summary and the corresponding extracted diagnoses, procedures, and medications. Each extraction was rated on a 1–5 Likert scale across three dimensions: (1) extraction completeness, (2) normalization correctness, and (3) clinical plausibility.

As shown in Table 10, across the 46 evaluated notes, mean ratings were 3.73 ± 0.54 for extraction completeness, 3.71 ± 0.51 for normalization correctness, and 3.60 ± 0.54 for clinical plausibility. Additionally, 68.9% of notes were rated ≥4 for completeness and normalization, and 57.8% were rated ≥4 for clinical plausibility. These findings provide independent expert validation of extraction quality and partially support the trends observed under semantic evaluation, while acknowledging that only a single clinician participated in the assessment.

7.7. Benchmark Contextualization

To clarify methodological positioning relative to prior EHR-LLM benchmarks, Table 11 contrasts the present study with representative systems across dataset domain, task formulation, model families evaluated, retrieval usage, supervision strategy, evaluation methodology, and deployment analysis. This comparison emphasizes differences in experimental design rather than direct performance ranking.

Table 12 positions the reported results within the broader landscape of clinical information extraction. As expected, domain-specific transformer models fine-tuned on established benchmarks, such as RoBERTa-MIMIC, continue to achieve strong performance on curated datasets. In this context, the fine-tuned BioClinicalBERT model attains a Fuzzy F1 of 0.608 on MIMIC-IV discharge summaries, reflecting the increased lexical variability and complexity of this corpus relative to more homogeneous benchmarks.

Prompting-based LLM approaches perform competitively in zero-shot settings, with GPT-4o-mini achieving a Judge F1 of 0.447 on the proposed task. Retrieval-augmented prompting yields more limited benefits in this setting compared with prior work such as DiRAG, likely due to dataset-specific terminology patterns and differences in knowledge base construction. Together, these results emphasize that extraction performance is highly sensitive to dataset characteristics, evaluation methodology, and the availability of labeled supervision.

8. Discussion and Limitations

This study provides a systematic comparison of prompting-based extraction, retrieval-augmented canonicalization, and supervised fine-tuning for structured clinical information extraction, highlighting tradeoffs that depend on model capacity, supervision availability, entity type, and evaluation methodology.

Across both GPT-4o-mini and Qwen 2.5-7B, zero-shot prompting consistently achieves performance comparable to, and in some cases exceeding, few-shot configurations. This behavior contrasts with conventional expectations from in-context learning and suggests that pretrained LLMs may already encode sufficient medical knowledge for structured extraction tasks. In this setting, additional examples can introduce noise or distract from adherence to a fixed output schema, particularly when strict JSON formatting is required. Chain-of-Verification prompting exhibits more consistent benefits for the smaller open-source model, indicating that explicit verification and reasoning scaffolds may compensate for more limited parametric knowledge, whereas larger models appear less sensitive to prompt structure [25].

Performance varies substantially across entity types, reflecting underlying linguistic characteristics of clinical documentation. Medication extraction is consistently strongest across all methods, likely due to more standardized naming conventions and reduced contextual ambiguity. Diagnoses exhibit greater lexical variability, leading to low exact-match scores but substantially higher fuzzy and semantic performance. Procedures remain the most challenging category, with highly variable phrasing and implicit references that complicate both extraction and normalization. These patterns align with prior clinical NLP findings and suggest that entity-type-specific modeling and evaluation strategies remain necessary.

The comparison between closed-source and open-source models indicates that locally deployable LLMs are increasingly viable for clinical extraction tasks. Qwen 2.5-7B achieves approximately 94% of GPT-4o-mini’s zero-shot semantic performance while operating entirely without API costs, albeit with higher latency. This tradeoff highlights the potential of open-source models in privacy-sensitive or resource-constrained environments, particularly when paired with structured prompting strategies. From a deployment perspective, zero-shot prompting emerges as the most efficient supervision-light strategy, while supervised fine-tuning requires labeled data and substantial upfront training cost but offers low marginal inference cost once deployed.

Evaluation methodology strongly influences observed performance. Large gaps between Exact Match and semantic scores demonstrate that string-based metrics substantially underestimate clinically meaningful correctness in free-text extraction. Fuzzy matching partially mitigates this issue but remains insufficient to capture true clinical equivalence. Semantic evaluation using an LLM-as-Judge provides a more informative assessment of extraction quality, though it introduces model-dependent biases and does not replace expert human adjudication. In this work, semantic evaluation is used to complement, rather than supersede, string-based metrics.

Retrieval-augmented canonicalization improves terminology consistency and yields localized gains, particularly for diagnoses, but does not substantially improve overall extraction accuracy [26]. This outcome is expected given that canonicalization refines only already-extracted entities and cannot recover missed predictions. Its effectiveness depends on alignment between predicted phrases and the canonical vocabulary, and mismatches may improve semantic interpretability at the expense of strict string matching. These findings suggest that RAG is better suited for post-processing and harmonization than as a standalone accuracy-enhancing mechanism [27]. Supervised fine-tuning of BioClinicalBERT achieves the strongest string-based performance, particularly under fuzzy evaluation, reflecting the benefits of labeled supervision for capturing domain-specific terminology. However, performance varies by entity type and exhibits recall-heavy tendencies, especially for diagnoses and procedures. These patterns likely reflect class imbalance, large label spaces, and limited annotated data, underscoring the continued importance of domain-specific pretraining, careful threshold calibration, and improved learning objectives for clinical extraction.

Several limitations should be noted. The analysis is conducted on discharge summaries from a single institution, which may limit generalizability to settings with different documentation practices. The filtered dataset, while suitable for controlled comparison, is small for supervised learning and may not capture rare conditions or procedural descriptions. Temporal variation in clinical terminology across the 2008–2022 period is not explicitly modeled. The RAG knowledge base is derived from the same dataset used for evaluation, raising potential concerns about circularity, and retrieval parameters are fixed rather than optimized per entity type. Finally, cost and resource constraints prevent full semantic evaluation of the fine-tuned model and limit latency measurements to a single hardware configuration.

Several additional limitations warrant emphasis. The analytic dataset comprised 449 discharge summaries from 300 unique patients, representing a small filtered subset of the original MIMIC-IV cohort. While this design enables controlled comparison across methods, it limits representativeness and statistical generalizability. Moreover, ground-truth labels were derived from admission-level structured coding tables rather than manually annotated discharge summaries. As a result, the reference standard does not constitute a curated gold-standard dataset at the narrative entity level and may introduce label noise due to documentation–coding mismatches. The semantic evaluation depends on a single LLM judge without ensemble verification or systematic calibration against multiple human raters. Although a clinician evaluation was conducted on a subset of notes, only one clinician participated and inter-rater reliability was not assessed. Future work should incorporate larger cohorts and multi-rater manual annotation to strengthen evaluation robustness.

Taken together, these findings emphasize that no single extraction paradigm dominates across all conditions. Prompting-based methods offer strong semantic performance with minimal supervision, supervised fine-tuning yields higher string-level accuracy when labeled data are available, and retrieval-based canonicalization improves consistency rather than coverage. The choice of extraction strategy should therefore be guided by data availability, resource constraints, and evaluation objectives, particularly in clinical settings where semantic correctness is often more relevant than surface-form matching.

9. Conclusions

This study systematically compares prompting-based extraction, retrieval-augmented canonicalization, and supervised fine-tuning for structured clinical information extraction from MIMIC-IV discharge summaries under a unified experimental and evaluation framework. The results clarify how these approaches differ in accuracy, consistency, cost, and practical applicability. Prompting-based extraction performs strongly in zero-shot settings, with both GPT-4o-mini and Qwen 2.5-7B achieving meaningful semantic accuracy without task-specific supervision. Few-shot prompting does not yield consistent gains, suggesting that pretrained language models already encode substantial clinical knowledge and benefit more from structured reasoning mechanisms, such as Chain-of-Verification, than from additional examples. Retrieval-augmented generation improves terminology consistency and interpretability but does not materially increase extraction coverage, indicating that its primary role is post-processing rather than recall enhancement. Supervised fine-tuning with BioClinicalBERT achieves the strongest string-based performance, particularly under fuzzy evaluation, though effectiveness varies by entity type due to differences in lexical variability, label space size, and data sparsity.

From a deployment perspective, open-source models show increasing practical viability. Qwen 2.5-7B approaches the semantic performance of GPT-4o-mini while operating locally and without inference cost, making it suitable for privacy-sensitive or resource-constrained clinical settings. Across all approaches, evaluation results are highly sensitive to metric choice: exact string matching substantially underestimates clinically meaningful correctness, while semantic evaluation provides a more informative, though imperfect, assessment. These findings primarily reflect controlled comparative behavior across extraction paradigms within a filtered subset of MIMIC-IV and should be validated at a larger scale in future work.

Author Contributions

Conceptualization, T.Y., A.T., J.C. and M.M.; methodology, T.Y., A.T. and M.M.; software, T.Y. and A.T.; validation, T.Y. and M.M.; formal analysis, T.Y., A.T. and J.C.; investigation, T.Y., and M.M.; resources, T.Y. and A.T.; data curation, T.Y.; writing—original draft preparation, T.Y., A.T. and M.M.; writing—review and editing, M.M.; visualization, T.Y.; supervision, M.M.; project administration, M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The study uses the publicly available MIMIC-IV dataset. MIMIC-IV is hosted on PhysioNet and is available after completing the required credentialing and data use agreement. Information on access is available at: https://physionet.org/content/mimiciv/ (accessed on 1 March 2026). The discharge summaries and linked structured tables (diagnoses_icd, procedures_icd, prescriptions) used in this work are derived from MIMIC-IV; no new primary datasets were generated. Processed splits and analysis code are available from the corresponding author upon reasonable request.

Acknowledgments

The authors sincerely thank Vishalakshi Tekale, for generously contributing their time and expertise to independently evaluate a subset of discharge summaries and assess extraction completeness, normalization correctness, and clinical plausibility. Tekale’s thoughtful clinical review provided valuable domain-informed validation of the automated evaluation framework and strengthened the overall rigor of this study.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large Language Model
RAG	Retrieval-Augmented Generation
EHR	Electronic Health Record
NLP	Natural Language Processing
NER	Named Entity Recognition
CoVE	Chain-of-Verification

Appendix A

Appendix A.1. System Prompt

All experiments used the following system prompt:

“You are a medical NLP expert specializing in extracting structured information from unstructured clinical discharge notes. Your task is to identify and extract all diagnoses, procedures, and medications mentioned in the note text.”

Appendix A.2. Zero-Shot Prompt Template

“Extract all diagnoses, procedures, and medications from this clinical discharge note.

Clinical Note: {note_text}

Return your response as valid JSON in this exact format:

{“diagnoses”: [“diagnosis 1”, “diagnosis 2”, …],

“procedures”: [“procedure 1”, “procedure 2”, …],

“medications”: [“medication 1”, “medication 2”, …]}

Rules:

-Always include all three keys even if a list is empty

-For medications: extract only the drug name (no dosage, route, or frequency)

-Do not include any text outside the JSON structure”

Appendix A.3. Few-Shot Prompt Template (2-Shot)

“You are a clinical information extraction system. Extract all diagnoses, procedures, and medications from discharge notes.

Instructions:

-Extract exact entity names as they appear

-For medications: extract only the drug name (no dosage)

-Always return valid JSON with all three keys

EXAMPLE 1:

INPUT: Clinical Note: {example_1_text}

OUTPUT: {example_1_ground_truth_json}

EXAMPLE 2:

INPUT: Clinical Note: {example_2_text}

OUTPUT: {example_2_ground_truth_json}

NOW EXTRACT FROM THIS NEW NOTE:

INPUT: Clinical Note: {target_note_text}

OUTPUT:.”

Appendix A.4. Chain-of-Verification (CoVE) Prompt Template

“Extract all diagnoses, procedures, and medications from this clinical note using a verification process.

STEP 1: Initial Extraction

First, carefully read the note and extract all clinical entities.

STEP 2: Generate Verification Questions

Create specific questions to verify your extraction:

-Did I miss any diagnoses mentioned in the note?

-Did I miss any procedures or interventions?

-Did I miss any medications (on admission or discharge)?

-Did I misclassify any entity (e.g., a procedure as a diagnosis)?

-Did I include anything not actually present in the note?

STEP 3: Re-examine and Answer

Go back to the clinical note and answer each verification question.

STEP 4: Correct Errors

Based on your verification, make corrections to your initial extraction.

STEP 5: Final Output

Return ONLY the final verified JSON (no reasoning, no intermediate steps):

{“diagnoses”: [“diagnosis 1”, “diagnosis 2”, …],

“procedures”: [“procedure 1”, “procedure 2”, …],

“medications”: [“medication 1”, “medication 2”, …]}

Clinical Note: {note_text}

Final Verified JSON Response:”

Appendix A.5. LLM as Judge Prompt

“You are evaluating a clinical information extraction system.

Here is the ground truth (human-annotated) structured data:

Here is the model’s predicted structured data: {data}

Evaluate the overlap between them for three categories: diagnoses, procedures, medications.

Rules:

-correct → same or equivalent meaning (ignore capitalization, word order, abbreviations, plural/singular)

-partial → closely related (e.g., ‘heart failure’ vs. ‘chronic heart failure’)

-incorrect → unrelated or wrong

-Ignore medication dosage, formulation, or brand/generic differences

-Consider brand-generic names equivalent (e.g., ‘Wellbutrin’ == ‘Bupropion’)

-Treat reordered phrases as equivalent (‘bone marrow biopsy’ == ‘biopsy of bone marrow’)

-Consider more specific diagnosis as partial, not incorrect

Return ONLY valid JSON exactly in this structure:

{{“diagnoses”: {{“correct”: X, “partial”: Y, “incorrect”: Z, “total_gt”: N}},

“procedures”: {{“correct”: X, “partial”: Y, “incorrect”: Z, “total_gt”: N}},

“medications”:{{“correct”: X, “partial”: Y, “incorrect”: Z, “total_gt”: N}}}}

Do not include explanations, text, or code fences.”

Appendix A.6. Example Input and Output for LLM Experiment

Appendix A.6.1. Input: Discharge Note Excerpt (Truncated)

“History of Present Illness:

___ yo M with h/o ___ who presented to his PCP with generalized weakness, decreased appetite, and pallor. Initial labs showed HCT 14.7, WBC 7.9, and platelets 35. He was admitted to ___ for transfusion and received 3 units of PRBCs, with repeat HCT of 20.9 and platelets of 30. He was subsequently transferred to […].

Past Medical History:

___ disease, diverticulitis, prior MVA with loss of license.

Major Surgical or Invasive Procedures:

Bone marrow biopsy.

Brief Hospital Course:

___ year old man with history of ___ disease presented with weakness, decreased appetite, weight loss, and pallor, and was found to have anemia and thrombocytopenia. He received a total of 4 units of PRBCs (3 at ___ and 1 at ___). Peripheral smear was consistent with high-grade MDS (myelodysplastic syndrome) with concern for leukemic transformation due to […].

Medications on Admission:

ASA, Colace, Flomax, HCTZ, Lactulose, Lasix […].

Discharge Medications:

Docusate sodium, tamsulosin, lactulose, furosemide, […].

Discharge Diagnosis:

Anemia, thrombocytopenia, ___ disease, chronic heart failure.”

Appendix A.6.2. Output: Structured JSON Generated by LLM

“{‘diagnoses’: [‘Anemia’, ‘Thrombocytopenia’, ‘___ Disease’, ‘Chronic heart failure’],

‘procedures’: [‘bone marrow biopsy’],

‘medications’: [‘Docusate Sodium’, ‘Tamsulosin’, ‘Lactulose’, ‘Furosemide’, ‘Senna’, ‘Carbidopa-Levodopa’, ‘Bupropion HCl’, ‘Metolazone’, ‘Acetaminophen’, ‘Uroxatral’, ‘Magnesium Oxide’]}”

References

Johnson, A.E.W.; Bulgarelli, L.; Shen, L.; Gayles, A.; Shammout, A.; Horng, S.; Pollard, T.J.; Hao, S.; Moody, B.; Gow, B.; et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 2023, 10, 1. [Google Scholar] [CrossRef] [PubMed]
Kundeti, S.R.; Vijayananda, J.; Mujjiga, S.; Kalyan, M. Clinical named entity recognition: Challenges and opportunities. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 5–8 December 2016. [Google Scholar] [CrossRef]
Hripcsak, G.; Rothschild, A.S. Agreement, the f-measure, and reliability in information retrieval. J. Am. Med. Inform. Assoc. 2005, 12, 296–298. [Google Scholar] [CrossRef]
Koroteev, M.V. BERT: A review of applications in natural language processing and understanding. arXiv 2021, arXiv:2103.11943. [Google Scholar] [CrossRef]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 2021, 3, 2. [Google Scholar] [CrossRef]
Yang, X.; Bian, J.; Hogan, W.R.; Wu, Y. Clinical concept extraction using transformers. J. Am. Med. Inform. Assoc. 2020, 27, 1935–1942. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent abilities of large language models. arXiv 2022, arXiv:2206.07682. [Google Scholar] [CrossRef]
Sivarajkumar, S.; Kelley, M.; Samolyk-Mazzanti, A.; Visweswaran, S.; Wang, Y. An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing: Algorithm development and validation study. JMIR Med. Inform. 2024, 12, e55318. [Google Scholar] [CrossRef] [PubMed]
Monajatipoor, M.; Yang, J.; Stremmel, J.; Emami, M.; Mohaghegh, F.; Rouhsedaghat, M.; Chang, K.W. LLMs in biomedicine: A study on clinical named entity recognition. arXiv 2024, arXiv:2404.07376. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Gargari, O.K.; Habibi, G. Enhancing medical AI with retrieval-augmented generation: A mini narrative review. Digit. Health 2025, 11, 20552076251337177. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-Eval: NLG evaluation using GPT-4 with better human alignment. arXiv 2023, arXiv:2303.16634. [Google Scholar] [CrossRef]
Gera, A.; Boni, O.; Perlitz, Y.; Bar-Haim, R.; Eden, L.; Yehudai, A. JuStRank: Benchmarking LLM judges for system ranking. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 682–712. [Google Scholar] [CrossRef]
Guo, Y.; Ge, Y.; Sarker, A. Detection of medication mentions and medication change events in clinical notes using transformer-based models. Stud. Health Technol. Inform. 2024, 310, 685. [Google Scholar] [PubMed]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
Zhou, D.; Schärli, N.; Hou, L.; Wei, J.; Scales, N.; Wang, X.; Schuurmans, D.; Cui, C.; Bousquet, O.; Le, Q.; et al. Least-to-most prompting enables complex reasoning in large language models. arXiv 2022, arXiv:2205.10625. [Google Scholar]
Dhuliawala, S.; Komeili, M.; Xu, J.; Raileanu, R.; Li, X.; Celikyilmaz, A.; Weston, J. Chain-of-verification reduces hallucination in large language models. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 3563–3578. Available online: https://aclanthology.org/2024.findings-acl.212.pdf (accessed on 1 March 2026).
Alsentzer, E.; Murphy, J.; Boag, W.; Weng, W.H.; Jindi, D.; Naumann, T.; McDermott, M. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, MN, USA, 7 June 2019; pp. 72–78. [Google Scholar]
Rezaei-Dastjerdehei, M.R.; Mijani, A.; Fatemizadeh, E. Addressing imbalance in multi-label classification using weighted cross-entropy loss function. In Proceedings of the 2020 27th National and 5th International Iranian Conference on Biomedical Engineering (ICBME), Tehran, Iran, 26–27 November 2020; IEEE: New York, NY, USA, 2020; pp. 333–338. [Google Scholar] [CrossRef]
Mosbach, M.; Andriushchenko, M.; Klakow, D. On the stability of fine-tuning BERT: Misconceptions, explanations, and strong baselines. arXiv 2020, arXiv:2006.04884. [Google Scholar]
Goldberger, A.L.; Amaral, L.A.N.; Glass, L.; Hausdorff, J.M.; Ivanov, P.C.; Mark, R.G.; Mietus, J.E.; Moody, G.B.; Peng, C.-K.; Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 2000, 101, e215–e220. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
White, J. A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv 2023, arXiv:2302.11382. [Google Scholar] [CrossRef]
Xu, K.; Feng, Y.; Li, Q.; Dong, Z.; Wei, J. Survey on terminology extraction from texts. J. Big Data 2025, 12, 29. [Google Scholar] [CrossRef]
Amugongo, L.M.; Mascheroni, P.; Brooks, S.; Doering, S.; Seidel, J. Retrieval-augmented generation for large language models in healthcare: A systematic review. PLoS Digit. Health 2025, 4, e0000877. [Google Scholar] [CrossRef]

Figure 1. Overview of the Proposed Experimental Architecture.

Figure 2. Training and validation loss curves over 50 epochs for Bio ClinicalBERT fine-tuning. The model demonstrates rapid initial learning with convergence around epoch 10. Best checkpoint selected at epoch 15 (minimum validation loss) before overfitting begins.

Table 1. RAG Configuration Parameters.

Parameter	Deta
Dense Model	S-PubMedBERT (Sentence-BERT, MS-MARCO)
Retrieval top-k	3
Similarity threshold	0.7
Index type	FAISS IndexFlatIP (cosine similarity)

Table 2. Fine-Tuning Hyperparameters.

Parameter	Title 2
Model	Bio ClinicalBERT
Max Input Length	1024 tokens
Batch Size	4
Optimizer	AdamW
Classification Head Learning Rate	1 × 10⁻³
Encoder Learning Rate	2 × 10⁻⁵
Maximum Epochs	50
Early Stopping	Validation loss (patience: 10 epochs)
Scheduler	Linear warmup (10%)
Threshold Tuning	Grid search on validation set
Top-K Caps	Diagnoses: 20, Procedures: 8, Medications: 50

Table 3. Prompting-based extraction performance of GPT-4o-mini on the validation set. Results are reported by entity type and aggregated overall using Exact Match, Fuzzy Match, and semantic (LLM-as-Judge) F1 scores, illustrating differences between surface-form and semantically informed evaluation.

Strategy	Entity	Exact F1	Fuzzy F1	Judge F1
Zero-shot	Diagnoses	0.019	0.123	0.419
	Procedures	0.005	0.108	0.420
	Medications	0.352	0.477	0.502
	Overall	0.125	0.236	0.447
Few-shot-2	Diagnoses	0.023	0.171	0.405
	Procedures	0.012	0.165	0.402
	Medications	0.386	0.501	0.498
	Overall	0.141	0.279	0.435
Few-shot-4	Diagnoses	0.037	0.199	0.413
	Procedures	0.019	0.181	0.420
	Medications	0.364	0.493	0.500
	Overall	0.140	0.291	0.444
CoVE	Diagnoses	0.018	0.117	0.392
	Procedures	0.009	0.123	0.414
	Medications	0.344	0.474	0.500
	Overall	0.124	0.238	0.435

Table 4. Prompting-based extraction performance of Qwen-2.5-7B on the validation set (n = 84). Performance is reported across prompting strategies and entity types using Exact Match, Fuzzy Match, and semantic (LLM-as-Judge) F1 scores, enabling comparison with closed-source models under identical evaluation conditions.

Strategy	Entity	Exact F1	Fuzzy F1	Judge F1
Zero-shot	Diagnoses	0.018	0.098	0.379
	Procedures	0.005	0.104	0.385
	Medications	0.330	0.434	0.455
	Overall	0.118	0.212	0.406
Few-shot-2	Diagnoses	0.022	0.169	0.367
	Procedures	0.006	0.136	0.354
	Medications	0.362	0.457	0.460
	Overall	0.130	0.254	0.394
Few-shot-4	Diagnoses	0.029	0.210	0.362
	Procedures	0.023	0.169	0.372
	Medications	0.369	0.466	0.459
	Overall	0.140	0.282	0.398
CoVE	Diagnoses	0.019	0.107	0.370
	Procedures	0.005	0.100	0.404
	Medications	0.334	0.441	0.462
	Overall	0.119	0.216	0.419

Table 5. Effect of retrieval-augmented canonicalization on GPT-4o-mini performance on the test set (n = 89). Results compare prompting-only outputs with RAG-enhanced outputs across entity types, highlighting the impact of post-processing canonicalization on consistency and matching behavior.

Entity Type	Exact-F1	Fuzzy-F1	Judge-F1
Diagnosis	0.22	0.162	0.405
Procedures	0.018	0.099	0.378
Medications	0.291	0.445	0.499
Overall	0.110	0.235	0.428

Table 6. Cost analysis of prompting strategies on the validation set (n = 84). Monetary costs are reported for extraction and semantic evaluation across prompting strategies and model backbones, illustrating tradeoffs between performance and inference cost.

Model	Strategy	Extract ($)	Judge ($)	Total ($)
GPT-4o-mini	Zero-shot	0.024	0.022	0.046
	Few-shot-2	0.051	0.021	0.072
	Few-shot-4	0.098	0.021	0.120
	CoVE	0.026	0.022	0.047
	Subtotal	0.199	0.086	0.286
Qwen 2.5-7B	Zero-shot	0.000	0.021	0.021
	Few-shot-2	0.000	0.020	0.020
	Few-shot-4	0.000	0.020	0.020
	CoVE	0.000	0.080	0.020
	Subtotal	0.000	0.080	0.080
Grand Total		0.199	0.167	0.366

Table 7. Latency and token usage across prompting strategies on the validation set (n = 84). Average latency and input/output token counts are reported for each model and prompting configuration, reflecting computational efficiency under deterministic inference settings.

Model	Strategy	Latency (s)	Input Tokens	Output Tokens
GPT-4o-mini	Zero-shot	3.1	1269	164
	Few-shot-2	3.7	3389	163
	Few-shot-4	3.5	7151	162
	CoVE	3.5	1450	152
	Average	3.5	3315	160
Qwen 2.5-7B	Zero-shot	5.2	1352	156
	Few-shot-2	6.6	3309	197
	Few-shot-4	7.0	7049	202
	CoVE	5.1	1455	157
	Average	6.0	3291	178

Table 8. Test-set performance of the fine-tuned BioClinicalBERT model (n = 89). Results are reported by entity type and overall using Exact and Fuzzy Match precision, recall, and F1 scores, reflecting supervised extraction performance under string-based evaluation.

Entity Type	Exact (F1/Prec/Recall)	Fuzzy (F1/Prec/Recall)
Diagnosis	0.130/0.096/0.200	0.630/0.466/0.973
Procedures	0.050/0.031/0.126	0.400/0.250/0.994
Medications	0.424/0.471/0.386	0.794/0.880/0.723
Overall	0.201/0.199/0.237	0.608/0.532/0.897

Table 9. Comparison of best-performing extraction approaches on the test set (n = 89). The table contrasts zero-shot prompting, RAG-enhanced prompting, and supervised fine-tuning using Exact, Fuzzy, and semantic evaluation metrics where applicable.

Approach	F1_Exact	F1_Fuzzy	F1_Judge
GPT-4o-mini (Zero-shot)	0.125	0.236	0.447
GPT-4o-mini + RAG	0.110	0.235	0.428
Fine-Tuned Bio_ClinicalBERT	0.201	0.608	-

F1_Judge not evaluated for Fine-Tuned Bio_ClinicalBERT approach.

Table 10. Clinician evaluation of structured extraction quality (n = 46 discharge summaries).

Dimension	Mean ± SD	% Rated ≥ 4
Extraction Completeness	3.73 ± 0.54	68.9%
Normalization Correctness	3.71 ± 0.51	68.9%
Clinical Plausibility	3.60 ± 0.54	57.8%

Table 11. Methodological positioning relative to representative EHR-LLM benchmarks. Abbreviations: QA = Question Answering; NER = Named Entity Recognition; RAG = Retrieval-Augmented Generation; SFT = Supervised Fine-Tuning; SemEval = Semantic Evaluation; Deploy = Deployment/Cost Analysis.

Study/System	Dataset	Task	Models	RAG	SFT	SemEval	Deploy
EHRNoteQA	MIMIC-III	QA	Single LLM	No	No	No	No
DiRAG	Biomedical	NER	LLM + RAG	Yes	No	Limited	No
RadGraph	MIMIC-CXR	Entity + Relation	Transformer	No	Yes	No	No
RoBERTa-MIMIC	i2b2	NER	Transformer	No	Yes	No	No
This Study	MIMIC-IV	Structured Extraction	LLM (closed + open) + Transformer	Yes *	Yes	Yes	Yes

* RAG used as post hoc canonicalization only.

Table 12. Contextual comparison with representative clinical information extraction benchmarks. Reported scores are drawn from prior studies using heterogeneous datasets, task formulations, and evaluation protocols. These values are provided for qualitative context and are not intended as direct head-to-head performance comparisons.

Approach/System	F1 Score	Dataset	Notes
Traditional/Legacy Systems
MetaMap (rule-based)	0.70	i2b2 Obesity	Baseline NER tool
cTAKES (rule-based)	0.65	i2b2 Obesity	Clinical pipeline
CLAMP (rule-based)	0.68	i2b2 Obesity	Multi-institutional
BERT-Based Fine-Tuned Models
BioBERT (general biomedical)	0.62	PubMed/MEDLINE	Biomedical domain
RoBERTa-MIMIC	0.89	i2b2 2010	State-of-the-art BERT
BioClinicalBERT (our approach)	0.60	MIMIC-IV	Fuzzy F1 on our data
Large Language Model Approaches
GPT-3.5 (few-shot)	0.75	Biomedical NER	DiRAG study benchmark
LLaMA instruction-tuned	0.87	NCBI Disease	Multi-dataset tuning
GPT-4o-mini (zero-shot, our)	0.44	MIMIC-IV	LLM-as-Judge metric
Qwen 2.5-7B (CoVE, our)	0.41	MIMIC-IV	LLM-as-Judge metric
Retrieval-Augmented and Hybrid Systems
DiRAG (RAG-enhanced LLM)	0.80–0.87	Biomedical NER	Zero-shot with retrieval
RadGraph (entity + relation)	0.82	MIMIC-CXR	Radiology reports
Our GPT-4o-mini + RAG	0.23	MIMIC-IV	Fuzzy F1 on our data

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yadav, T.; Tekale, A.; Chong, J.; Masum, M. Understanding Tradeoffs in Clinical Text Extraction: Prompting, Retrieval-Augmented Generation, and Supervised Learning on Electronic Health Records. Algorithms 2026, 19, 215. https://doi.org/10.3390/a19030215

AMA Style

Yadav T, Tekale A, Chong J, Masum M. Understanding Tradeoffs in Clinical Text Extraction: Prompting, Retrieval-Augmented Generation, and Supervised Learning on Electronic Health Records. Algorithms. 2026; 19(3):215. https://doi.org/10.3390/a19030215

Chicago/Turabian Style

Yadav, Tanya, Aditya Tekale, Jeff Chong, and Mohammad Masum. 2026. "Understanding Tradeoffs in Clinical Text Extraction: Prompting, Retrieval-Augmented Generation, and Supervised Learning on Electronic Health Records" Algorithms 19, no. 3: 215. https://doi.org/10.3390/a19030215

APA Style

Yadav, T., Tekale, A., Chong, J., & Masum, M. (2026). Understanding Tradeoffs in Clinical Text Extraction: Prompting, Retrieval-Augmented Generation, and Supervised Learning on Electronic Health Records. Algorithms, 19(3), 215. https://doi.org/10.3390/a19030215

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Understanding Tradeoffs in Clinical Text Extraction: Prompting, Retrieval-Augmented Generation, and Supervised Learning on Electronic Health Records

Abstract

1. Introduction

2. Methodological Framework and Study Design

3. Clinical Entity Extraction Approaches

4. Data and Preprocessing

5. Evaluation Framework

5.1. Exact Match

5.2. Fuzzy Match

5.3. Semantic Evaluation Using an LLM-As-Judge

6. Experimental Design

6.1. RAG Canonicalization Experiments

6.2. Fine-Tuning Experiments

7. Experimental Results

7.1. Prompting-Based Extraction Performance

7.2. Effects of Retrieval-Augmented Canonicalization

7.3. Cost and Latency Analysis

7.4. Supervised Fine-Tuning Results

7.5. Comparative Analysis Across Extraction Paradigms

7.6. Clinician Evaluation of Extraction Quality

7.7. Benchmark Contextualization

8. Discussion and Limitations

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. System Prompt

Appendix A.2. Zero-Shot Prompt Template

Appendix A.3. Few-Shot Prompt Template (2-Shot)

Appendix A.4. Chain-of-Verification (CoVE) Prompt Template

Appendix A.5. LLM as Judge Prompt

Appendix A.6. Example Input and Output for LLM Experiment

Appendix A.6.1. Input: Discharge Note Excerpt (Truncated)

Appendix A.6.2. Output: Structured JSON Generated by LLM

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI