1. Introduction
Clinical notes are a vital source of information in healthcare, capturing details about patients’ symptoms, medications, and medical procedures. However, these notes are often unstructured and filled with abbreviations, shorthand, and inconsistent language, making it difficult to extract reliable and consistent information [
1]. Recent studies have highlighted that the usability and complexity of electronic health record (EHR) systems contribute significantly to physician dissatisfaction and burnout. For example, a recent work reported that EHR usability issues negatively affect physician satisfaction and are linked to burnout in primary care [
2]. Similarly, Muhiyaddin et al. conducted a scoping review and found that EHR-related documentation burdens contribute to high levels of stress among healthcare providers [
3].
Named Entity Recognition (NER) is a natural language processing (NLP) technique that identifies and categorizes key medical terms such as diseases, medications, symptoms, and procedures in unstructured clinical text. By transforming free-text narratives into structured data, NER facilitates information retrieval, clinical decision support, cohort selection, and medical knowledge discovery. NER plays a critical role in extracting meaningful insights from electronic health record (EHR) systems to support clinical analytics and research [
4]. Earlier approaches to clinical NER relied heavily on rule-based systems and traditional machine learning algorithms, such as Conditional Random Fields (CRFs) and Support Vector Machines (SVMs), which used handcrafted features and domain lexicons [
5,
6]. Although these methods achieved some success, they often struggled to generalize to new datasets and failed to handle the variability and complexity of clinical language. This challenge is compounded in modern healthcare systems by inconsistent documentation practices and burdensome EHR interfaces that have been linked to physician distress and burnout.
In recent years, the introduction of transformer-based language models such as BERT and its biomedical adaptations (BioBERT and ClinicalBERT) has significantly advanced the field of clinical NER [
7]. These models leverage transformer architectures and self-supervised pretraining on large text corpora to capture nuanced semantic relationships, resulting in superior performance compared to earlier deep learning models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks [
8]. Although RNN-based models demonstrated improved performance over traditional machine learning approaches, they often required large annotated datasets and still struggled with the complexity of clinical language. By contrast, transformer-based models, trained on large biomedical corpora, offer improved generalization and context awareness that are crucial for clinical NER tasks. Despite these improvements, transformer-based language models are part of the broader category of large language models (LLMs) are not without limitations. For example, they may produce inconsistent or ambiguous outputs when faced with abbreviations like “SOB” (shortness of breath) or domain-specific jargon, and they sometimes generate hallucinated or irrelevant entities that reduce reliability.
To address these challenges, recent work has explored integrating retrieval-based methods with generative language models through Retrieval-Augmented Generation (RAG) frameworks [
9,
10]. RAG methods combine the generative power of LLMs with external knowledge retrieval, enabling models to ground their outputs in relevant domain-specific information. Furthermore, various strategies have been proposed to overcome these challenges and further enhance clinical NER performance. A recent study showed that prompt-based methods can improve LLM performance on clinical NER tasks [
11]. In line with this, an evaluation study confirmed that BERT-based models consistently outperform earlier models, though challenges remain for domain adaptation [
12]. Lee et al. introduced Clinical ModernBERT, a transformer model pre-trained on large clinical corpora with support for long-context scenarios, achieving state-of-the-art results [
13]. RAG-based entity linking methods have also been explored for biomedical concept normalization, showing promise in resolving ambiguous abbreviations and aligning outputs with external knowledge bases [
14]. Moreover, a hybrid BiLSTM-BERT-RAG architecture demonstrated near-human performance in clinical NER tasks, highlighting the practical benefits of combining generative and retrieval-based methods [
15]. While RAG has shown promise in various NLP tasks, and initial studies have begun to explore its role in clinical NER, particularly for resolving ambiguous abbreviations and normalizing terms, its full potential remains underexplored and warrants further investigation.
Building on these advancements, we aim to further explore the potential of combining transformer-based models with retrieval-augmented generation techniques for clinical NER. Specifically, our goal is to develop a hybrid system that not only improves entity extraction performance but also enhances the clarity and clinical relevance of the outputs. To this end, we investigate the following research questions:
RQ1:
How effectively can a fine-tuned general-purpose transformer model (BERT) extract clinically relevant entities from unstructured clinical notes compared to domain-specific models?
RQ2:
Can a retrieval-augmented generation (RAG) framework improve the clarity and standardization of ambiguous or abbreviated medical terms in NER outputs?
RQ3:
How does combining semantic similarity with lexical re-ranking impact the accuracy of medical concept retrieval and normalization?
RQ4:
What role does prompt engineering play in ensuring the reliability and interpretability of LLM-generated medical term enhancements?
To address the research questions outlined above, we propose a hybrid framework that enhances clinical Named Entity Recognition (NER) by integrating a fine-tuned BERT model with a dictionary-infused Retrieval-Augmented Generation (DiRAG) pipeline. Our system begins with entity extraction using a BERT-base model fine-tuned on the MACCROBAT biomedical NER dataset. Despite being a general-purpose model, our fine-tuned BERT achieved an F1 score of 0.708, outperforming several domain-specific models such as BioBERT and ClinicalBERT. This result highlights the effectiveness of task-specific fine-tuning, even with limited clinical data, and directly addresses RQ1. To address RQ2 and RQ3, we developed the DiRAG module, which selectively enhances only ambiguous or abbreviated entities rather than rewriting all extracted outputs. This module incorporates a two-step normalization process: (1) semantic retrieval of candidate medical concepts from a UMLS-based vector database using embedding similarity, and (2) a re-ranking mechanism that combines semantic similarity with lexical overlap to improve retrieval precision. This hybrid strategy enables more accurate and context-aware mapping of extracted entities to their standardized medical equivalents.
In response to RQ4, we designed domain-specific prompt templates to guide the large language model during the normalization process. These prompts are carefully crafted to ensure that the model relies solely on retrieved medical knowledge, thereby reducing hallucinations and improving the interpretability and reliability of the outputs. Together, these components form a robust, interpretable, and scalable solution for clinical NER. Our results demonstrate that this hybrid approach significantly improves the accuracy, clarity, and clinical relevance of extracted entities, making it a strong candidate for real-world deployment in healthcare NLP systems.
2. Literature Review
Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, including healthcare. These models leverage vast amounts of data to perform complex language tasks, such as Named Entity Recognition (NER) and text classification. In the medical field, LLMs have been applied to clinical information extraction, significantly improving the identification of entities in electronic health records (EHRs) by effectively capturing contextual information and relationships among medical terms [
16,
17,
18]. Li et al. [
19] explored few-shot learning for clinical NER, highlighting the challenges of limited annotated data and demonstrating how LLMs can generalize from small datasets.
Building on this foundation, recent advancements have introduced context-aware embeddings and domain-specific medical lexicons to further enhance biomedical NER. ClinicalBERT, for instance, leverages pre-training on clinical notes to improve contextual understanding of medical terminology, facilitating more accurate entity extraction [
18]. In parallel, prompt-based fine-tuning has emerged as a promising strategy to bridge the gap between general-purpose LLMs and clinical applications. Xu et al. [
20] introduced PromptNER, a method that utilizes prompt engineering to effectively adapt LLMs to limited-data clinical NER tasks, achieving improved performance compared to traditional fine-tuning methods. A significant focus has also been placed on phenotype recognition in clinical notes, an area critical for understanding disease progression and patient subtyping. Yang et al. [
21] proposed PhenoBCBERT and PhenoGPT, which demonstrated superior performance in identifying phenotype terms compared to existing approaches, thereby facilitating more precise extraction of disease-related information. Complementing this work, Agrawal et al. [
22] benchmarked LLMs for few-shot clinical information extraction, revealing their strong ability to generalize to new tasks with minimal examples, a critical capability in domains with limited labeled data.
The potential of LLMs to handle rare diseases and their associated phenotypes has also garnered attention. Shyr et al. [
23] leveraged LLMs for zero-shot and few-shot identification of rare disease entities, demonstrating their ability to capture complex and infrequent medical concepts. Similarly, Andrew et al. [
24] focused on temporal entity extraction in pediatric rare diseases, highlighting the importance of temporal information for accurate patient monitoring. These studies underscore the versatility of LLMs in capturing nuanced clinical information beyond standard diagnostic terms [
25]. In addition to NER tasks, advancements have been made in other areas of clinical text processing. Lopez et al. [
26] proposed Clinical Entity Augmented Retrieval, a method that improves information extraction by integrating retrieval-augmented strategies, thereby enhancing the relevance of extracted medical entities. On the privacy front, Pissarra et al. [
27] focused on anonymizing clinical text using LLMs, ensuring that patient privacy is preserved while maintaining the utility of medical data for research and analysis.
LLMs have also shown potential in accelerating annotation processes and scalable information extraction. Ghali et al. [
28] introduced GAMEDX, an LLM-based system for precise and scalable medical entity extraction, showcasing its applicability in large-scale clinical data. Similarly, Goel et al. [
29] demonstrated that LLMs can significantly reduce manual annotation efforts by automating parts of the data labeling process, which is a critical step for training and evaluating clinical NLP systems.
Prompt engineering has emerged as a key strategy for optimizing clinical NER. Hu et al. [
30] developed a prompt-based approach that enhances LLM performance in clinical contexts, emphasizing the importance of prompt design for domain-specific tasks. Monajatipoor et al. [
25] conducted a comprehensive analysis of LLMs in biomedical applications, underscoring their adaptability and the need for targeted fine-tuning strategies to maximize their utility across different medical subdomains. Synthetic data generation has also been explored as a means to augment training data for clinical text mining. Tang et al. [
31] investigated the effectiveness of synthetic data generated by LLMs, demonstrating improved NER performance through fine-tuning on these artificial samples. Despite these gains, Lu et al. [
32] highlighted persistent challenges at the token-level NER stage, suggesting that model architecture advancements remain necessary to fully address fine-grained entity extraction.
A comparative analysis by Obeidat et al. [
33] revealed that LLMs can surpass traditional encoders in biomedical NER, showcasing their superior text representation capabilities and underscoring their potential for complex medical text processing. Ghali et al. [
28] further demonstrated the scalability of LLMs in handling large clinical datasets, reinforcing their suitability for real-world healthcare applications.
Recent research has also emphasized the importance of integrating domain-specific knowledge into LLM outputs. Yang et al. [
25] proposed DiRAG-Med, a dictionary-infused retrieval-augmented generation model that enhances medical text normalization by incorporating medical lexicons, significantly improving the accuracy of entity normalization and reducing ambiguities. Such approaches are vital for ensuring that extracted information aligns with clinical terminologies and standards.
Finally, del Moral-González et al. [
34] presented a comprehensive benchmark for generative LLMs in clinical text processing, focusing specifically on zero-shot NER. Their systematic comparison of fine-tuned generative models, including LLaMA 2 and Mistral, across multiple evaluation strategies provides a critical framework for assessing the effectiveness of generative approaches in clinical NLP tasks.
Overall, the literature demonstrates that LLMs have revolutionized the field of clinical text processing, offering significant improvements in entity recognition, text anonymization, and information extraction. Despite these advancements, challenges remain in areas such as privacy, data quality, model generalization, and domain adaptation, warranting further investigation to ensure the reliable deployment of LLMs in healthcare settings.
3. Dataset Description and Pre-Processing
3.1. Dataset for Fine-Tuning: MACCROBAT Clinical Corpus
This study employs the MACCROBAT dataset, a comprehensively annotated corpus of clinical narratives specifically designed for biomedical named entity recognition (NER) tasks in healthcare informatics. The dataset is publicly accessible through Hugging Face [
35] and represents a carefully curated collection of 200 real clinical notes sourced from multiple hospital departments, ensuring representative coverage of medical terminology and clinical documentation patterns across diverse healthcare specialties including cardiology, oncology, neurology, emergency medicine, and internal medicine.
The MACCROBAT corpus provides a rich resource for clinical NLP with more than 80 annotated entity types, organized into categories such as conditions, therapeutic interventions, anatomical references, laboratory data, temporal information, demographics, and family history. Annotations are marked with character-level boundaries, supporting precise span detection and evaluation at both token and character levels. The process followed established clinical NLP guidelines and achieved inter-annotator agreement above 0.85 Cohen’s kappa, ensuring reliable ground truth labels.
The dataset combines structured content (35%), including medication lists, laboratory results, procedural reports with CPT codes, and template-based discharge summaries, with unstructured narratives (65%) such as physician notes, patient histories, and clinical assessments. This mix reflects real-world EHR documentation and supports model robustness across varied text types. In total, the corpus contains 200 notes with about 125,000 tokens. Notes average 625 tokens (±280), with an entity density of 15.2 per 100 tokens. After preprocessing, the vocabulary includes 18,500 unique terms. Entity distribution is balanced across categories, with conditions (28%), medications (22%), procedures (18%), and anatomical terms (15%) most common. Together, these features make the corpus well suited for developing and evaluating large language models for clinical entity recognition.
3.2. Knowledge Base Integration: Unified Medical Language System (UMLS) for RAG
To improve entity normalization and semantic consistency, this study incorporates the Unified Medical Language System (UMLS) Metathesaurus [
36], developed and maintained by the U.S. National Library of Medicine. The UMLS is a comprehensive biomedical vocabulary resource that contains more than four million concepts drawn from over 200 source vocabularies.
The integrated knowledge base is structured around several key components. Each medical concept is assigned a unique identifier (CUI) that provides a standardized reference point. Concepts are accompanied by definitions from authoritative medical sources, extensive sets of synonyms and alternative term variants to support robust matching, and preferred term designations that align with established medical nomenclature standards.
Integration with the UMLS enables entity linking and concept normalization through multiple mechanisms. Extracted entities are automatically mapped to standardized concepts, classified into semantic types according to the UMLS semantic network, and enriched with synonyms to improve recognition recall. In addition, hierarchical relationships between concepts are leveraged to support semantic reasoning. Together, these capabilities ensure that entities are not only detected but also represented in a consistent and clinically meaningful way.
By combining richly annotated clinical text with a standardized biomedical vocabulary, this study establishes a strong foundation for developing and evaluating clinical NER systems. The integration provides methodological rigor and enhances practical applicability, supporting the creation of NLP models capable of accurate clinical entity recognition and normalization in real-world healthcare environments.
4. Methodology
4.1. Overview
Our methodology employs a two-stage hybrid framework that systematically addresses the challenges of medical text processing through supervised fine-tuning followed by knowledge-enhanced normalization as illustrated in
Figure 1. The framework begins with supervised fine-tuning of a transformer-based language model for accurate medical entity extraction from clinical narratives. The resulting identified medical entities are then passed through a sophisticated Retrieval-Augmented Generation (RAG) pipeline designed specifically for medical terminology normalization and standardization.
This sequential approach is designed to address two fundamental challenges in biomedical text processing: first, the accurate identification of medical entities from diverse clinical documentation styles and formats, and second, the standardization and disambiguation of these extracted entities using domain-specific medical knowledge bases. The overall system architecture demonstrates a seamless flow from raw clinical text through precise entity extraction to comprehensive normalized medical terminology, with each stage building upon the previous to achieve robust medical text understanding. The methodology is grounded in the principle that while transformer models demonstrate exceptional capability in pattern recognition for entity identification, they require integration with structured medical knowledge for effective disambiguation and normalization tasks. This is particularly critical when processing abbreviated, colloquial, or highly specialized medical terminology commonly found in clinical documentation, where context and domain expertise are essential for accurate interpretation. The framework addresses the inherent limitations of purely neural approaches by incorporating authoritative medical knowledge sources and large language model reasoning capabilities to enhance both accuracy and clinical interpretability of extracted entities.
4.2. Data Preprocessing and Tokenization
4.2.1. Dataset Preparation
We utilize the MACCROBAT (Medical Abbreviations and Clinical Case Reports for Biomedical Analysis and Translation) dataset, a specialized biomedical NER corpus consisting of 200 manually annotated clinical case reports. The dataset encompasses five primary entity categories: symptoms, diseases, medications, procedures, and laboratory values, with each entity annotated using the BIO (Beginning-Inside-Outside) tagging scheme where ‘B-’ denotes the beginning of an entity, ‘I-’ indicates continuation tokens within the same entity, and ‘O’ represents tokens outside any entity. The preprocessing pipeline begins with comprehensive text normalization to ensure compatibility with transformer tokenization while preserving the integrity of medical terminology. Clinical texts undergo standardized cleaning procedures including removal of extraneous whitespace, correction of encoding artifacts, and careful preservation of medical abbreviations and alphanumeric identifiers that are crucial for accurate medical entity recognition. This step is particularly important in clinical texts where formatting inconsistencies and special characters are common due to varying documentation practices across healthcare institutions.
4.2.2. Tokenization and Label Alignment
The tokenization process employs the BERT tokenizer (bert-base-uncased) for subword tokenization, which presents unique challenges when processing medical terminology. Medical terms frequently contain hyphens, periods, and alphanumeric combinations that may be split across multiple subtokens during the tokenization process. To address this challenge, we implement a sophisticated label alignment strategy that ensures consistent mapping between original BIO labels and tokenized sequences. The alignment process assigns the original BIO label to the first subtoken of each word, while subsequent subtokens within the same word receive a special “X” label to indicate their secondary status. During training, the loss computation specifically masks these “X” tokens to prevent gradient updates from misaligned labels, ensuring that the model learns from correctly aligned token-label pairs. This approach maintains the semantic integrity of medical entities while accommodating the subword tokenization requirements of transformer models. Input sequences are managed through a careful truncation and padding strategy, with sequences truncated or padded to a maximum length of 512 tokens. For longer clinical documents that exceed this limit, we implement a sliding window approach with appropriate overlap to prevent information loss while maintaining computational efficiency. This ensures that comprehensive clinical narratives are processed in their entirety without losing critical medical information.
4.2.3. Data Splitting and Validation
The dataset undergoes stratified partitioning to ensure balanced representation of entity types across training (80%) and test (20%) sets. We implement a robust validation strategy that includes 5-fold cross-validation for comprehensive performance estimation. Early stopping mechanisms based on validation F1-score are employed to prevent overfitting and ensure optimal model generalization to unseen clinical texts.
4.3. Supervised Fine-Tuning Approach
4.3.1. Model Architecture and Training Strategy
We implement full parameter supervised fine-tuning of the BERT-base-uncased model as our primary approach for medical named entity recognition. The base model, containing 110 million parameters across 12 transformer layers, provides a robust foundation for understanding biomedical text through its pre-trained representations. Each transformer layer incorporates multi-head self-attention mechanisms and feed-forward networks that enable the model to capture complex linguistic patterns inherent in clinical documentation.
BERT was specifically chosen as the foundational model for medical named entity recognition due to its strong capability to capture contextual relationships in text through deep bidirectional encoding. Its transformer-based architecture enables it to understand complex dependencies between tokens, which is essential in clinical narratives where meaning often depends on surrounding context. Compared to traditional sequence models, BERT offers superior generalization across varied syntactic structures and medical terminologies. The base version of BERT strikes a practical balance between model capacity and computational efficiency, making it suitable for fine-tuning on specialized biomedical datasets without overfitting. The fine-tuning process involves adding a linear classification head on top of the final hidden states to predict BIO tags for each token in the input sequence. This classification head is randomly initialized and trained jointly with the fine-tuned BERT parameters to learn task-specific representations for medical entity recognition. The training objective utilizes cross-entropy loss with carefully designed class weights to address label imbalance commonly observed in NER datasets, where ‘O’ tags significantly outnumber entity tags.
The supervised fine-tuning approach updates all model parameters during training, allowing the model to adapt its internal representations to the specific characteristics of medical text. This comprehensive adaptation is particularly beneficial for capturing domain-specific linguistic patterns, medical terminology variations, and contextual relationships that are crucial for accurate entity identification in clinical narratives.
4.3.2. Training and Optimization
The training process employs the AdamW optimizer with carefully tuned hyperparameters to ensure stable convergence and optimal performance. The learning rate schedule incorporates a linear warmup phase followed by linear decay, which has been shown to improve training stability and final model performance in transformer fine-tuning scenarios. Gradient accumulation techniques are employed to achieve effective larger batch sizes while working within memory constraints, enabling more stable gradient estimates during training.
Regularization techniques including weight decay and dropout are applied to prevent overfitting, particularly important given the limited dataset size. The training process incorporates early stopping mechanisms based on validation performance to identify the optimal stopping point and prevent degradation due to overfitting. Multiple training runs with different random seeds are conducted to ensure reproducibility and robustness of the reported results.
4.4. Retrieval-Augmented Generation Pipeline
4.4.1. Knowledge Base Construction and Indexing
The RAG pipeline addresses the limitation of fine-tuned models in providing standardized and interpretable entity descriptions, particularly for abbreviated or ambiguous medical terms commonly found in clinical documentation. Our knowledge base is constructed using the Unified Medical Language System (UMLS), a comprehensive metathesaurus containing over 4 million medical concepts from more than 200 biomedical vocabularies. This extensive resource provides the foundation for accurate entity normalization and disambiguation. The knowledge base preprocessing involves systematic extraction of medical concepts along with their preferred terms, synonyms, and clinical definitions. Each concept undergoes embedding generation using a domain-adapted sentence transformer specifically trained on biomedical text, creating dense vector representations that capture semantic relationships between medical concepts. These embeddings are then indexed using a high-performance vector database to enable efficient approximate nearest neighbor search during the retrieval phase.
4.4.2. Entity Retrieval and Semantic Matching
The retrieval process begins by grouping entities predicted by the fine-tuned BERT model according to their semantic categories, such as symptoms, diseases, medications, procedures, and laboratory values. Each entity group is then processed through the retrieval pipeline to identify relevant medical concepts from the knowledge base. The input entities are embedded using the same sentence transformer employed for knowledge base construction, ensuring consistent representation spaces for accurate similarity computation.
Semantic retrieval is performed using cosine similarity between entity embeddings and knowledge base concept embeddings, with the top-k most similar concepts retrieved as candidates for normalization. This semantic matching approach captures conceptual relationships that may not be apparent through exact string matching, enabling identification of relevant medical concepts even when terminology varies between the input text and the knowledge base.
4.4.3. Dictionary-Based Re-Ranking Strategy
Our novel Dictionary-RAG approach enhances traditional semantic retrieval through a sophisticated re-ranking mechanism that combines semantic similarity with lexical matching as shown in
Figure 2. This hybrid approach addresses limitations of purely semantic approaches by incorporating explicit keyword overlap analysis, which is particularly valuable for medical terminology where specific terms and abbreviations carry precise clinical meanings.
The re-ranking process operates as an enhancement layer on top of initial semantic retrieval, refining the list of retrieved medical concepts based on their lexical alignment with the input entity. After extracting top-k candidate concepts using embedding-based cosine similarity, each candidate is evaluated for terminological closeness to the original entity string through a dictionary-driven lexical scoring mechanism. While semantic retrieval using embedding similarity enables the discovery of conceptually related medical terms, it often lacks the precision required for accurate clinical entity normalization. The top-k retrieved concepts, although semantically similar, may represent broader or tangential meanings, especially when the input entity is abbreviated, ambiguous, or informally phrased. For example, an input entity like “T2DM” may yield concepts such as “glucose regulation disorder” or “metabolic disease,” which are related but not terminologically exact. To improve specificity, we introduce a dictionary-based re-ranking strategy that explicitly evaluates lexical similarity between the input entity and candidate concept descriptions. This step is crucial in clinical NLP, where high precision is necessary and small terminological differences can alter medical interpretation. Keyword-based matching allows the model to determine whether a candidate concept includes a standardized equivalent of the ambiguous term, particularly within its known aliases or definitions. In effect, the keyword match acts as a mechanism to identify the exact standardized term that corresponds to the given entity. It resolves ambiguity by linking the extracted mention to a canonical medical concept. The overlap between processed tokens is then measured to compute a lexical similarity score. To integrate both conceptual relevance and terminological precision, we define a composite re-ranking score:
where
is the original semantic similarity score from embedding retrieval, and
represents the token-level overlap score. The weights
and
constrained such that
, control the trade-off between abstraction and specificity. Candidates are then re-ordered based on this composite score.
Concepts that are both semantically similar and contain a standardized version of the ambiguous entity are ranked highest. This hybrid strategy significantly improves the quality of retrieved results by ensuring they are not only contextually related but also terminologically grounded. The re-ranked concepts are then passed to the large language model for further expansion and normalization.
4.4.4. Large Language Model Integration and Prompt Engineering
The final and crucial stage of the RAG pipeline leverages a large language model for sophisticated contextual entity enhancement and medical terminology normalization. This stage represents the convergence of retrieved medical knowledge with advanced language understanding capabilities to produce clinically accurate and interpretable entity descriptions. The integration process transforms the raw entities identified by the fine-tuned transformer into standardized, expanded medical terminology that maintains clinical precision while enhancing readability for healthcare professionals.
The prompt engineering component constitutes a critical methodological innovation, where carefully crafted prompts serve as the interface between retrieved medical knowledge and the language model’s reasoning capabilities. Each prompt is systematically constructed to combine the original entity extracted from clinical text with its corresponding retrieved knowledge context from the UMLS database. The prompt structure follows a standardized template that instructs the model to perform specific normalization tasks while maintaining strict adherence to the provided medical context. The prompt design incorporates several key methodological principles to ensure clinical accuracy and minimize potential hallucinations. First, the prompts explicitly emphasize reliance on the provided UMLS context rather than the model’s pre-training knowledge, creating a knowledge-grounded generation process that prioritizes established medical terminology over potentially outdated or inaccurate information. Second, the prompts include specific instructions for handling uncertainty, directing the model to indicate when insufficient information is available rather than generating potentially incorrect medical information.
The prompt template structure guides the language model to generate comprehensive entity descriptions that include multiple components: full expanded forms of abbreviated terms, concise clinical definitions that provide medical context, alternative common terms or synonyms that may be encountered in different clinical settings, and relevant clinical significance where appropriate. This multi-faceted approach ensures that the normalized entities provide maximum utility for clinical applications while maintaining accuracy and consistency.
The language model integration process employs carefully optimized generation parameters to balance consistency with appropriate clinical variability. The temperature setting is maintained at low values to ensure reproducible and consistent outputs, while other parameters such as top-p sampling and presence penalties are tuned to prevent repetitive or overly generic responses. The generation process is constrained to produce outputs within specified length limits to ensure conciseness while maintaining completeness of medical information.
Post-processing mechanisms are implemented to extract structured information from the language model responses and ensure consistency in formatting across different entity types and categories. This includes standardization of terminology presentation, validation of medical accuracy against known standards, and formatting consistency that enables seamless integration with downstream clinical applications. The post-processing stage also implements quality control measures to identify and handle potential generation errors or inconsistencies that may arise during the normalization process.
4.5. Pipeline Integration and Entity Flow Management
The complete methodology integrates the supervised fine-tuning and RAG components through a systematically designed pipeline that manages the flow of medical entities from initial extraction to final normalization. The pipeline architecture ensures seamless transfer of identified entities from the fine-tuned transformer model to the RAG system, maintaining entity integrity and contextual information throughout the processing stages. This integration addresses the critical requirement of preserving entity relationships and clinical context while transforming raw medical terminology into standardized, interpretable descriptions. The entity flow management system begins with the output from the supervised fine-tuning stage, where identified medical entities are systematically organized and categorized according to their semantic types. Each entity retains its original textual context and positional information from the source clinical document, enabling the RAG pipeline to make informed normalization decisions based on both the entity itself and its surrounding clinical context. This contextual preservation is particularly important for disambiguating entities that may have different meanings depending on their clinical usage.
The transition from entity extraction to normalization involves sophisticated data structuring mechanisms that prepare entities for optimal retrieval and processing. Entities are grouped by semantic categories and processed in batches to optimize computational efficiency while maintaining accuracy. The pipeline implements robust error handling and fallback mechanisms to manage edge cases, ensuring reliable performance across diverse clinical documentation styles and varying entity complexity levels. Quality assurance mechanisms are embedded throughout the pipeline integration process to maintain clinical accuracy and consistency. These include validation checkpoints that verify entity integrity during transfers between pipeline stages, consistency checks that ensure uniform processing across entity categories, and accuracy validation that compares normalized entities against established medical standards. The methodology ensures that the integrated system maintains high precision in entity identification while providing clinically meaningful and interpretable normalization results.
5. Experimental Design
To assess the effectiveness of the proposed Dictionary-Infused Retrieval-Augmented Generation (DiRAG) framework, we designed experiments that cover dataset selection, baseline comparisons, evaluation metrics, and implementation settings. All datasets used in this study, including MACCROBAT and AGBONNET, are described in
Section 3.
We compared DiRAG against four baselines: (a) transformer-based NER models (BERT, BioBERT, ClinicalBERT) fine-tuned on MACCROBAT, (b) LLM-only normalization using Gemini without retrieval, (c) semantic retrieval with LLM normalization but without dictionary-based re-ranking, and (d) dictionary look-up via direct string matching against UMLS terms. These baselines allow us to isolate the individual contributions of fine-tuning, retrieval, and dictionary infusion.
Performance was measured using Accuracy, Precision, Recall, and F1 Score at the concept level, computed against gold-standard mappings. Two evaluation criteria were applied: exact match, which required strict lexical identity, and relaxed match, which allowed clinically equivalent predictions differing only in case, substring containment, or non-essential token variations (e.g., “MRI” vs. “mri”, “pBC1 plasmid” vs. “pBC1”).
Key experimental settings are summarized in
Table 1.
6. Experimental Results
We evaluate our approach across two dimensions: (1) model performance on medical NER using fine-tuning strategies (SFT and LoRA), and (2) post-processing effectiveness of our Dictionary-Infused Retrieval-Augmented Generation (DiRAG) pipeline.
6.1. Comparison with Existing Fine-Tuned Models
We compare our results with several benchmark pre-trained transformer models fine-tuned for medical NER, as reported in the benchmark paper [
34].
Table 2 presents their reported F1 scores, along with our own results after fine-tuning BERT on the MACCROBAT dataset.
Additionally, our model achieved a token-level accuracy of 82.8% on the test set, indicating strong generalization capability when applied to biomedical text. This performance was attained despite the relatively small size of the MACCROBAT dataset and the use of a base BERT model, rather than a domain-specific variant like BioBERT or ClinicalBERT. These results underscore the strength of task-specific fine-tuning, showing that even general-purpose transformer models can yield superior performance when trained on high-quality, domain-labeled data. This is particularly relevant for resource-constrained environments where access to large pre-trained medical models may be limited.
Compared to the benchmark models from existing literature, our fine-tuned BERT model demonstrates a notable improvement in F1 score, achieving 0.708—significantly higher than the previously reported best of 0.579 for BioBERT. This margin validates the effectiveness of our training pipeline and indicates that targeted fine-tuning with curated data can outperform larger or more specialized models. Such findings encourage the adoption of practical, efficient, and replicable training strategies over reliance on increasingly large-scale models, especially in specialized domains like clinical NLP.
6.2. Supervised Fine-Tuning vs. Parameter-Efficient Fine-Tuning (PEFT)
We evaluated both supervised fine-tuning (SFT) and parameter-efficient tuning with LoRA (PEFT) for adapting BERT to clinical NER. Although PEFT offers computational efficiency by updating only 0.59% of parameters (653,651 trainable out of 109 million), it achieved poor performance on the MACCROBAT dataset (F1 = 0.05). In contrast, SFT substantially outperformed PEFT, reaching an F1 score of 0.708 (
Table 3). We attribute LoRA’s underperformance to the limited size of the dataset and the need for precise domain adaptation, which favors full-model fine-tuning. Given its superior accuracy and ability to capture the nuances of biomedical language, SFT was adopted as our primary training strategy.
6.3. Sensitivity Analysis of Composite Ranking Weights
We examined the effect of weighting semantic similarity (ssem) and lexical overlap (slex) in the re-ranking stage by varying parameters α and β, constrained such that α + β = 1, across the range [0.0, 1.0] in increments of 0.1. The results showed clear trade-offs. When semantic similarity dominated (α ≥ 0.7), recall improved for entities with paraphrased or contextually diverse terminology, but precision declined due to inclusion of loosely related concepts. Conversely, higher lexical weight (β ≥ 0.6) enhanced precision for abbreviation-heavy terms such as “ASA” and “MIBI,” but recall decreased as broader semantic variants were underrepresented.
The optimal balance was observed at α = 0.6 and β = 0.4, which yielded the highest F1 score of 0.708. This weighting preserved semantic breadth while grounding ambiguous entities in terminological precision. The analysis underscores the complementary roles of semantic similarity and lexical overlap, demonstrating that their integration is critical for robust clinical entity normalization.
6.4. Case Study: Entity Enhancement with RAG
To further demonstrate the effectiveness of our Dictionary-Infused RAG pipeline, we conducted a case study on a real clinical note. The note was first passed through the fine-tuned BERT model to extract medical entities. The same entities were then enhanced using our RAG pipeline by retrieving semantically similar UMLS concepts and prompting Gemini for normalization.
As shown in
Figure 3, a clinical note containing several abbreviations and vague medical terms—such as “SOB”, “CKD”, “TTE”, and “MIBI”—is first processed through our fine-tuned BERT model trained on the MACCROBAT dataset. This model segments the note into structured entity categories (e.g., AGE, SEX, SYMPTOMS, DISEASE DISORDER, DIAGNOSTIC PROCEDURE), but the extracted entities often retain ambiguous or shorthand terms from the original text. To address this, the NER output is passed into our Dictionary-Infused RAG pipeline. Specifically, only potentially ambiguous terms are embedded using BioBERT and queried against a Pinecone vector database populated with UMLS-derived concepts. The retrieved results are re-ranked based on keyword overlap with the original term, and the top matches are used to construct a targeted prompt for Gemini. The LLM is then instructed to normalize the input entity using only the retrieved context, ensuring that vague or abbreviated terms are enhanced while unambiguous entries remain unchanged. This selective correction mechanism preserves the structural integrity of the original NER categories while improving their clarity. For instance, “SOB” becomes “Shortness of breath”, “TTE” is expanded to “Transthoracic echocardiogram”, and “Furosemide IV” is reformatted as “Intravenous furosemide”. This transformation leads to a final output that is more interpretable, medically precise, and suitable for downstream clinical applications.
6.5. Qualitative Comparison: LLM vs. RAG-Enhanced Output
To understand the impact of our RAG pipeline, we evaluated its ability to clarify ambiguous entities produced by the NER model.
Figure 4 shows a sample of entities across different categories, comparing outputs generated by the LLM alone versus the same LLM with RAG enhancement. In the table, a check mark indicates that RAG improved the output, while a cross (✗) indicates no improvement.
RAG-enhanced outputs yield more interpretable and clinically precise terms, such as expanding “TTE” to “Transthoracic echocardiogram,” “MIBI scan” to “Myocardial perfusion imaging with sestamibi,” and clarifying “Furosemide IV” as “Intravenous furosemide” (
Figure 4). By leveraging UMLS knowledge, the system reduces hallucinations and provides standardized terminology critical for downstream clinical applications.
6.6. RAG Enhancement Evaluation
To evaluate the effectiveness of our RAG pipeline, we manually inspected outputs from fine-tuned BERT and compared them against RAG-enhanced versions. Our focus was on resolving ambiguous, abbreviated, or incomplete entities. From
Figure 5, we observe that the RAG pipeline effectively enhanced the clarity of several ambiguous entities extracted by the fine-tuned BERT model. Based on initial inspection, 10 ambiguous terms were expected to require normalization. However, the RAG process identified 6 additional ambiguous or unclear entities that were not initially flagged. This led to a total of 12 entities being enhanced through the dictionary-infused RAG pipeline. The results underscore the pipeline’s utility not only in addressing known ambiguities but also in uncovering and resolving overlooked terminology through contextual retrieval and prompting.
6.7. Quantitative Evaluation of the DiRAG Framework
We conducted a large-scale evaluation of the proposed DiRAG framework on two benchmark datasets: 200 clinical notes from MACCROBAT and 100 notes from AGBONNET. For both datasets, Accuracy, Precision, Recall, and F1 Score were computed by comparing predicted normalized concepts against gold-standard mappings. Results are reported under both exact-match and relaxed-match criteria to capture clinically valid equivalence beyond surface-level lexical variation.
Relaxed matching was introduced to account for clinically equivalent predictions that differ only in form. A prediction was considered correct if it satisfied case-insensitive equality (e.g., “MRI” vs. “mri”), substring containment (e.g., “Plasmid pBC1” vs. “pBC1”), or meaningful token overlap after removing trivial words such as “number,” “level,” or “scan” (e.g., “Etoposide + Cisplatin” vs. “EC Regimen (Cyclophosphamide and Etoposide)”). This ensured that the evaluation captured semantic validity while avoiding inflated scores from superficial overlaps.
The results show that DiRAG achieves consistently high precision across both datasets, highlighting its robustness against false positives (
Table 4). However, strict exact-match criteria constrained recall, producing moderate accuracy (72.16% for MACCROBAT and 63.64% for AGBONNET). Under relaxed match, recall improved significantly, with accuracies exceeding 86% and F1 Scores above 0.92. These results confirm that the Dictionary-Infused re-ranking mechanism effectively captures clinically meaningful equivalence, mitigating the limitations of purely semantic retrieval or surface-level matching.
7. Limitations
While our approach shows promising results, several limitations should be noted. First, the manual evaluation of RAG-enhanced outputs was conducted on only 10 clinical notes, which may limit the scope of qualitative insights. Larger-scale validation, such as on 500 notes or more, would provide stronger evidence of generalizability but would require scalable annotation infrastructure and domain-specific oversight. Second, the evaluators were not trained medical professionals, raising the possibility of errors in assessing the clinical appropriateness of substitutions. Expert validation would provide a more reliable measure of clinical relevance. Finally, the current evaluation is based on the MACCROBAT and AGBONNET corpora, which together comprise 300 notes from limited specialties. Testing on broader and more diverse corpora, such as MIMIC-III or multi-institution datasets, is needed to fully establish generalizability.
8. Contribution to the Literature
We introduce a novel form of Dictionary-Infused Retrieval-Augmented Generation (DiRAG) for medical entity normalization. Unlike traditional RAG pipelines that rewrite or regenerate all extracted entities using retrieved knowledge, our approach is selective where we replace only the ambiguous, unclear, or abbreviated terms with standardized equivalents drawn from a UMLS-grounded dictionary. This allows us to retain the structure and correctness of confidently predicted entities while resolving ambiguous terms like “ASA”, “SOB”, or “MIBI” into their full, clinically meaningful forms. As a result, our method enhances interpretability without risking the distortion or hallucination of already accurate entities. This approach is particularly well-suited to real-world clinical notes, where raw NER outputs often contain a mix of standard terms and vague shorthand. By applying RAG only to those entities that require it, we maintain high precision and improve clarity without overcorrecting. Our use of a domain-specific encoder (BioBERT), a UMLS-based vector store (Pinecone), and a prompting strategy that instructs the LLM (Gemini 1.5 Flash) to rely solely on retrieved knowledge ensures that corrections are grounded and trustworthy. This combination of precision, efficiency, and domain grounding sets our DiRAG method apart from existing generative or rewriting-based approaches.
9. Conclusions
This study presents a robust and practical framework for improving Named Entity Recognition (NER) in clinical text by combining the strengths of supervised fine-tuning and retrieval-augmented generation. By fine-tuning a general-purpose BERT model on the MACCROBAT dataset, we achieved superior performance compared to several domain-specific models, demonstrating the effectiveness of targeted adaptation. To enhance clarity and standardization, we introduced a dictionary-infused RAG module that selectively normalizes ambiguous or abbreviated medical entities using knowledge from the UMLS and guided prompting of a large language model. Our hybrid approach not only improves entity extraction accuracy but also ensures the outputs are clinically meaningful and interpretable. Through rigorous evaluation and case studies, we showed that integrating semantic retrieval, lexical re-ranking, and prompt engineering produces reliable and context-aware entity normalization. This solution addresses key limitations in current clinical NER systems, offering a scalable and explainable method that aligns with real-world healthcare needs.
Overall, this work contributes a significant advancement in clinical NLP by demonstrating how combining transformer-based models with knowledge-grounded generation can lead to more accurate, transparent, and usable results in medical information extraction.