What We Know About the Role of Large Language Models for Medical Synthetic Dataset Generation

Montenegro, Larissa; Gomes, Luis M.; Machado, José M.

doi:10.3390/ai6060109

Open AccessSystematic Review

What We Know About the Role of Large Language Models for Medical Synthetic Dataset Generation

by

Larissa Montenegro

^1,*,†

,

Luis M. Gomes

^2,†

and

José M. Machado

^1,†

¹

Centro ALGORITMI/LASI, University of Minho, 4704-553 Braga, Portugal

²

Centro ALGORITMI/LASI, University of Azores, 9500-321 Ponta Delgada, Portugal

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

AI 2025, 6(6), 109; https://doi.org/10.3390/ai6060109

Submission received: 14 March 2025 / Revised: 27 April 2025 / Accepted: 23 May 2025 / Published: 27 May 2025

(This article belongs to the Section Medical & Healthcare AI)

Download

Browse Figure

Versions Notes

Abstract

Synthetic medical text generation has emerged as a solution to data scarcity and privacy constraints in clinical NLP. This review systematically evaluates the use of Large Language Models (LLMs) for structured medical text generation, examining techniques such as retrieval-augmented generation (RAG), structured fine-tuning, and domain-specific adaptation. Four search queries were applied following the PRISMA methodology to identify and extract data from 153 studies. Key benchmarking metrics, such as performance measures, and qualitative insights, including methodological trends and challenges, were documented. The results show that while LLM-generated text improves fluency, hallucinations and factual inconsistencies persist. Structured consultation models, such as SOAP and Calgary–Cambridge, enhance coherence but do not fully prevent errors. Hybrid techniques that combine retrieval-based grounding with domain-specific fine-tuning improve factual accuracy and task performance. Conventional evaluation metrics (e.g., ROUGE, BLEU) are insufficient for medical validation, highlighting the need for domain-specific benchmarks. Privacy-preserving strategies, including differential privacy and PHI de-identification, support regulatory compliance but may reduce linguistic quality. These findings are relevant for clinical NLP applications, such as AI-powered scribe systems, where structured synthetic datasets can improve transcription accuracy and documentation reliability. The conclusions highlight the need for balanced approaches that integrate medical structure, factual control, and privacy to enhance the usability of synthetic medical text.

Keywords:

synthetic clinical text; AI-based medical scribe; retrieval-augmented generation (RAG); biomedical LLMs; faithfulness evaluation

1. Introduction

In Natural Language Processing (NLP), datasets play a crucial role in model development and evaluation. However, in biomedical applications, data availability is often restricted due to privacy concerns and security regulations, limiting access to real clinical corpora. These challenges are particularly evident in the context of automated clinical documentation systems, where physician–patient conversations are difficult to obtain due to compliance constraints. Consequently, synthetic data generation has emerged as a viable solution for training and validating clinical NLP models, enabling the development of privacy-preserving AI systems for medical documentation [1,2]. Methods such as rule-based approaches and weak supervision frameworks, such as Snorkel [3], have automated parts of the labeling process but often struggle with adaptability and semantic precision. These limitations are critical in biomedical NLP, where terminology accuracy, contextual coherence, and adherence to ethical standards are essential factors [4,5].

With the emergence of new Large Language Models (LLMs), such as ChatGPT from OpenAI, Llama from Meta, and Gemini from Google, new possibilities for synthetic dataset generation have arisen [6,7,8]. These LLMs can address the shortage of annotated medical data through prompt engineering, thereby reducing manual labeling costs [9,10]. However, challenges remain in ensuring factual accuracy, maintaining structured clinical documentation, and aligning with standardized medical reporting frameworks. Some domain-specific augmentation tools, such as MedAug [11], DataSynth [12], and ClinicalBERT-based methods [13], incorporate medical knowledge but often lack strict adherence to well-established documentation structures, such as SOAP (Subjective, Objective, Assessment, and Plan) [14] or the Calgary–Cambridge Guide for medical communication [15].

Our previous work [16] proposed an AI-based medical scribe system to automate clinical documentation using Automatic Speech Recognition (ASR) and NLP techniques for European Portuguese. We identified key challenges in real-time physician–patient transcription, including domain adaptation failures, limited availability of training data, and inconsistencies in structured text generation. In light of these constraints, alternative data augmentation strategies, including synthetic dataset generation, could help mitigate the limitations associated with the availability of real clinical data. LLMs have demonstrated potential in generating structured synthetic medical text, which could be leveraged to create more diverse and representative training datasets for medical scribe AI models.

This review aims to systematically evaluate the methodologies, benchmarking strategies, and challenges associated with synthetic medical dataset generation for clinical LLMs. Synthetic medical text generation approaches are assessed through comparative benchmarking of dataset reliability, generalization capacity, and regulatory compliance. The findings of this study may be relevant for advancing AI-based medical scribe systems designed to automate clinical documentation, as synthetic datasets can enhance transcription accuracy, improve domain adaptation, and support multilingual processing in clinical applications [16].

This review follows the PRISMA guidelines [17] and includes four search queries to ensure methodological breadth. Results are synthesized based on accuracy, hallucination risk, domain alignment, and privacy performance.

The remainder of this paper is organized as follows. Section 2 outlines the PRISMA-based systematic review methodology. Section 3 examines synthetic medical text generation, emphasizing approaches, model architectures, and performance metrics. Section 4 addresses key challenges, such as hallucination and dataset reliability, and discusses how structured frameworks can mitigate these challenges. Section 5 presents the conclusions and directions for future research.

2. Methodology

We conducted a systematic review that follows the guidelines provided by PRISMA [17], encompassing three phases: (i) the literature search strategy, (ii) the inclusion and exclusion criteria, and (iii) data extraction.

2.1. Literature Search Strategy

The Scopus database was selected for the literature search due to its extensive coverage of indexed publications [18]. The last search was conducted on 11 February 2025. Multiple search queries were employed to obtain a broad range of studies and minimize selection bias. This approach aligns with the PRISMA emphasis on systematic and reproducible search methodologies. Only publications from the last five years were considered, focusing on peer-reviewed journal articles and conference papers in Computer Science and Engineering, to reflect recent developments. The PRISMA framework guided the selection process, ensuring a structured approach to study identification, screening, and inclusion.

Instead of relying on a single search query, four distinct queries were designed to capture different aspects of synthetic dataset generation in biomedical NLP, progressively expanding the search scope while maintaining relevance.

(“ChatGPT*” OR “Chat-GPT*” OR “genAI*” OR “llms*” OR “large language model*” OR “generative artificial intelligence*” OR “GAI” OR “GPT-based model*” OR “transformer model*”) AND (“synthetic data*” OR “synthetic text*” OR “text data*” OR “medical conversation*” OR “text dataset augmentation*” OR “clinical text*” OR “medical text*”) AND (“gener*” OR “creat*”) AND (“AI” OR “NLP” OR “natural language processing*”) AND (“biomed*” OR “medic*” OR “health*” OR “clinic*” OR “hospital*” OR “clinical documentation*” OR “clinical annotation frameworks*” OR “SOAP*” OR “Calgary-Cambridge”) AND (“ethics*” OR “bias*” OR “privacy*” OR “HIPAA*” OR “GDPR” OR “data compliance*” OR “AI hallucination*”)

The first query aimed to identify research on synthetic datasets generated by LLMs for clinical applications. Particular emphasis was placed on structured annotation frameworks, such as SOAP and the Calgary–Cambridge guide, given their importance in structured clinical documentation. This query provided a baseline understanding of the existing work in structured synthetic text generation but resulted in only 17 papers, indicating the need to broaden the scope.

(“ChatGPT*” OR “Chat-GPT*” OR “genAI*” OR “llms*” OR “large language model*” OR “GPT-based model*” OR “generative artificial intelligence*” OR “GAI”) AND (“text data*” OR “text synthetic*” OR “text gener*” OR “synthetic data*” OR “clinical text*” OR “medical text*”) AND (“AI” OR “NLP” OR “natural language processing*”) AND (“biomed*” OR “medic*” OR “health*” OR “clinic*” OR “hospital*”)

Since the first query yielded limited results, the second query broadened the search to include LLM applications in biomedical and clinical text generation. Ethical considerations and data security terms were excluded to increase recall, focusing instead on technical approaches and dataset enhancement strategies. This significantly expanded the results to 124 papers.

(“ChatGPT*” OR “genAI*” OR “llms*” OR “large language model*” OR “generative artificial intelligence*” OR “GPT-based model*”) AND (“synthetic data*” OR “synthetic text*” OR “text data*” OR “synthetic datasets for medic*” OR “biomedical text generation*”) AND (“ethic*” OR “bias*” OR “privacy*” OR “HIPAA*” OR “GDPR” OR “data compliance*” OR “AI hallucination*” OR “responsible AI*”) AND (“AI” OR “NLP” OR “biomedical NLP*” OR “healthcare NLP*”)

Given the growing concerns regarding AI hallucinations, bias, and PHI de-identification, the third query specifically targeted the ethical, privacy, and fairness aspects of synthetic dataset generation in healthcare NLP. This query aimed to ensure that regulatory compliance (e.g., HIPAA, GDPR) and responsible AI development were adequately covered in this review. It resulted in 33 papers.

(“ChatGPT*” OR “Chat-GPT*” OR “genAI*” OR “llms*” OR “large language model*” OR “generative artificial intelligence*” OR “GAI” OR “GPT-based model*” OR “Claude*” OR “Gemini*” OR “GatorTron*” OR “BioGPT*” OR “transformer model*”) AND (“synthetic data*” OR “synthetic text*” OR “text data*” OR “structured annotation datasets*” OR “clinical text*”) AND (“gener*” OR “creat*” OR “augment*”) AND (“AI” OR “NLP” OR “biomedical NLP*” OR “healthcare NLP*”) AND (“comparison*” OR “evaluation*” OR “benchmarks*” OR “framework*”)

The fourth query aimed to assess the quality of synthetic datasets by collecting studies focused on frameworks, performance metrics, and structured annotation. Given the importance of validating synthetic datasets, this query addressed how LLM-generated clinical texts are evaluated against gold-standard benchmarks. The search retrieved 38 papers, with some overlapping with those identified in earlier queries.

In total, 212 papers were retrieved across the four queries. After removing duplicates, 153 unique articles remained for the final analysis. This systematic search minimized selection bias, ensured coverage of diverse approaches, and enhanced methodological rigor, thereby forming a solid basis for evaluating state-of-the-art LLM-driven biomedical NLP.

2.2. Screening Process and Data Extraction

Figure 1 shows the four phases of the systematic screening process, including the number of records identified, screened, excluded, and retained for final analysis. Studies were included if they met the following criteria: (i) proposed methodologies for LLM-based synthetic dataset generation, (ii) presented frameworks or techniques addressing challenges such as data scarcity, annotation efficiency, or domain-specific adaptation, and (iii) made practical contributions to biomedical NLP workflows, such as dataset pipeline development, model fine-tuning strategies, or architectural innovations.

The lead reviewer conducted all stages of the screening process, including initial title and abstract screening, full-text validation, and data extraction. No automation tools were used during any stage of selection or data collection.

Exclusion criteria eliminated studies that met the following criteria: (i) focused on non-biomedical applications of LLMs, (ii) provided no empirical evaluation of synthetic datasets, or (iii) were limited to opinion pieces, editorials, or high-level discussions lacking methodological detail.

During the screening process, papers unrelated to LLMs or synthetic text generation were excluded during the first stage. The remaining papers underwent full-text review to confirm they met the inclusion and exclusion criteria. Emphasis was placed on studies demonstrating progress in synthetic dataset methodologies, structured clinical text processing, and benchmarking techniques, ensuring a focused selection of the relevant literature. After this process, 39 articles were retained, forming the basis for examining the role of LLMs in biomedical applications.

A structured data extraction framework was employed to maintain consistency and analytical depth across the reviewed studies. A standardized table documented study attributes, including bibliographic details for traceability, and summarized key findings such as objectives, technical contributions, and advances in synthetic dataset generation. The four-query search strategy provided essential quantitative and qualitative data regarding datasets, model architectures, fine-tuning strategies, and evaluation metrics, including F1-score, BLEU, ROUGE, and BERTScore. Furthermore, biomedical NLP applications and the technical challenges addressed by each study were identified.

The four distinct search queries ensured that a comprehensive and unbiased review was conducted iteratively. The PRISMA methodology facilitated the systematic capture of both quantitative measures, such as performance metrics, and qualitative insights, such as methodological trends and challenges, across different subdomains of synthetic medical text generation.

The synthesis method adopted was narrative, organizing results by methodological groupings: structured generation models, retrieval-augmented generation, privacy-preserving frameworks, and benchmarking strategies. No meta-analysis was conducted. Effect measures were not applicable, and no formal assessment of reporting bias or evidence certainty was performed. These decisions are consistent with the scope and objectives of this review. This systematic review protocol was prospectively registered with the Open Science Framework (OSF) on 22 May 2025, under the registration DOI https://doi.org/10.17605/OSF.IO/SF2ED.

3. Developments in Synthetic Medical Text Generation

Privacy regulations and the limited availability of high-quality, real-world clinical datasets have prompted a shift toward synthetic data generation in medical NLP. Large Language Models (LLMs), including the GPT series from OpenAI and locally deployed models, now generate structured medical text that resembles clinical interactions. Although LLMs are a leading approach for unstructured medical text generation, other generative AI models—such as Generative Adversarial Networks (GANs) and Variational Autoencoders—have also been studied to enhance biological plausibility, linguistic coherence, and structured text generation [19,20].

Synthetic medical text generation has leveraged various methodologies, including LLM-based structured text generation, synthetic dialogue augmentation, data-to-text transformation, retrieval-augmented generation (RAG) [21], multi-agent architectures, and privacy-preserving synthetic data techniques [22]. The foundation of modern LLMs, including GPT-4, BioGPT, and Qwen-7B, is the Transformer architecture, which replaces recurrence with self-attention mechanisms to improve efficiency and scalability [8,10,23,24,25,26,27,28]. These approaches have been applied in domains such as emergency medicine, electronic medical records (EHRs), medical abstraction, and privacy-preserving text generation to address data sparsity and patient confidentiality constraints in clinical NLP [10,24,25,27,28,29].

All included studies are cited in the main text and summarized in Table 1, which details model architectures, applications, optimization techniques, and evaluation metrics. We narratively synthesized the findings by methodological theme (e.g., structured generation, retrieval-augmented generation, privacy-preserving frameworks, and benchmarking techniques) to identify common trends and technical challenges. No meta-analysis was conducted due to the heterogeneity of study designs and evaluation metrics. Accordingly, effect measures, reporting bias assessments, and certainty of evidence grading were not performed.

3.1. Approaches to Synthetic Medical Text Generation

Synthetic medical text generation has leveraged various methodologies, such as LLM-based structured text generation, synthetic dialog augmentation, data-to-text transformation, multi-agent architectures, and privacy-preserving synthetic data approaches. The foundation of modern LLMs, including GPT-4, BioGPT, and Qwen-7B, is the Transformer architecture, which replaces recurrence with self-attention mechanisms to improve efficiency and scalability in text generation tasks [8,10,23,24,25,26,27,28]. For example, these techniques have been applied to emergency medicine, electronic medical records (EHRs), medical abstraction, and privacy-preserving text generation to address data sparsity and patient confidentiality constraints in clinical NLP [10,24,25,27,28,29].

Table 1 presents a comparative analysis of synthetic medical text generation models, detailing their architectures, applications, optimization techniques, and evaluation metrics. The table summarizes key methodologies, ranging from structured text generation in emergency medicine and EHR-based applications to data augmentation strategies for clinical summarization.

LLM-based structured text generation was applied in emergency medicine, and EHR-based applications have demonstrated the potential of synthetic data in medical NLP. Moser et al. (2024) propose a multi-stage synthetic dialogue generation pipeline for emergency medicine, leveraging Zephyr-7b-beta to generate structured ambulance–patient interactions. The framework consists of four sequential phases: (i) initial ambulance interaction, (ii) triage and vital signs assessment, (iii) medication administration, and (iv) hospital arrival, each modeled using task-specific prompt templates based on real-world clinical protocols. To improve linguistic fluency, the dialogues experience automated refinement and translation into German using GPT-4 Turbo, which improves coherence but reduces accuracy from 94% to 87% due to minor semantic inconsistencies. This trade-off underscores a key challenge in synthetic medical text generation: balancing linguistic fluency with medical information retention [24].

Latif et al. (2024) assess LLM-based text augmentation for improving dataset diversity and robustness in low-resource clinical NLP tasks. Their study applies ChatGPT, BART, and T5 to expand the Clinical Health-Aware Reasoning across Dimensions (CHARDAT) dataset, which includes structured annotations for treatment, risk factors, and prevention categories. The ChatGPT-based augmentation method generates paraphrases of training instances to increase dataset coverage while maintaining medical relevance. ChatGPT-based augmentation outperforms back-translation methods compared to traditional augmentation techniques such as Easy Data Augmentation and Automated Easy Data Augmentation. Benchmarking results show that datasets augmented using BART achieve ROUGE-1 (52.35), ROUGE-2 (41.59), and ROUGE-L (50.71), demonstrating its effectiveness in improving downstream NLP model performance [30]. Prompt engineering has also been applied to generate synthetic patient communication messages in healthcare portals, expanding the applicability of LLM-based augmentation techniques to real-world clinical interactions [35].

Synthetic dataset generation has primarily focused on English-language medical corpora, but recent studies have explored its use in low-resource medical NLP tasks. Frei and Kramer (2023) propose an LLM-based approach for generating annotated medical text in German clinical NLP. Their method uses GPT-NeoX (20B), a decoder-only Transformer model, to create structured medical text with Named Entity Recognition (NER) labels through few-shot prompting. Unlike fine-tuned models, GPT-NeoX operates in a zero-shot or few-shot setting, generating synthetic datasets without additional gradient-based updates. The study introduces a markup-based prompt design with custom XML-like tags to ensure structured NER annotations. Post-processing and data filtering remove invalid or duplicate entries to refine dataset quality. The resulting synthetic corpus trains GPTNERMED, a domain-specific NER model fine-tuned on BERT-based architectures, including GBERT, GottBERT, and German-MedBERT. Benchmarking results show that models trained on synthetic datasets achieve F1-scores of 0.918 (GBERT), 0.910 (GottBERT), and 0.883 (German-MedBERT), confirming that synthetic annotations can effectively train medical NLP models [31].

Abdel-Khalek et al. (2024). developed SEHRG-DLD. This framework utilizes ChatGPT to generate synthetic electronic health records (EHRs) with structured patient attributes, including BMI, glucose levels, and blood pressure. The synthetic records were processed using a Deep Belief Network (DBN) classifier, optimized with Harris Hawks Optimization and Golden Jackal Optimization. The model achieved 97% classification accuracy, demonstrating that synthetically generated structured medical records can effectively train disease prediction models while addressing privacy concerns associated with real patient data [29].

LLamaCarre was introduced by Li et al. (2024), an instruction-tuned model designed for clinical NLP. The model employs self-instruction techniques, where GPT-4 generates diverse reworded instructions to simulate clinical prompts. Fine-tuning is performed using Low-Rank Adaptation, allowing the optimization of Llama 2 while reducing computational costs. LlamaCare is benchmarked against Llama 2, PMC-LLaMA, and domain-adapted text generation and classification baselines. It achieves ROUGE-L (27.2) and BLEU-4 (18.8) for discharge summary generation, outperforming both baselines. In clinical text classification tasks, including in-hospital mortality prediction, length of stay estimation, and diagnosis classification, LlamaCare improves AUROC scores by 2–5 points. These results demonstrate that instruction fine-tuning and domain-specific adaptation enhance synthetic dataset quality, improving coherence and predictive performance in clinical NLP [32].

Tian et al. (2024) introduce ChiMed-GPT, a Chinese medical LLM optimized for synthetic text generation using a multi-stage training framework. The methodology includes pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF). The pre-training phase involves continual learning on 369,800 medical encyclopedia documents and 8475 textbook articles, using byte-pair encoding (BPE) and Megatron-LM distributed training for scalability. The SFT phase fine-tunes the model on 1.2 million structured medical QA pairs and dialogues from ChiMed, CMD-SFT, MC (Medical Consultation), and MedDialog. This ensures data filtering and PHI removal for clinical accuracy. The RLHF phase applies reward-based preference modeling, with human evaluators ranking responses, followed by rejection sampling fine-tuning to improve factual consistency and domain relevance. Benchmarking results show that ChiMed-GPT outperforms general-purpose LLMs, achieving F1-scores of 40.82–41.04 in NER and multi-choice QA accuracy of 68.29% in C-Eval, 52.92% in CMMLU, and 44.50% in MedQA, approaching the performance of GPT-4 [9].

Furthermore, synthetic dialogue augmentation has been investigated to enhance clinical NLP summarization. Schlegel et al. (2023) introduced PULSAR, a domain-adaptive fine-tuning framework that utilizes synthetic dialogues generated by GPT-3.5 to improve medical summarization models. The framework was evaluated in the MEDIQA-Sum 2023 challenge, where the integration of synthetic training data led to notable improvements in ROUGE and BERTScore metrics, positioning the model among the top-performing submissions [25].

Similarly, Chen et al. (2023) [8] proposed DoPaCos, a synthetic doctor-patient conversation generation pipeline to enhance pre-training for clinical note summarization models. Their approach fine-tuned DialoGPT on 80,000 real doctor-patient conversations to generate 160,000 synthetic dialogues, which were then used to pre-train a Longformer–Encoder–Decoder (LED) model. Pre-training with synthetic dialogues improved summarization performance, but models trained solely on real interactions performed better, highlighting the need for contextual grounding in clinical text generation [8].

Zhang et al. (2025) investigated data-to-text generation and task-specific optimization by fine-tuning Qwen-7B-Chat for medical data-to-text generation (D2T) and medical text summarization (MTS). The D2T model converted structured clinical data into history of present illness (HPI) narratives, while the MTS model transformed progress notes into discharge summaries. They introduced a Faithfulness Score to measure semantic alignment between synthetic and human-generated medical narratives. Their results showed a 19.72% improvement in D2T and a 19.33% improvement in MTS over baseline models, demonstrating that structured fine-tuning improves accuracy and reliability in LLM-generated medical narratives [10].

In psychiatric applications, Wu et al. (2024) [34] developed CALLM (Clinical Interview Data Augmentation via LLM), a framework designed to generate synthetic transcripts for PTSD diagnosis. Synthetic data generation has also been explored in the domain of mental health for tasks such as suicidal ideation detection, highlighting the potential and ethical considerations of using LLMs for sensitive clinical narratives [36]. CALLM simulated patient–doctor interactions using a role-playing paradigm. Textbook–Assignment–Application (T-A-A) dataset partitioning was employed to categorize synthetic dialogues into deterministic, semi-structured, and real-world samples. Additionally, Response–Reason prompting guided LLMs in generating clinically relevant psychiatric transcripts. The model achieved a balanced accuracy of 77%, an F1-score of 0.70, and an AUC of 0.78, outperforming traditional data augmentation techniques.

Beyond LLM-based generation, multi-agent architectures have been proposed to enhance linguistic and domain accuracy in multilingual synthetic medical text. Almutairi et al. (2024) [26] introduced a multi-agent synthetic dialogue generation system in which dedicated generation, refinement, and evaluation agents were tasked with optimizing Arabic medical dialogues for dialect-specific accuracy. This approach resulted in a BERTS score of 0.834 and was rated 90% acceptable in terms of medical realism and engagement by human evaluators.

Ensuring data security and patient confidentiality in synthetic medical text has been a critical area of research. Lund et al. (2024) [27] developed a GPT-4-based synthetic dataset containing 1200 de-identified Norwegian clinical discharge summaries with annotated Protected Health Information (PHI). Their study evaluated automated de-identification methods and found that GPT-4-based annotations achieved an F1-score of 0.983, closely aligning with human-labeled de-identification benchmarks. However, the model exhibited limitations in detecting complex PHI patterns, with rule-based NER methods outperforming GPT-4 in multi-class Named Entity Recognition tasks.

Meanwhile, Zecevic et al. (2024) [28] introduced a differentially private synthetic text generation framework, fine-tuning BioGPT on 90,000 de-identified endoscopy reports while incorporating differential privacy (DP) mechanisms. Their approach generated 10,000 synthetic reports, ensuring that individual patient records were not replicated while maintaining linguistic coherence and medical relevance. Their evaluation framework assessed textual similarity, privacy protection, and downstream NLP utility, demonstrating that DP-enhanced synthetic datasets improved privacy while slightly reducing text fluency.

3.2. Benchmarking Synthetic Medical Text Generation

Benchmarking synthetic medical text generation is essential for evaluating faithfulness, factual consistency, domain adaptation, and privacy compliance. While LLMs demonstrate strong performance in text generation, persistent challenges such as hallucination, domain shifts, and factual inconsistencies require rigorous benchmarking methodologies to ensure synthetic datasets remain clinically viable [10,21,37,38].

An emerging methodology, retrieval-augmented generation (RAG), has shown potential to ground synthetic outputs against external knowledge sources to improve factuality [21]. However, retrieval-based approaches sometimes prioritize factual density at the cost of fluency, resulting in text that appears rigid or formulaic. Therefore, a balance between retrieval accuracy and linguistic naturalness remains an open challenge.

In parallel, privacy-preserving methods, such as differential privacy through foundation model APIs (AUG-PE) [22], have been benchmarked to address data protection requirements in synthetic medical text. These methods show promise in minimizing the risk of re-identification without severely degrading text utility. Finally, the diversity and bias in synthetic training datasets have been increasingly scrutinized. Studies such as Yu et al. (2023) [39] highlight that LLMs trained or fine-tuned on synthetic datasets may introduce subtle biases depending on the attribute distribution of the generated data. Managing diversity without compromising factual alignment is critical for ensuring equitable clinical NLP systems.

Table 2 summarizes the benchmarking techniques, target models, evaluation metrics, and findings across reviewed studies.

Table 2 presents a comparative analysis of benchmarking methodologies, summarizing key benchmarking techniques, models evaluated, corresponding evaluation metrics, and key findings from each study. The benchmarking methodologies focus on three core areas: retrieval-augmented evaluation, knowledge-infused benchmarking, and privacy-preserving assessments.

Ensuring that synthetic medical text generalizes to real-world clinical datasets remains a persistent challenge. Serbetçi et al. (2023) [37] evaluated GPTNERMED, a synthetic German medical text dataset designed for Named Entity Recognition (NER), and found that models trained on synthetic data exhibited poor generalization when tested on real-world datasets such as BRONCO150 and CARDIO:DE. They introduced domain adaptation loss functions to quantify domain adaptation performance, which assesses the extent of model performance degradation when transitioning from synthetic to real-world data. Their evaluation included multiple benchmarking metrics. They measured Maximum Mean Discrepancy (MMD), a statistical distance measure that quantifies the distributional difference between synthetic and real-world text embeddings. Also, they assessed Domain Adversarial Training (DAT) Loss, which evaluates the discrepancy between the synthetic (source) and real-world (target) distributions by employing adversarial domain adaptation techniques. The researchers introduced the Entity Consistency Score, determining whether synthetic-generated medical entities align with real-world clinical entity distributions. Their findings showed that models trained on only synthetic datasets exhibited high MMD values, indicating significant distributional differences and increased DAT loss, reinforcing concerns that synthetic datasets often fail to capture the variability of real clinical narratives. A potential improvement is adversarial fine-tuning, where synthetic medical text is iteratively refined using real-world validation sets to minimize distributional shifts and enhance generalization.

Li et al. (2024) evaluate the discharge summary generation from LlamaCare, benchmarking it against PMC-LLaMA and Llama 2 using BLEU-4, ROUGE-L, and AUROC metrics. In clinical text classification tasks, including in-hospital mortality prediction, length-of-stay estimation, diagnosis classification, and procedures prediction, LlamaCare outperforms baseline models with AUROC improvements ranging from 2 to 5 points. These results demonstrate that instruction fine-tuning with domain-specific supervision enhances the quality of synthetic medical datasets, improving both linguistic fluency and predictive performance in clinical NLP applications [32].

Tian et al. (2024) benchmarked ChiMed-GPT, a domain-specific Chinese medical LLM, evaluating its performance on Named Entity Recognition (NER) and medical question-answering (QA) tasks. ChiMed-GPT achieves NER F1-scores ranging from 40.82 to 41.04, demonstrating improved entity recognition capabilities over general-purpose LLMs. In multi-choice QA tasks, the model attains 68.29% accuracy on C-Eval, 52.92% on CMMLU, and 44.50% on MedQA, approaching the performance of GPT-4. The study applies a multi-stage training framework, incorporating pre-training on 369,800 medical encyclopedia documents, supervised fine-tuning (SFT) on 1.2 million structured QA pairs, and reinforcement learning from human feedback (RLHF) with rejection sampling fine-tuning, optimizing factual alignment and domain specificity. These results highlight the impact of structured fine-tuning and human-in-the-loop training in improving factual consistency, reinforcing the effectiveness of domain-adapted LLMs for synthetic medical text generation in specialized clinical settings [9].

Retrieval-augmented generation (RAG) has been explored to improve the factual correctness of synthetic medical text by integrating external structured knowledge sources before text generation. Since LLMs generate synthetic datasets without real-time access to clinical databases, retrieval-based techniques have been introduced to ground synthetic content in verified knowledge sources and reduce hallucination risks.

Guo et al. (2024) [21] developed Retrieval-Augmented Lay Language (RALL) Generation. This framework enhances the creation of LLM-based synthetic datasets by retrieving external sources such as Wikipedia and Unified Medical Language System (UMLS) before generating medical summaries. This approach strengthens factual grounding in synthetic datasets by ensuring alignment with external clinical references, addressing data sparsity and factual inconsistencies in generated medical text. The researchers introduced retrieval ranking precision metrics, which assess the model’s ability to retrieve factually relevant documents before generating synthetic text. Their study applied several evaluation techniques. They measured precision at rank one, determining the percentage of cases where the highest-ranked retrieved document was factually relevant before being used for text generation. Moreover, they calculated recall within the top five retrieved results, evaluating how many relevant references appeared in the first five retrieved documents. This ensured that retrieval-based knowledge provides sufficient grounding for synthetic dataset creation. Their study demonstrated that RALL improved recall within the top five retrieved documents by 18% and mean reciprocal rank by 12%, compared to baseline LLM-generated summaries, indicating that retrieval augmentation strengthens factual precision in synthetic medical datasets. However, retrieval-based models exhibited fluency challenges, often generating incoherent, overly factual text prioritizing correctness over readability. This trade-off highlights the need for hybrid benchmarking frameworks that integrate retrieval ranking with coherence assessment metrics, ensuring that synthetic datasets maintain factual grounding while preserving linguistic fluency.

Latif et al. (2024) evaluate the impact of LLM-based text augmentation on synthetic medical dataset diversity and robustness, comparing ChatGPT-generated rephrasings to traditional augmentation techniques such as Easy Data Augmentation (EDA) and Automated Easy Data Augmentation (AEDA). Their benchmarking results show that datasets augmented with ChatGPT improve clinical NLP model performance, with BART achieving ROUGE-1 (52.35), ROUGE-2 (41.59), and ROUGE-L (50.71). The study also analyzes the effect of different prompting strategies, finding that zero-shot prompting produces higher-quality augmented text than few-shot and one-shot approaches. These findings demonstrate that LLM-driven text augmentation enhances dataset coverage while maintaining medical relevance, improving synthetic dataset utility for clinical NLP applications [30].

Frei et al. (2023) evaluate the effectiveness of LLM-generated synthetic medical datasets for training domain-specific NLP models, specifically in Named Entity Recognition (NER). Their study benchmarks NER models trained on GPT-NeoX (20B)-generated synthetic text against real-world datasets, demonstrating that synthetic data can serve as viable training material for medical NLP tasks. The models trained on synthetic data achieved high F1-scores of 0.918 (GBERT), 0.910 (GottBERT), and 0.883 (German-MedBERT), indicating that LLM-generated annotated datasets can closely approximate the performance of real clinical corpora. However, their findings highlight bias in the generated text, with an over-representation of standard medical terms and limited coverage of rare clinical entities, requiring additional data filtering and cleaning to improve dataset quality. These results underscore the potential of synthetic medical dataset generation for low-resource languages, demonstrating that LLMs can be effectively leveraged for privacy-preserving and scalable medical text annotation [31].

Beyond retrieval-based methods, knowledge-infused prompting has been introduced to improve factual grounding in synthetic datasets. Zafar et al. (2024) [40] proposed KI-MAG (Knowledge-Infused Medical Abstractive Generator). This framework embeds medical knowledge graphs (KGs) into synthetic QA datasets to improve semantic correctness and clinical consistency. Unlike RAG, which retrieves external information before generating synthetic text, KI-MAG pre-integrates domain-specific medical knowledge into the synthetic dataset creation process. The model generated BioASQ-SYN, a synthetic QA dataset created using knowledge-infused prompting, effectively addressing data sparsity issues in biomedical NLP. The KI-MAG framework was evaluated using BLEU-based text similarity metrics to benchmark the effectiveness of knowledge infusion in synthetic dataset generation. The model demonstrated a 15% improvement across BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores compared to traditional abstractive QA models, confirming that embedding structured medical knowledge reduces hallucination rates in synthetic medical text. Human evaluation confirmed that while ChatGPT exhibited greater fluency, KI-MAG significantly outperformed general-purpose LLMs in relevance and factual correctness of medical entity. These findings reinforce the importance of knowledge-grounded benchmarking in synthetic medical text generation, demonstrating that pre-integrated domain expertise enhances factual alignment and dataset reliability. Both retrieval-augmented generation and knowledge-infused prompting represent complementary approaches to ensuring that synthetically generated medical datasets maintain factual consistency. While RAG dynamically integrates external knowledge sources, knowledge-infused prompting embeds structured medical expertise within the dataset, reducing reliance on real-time retrieval. Benchmarking studies indicate that hybrid synthetic dataset generation frameworks combining RAG and knowledge infusion may offer a more robust solution to hallucination mitigation and domain-specific consistency in synthetic clinical NLP applications.

Fine-tuning synthetic text generation models for clinical accuracy and coherence requires rigorous optimization techniques. Zhang et al. (2025) [10] examined faithfulness in synthetic medical narratives and proposed a structured fine-tuning approach incorporating a Faithfulness Score metric. Their study utilized multiple fine-tuning techniques to benchmark model performance. They applied Contrastive Learning Fine-Tuning, which optimizes embeddings by ensuring that factually correct sequences are more similar to reference EHR narratives while penalizing hallucinated outputs. They also implemented Gradient-Based Adversarial Fine-Tuning (GAFT), which applies small perturbations to generated text embeddings and fine-tunes the model to correct discrepancies, effectively reducing hallucination rates. In addition, they employed Curriculum Learning Optimization, introducing synthetic text with increasing factual complexity to help the model gradually improve its ability to generate long-form structured medical narratives without introducing contradictions. Their study demonstrated that gradient-based adversarial fine-tuning reduced hallucination rates by 23%, while curriculum learning significantly improved faithfulness metrics in long-form synthetic discharge summaries.

Xu et al. (2024) [38] also integrated domain-specific knowledge into fine-tuning through CLINGEN, optimizing LLM-generated synthetic text for clinical decision support applications. Their evaluation showed a 7.7% to 8.7% improvement across domain-specific biomedical NLP tasks, reinforcing the effectiveness of structured fine-tuning techniques in medical text generation benchmarking.

Serbetçi et al. (2023) assess the generalization capabilities of GPTNERMED, a synthetic German medical text dataset for Named Entity Recognition (NER). Benchmarking against real-world datasets (BRONCO150, CARDIO:DE), their study finds that synthetic-trained models exhibit domain adaptation challenges, reflected in high Maximum Mean Discrepancy (MMD) values and increased Domain Adversarial Training (DAT) Loss. Additionally, the Entity Consistency Score reveals that synthetic datasets fail to capture rare clinical entities, underscoring the need for hybrid training approaches that integrate real-world validation data [37].

Chen et al. (2023) evaluate GatorTronGPT, a domain-specific LLM trained on 82 billion words of de-identified clinical text, assessing its realism through a Turing Test of physicians. Medical professionals were asked to differentiate between synthetic and real clinical notes, with results showing no significant distinction, confirming high linguistic fluency and clinical coherence from GatorTronGPT. However, the study highlights that while fluency is strong, subtle factual inconsistencies may arise in complex cases, reinforcing the need for physician-in-the-loop validation in synthetic dataset generation [8].

Ensuring privacy compliance in synthetic medical text generation is critical for maintaining regulatory standards such as HIPAA and GDPR. Lund et al. (2024) [27] developed a GPT-4-based synthetic dataset containing 1200 de-identified Norwegian clinical discharge summaries annotated with Protected Health Information (PHI). Their study evaluated automated PHI de-identification methods using precision-recall metrics to benchmark privacy performance. Their findings showed that GPT-4-based annotations achieved an F1-score of 0.983, aligning closely with human-labeled de-identification benchmarks. However, their study also revealed that GPT-4 struggled with complex PHI pattern recognition, with rule-based Named Entity Recognition (NER) methods outperforming LLM-based de-identification in multi-class entity detection tasks. Beyond PHI removal, differential privacy (DP) techniques have been benchmarked to assess privacy-preserving synthetic dataset generation.

Xie et al. (2024) evaluate AUG-PE, an API-based privacy-preserving synthetic text generation framework that removes the need for differential privacy (DP)-specific fine-tuning while maintaining data utility. Unlike DP-SGD-based fine-tuning, which modifies model weights and injects Gaussian noise, AUG-PE applies post-processing privacy masking, selectively removing sensitive PHI patterns before dataset deployment. Benchmarking results show that AUG-PE improves privacy–utility trade-offs, ensuring synthetic datasets remain informative while mitigating re-identification risks [22].

4. Discussion

Advances in Large Language Models (LLMs), including GPT-4 and BioGPT, have improved the scalability and fluency of synthetic medical text generation. However, results show a trade-off between coherence and factual accuracy. LLMs often generate hallucinations, introducing medical details not present in source inputs [10]. Retrieval-augmented generation (RAG) [21] and knowledge-infused prompting [40] help mitigate factual inconsistencies by incorporating structured knowledge sources. However, integrating domain-specific medical frameworks, such as SOAP or Calgary–Cambridge, remains critical for preserving clinical coherence and structured reasoning [24,25]. Models trained with structured consultation formats better adhere to real-world medical questioning sequences, reinforcing the need for domain-aware constraints in synthetic dataset generation [7,10,35,38].

The role of structured medical frameworks in improving LLM-generated synthetic medical text is summarized in Table 3. This table highlights the key challenges associated with unstructured synthetic datasets, including faithfulness, hallucinations, and generalization issues. It outlines how embedding consultation models such as SOAP, SBAR, and Calgary–Cambridge into training pipelines can enhance medical text reliability, enforce logical questioning sequences, and improve dataset usability for clinical applications.

4.1. Applications of LLMs in Clinical NLP

The findings suggest hybrid LLM architectures combining retrieval grounding, structured fine-tuning, and domain adaptation can enhance factual consistency and medical relevance. These methods apply to clinical documentation, decision support, and healthcare automation, particularly in settings where patient privacy regulations limit access to real-world clinical datasets. While LLM-based synthetic EHR generation has demonstrated feasibility [35], privacy concerns persist due to the absence of built-in differential privacy mechanisms. Alternative approaches, such as API-based privacy-preserving models (e.g., AUG-PE) [22], provide a viable strategy to balance data utility with privacy requirements.

Beyond privacy, benchmarking methodologies must evolve to assess LLM-generated medical text using domain-specific evaluation metrics. Current assessments often rely on linguistic metrics such as ROUGE and BLEU, which fail to capture medical correctness [10,41]. Integrating fact-verification mechanisms, retrieval-grounded evaluation, and physician-in-the-loop assessments could improve dataset validation and ensure synthetic medical text aligns with clinical documentation standards.

The implications of LLM-generated synthetic datasets in clinical settings are outlined in Table 4. This table categorizes key applications, such as dialogue generation, EHR structuring, and medical summarization, while addressing challenges of factual consistency, privacy, and faithfulness.

A key advantage of LLM-generated synthetic datasets is their ability to increase data availability while maintaining patient confidentiality, supporting the development of NLP models for tasks such as clinical summarization and Named Entity Recognition [27,28]. Techniques such as contrastive learning and Gradient-Based Adversarial Fine-Tuning (GAFT) have effectively reduced hallucination rates and improved dataset faithfulness [10,38].

However, retrieval-augmented generation methods often constrain output flexibility, leading to rigid text structures that may not align with natural patient–provider interactions [40]. Furthermore, the absence of structured medical frameworks in training pipelines frequently results in synthetic datasets that lack realistic clinical progression, reducing their applicability for real-world deployment [24,25].

4.2. Limitations in Medical LLMs

The results indicate that retrieval-augmented generation (RAG) and knowledge-infused prompting (KI-MAG) do not always yield the expected improvements in factual accuracy. In some studies, integrating structured knowledge sources constrained LLMs to predetermined factual outputs, reducing the variability necessary for natural clinical dialogue generation. This trade-off was particularly evident in specialized medical domains where rigid knowledge bases did not account for nuanced variations in treatment decisions and diagnostic reasoning.

Structured medical consultation models like SOAP help improve coherence but do not eliminate hallucinations. This suggests that relying solely on structured frameworks is not enough to prevent factual inconsistencies. A more effective approach would be hybrid benchmarking strategies that combine retrieval-based fact-checking with domain-specific fine-tuning.

Although this review followed the PRISMA guidelines to ensure methodological transparency, several limitations in the review process should be noted. The screening and data extraction were conducted by a single reviewer, which may introduce subjective bias despite adherence to structured protocols. In addition, the literature search was restricted to the Scopus database, possibly omitting relevant studies indexed elsewhere. Finally, this review was not pre-registered, limiting the traceability of methodological decisions over time. These limitations may have influenced the comprehensiveness of the included evidence and should be addressed in future systematic reviews through protocol registration, multi-database searches, and multi-reviewer validation workflows.

5. Conclusions

This research highlights the potential of LLM-based synthetic data in expanding training datasets for clinical NLP, improving data accessibility, and supporting privacy-preserving AI applications in healthcare. However, maintaining factual consistency and complying with regulatory standards remain challenges. While differential privacy techniques and automated PHI de-identification offer solutions, balancing data utility with privacy requirements remains a limitation. Existing benchmarking methods rely on basic linguistic metrics, emphasizing the need for more robust domain-specific validation, including adversarial training loss functions and physician-in-the-loop assessments.

We examined recent developments in synthetic medical dataset generation with LLMs, evaluating models such as ChatGPT, BioGPT, ChiMed-GPT, and LlamaCare, which improve data scalability and fluency. However, key challenges persist, including hallucination risks, domain adaptation gaps, and the partial integration of structured frameworks, which impact the reliability of synthetic clinical text [9,20,33].

One of the primary applications of this work is in enhancing AI-based medical scribe systems for automated clinical documentation using ASR and NLP. LLMs can improve ASR transcription robustness, optimize NLP-based summarization models, and enhance structured medical documentation workflows by leveraging synthetic datasets for physician–patient dialogue generation. Generating high-fidelity, privacy-preserving synthetic conversations can help bridge data scarcity gaps, ensuring multilingual adaptability and improved domain-specific performance in clinical environments.

The findings from this study indicate that hybrid approaches, combining retrieval-augmented generation, domain-specific fine-tuning, and privacy-preserving mechanisms, enhance factual accuracy while maintaining linguistic coherence. However, benchmarking inconsistencies persist, as commonly used metrics such as ROUGE and BLEU fail to capture medical accuracy and clinical relevance. Ensuring domain-specific validation and integrating structured consultation frameworks remain critical in achieving reliable synthetic datasets for clinical NLP.

Future advancements in LLM-driven synthetic data generation will be essential for developing AI-based medical scribe systems capable of accurately transcribing, summarizing, and documenting physician–patient interactions while ensuring privacy compliance and clinical reliability. This study addresses key questions on integrating synthetic data into clinical NLP systems, showing that while LLM-based solutions improve scalability and domain adaptability, structured frameworks and refined evaluation methodologies remain crucial for ensuring clinical accuracy and usability.

Our future research should incorporate structured medical consultation models such as SOAP, SBAR, and the Calgary–Cambridge framework into LLM training, improving the coherence and usability of synthetic datasets for real applications. Empirical studies that assess framework-guided synthetic data in clinical dialogue summarization, ASR enhancement, and Named Entity Recognition will provide deeper insights into how LLMs can optimize downstream tasks in AI-based medical scribe systems. Refining benchmarking methods is also essential for factual grounding, domain adaptation, and privacy compliance. Future approaches should develop hybrid evaluation techniques integrating retrieval augmentation with knowledge-infused prompting, ensuring that synthetic medical text maintains domain accuracy and real-world applicability. Moreover, expanding multilingual synthetic datasets remains a priority to enhance LLM adaptability for non-English healthcare systems, which is particularly relevant for developing medical scribe solutions in underrepresented languages like European Portuguese. This guidance aligns with the objectives of this study, which sought to identify advancements and limitations in LLM-driven synthetic dataset generation for biomedical NLP.

Author Contributions

Conceptualization, L.M., L.M.G. and J.M.M.; methodology, L.M.; formal analysis, L.M.; investigation, L.M.; resources, L.M.; writing—original draft preparation, L.M.; writing—review and editing, L.M., L.M.G. and J.M.M.; supervision, L.M.G. and J.M.M.; funding acquisition, J.M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by FCT–Fundação para a Ciência e Tecnologia within the R&D Unit Project Scope UID/00319/Centro ALGORITMI (ALGORITMI/UM).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chlap, P.; Bagheri, A.P.; Owens, S.P.; Dagan, D.M.; Nguyen, Q.; Drummond, T. A survey of synthetic data generation methods for medical imaging. Comput. Med. Imaging Graph. 2021, 94, 101997. [Google Scholar] [CrossRef]
Goncalves, A.; Ray, P.; Soper, B.; Stevens, J.; Coyle, L.; Sales, A.P. Generation and evaluation of synthetic patient data. BMC Med. Res. Methodol. 2020, 20, 108. [Google Scholar] [CrossRef]
Ratner, A.; Bach, S.H.; Ehrenberg, H.; Fries, J.; Wu, S.; Ré, C. Snorkel: Rapid Training Data Creation with Weak Supervision. Proc. Vldb Endow. 2020, 13, 1971–1983. [Google Scholar] [CrossRef]
Gilardi, F.; Kühner, M.; Grass, L. Large Language Models for Synthetic Dataset Generation: A Comparative Analysis. J. Artif. Intell. Res. 2023, 67, 145–162. [Google Scholar]
Sun, Y.; Yang, L.; Tang, J. Evaluating LLM-Generated Synthetic Datasets for NLP: Challenges and Opportunities. Trans. Assoc. Comput. Linguist. 2023, 11, 202–219. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P. Language Models are Few-Shot Learners: Applications in Synthetic Dataset Generation. Adv. Neural Inf. Process. Syst. (Neurips) 2023, 34, 1877–1890. [Google Scholar] [CrossRef]
Liu, Y.; Ju, S.; Wang, J. Exploring the Potential of ChatGPT in Medical Dialogue Summarization: A Study on Consistency with Human Preferences. BMC Med. Inform. Decis. Mak. 2024, 24, 75. [Google Scholar] [CrossRef]
Chen, Q.; Sun, H.; Liu, H.; Jiang, Y.; Ran, T.; Jin, X.; Xiao, X.; Lin, Z.; Chen, H.; Niu, Z. Benchmarking ChatGPT for Biomedical Text Generation. Bioinformatics 2023, 39, btad557. [Google Scholar] [CrossRef]
Tian, Y.; Gan, R.; Song, Y.; Zhang, J.; Zhang, Y. CHIMED-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences. arXiv 2024, arXiv:2311.06025. [Google Scholar]
Zhang, X.; Zhao, G.; Ren, Y.; Wang, W.; Cai, W.; Zhao, Y.; Liu, J. Data-Augmented Large Language Models for Medical Record Generation. Appl. Intell. 2025, 55, 88. [Google Scholar] [CrossRef]
Zhu, Z.; Liu, S.; Zhang, R. Examining the Persuasive Effects of Health Communication in Short Videos: Systematic Review. J. Med. Internet Res. 2023, 25, e48508. [Google Scholar] [CrossRef] [PubMed]
Johnson, R.; Madden, S.; Cafarella, M. DataSynth: Generating Synthetic Data Using Declarative Constraints. Proc. Vldb Endow. 2022, 13, 2071–2083. [Google Scholar] [CrossRef]
Huang, H.; Wang, Y.; Tang, J. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. J. Biomed. Inform. 2022, 127, 103994. [Google Scholar] [CrossRef]
Weed, L. Medical Records, Medical Education, and Patient Care: The Problem-Oriented Record as a Basic Tool. J. Am. Med. Assoc. 2004, 292, 1066–1070. [Google Scholar]
Silverman, J.; Kurtz, S.; Draper, J. The Calgary-Cambridge Guide to the Medical Interview: Communication for Clinical Practice. Med. Educ. 2008, 42, 673–679. [Google Scholar] [CrossRef]
Montenegro, L.; Gomes, L.M.; Machado, J.M. AI-Based Medical Scribe to Support Clinical Consultations: A Proposed System Architecture. In EPIA Conference on Artificial Intelligence; Springer: Cham, Switzerland, 2023; pp. 1–12. [Google Scholar] [CrossRef]
Moher, D.; Liberati, A.; Tetzlaff, J.; Altman, D.G.; Group, T.P. Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. PLoS Med. 2009, 6, e1000097. [Google Scholar] [CrossRef]
Falagas, M.E.; Pitsouni, E.I.; Malietzis, G.A.; Pappas, G. Comparison of PubMed, Scopus, Web of Science, and Google Scholar: Strengths and weaknesses. FASEB J. 2008, 22, 338–342. [Google Scholar] [CrossRef]
Goyal, M.; Mahmoud, Q. Synthetic Data Generation: Trends, Challenges, and Future Directions. Expert Syst. Appl. 2024, 229, 120462. [Google Scholar] [CrossRef]
Ghebrehiwet, I.; Zaki, N.; Damseh, R.; Mohamad, M.S. Revolutionizing personalized medicine with generative AI: A systematic review. Artif. Intell. Rev. 2024, 57, 128. [Google Scholar] [CrossRef]
Guo, Y.; Qiu, W.; Leroy, G.; Wang, S.; Cohen, T. Retrieval augmentation of large language models for lay language generation. J. Biomed. Inform. 2024, 149, 104580. [Google Scholar] [CrossRef]
Xie, C.; Lin, Z.; Backurs, A.; Gopi, S.; Yu, D.; Inan, H.; Nori, H.; Jiang, H.; Zhang, H.; Lee, Y.T.; et al. Differentially Private Synthetic Data via Foundation Model APIs. In Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: San Francisco, CA, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Moser, D.; Bender, M.; Sariyar, M. Generating synthetic healthcare dialogues in emergency medicine using large language models. Stud. Health Technol. Inform. 2024, 321, 235–239. [Google Scholar] [PubMed]
Schlegel, V.; Li, H.; Wu, Y.; Subramanian, A.; Nguyen, T.T.; Kashyap, A.R.; Beck, D.; Zeng, X.; Batista-Navarro, R.T.; Winkler, S.; et al. PULSAR at MEDIQA-Sum 2023: Large Language Models Augmented by Synthetic Dialogue Convert Patient Dialogues to Medical Records. arXiv 2023, arXiv:2307.02006. [Google Scholar]
Almutairi, M.; Alghamdi, A.; Alkadi, S. Multi-Agent Large Language Models for Arabic Medical Dialogue Generation: A Culturally Adapted Approach. J. Artif. Intell. Healthc. 2024, 18, 112–134. [Google Scholar]
Lund, J.A.; Burman, J.; Woldaregay, A.Z.; Jenssen, R.; Mikalsen, K. Instruction-Guided Deidentification with Synthetic Test Cases for Norwegian Clinical Text. In Proceedings of the 5th Northern Lights Deep Learning Conference (NLDL), Tromsø, Norway, 9–11 January 2024. [Google Scholar]
Zecevic, A.; Haug, C.; Schenk, L.; Zaveri, K.; Heuss, L.T.; Raemy, E.; Tzovara, A.; Oezcan, I.M. Privacy-Preserving Synthetic Text Generation Using Differentially Private Fine-Tuning of BioGPT. Artif. Intell. Med. 2024, 150, 102649. [Google Scholar]
Abdel-Khalek, S.; Algarni, A.D.; Amoudi, G.; Alkhalaf, S.; Alhomayani, F.M.; Kathiresan, S. Leveraging AI-Generated Content for Synthetic Electronic Health Record Generation with Deep Learning-Based Diagnosis Model. IEEE Trans. Consum. Electron. 2024. [CrossRef]
Latif, S.; Kim, Y.B. Augmenting Clinical Text with Large Language Models: A Comparative Study on Synthetic Data Generation for Medical NLP. J. Biomed. Inform. 2024, 145, 104567. [Google Scholar]
Frei, J.; Kramer, F. Annotated Dataset Creation through General Purpose Language Models for Non-English Medical NLP. arXiv 2023, arXiv:2308.14493. [Google Scholar]
Li, R.; Zhang, H.; Zhang, J.; Wu, Y. LlamaCare: Instruction-Tuned Large Language Models for Clinical Text Generation and Prediction. arXiv 2024, arXiv:2401.12345. [Google Scholar]
Tian, S.; Jin, Q.; Yeganova, L.; Lai, P.T.; Zhu, Q.; Chen, X.; Yang, Y.; Chen, Q.; Kim, W.; Comeau, D.C.; et al. Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health. Briefings Bioinform. 2024, 25, bbad493. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Mao, K.; Zhang, Y.; Chen, J. CALLM: Enhancing Clinical Interview Analysis Through Data Augmentation with Large Language Models. IEEE J. Biomed. Health Inform. 2024, 28, 7531–7542. [Google Scholar] [CrossRef]
Wang, N.; Lee, H.; Patel, R.; Johnson, M. Taxonomy-Based Prompt Engineering for Synthetic Patient Portal Messages. J. Biomed. Inform. 2024, 160, 104752. [Google Scholar] [CrossRef] [PubMed]
Ghanadian, H.; Nejadgholi, I.; Al Osman, H. Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language Models. IEEE Access 2024, 12, 3358206. [Google Scholar] [CrossRef]
Serbetçi, G.; Leser, U. Challenges and Solutions in Multilingual Clinical Data Analysis. J. Glob. Health Inform. 2023, 15, 210–222. [Google Scholar]
Xu, R.; Cui, H.; Yu, Y.; Kan, X.; Shi, W.; Zhuang, Y.; Wang, M.D.; Jin, W.; Ho, J.C.; Yang, C. Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), Miami, FL, USA, 12–16 November 2024; pp. 321–339. [Google Scholar]
Yu, Y.; Zhuang, Y.; Zhang, J.; Meng, Y.; Ratner, A.; Krishna, R.; Shen, J.; Zhang, C. Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias. Adv. Neural Inf. Process. Syst. 2023, 36, 55734–55784. [Google Scholar]
Zafar, A.; Sahoo, S.K.; Bhardawaj, H.; Das, A.; Ekbal, A. KI-MAG: A Knowledge-Infused Abstractive Question Answering System in the Medical Domain. Neurocomputing 2024, 571, 127141. [Google Scholar] [CrossRef]
Liu, J.; Zhang, K.; Huang, Y.; Wang, L. Performance Evaluation of LLMs in Biomedical Text Generation. J. Med. Inform. 2024, 48, 102–118. [Google Scholar]

Figure 1. The four phases of article selection following the PRISMA-ScR statement.

Table 1. Comparison of synthetic medical text generation models and performance metrics.

Study	Model Used	Application	Optimization	Evaluation Metrics
Moser et al. (2024) [24]	Zephyr-7b-beta, GPT-4 Turbo	Emergency medicine dialogues	Multi-stage pipeline	Accuracy (94% → 87%)
Latif et al. (2024) [30]	ChatGPT, BART, T5	Clinical text augmentation	Zero-shot prompting, LLM-based rephrasing	ROUGE-1, ROUGE-2, ROUGE-L
Frei et al.(2023) [31]	GPT-NeoX (20B)	Synthetic NER- annotated text	Few-shot prompting, XML-based entity tagging	F1-score
Abdel-Khalek et al. (2024) [29]	ChatGPT (SEHRG-DLD)	Synthetic EHRs	Deep Belief Network (DBN), HHO, GJO	Classification accuracy (97%)
Li et al. (2024) [32]	LlamaCare (Llama 2-7B)	Discharge summary generation, clinical text classification	Instruction tuning, self-instruction, LoRA	ROUGE-L, BLEU-4, AUROC
Tian et al. (2024) [33]	ChiMed-GPT	Structured dialogue, NER, QA	Pre-training, RLHF, rejection sampling	Accuracy, BLEU-1, ROUGE-L
Schlegel et al. (2023) [25]	GPT-3.5 (PULSAR)	Medical summarization	Domain-specific fine-tuning	ROUGE, BERTScore
Chen et al. (2023) [8]	DialoGPT (DoPaCos)	Doctor–patient conversations	Pre-training on synthetic dialogues	ROUGE, BERTScore
Zhang et al. (2025) [10]	Qwen-7B-Chat	Data-to-text, summarization	Faithfulness Score for fine-tuning	D2T (+19.72%), MTS (+19.33%)
Wu et al. (2024) [34]	CALLM	PTSD transcript generation	T-A-A partitioning, Response–Reason prompting	bACC (77%), F1 (0.70), AUC (0.78)
Almutairi et al. (2024) [26]	Multi-Agent System	Arabic medical dialogues	Multi-agent refinement	BERT Score (0.834), human acceptability (90%)
Lund et al. (2024) [27]	GPT-4	PHI de-identification	Automated deidentification	F1-score (0.983)
Zecevic et al. (2024) [28]	BioGPT	Privacy-preserving text generation	Differential Privacy (DP)	Textual similarity, privacy protection

Table 2. Comparative benchmarking metrics for synthetic medical text generation.

Study	Benchmarking Method	Model(s) Evaluated	Evaluation Metric	Key Findings
Serbetçi et al. (2023) [37]	Generalization	GPTNERMED	Maximum Mean Discrepancy (MMD), Domain Adversarial Training Loss (DAT), Entity Consistency Score	Limited generalization to real-world datasets
Li et al. (2024) [32]	Instruction-Tuned LLM Evaluation	LlamaCare (Llama 2-7B)	ROUGE-L, BLEU-4, AUROC	Improved domain adaptation and coherence
Tian et al. (2024) [33]	Domain-Specific Benchmarking	ChiMed-GPT	Multi-choice QA Accuracy, NER F1-score, BLEU	Domain adaptation improved factual consistency
Guo et al. (2024) [21]	Retrieval-Augmented Evaluation	RALL (Wikipedia, UMLS retrieval)	Precision at Rank 1, Recall in Top 5, Mean Reciprocal Rank	Higher factual accuracy, reduced fluency
Latif et al. (2024) [30]	Augmentation Performance	ChatGPT, BART, T5	ROUGE-1, ROUGE-2, ROUGE-L	BART-augmented datasets outperformed back-translation
Frei et al. (2023) [31]	NER Model Evaluation	GPT-NeoX (20B)	F1-score (GBERT, GottBERT, German-MedBERT)	Synthetic annotations improved NER performance
Zafar et al. (2024) [40]	Knowledge-Infused Benchmarking	KI-MAG	BLEU-1, BLEU-2, BLEU-3, BLEU-4	Improved factual consistency in synthetic QA datasets
Zhang et al. (2025) [10]	Fine-Tuning and Faithfulness	Qwen-7B-Chat	Faithfulness Score, Contrastive Learning, Gradient-Based Adversarial Fine-Tuning	Reduced hallucination, improved factual consistency
Xu et al. (2024) [38]	Domain-Specific Faithfulness	CLINGEN	Faithfulness Score, Domain-Specific Accuracy	Hallucination reduction in clinical NLP
Chen et al. (2023) [8]	Realism Evaluation	GatorTronGPT	Physicians’ Turing Test	Synthetic text indistinguishable from real data
Lund et al. (2024) [27]	PHI De-Identification	GPT-4	PHI Removal F1-score	High F1, limitations in complex PHI detection
Xie et al. (2024) [22]	Privacy-Preserving Evaluation	AUG-PE (API-based DP)	Privacy–Utility Trade-off	Improved privacy–utility balance

Table 3. Role of structured medical frameworks in enhancing LLM-generated synthetic data.

Challenge	Issue in LLM-Generated Synthetic Text	Solution via Structured Medical Frameworks
Faithfulness and Clinical Relevance	LLM-generated dialogues often lack structured clinical progression and deviate from real-world medical interactions.	Training LLMs with structured models (e.g., Calgary–Cambridge, SOAP) ensures logical, medical questioning and progression, improving clinical training and research dataset usability.
Hallucinations in Clinical Conversations	LLMs frequently introduce fabricated symptoms, test results, or diagnoses that were not present in the input data. Lack of structured constraints leads to unpredictable factual inconsistencies.	Embedding structured consultation formats constrains LLM outputs to follow expected medical interactions, reducing the risk of hallucinated symptoms and fabricated patient histories.
Dataset Generalization	Synthetic medical text lacks adaptability across clinical settings, specialties, and languages. Models trained on domain-specific, unstructured synthetic data struggle with real-world clinical tasks.	Structured LLM fine-tuning with consultation frameworks (SOAP, SBAR (Situation, Background, Assessment, and Recommendation), and Calgary–Cambridge) improves dataset standardization and enhances cross-domain generalization for multiple medical specialties.

Table 4. Applications and challenges of LLM-generated synthetic medical text.

Category	Applications of LLM-Generated Synthetic Text	Challenges
Synthetic Medical Dialogues	Generates structured doctor–patient conversations, enhancing fluency in clinical interactions.	Often introduces fabricated symptoms and medical details, reducing factual consistency.
	Fine-tuned to support clinical summarization models.	Lacks structured medical reasoning, failing to follow consultation models like SOAP or Calgary–Cambridge.
Synthetic EHR and Medical Reports	Generates structured EHRs incorporating clinical attributes (e.g., BMI, blood pressure) for training disease prediction models.	Lacks built-in privacy mechanisms, raising concerns about patient data protection.
	Used in clinical decision support systems for data augmentation.	Alternative privacy-preserving models (e.g., AUG-PE) outperform LLMs in privacy–utility balance.
Medical Summarization and Abstraction	Produces fluent medical summaries, improving documentation efficiency.	Performs worse than specialized models (e.g., BART) in factual accuracy (ROUGE-1: −14.94%, ROUGE-2: −53.48%).
	Automates discharge summary and progress note generation.	Requires retrieval-augmented generation (RAG) and knowledge-infused prompting to improve factual consistency.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Montenegro, L.; Gomes, L.M.; Machado, J.M. What We Know About the Role of Large Language Models for Medical Synthetic Dataset Generation. AI 2025, 6, 109. https://doi.org/10.3390/ai6060109

AMA Style

Montenegro L, Gomes LM, Machado JM. What We Know About the Role of Large Language Models for Medical Synthetic Dataset Generation. AI. 2025; 6(6):109. https://doi.org/10.3390/ai6060109

Chicago/Turabian Style

Montenegro, Larissa, Luis M. Gomes, and José M. Machado. 2025. "What We Know About the Role of Large Language Models for Medical Synthetic Dataset Generation" AI 6, no. 6: 109. https://doi.org/10.3390/ai6060109

APA Style

Montenegro, L., Gomes, L. M., & Machado, J. M. (2025). What We Know About the Role of Large Language Models for Medical Synthetic Dataset Generation. AI, 6(6), 109. https://doi.org/10.3390/ai6060109

Article Menu

What We Know About the Role of Large Language Models for Medical Synthetic Dataset Generation

Abstract

1. Introduction

2. Methodology

2.1. Literature Search Strategy

2.2. Screening Process and Data Extraction

3. Developments in Synthetic Medical Text Generation

3.1. Approaches to Synthetic Medical Text Generation

3.2. Benchmarking Synthetic Medical Text Generation

4. Discussion

4.1. Applications of LLMs in Clinical NLP

4.2. Limitations in Medical LLMs

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI