Automated Clinical Trial Data Analysis and Report Generation by Integrating Retrieval-Augmented Generation (RAG) and Large Language Model (LLM) Technologies
Abstract
1. Introduction
1.1. Research Background and Motivation
1.2. Research Questions and Objectives
- Scale and complexity of data processing: Randomized controlled trials (RCTs) typically yield millions of mixed structured and unstructured records. These datasets’ sheer volume and heterogeneity far exceed the single-pass processing capacity of conventional Large Language Models (LLMs).
- High reliability and traceability of results: Medical decision-making demands rigorous factual accuracy so that any informational error can lead to serious consequences. Building a complete and verifiable chain of evidence is therefore indispensable [6].
- Efficiency gains and cost optimization: Clinical trial teams are constantly under pressure to shorten study cycles while reducing time and human-resource expenditures.
1.3. Expected Contributions and Impact
- Methodological Innovation: We introduce a RAG-LLM workflow capable of supporting multi-site datasets. The proposed framework offers an in-house–deployable implementation example that enables heterogeneous data retrieval and empirically verifies its effect on report accuracy and drafting speed.
- Practical Value: This study provides initial evidence for the feasibility of multimodal RAG-LLM–driven automation in clinical trial reporting. The system allows physicians to retrieve trial evidence in real-time based on patient characteristics and to generate treatment recommendations, thereby shortening report delivery cycles [9].
- Theoretical Significance: The work establishes structured evaluation metrics for RAG-LLM applications in the healthcare domain and lays a foundation for subsequent research in clinical data analytics [7].
2. Previous Research
2.1. Challenges in the Traditional Clinical-Trial Reporting Workflow
2.2. Advances and Limitations of Large Language Models in Medical Text Understanding
2.3. Retrieval-Augmented Generation (RAG) and Cross-Modal Applications
2.4. Parameter-Efficient Fine-Tuning and Study Positioning
3. Materials and Methods
3.1. Data Sources and Preprocessing Workflow
3.2. Vector Database Construction
- Data extraction and standardization: An ETL workflow consolidates structured fields from electronic health records (HL7 CDA/FHIR), National Health Insurance billing codes (ICD-10, LOINC), and DICOM imaging reports into unified JSON objects, which are batch-loaded daily into a staging area. Existing HIS tables remain untouched, preserving routine clinical operations.
- Semantic chunking and embedding:
- Text: A small BioClinical-E5 embedding model—bilingually fine-tuned for Traditional Chinese and English—splits discharge summaries and laboratory results into 256-token chunks and converts each into a 768-dimensional vector.
- Images: Only the Findings/Impression sections of radiology reports are extracted, avoiding the storage of large raw image vectors and reducing storage and network overhead.
- Vector indexing—A FAISS Flat + IVF hybrid index accommodates several hundred thousand vectors, keeping query latency at ~50 ms, which is sufficient for real-time retrieval in outpatient and research scenarios.
- Service Integration—
- RESTful API: A/search endpoint accepts natural-language queries and returns the top five summaries with source links.
- Single Sign-On (SSO): Access is gated through the in-house application-server ACL, which enforces role-based privileges and meets regulatory “least-privilege” requirements.
3.3. Hierarchical Retrieval-Augmented Generation (RAG)
- Semantic Query Parsing: The system first interprets the user’s raw question and rewrites it as machine-optimized instructions, laying the groundwork for efficient downstream retrieval.
- Hierarchical Retrieval: A multi-tier strategy conducts an initial broad sweep, then incrementally narrows the search space, dramatically boosting retrieval efficiency and accuracy as data granularity increases.
- Evidence Fusion: Textual and imaging artifacts retrieved from disparate sources are merged, ranked, and weighted according to their relevance and importance to the original query, yielding a logically coherent and well-structured context window.
- LLM-Based Generation: The fused context is passed to a purpose-tuned LLM [23], which produces the final, highly accurate response—complete with trial summaries, risk analyses, and decision recommendations.
3.4. Parameter-Efficient Fine-Tuning and Model Governance
- Tier 1—Public corpus adaptation
- Tier 2—Campus-specific refinement
- Factual cross-check against a medical knowledge graph.
- Rule-based validation of medication and laboratory advice using clinical logic.
- Random double-masked review of generated reports by two attending physicians.
3.5. Report Generation and Clinical Workflow Integration
4. Experimental Design
4.1. Research Objectives and Hypotheses
- Retrieval-layer evidence recall and precision—ensuring that clinical decisions are grounded in complete, loss-free evidence.
- Generation-layer factual consistency and linguistic quality—assessing whether the LLM attains publication-level narrative quality in Traditional Chinese medical contexts.
- End-to-end latency and throughput—verifying real-time responsiveness on the hardware profile typical of regional hospitals.
- Module contribution and safety—using ablation studies to quantify how the imaging branch, LoRA fine-tuning, and RLAIF alignment affect hallucination rates and expert scores.
- H1 The text-retrieval component will achieve an evidence recall rate ≥ 0.85 in a top-20 setting.
- H2 Adding the PACS imaging branch will raise the average factual consistency score from three clinical experts by at least five percentage points.
- H3 Compared with a prompt-only Llama-3 baseline, the whole pipeline will reduce the hallucination rate by ≥40% (baseline 20%).
- H4 With the in-house inference environment and a multi-million-vector index, end-to-end query latency will remain within clinically acceptable limits.
4.2. Training Pipeline Configuration
- Base model and data stream
- 2.
- Sequence truncation and chunk-level embedding
- 3.
- LoRA-based PEFT fine-tuning
- 4.
- GRPO reinforcement-learning loop
- After each generation step, a rule-mixed feedback signal is computed from:
- –
- Vocabulary coverage (alignment with trial-standard terminology)
- –
- Citation accuracy (consistency between RAG-retrieved passages and the model’s answer)
- –
- Paragraph-structure score (presence of the four-level heading schema)
- A weighted composite reward is computed and fed into the GRPO configuration for policy updating. Recent studies from AWS and HF-TRL confirm that PPO/GRPO significantly improves instruction adherence without degrading the base-model knowledge.
4.3. Training Parameters and Validation
4.3.1. Hyperparameter Configuration Rationale
4.3.2. Reinforcement-Learning Fine-Tuning (GRPO)
- Optimizer and learning rate—paged_adamw_8bit reduces memory footprint; learning rate is 5 × 10−6 with a cosine-annealing schedule and a 0.1 warm-up ratio.
- Precision and batch size—Automatic selection of bf16 or fp16 mixed precision, depending on hardware; per_device_train_batch_size = 1; gradient_accumulation_steps = 1 (adjustable, e.g., to 4, for added stability).
- Generation and reward computation—for each training step, the model generates 6 candidate responses (num_generations = 6) to compute rewards. The prompt length is capped at 256 tokens, and the completion length is 200 tokens.
- Training stability—weight decay 0.1; gradient clipping with max_grad_norm = 0.1 to avert overfitting and gradient explosions.
- Loop control and checkpointing—max_steps = 10 for initial testing (to be adjusted as convergence dictates); checkpoints saved every 10 steps; logs written every step (logging_steps = 1); no reporting to weights and biases (report_to = “none”). All artifacts are stored in the outputs/directory.
4.3.3. Validation Protocol
4.4. Evaluation Metrics and Statistical Methods
- ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation—Longest Common Subsequence):This metric evaluates lexical overlap between the generated text and the reference by computing the length of the longest common subsequence (LCS).Definition:
- BERTScore (Bidirectional Encoder Representations from Transformers Score):This semantic metric compares contextual embeddings from the candidate and reference using a pre-trained language model.Definition:
- Med-Concept F1 (Medical Concept-Level F1 Score):This metric maps extracted entities to UMLS and ICD-10 codes, and imputes precision, recall, and F1 score on the concept level.Definition:
- FactCC-Med (Factual Consistency Classifier—Medical Adaptation):A binary classifier detecting sentence-level factual inconsistencies. Output is the proportion of factual errors.Definition:
- Composite Quality Index (CQI):A weighted aggregate of the above four metrics to facilitate holistic comparison.Definition:
- Text-level overlap—ROUGE-L measures the longest common subsequence, reflecting key semantics and word-order fidelity.
- Semantic similarity—BERTScore, based on bidirectional contextual embeddings, captures meaning beyond surface tokens and has been shown to correlate well with human judgment in medical text.
- Clinical-concept coverage—Med-Concept F1 computes recall and precision after mapping predictions to UMLS and ICD-10 concepts, directly linking the score to downstream decision correctness.
- Factual consistency—FactCC-Med detects sentence-level factual errors and is widely adopted in recent healthcare-NLP studies to flag hallucinations.
4.5. Baselines and Ablation Studies
- Layer-specific LoRA adaptation—removing LoRA adapters on gate_proj and o_proj (labeled –Gate and O Proj) raises the FactCC-Med hallucination rate by +0.9 pp, demonstrating that parameter-efficient fine-tuning on these two projections is critical for factual consistency.
- PPO reinforcement learning—eliminating the PPO multi-reward mechanism (−PPO RL) drops CQI by −2.6 pp, highlighting PPO’s role in balancing fluency, clinical-concept coverage, and factual accuracy.
- RAG retrieval—disabling RAG (RAG Retrieval) causes a −6.1 pp CQI decline and uniformly degrades all sub-metrics, underscoring the necessity of external knowledge integration for accurate and complete reports.
- LoRA overall—most notably, removing LoRA entirely (−LoRA (revert to FT)) slashes CQI by −9.8 pp, nearly reverting performance to the baseline B-0 level (68.1 Vs. ~68.5), confirming LoRA’s centrality: without it, the benefits of other modules cannot be fully realized.
5. Discussion
- The experiments involve a single healthcare system; the impact of cross-institution data heterogeneity on RAG retrieval effectiveness remains untested.
- While CQI aggregates multiple quantitative metrics, it cannot fully substitute for qualitative expert judgment. Incorporating newly released hallucination datasets such as MedHal could further strengthen the fact-checking module.
- The study did not examine vector-database refresh cycles; maintaining real-time retrieval performance as new cases are ingested will be a key optimization target.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
AI | Artificial Intelligence |
ATC | Anatomical Therapeutic Chemical Classification |
CQI | Composite Quality Index |
DICOM | Digital Imaging and Communications in Medicine |
EHR | Electronic Health Record |
FHIR | Fast Healthcare Interoperability Resources |
GRPO | Guided Reinforcement with Policy Optimization |
HIS | Hospital Information System |
ICD-10 | International Classification of Diseases, 10th Revision |
LLM | Large Language Model |
LoRA | Low-Rank Adaptation |
Med-F1 | UMLS-based Medical Concept F1 Score |
NER | Named Entity Recognition |
NHI | National Health Insurance (Taiwan) |
PACS | Picture Archiving and Communication System |
PEFT | Parameter-Efficient Fine-Tuning |
PPO | Proximal Policy Optimization |
QLoRA | Quantized LoRA |
RAG | Retrieval-Augmented Generation |
RCT | Randomized Controlled Trial |
TFDA | Taiwan Food and Drug Administration |
UMLS | Unified Medical Language System |
References
- Hutson, M. How AI is being used to accelerate clinical trials. Nature 2024, 627, S2–S5. Available online: https://www.nature.com/articles/d41586-024-00753-x (accessed on 22 June 2025). [CrossRef]
- Xu, Z.; Jain, S.; Kankanhalli, M. Hallucination is inevitable: An innate limitation of large language models. arXiv 2024, arXiv:2401.11817. Available online: https://arxiv.org/abs/2401.11817 (accessed on 22 June 2025).
- Raiaan, M.A.K.; Mukta, M.S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J. A review on large language models: Architectures, applications, taxonomies, open issues, and challenges. IEEE Access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H.; Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv 2023, arXiv:2312.10997. Available online: https://arxiv.org/abs/2312.10997 (accessed on 22 June 2025).
- Ng, K.K.Y.; Matsuba, I.; Zhang, P.C. RAG in Health Care: A Novel Framework for Improving Communication and Decision-Making by Addressing LLM Limitations. NEJM AI 2025, 2, AIra2400380. [Google Scholar] [CrossRef]
- Ocampo, T.S.C.; Silva, T.P.; Alencar-Palha, C.; Haiter-Neto, F.; Oliveira, M.L. ChatGPT, and scientific writing: A reflection on the ethical boundaries. Imaging Sci. Dent. 2023, 53, 175–176. [Google Scholar] [CrossRef] [PubMed Central]
- Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv 2023, arXiv:2305.14314. Available online: https://arxiv.org/abs/2305.14314 (accessed on 22 June 2025).
- Huang, D.; Hu, Z.; Wang, Z. Performance Analysis of Llama 2 Among Other LLMs. In Proceedings of the 2024 IEEE Conference on Artificial Intelligence (CAI), Singapore, 25–27 June 2024; pp. 1081–1085. [Google Scholar]
- Tanno, R.; Barrett, D.G.T.; Sellergren, A.; Ghaisas, S.; Dathathri, S.; See, A.; Welbl, J.; Lau, C.; Tu, T.; Azizi, S.; et al. Collaboration between clinicians and vision–language models in radiology report generation. Nat. Med. 2024, 31, 599–608. [Google Scholar] [CrossRef]
- Asgari, E.; Montaña-Brown, N.; Dubois, M.; Khalil, S.; Balloch, J.; Yeung, J.A.; Pimenta, D. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarization. npj Digit. Med. 2025, 8, 274. [Google Scholar] [CrossRef]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large Language Models Encode Clinical Knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
- Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; De Leon, L.; Elepaño, C.; Madriaga, M.; Aggabao, R.; Diaz-Candido, G.; Tseng, V.; et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health 2023, 2, e0000198. [Google Scholar] [CrossRef] [PubMed]
- Liu, S.; McCoy, A.B.; Wright, A. Improving large language model applications in biomedicine with retrieval-augmented generation: A systematic review, meta-analysis, and clinical development guidelines. J. Am. Med. Inf. Assoc. 2025, 32, 605–615. [Google Scholar] [CrossRef]
- Izacard, G.; Grave, E. Leveraging passage retrieval with generative models for open domain question answering. arXiv 2020, arXiv:2007.01282. [Google Scholar] [CrossRef]
- Alkhalaf, M.; Yu, P.; Yin, M.; Deng, C. Applying generative AI with retrieval-augmented generation to summarize and extract key clinical information from electronic health records. J. Biomed. Inform. 2024, 156, 104662. [Google Scholar] [CrossRef] [PubMed]
- Malkov, Y.A.; Yashunin, D.A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. arXiv 2018, arXiv:1603.09320. [Google Scholar] [CrossRef]
- Gao, Y.; Li, R.; Croxford, E.; Tesch, S.; To, D.; Caskey, J.; Patterson, B.W.; Churpek, M.M.; Miller, T.; Dligach, D.; et al. Large Language Models and Medical Knowledge Grounding for Diagnosis Prediction. medRxiv 2023, 2023, 24.23298641. [Google Scholar] [CrossRef]
- Qin, C.; Jiang, K.; Wang, Y.; Zhu, T.; Wu, Y.; Zhang, D. Event-triggered H∞ control for unknown constrained nonlinear systems with application to robot arm. Appl. Math. Model. 2025, 144, 116089. [Google Scholar] [CrossRef]
- Zhang, D.; Hao, X.; Liang, L.; Liu, W.; Qin, C. A novel deep convolutional neural network algorithm for surface defect detection. J. Comput. Des. Eng. 2022, 9, 1616–1632. [Google Scholar] [CrossRef]
- Hu, E.; Shen, Y.; Wallis, C.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. ICML Proc. 2021, 139, 132–152. [Google Scholar]
- Xiong, G.; Jin, Q.; Lu, Z.; Zhang, A. Benchmarking Retrieval-Augmented Generation for Medicine. arXiv 2024, arXiv:2402.13178. [Google Scholar]
- Ye, C. Exploring a learning-to-rank approach to enhance the Retrieval Augmented Generation (RAG)-based electronic medical records search engines. Inform. Health 2024, 1, 93–99. [Google Scholar] [CrossRef]
Model | ROUGE-L ↑ | BERTScore ↑ | Med-Concept F1 ↑ | FactCC-Med ↓ (%) | CQI ↑ |
---|---|---|---|---|---|
B-0 (text-only fine-tune) | 37.5 ± 0.8 | 0.874 ± 0.005 | 0.712 ± 0.011 | 9.8 ± 0.4 | 68.1 ± 0.9 |
B-1 (B-0 + RAG) | 40.2 ± 0.7 | 0.890 ± 0.004 | 0.748 ± 0.010 | 7.3 ± 0.3 | 72.9 ± 0.8 * |
A-1 (LoRA-16) | 41.0 ± 0.6 | 0.893 ± 0.004 | 0.762 ± 0.009 | 7.0 ± 0.3 | 74.2 ± 0.7 * |
A-2 (LoRA-32) | 42.6 ± 0.6 | 0.901 ± 0.003 | 0.781 ± 0.009 | 6.6 ± 0.2 | 77.0 ± 0.6 * |
A-3 (LoRA-64) | 43.0 ± 0.6 | 0.902 ± 0.003 | 0.784 ± 0.009 | 6.5 ± 0.2 | 77.7 ± 0.6 *† |
Full (A-2 + PPO multi-reward) | 43.1 ± 0.5 | 0.904 ± 0.003 | 0.791 ± 0.008 | 6.2 ± 0.2 | 78.3 ± 0.5 * |
Module Removed | ΔROUGE-L | ΔBERTScore | ΔMed-Concept F1 | ΔFactCC-Med | ΔCQI |
---|---|---|---|---|---|
None (Full) | 0 | 0 | 0 | 0 | 0 |
−PPO RL | −0.9 ★ | −0.3 | −0.5 | +0.4 ★ | −2.6 ★ |
−RAG Retrieval | −2.3 ★ | −1.1 ★ | −1.6 ★ | +1.8 ★ | −6.1 ★ |
−LoRA (revert to FT) | −5.6 ★ | −3.0 ★ | −2.9 ★ | +2.7 ★ | −9.8 ★ |
−Gate and O Proj | −1.5 ★ | −0.7 | −1.2 | +0.9 | −3.4 ★ |
System/Model | RAG Design | Fine-Tuning Method | ROUGE-L ↑ | Med-F1 ↑ | Latency ↓ (s) | Deployment Feasibility |
---|---|---|---|---|---|---|
This Study (Full) | Hierarchical (multi-modal) | LoRA-32 + GRPO | 43.1 | 0.791 | <5 | On-premise feasible; low compute demand |
Med-PaLM 2 | None (prompt-only) | Internal SFT | ~41.2 | ~0.76 | >10 | Cloud only; regulatory friction |
PMC-LLaMA (13B) | None | Prompt tuning | ~40.5 | ~0.74 | ~8 | Partially open source; no image support |
BioGPT + RAG-lite | Flat retrieval | Full fine-tuning | ~38.7 | ~0.71 | ~7 | High GPU cost; no LoRA optimizations |
BioBERT + Templates | None | Rule-based | ~35.0 | ~0.69 | >15 | Easy to implement, poor accuracy |
Redaction Level (%) | ROUGE-L | Med-F1 |
---|---|---|
0% (Baseline) | 43.1 | 0.791 |
10% (Light) | 41.6 | 0.771 |
25% (Moderate) | 39.3 | 0.742 |
40% (Aggressive) | 36.2 | 0.701 |
Setting | Task Domain | ROUGE-L | Med-F1 |
---|---|---|---|
GRPO (Adult CT, original weights) | Pulmonary | 43.1 | 0.791 |
GRPO (Direct applied to DDH) | Pediatric DDH | 40.4 | 0.751 |
GRPO (with adaptive weights) | Pediatric DDH | 42.1 | 0.773 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kuo, S.-M.; Tai, S.-K.; Lin, H.-Y.; Chen, R.-C. Automated Clinical Trial Data Analysis and Report Generation by Integrating Retrieval-Augmented Generation (RAG) and Large Language Model (LLM) Technologies. AI 2025, 6, 188. https://doi.org/10.3390/ai6080188
Kuo S-M, Tai S-K, Lin H-Y, Chen R-C. Automated Clinical Trial Data Analysis and Report Generation by Integrating Retrieval-Augmented Generation (RAG) and Large Language Model (LLM) Technologies. AI. 2025; 6(8):188. https://doi.org/10.3390/ai6080188
Chicago/Turabian StyleKuo, Sheng-Ming, Shao-Kuo Tai, Hung-Yu Lin, and Rung-Ching Chen. 2025. "Automated Clinical Trial Data Analysis and Report Generation by Integrating Retrieval-Augmented Generation (RAG) and Large Language Model (LLM) Technologies" AI 6, no. 8: 188. https://doi.org/10.3390/ai6080188
APA StyleKuo, S.-M., Tai, S.-K., Lin, H.-Y., & Chen, R.-C. (2025). Automated Clinical Trial Data Analysis and Report Generation by Integrating Retrieval-Augmented Generation (RAG) and Large Language Model (LLM) Technologies. AI, 6(8), 188. https://doi.org/10.3390/ai6080188