Deep Reinforcement Learning-Driven Adaptive Prompting for Robust Medical LLM Evaluation
Abstract
1. Introduction
- Formalizing prompt selection in medical LLM evaluation as an MDP with sample-wise context features and a multi-objective reward structure.
- Proposing a DQN-based agent for optimizing adaptive multi-prompt selection across multiple performance objectives.
- Designing a comprehensive multi-objective reward function reflecting critical metrics including accuracy, safety, medical terminology coverage, and dialogue relevance.
- Demonstrating consistent and significant improvements in composite reward, robust enhancements in safety, and substantial gains in medical terminology coverage across three diverse medical evaluation tasks when compared to fixed and random baselines.
2. Related Work
2.1. LLMs in Medicine
2.2. Prompt Engineering and Adaptation
2.3. Reinforcement Learning for LLMs
3. Methods
3.1. Dataset Preparation
3.1.1. Medical Multiple-Choice Question (MCQ) Dataset
- Parsing and Extraction: Initially, each raw sample was subjected to validation for format compliance, ensuring the entry was a two-element list: (content, answer). Regular expressions were used to extract the question stem, option label-value mapping (e.g., (A), (B), …), and the correct answer key. Entries with parsing errors, missing fields, or inconsistent answer keys were excluded.
- Canonicalization: The validated data was then normalized into a uniform format with three fields: question (text), options (dictionary mapping label to text), and answer (single correct label). This facilitates downstream prompt generation and evaluation.
- Sampling and Splitting: From all successfully parsed entries, a fixed number was uniformly randomly sampled to mitigate sampling bias. The sampled entries were then shuffled and split into a training set and a test set at a 10:1 ratio, ensuring mutual exclusivity.
- Storage: The processed MCQ data for each split was preserved in JSON Lines format for ease of integration with LLM pipelines.
3.1.2. Medical Knowledge Question-Answering (MKQA) Dataset
- Format Handling: The original dataset files exhibited variability in structure, appearing either as standard JSON arrays of question-answer objects or as JSON Lines with individual dictionaries containing question and answer keys. To accommodate this, a dedicated preprocessing module was engineered to dynamically identify the file type (e.g., by inspecting the initial bytes or lines) and subsequently apply the appropriate parsing logic, thereby ensuring resilient data extraction.
- Deduplication and Validation: Following format normalization, this custom data loader performed deduplication and validation to ensure data quality. It rigorously checked each entry for completeness, ensuring that both the question and answer fields were non-empty strings. Furthermore, it identified and removed duplicate question-answer pairs to prevent redundancy.
- Sampling and Splitting: Following data validation and deduplication, the final benchmark was constructed by randomly selecting a fixed number of unique entries without replacement. This sampling step aimed to ensure a diverse and unbiased representation of the available medical questions. The resulting dataset was then randomly shuffled and partitioned into training and test sets with a 10:1 ratio.
- Export: Each split was exported as a JSON Lines file, facilitating batch evaluation.
3.1.3. Doctor-Patient Dialogue Dataset
- Initial Loading: The raw data, consisting of doctor-patient dialogue sessions, was initially loaded. This dataset was structured as a JSON array, where each element represented a complete multi-turn conversation. During the loading process, automated validation checks were performed to ensure the structural integrity of the JSON format and the proper parsing of each dialogue session.
- Random Sampling: From the entire loaded dataset, all available dialogue sessions were first thoroughly shuffled to ensure randomness. Subsequently, a specific number of samples were randomly drawn from the shuffled data.
- Division: The selected subset of dialogues was then deterministically partitioned into a training set and a testing set in a training-to-testing ratio of 10:1. This split ensures that the test set remains entirely unseen during the model’s training phase, allowing for an unbiased evaluation of generalization performance.
- Saving: Both sets were saved into standard JSON Line files with full dialogue structure preserved.
3.2. Problem Statement
- Role-based/General Instruction: Prompts designed to establish an expert persona for the LLM (e.g., “As a physician, answer…”).
- Chain-of-Thought (CoT) Reasoning: Strategies that encourage step-by-step thinking for complex problems.
- Safety-focused Prompting: Prompts explicitly incorporating safety disclaimers and emphasizing risk avoidance.
- Terminology-rich Communication: Prompts guiding the LLM to use and explain medical terminologies.
- Patient-centric/Layperson Explanation: Strategies focused on generating responses in simple, understandable language for non-medical audiences.
3.3. Formalization of Prompting Limitations
3.4. Reinforcement Learning Framework
3.4.1. MDP Formulation
- States : The state vector represents the current context and historical performance. It encodes features such as the characteristics of the current medical input (e.g., question type, length, density of medical terms in the prompt) and the rolling average of the LLM’s performance metrics (e.g., accuracy, safety, relevance, terminology coverage) from previous interactions within the current episode. For multi-turn dialogues, it also incorporates aspects of the dialogue history.
- Actions : The action space is a discrete set representing the selection of a specific prompting strategy , where is the total number of predefined strategies available in the prompt pool.
- Transition P: The transition function is deterministic. Upon selecting an action (prompt) and observing the LLM’s response, the environment progresses to the next relevant state. Specifically:
- -
- For single-turn tasks (MCQ, MKQA), the environment moves to the initial state of the next medical task input () in the dataset.
- -
- For multi-turn dialogue tasks, the environment transitions to the subsequent turn within the current dialogue. The environment only moves to the initial state of the next dialogue sample () once all turns of the current dialogue have concluded.
- Reward : The reward is a real-valued scalar quantifying the quality of the LLM’s response generated using the chosen prompt a from state s. It is computed as a weighted sum of multi-objective assessment metrics, which include accuracy, safety, relevance, and medical terminology coverage.
- Discount Factor : This factor determines the present value of future rewards. A value closer to 1 emphasizes long-term rewards, encouraging the agent to consider the cumulative impact of its actions across an episode. Conversely, a smaller value prioritizes immediate rewards.
3.4.2. DQN-Based Prompt Policy Learning
- Experience Replay: Storing the agent’s experiences (state, action, reward, next_state) in a replay buffer and sampling mini-batches for training. This decorrelates consecutive samples and improves learning stability.
- Target Network: Employing a separate target Q-network, which periodically copies the weights from the main Q-network, to provide stable targets for the Q-value updates. This mitigates oscillations and divergence issues.
- State Feature Engineering
- Input Length: Normalized character count of the current user input (question or turn).
- Medical Terminology Density: Normalized count of identified medical terms (based on a predefined medical vocabulary) within the current user input.
- Recent Performance Statistics: Rolling average values of previously achieved rewards and individual metric scores (accuracy, safety, relevance, and terminology coverage) within the current episode, reflecting the agent’s recent interaction performance.
- Dialogue Context Indicators (for multi-turn tasks): Features like the current turn number in a dialogue provide insight into the conversation’s depth.
- Prompt Strategy Pool
3.5. Reward Function
- (Accuracy): Quantifies the semantic similarity and factual correctness of the LLM’s response compared to the gold reference.
- -
- For open-ended QA tasks, it is measured as a weighted average:where is the gold reference, is the LLM’s response, and denotes sentence embeddings. ROUGE-L F1 evaluates the overlap of the longest common subsequence between and , serving as a measure of content similarity. CosineSim calculates the cosine of the angle between the sentence embeddings of and , thereby capturing their semantic similarity. The weights are empirically determined such that .
- -
- For MCQs, it is a binary score: 1 is assigned if the LLM’s predicted option matches the correct option, and 0 is assigned otherwise.
- (Safety): A binary score (1 for safe, 0 for unsafe) indicating whether the LLM’s response adheres to medical safety guidelines, identified by the absence of predefined harmful patterns (e.g., self-medication advice, dose recommendations without context, dismissal of professional medical consultation).
- (Medical Terminology Coverage): Measures the F1-score of relevant medical terms present in the LLM’s response compared to those identified in the user’s input/dialogue history and a predefined medical vocabulary. The F1-score is the harmonic mean of precision and recall, providing a balanced measure that considers both the accuracy of the retrieved items (precision) and the completeness of the retrieval (recall). It encourages the use and explanation of appropriate clinical terminology.
- (Contextual Relevance): Applicable primarily in multi-turn dialogue settings. It assesses how well the LLM’s response addresses the most recent user query and aligns with the overall dialogue context. It is evaluated as a weighted combination of keyword overlap and semantic similarity:where is the most recent user utterance, is the LLM’s response, and are weights such that . KeywordOverlap is computed by tokenizing and filtering stop words from both and , then calculating the ratio of common words to words in . CosineSim is applied between the sentence embeddings of and .
4. Experiments
4.1. Datasets Employed
- MKQA Dataset: Used for open-ended QA tasks. This dataset was leveraged from the medicalBook_zh_qa dataset from ApolloCorpus [21], a large-scale Chinese corpus for medical foundation models. It comprises 880 meticulously processed QA pairs, divided into a training set of 800 and a test set of 80.
- MCQ Dataset: Employed for MCQ tasks. This dataset was constructed by processing the medicalExam_zh_clean subset from ApolloCorpus. It consists of 1540 board-exam style questions, partitioned into a training set of 1400 and a test set of 140.
- Doctor-Patient Dialogue Dataset: Dedicated to multi-turn dialogue tasks, this dataset was curated from CMtMedQA [22] dataset, which comprises a collection of medical question-answering dialogues. It encompasses 660 curated conversations, each averaging approximately 8.8 turns. These conversations are systematically split into a training set of 600 conversations and a test set of 60 conversations.
4.2. Auxiliary Resources
- Medical Terminology Vocabulary: A comprehensive list of medical terms, compiled by merging publicly available medical word lists from two sources: QASystemOnMedicalGraph [23] and Chinese Medical Words [24]. This merged vocabulary contains 132,751 unique terms. It serves as a lookup dictionary to identify and count specific medical terms in both user inputs and LLM responses for the Terminology Coverage metric.
- Chinese Stopwords List: A predefined list of common Chinese stop words, obtained from a publicly available collection at GitHub repository “33211/stopwords” [25]. These stopwords are utilized to filter out common and uninformative words during the calculation of keyword overlap in the Contextual Relevance metric.
4.3. Experimental Environment
4.4. Metrics
- Accuracy (Acc): Quantifies the semantic similarity and factual correctness of the LLM’s response compared to the gold reference. For open-ended QA tasks, it is measured as a weighted average (as defined in Equation (7)). For MCQs, it is a binary score. This metric directly contributes to the immediate reward, and its average across all evaluated samples/turns is reported as the overall factual performance.
- Safety (Safe): A binary score (as defined in Section 3.5) indicating whether the LLM’s response adheres to medical safety guidelines. This score is a direct component of the immediate reward. For the final evaluation, we report the average Safety score.
- Medical Terminology Coverage (Term): It measures the F1 score of the relevant medical terms present in the LLM’s response, which is defined in Section 3.5). This metric contributes to the immediate reward, and its average across all evaluated samples/turns reflects the overall appropriate use of medical vocabulary.
- Contextual Relevance (Rel): This metric assesses how well the LLM’s response addresses the most recent user query and aligns with the dialogue’s overall context. It is primarily applicable in multi-turn dialogue settings. It is evaluated as a weighted combination of keyword overlap and semantic similarity (as defined in Equation (8)). The Contextual Relevance score contributes to the immediate reward, and its average across all evaluated dialogue turns is reported for overall conversational coherence.
4.5. RL Training Details
- Main Reward Weights: For multi-turn dialogue tasks, (Safety), (Accuracy), (Contextual Relevance), and (Medical Terminology Coverage). These weights ensure a balanced emphasis on safe and accurate interactions, along with professional terminology, recognizing the conversational and sensitive nature of doctor-patient exchanges. For MCQ tasks, the weights were set to , , and . Here, Medical Terminology Coverage received the highest weight, reflecting the paramount importance of precise domain knowledge in board-exam-style questions, where accurate term recognition and utilization are critical. For MKQA tasks, , , and . In this context, accuracy was assigned the highest weight, as factual correctness and direct answer fidelity are the primary objectives in single-turn factual question answering. Note: is not applicable for single-turn tasks.
- Accuracy Metric Weights (for open-ended QA, Equation (7)): (ROUGE-L F1) and (CosineSim).
- Contextual Relevance Metric Weights (for multi-turn dialogue, Equation (8)): (KeywordOverlap) and (CosineSim).
4.6. Evaluation Results and Analysis
- Fixed Strategy (Baseline): This method consistently employs a single, predefined prompt strategy (the “Role-based/General Instruction” prompt) for all LLM interactions, serving as a static baseline representing a common, non-adaptive prompting approach.
- Random Strategy (Random): This method randomly selects a prompt strategy from the predefined pool for each LLM interaction, representing a non-intelligent, purely stochastic approach.
- Rule-based Strategy (Rule-based): This adaptive baseline employs a set of hand-crafted heuristic rules to select a prompt strategy for each LLM interaction. The rules prioritize the prompt selection based on observable features of the current input, including its length and medical terminology density. For example, in MKQA tasks, the strategy might prioritize a “Safety-focused” prompt for potentially risky queries, a “Chain-of-Thought” prompt for complex, terminology-rich questions, or a “Patient-centric” prompt for very simple, non-technical queries.
- -
- A CoT prompt is chosen for questions with high complexity (normalized question length > 0.6 and normalized term density > 0.5).
- -
- A Terminology-rich Communication prompt is selected if the term density is notably high (normalized term density > 0.7).
- -
- A Patient-centric/Layperson Explanation prompt is applied for very simple, non-technical queries (normalized question length < 0.3 and normalized term density < 0.2).
- -
- Otherwise, a default Role-based/General Instruction prompt is used.
- RL Agent Strategy (RL): This is our proposed method, where a trained DQN agent dynamically selects the optimal prompt strategy based on the current state (task features and historical performance) for each LLM interaction.
4.7. Cross-Model Applicability
4.8. Comparison with All Static Prompt Strategies
4.9. Performance Comparison of Diverse Chinese LLMs Under Learned RL Policy
4.9.1. Evaluated LLM Models
4.9.2. Evaluation Protocol
4.9.3. Results on Diverse LLMs
5. Discussion
5.1. Strengths and Contributions
- Novel Reinforcement Learning Approach for LLM Evaluation: To our knowledge, this study represents a pioneering effort in leveraging reinforcement learning for automated and adaptive prompt selection, particularly within the rigorous domain of medical Large Language Model evaluation. This adaptive mechanism facilitates the dynamic optimization of LLM prompting strategies, demonstrating its potential to move beyond static or manually tuned approaches.
- Model-Agnostic and Applicable Framework: The proposed RL-based prompt policy learning framework indicates strong model-agnostic properties regarding its application [40]. As evidenced by our LLM evaluation under the learned RL policy, the policy trained on the DeepSeek-V3-0324 model can effectively be applied to the GPT-4.1 model, allowing us to observe their performance when operating under an optimized prompting regime. This highlights its applicability and potential for transferability across different underlying LLM architectures.
- Enhanced Performance in Key Medical Metrics: The RL-driven prompt selection consistently optimizes the composite reward function, achieving higher overall scores compared to baseline and random strategies. Notably, the framework significantly enhances the safety of LLM responses across all tasks, frequently reaching near-perfect or perfect scores, which is critical in healthcare applications. It also results in substantial gains in Medical Terminology Coverage across all tasks, contributing to the informativeness of the generated content. While varying across tasks, an improvement in general Accuracy is observed in the MKQA task.
5.2. Limitations
- Fixed Prompt Pool and Expressivity: The current prompt pool consists of a predefined, fixed set of five manually engineered prompting strategies. This inherent limitation restricts the expressive power of the agent’s actions, as it cannot explore or generate novel prompting techniques beyond these pre-engineered templates, nor can it dynamically combine elements from different strategies. Such a reliance on a constrained, pre-determined action space inevitably places an upper bound on the achievable performance, particularly in highly variable and complex medical scenarios where the ideal prompt might lie outside the current range. A more expressive action space, potentially involving prompt generation or dynamic prompt composition, could further unlock LLM capabilities.
- Reward Proxy and Human Judgment Alignment: The composite reward function, while multi-faceted, acts as a proxy for true human judgment of LLM response quality. This presents a significant challenge, as automated metrics inherently lack the capacity to fully capture the subjective, ethical, and highly contextual nuances of human–LLM interaction in healthcare. Despite its comprehensive design across various quantitative metrics, there remains a potential for misalignment between the computationally derived metrics and nuanced human perception, especially concerning complex aspects such as empathy, ethical considerations, or the appropriateness of multi-turn dialogue flow, which are challenging to fully capture with current automated metrics. In a high-stakes domain like medicine, where suboptimal or inappropriate responses can have severe consequences, ensuring robust alignment with expert human judgment is a critical concern that current automated proxies can only address to a limited extent.
- Generalization to Unseen/Rare Contexts: While cross-model applicability of our prompt policy has been demonstrated, the policy’s ability to adapt to entirely new or rare medical contexts, unseen during dataset curation, requires further rigorous validation. This limitation is particularly relevant in dynamic clinical environments, where highly specialized or nuanced scenarios often involve sparse data, posing challenges not only for RL policy adaptation but also for the underlying LLMs themselves. The effectiveness in such situations needs dedicated investigation.
5.3. Future Work
- Advanced RL Architectures: Exploring more sophisticated reinforcement learning architectures, such as Multi-Agent reinforcement learning (MARL) [41] or Retrieval-Augmented reinforcement learning [42], could enable the agent to learn more complex, multi-level prompting strategies or to quickly adapt to new LLMs and tasks with minimal retraining.
- Dynamic Prompt Generation and Expansion: Moving beyond a fixed prompt pool, future work could investigate methods for the agent to dynamically construct or expand its prompt space. This may involve techniques like prompt mutation [43], prompt distillation [44], or integrating a meta-LLM to generate novel prompts [45], significantly enhancing the expressivity of the learned policy.
- Human-in-the-Loop Reward Mechanisms: To address the potential misalignment of automated reward proxies, we propose prioritizing the integration of human feedback directly into the reward signal in future work. This could involve methods such as reinforcement learning from human feedback (RLHF), which is particularly well-suited for capturing nuanced human judgments and ethical considerations in complex domains like medical dialogues, thereby leading to policies better aligned with these subtle and critical aspects.
- Cross-Lingual and Multimodal Generalization: Expanding the framework to support cross-lingual medical LLMs and multimodal inputs (e.g., incorporating medical images or patient physiological data) would significantly broaden its applicability and impact. This extension would enable the development of more comprehensive and holistic medical AI systems, and simultaneously facilitate their robust evaluation across diverse languages and data modalities, addressing real-world complexities in global healthcare.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Farquhar, S.; Kossen, J.; Kuhn, L.; Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature 2024, 630, 625–630. [Google Scholar] [CrossRef] [PubMed]
- Asgari, E.; Montaña-Brown, N.; Dubois, M.; Khalil, S.; Balloch, J.; Yeung, J.A.; Pimenta, D. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digit. Med. 2025, 8, 274. [Google Scholar] [CrossRef] [PubMed]
- Meng, X.; Yan, X.; Zhang, K.; Liu, D.; Cui, X.; Yang, Y.; Zhang, M.; Cao, C.; Wang, J.; Wang, X.; et al. The application of large language models in medicine: A scoping review. iScience 2024, 27, 109713. [Google Scholar] [CrossRef]
- Bonaca, M.P.; Lang, N.N.; Chen, A.; Amiri-Kordestani, L.; Lipka, L.; Zwiewka, M.; Strnadova, C.; Klaar, S.; Dent, S.; Janicijevic, T.K.; et al. Cardiovascular safety in oncology clinical trials: JACC: CardioOncology Primer. Cardio Oncol. 2025, 7, 83–95. [Google Scholar]
- Zamorano, J.L.; Gottfridsson, C.; Asteggiano, R.; Atar, D.; Badimon, L.; Bax, J.J.; Cardinale, D.; Cardone, A.; Feijen, E.A.M.; Ferdinandy, P.; et al. The cancer patient and cardiology. Eur. J. Heart Fail. 2020, 22, 2290–2309. [Google Scholar] [CrossRef]
- Zaghir, J.; Naguib, M.; Bjelogrlic, M.; Névéol, A.; Tannier, X.; Lovis, C. Prompt engineering paradigms for medical applications: Scoping review. J. Med. Internet Res. 2024, 26, e60501. [Google Scholar] [CrossRef] [PubMed]
- Sivarajkumar, S.; Kelley, M.; Samolyk-Mazzanti, A.; Visweswaran, S.; Wang, Y. An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing: Algorithm development and validation study. JMIR Med. Inform. 2024, 12, e55318. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar] [CrossRef]
- Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H.; et al. Toward expert-level medical question answering with large language models. Nat. Med. 2025, 31, 943–950. [Google Scholar] [CrossRef]
- Liu, X.; Liu, H.; Yang, G.; Jiang, Z.; Cui, S.; Zhang, Z.; Wang, H.; Tao, L.; Sun, Y.; Song, Z.; et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 2025, 31, 932–942. [Google Scholar] [CrossRef]
- Michalopoulos, G.; Williams, K.; Singh, G.; Lin, T. MedicalSum: A guided clinical abstractive summarization model for generating medical reports from patient-doctor conversations. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 4741–4749. [Google Scholar]
- Görtz, M.; Baumgärtner, K.; Schmid, T.; Muschko, M.; Woessner, P.; Gerlach, A.; Byczkowski, M.; Sültmann, H.; Duensing, S.; Hohenfellner, M. An artificial intelligence-based chatbot for prostate cancer education: Design and patient evaluation study. Digit. Health 2023, 9, 20552076231173304. [Google Scholar] [CrossRef]
- Lampert, C.H.; Nickisch, H.; Harmeling, S. Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 453–465. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 24824–24837. [Google Scholar]
- Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
- Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
- Christiano, P.F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; Amodei, D. Deep Reinforcement Learning from Human Preferences. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. Constitutional AI: Harmlessness from AI feedback. arXiv 2022, arXiv:2212.08073. [Google Scholar] [CrossRef]
- Wang, X.; Li, C.; Wang, Z.; Bai, F.; Luo, H.; Zhang, J.; Jojic, N.; Xing, E.P.; Hu, Z. PromptAgent: Strategic planning with language models enables expert-level prompt optimization. arXiv 2023, arXiv:2310.16427. [Google Scholar]
- Wang, X.; Chen, N.; Chen, J.; Hu, Y.; Wang, Y.; Wu, X.; Gao, A.; Wan, X.; Li, H.; Wang, B. Apollo: Lightweight Multilingual Medical LLMs towards Democratizing Medical AI to 6B People. arXiv 2024, arXiv:2403.03640. [Google Scholar] [CrossRef]
- Yang, S.; Zhao, H.; Zhu, S.; Zhou, G.; Xu, H.; Jia, Y.; Zan, H. Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue. arXiv 2023, arXiv:2308.03549. [Google Scholar] [CrossRef]
- Chen, Z. QASystemOnMedicalGraph: A Medical Knowledge Graph Based Question Answering System. 2018. Available online: https://github.com/zhihao-chen/QASystemOnMedicalGraph (accessed on 2 December 2025).
- Xtea. Chinese Medical Words. 2020. Available online: https://github.com/xtea/chinese_medical_words (accessed on 2 December 2025).
- 33211. Chinese Stopwords List. 2018. Available online: https://github.com/33211/stopwords (accessed on 2 December 2025).
- DeepSeek-AI. DeepSeek-V3 Technical Report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
- OpenAI. Introducing GPT-4.1 in the API. 2025. Available online: https://openai.com/index/gpt-4-1/ (accessed on 2 December 2025).
- Buitrago, P.A.; Nystrom, N.A. Open compass: Accelerating the adoption of AI in open research. In Proceedings of the Practice and Experience in Advanced Research Computing 2019: Rise of the Machines (Learning), Chicago, IL, USA, 28 July–1 August 2019; pp. 1–9. [Google Scholar]
- Huang, Y.; Bai, Y.; Zhu, Z.; Zhang, J.; Zhang, J.; Su, T.; Liu, J.; Lv, C.; Zhang, Y.; Fu, Y.; et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Adv. Neural Inf. Process. Syst. 2023, 36, 62991–63010. [Google Scholar]
- Zhang, N.; Chen, M.; Bi, Z.; Liang, X.; Li, L.; Shang, X.; Yin, K.; Tan, C.; Xu, J.; Huang, F.; et al. Cblue: A chinese biomedical language understanding evaluation benchmark. arXiv 2021, arXiv:2106.08087. [Google Scholar]
- Xu, L.; Li, A.; Zhu, L.; Xue, H.; Zhu, C.; Zhao, K.; He, H.; Zhang, X.; Kang, Q.; Lan, Z. Superclue: A comprehensive chinese large language model benchmark. arXiv 2023, arXiv:2307.15020. [Google Scholar] [CrossRef]
- Liao, Y.; Jiang, S.; Wang, Y.; Wang, Y. MING-MOE: Enhancing Medical Multi-Task Learning in Large Language Models with Sparse Mixture of Low-Rank Adapter Experts. arXiv 2024, arXiv:2404.09027. [Google Scholar]
- Chen, J.; Cai, Z.; Ji, K.; Wang, X.; Liu, W.; Wang, R.; Hou, J.; Wang, B. HuatuoGPT-o1, towards medical complex reasoning with LLMs. arXiv 2024, arXiv:2412.18925. [Google Scholar]
- Luo, L.; Ning, J.; Zhao, Y.; Wang, Z.; Ding, Z.; Chen, P.; Fu, W.; Han, Q.; Xu, G.; Qiu, Y.; et al. Taiyi: A bilingual fine-tuned large language model for diverse biomedical tasks. J. Am. Med. Inform. Assoc. 2024, 31, 1865–1874. [Google Scholar] [CrossRef]
- Zhang, X.; Xue, K.; Zhang, S. PULSE: Pretrained and Unified Language Service Engine. 2023. Available online: https://github.com/openmedlab/PULSE (accessed on 2 December 2025).
- Hu, S.; Tu, Y.; Han, X.; He, C.; Cui, G.; Long, X.; Zheng, Z.; Fang, Y.; Huang, Y.; Zhao, W.; et al. MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. arXiv 2024, arXiv:2404.06395. [Google Scholar] [CrossRef]
- Young, A.; Chen, B.; Li, C.; Huang, C.; Zhang, G.; Zhang, G.; Wang, G.; Li, H.; Zhu, J.; Chen, J.; et al. Yi: Open foundation models by 01. ai. arXiv 2024, arXiv:2403.04652. [Google Scholar]
- DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
- Cui, Y.; Yang, Z.; Yao, X. Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca. arXiv 2023, arXiv:2304.08177. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. Model-agnostic interpretability of machine learning. arXiv 2016, arXiv:1606.05386. [Google Scholar] [CrossRef]
- Zhang, K.; Yang, Z.; Başar, T. Multi-agent reinforcement learning: A selective overview of theories and algorithms. In Handbook of Reinforcement Learning and Control; Springer: Berlin/Heidelberg, Germany, 2021; pp. 321–384. [Google Scholar]
- Goyal, A.; Friesen, A.; Banino, A.; Weber, T.; Ke, N.R.; Badia, A.P.; Guez, A.; Mirza, M.; Humphreys, P.C.; Konyushova, K.; et al. Retrieval-augmented reinforcement learning. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 7740–7765. [Google Scholar]
- Fernando, C.; Banarse, D.; Michalewski, H.; Osindero, S.; Rocktäschel, T. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv 2023, arXiv:2309.16797. [Google Scholar]
- Li, L.; Zhang, Y.; Chen, L. Prompt distillation for efficient LLM-based recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 1348–1357. [Google Scholar]
- Zhang, Y.; Yuan, Y.; Yao, A.C. Meta prompting for AI systems. arXiv 2023, arXiv:2311.11482. [Google Scholar]



| Task | Method | Reward | Std | Acc | Safe | Term | Rel |
|---|---|---|---|---|---|---|---|
| MKQA | Baseline | – | |||||
| MKQA | Random | – | |||||
| MKQA | Rule-based | – | |||||
| MKQA | RL | – | |||||
| MCQ | Baseline | – | |||||
| MCQ | Random | – | |||||
| MCQ | Rule-based | – | |||||
| MCQ | RL | – | |||||
| Dialogue | Baseline | ||||||
| Dialogue | Random | ||||||
| Dialogue | Rule-based | ||||||
| Dialogue | RL |
| Task | Method | Reward | Std | Acc | Safe | Term | Rel |
|---|---|---|---|---|---|---|---|
| MKQA | Baseline | – | |||||
| MKQA | Random | – | |||||
| MKQA | Rule-based | – | |||||
| MKQA | RL | – | |||||
| MCQ | Baseline | – | |||||
| MCQ | Random | – | |||||
| MCQ | Rule-based | – | |||||
| MCQ | RL | – | |||||
| Dialogue | Baseline | ||||||
| Dialogue | Random | ||||||
| Dialogue | Rule-based | ||||||
| Dialogue | RL |
| Task | Method | Reward | Std | Acc | Safe | Term | Rel |
|---|---|---|---|---|---|---|---|
| MKQA | Baseline | – | |||||
| MKQA | Random | – | |||||
| MKQA | Rule-based | – | |||||
| MKQA | RL | – | |||||
| MCQ | Baseline | – | |||||
| MCQ | Random | – | |||||
| MCQ | Rule-based | – | |||||
| MCQ | RL | – | |||||
| Dialogue | Baseline | ||||||
| Dialogue | Random | ||||||
| Dialogue | Rule-based | ||||||
| Dialogue | RL |
| LLM Model | Strategy | MKQA | MCQ | Dialogue |
|---|---|---|---|---|
| DeepSeek-V3-0324 | Role-based | |||
| Safety-focused | ||||
| Terminology-rich | ||||
| CoT Reasoning | ||||
| Patient-centric | ||||
| RL Agent Strategy | ||||
| GPT-4.1 | Role-based | |||
| Safety-focused | ||||
| Terminology-rich | ||||
| CoT Reasoning | ||||
| Patient-centric | ||||
| RL Agent Strategy | ||||
| Gemini-2.0-flash | Role-based | |||
| Safety-focused | ||||
| Terminology-rich | ||||
| CoT Reasoning | ||||
| Patient-centric | ||||
| RL Agent Strategy |
| LLM Model | MKQA | MCQ | Dialogue | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Reward | Acc | Safe | Term | Reward | Acc | Safe | Term | Reward | Acc | Safe | Rel | Term | |
| Medical LLMs | |||||||||||||
| MING-7B | |||||||||||||
| HuatuoGPT-o1-7B | |||||||||||||
| Taiyi-LLM | |||||||||||||
| PULSE-7bv5 | |||||||||||||
| General LLMs | |||||||||||||
| MiniCPM3-4B | |||||||||||||
| Yi-1.5-9B-Chat | |||||||||||||
| DeepSeek-R1-0528-Qwen3-8B | |||||||||||||
| llama-3-chinese-8b-instruct-v3 | |||||||||||||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ding, D.; Xi, W.; Ding, Z.; Gao, J. Deep Reinforcement Learning-Driven Adaptive Prompting for Robust Medical LLM Evaluation. Appl. Sci. 2026, 16, 1514. https://doi.org/10.3390/app16031514
Ding D, Xi W, Ding Z, Gao J. Deep Reinforcement Learning-Driven Adaptive Prompting for Robust Medical LLM Evaluation. Applied Sciences. 2026; 16(3):1514. https://doi.org/10.3390/app16031514
Chicago/Turabian StyleDing, Dong, Wang Xi, Zenghui Ding, and Jianqing Gao. 2026. "Deep Reinforcement Learning-Driven Adaptive Prompting for Robust Medical LLM Evaluation" Applied Sciences 16, no. 3: 1514. https://doi.org/10.3390/app16031514
APA StyleDing, D., Xi, W., Ding, Z., & Gao, J. (2026). Deep Reinforcement Learning-Driven Adaptive Prompting for Robust Medical LLM Evaluation. Applied Sciences, 16(3), 1514. https://doi.org/10.3390/app16031514

