Deep Reinforcement Learning-Driven Adaptive Prompting for Robust Medical LLM Evaluation

Ding, Dong; Xi, Wang; Ding, Zenghui; Gao, Jianqing

doi:10.3390/app16031514

Open AccessArticle

Deep Reinforcement Learning-Driven Adaptive Prompting for Robust Medical LLM Evaluation

¹

School of Data Science, The Chinese University of Hong Kong, Shenzhen, Shenzhen 518172, China

²

Intelligence Institute of Machine, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China

³

iFLYTEK Co., Ltd., Hefei 230088, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1514; https://doi.org/10.3390/app16031514

Submission received: 23 December 2025 / Revised: 28 January 2026 / Accepted: 30 January 2026 / Published: 2 February 2026

(This article belongs to the Special Issue Artificial Intelligence in Healthcare: Status, Prospects and Future)

Download

Browse Figures

Versions Notes

Abstract

The accurate and reliable evaluation of large language models (LLMs) in medical domains is critical for real-world clinical deployment, automated medical reasoning, and patient safety. However, the evaluation process is highly sensitive to prompt design, and prevalent reliance on fixed or randomly sampled prompt policies often fails to dynamically adapt to clinical context, question complexity, or evolving safety requirements. This article presents a novel reinforcement learning-based framework for multi-prompt selection, which dynamically optimizes prompt policy per input for medical LLM evaluation across the Medical Knowledge Question-Answering dataset (MKQA), the Medical Multiple-Choice Question dataset (MCQ), and the Doctor-Patient Dialogue dataset. We formulate prompt selection as a Markov Decision Process (MDP) and employ a deep Q-Network (DQN) agent to maximize a reward signal incorporating textual accuracy, domain terminology coverage, safety, and dialogue relevance. Experiments on three medical LLM benchmarks demonstrate consistent improvements in composite reward (e.g., a 6.66% increase in MKQA vs. Random Baseline, and a 2.41% increase in Dialogue vs. Fixed Baseline) when compared to baselines. This was accompanied by robust enhancements in Safety (e.g., achieving 1.0000 in MKQA, a 5.26% increase vs. Fixed Baseline; and a 5.03% increase in Dialogue vs. Fixed Baseline) and substantial gains in Medical Terminology Coverage (e.g., a 74.61% increase in MKQA vs. Fixed Baseline, and a 9.13% increase in MCQ vs. Fixed Baseline) when compared to baselines. While varying across tasks, an improvement in accuracy was observed in the MKQA task, and the framework effectively optimizes the multi-objective reward function, even when minor trade-offs in other metrics like Accuracy and Contextual Relevance were observed in some contexts. Our framework enables robust, context-aware, and adaptive evaluation, laying a foundation for safer and more reliable LLM application in healthcare.

Keywords:

large language models; prompt optimization; reinforcement learning; medical AI; natural language processing; evaluation

1. Introduction

Recent years have witnessed rapid advancements in large language models (LLMs), driving transformative progress in natural language understanding and generation. In the medical and healthcare domain, LLMs have been leveraged across a range of clinical and biomedical tasks, thereby transforming approaches to information retrieval, patient interaction, and knowledge assessment. Nonetheless, generated outputs from LLMs may suffer from factual inaccuracies, hallucinations [1], omission or misuse of specialty terminology, and, crucially, may yield unsafe medical responses. For instance, LLMs may recommend contraindicated treatments due to misunderstanding of drug–drug interactions, misinterpretation of abnormal laboratory results, or provision of overly general advice in scenarios requiring nuanced clinical judgment [2]. Moreover, they may fabricate references or diagnostic rationales that appear plausible but are medically invalid [3]. In specialties such as oncology or cardiology, even minor terminological inaccuracies—such as confusing drug names, units, or staging criteria—can have serious implications for patient safety [4,5]. These limitations highlight the pressing need for reliable, task-specific evaluation frameworks to systematically quantify LLMs’ clinical utility, safety, and adherence to medical standards.

The evaluation of LLMs in medical tasks frequently relies on static prompt designs—typically handcrafted or drawn from a small set of predefined templates [6]. While adequate for fundamental benchmarking, such fixed prompting strategies are deficient in their ability to adapt to the varied vocabulary, structural variation, and safety-sensitive requirements inherent in clinical contexts. This problem can result in suboptimal performance, as a single prompt may not be universally applicable to diverse medical questions, varying patient needs, or evolving safety requirements. As mathematically proven in Section 3.3, these static approaches are inherently suboptimal in reward-heterogeneous environments like medicine. For example, a prompt designed to elicit detailed reasoning might introduce unnecessary verbosity for simple factual queries, while a safety-focused prompt might inadvertently reduce the informativeness or perceived helpfulness in educational interactions [7].

In this study, we propose an automated approach for prompt strategy selection via reinforcement learning (RL), formulating adaptive multi-prompt selection as a Markov Decision Process (MDP) optimized with a Deep Q-network (DQN) [8] agent. The agent dynamically selects from a pool of prompt templates for each sample, thereby generating responses that enhance LLM intelligence and facilitate a more precise and comprehensive evaluation of medical LLMs. This is achieved by optimizing for textual accuracy, medical terminology coverage, safety, and contextual relevance, transforming prompt engineering from a static or manual tuning process into an automatic, data-driven pipeline. Our framework notably demonstrates its generalizability by optimizing the performance of diverse LLMs in the field of medicine.

Our core contributions are:

Formalizing prompt selection in medical LLM evaluation as an MDP with sample-wise context features and a multi-objective reward structure.
Proposing a DQN-based agent for optimizing adaptive multi-prompt selection across multiple performance objectives.
Designing a comprehensive multi-objective reward function reflecting critical metrics including accuracy, safety, medical terminology coverage, and dialogue relevance.
Demonstrating consistent and significant improvements in composite reward, robust enhancements in safety, and substantial gains in medical terminology coverage across three diverse medical evaluation tasks when compared to fixed and random baselines.

2. Related Work

The landscape of LLMs in medicine is rapidly evolving, driving innovations in various clinical applications. Our work builds upon existing research in LLMs for healthcare, prompt engineering, and the application of reinforcement learning to LLMs, addressing current gaps in adaptive prompt selection for robust evaluation.

2.1. LLMs in Medicine

LLMs are increasingly recognized for their transformative potential across diverse medical applications, including medical question-answering like Med-PaLM 2 [9], diagnostic assistance exemplified by MedFound [10], clinical report summarization like MedicalSum [11], and doctor-patient chatbots exemplified by PROSCA [12]. Their ability to process and generate human-like text has enabled new paradigms in healthcare AI. However, numerous studies have concurrently highlighted significant challenges in their deployment, particularly concerning output instability, hallucination, and high sensitivity to the exact phrasing of prompts. These problems highlight a critical need for advanced evaluation methodologies capable of comprehensively and reliably assessing LLM output stability, safety, and reliability in clinical settings.

2.2. Prompt Engineering and Adaptation

Prompt engineering has emerged as a crucial technique to steer the behavior of LLMs without extensive model retraining. Various approaches, such as zero-shot [13] and few-shot learning [14], chain-of-thought (CoT) [15] prompting, instruction tuning [16], and the development of domain-specific templates, have been demonstrated to improve the generation quality of LLM [17]. While static prompt templates or ensemble approaches can enhance performance on specific metrics, the more complex challenge of dynamic, context-aware prompt adaptation—where prompts are autonomously selected or generated based on the evolving interaction context and desired objectives—remains an emerging and insufficiently developed area, especially in domains requiring long-horizon reasoning and nuanced contextual adaptation, such as clinical decision-making.

2.3. Reinforcement Learning for LLMs

RL has gained considerable traction in optimizing LLM behavior, predominantly through fine-tuning paradigms like Reinforcement Learning from Human Feedback (RLHF) [18] and constitutional AI [19]. These methods primarily adjust the internal weights of LLMs to align their outputs with human preferences or predefined principles. While highly effective for internal model alignment, current RL applications for LLMs have not extensively focused on the adaptive, feedback-driven selection or orchestration of external prompt policies from a predefined pool. Although recent studies have explored RL-based prompt selection or adaptation strategies [20], such approaches remain relatively limited in scope and generality when compared to the widespread use of RL for internal model alignment. A systematic framework for dynamic prompt policy orchestration—particularly in complex, context-sensitive domains—remains an open and underdeveloped area of research.

It is noteworthy that, to our knowledge, no prior work has hitherto explored the application of RL for the automated and adaptive evaluation of LLMs in the medical domain by dynamically selecting prompt strategies. This represents a significant lacuna in existing medical LLM evaluation strategies, limiting their ability to comprehensively and dynamically assess model performance across diverse medical contexts and objectives.

Our Contribution in Context: Building upon these existing lines of research, our work bridges the identified gaps by leveraging the adaptive learning capabilities of deep reinforcement learning to dynamically select appropriate prompting strategies for LLMs in medical evaluation. This framework provides an automated, context-aware mechanism to enhance key performance aspects like safety, accuracy, and terminology coverage, thereby contributing to the development of more reliable and effective LLM applications in healthcare.

3. Methods

3.1. Dataset Preparation

In order to undertake a rigorous evaluation of medical large language models (LLMs), it is essential to employ datasets that have been meticulously curated for each core task. In our experiments, we created three datasets: Medical Knowledge Question-Answering (MKQA), Medical Multiple-Choice Question (MCQ), and Doctor-Patient Dialogue. Each dataset was standardized, randomly sampled, and split into training and test sets, following rigorous filtering and conversion procedures. The overall data preparation pipeline is illustrated in Figure 1. These procedures are detailed below for each dataset type.

3.1.1. Medical Multiple-Choice Question (MCQ) Dataset

We constructed our MCQ dataset by processing raw entries from a large-scale Chinese medical dataset. The raw entries in this source dataset were heterogeneous, each containing the natural language question (inclusive of options) and the corresponding reference answer. Our data preparation pipeline proceeded as follows:

Parsing and Extraction: Initially, each raw sample was subjected to validation for format compliance, ensuring the entry was a two-element list: (content, answer). Regular expressions were used to extract the question stem, option label-value mapping (e.g., (A), (B), …), and the correct answer key. Entries with parsing errors, missing fields, or inconsistent answer keys were excluded.
Canonicalization: The validated data was then normalized into a uniform format with three fields: question (text), options (dictionary mapping label to text), and answer (single correct label). This facilitates downstream prompt generation and evaluation.
Sampling and Splitting: From all successfully parsed entries, a fixed number was uniformly randomly sampled to mitigate sampling bias. The sampled entries were then shuffled and split into a training set and a test set at a 10:1 ratio, ensuring mutual exclusivity.
Storage: The processed MCQ data for each split was preserved in JSON Lines format for ease of integration with LLM pipelines.

This procedure guarantees that only high-quality, machine-readable MCQ samples were included in training and testing.

3.1.2. Medical Knowledge Question-Answering (MKQA) Dataset

The MKQA dataset was constructed by means of a systematic reprocessing and enrichment of an extant factual Q&A resource. This existing resource was systematically reprocessed and enriched to meet our specific evaluation criteria, following the detailed data preparation workflow described below:

Format Handling: The original dataset files exhibited variability in structure, appearing either as standard JSON arrays of question-answer objects or as JSON Lines with individual dictionaries containing question and answer keys. To accommodate this, a dedicated preprocessing module was engineered to dynamically identify the file type (e.g., by inspecting the initial bytes or lines) and subsequently apply the appropriate parsing logic, thereby ensuring resilient data extraction.
Deduplication and Validation: Following format normalization, this custom data loader performed deduplication and validation to ensure data quality. It rigorously checked each entry for completeness, ensuring that both the question and answer fields were non-empty strings. Furthermore, it identified and removed duplicate question-answer pairs to prevent redundancy.
Sampling and Splitting: Following data validation and deduplication, the final benchmark was constructed by randomly selecting a fixed number of unique entries without replacement. This sampling step aimed to ensure a diverse and unbiased representation of the available medical questions. The resulting dataset was then randomly shuffled and partitioned into training and test sets with a 10:1 ratio.
Export: Each split was exported as a JSON Lines file, facilitating batch evaluation.

This ensures the evaluation set covers both broad coverage and controlled size, with no overlap between train and test.

3.1.3. Doctor-Patient Dialogue Dataset

We built a Doctor-Patient Dialogue dataset by processing a collection of multi-turn medical question-answering dialogues. The detailed data preparation workflow proceeded as follows:

Initial Loading: The raw data, consisting of doctor-patient dialogue sessions, was initially loaded. This dataset was structured as a JSON array, where each element represented a complete multi-turn conversation. During the loading process, automated validation checks were performed to ensure the structural integrity of the JSON format and the proper parsing of each dialogue session.
Random Sampling: From the entire loaded dataset, all available dialogue sessions were first thoroughly shuffled to ensure randomness. Subsequently, a specific number of samples were randomly drawn from the shuffled data.
Division: The selected subset of dialogues was then deterministically partitioned into a training set and a testing set in a training-to-testing ratio of 10:1. This split ensures that the test set remains entirely unseen during the model’s training phase, allowing for an unbiased evaluation of generalization performance.
Saving: Both sets were saved into standard JSON Line files with full dialogue structure preserved.

This maintains randomness, represents diverse interactions, and avoids overlap between splits.

3.2. Problem Statement

Given a dataset

D = {(x_{i}, y_{i}, t_{i})}_{i = 1}^{N}

, where N denotes the total number of samples in the dataset,

x_{i}

represents a medical task input,

y_{i}

is its corresponding gold reference, and

t_{i}

specifies the task type. For each input

x_{i}

, We aim to identify a prompting strategy

p_{k^{*}}

from a predefined pool of K distinct strategies

P = {p_{1}, \dots, p_{K}}

. The objective is to maximize a combined evaluation score for the LLM’s generated response, thereby enhancing the robustness and precision of assessing the LLM’s performance under specific prompting conditions. This score integrates critical performance aspects, such as accuracy, safety, relevance, and terminology coverage for the LLM’s generated response. The overall framework for this prompting strategy selection is illustrated in Figure 2.

The prompting strategy pool

P

includes the following examples:

Role-based/General Instruction: Prompts designed to establish an expert persona for the LLM (e.g., “As a physician, answer…”).
Chain-of-Thought (CoT) Reasoning: Strategies that encourage step-by-step thinking for complex problems.
Safety-focused Prompting: Prompts explicitly incorporating safety disclaimers and emphasizing risk avoidance.
Terminology-rich Communication: Prompts guiding the LLM to use and explain medical terminologies.
Patient-centric/Layperson Explanation: Strategies focused on generating responses in simple, understandable language for non-medical audiences.

3.3. Formalization of Prompting Limitations

To rigorously demonstrate the inherent limitations of conventional static and random prompting strategies, we formalize the prompt selection problem. Our aim is to prove that any static (i.e., fixed) prompt policy is necessarily suboptimal for any sufficiently complex task environment characterized by reward heterogeneity.

Definition 1

(Policy Spaces). Let

Π_{d y n}

be the functional space of all possible dynamic policies

π : S \to A

. A policy

π_{k} \in Π_{d y n}

is a static policy if it selects a constant action, i.e.,

π_{k} (s) = a_{k} \in A

for all

s \in S

. The set of all static policies,

Π_{s t a t} = {π_{1}, \dots, π_{K}}

, is a finite proper subset of

Π_{d y n}

.

Proposition 1

(Weak Superiority of Dynamic Policies). Let

π_{d y n}^{*} = arg {max}_{π \in Π_{d y n}} J (π)

and

π_{s t a t}^{*} = arg {max}_{π \in Π_{s t a t}} J (π)

be the optimal dynamic and static policies, respectively. Here,

J (π) = E_{s \sim states (D)} [R (s, π (s))]

represents the expected cumulative reward for a policy π over the distribution of states generated from the dataset

D

. Then,

J (π_{d y n}^{*}) \geq J (π_{s t a t}^{*})

.

Proof.

The optimization for

π_{d y n}^{*}

is performed over the space

Π_{d y n}

, while the optimization for

π_{s t a t}^{*}

is performed over the subset

Π_{s t a t} \subset Π_{d y n}

. The maximum of a function over a set cannot be less than its maximum over any subset of that set. □

This proposition is foundational but insufficient. The following theorem provides the critical condition for the strict superiority of dynamic policies.

Theorem 1

(Strict Superiority under Reward Heterogeneity). A strictly superior dynamic policy exists (i.e.,

J (π_{d y n}^{*}) > J (π_{s t a t}^{*})

) if and only if the task environment is reward-heterogeneous. An environment is defined as reward-heterogeneous if there is no single action that is optimal for all possible states within the observed task distribution. Formally,

\begin{matrix} \neg \exists k \in {1, \dots, K} s . t . & R (s, a_{k}) \geq R (s, a_{j}), \\ \forall j \neq k, \forall s \in states (D) \end{matrix}

(1)

where

states (D)

denotes the set of all possible states s that can be generated from inputs

x \in D

.

Proof.

(⇐) Assume the environment is reward-heterogeneous. Let

π_{s t a t}^{*}

be the optimal static policy, which selects action

a_{k^{*}}

for all states. By the heterogeneity condition, the set of states

S^{'} = {s \in states (D) ∣ \exists a_{j} s . t . R (s, a_{j}) > R (s, a_{k^{*}})}

has non-zero measure under the distribution of states generated from

D

. We can construct a dynamic policy

π^{'}

as follows:

π^{'} (s) = \{\begin{matrix} arg max_{a \in A} R (s, a) & if s \in S^{'} \\ a_{k^{*}} & if s \notin S^{'} \end{matrix}

(2)

By construction,

R (s, π^{'} (s)) \geq R (s, a_{k^{*}})

for all s, and

R (s, π^{'} (s)) > R (s, a_{k^{*}})

on the set

S^{'}

of non-zero measure. Therefore, the integrated objective function must be strictly greater:

\begin{matrix} J (π^{'}) & = \int_{s \in S} R (s, π^{'} (s)) d μ_{D} (s) \\ > \int_{s \in S} R (s, a_{k^{*}}) d μ_{D} (s) = J (π_{s t a t}^{*}) \end{matrix}

(3)

where

d μ_{D} (s)

denotes the probability measure of states induced by the dataset

D

. Since the optimal dynamic policy

π_{d y n}^{*}

must perform at least as well as

π^{'}

, we have

J (π_{d y n}^{*}) \geq J (π^{'}) > J (π_{s t a t}^{*})

(4)

(⇒) (By Contraposition) Assume the environment is not reward-heterogeneous. Then there exists a single action

a_{k^{*}}

that is optimal for all states

s \in states (D)

. The static policy

π_{k^{*}} (s) = a_{k^{*}}

achieves the point-wise maximum reward

R (s, a_{k^{*}})

for all states. No dynamic policy

π

can achieve a reward greater than

R (s, a_{k^{*}})

at any state s. Therefore,

J (π) \leq J (π_{k^{*}})

for any

π \in Π_{d y n}

. This implies

J (π_{d y n}^{*}) = J (π_{k^{*}}) = J (π_{s t a t}^{*})

, so no strictly superior dynamic policy exists. □

The domain of medicine is inherently reward-heterogeneous, a characteristic that leads to reward heterogeneity. As demonstrated by various medical applications, the optimal interaction strategy for a simple factual recall task (e.g., retrieving drug information) is different from that for a complex differential diagnosis (requiring multi-step reasoning) or an empathetic patient counseling scenario (demanding nuanced communication). Each scenario requires different prompt characteristics to elicit optimal responses from LLMs across different objectives. Theorem 1 provides a definitive theoretical mandate for the adoption of a dynamic, adaptive prompting policy for the adoption of a dynamic, adaptive prompting policy in medical LLM evaluation.

3.4. Reinforcement Learning Framework

The adaptive prompt selection in our work is cast as a Reinforcement Learning problem, as our goal is to enable an agent to learn optimal decision-making policies in a dynamic environment. The design of this RL framework, from its problem formalization as an MDP to the architectural choices of the DQN agent, involves crucial considerations to balance interpretability, efficiency, and effectiveness in medical LLM evaluation. Our design philosophy emphasizes creating a robust, context-aware learning system that effectively navigates the inherent heterogeneity of medical tasks. The objective is to capture salient features for state representation and formulate a multi-objective reward function that accurately reflects the complex, often conflicting, priorities of clinical evaluation, including patient safety and terminology precision. The subsequent subsections provide a detailed exposition of this framework.

3.4.1. MDP Formulation

The multi-prompt selection process for LLM response optimization is framed as an MDP, defined by the tuple

(S, A, P, R, γ)

:

States $S$ : The state vector $s \in S$ represents the current context and historical performance. It encodes features such as the characteristics of the current medical input (e.g., question type, length, density of medical terms in the prompt) and the rolling average of the LLM’s performance metrics (e.g., accuracy, safety, relevance, terminology coverage) from previous interactions within the current episode. For multi-turn dialogues, it also incorporates aspects of the dialogue history.
Actions $A$ : The action space $a \in A$ is a discrete set representing the selection of a specific prompting strategy $p_{k} \in P$ , where $| A | = K$ is the total number of predefined strategies available in the prompt pool.
Transition P: The transition function is deterministic. Upon selecting an action (prompt) and observing the LLM’s response, the environment progresses to the next relevant state. Specifically:
-
For single-turn tasks (MCQ, MKQA), the environment moves to the initial state of the next medical task input ( $x_{i + 1}$ ) in the dataset.
-
For multi-turn dialogue tasks, the environment transitions to the subsequent turn within the current dialogue. The environment only moves to the initial state of the next dialogue sample ( $x_{i + 1}$ ) once all turns of the current dialogue have concluded.
Reward $R (s, a)$ : The reward $R (s, a)$ is a real-valued scalar quantifying the quality of the LLM’s response generated using the chosen prompt a from state s. It is computed as a weighted sum of multi-objective assessment metrics, which include accuracy, safety, relevance, and medical terminology coverage.
Discount Factor $γ$ : This factor determines the present value of future rewards. A value closer to 1 emphasizes long-term rewards, encouraging the agent to consider the cumulative impact of its actions across an episode. Conversely, a smaller value prioritizes immediate rewards.

3.4.2. DQN-Based Prompt Policy Learning

Our approach utilizes DQN to enable the agent to learn an optimal prompt selection policy. This learned policy guides the LLM to generate responses that are optimized for specific evaluation criteria. This makes the LLM assessment more precise and robust, without directly modifying the LLM’s intrinsic capabilities. DQN, an extension of traditional Q-learning, addresses the challenges posed by high-dimensional state spaces by leveraging deep neural networks to approximate the action-value function

Q (s, a)

[8]. Key innovations of DQN include:

Experience Replay: Storing the agent’s experiences (state, action, reward, next_state) in a replay buffer and sampling mini-batches for training. This decorrelates consecutive samples and improves learning stability.
Target Network: Employing a separate target Q-network, which periodically copies the weights from the main Q-network, to provide stable targets for the Q-value updates. This mitigates oscillations and divergence issues.

The agent aims to learn the optimal action-value function

Q^{*} (s, a)

, which represents the maximum expected cumulative reward achievable by taking action a in state s and following an optimal policy thereafter. This function satisfies the Bellman optimality equation:

Q^{*} (s, a) = E [r + γ max_{a^{'}} Q^{*} (s^{'}, a^{'}) | s, a]

(5)

In our implementation, the main Q-network and the target network are both instantiated as multi-layer perceptrons (MLPs) that map state features to action Q-values, as generally illustrated in Figure 3. The network architecture consists of an input layer, two hidden layers, and an output layer. For MKQA and MCQ tasks, the hidden layers both comprise 64 neurons. For the multi-turn Dialogue task, a larger first hidden layer with 128 neurons is utilized, followed by a second hidden layer with 64 neurons. Both networks employ ReLU activation functions between hidden layers.

State Feature Engineering

The state vector s is designed to capture crucial information for prompt selection. Its features are normalized to a

[0, 1]

range and primarily include:

Input Length: Normalized character count of the current user input (question or turn).
Medical Terminology Density: Normalized count of identified medical terms (based on a predefined medical vocabulary) within the current user input.
Recent Performance Statistics: Rolling average values of previously achieved rewards and individual metric scores (accuracy, safety, relevance, and terminology coverage) within the current episode, reflecting the agent’s recent interaction performance.
Dialogue Context Indicators (for multi-turn tasks): Features like the current turn number in a dialogue provide insight into the conversation’s depth.

Note: Specific feature sets may vary slightly across different task types (MKQA, MCQ, Doctor-Patient Dialogue) as implemented separately for specialized training.

Prompt Strategy Pool

Our agent selects an action from a predefined pool of

K = 5

distinct prompting strategies, as described in detail in Section 3.2. These strategies comprise the discrete action space

A

for the agent, each designed to elicit specific behaviors from the LLM.

3.5. Reward Function

The immediate reward

r_{i}

for an action (prompt selection) at step i is formulated as a weighted linear combination of multiple critical performance metrics of the LLM’s generated response. This multi-objective reward aims to guide the agent towards selecting prompts that lead to high-quality, safe, and contextually appropriate outputs. As depicted in Figure 2, this reward forms the core of the objective function and the feedback signal for the RL agent. The reward function is defined as:

r_{i} = w_{A} \cdot {Acc}_{i} + w_{S} \cdot {Safe}_{i} + w_{T} \cdot {Term}_{i} + w_{R} \cdot {Rel}_{i}

(6)

where:

${Acc}_{i}$ (Accuracy): Quantifies the semantic similarity and factual correctness of the LLM’s response compared to the gold reference.
-
For open-ended QA tasks, it is measured as a weighted average:

$\begin{matrix} {Acc}_{i} = & w_{R L} \cdot ROUGE-L F 1 (Y_{i}, {\hat{Y}}_{i}) \\ + w_{C S} \cdot Cos ineSim (Emb (Y_{i}), Emb ({\hat{Y}}_{i})) \end{matrix}$

(7)

where $Y_{i}$ is the gold reference, ${\hat{Y}}_{i}$ is the LLM’s response, and $Emb (\cdot)$ denotes sentence embeddings. ROUGE-L F1 evaluates the overlap of the longest common subsequence between $Y_{i}$ and ${\hat{Y}}_{i}$ , serving as a measure of content similarity. CosineSim calculates the cosine of the angle between the sentence embeddings of $Y_{i}$ and ${\hat{Y}}_{i}$ , thereby capturing their semantic similarity. The weights $w_{R L}, w_{C S}$ are empirically determined such that $w_{R L} + w_{C S} = 1$ .
-
For MCQs, it is a binary score: 1 is assigned if the LLM’s predicted option matches the correct option, and 0 is assigned otherwise.
${Safe}_{i}$ (Safety): A binary score (1 for safe, 0 for unsafe) indicating whether the LLM’s response adheres to medical safety guidelines, identified by the absence of predefined harmful patterns (e.g., self-medication advice, dose recommendations without context, dismissal of professional medical consultation).
${Term}_{i}$ (Medical Terminology Coverage): Measures the F1-score of relevant medical terms present in the LLM’s response compared to those identified in the user’s input/dialogue history and a predefined medical vocabulary. The F1-score is the harmonic mean of precision and recall, providing a balanced measure that considers both the accuracy of the retrieved items (precision) and the completeness of the retrieval (recall). It encourages the use and explanation of appropriate clinical terminology.
${Rel}_{i}$ (Contextual Relevance): Applicable primarily in multi-turn dialogue settings. It assesses how well the LLM’s response addresses the most recent user query and aligns with the overall dialogue context. It is evaluated as a weighted combination of keyword overlap and semantic similarity:

$\begin{matrix} {Rel}_{i} = & w_{K W} \cdot KeywordOverlap (U_{i}, {\hat{Y}}_{i}) \\ + w_{S S} \cdot Cos ineSim (U_{i}, {\hat{Y}}_{i}) \end{matrix}$

(8)

where $U_{i}$ is the most recent user utterance, ${\hat{Y}}_{i}$ is the LLM’s response, and $w_{K W}, w_{S S}$ are weights such that $w_{K W} + w_{S S} = 1$ . KeywordOverlap is computed by tokenizing and filtering stop words from both $U_{i}$ and ${\hat{Y}}_{i}$ , then calculating the ratio of common words to words in $U_{i}$ . CosineSim is applied between the sentence embeddings of $U_{i}$ and ${\hat{Y}}_{i}$ .

The weights

w_{A}, w_{S}, w_{T}, w_{R}

are hyperparameters determined through a validation grid search to reflect the relative importance of each metric in the overall quality assessment. These weights are normalized such that their sum equals 1 (

w_{A} + w_{S} + w_{T} + w_{R} = 1

).

4. Experiments

Our experimental methodology aims to thoroughly evaluate the proposed DQN-based prompt policy learning framework. This section details the datasets used, the specific LLMs under investigation, the chosen baseline strategies, and the hyperparameter configurations for training our reinforcement learning agent.

4.1. Datasets Employed

For the purpose of training and evaluation, we utilized three specialized medical datasets, whose detailed construction and preprocessing procedures are thoroughly described in Section 3.1. These datasets align with the three distinct medical task types investigated in this study:

MKQA Dataset: Used for open-ended QA tasks. This dataset was leveraged from the medicalBook_zh_qa dataset from ApolloCorpus [21], a large-scale Chinese corpus for medical foundation models. It comprises 880 meticulously processed QA pairs, divided into a training set of 800 and a test set of 80.
MCQ Dataset: Employed for MCQ tasks. This dataset was constructed by processing the medicalExam_zh_clean subset from ApolloCorpus. It consists of 1540 board-exam style questions, partitioned into a training set of 1400 and a test set of 140.
Doctor-Patient Dialogue Dataset: Dedicated to multi-turn dialogue tasks, this dataset was curated from CMtMedQA [22] dataset, which comprises a collection of medical question-answering dialogues. It encompasses 660 curated conversations, each averaging approximately 8.8 turns. These conversations are systematically split into a training set of 600 conversations and a test set of 60 conversations.

These curated datasets ensure a comprehensive evaluation across diverse medical communication scenarios.

4.2. Auxiliary Resources

To facilitate the evaluation process, particularly for metrics such as medical terminology coverage and contextual relevance, we employed the following auxiliary resources:

Medical Terminology Vocabulary: A comprehensive list of medical terms, compiled by merging publicly available medical word lists from two sources: QASystemOnMedicalGraph [23] and Chinese Medical Words [24]. This merged vocabulary contains 132,751 unique terms. It serves as a lookup dictionary to identify and count specific medical terms in both user inputs and LLM responses for the Terminology Coverage metric.
Chinese Stopwords List: A predefined list of common Chinese stop words, obtained from a publicly available collection at GitHub repository “33211/stopwords” [25]. These stopwords are utilized to filter out common and uninformative words during the calculation of keyword overlap in the Contextual Relevance metric.

4.3. Experimental Environment

All reinforcement learning training and model evaluations were performed on a system equipped with two NVIDIA GeForce RTX 3090 GPUs, each with 24 GB of GDDR6X memory. The system runs with CUDA Version 12.9. The deep learning models were implemented using PyTorch 2.3.0 version.

For RL agent training, our prompt policy interacted with the DeepSeek [26] API. Specifically, we utilized the DeepSeek-V3-0324 model through the OpenAI Python client library. All API calls were directed to the standard DeepSeek API endpoint at: https://api-docs.deepseek.com/ (accessed on 2 December 2025).

4.4. Metrics

The evaluation is based on a comprehensive set of metrics that assess the performance of the LLM responses and the effectiveness of the learned prompt selection policy. These metrics are categorized based on their application:

Accuracy (Acc): Quantifies the semantic similarity and factual correctness of the LLM’s response compared to the gold reference. For open-ended QA tasks, it is measured as a weighted average (as defined in Equation (7)). For MCQs, it is a binary score. This metric directly contributes to the immediate reward, and its average across all evaluated samples/turns is reported as the overall factual performance.
Safety (Safe): A binary score (as defined in Section 3.5) indicating whether the LLM’s response adheres to medical safety guidelines. This score is a direct component of the immediate reward. For the final evaluation, we report the average Safety score.
Medical Terminology Coverage (Term): It measures the F1 score of the relevant medical terms present in the LLM’s response, which is defined in Section 3.5). This metric contributes to the immediate reward, and its average across all evaluated samples/turns reflects the overall appropriate use of medical vocabulary.
Contextual Relevance (Rel): This metric assesses how well the LLM’s response addresses the most recent user query and aligns with the dialogue’s overall context. It is primarily applicable in multi-turn dialogue settings. It is evaluated as a weighted combination of keyword overlap and semantic similarity (as defined in Equation (8)). The Contextual Relevance score contributes to the immediate reward, and its average across all evaluated dialogue turns is reported for overall conversational coherence.

4.5. RL Training Details

The reinforcement learning agent was trained with the reward function weights for the composite reward function (Equation (6)) and its sub-components were set as follows:

Main Reward Weights: For multi-turn dialogue tasks, $w_{S} = 0.30$ (Safety), $w_{A} = 0.30$ (Accuracy), $w_{R} = 0.10$ (Contextual Relevance), and $w_{T} = 0.30$ (Medical Terminology Coverage). These weights ensure a balanced emphasis on safe and accurate interactions, along with professional terminology, recognizing the conversational and sensitive nature of doctor-patient exchanges. For MCQ tasks, the weights were set to $w_{A} = 0.2$ , $w_{S} = 0.2$ , and $w_{T} = 0.6$ . Here, Medical Terminology Coverage received the highest weight, reflecting the paramount importance of precise domain knowledge in board-exam-style questions, where accurate term recognition and utilization are critical. For MKQA tasks, $w_{A} = 0.7$ , $w_{S} = 0.2$ , and $w_{T} = 0.1$ . In this context, accuracy was assigned the highest weight, as factual correctness and direct answer fidelity are the primary objectives in single-turn factual question answering. Note: $w_{R}$ is not applicable for single-turn tasks.
Accuracy Metric Weights (for open-ended QA, Equation (7)): $w_{R L} = 0.5$ (ROUGE-L F1) and $w_{C S} = 0.5$ (CosineSim).
Contextual Relevance Metric Weights (for multi-turn dialogue, Equation (8)): $w_{K W} = 0.5$ (KeywordOverlap) and $w_{S S} = 0.5$ (CosineSim).

4.6. Evaluation Results and Analysis

After training, we compare our RL agent framework with two non-learning baseline strategies. These are a fixed prompt selection method and a random prompt selection method. All evaluations were conducted on the GPT-4.1 [27] model. We selected GPT-4.1 due to its cutting-edge performance in language understanding and generation, making it a representative benchmark for evaluating our proposed method. Its widespread adoption in research also facilitates comparison with other studies. The evaluations were conducted across the three medical task types, which were the same as the training tasks.

The evaluated methods are defined as follows:

Fixed Strategy (Baseline): This method consistently employs a single, predefined prompt strategy (the “Role-based/General Instruction” prompt) for all LLM interactions, serving as a static baseline representing a common, non-adaptive prompting approach.
Random Strategy (Random): This method randomly selects a prompt strategy from the predefined pool for each LLM interaction, representing a non-intelligent, purely stochastic approach.
Rule-based Strategy (Rule-based): This adaptive baseline employs a set of hand-crafted heuristic rules to select a prompt strategy for each LLM interaction. The rules prioritize the prompt selection based on observable features of the current input, including its length and medical terminology density. For example, in MKQA tasks, the strategy might prioritize a “Safety-focused” prompt for potentially risky queries, a “Chain-of-Thought” prompt for complex, terminology-rich questions, or a “Patient-centric” prompt for very simple, non-technical queries.
-
A CoT prompt is chosen for questions with high complexity (normalized question length > 0.6 and normalized term density > 0.5).
-
A Terminology-rich Communication prompt is selected if the term density is notably high (normalized term density > 0.7).
-
A Patient-centric/Layperson Explanation prompt is applied for very simple, non-technical queries (normalized question length < 0.3 and normalized term density < 0.2).
-
Otherwise, a default Role-based/General Instruction prompt is used.
RL Agent Strategy (RL): This is our proposed method, where a trained DQN agent dynamically selects the optimal prompt strategy based on the current state (task features and historical performance) for each LLM interaction.

Before the main evaluation of GPT-4.1, we have included an evaluation of the RL prompt policy on the same LLM used during its training: DeepSeek-V3-0324. The purpose of this evaluation is to verify the basic efficacy and convergence of the RL agent in its optimized environment. The evaluation protocol for DeepSeek-V3-0324 is identical to that used for GPT-4.1. The results in Table 1 demonstrate that our RL prompt policy effectively optimizes performance in the training environment (DeepSeek-V3-0324). These findings confirm the successful learning and optimization capability of our RL agent within its designated training environment.

Then, we present the main evaluation results of our prompt policy learning framework on the GPT-4.1 model across the three medical task types in Table 2. This table includes the average scores for all evaluated metrics and the standard deviation specifically for the Reward metric (Std), calculated over the respective test sets. It is important to note that these statistics are derived from a single experimental run. Therefore, these standard deviations reflect intra-run variability (i.e., the dispersion of scores across individual samples within a single evaluation) rather than robustness across multiple independent experimental trials.

The results presented in Table 2 demonstrate that the RL-based prompt selection method generally achieves improved performance compared to the fixed baseline prompt, the random prompt, and the rule-based prompt selection strategy across various medical tasks.

Specifically, for the MKQA task, the RL method yields the highest average Reward (0.5617), showing an improvement of 6.12% over Baseline, 6.66% over Random and 8.96% over Rule-based. This is accompanied by improvements in Accuracy, which is 3.41% higher than Baseline, and substantial gains in Medical Terminology Coverage, representing a remarkable 74.61% increase over Baseline. The Safety metric reaches a perfect score of 1.0000 with RL, indicating no unsafe responses in this evaluation run, which marks a 5.26% improvement over baselines.

In the MCQ task, the RL method again achieves the greatest average Reward (0.4913), slightly surpassing Rule-based, Random, and Baseline. It also shows the highest Medical Terminology Coverage (0.2855). Although its Accuracy is marginally lower than Random and Baseline, Safety remains high at 0.9857, on par with these two methods.

For the Dialogue task, the RL method consistently attains the highest average Reward (0.6068), presenting an increase of 2.41% over Baseline, a 3.46% increase over Rule-based, and a 2.22% increase over Random. It also exhibits the highest Safety (0.9576) and Medical Terminology Coverage (0.6570). However, its Accuracy and Contextual Relevance are a little lower compared to baselines in this particular evaluation run. This indicates that the RL agent prioritized optimizing the composite reward, which includes significant contributions from Safety and Medical Terminology Coverage, even if it meant minor trade-offs in other metrics like Accuracy and Contextual Relevance for this specific dataset and evaluation.

Consolidating the observations across all tasks, these observed improvements in overall Reward and certain key metrics (Safety and Medical Terminology Coverage) directly suggest the efficacy of our RL framework in dynamically adapting prompting strategies to enhance LLM performance in medical contexts, specifically targeting the weighted objectives of the reward function. The findings of this study demonstrate that the RL agent effectively optimizes the composite reward, despite the presence of minor trade-offs in certain individual metrics. This strategic behavior, where the agent prioritizes substantial contributions from Safety and Term Coverage over slight decreases in Accuracy or Relevance in specific tasks (MCQ and Dialogue), underscores the agent’s robust capacity to make strategic trade-offs within a multi-objective optimization framework. By learning from real-time feedback, the RL agent effectively navigates the intricate interplay between different prompt types and varying input characteristics, leading to a more nuanced and context-aware steering of LLM behavior, consequently enabling a more comprehensive and accurate assessment of medical LLM capabilities.

4.7. Cross-Model Applicability

To further investigate the cross-model applicability of our learned RL prompt agent, we conducted an additional evaluation using the powerful Gemini-2.0-flash model. This experiment aims to demonstrate whether the RL agent, which has been trained on DeepSeek-Chat, can successfully transfer its optimized prompting strategy to an entirely different and advanced LLM like Gemini-2.0-flash, and maintain superior performance compared to non-adaptive baselines on this new model. The evaluation protocol for Gemini-2.0-flash was identical to that used for GPT-4.1, applying the same fixed, random, rule-based, and RL agent strategies across all three medical task types.

We present the evaluation results on Gemini-2.0-flash in Table 3. The evaluation results on Gemini-2.0-flash consistently corroborate the findings from the GPT-4.1 evaluations, further strengthening the evidence for the efficacy and cross-model applicability of our RL prompt framework.

These results, presented in the table, demonstrate that the RL prompt agent, trained on DeepSeek-V3-0324, effectively transfers to Gemini-2.0-flash, consistently optimizing the composite reward and enhancing critical metrics such as safety and terminology coverage. This provides compelling evidence for the framework’s broad applicability across different advanced LLMs.

4.8. Comparison with All Static Prompt Strategies

To empirically support the claim that adaptive strategies can outperform all individual static alternatives, we conducted an additional evaluation. In this experiment, we compare the performance of our RL Agent against each of the five predefined prompt strategies individually. This evaluation was performed on DeepSeek-V3-0324, GPT-4.1, and Gemini-2.0-flash across all three medical task types. Due to Reward being our primary optimization objective, only the average Reward scores are reported for this comparison.

The results in Table 4 provide unequivocal evidence that our RL prompt policy consistently outperforms all individual static prompt strategies and the random strategy in terms of composite reward across all evaluated LLMs and medical tasks. This empirical evidence strongly supports the theoretical mandate for dynamic, adaptive prompting in reward-heterogeneous environments in medicine, as formalized in Theorem 1. The RL Agent consistently achieved higher reward scores across diverse static alternatives, thereby validating the efficacy of its learned adaptive policy in discovering optimal interaction strategies beyond any single fixed approach.

4.9. Performance Comparison of Diverse Chinese LLMs Under Learned RL Policy

This section presents an evaluation of diverse Chinese LLMs when their responses are guided by our pretrained RL prompt policy. The primary objective is to evaluate the relative performance of these LLMs under an optimized prompting regime, thereby determining their intrinsic strengths across various medical tasks. We also aim to compare the capabilities of medical domain-specific versus general-purpose LLMs when operating with our learned prompt optimization strategy.

4.9.1. Evaluated LLM Models

In order to identify suitable candidates for model performance evaluation, we first conducted a comprehensive study of prominent and highly regarded LLMs in Chinese, including OpenCompass [28], C-Eval [29], CBLUE [30], and SuperCLUE [31]. Based on the performance metrics reported in these benchmarks, we screened and selected open-source Chinese LLMs that demonstrated superior overall capabilities and were compatible with our experimental hardware configuration.

For this evaluation, eight distinct Chinese LLMs were selected, encompassing both medical domain-specific and general-purpose architectures. These models were not involved in the RL agent’s training phase for assessing their performance under the guidance of our learned prompt optimization. These models are categorized into two groups: medical LLMs and general LLMs. Specifically, the medical LLMs include MING-7B [32], HuatuoGPT-o1-7B [33], Taiyi-LLM [34], and PULSE-7bv5 [35]. The general LLMs comprise MiniCPM3-4B [36], Yi-1.5-9B-Chat [37], DeepSeek-R1-0528-Qwen3-8B [38], and llama-3-chinese-8b-instruct-v3 [39].

4.9.2. Evaluation Protocol

For each of the evaluated LLMs, we exclusively utilized the pre-trained RL prompt selection agent to dynamically select the most appropriate prompt strategy for every input in the test sets of MKQA, MCQ, and Dialogue tasks. Unlike the main results, where RL performance was compared against fixed and random baselines for GPT-4.1, this evaluation solely focuses on the performance of each LLM when operating under the guidance of our learned RL policy. The performance metrics were calculated using the same methodology as described in Section 3.5. This protocol offers insights into the intrinsic capabilities of different LLMs under an optimized prompting regime, without attempting to quantify the direct performance uplift from RL for each individual model.

4.9.3. Results on Diverse LLMs

Table 5 summarizes the performance of the diverse LLMs when their responses are guided by our learned RL prompt policy. The results are presented for each of the three medical task types.

We will focus on a comparison of the performance trends observed among these diverse LLMs under the influence of our learned RL prompt policy, discuss the relative strengths and weaknesses of different models across the various medical tasks, and draw comparisons between the overall capabilities of medical domain-specific LLMs versus general-purpose LLMs when operating with an optimized prompting strategy.

Overall, the results indicate that no single LLM consistently outperforms all others across every task and metric. Instead, strengths vary significantly depending on the task type and specific evaluation criteria.

For the MKQA task, the general LLM, MiniCPM3-4B, demonstrates the highest average Reward (0.5399), surpassing both other general models and all medical LLMs. It also achieves a perfect Safety score, indicating its strong safety performance in general factual questions under our RL policy. Among Medical LLMs, HuatuoGPT-o1-7B shows the highest Reward and Taiyi-LLM achieves an outstanding Medical Terminology Coverage score.

In the MCQ task, there is a shift in the performance landscape. Taiyi-LLM (Medical LLM) achieves the optimal Reward (0.5091), closely followed by MING-7B. HuatuoGPT-o1-7B exhibits exceptional Accuracy in this task. Safety scores are generally very high across all models, with several achieving above 0.99.

For the Dialogue task, general LLMs again take the lead in overall reward, with MiniCPM3-4B achieving the highest Reward (0.7156), followed closely by llama-3-chinese-8b-instruct-v3 and HuatuoGPT-o1-7B. MING-7B (Medical LLM) stands out with a perfect Safety score. HuatuoGPT-o1-7B and MiniCPM3-4B demonstrate exceedingly elevated Medical Terminology Coverage, while DeepSeek-R1-0528-Qwen3-8B shows the highest Contextual Relevance (0.6947).

When comparing medical and general language models under the guidance of our RL policy, we observe a mixed but insightful picture. As a general LLM, MiniCPM3-4B demonstrates strong overall capabilities, achieving the highest composite rewards in MKQA and Dialogue tasks. This finding suggests its robust general reasoning and language generation abilities, even in medical contexts, are well-leveraged by our RL-driven prompt optimization. Medical LLMs, while not always leading in composite reward, frequently excel in domain-specific aspects like Medical Terminology Coverage (e.g., Taiyi-LLM in MKQA/MCQ, HuatuoGPT-o1-7B in Dialogue) and Safety (e.g., MING-7B in Dialogue). This indicates that their specialized training enables them to maintain high adherence to medical safety and terminology, even when paired with a general RL prompt optimizer. This cross-model analysis emphasizes the versatility of our RL framework in optimizing performance across a range of LLM architectures.

5. Discussion

5.1. Strengths and Contributions

Our work presents several significant strengths and contributions to the field of LLM evaluation and optimization in medical contexts:

Novel Reinforcement Learning Approach for LLM Evaluation: To our knowledge, this study represents a pioneering effort in leveraging reinforcement learning for automated and adaptive prompt selection, particularly within the rigorous domain of medical Large Language Model evaluation. This adaptive mechanism facilitates the dynamic optimization of LLM prompting strategies, demonstrating its potential to move beyond static or manually tuned approaches.
Model-Agnostic and Applicable Framework: The proposed RL-based prompt policy learning framework indicates strong model-agnostic properties regarding its application [40]. As evidenced by our LLM evaluation under the learned RL policy, the policy trained on the DeepSeek-V3-0324 model can effectively be applied to the GPT-4.1 model, allowing us to observe their performance when operating under an optimized prompting regime. This highlights its applicability and potential for transferability across different underlying LLM architectures.
Enhanced Performance in Key Medical Metrics: The RL-driven prompt selection consistently optimizes the composite reward function, achieving higher overall scores compared to baseline and random strategies. Notably, the framework significantly enhances the safety of LLM responses across all tasks, frequently reaching near-perfect or perfect scores, which is critical in healthcare applications. It also results in substantial gains in Medical Terminology Coverage across all tasks, contributing to the informativeness of the generated content. While varying across tasks, an improvement in general Accuracy is observed in the MKQA task.

5.2. Limitations

Despite the encouraging outcomes, it is acknowledged that our present framework is not without its limitations that warrant future research:

Fixed Prompt Pool and Expressivity: The current prompt pool consists of a predefined, fixed set of five manually engineered prompting strategies. This inherent limitation restricts the expressive power of the agent’s actions, as it cannot explore or generate novel prompting techniques beyond these pre-engineered templates, nor can it dynamically combine elements from different strategies. Such a reliance on a constrained, pre-determined action space inevitably places an upper bound on the achievable performance, particularly in highly variable and complex medical scenarios where the ideal prompt might lie outside the current range. A more expressive action space, potentially involving prompt generation or dynamic prompt composition, could further unlock LLM capabilities.
Reward Proxy and Human Judgment Alignment: The composite reward function, while multi-faceted, acts as a proxy for true human judgment of LLM response quality. This presents a significant challenge, as automated metrics inherently lack the capacity to fully capture the subjective, ethical, and highly contextual nuances of human–LLM interaction in healthcare. Despite its comprehensive design across various quantitative metrics, there remains a potential for misalignment between the computationally derived metrics and nuanced human perception, especially concerning complex aspects such as empathy, ethical considerations, or the appropriateness of multi-turn dialogue flow, which are challenging to fully capture with current automated metrics. In a high-stakes domain like medicine, where suboptimal or inappropriate responses can have severe consequences, ensuring robust alignment with expert human judgment is a critical concern that current automated proxies can only address to a limited extent.
Generalization to Unseen/Rare Contexts: While cross-model applicability of our prompt policy has been demonstrated, the policy’s ability to adapt to entirely new or rare medical contexts, unseen during dataset curation, requires further rigorous validation. This limitation is particularly relevant in dynamic clinical environments, where highly specialized or nuanced scenarios often involve sparse data, posing challenges not only for RL policy adaptation but also for the underlying LLMs themselves. The effectiveness in such situations needs dedicated investigation.

5.3. Future Work

On the basis of the findings of our present study, several promising avenues for future research are identified:

Advanced RL Architectures: Exploring more sophisticated reinforcement learning architectures, such as Multi-Agent reinforcement learning (MARL) [41] or Retrieval-Augmented reinforcement learning [42], could enable the agent to learn more complex, multi-level prompting strategies or to quickly adapt to new LLMs and tasks with minimal retraining.
Dynamic Prompt Generation and Expansion: Moving beyond a fixed prompt pool, future work could investigate methods for the agent to dynamically construct or expand its prompt space. This may involve techniques like prompt mutation [43], prompt distillation [44], or integrating a meta-LLM to generate novel prompts [45], significantly enhancing the expressivity of the learned policy.
Human-in-the-Loop Reward Mechanisms: To address the potential misalignment of automated reward proxies, we propose prioritizing the integration of human feedback directly into the reward signal in future work. This could involve methods such as reinforcement learning from human feedback (RLHF), which is particularly well-suited for capturing nuanced human judgments and ethical considerations in complex domains like medical dialogues, thereby leading to policies better aligned with these subtle and critical aspects.
Cross-Lingual and Multimodal Generalization: Expanding the framework to support cross-lingual medical LLMs and multimodal inputs (e.g., incorporating medical images or patient physiological data) would significantly broaden its applicability and impact. This extension would enable the development of more comprehensive and holistic medical AI systems, and simultaneously facilitate their robust evaluation across diverse languages and data modalities, addressing real-world complexities in global healthcare.

6. Conclusions

In this paper, we propose and validate a pioneering deep RL-based framework for automated and adaptive prompt selection, specifically designed for the rigorous evaluation and enhanced assessment of LLMs in diverse medical clinical tasks. This framework dynamically tailors prompting strategies to facilitate more robust and comprehensive LLM evaluation. Our empirical evaluation of GPT-4.1 across three medical benchmarks (MKQA, MCQ, Dialogue) demonstrates that the RL-driven prompt selection consistently achieves superior composite Rewards, significantly enhancing LLM response Safety and substantially improving Medical Terminology Coverage compared to fixed and random baselines. While optimizing the multi-objective reward function, the RL policy effectively balanced various performance aspects, even when minor trade-offs in accuracy or relevance were observed. The framework’s demonstrated model-agnosticism and broad applicability to diverse, unseen LLMs lay a foundation for future research into more sophisticated adaptive prompting mechanisms, ultimately contributing to the development of safer, more accurate, and more informative medical AI systems.

Author Contributions

Conceptualization, D.D. and Z.D.; methodology, D.D., W.X., and J.G.; software, Z.D. and J.G.; validation, D.D.; investigation, D.D.; resources, Z.D. and J.G.; data curation, D.D.; writing—original draft preparation, D.D.; writing—review and editing, D.D. and W.X.; supervision, Z.D. and J.G.; project administration, Z.D. and J.G.; funding acquisition, Z.D. and J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (Grant No. 2024YFF0507603), the Anhui Provincial Major Science and Technology Project (Nos. 202303a07020006, and 202304a05020071), the Anhui Provincial Clinical Medical Research Transformation Project (No. 202204295107020004), the Research Funds of the Center for Xin’an Medicine and Modernization of Traditional Chinese Medicine of IHM (No. 2023CXMMTCM012).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed in this study are publicly available: ApolloCorpus can be accessed at https://github.com/FreedomIntelligence/Apollo (accessed on 2 December 2025) [21], and CMtMedQA is available at https://github.com/SupritYoung/Zhongjing (accessed on 2 December 2025) [22]. The medical terminology vocabulary used was compiled from publicly available lists [23,24], and the Chinese stopwords list was obtained from https://github.com/33211/stopwords(accessed on 2 December 2025) [25].

Conflicts of Interest

Corresponding author Jianqing Gao is an employee of iFLYTEK Co., Ltd., Hefei 230088 China/P. R. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Farquhar, S.; Kossen, J.; Kuhn, L.; Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature 2024, 630, 625–630. [Google Scholar] [CrossRef] [PubMed]
Asgari, E.; Montaña-Brown, N.; Dubois, M.; Khalil, S.; Balloch, J.; Yeung, J.A.; Pimenta, D. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digit. Med. 2025, 8, 274. [Google Scholar] [CrossRef] [PubMed]
Meng, X.; Yan, X.; Zhang, K.; Liu, D.; Cui, X.; Yang, Y.; Zhang, M.; Cao, C.; Wang, J.; Wang, X.; et al. The application of large language models in medicine: A scoping review. iScience 2024, 27, 109713. [Google Scholar] [CrossRef]
Bonaca, M.P.; Lang, N.N.; Chen, A.; Amiri-Kordestani, L.; Lipka, L.; Zwiewka, M.; Strnadova, C.; Klaar, S.; Dent, S.; Janicijevic, T.K.; et al. Cardiovascular safety in oncology clinical trials: JACC: CardioOncology Primer. Cardio Oncol. 2025, 7, 83–95. [Google Scholar]
Zamorano, J.L.; Gottfridsson, C.; Asteggiano, R.; Atar, D.; Badimon, L.; Bax, J.J.; Cardinale, D.; Cardone, A.; Feijen, E.A.M.; Ferdinandy, P.; et al. The cancer patient and cardiology. Eur. J. Heart Fail. 2020, 22, 2290–2309. [Google Scholar] [CrossRef]
Zaghir, J.; Naguib, M.; Bjelogrlic, M.; Névéol, A.; Tannier, X.; Lovis, C. Prompt engineering paradigms for medical applications: Scoping review. J. Med. Internet Res. 2024, 26, e60501. [Google Scholar] [CrossRef] [PubMed]
Sivarajkumar, S.; Kelley, M.; Samolyk-Mazzanti, A.; Visweswaran, S.; Wang, Y. An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing: Algorithm development and validation study. JMIR Med. Inform. 2024, 12, e55318. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar] [CrossRef]
Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H.; et al. Toward expert-level medical question answering with large language models. Nat. Med. 2025, 31, 943–950. [Google Scholar] [CrossRef]
Liu, X.; Liu, H.; Yang, G.; Jiang, Z.; Cui, S.; Zhang, Z.; Wang, H.; Tao, L.; Sun, Y.; Song, Z.; et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 2025, 31, 932–942. [Google Scholar] [CrossRef]
Michalopoulos, G.; Williams, K.; Singh, G.; Lin, T. MedicalSum: A guided clinical abstractive summarization model for generating medical reports from patient-doctor conversations. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 4741–4749. [Google Scholar]
Görtz, M.; Baumgärtner, K.; Schmid, T.; Muschko, M.; Woessner, P.; Gerlach, A.; Byczkowski, M.; Sültmann, H.; Duensing, S.; Hohenfellner, M. An artificial intelligence-based chatbot for prostate cancer education: Design and patient evaluation study. Digit. Health 2023, 9, 20552076231173304. [Google Scholar] [CrossRef]
Lampert, C.H.; Nickisch, H.; Harmeling, S. Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 453–465. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 24824–24837. [Google Scholar]
Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Christiano, P.F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; Amodei, D. Deep Reinforcement Learning from Human Preferences. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. Constitutional AI: Harmlessness from AI feedback. arXiv 2022, arXiv:2212.08073. [Google Scholar] [CrossRef]
Wang, X.; Li, C.; Wang, Z.; Bai, F.; Luo, H.; Zhang, J.; Jojic, N.; Xing, E.P.; Hu, Z. PromptAgent: Strategic planning with language models enables expert-level prompt optimization. arXiv 2023, arXiv:2310.16427. [Google Scholar]
Wang, X.; Chen, N.; Chen, J.; Hu, Y.; Wang, Y.; Wu, X.; Gao, A.; Wan, X.; Li, H.; Wang, B. Apollo: Lightweight Multilingual Medical LLMs towards Democratizing Medical AI to 6B People. arXiv 2024, arXiv:2403.03640. [Google Scholar] [CrossRef]
Yang, S.; Zhao, H.; Zhu, S.; Zhou, G.; Xu, H.; Jia, Y.; Zan, H. Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue. arXiv 2023, arXiv:2308.03549. [Google Scholar] [CrossRef]
Chen, Z. QASystemOnMedicalGraph: A Medical Knowledge Graph Based Question Answering System. 2018. Available online: https://github.com/zhihao-chen/QASystemOnMedicalGraph (accessed on 2 December 2025).
Xtea. Chinese Medical Words. 2020. Available online: https://github.com/xtea/chinese_medical_words (accessed on 2 December 2025).
33211. Chinese Stopwords List. 2018. Available online: https://github.com/33211/stopwords (accessed on 2 December 2025).
DeepSeek-AI. DeepSeek-V3 Technical Report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
OpenAI. Introducing GPT-4.1 in the API. 2025. Available online: https://openai.com/index/gpt-4-1/ (accessed on 2 December 2025).
Buitrago, P.A.; Nystrom, N.A. Open compass: Accelerating the adoption of AI in open research. In Proceedings of the Practice and Experience in Advanced Research Computing 2019: Rise of the Machines (Learning), Chicago, IL, USA, 28 July–1 August 2019; pp. 1–9. [Google Scholar]
Huang, Y.; Bai, Y.; Zhu, Z.; Zhang, J.; Zhang, J.; Su, T.; Liu, J.; Lv, C.; Zhang, Y.; Fu, Y.; et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Adv. Neural Inf. Process. Syst. 2023, 36, 62991–63010. [Google Scholar]
Zhang, N.; Chen, M.; Bi, Z.; Liang, X.; Li, L.; Shang, X.; Yin, K.; Tan, C.; Xu, J.; Huang, F.; et al. Cblue: A chinese biomedical language understanding evaluation benchmark. arXiv 2021, arXiv:2106.08087. [Google Scholar]
Xu, L.; Li, A.; Zhu, L.; Xue, H.; Zhu, C.; Zhao, K.; He, H.; Zhang, X.; Kang, Q.; Lan, Z. Superclue: A comprehensive chinese large language model benchmark. arXiv 2023, arXiv:2307.15020. [Google Scholar] [CrossRef]
Liao, Y.; Jiang, S.; Wang, Y.; Wang, Y. MING-MOE: Enhancing Medical Multi-Task Learning in Large Language Models with Sparse Mixture of Low-Rank Adapter Experts. arXiv 2024, arXiv:2404.09027. [Google Scholar]
Chen, J.; Cai, Z.; Ji, K.; Wang, X.; Liu, W.; Wang, R.; Hou, J.; Wang, B. HuatuoGPT-o1, towards medical complex reasoning with LLMs. arXiv 2024, arXiv:2412.18925. [Google Scholar]
Luo, L.; Ning, J.; Zhao, Y.; Wang, Z.; Ding, Z.; Chen, P.; Fu, W.; Han, Q.; Xu, G.; Qiu, Y.; et al. Taiyi: A bilingual fine-tuned large language model for diverse biomedical tasks. J. Am. Med. Inform. Assoc. 2024, 31, 1865–1874. [Google Scholar] [CrossRef]
Zhang, X.; Xue, K.; Zhang, S. PULSE: Pretrained and Unified Language Service Engine. 2023. Available online: https://github.com/openmedlab/PULSE (accessed on 2 December 2025).
Hu, S.; Tu, Y.; Han, X.; He, C.; Cui, G.; Long, X.; Zheng, Z.; Fang, Y.; Huang, Y.; Zhao, W.; et al. MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. arXiv 2024, arXiv:2404.06395. [Google Scholar] [CrossRef]
Young, A.; Chen, B.; Li, C.; Huang, C.; Zhang, G.; Zhang, G.; Wang, G.; Li, H.; Zhu, J.; Chen, J.; et al. Yi: Open foundation models by 01. ai. arXiv 2024, arXiv:2403.04652. [Google Scholar]
DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
Cui, Y.; Yang, Z.; Yao, X. Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca. arXiv 2023, arXiv:2304.08177. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. Model-agnostic interpretability of machine learning. arXiv 2016, arXiv:1606.05386. [Google Scholar] [CrossRef]
Zhang, K.; Yang, Z.; Başar, T. Multi-agent reinforcement learning: A selective overview of theories and algorithms. In Handbook of Reinforcement Learning and Control; Springer: Berlin/Heidelberg, Germany, 2021; pp. 321–384. [Google Scholar]
Goyal, A.; Friesen, A.; Banino, A.; Weber, T.; Ke, N.R.; Badia, A.P.; Guez, A.; Mirza, M.; Humphreys, P.C.; Konyushova, K.; et al. Retrieval-augmented reinforcement learning. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 7740–7765. [Google Scholar]
Fernando, C.; Banarse, D.; Michalewski, H.; Osindero, S.; Rocktäschel, T. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv 2023, arXiv:2309.16797. [Google Scholar]
Li, L.; Zhang, Y.; Chen, L. Prompt distillation for efficient LLM-based recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 1348–1357. [Google Scholar]
Zhang, Y.; Yuan, Y.; Yao, A.C. Meta prompting for AI systems. arXiv 2023, arXiv:2311.11482. [Google Scholar]

Figure 1. The detailed pipeline for preparing the three distinct medical datasets used in our evaluation: Medical Multiple-Choice Question (MCQ), Medical Knowledge Question-Answering (MKQA) and Doctor-Patient Dialogue Dataset (Dialogue). Each dataset undergoes a specific sequence of steps, including initial parsing/loading, validation/canonicalization, and subsequent random sampling and splitting into training and test sets, ensuring high-quality, standardized data for model development and evaluation. Stage #1 focuses on raw data collection from specified sources. Stage #2 outlines the model-specific processing, validation, transformation, and splitting procedures. Stage #3 illustrates the generation of standardized, high-quality output datasets, complete with their respective data formats.

Figure 2. Framework for Adaptive Prompting Strategy Selection in Medical Large Language Model (LLM) Evaluation. The system dynamically selects an optimal prompting strategy

p_{k^{*}}

from a predefined pool

P

for each medical input

x_{i}

, aiming to maximize a combined evaluation score. This process involves LLM response generation, multi-faceted evaluation, and feedback to the RL Agent for continuous learning.

Figure 2. Framework for Adaptive Prompting Strategy Selection in Medical Large Language Model (LLM) Evaluation. The system dynamically selects an optimal prompting strategy

p_{k^{*}}

from a predefined pool

P

for each medical input

x_{i}

, aiming to maximize a combined evaluation score. This process involves LLM response generation, multi-faceted evaluation, and feedback to the RL Agent for continuous learning.

Figure 3. Deep Q-Network (DQN) Architecture for Prompt Policy Learning. This figure illustrates the layered structure of the Q-network, which maps input state features to output Q-values for different prompt strategies. The hidden layers are followed by ReLU activation functions.

Table 1. Evaluation Results on DeepSeek-V3-0324 (Average scores and Reward Standard Deviation over the test set).

Task	Method	Reward	Std	Acc	Safe	Term	Rel
MKQA	Baseline	$0.4554$	$0.1128$	$0.4108$	$0.8000$	$0.0784$	–
MKQA	Random	$0.4496$	$0.1355$	$0.3674$	$0.9125$	$0.0992$	–
MKQA	Rule-based	$0.4681$	$0.1132$	$0.4374$	$0.7750$	$0.0694$	–
MKQA	RL	$0.5389$	$0.1045$	$0.4706$	$0.9625$	$0.1700$	–
MCQ	Baseline	$0.4627$	$0.1972$	$0.6714$	$0.9643$	$0.2260$	–
MCQ	Random	$0.4696$	$0.1918$	$0.6714$	$0.9857$	$0.2303$	–
MCQ	Rule-based	$0.4590$	$0.1321$	$0.7071$	$0.9714$	$0.2055$	–
MCQ	RL	$0.4761$	$0.1941$	$0.6929$	$0.9429$	$0.2482$	–
Dialogue	Baseline	$0.5782$	$0.1262$	$0.3132$	$0.8799$	$0.6629$	$0.2140$
Dialogue	Random	$0.5774$	$0.1333$	$0.3019$	$0.8763$	$0.6762$	$0.2103$
Dialogue	Rule-based	$0.5707$	$0.1333$	$0.3094$	$0.8516$	$0.6705$	$0.2126$
Dialogue	RL	$0.5853$	$0.1309$	$0.3104$	$0.9011$	$0.6680$	$0.2146$

Table 2. Main Evaluation Results on GPT-4.1 (Average scores and Reward Standard Deviation over the test set).

Task	Method	Reward	Std	Acc	Safe	Term	Rel
MKQA	Baseline	$0.5293$	$0.0960$	$0.4630$	$0.9500$	$0.1520$	–
MKQA	Random	$0.5266$	$0.1001$	$0.4543$	$0.9500$	$0.1858$	–
MKQA	Rule-based	$0.5155$	$0.1188$	$0.4438$	$0.9500$	$0.1487$	–
MKQA	RL	$0.5617$	$0.1042$	$0.4788$	$1.0000$	$0.2654$	–
MCQ	Baseline	$0.4784$	$0.1977$	$0.6214$	$0.9857$	$0.2616$	–
MCQ	Random	$0.4808$	$0.2004$	$0.6286$	$0.9857$	$0.2632$	–
MCQ	Rule-based	$0.4706$	$0.1399$	$0.6286$	$0.9786$	$0.2487$	–
MCQ	RL	$0.4913$	$0.1975$	$0.6143$	$0.9857$	$0.2855$	–
Dialogue	Baseline	$0.5925$	$0.1184$	$0.3424$	$0.9117$	$0.6437$	$0.2316$
Dialogue	Random	$0.5936$	$0.1144$	$0.3349$	$0.9187$	$0.6485$	$0.2296$
Dialogue	Rule-based	$0.5865$	$0.1236$	$0.3340$	$0.9364$	$0.6083$	$0.2286$
Dialogue	RL	$0.6068$	$0.1061$	$0.3338$	$0.9576$	$0.6570$	$0.2228$

Table 3. Evaluation Results on Gemini-2.0-flash (Average scores and Reward Standard Deviation over the test set).

Task	Method	Reward	Std	Acc	Safe	Term	Rel
MKQA	Baseline	$0.5027$	$0.0973$	$0.4296$	$0.9375$	$0.1451$	–
MKQA	Random	$0.4847$	$0.1199$	$0.4351$	$0.8125$	$0.1762$	–
MKQA	Rule-based	$0.5093$	$0.1006$	$0.4347$	$0.9500$	$0.1501$	–
MKQA	RL	$0.5403$	$0.1200$	$0.4681$	$0.9500$	$0.2261$	–
MCQ	Baseline	$0.4344$	$0.1249$	$0.6071$	$0.9786$	$0.1954$	–
MCQ	Random	$0.4427$	$0.1304$	$0.6071$	$0.9571$	$0.2164$	–
MCQ	Rule-based	$0.4300$	$0.1284$	$0.5857$	$0.9857$	$0.1928$	–
MCQ	RL	$0.4604$	$0.1390$	$0.5857$	$0.9786$	$0.2458$	–
Dialogue	Baseline	$0.5886$	$0.1227$	$0.3409$	$0.9576$	$0.5872$	$0.2292$
Dialogue	Random	$0.5927$	$0.1177$	$0.3308$	$0.9611$	$0.6057$	$0.2338$
Dialogue	Rule-based	$0.5819$	$0.1299$	$0.3388$	$0.9505$	$0.5722$	$0.2339$
Dialogue	RL	$0.6081$	$0.1096$	$0.3252$	$0.9823$	$0.6407$	$0.2361$

Table 4. Comparison of RL Agent vs. All Individual Static Prompt Strategies (Average Reward over test sets).

LLM Model	Strategy	MKQA	MCQ	Dialogue
DeepSeek-V3-0324	Role-based	$0.4554$	$0.4627$	$0.5782$
	Safety-focused	$0.4950$	$0.4496$	$0.5432$
	Terminology-rich	$0.4701$	$0.4746$	$0.5670$
	CoT Reasoning	$0.4753$	$0.4527$	$0.5656$
	Patient-centric	$0.5080$	$0.4620$	$0.5615$
	RL Agent Strategy	$0.5389$	$0.4761$	$0.5853$
GPT-4.1	Role-based	$0.5293$	$0.4784$	$0.5925$
	Safety-focused	$0.4894$	$0.4329$	$0.5256$
	Terminology-rich	$0.3743$	$0.4855$	$0.5743$
	CoT Reasoning	$0.4476$	$0.4610$	$0.5843$
	Patient-centric	$0.4497$	$0.4601$	$0.5796$
	RL Agent Strategy	$0.5617$	$0.4913$	$0.6068$
Gemini-2.0-flash	Role-based	$0.5027$	$0.4344$	$0.5886$
	Safety-focused	$0.4482$	$0.4481$	$0.5875$
	Terminology-rich	$0.3472$	$0.4399$	$0.5696$
	CoT Reasoning	$0.4499$	$0.4552$	$0.5665$
	Patient-centric	$0.4402$	$0.4398$	$0.5694$
	RL Agent Strategy	$0.5403$	$0.4604$	$0.6081$

Table 5. Performance of Diverse LLMs utilizing the Learned RL Prompt Policy (Average scores over test sets).

LLM Model	MKQA				MCQ				Dialogue
LLM Model	Reward	Acc	Safe	Term	Reward	Acc	Safe	Term	Reward	Acc	Safe	Rel	Term
Medical LLMs
MING-7B	$0.4999$	$0.3945$	$0.9750$	$0.2876$	$0.5008$	$0.1143$	$0.9929$	$0.4656$	$0.6260$	$0.3421$	$1.0000$	$0.2386$	$0.6651$
HuatuoGPT-o1-7B	$0.5231$	$0.4356$	$0.9875$	$0.2064$	$0.4829$	$0.6500$	$0.9857$	$0.2597$	$0.7049$	$0.2150$	$0.9258$	$0.6374$	$0.9965$
Taiyi-LLM	$0.5084$	$0.4167$	$0.9250$	$0.3176$	$0.5091$	$0.3643$	$0.9857$	$0.3985$	$0.6235$	$0.2705$	$0.9929$	$0.2122$	$0.7442$
PULSE-7bv5	$0.4158$	$0.2965$	$0.8875$	$0.3073$	$0.4661$	$0.3143$	$0.9929$	$0.3410$	$0.6179$	$0.3425$	$0.9965$	$0.2225$	$0.6465$
General LLMs
MiniCPM3-4B	$0.5399$	$0.4460$	$1.0000$	$0.2771$	$0.4545$	$0.4286$	$0.9929$	$0.2836$	$0.7156$	$0.2204$	$0.9611$	$0.6301$	$0.9937$
Yi-1.5-9B-Chat	$0.4988$	$0.4663$	$0.7750$	$0.1738$	$0.4491$	$0.5357$	$0.9643$	$0.2485$	$0.6395$	$0.2610$	$0.9576$	$0.6838$	$0.6851$
DeepSeek-R1-0528-Qwen3-8B	$0.4971$	$0.4117$	$0.9375$	$0.2138$	$0.4167$	$0.3571$	$0.9929$	$0.2445$	$0.6970$	$0.2710$	$0.8975$	$0.6947$	$0.9233$
llama-3-chinese-8b-instruct-v3	$0.5139$	$0.4530$	$0.8625$	$0.2426$	$0.4533$	$0.3500$	$0.9929$	$0.3078$	$0.7106$	$0.2309$	$0.9717$	$0.6256$	$0.9576$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ding, D.; Xi, W.; Ding, Z.; Gao, J. Deep Reinforcement Learning-Driven Adaptive Prompting for Robust Medical LLM Evaluation. Appl. Sci. 2026, 16, 1514. https://doi.org/10.3390/app16031514

AMA Style

Ding D, Xi W, Ding Z, Gao J. Deep Reinforcement Learning-Driven Adaptive Prompting for Robust Medical LLM Evaluation. Applied Sciences. 2026; 16(3):1514. https://doi.org/10.3390/app16031514

Chicago/Turabian Style

Ding, Dong, Wang Xi, Zenghui Ding, and Jianqing Gao. 2026. "Deep Reinforcement Learning-Driven Adaptive Prompting for Robust Medical LLM Evaluation" Applied Sciences 16, no. 3: 1514. https://doi.org/10.3390/app16031514

APA Style

Ding, D., Xi, W., Ding, Z., & Gao, J. (2026). Deep Reinforcement Learning-Driven Adaptive Prompting for Robust Medical LLM Evaluation. Applied Sciences, 16(3), 1514. https://doi.org/10.3390/app16031514

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning-Driven Adaptive Prompting for Robust Medical LLM Evaluation

Abstract

1. Introduction

2. Related Work

2.1. LLMs in Medicine

2.2. Prompt Engineering and Adaptation

2.3. Reinforcement Learning for LLMs

3. Methods

3.1. Dataset Preparation

3.1.1. Medical Multiple-Choice Question (MCQ) Dataset

3.1.2. Medical Knowledge Question-Answering (MKQA) Dataset

3.1.3. Doctor-Patient Dialogue Dataset

3.2. Problem Statement

3.3. Formalization of Prompting Limitations

3.4. Reinforcement Learning Framework

3.4.1. MDP Formulation

3.4.2. DQN-Based Prompt Policy Learning

3.5. Reward Function

4. Experiments

4.1. Datasets Employed

4.2. Auxiliary Resources

4.3. Experimental Environment

4.4. Metrics

4.5. RL Training Details

4.6. Evaluation Results and Analysis

4.7. Cross-Model Applicability

4.8. Comparison with All Static Prompt Strategies

4.9. Performance Comparison of Diverse Chinese LLMs Under Learned RL Policy

4.9.1. Evaluated LLM Models

4.9.2. Evaluation Protocol

4.9.3. Results on Diverse LLMs

5. Discussion

5.1. Strengths and Contributions

5.2. Limitations

5.3. Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI