Domain-Aware Reinforcement Learning for Prompt Optimization

Gao, Mengqi; Sun, Bowen; Wang, Tong; Fan, Ziyu; Zhang, Tongpo; Zheng, Zijun

doi:10.3390/math13162552

Open AccessArticle

Domain-Aware Reinforcement Learning for Prompt Optimization

by

Mengqi Gao

¹,

Bowen Sun

^1,*,

Tong Wang

¹,

Ziyu Fan

²

,

Tongpo Zhang

¹ and

Zijun Zheng

³

¹

School of Computer and lnformation Engineering, Shanghai Polytechnic University, Shanghai 201209, China

²

Department of Engineering, Durham University, Durham DH1 3LE, UK

³

College of Sciences, China Jiliang University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(16), 2552; https://doi.org/10.3390/math13162552

Submission received: 15 July 2025 / Revised: 4 August 2025 / Accepted: 5 August 2025 / Published: 9 August 2025

(This article belongs to the Special Issue Multi-Criteria Decision Making Under Uncertainty)

Download

Browse Figures

Versions Notes

Abstract

Prompt engineering provides an efficient way to adapt large language models (LLMs) to downstream tasks without retraining model parameters. However, designing effective prompts can be challenging, especially when model gradients are unavailable and human expertise is required. Existing automated methods based on gradient optimization or heuristic search exhibit inherent limitations under black box or limited-query conditions. We propose Domain-Aware Reinforcement Learning for Prompt Optimization (DA-RLPO), which treats prompt editing as a sequential decision process and leverages structured domain knowledge to constrain candidate edits. Our experimental results show that DA-RLPO achieves higher accuracy than baselines on text classification tasks and maintains robust performance with limited API calls, while also demonstrating effectiveness on text-to-image and reasoning tasks.

Keywords:

prompt optimization; knowledge base; reinforcement learning; entropy regularization

MSC:

68T05; 68T50

1. Introduction

Prompt engineering has become crucial for effectively adapting large language models to specific downstream tasks without extensive model retraining [1,2,3]. However, manually crafting suitable prompts is highly dependent on domain expertise and often requires significant trial-and-error, making this approach inefficient and unscalable [4,5]. Thus, automating prompt optimization has emerged as a critical research direction to overcome these limitations.

Existing automated prompt optimization approaches mainly involve gradient-based continuous optimization, heuristic discrete search, and reinforcement learning (RL) [6,7,8]. Gradient-based methods efficiently optimize prompts but require access to internal model gradients or probability distributions, restricting their application to open-source or white box LLMs [9,10]. Conversely, heuristic discrete search methods, such as evolutionary algorithms, do not rely on gradients and, thus, support black box models. However, these methods suffer from limited exploration and frequently converge to suboptimal solutions [11]. Reinforcement learning provides a promising direction by naturally balancing exploration and exploitation, yet existing RL-based approaches rarely leverage domain-specific constraints, resulting in potential semantic incoherence [9,12]. Additionally, the high-dimensional discrete search space poses significant challenges for achieving effective prompt optimization.

To address these challenges, we propose a Domain-Aware Reinforcement Learning framework for Prompt Optimization (DA-RLPO). We model discrete prompt editing as a Markov Decision Process (MDP). Our method leverages deep Q-learning combined with structured domain knowledge to systematically constrain the search space of candidate edits, thereby improving both search efficiency and optimization quality. An entropy-regularized reward function ensures balanced exploration and exploitation during prompt refinement.

The main contributions of our paper are summarized as follows. First, we propose DA-RLPO, a reinforcement learning framework for discrete prompt optimization. The framework incorporates domain-aware constraints and uses entropy-based reward regularization, enabling effective optimization without gradient information. Second, we introduce a structured knowledge base to guide prompt editing, ensuring semantic coherence and domain relevance. Finally, our extensive experiments demonstrate that DA-RLPO consistently outperforms baselines on text classification tasks, maintains robust performance under limited query budgets, and generalizes effectively to text-to-image generation and reasoning tasks.

2. Related Work

2.1. Prompt Optimization and Editing Methods

Prompt engineering has emerged as a crucial technique for guiding large language models to perform diverse downstream tasks by carefully crafting input prompts [13,14]. This approach has found widespread applications in domains such as natural language understanding, text generation, question answering, and information extraction, among others [15,16,17,18]. By providing appropriate prompts, LLMs can be readily adapted to a wide range of real-world scenarios without task-specific re-training [19]. Compared to traditional fine-tuning, prompt engineering provides a more lightweight and efficient alternative, leveraging the knowledge already embedded within pre-trained LLMs without requiring extensive parameter updates [20,21]. However, the quality and effectiveness of these prompts significantly impact task performance, highlighting the importance of systematically designing and optimizing prompts [22].

Manual prompting is the earliest and most straightforward approach, relying on human intuition and domain expertise to design textual templates [23]. Such manual prompts have shown reasonable performance in various natural language understanding tasks, especially in few-shot scenarios. For instance, Schick et al. demonstrated that carefully constructed manual prompts could substantially enhance model accuracy for text classification tasks [21]. Despite their simplicity, manually crafted prompts face inherent limitations due to human bias, inconsistency, and poor generalization across tasks or domains, leading to suboptimal and unstable performance [10]. Additionally, manual design processes are often labor-intensive and lack scalability when applied to diverse downstream tasks or evolving LLMs.

To overcome these limitations, automated prompt optimization methods have been proposed [24]. Recent approaches can be broadly categorized into continuous prompt optimization and discrete prompt editing techniques. Continuous methods, such as Prefix-tuning [25] and Prompt-tuning [26], introduce additional learnable embeddings optimized directly through gradient descent without modifying the pre-trained model parameters. These methods have demonstrated effectiveness and parameter efficiency across various NLP benchmarks. However, continuous prompts are incompatible with many black box APIs, such as OpenAI’s GPT-3 and ChatGPT, which only accept discrete textual inputs [8,27]. Consequently, research attention has shifted toward discrete optimization techniques. For example, AutoPrompt utilizes gradient-guided discrete trigger search, while iteratively editing discrete prompts based on task-specific feedback [6,8]. Such discrete methods offer improved compatibility with black box settings but often face exploration limitations due to heuristic or greedy editing strategies [28].

2.2. Reinforcement Learning Approaches for Prompt Optimization

Recent research has explored reinforcement learning as a principled framework for automating prompt optimization, especially in scenarios where heuristic or greedy search methods provide limited exploration and scalability [29]. RL-based approaches adaptively refine discrete prompts by interacting with the language model and receiving feedback from downstream task performance, offering more flexibility than conventional rule-based or gradient-free methods [30].

Deng et al. introduced RLPrompt, which formulates discrete prompt optimization as a Markov Decision Process, allowing prompts to be iteratively updated via token-level edits, where the agent receives rewards based on task-specific performance improvements [9]. Similarly, TEMPERA employs reinforcement learning to adaptively edit task prompts by conditioning editing actions on both the input and current prompt, further improving adaptation and performance across tasks [31]. RL techniques have also been successfully applied to other text editing tasks, such as controllable summarization and sequence correction [32,33].

However, existing RL-based prompt editing methods often lack explicit mechanisms to guarantee linguistic validity, semantic coherence, or integration of domain-specific knowledge [34]. Most prior approaches do not include structured constraints or domain-aware guidance during editing, which limits the quality, interpretability, and adaptability of the optimized prompts, particularly for black box language models accessed via API.

2.3. Knowledge-Guided and Domain-Aware Text Editing

Reinforcement learning provides a principled framework for adaptive prompt optimization. However, methods based solely on immediate rewards or unconstrained searches often produce prompts lacking in domain accuracy, semantic coherence, or grammatical correctness [35]. This limitation is especially prominent in technical domains, where precise terminology and context are essential [36,37].

To address this, recent research has incorporated structured domain knowledge into prompt optimization frameworks. For instance, Zhang et al. systematically explored domain-specific prompt repositories and introduced TreePrompt, which leverages syntactic dependency structures for relation extraction. Their approach demonstrated measurable improvements in both few-shot classification accuracy and the performance of context-sensitive prompt generation [38]. Similarly, Chen et al. proposed KGPT, a knowledge-grounded pre-training model for data-to-text generation, which improved factual correctness and reduced the demand for labeled data, achieving strong results in zero-shot settings on WebNLG [39]. Wiegmann et al. further showed that using structured phrase repositories can enhance semantic coherence and reduce ambiguity in prompts [40]. Du et al. proposed AGSK, integrating average gradient search with external knowledge for sentiment analysis. This method outperformed baselines on the SemEval2014 datasets and demonstrated strong robustness in few-shot scenarios [34].

Despite these advances, most approaches still rely on heuristics or lack systematic integration of multi-level linguistic constraints. As a result, they are limited in exploration ability and adaptability across domains.

To overcome these challenges, our work proposes a discrete prompt optimization framework based on deep Q-learning (DQN). We systematically incorporate a structured, domain-aware knowledge base to guide phrase-level editing. Rule-based validity constraints and entropy-regularized rewards further improve semantic coherence, editing diversity, and interpretability. Our experimental results on standard benchmarks validate the effectiveness and generalizability of our approach over existing heuristic-driven methods.

3. Methods

3.1. Overview

We propose a reinforcement learning framework for prompt optimization, in which the rewriting process is formalized as a Markov Decision Process with discrete phrase-level actions. The agent iteratively refines an initial prompt via edit operations, guided by task performance and structural constraints. The framework consists of five core components: (i) a structured, domain-aware knowledge base

K

that provides candidate phrases for editing, filtered by syntactic, semantic, and contextual criteria; (ii) an action execution module that applies four discrete edit operations (add, delete, substitute, and swap), with the set of feasible actions at each step determined by the current prompt state; (iii) a state representation module that encodes both semantic and statistical features of recent task performance; (iv) a Deep Q-Network that learns the editing policy by minimizing the temporal-difference error, employing experience replay and periodic target network synchronization; and (v) a composite reward function that balances accuracy improvement with the entropy-based measurement of action diversity.

At each step, the agent encodes the current state, determines the set of valid actions, selects an action according to an

ϵ

-greedy policy, and applies the corresponding edit. The new prompt and resulting state are evaluated by the reward function, and the transition is stored in the experience replay buffer. Mini-batches are sampled uniformly from the buffer for Q-network updates, and the parameters of the target network are synchronized with the online network at regular intervals. The overall optimization proceeds until a maximum number of episodes is reached. The complete procedure is summarized in Algorithm 1.

Algorithm 1 Domain-Aware RL-based Prompt Optimization Framework

Require:: Initial prompt $p_{0}$ , editing environment $E$ , knowledge base $K$ , reward function $R$ , number of episodes N, max steps per episode T
Ensure:: Optimized prompt $p^{*}$
1:: Initialize Q-network $Q (s, a; θ)$ with random weights
2:: Initialize target network $Q (s, a; θ^{'})$ ( $θ^{'} \leftarrow θ$ )
3:: Initialize experience replay buffer $D$
4:: for episode $= 1$ to N do
5:: $p \leftarrow p_{0}$
6:: for step $= 1$ to T do
7:: Encode normalized state s from p
8:: Determine valid action set $A_{valid} (s)$
9:: Select $a \in A_{valid} (s)$ via $ϵ$ -greedy policy over $Q (s, a)$
10:: Apply a to p using $E$ and $K$ to obtain $p^{'}$ , $s^{'}$
11:: Compute reward $r \leftarrow R (s, a)$
12:: Store $(s, a, r, s^{'})$ in buffer $D$
13:: Sample mini-batch $B \sim Uniform (D, | B |)$
14:: Update $θ$ using TD loss on $B$
15:: if step $mod C = 0$ then
16:: Update target network: $θ^{'} \leftarrow θ$
17:: end if
18:: $p \leftarrow p^{'}$ , $s \leftarrow s^{'}$
19:: end for
20:: end for
21:: return final optimized prompt $p^{*}$

3.2. Domain-Aware Decision Flow

This section presents a detailed description of the domain-aware decision flow underlying DA-RLPO, illustrating the integration of domain-awareness into the prompt optimization process. As illustrated in Figure 1, the approach centers on a domain-aware engine. This engine processes the initial task prompt and incorporates both semantic and rule-based information from a structured knowledge base, enabling accurate recognition of domain relevance and guiding subsequent editing decisions. Specifically, the input prompt is first processed by a domain identification module, which determines the specific domain associated with the current task. Based on the identified domain, the candidate action space for editing is further constrained by domain knowledge, which effectively reduces the search space and improves the efficiency of the editing process. The domain-aware engine interacts closely with the knowledge base to retrieve and generate candidate phrases that are contextually appropriate for the task, taking into account both semantic similarity and domain-specific rules and metadata. These domain-relevant phrases are then used by the editing module for insertion or substitution, resulting in an optimized prompt with explicit domain characteristics. This decision flow highlights the core feature of the proposed method: prompt editing is achieved through systematic incorporation of domain knowledge and semantic constraints, allowing the framework to produce prompts that are both efficient and contextually appropriate for the identified task domain.

3.3. Knowledge Base

As a foundational component of the domain-aware decision flow described above, we construct a structured knowledge base

K

to enable phrase-level prompt editing under the reinforcement learning framework. Each element in

K

represents a syntactically coherent phrase segment that can be added to or substituted into a prompt, ensuring semantic consistency and domain appropriateness.

The

K

structure represents domain-specific knowledge and syntactic granularity. As shown in Table 1, phrases are organized by high-level application domains such as general, medical, and financial. Each phrase is further annotated with its syntactic type, including noun phrases (NPs), verb phrases (VPs), and conditional clauses. This organization enables controlled and flexible phrase-level modifications during editing, while preserving fluency.

Each phrase in

K

is associated with structured metadata describing its properties. These attributes include a numerical formality score reflecting stylistic tone, a syntactic category label that specifies whether the phrase is a noun phrase, verb phrase, or another type, and a set of context tags that indicate appropriate usage scenarios. Additional metadata fields may capture information such as supporting literature references, severity levels, or legal applicability. This rich annotation enables the editing policy to systematically filter or prioritize candidate phrases based on contextual requirements and linguistic constraints.

Candidate retrieval is governed by a set of rule-based criteria defined over

K

. These include syntactic matching constraints and fuzzy context alignment. For substitution operations, phrases must match in syntactic type, and contextual relevance is estimated by comparing metadata tags with the current prompt window.

To guarantee semantic appropriateness and mitigate redundancy, candidate phrases retrieved from

K

are further filtered according to both semantic similarity and contextual alignment. For each candidate phrase

p_{i}

from the knowledge base, its cosine similarity to an existing phrase

p_{j}

in the current prompt is computed as

Sim (p_{i}, p_{j}) = \frac{e (p_{i}) \cdot e (p_{j})}{∥ e (p_{i}) ∥ ∥ e (p_{j}) ∥},

(1)

where

e (p)

denotes the embedding vector of phrase p, obtained using the sentence encoder. Only candidate phrases satisfying

Sim (p_{i}, p_{j}) < 0.65

are retained for editing, ensuring sufficient dissimilarity from existing prompt content. This mechanism discourages the insertion of near-duplicate phrases and enhances diversity among generated prompt variants. Additionally, contextual tags and keyword overlap are leveraged to prioritize further phrases that best match the intended usage scenario of the current prompt window.

Figure 2 illustrates the overall architecture of

K

. During training,

K

enables the agent to explore phrase-level variations within a constrained yet diverse action space. The integration of this structured knowledge base with reinforcement learning is further detailed in the following section.

3.4. Reinforcement Learning for Prompt Editing

This section details the reinforcement learning framework adopted for prompt editing. The optimization procedure is formulated as a Markov Decision Process. The state space integrates semantic and statistical features. The action space consists of discrete phrase-level editing operations, which are restricted by rule-based validity constraints. The reward function incorporates both task-specific performance and the entropy of the recent action distribution. Candidate edits are retrieved from the structured knowledge base

K

according to syntactic, semantic, and contextual relevance. Model training is conducted with a Deep Q-Network utilizing experience replay and periodic target network updates. An overview of the entire process is presented in Figure 3. The subsequent subsections formally define each component and provide implementation details.

3.4.1. Formulation as Markov Decision Process

We formulate the prompt optimization problem as a discrete-time finite Markov Decision Process, defined by the tuple

M = (S, A, P, R, γ)

, where each element is explicitly characterized as follows:

State space (

S

). The state

s_{t} \in S

at each time step t encapsulates both semantic and statistical features of the current prompt. Specifically,

s_{t}

consists of a semantic embedding vector obtained from Sentence-BERT combined with statistical features reflecting recent task performance, including running mean, standard deviation, and accuracy trend.

Action space (

A

). The action space is defined as a discrete set of phrase-level editing operations applicable to the current prompt, namely,

A = {add, delete, substitute, swap} .

(2)

Each action modifies the prompt by altering its constituent phrases based on predefined syntactic and semantic constraints detailed in the structured knowledge base

K

.

State transition function (P). The state transition probability

P (s_{t + 1} ∣ s_{t}, a_{t})

characterizes the probability of reaching a subsequent state

s_{t + 1}

from state

s_{t}

by applying action

a_{t}

. Given the deterministic nature of our editing operations, state transitions are implicitly defined by

s_{t + 1} = f_{E} (s_{t}, a_{t}),

(3)

where

f_{E}

denotes the phrase-level editing function executed within the editing environment

E

, ensuring grammatical correctness and semantic consistency via rule-based validity checks.

Reward function (R). The reward function

R (s_{t}, a_{t})

quantitatively evaluates the effectiveness of prompt edits based on improvements in task-specific performance and the diversity of selected actions. Specifically, it is computed as

R (s_{t}, a_{t}) = Δ {Acc}_{t} + H (P_{t}),

(4)

where

Δ {Acc}_{t}

denotes the incremental change in accuracy due to action

a_{t}

, and where

H (P_{t})

is the entropy of the action distribution, encouraging diverse and explorative editing behavior.

Discount Factor (

γ

). The discount factor

γ \in [0, 1)

is introduced to balance immediate and future rewards. In this study,

γ

was empirically set to

0.99

, assigning considerable importance to future prompt refinement outcomes.

This MDP formulation allows us to utilize deep reinforcement learning techniques, specifically deep Q-learning, to learn an optimal policy for prompt editing by maximizing the expected cumulative reward over the editing horizon.

3.4.2. Hybrid State Encoding

The state representation is composed of semantic embedding of the current prompt and statistical features derived from task performance. The semantic embedding is obtained using Sentence-BERT, producing a 384-dimensional vector that captures the overall meaning of the prompt. The statistical features include the running mean of task accuracy over the last t time-steps (

μ_{t}

), the standard deviation of task accuracy (

σ_{t}

), and the trend, calculated as the difference in task accuracy over the most recent steps (

Δ_{t}

).

The complete state vector is obtained by concatenating the semantic embedding and the statistical features:

s_{t} = [{embedding}_{t}, μ_{t}, σ_{t}, Δ_{t}] .

(5)

To ensure stable training, the state features are normalized dynamically, using running statistics. For each dimension of

s_{t}

, the running mean

{\hat{μ}}_{t}

and running standard deviation

{\hat{σ}}_{t}

are recursively updated at every step according to

{\hat{μ}}_{t} = (1 - α) {\hat{μ}}_{t - 1} + α s_{t},

(6)

{\hat{σ}}_{t} = (1 - α) {\hat{σ}}_{t - 1} + α | s_{t} - {\hat{μ}}_{t - 1} |,

(7)

where

α

is the update rate, and all operations are performed element-wise. The normalized state vector is then computed as

{\tilde{s}}_{t} = \frac{s_{t} - {\hat{μ}}_{t}}{{\hat{σ}}_{t}} .

(8)

This adaptive normalization strategy enables the model to account for changes in the distribution of state features over time and maintain consistent input scaling throughout training.

3.4.3. Action Selection Policy

At each time-step, the agent selects an editing action from the set of feasible actions, which consists of add, substitute, delete, and swap operations. Figure 4 presents representative candidate prompts generated by applying each editing operation to a common initial prompt. These operations are designed to produce minimal but semantically meaningful changes to the prompt. The add action involves inserting a new phrase into the prompt, either from the knowledge base

K

or from the deleted sequence. The delete action removes an existing phrase from the prompt, while the substitute action replaces an existing phrase with a new one from

K

. Finally, the swap action exchanges two phrases within the prompt. Action selection is driven by a Deep Q-Network, which learns the action-value function

Q (s, a)

. The action selection is guided by an

ϵ

-greedy policy, in which a random valid action is chosen with probability

ϵ

, and the action maximizing the current Q-value is selected with probability

1 - ϵ

.

The set of available actions at each time-step is not fixed, but rather determined by the current state. This restriction is implemented by a validity function that encodes the prompt structure and linguistic constraints. Formally, the set of valid actions is defined as

A_{valid} (s_{t}) = \{a \in A ∣ f_{valid} (s_{t}, a) = 1\},

(9)

where

f_{valid} (s_{t}, a)

is a binary function that returns 1 if action a is valid in state

s_{t}

and 0 otherwise. The function

f_{valid}

is implemented as a set of rule-based checks, including phrase presence, syntactic compatibility, entity type filtering, and grammar verification. For example, a delete action is allowed only if a non-essential phrase exists, while a swap action requires at least two eligible phrases. At each decision step, the agent samples actions exclusively from

A_{valid} (s_{t})

under the current policy. This state-dependent constraint on the action space ensures that only feasible and linguistically appropriate edits are considered, thereby reducing the risk of ungrammatical or semantically inconsistent prompt generation.

3.4.4. Reward Shaping

The reward function is constructed to guide the agent toward prompt edits that improve task performance. The primary component is the change in task accuracy

Δ {Acc}_{t}

, which quantifies the improvement achieved after each edit. To further encourage exploration and avoid premature convergence to repetitive behaviors, a diversity term based on the entropy of the action distribution is introduced. The total reward at time-step t is given by

r_{t} = Δ {Acc}_{t} + H (P_{t}),

(10)

where

Δ {Acc}_{t}

denotes the change in task accuracy, and where

H (P_{t})

is the entropy of the empirical action distribution at time t.

The entropy regularization term

H (P_{t})

promotes exploration by discouraging the policy from repeatedly selecting the same actions. Specifically, the entropy is computed as

H (P_{t}) = - \sum_{a \in A} P_{t} (a) log P_{t} (a),

(11)

where

P_{t} (a)

is the empirical probability of selecting action a at time-step t. This probability is estimated from the action frequencies within a fixed-length sliding window up to time t:

P_{t} (a) = \frac{n_{t} (a)}{\sum_{a^{'} \in A} n_{t} (a^{'})},

(12)

where

n_{t} (a)

denotes the number of times action a was selected in the recent window. The entropy term thereby regularizes the agent’s behavior, encouraging a broader exploration of the action space and preventing the learning policy from stagnating at suboptimal, repetitive strategies.

3.4.5. Training Mechanism

The DQN is trained using standard reinforcement learning techniques, including experience replay and periodic target network updates. At each step, the agent stores its interaction as a tuple

(s_{i}, a_{i}, r_{i}, s_{i}^{'})

in an experience replay buffer

D

:

D = {(s_{i}, a_{i}, r_{i}, s_{i}^{'})}_{i = 1}^{| D |},

(13)

where

s_{i}

is the state,

a_{i}

the action,

r_{i}

the reward, and

s_{i}^{'}

the next state. During learning, mini-batches

B

are randomly and uniformly sampled from

D

:

B \sim Uniform (D, | B |),

(14)

where

| B |

denotes the batch size.

The loss function for training is the temporal difference (TD) error,

L_{DQN} = E_{(s, a, r, s^{'}) \sim B} [{(r + γ max_{a^{'}} Q_{θ^{'}} (s^{'}, a^{'}) - Q_{θ} (s, a))}^{2}],

(15)

where r is the observed reward,

γ

is the discount factor, and

θ^{'}

denotes the parameters of the target Q-network.

To stabilize training, the parameters of the target network are periodically synchronized with those of the online network according to

θ^{'} \leftarrow θ, every C steps,

(16)

where C is the target update period,

θ

denotes the parameters of the online Q-network, and

θ^{'}

those of the target network.

Additionally, the exploration rate

ϵ_{t}

decays exponentially with the training steps,

ϵ_{t} = max (ϵ_{0} \cdot δ^{t}, ϵ_{min}),

(17)

where

ϵ_{0}

is the initial exploration rate,

δ

is the decay factor, and

ϵ_{min}

is the minimum value. This schedule ensures that the agent gradually shifts from exploration to exploitation as learning proceeds.

3.5. Complexity Analysis

We analyze the complexity of the proposed reinforcement learning framework. Let T denote the maximum number of editing steps per episode, and let F represent the cost of a forward pass through the Q-network at each step. At each editing step, the agent retrieves k candidate phrases from the knowledge base of size

| K |

, which requires

O (log | K | + k)

time. The total complexity per episode is, therefore,

O (T [F + log | K | + k])

. During training, the Q-network is updated via experience replay, with each update processing a mini-batch of size

| B |

, yielding an additional

O (| B |)

cost per update. For N episodes, the overall training complexity is

O (N T [F + log | K | + k] + N | B |)

. Notably, the use of domain-aware constraints substantially reduces the effective action space and retrieval costs, thereby improving the practical efficiency of the method.

4. Experiments

For this section, we conducted experiments to evaluate the performance of our proposed prompt optimization framework. Our evaluation consisted of the following main parts: (1) an evaluation of eight binary classification tasks, using the Natural-Instructions dataset v2.6 [41]; (2) an evaluation under the constraint of limited API calls; (3) additional experiments on text-to-image generation tasks, using Stable Diffusion 2.1 and PickScore version 1.0 [42,43]; (4) an evaluation of reasoning capability on representative tasks from the GSM8K, ASDiv, AQuA, CSQA, and StrategyQA datasets [44,45,46,47,48]; (5) an ablation study to understand the contribution of the key components of our framework; and (6) hyperparameter sensitivity analyses to verify the robustness of the method. The results from these experiments demonstrate the effectiveness, efficiency, and generalization ability of our approach across various practical scenarios.

4.1. Experimental Setup

This section describes the experimental setup used to evaluate the effectiveness of our proposed method. The setup included information about the dataset, baselines, and experimental parameters, ensuring a fair comparison between the various methods.

4.1.1. Dataset

Our experiments were conducted on a subset of the Natural-Instructions dataset v2.6, which includes a wide range of tasks designed to evaluate instruction-following capabilities. We focused on eight binary classification tasks: task019, task021, task022, task050, task069, task137, task139, and task195. These tasks are well-suited for testing prompt optimization strategies, as they involve interpreting and responding to diverse instructions. Each task includes a set of input–output examples that the models are expected to handle effectively.

In addition, to evaluate the generalization of our method, we utilized Stable-Diffusion-2-1 for text-to-image generation tasks, with PickScore used to assess prompt effectiveness. We further conducted experiments on representative reasoning tasks from the GSM8K, ASDiv, AQuA, CSQA, and StrategyQA datasets.

4.1.2. Baselines

To comprehensively evaluate the effectiveness of our proposed reinforcement learning framework, we compared it to five state-of-the-art prompt optimization methods. Specifically, Plum employs metaheuristic algorithms for discrete prompt optimization, leveraging black box model feedback to iteratively select and refine candidate prompts [49]; BDPL utilizes a variance-reduced policy gradient strategy to optimize discrete prompts, effectively addressing high-variance issues in gradient estimation [27]; GrIPS conducts discrete prompt generation through phrase-level editing actions, selecting optimal prompts based on empirical task performance [8]; RLPrompt formulates prompt optimization as a reinforcement learning problem, iteratively refining prompts via token-level edits guided by task performance rewards [9]; StablePrompt formulates prompt optimization as an online reinforcement learning problem and employs adaptive proximal policy optimization for stable training [7]. These representative baselines encompass diverse prompt optimization paradigms, providing a comprehensive benchmark to evaluate the advantages of our proposed approach.

4.1.3. Experimental Parameters

To ensure fairness across all the methods, we used the following standardized settings:

Backbone model: GPT-4o was used for all tasks as the underlying language model.
Batch size: Set to 1 for all methods to focus on individual prompt editing.
Time limit: Each method was given 45 min of runtime per task, ensuring an equitable computational budget.
Task-agnostic instruction: Each method was provided with a task-agnostic instruction: You will be given a task. Read and understand the task carefully, and appropriately answer [list_of_labels] [49]. The placeholder [list_of_labels] is replaced with the actual task-specific labels for each experiment.

These experimental parameters were designed to ensure fair and consistent evaluation of the different prompt learning methods. The time limit ensured that all the methods were constrained by similar computational resources, and the task-agnostic instruction provided a general template for all the tasks.

4.2. Prompt Optimization Performance

We evaluated the effectiveness of the proposed method on eight binary classification tasks from the Natural-Instructions v2.6 dataset [41], comparing it against five baseline prompt optimization methods. Table 2 summarizes the results regarding average task accuracy with standard deviation across multiple runs.

Across all the tasks, our method achieved the highest average accuracy of 61.13%, consistently outperforming all the baselines. Specifically, compared to GrIPS, BDPL, RLPrompt, and StablePrompt, our reinforcement learning-based framework demonstrated superior editing efficiency and generalization capability. For instance, on task3 our approach significantly improved performance by more than 9 points compared to StablePrompt and by approximately 15 points compared to RLPrompt, highlighting its ability to effectively refine suboptimal initial prompts and achieve substantial accuracy gains. Further analysis indicated that RLPrompt and GrIPS achieved competitive performance on task5, but their performance deteriorated on more complex tasks, due to limited flexibility and domain-specific constraints. In contrast, our method, with its structured action constraints and dynamic domain-awareness integration, consistently maintained stable and improved performance across all the tested tasks.

These results validate the effectiveness of modeling prompt optimization as a sequential decision-making process guided by domain knowledge and structured reinforcement learning, confirming its suitability and robustness for practical prompt optimization scenarios.

4.3. Prompt Optimization with Limited API Calls

To address concerns regarding computational resources, we further evaluated prompt optimization methods under a limited API call budget. As shown in Table 3, the proposed method achieved the highest average accuracy across all tasks under identical query constraints. Specifically, our approach consistently outperformed all comparative algorithms, demonstrating greater efficiency in utilizing limited queries. In particular, for tasks 3 and 7 our method exhibited substantial improvements, highlighting its effectiveness even in constrained resource scenarios. Compared with other reinforcement learning-based methods like RLPrompt and StablePrompt, our method attains superior accuracy by leveraging structured domain knowledge and entropy-guided exploration to efficiently navigate the prompt space. These results confirm the practicality and efficiency of our framework, making it particularly suitable for real-world scenarios where API calls are limited.

4.4. Prompt Optimization for Text-to-Image Generation

In this experiment, all the images were generated using the Stable-Diffusion-2-1 model [42]. To objectively evaluate the effectiveness of prompt optimization, we employed PickScore, an offline CLIP-based evaluator trained on a large-scale, high-quality image–text pair dataset [43]. PickScore quantitatively measures the semantic relevance between generated images and their textual prompts.

Figure 5 and Figure 6 show the visual comparisons between images generated using the original and the optimized prompts for ten distinct topics (Sunken Palace, Golden Petals, Mountain Dawn, Misty Mirror, Starlit Sky, Library Dog, Flaming Flight, Morning Peony, Urban Sunrise, and Blazing Tower). It is evident that the images generated using prompts optimized by the proposed method demonstrate notable improvements in visual quality and thematic coherence compared to those generated by the initial prompts.

Quantitatively, the optimized prompts achieved consistently higher PickScores across all ten image topics, especially in cases like Morning Peony and “Urban Sunrise”, where the relative improvements reached approximately 5.5% and 6.5%, respectively. These results suggest that the reinforcement learning-based editing strategy proposed in this paper effectively guides the model toward generating images with improved thematic fidelity and enhanced visual detail.

From the perspective of specific prompt-editing operations, the prompt optimization primarily employed three strategies: substitute vague or abstract descriptions with concrete visual elements, replacing general phrases such as “delicate pink hues” with more explicit details like “deep pink, intricately layered petals” and enhancing the visual richness of the generated images; add descriptive details or lighting elements, as seen in Urban Sunrise, “explicitly mentioning orange light”, and “ crisp shadows”, thereby emphasizing the depiction of lighting and shadow effects; and delete ambiguous or uncertain phrases, such as removing “faintly visible” in the context of “Starlit Sky”, significantly enhancing clarity and visual strength.

The optimized prompts generally yielded images with more focused themes, richer visual detail, and more accurate representation of intended atmospheres, particularly in visually complex scenarios such as “Sunken Palace” and “Flaming Flight”. These findings demonstrate the effectiveness and generalizability of the proposed method in prompt optimization tasks involving large language models, highlighting its practical value for real-world applications.

4.5. Prompt Optimization for Reasoning Tasks

In addition to quantitative evaluation, we present representative examples from five reasoning benchmarks to demonstrate the refinement of prompts by the proposed method via knowledge-guided editing. As shown in Table 4, our method consistently enhances initial prompts by clarifying key concepts, supplementing missing definitions, or completing logical reasoning steps, thereby improving the interpretability and effectiveness of each prompt for large language models.

Experimental analysis shows that our method consistently enriches initial prompts with explicit definitions, contextual information, and instructive annotations across diverse reasoning benchmarks. For instance, in the CSQA example, the optimized prompt clearly defines the term “choker” and guides consideration of alternative storage locations, thereby prompting more comprehensive reasoning. In the GSM8K and AQuA examples, the refined prompts explicitly outline calculation steps or clarify mathematical relationships. Similarly, for StrategyQA, the edited prompt introduces practical considerations regarding food preparation, thus enabling the model to reason about implicit factors more effectively.

To further demonstrate the domain-awareness capability of the proposed method, we present additional examples in Table 5. Specifically, these examples illustrate that the domain-aware editing mechanism refines prompts by incorporating precise domain-specific terms and clarifications. For instance, the initial prompt mentioning “lawyers” is enhanced to “attorneys (legal professionals specializing in divorce cases)”, clearly reflecting the legal domain context. Similarly, the term “blood” is modified to “bruises or scrapes” in a sports-related prompt to better align with typical domain-specific language used in athletic contexts. Furthermore, a prompt referencing “electric motor” is refined to explicitly indicate it as an “electromechanical device,” providing a more accurate domain-specific description. These refined prompts clearly indicate that our method systematically leverages structured domain knowledge to enhance the clarity and contextual appropriateness of the generated prompts.

4.6. Ablation Study

Table 6 presents the ablation results. Removing either the knowledge base constraint or the diversity reward term led to a drop in accuracy, demonstrating the importance of both knowledge-guided candidate selection and exploration encouragement in prompt optimization.

4.7. Hyperparameter Sensitivity Analysis

Following the ablation study of core components, we further investigated the sensitivity of our approach to key hyperparameter settings. To better understand the robustness and practical performance of our reinforcement learning-based prompt optimization framework, we conducted a comprehensive sensitivity analysis on four key hyperparameters: replay buffer size, maximum editing steps per episode, epsilon decay rate, and entropy reward weight. All the ablation experiments were performed under the same experimental setup described previously, and the reported accuracy was averaged across all the benchmark tasks.

We examined buffer sizes of 1000, 5000, 10,000, and 20,000. As shown in Figure 7 (top-left), model accuracy initially improved with increasing buffer size, reaching the best result at 10,000 (61.13%), while further growing yield diminishing returns. This trend reflects the benefit of enhanced sample diversity for stable training but also indicates that excessively large buffers may introduce unnecessary delay or overfitting to outdated transitions.

We varied the maximum number of allowed editing steps per episode among 1, 3, 5, and 7. Figure 7 (top-right) shows accuracy rising from 59.81% to 61.13% as the number increased to 5, after which it plateaued. This suggests that moderately longer editing trajectories enable more flexible prompt optimization, while excessively long episodes provide limited additional benefit.

The epsilon decay rate governs the exploration–exploitation trade-off in policy learning. We compared decay values of 0.98, 0.99, 0.995, and 0.997. Figure 7 (bottom-left) demonstrates that an appropriate balance between exploration and exploitation yields the highest accuracy, while too-rapid or too-slow decay reduces overall performance.

We assessed the impact of the entropy regularization weight on the agent’s reward function. As shown in Figure 7 (bottom-right), setting this weight to 0.2 achieved the highest accuracy, indicating that moderate encouragement of diverse editing actions is beneficial. The absence of diversity regularization (weight = 0) resulted in noticeably worse performance, while overly high weights also degraded the results.

Overall, these results indicate that the proposed method is robust to hyperparameter settings within reasonable ranges, and that the best performance is consistently achieved when balancing diversity, exploration, and efficient trajectory length.

5. Conclusions

This paper proposes DA-RLPO, a reinforcement learning framework designed for discrete prompt optimization. Our method formulates prompt editing as a Markov Decision Process, utilizing deep Q-learning guided by domain-specific knowledge. The approach employs a structured knowledge base to ensure semantic coherence and linguistic correctness in edited prompts. Experiments on text classification tasks demonstrate the effectiveness of DA-RLPO, particularly in scenarios with limited query budgets. Future research directions include integrating multi-modal knowledge resources, designing more fine-grained editing actions, and exploring methods for reducing the computational complexity of the training process.

Author Contributions

Conceptualization, M.G. and B.S.; methodology, M.G. and T.W.; software, M.G. and Z.F.; validation, M.G., T.W. and T.Z.; formal analysis, M.G.; investigation, M.G. and B.S.; resources, Z.F.; data curation, M.G. and Z.F.; writing—original draft preparation, M.G.; writing—review and editing, B.S., T.W. and Z.Z.; visualization, T.Z. and Z.Z.; supervision, B.S.; project administration, B.S.; funding acquisition, M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the project “Prompt Optimization for Creative Text-Image Content Applications” (Grant No. C80JX254003).

Data Availability Statement

The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shao, Z.; Yu, Z.; Wang, M.; Yu, J. Prompting large language models with answer heuristics for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14974–14983. [Google Scholar]
Zhang, H.; Yu, P.S.; Zhang, J. A systematic survey of text summarization: From statistical methods to large language models. ACM Comput. Surv. 2024, 57, 1–41. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Atreja, S.; Ashkinaze, J.; Li, L.; Mendelsohn, J.; Hemphill, L. Prompt design matters for computational social science tasks but in unpredictable ways. arXiv 2024, arXiv:2406.11980. [Google Scholar]
Dang, H.; Mecke, L.; Lehmann, F.; Goller, S.; Buschek, D. How to prompt? opportunities and challenges of zero-and few-shot learning for human-ai interaction in creative applications of generative models. arXiv 2022, arXiv:2209.01390. [Google Scholar]
Shin, T.; Razeghi, Y.; Logan, I.R.L.; Wallace, E.; Singh, S. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv 2020, arXiv:2010.15980. [Google Scholar]
Kwon, M.; Kim, G.; Kim, J.; Lee, H.; Kim, J. StablePrompt: Automatic prompt tuning using reinforcement learning for large language models. arXiv 2024, arXiv:2410.07652. [Google Scholar]
Prasad, A.; Hase, P.; Zhou, X.; Bansal, M. Grips: Gradient-free, edit-based instruction search for prompting large language models. arXiv 2022, arXiv:2203.07281. [Google Scholar]
Deng, M.; Wang, J.; Hsieh, C.-P.; Wang, Y.; Guo, H.; Shu, T.; Song, M.; Xing, E.P.; Hu, Z. Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv 2022, arXiv:2205.12548. [Google Scholar]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Chang, K.; Xu, S.; Wang, C.; Luo, Y.; Liu, X.; Xiao, T.; Zhu, J. Efficient prompting methods for large language models: A survey. arXiv 2024, arXiv:2404.01077. [Google Scholar]
Zhao, T.; Li, G.; Song, Y.; Wang, Y.; Chen, Y.; Yang, J. A multi-scenario text generation method based on meta reinforcement learning. Pattern Recognit. Lett. 2023, 165, 47–54. [Google Scholar] [CrossRef]
Vatsal, S.; Dubey, H. A survey of prompt engineering methods in large language models for different nlp tasks. arXiv 2024, arXiv:2407.12994. [Google Scholar]
Schick, T.; Schütze, H. It’s not just size that matters: Small language models are also few-shot learners. arXiv 2020, arXiv:2009.07118. [Google Scholar]
Wang, C.; Yang, Y.; Li, R.; Sun, D.; Cai, R.; Zhang, Y.; Fu, C. Adapting llms for efficient context processing through soft prompt compression. In Proceedings of the International Conference on Modeling, Natural Language Processing and Machine Learning, Xi’an, China, 17–19 May 2024; pp. 91–97. [Google Scholar]
Li, C.; Zhang, M.; Mei, Q.; Kong, W.; Bendersky, M. Learning to rewrite prompts for personalized text generation. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 3367–3378. [Google Scholar]
Qiu, J.-Q.; Zhang, C.-Y.; Chen, C.P. Prompt learning for few-shot question answering via self-context data augmentation. IEEE Trans. Artif. Intell. 2024, 6, 589–603. [Google Scholar] [CrossRef]
Peng, J.; Yang, W.; Wei, F.; He, L. Prompt for extraction: Multiple templates choice model for event extraction. Knowl.-Based Syst. 2024, 289, 111544. [Google Scholar] [CrossRef]
He, X.; Zannettou, S.; Shen, Y.; Zhang, Y. You only prompt once: On the capabilities of prompt learning on large language models to tackle toxic content. In Proceedings of the 2024 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 20–22 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 770–787. [Google Scholar]
Gao, T.; Fisch, A.; Chen, D. Making pre-trained language models better few-shot learners. arXiv 2020, arXiv:2012.15723. [Google Scholar]
Schick, T.; Schütze, H. Exploiting cloze questions for few shot text classification and natural language inference. arXiv 2020, arXiv:2001.07676. [Google Scholar]
Bu, K.; Liu, Y.; Ju, X. Efficient utilization of pre-trained models: A review of sentiment analysis via prompt learning. Knowl.-Based Syst. 2024, 283, 111148. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Li, W.; Wang, X.; Li, W.; Jin, B. A survey of automatic prompt engineering: An optimization perspective. arXiv 2025, arXiv:2502.11560. [Google Scholar]
Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar]
Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv 2021, arXiv:2104.08691. [Google Scholar]
Diao, S.; Huang, Z.; Xu, R.; Li, X.; Lin, Y.; Zhou, X.; Zhang, T. Black-box prompt learning for pre-trained language models. Trans. Mach. Learn. Res. 2023. Available online: https://openreview.net/forum?id=IvsGP7xRvm (accessed on 4 August 2025).
Su, H.; Chi, P.; Huang, S.-C.; Lam, C.H.; Sahay, S.; Chen, S.-T.; Lee, H. Few-shot prompting towards controllable response generation. arXiv 2022, arXiv:2206.03931. [Google Scholar]
Batorski, P.; Kosmala, A.; Swoboda, P. Prl: Prompts from reinforcement learning. arXiv 2025, arXiv:2505.14412. [Google Scholar]
Xin, X.; Pimentel, T.; Karatzoglou, A.; Ren, P.; Christakopoulou, K.; Ren, Z. Rethinking reinforcement learning for recommendation: A prompt perspective. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 1347–1357. [Google Scholar]
Zhang, T.; Wang, X.; Zhou, D.; Schuurmans, D.; Gonzalez, J.E. Tempera: Test-time prompting via reinforcement learning. arXiv 2022, arXiv:2211.11890. [Google Scholar]
Iqbal, T.; Qureshi, S. The survey: Text generation models in deep learning. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 2515–2528. [Google Scholar] [CrossRef]
Cao, Y.; Sheng, Q.Z.; McAuley, J.; Yao, L. Reinforcement learning for generative ai: A survey. arXiv 2023, arXiv:2308.14328. [Google Scholar]
Du, Y.; Yin, Z.; Xie, R.; Zhang, Q. Prompt template construction by average gradient search with external knowledge for aspect sentimental analysis. Expert Syst. Appl. 2024, 238, 122271. [Google Scholar] [CrossRef]
Zheng, L.; Chen, J.; Wang, J.; He, J.; Hu, Y.; Chen, Y.; Fan, C.; Gao, Y.; Zhang, C. Episodic multi-agent reinforcement learning with curiosity-driven exploration. Adv. Neural Inf. Process. Syst. 2021, 34, 3757–3769. [Google Scholar]
Yu, W.; Zhu, C.; Li, Z.; Hu, Z.; Wang, Q.; Ji, H.; Jiang, M. A survey of knowledge-enhanced text generation. ACM Comput. Surv. 2022, 54, 1–38. [Google Scholar] [CrossRef]
Zhu, H.; Peng, H.; Lyu, Z.; Hou, L.; Li, J.; Xiao, J. Pre-training language model incorporating domain-specific heterogeneous knowledge into a unified representation. Expert Syst. Appl. 2023, 215, 119369. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, X.; Cai, M.; Wang, W.; Song, Y. Prompt-based relation extraction by reasoning with contextual knowledge. In Pacific-Asia Conference on Knowledge Discovery and Data Mining; Springer: Berlin/Heidelberg, Germany, 2025; pp. 52–64. [Google Scholar]
Chen, W.; Su, Y.; Yan, X.; Wang, W.Y. Kgpt: Knowledge-grounded pre-training for data-to-text generation. arXiv 2020, arXiv:2010.02307. [Google Scholar]
Wiegmann, M.; Wolska, M.; Schröder, C.; Borchardt, O.; Stein, B.; Potthast, M. Trigger warning assignment as a multi-label document classification problem. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 12113–12134. [Google Scholar]
Mishra, S.; Khashabi, D.; Baral, C.; Hajishirzi, H. Cross-task generalization via natural language crowdsourcing instructions. arXiv 2021, arXiv:2104.08773. [Google Scholar]
Li, X.; Lu, J.; Han, K.; Prisacariu, V.A. Sd4match: Learning to prompt stable diffusion model for semantic matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27558–27568. [Google Scholar]
Kirstain, Y.; Polyak, A.; Singer, U.; Matiana, S.; Penna, J.; Levy, O. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Adv. Neural Inf. Process. Syst. 2023, 36, 36652–36663. [Google Scholar]
Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training verifiers to solve math word problems. arXiv 2021, arXiv:2110.14168. [Google Scholar]
Xu, Y.; Guo, X.; Zeng, Z.; Miao, C. Softcot: Soft chain-of-thought for efficient reasoning with llms. arXiv 2025, arXiv:2502.12134. [Google Scholar]
Ling, W.; Yogatama, D.; Dyer, C.; Blunsom, P. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv 2017, arXiv:1705.04146. [Google Scholar]
Talmor, A.; Herzig, J.; Lourie, N.; Berant, J. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv 2018, arXiv:1811.00937. [Google Scholar]
Geva, M.; Khashabi, D.; Segal, E.; Khot, T.; Roth, D.; Berant, J. Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. Trans. Assoc. Comput. Linguist. 2021, 9, 346–361. [Google Scholar] [CrossRef]
Pan, R.; Xing, S.; Diao, S.; Sun, W.; Liu, X.; Shum, K.; Zhang, J.; Pi, R.; Zhang, T. Plum: Prompt learning using metaheuristics. In Findings of the Association for Computational Linguistics: ACL 2024; ACL: Vienna, Austria, 2024; pp. 2177–2197. [Google Scholar]

Figure 1. Domain-aware decision flow for reinforcement learning-based prompt optimization.

Figure 2. Knowledge base architecture.

Figure 3. Reinforcement learning framework for prompt editing. The process integrates state encoding, action selection, reward evaluation, and training updates, with

K

guiding add and substitute operations.

Figure 3. Reinforcement learning framework for prompt editing. The process integrates state encoding, action selection, reward evaluation, and training updates, with

K

guiding add and substitute operations.

Figure 4. Examples of prompt modifications produced by four discrete editing actions applied to the initial prompt. Each resulting prompt is independently generated through a single-step transformation.

Figure 5. SD-2.1 image topics: (1) Sunken Palace, (2) Golden Petals, (3) Mountain Dawn, (4) Misty Mirror, (5) Starlit Sky, (6) Library Dog, (7) Flaming Flight, (8) Morning Peony, (9) Urban Sunrise, (10) Blazing Tower.

Figure 6. DA-RLPO augmented SD-2.1 image topics: (1) Sunken Palace, (2) Golden Petals, (3) Mountain Dawn, (4) Misty Mirror, (5) Starlit Sky, (6) Library Dog, (7) Flaming Flight, (8) Morning Peony, (9) Urban Sunrise, (10) Blazing Tower.

Figure 7. Ablation study of four hyperparameters: replay buffer size, max steps per episode, epsilon decay, and entropy reward weight (from left to right, top to bottom).

Table 1. Representative entries in the structured knowledge base

K

across domains.

Table 1. Representative entries in the structured knowledge base

K

across domains.

Domain	Phrase Category	Example Phrase	Syntax
General	Action Phrase	conduct a comprehensive analysis	VP
General	Warning Clause	Important: Failure to comply may result in errors	Conditional Clause
Professional	Procedure Phrase	consult clinical guidelines	VP
Professional	Caution Note	Contra-indication: Not applicable to pregnant patients	NP + Clause
Professional	Compliance Phrase	adhere to GAAP standards	VP

Table 2. Prompt optimization accuracy (%) on Natural-Instructions tasks. Each result is averaged over multiple runs, with standard deviation in parentheses. Bold font highlights the best performance in each row and in the average.

Task	GrIPS	BDPL	Plum	RLPrompt	StablePrompt	Ours
task1	42.73 (1.21)	48.36 (0.95)	47.55 (0.88)	45.76 (1.01)	48.51 (1.03)	49.82 (0.90)
task2	41.14 (0.88)	46.62 (0.79)	46.25 (0.67)	42.89 (0.98)	44.78 (0.97)	48.03 (0.76)
task3	45.03 (1.37)	50.26 (1.09)	48.42 (1.13)	49.21 (1.12)	55.44 (1.09)	64.77 (1.06)
task4	69.25 (0.92)	72.06 (0.94)	71.13 (1.00)	70.13 (0.99)	70.61 (0.96)	70.88 (0.82)
task5	80.19 (0.66)	79.31 (0.82)	80.04 (0.89)	80.14 (0.91)	79.16 (0.85)	80.40 (0.78)
task6	44.29 (0.78)	42.33 (0.80)	43.76 (0.85)	44.84 (0.75)	44.08 (0.94)	45.02 (0.70)
task7	36.92 (0.91)	43.01 (0.88)	45.15 (0.73)	40.97 (0.91)	43.41 (1.05)	50.63 (0.96)
task8	72.56 (0.83)	77.11 (0.79)	76.92 (0.75)	70.98 (1.04)	71.95 (0.79)	79.48 (0.71)
Avg.	54.01	57.38	57.40	55.62	57.24	61.13

Table 3. Prompt optimization accuracy (%) under limited API call budget. Results are averaged over multiple runs; standard deviation is shown in parentheses. Bold font highlights the best performance in each row and in the average.

Task	GrIPS	BDPL	Plum	RLPrompt	StablePrompt	Ours
task1	42.73 (1.21)	45.31 (1.04)	46.70 (0.98)	42.26 (1.01)	46.39 (1.00)	49.82 (0.90)
task2	41.14 (0.88)	44.25 (1.12)	45.45 (0.91)	41.65 (0.97)	45.53 (0.95)	48.03 (0.76)
task3	45.03 (1.37)	48.66 (1.06)	47.59 (0.95)	45.96 (1.14)	47.34 (1.09)	64.77 (1.06)
task4	69.25 (0.92)	71.83 (1.01)	69.89 (0.93)	69.74 (1.04)	70.15 (0.96)	70.88 (0.82)
task5	80.19 (0.66)	78.10 (0.87)	78.67 (0.79)	79.58 (0.92)	77.48 (0.90)	80.40 (0.78)
task6	44.29 (0.78)	42.95 (0.85)	42.99 (0.80)	44.27 (0.93)	42.75 (0.92)	45.02 (0.70)
task7	36.92 (0.91)	40.02 (0.90)	44.37 (0.77)	38.86 (0.86)	44.51 (0.83)	50.63 (0.96)
task8	72.56 (0.83)	74.14 (0.95)	75.62 (0.84)	70.71 (0.87)	73.46 (0.76)	79.48 (0.71)
Avg.	54.01	55.66	56.41	54.13	55.95	61.13

Table 4. Examples of initial prompts and the corresponding optimized prompts generated by our method on representative reasoning tasks.

Dataset	Initial Prompt	Optimized Prompt (Ours)
GSM8K	Randy has 60 mango trees on his farm. He also has 5 less than half as many coconut trees as mango trees. How many trees does Randy have in all on his farm?	Randy has 60 mango trees on his farm. He also has 5 less than half as many coconut trees as mango trees. What is the total number of trees on Randy’s farm (calculate the total mango and coconut trees)?
ASDiv	Gino has 64 popsicle sticks. I have 100 popsicle sticks. What is the sum of our popsicle sticks?	Gino has 64 popsicle sticks, and I have 100 popsicle sticks. What is the total number of popsicle sticks we have together?
AQuA	If each side of a square is increased by 25%, find the percentage change in its area.	If each side of a square is increased by 25%, what is the resulting percentage change in the area (recall that area is calculated as side squared)?
CSQA	To locate a choker not located in a jewelry box or boutique where would you go?	To locate a choker (a close-fitting necklace) that is not in a jewelry box or boutique, where else might you look for it (think about common places where jewelry is stored at home)?
StrategyQA	Is shrimp scampi definitely free of plastic?	Is shrimp scampi (a dish made from shrimp and garlic butter) definitely free of plastic (consider whether any plastic packaging or contamination could be involved in the preparation)?

Table 5. Examples of initial prompts and their domain-aware optimized prompts (CSQA dataset).

Initial Prompt	Domain-Aware Optimized Prompt
Stopping being married to one another was a costly task, the lawyers and their what were astronomical?	Stopping being married to one another was a costly task, the attorneys (legal professionals specializing in divorce cases) and their what were astronomical?
Blood isn’t usually spilled during what sport that doesn’t allow the use of hands?	Bruises or scrapes aren’t usually sustained during what sport that doesn’t allow the use of hands?
The electric motor powered many important things, including one in almost every house. Name this item.	The electromechanical device (electric motor) powered many important things, including one in almost every house. Name this item.

Table 6. Ablation study: Effect of knowledge base constraint and diversity reward on prompt optimization.

Method	Accuracy
Ours (Full)	61.13 (0.92)
w/o Knowledge Base	58.10 (1.03)
w/o Diversity Reward	58.76 (1.00)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, M.; Sun, B.; Wang, T.; Fan, Z.; Zhang, T.; Zheng, Z. Domain-Aware Reinforcement Learning for Prompt Optimization. Mathematics 2025, 13, 2552. https://doi.org/10.3390/math13162552

AMA Style

Gao M, Sun B, Wang T, Fan Z, Zhang T, Zheng Z. Domain-Aware Reinforcement Learning for Prompt Optimization. Mathematics. 2025; 13(16):2552. https://doi.org/10.3390/math13162552

Chicago/Turabian Style

Gao, Mengqi, Bowen Sun, Tong Wang, Ziyu Fan, Tongpo Zhang, and Zijun Zheng. 2025. "Domain-Aware Reinforcement Learning for Prompt Optimization" Mathematics 13, no. 16: 2552. https://doi.org/10.3390/math13162552

APA Style

Gao, M., Sun, B., Wang, T., Fan, Z., Zhang, T., & Zheng, Z. (2025). Domain-Aware Reinforcement Learning for Prompt Optimization. Mathematics, 13(16), 2552. https://doi.org/10.3390/math13162552

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Domain-Aware Reinforcement Learning for Prompt Optimization

Abstract

1. Introduction

2. Related Work

2.1. Prompt Optimization and Editing Methods

2.2. Reinforcement Learning Approaches for Prompt Optimization

2.3. Knowledge-Guided and Domain-Aware Text Editing

3. Methods

3.1. Overview

3.2. Domain-Aware Decision Flow

3.3. Knowledge Base

3.4. Reinforcement Learning for Prompt Editing

3.4.1. Formulation as Markov Decision Process

3.4.2. Hybrid State Encoding

3.4.3. Action Selection Policy

3.4.4. Reward Shaping

3.4.5. Training Mechanism

3.5. Complexity Analysis

4. Experiments

4.1. Experimental Setup

4.1.1. Dataset

4.1.2. Baselines

4.1.3. Experimental Parameters

4.2. Prompt Optimization Performance

4.3. Prompt Optimization with Limited API Calls

4.4. Prompt Optimization for Text-to-Image Generation

4.5. Prompt Optimization for Reasoning Tasks

4.6. Ablation Study

4.7. Hyperparameter Sensitivity Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI