Prompt Optimization with Two Gradients for Classification in Large Language Models

Lieander, Anthony Jethro; Wang, Hui; Rafferty, Karen

doi:10.3390/ai6080182

Open AccessArticle

Prompt Optimization with Two Gradients for Classification in Large Language Models

by

Anthony Jethro Lieander

^*

,

Hui Wang

and

Karen Rafferty

School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast BT7 1NN, UK

^*

Author to whom correspondence should be addressed.

AI 2025, 6(8), 182; https://doi.org/10.3390/ai6080182

Submission received: 2 June 2025 / Revised: 30 July 2025 / Accepted: 6 August 2025 / Published: 8 August 2025

(This article belongs to the Special Issue Large Language Models and Retrieval-Augmented Generation in Natural Language Processing, Human–Robot Interaction and Quantum Computing)

Download

Browse Figures

Versions Notes

Abstract

Large language models (LLMs) generally perform well in common tasks, yet are often susceptible to errors in sophisticated natural language processing (NLP) on classification applications. Prompt engineering has emerged as a strategy to enhance their performance. Despite the effort required for manual prompt optimization, recent advancements highlight the need for automation to reduce human involvement. We introduced the PO2G (prompt optimization with two gradients) framework to improve the efficiency of optimizing prompts for classification tasks. PO2G demonstrates improvement in efficiency, reaching almost 89% accuracy after just three iterations, whereas ProTeGi requires six iterations to achieve a comparable level. We evaluated PO2G and ProTeGi on a benchmark of nine NLP tasks, three tasks from the original ProTeGi study, and six non-domain-specific tasks. We also evaluated both frameworks on seven legal-domain classification tasks. These results provide broader insights into the efficiency and effectiveness of prompt optimization frameworks for classification across diverse NLP scenarios.

Keywords:

LLM; NLP; prompt optimization; gradient descent; legal

1. Introduction

Large language models (LLMs) are widely used in natural language processing tasks and have demonstrated strong performance in many use cases [1]. However, they often struggle with sophisticated problems [2]. Prompt engineering, the craft of designing targeted prompts to guide model outputs, has emerged as a strategy for improving LLM performance [3]. In this research, we focus on using LLMs to annotate legal documents. LLMs have shown promising results in the legal domain, GPT-4 successfully passed the bar exam [4], and other studies demonstrated its potential as a legal assistant, achieving roughly a C+ grade on law school exams [5,6]. However, LLM performance remains inconsistent on sophisticated problem tasks like legal text annotation, indicating the need for better techniques.

Two widely used prompt-engineering techniques are chain-of-thought prompting and few-shot prompting. Chain-of-thought (CoT) prompting involves guiding the model to produce step-by-step reasoning before giving a final answer, effectively simulating human-like reasoning [3]. Few-shot prompting improves performance by providing a few example input-output pairs within the prompt to help the model infer the task format [7]. While both methods can improve accuracy, they require substantial human expertise to craft effective prompts and often involve collaboration with domain experts. These prompts are also computationally expensive at inference time due to the high token counts in the model’s output reasoning. Therefore, there is a need for automated prompt optimization frameworks that can generate concise, direct prompts with better accuracy and consistency, without extensive human effort.

1.1. Literature Review

1.1.1. Survey Paper

Prompt engineering has been the subject of several comprehensive surveys. Table 1 highlights a few of these works [8,9,10]. Collectively, these surveys review the paradigm shift from the traditional pretrain–fine-tune approach to a pretrain–prompt–predict paradigm, and catalog a wide range of prompt design techniques. Following this literature, we can broadly classify prompt optimization methods into two families based on the behavior of the optimized prompt: (i) analysis-driven prompts, which encourage the LLM to produce reasoning or explanations, and (ii) direct-label prompts, which aim for the LLM to output a final answer or class label directly without reasoning.

1.1.2. Foundation Inspiration

Table 2 lists foundational prompt-engineering methods that inspired our work, grouped by whether they use an analysis-driven approach or a direct-label approach. In the analysis-driven category, the chain-of-thought (CoT) method uses manually crafted multi-step reasoning in the LLM’s output [3]. This approach can improve performance but requires carefully curated prompts and incurs a higher inference cost due to the longer reasoning tokens. Auto-CoT [11] attempts to automate this process by using the LLM itself to generate reasoning chains. However, its outputs can be inconsistent across multiple runs and still require reasoning output. AutoHint [12] iteratively adds hint phrases to a single evolving prompt. These hint-based prompts may encourage the model to analyze and reason through the problem, but the method still focuses on expanding the reasoning on inference.

In the direct-label category, the goal is to produce the correct label or decision with minimal reasoning in the inference. ProTeGi [13] applies gradient-inspired textual edits to prompts, using a bandit or minibatch evaluation strategy incrementally. This method demonstrated that LLM prompts can be optimized by treating prompt edits as gradient steps; however, it does not explicitly distinguish between different types of errors (false positives and false negatives) when scoring edits. CLAPS employs an evolutionary algorithm: it maintains a pool of candidate prompts and iteratively selects, mutates, and recombines them, guided by their performance [14]. CLAPS can produce robust prompts for document annotation tasks. Still, it lacks a mechanism to use specific error feedback. PACE uses an actor–critic model to refine a single prompt in an iterative loop [15]. PACE can handle direct labeling tasks, but most demonstrations of PACE have been on non-classification tasks, and it optimizes one prompt at a time without explicitly handling different error types. TextGrad generalizes the idea of gradient-guided prompt editing by formulating prompt optimization as an end-to-end differentiable problem [16]. TextGrad is a complex framework that can handle complex reasoning prompts. Importantly, none of these automatic prompt optimization methods explicitly separates false positive (FP) and false negative (FN) errors when guiding prompt refinements. Our work builds on the direct-label paradigm with a focus on efficiency and introduces a novel way to incorporate error-specific feedback into prompt optimization.

1.1.3. Related Work

Beyond the foundational methods, several recent works have proposed new frameworks for prompt optimization, as summarized in Table 3. AMPO [17] expands and maintains a diverse pool of prompt candidates (a multi-branch optimization strategy), which can improve robustness but also increases the number of LLM calls needed per iteration. STRAGO [18] introduces guided edits by adding reasoning tokens to prompts; this approach steers the LLM’s reasoning process, but similarly to CoT, it results in longer prompts and higher inference costs focused on analytic reasoning tasks.

Another line of work tries to leverage performance feedback more directly. APO-CF (automatic prompt optimization with confusion matrix feedback) [19] uses the model’s confusion matrix (on a validation set) to guide prompt edits. This method provides a form of error feedback, but it uses aggregated signals (like overall false positive/negative counts) rather than pinpointing individual instances and their mistakes. Other approaches focus on the structure of the prompt itself: SCULPT [20] systematically tunes sections of a long prompt (assuming the prompt has a predefined structure with segments), which is useful for very structured prompts but may miss more free-form changes. GREATER [21] proposes using gradients over the model’s reasoning process (requiring access to token-level probabilities) to refine prompts, an approach that is not feasible with closed-source LLMs where internal logits are inaccessible.

Some methods aim for fine-grained edits or low-resource scenarios. LPO (local prompt optimization) [22] looks at token-level replacements in a prompt based on local utility, which can improve a prompt iteratively; however, focusing only on local edits might overlook the need for larger structural or semantic changes in the prompt. PROPEL (prompt optimization with expert priors for LLMs) [23] uses an agent-based loop that incorporates expert knowledge or heuristic rules during prompt refinement; it has mainly been tested on small and medium-sized LLMs, and the additional agent overhead can increase optimization time. AutoMedPrompt [24] adapts textual gradient methods specifically for medical question-answering, refining a single prompt candidate through multiple iterations. This domain-specific approach shows the versatility of prompt optimization techniques, though it was demonstrated on specialized medical tasks. Notably, none of the above methods explicitly leverage separate FP vs. FN as distinctive signals in their optimization loops, and, to our knowledge, none have been applied to legal document annotation scenarios. The framework discussed in this section was recently published and was not available at the time PO2G was developed in Q3 of 2024.

1.2. Framework with Two Gradients

Inspired by the strengths and limitations of ProTeGi, CLAPS, AutoHint, and PACE, we propose the PO2G (prompt optimization with two gradients). The PO2G framework uses a gradient-descent-like iterative approach to refine prompts. The key innovation is that it categorizes the model’s misclassifications into two types: false positives (FP) and false negatives (FN), and treats each as a “distinct gradient” or direction for improvement. In each iteration, the framework gathers the examples from the LLM’s incorrect predictions and splits them into an FP set (data should be excluded) and an FN set (data should be included). This separation allows PO2G to make targeted adjustments to the prompt; the prompt can be edited to be more restrictive in response to FPs and more inclusive or instructive in response to FNs. This framework simulates human behaviour to update the prompt when it finds an incorrect prediction. Additionally, to avoid overwhelming the prompt with too much similar feedback, we employ a clustering step on the incorrect examples. By clustering the FP examples and FN examples separately, PO2G takes sample representative error cases from each cluster. The prompt is then updated with focused guidance addressing those representative cases, which reduces redundancy in the feedback and keeps the prompt efficient.

PO2G is a data-driven framework designed to optimize prompts more efficiently than ProTeGi. It takes advantage of modern LLMs’ larger context windows. GPT-3.5 Turbo output token is limited to 4096 tokens; newer models, such as GPT-4o-mini, can handle significantly longer prompts. This expanded context capacity allows LLM to include more detailed feedback and instructions in each iteration step. Figure 1 illustrates an example of prompt refinement with PO2G. A simple initial prompt (iteration 0) is gradually transformed into a more detailed and effective prompt by iteration 3, with additional instructions and clarifications incorporated at each step. In this way, PO2G iteratively builds and edits a prompt that captures the optimal instructions for the task, without requiring a human to write out long reasoning chains or numerous examples manually. PO2G frameworks support different LLMs at any stage. A more powerful LLM can be used for feedback and prompt generation, while a cheaper model handles scoring.

The ProTeGi framework was able to optimize prompts for domain-specific tasks. However, the author identified issues such as overfitting, limited task diversity, and inefficiency. PO2G (prompt optimization with two gradients) is an alternative framework to ProTeGi by addressing inefficiency with a different strategy to optimize the prompt. We evaluate prompt optimization across nine NLP tasks, with an additional seven Legal NLP tasks, thereby adding diversity to the NLP task set. To address overfitting, P02G processes the whole training dataset instead of mini-batching to mitigate local bias and ensure comprehensive learning. Utilizing clustering on all datasets helps identify the most influential prompts to improve efficiency.

1.3. Legal LLM

A major motivation for PO2G is its application to legal document analysis. In practice, legal professionals often need to extract obligations, requirements, or other information from lengthy documents. We integrate legal cases into our evaluation for behavioral analysis and compare our method with ProTeGi for tax and obligation document extraction. However, due to data limitations, we conduct the testing with publicly available data to compare the performance between PO2G and ProTeGi. The result in the legal domain will be shown in this research as an exploratory analysis.

Our proposed framework achieved a certain level of accuracy faster than ProTeGi in these legal NLP tasks. For instance, after three optimization iterations, the prompt generated by PO2G attained an accuracy comparable to what ProTeGi achieved after six iterations. This suggests that separating error types and categorizing feedback helps accelerate the learning of an effective prompt. For all tasks (both general and legal), we report accuracy as the primary evaluation metric to stay consistent with the metrics used by ProTeGi’s authors. However, we recognize that in the legal extraction tasks, the data are imbalanced, and the amount of data extracted are way less than the unrelated information. Therefore, in addition to accuracy, we also report precision, recall, and F1 score for the positive class (element that needs to be extracted) to better capture the model’s performance on the minority class. All metrics are computed by comparing the LLM’s predicted labels to the ground truth.

1.4. Research Contributions

In summary, this study advances prompt engineering research in several complementary ways:

Revisiting a baseline: We replicate the ProTeGi prompt optimization framework using a newer model, GPT-4o-mini. This updated baseline is important because larger-context models can incorporate longer feedback chains and generally exhibit stronger performance on complex tasks, providing a realistic comparison for our approach.

Broadening empirical coverage: We significantly expand the evaluation suite beyond the original ProTeGi paper’s four tasks. In our experiments, PO2G is tested on nine diverse public NLP tasks, and we further explore its application on seven legal document annotation tasks. This broad evaluation demonstrates the generalizability of our method and provides insight into its performance in a specialized domain (legal NLP) that was not covered in prior work.

Introducing the PO2G(P-O-Two-G) framework: We propose PO2G (prompt optimization with two gradients), a novel automatic prompt optimization framework. PO2G leverages a gradient-descent-inspired prompt editing strategy with support for large prompt contexts, treats false positives and false negatives as two distinct textual “gradients” for targeted feedback, and uses clustering of misclassified examples to reduce redundant information during optimization. This approach is particularly useful for classification tasks.

2. Methodology

This section introduces the PO2G Framework for prompt optimization (Figure 2). It begins with the hard prompt optimization process, followed by an explanation of the data sampling and clustering methods used to generate feedback for prompt refinement. Next, it discusses prompt expansion and selection strategies and concludes with a comparison between the PO2G Framework and ProTeGi.

2.1. Hard Prompt Tuning in Textual Gradient Descent

Hard Prompt Tuning, also known as discrete prompt optimization, is a process that refines prompts using human-readable words or tokens. Hard prompt tuning in textual gradient descent is an iterative process that adjusts a discrete prompt by navigating a defined textual space, guided by gradients derived from feedback on incorrect predictions. Section 2.6 illustrates this process. In our research, we extend this approach with PO2G prompt optimization, which simultaneously utilizes two feedback signals to improve the prompt.

2.2. Two Distinct Gradients

Algorithm 1 and Figure 3 outline the two distinct gradients on the PO2G framework. The process starts with an initial prompt

P_{0}

and a labeled training dataset

D_{t e x t, l a b e l}

. First,

L L M_{1}

evaluates

P_{0}

on

D_{t e x t}

to generate predictions. Incorrect predictions are identified as either false positives (FP) or false negatives (FN), forming two distinct loss signals.

Algorithm 1 Two gradients in the PO2G framework.

Require:

P_{0}

: Initial prompt,

D_{t e x t, l a b e l}

: Training dataset (text, label),

L L M_{1}

: LLM to classify,

L L M_{2 F P}

: LLM to generate FP feedback,

L L M_{2 F N}

: LLM to generate FN feedback,

L L M_{3}

:
LLM to generate new prompt
Ensure:

P_{F P}^{'}, P_{F N}^{'}

1:: Step 1: Evaluate Initial Prompt
2:: $L L M_{1 P r e d i c t i o n} \leftarrow L L M_{1} (P_{0}, D_{t e x t})$
3:: if $L L M_{1 P r e d i c t i o n} \neq D_{l a b e l}$ then
4:: $D_{(F P)}$ : $L L M_{1 P r e d i c t i o n} = 1$ and $D_{l a b e l} = 0$
5:: $D_{(F N)}$ : $L L M_{1 P r e d i c t i o n} = 0$ and $D_{l a b e l} = 1$
6:: end if
7:: Step 2: Loss Signal Sampling
8:: $F e e d b a c k_{F P} \leftarrow L L M_{2 F P} (P_{0}, Sample D_{(F P)})$
9:: $F e e d b a c k_{F N} \leftarrow L L M_{2 F N} (P_{0}, Sample D_{(F N)})$
10:: Step 3: Generate a New Prompt
11:: $P_{F P}^{'} \leftarrow L L M_{3} (P_{0}, F e e d b a c k_{F P})$
12:: $P_{F N}^{'} \leftarrow L L M_{3} (P_{0}, F e e d b a c k_{F N})$
return $P_{F P}^{'}, P_{F N}^{'}$

In Step 2, feedback is generated by sampling from the

D_{(F P)}

and

D_{(F N)}

subsets using

L L M_{2 F P}

and

L L M_{2 F N}

, respectively. This feedback guides the prompt refinement process via

L L M_{3}

, producing two new prompts:

P_{FP}^{'}

and

P_{F N}^{'}

. Section 2.6 illustrates an example of FP data and the corresponding feedback generation.

In Steps 2 and 3, the default sample size is set to five for each distinct gradient. The selection process may be conducted either randomly or through a clustering method, which will be detailed in the subsequent subsection. From the five samples, the prompt yielding the highest accuracy within each false positive (FP) and false negative (FN) subset is selected to represent the corresponding FP and FN gradients.

By leveraging both FP and FN data as two distinct gradient signals, our method provides targeted feedback to refine the prompt. Specifically,

L L M_{2 F P}

generates feedback instructing how to modify the prompt to exclude unnecessary data that were incorrectly included (FP), while

L L M_{2 F N}

produces feedback to incorporate data that should have been classified as positive but were missed (FN).

2.3. Loss Signal Selection: Cluster Sampling

The clustering process begins after evaluating

P_{0}

using

L L M_{1}

, as illustrated in Algorithm 2 and Figure 4. Both the

D_{(F P)}

and

D_{(F N)}

datasets undergo clustering separately, with the default number of clusters set to 5. Initially, the

D_{(F P)}

and

D_{(F N)}

data are embedded into a vector space using the “embedding-3-small” model from OpenAI. This embedding method is chosen for its simplicity and cloud-based capabilities.

We use the K-means algorithm for clustering, as used on the clustering and pruning framework [14], due to its simplicity, speed, and effective centroid-based grouping. Notably, they also indicate that variations in the clustering method, and even the embedding technique, have minimal impact on the overall framework, reinforcing our choice of these methods. A sample closest to each cluster’s centroid is selected to represent that cluster. This approach reduces redundancy and ensures greater consistency compared to randomly selecting data points as loss signals.

Although the embedding and clustering processes require computational time, these steps are expected to reduce the overall computational load by minimizing the number of identical feedback and prompts that lead to more API calls to

L L M_{1}

,

L L M_{2 F P}

,

L L M_{2 F N}

, and

L L M_{3}

.

Typically, the number of prompts generated after the clustering process is twice the number of clusters. The most influential prompt across all clusters will be selected, which is the prompt with the highest accuracy. (

P_{F P}^{'}, P_{F N}^{'}

) is based on the most influential prompt selected previously. These two new prompts are then compared and ranked against other prompts in the same iteration, which will be explained in the next subsection.

Algorithm 2 Clustering to find an influential sample.

Require: D: Input dataset (FN or FP data), num_clusters: Number of clusters to form,
embed: Text embedding function, K-Means: K-means clustering function
Ensure: representative_data: Set of data points representing each cluster

1:: Step 1: Embed the input data
2:: $E \leftarrow {embed (x) ∣ x \in D}$ A
3:: Step 2: Apply K-means clustering to the embedded data
4:: $clusters, centroids \leftarrow KMeans (E, num_clusters)$
5:: Step 3: Select the closest point to each cluster’s centroid
6:: for each cluster $k \in {1, 2, \dots, num_clusters}$ do
7:: $x^{*} \leftarrow arg {min}_{x \in D_{k}} {∥ embed (x) - centroids [k] ∥}_{2}$
8:: Add $x^{*}$ to representative_data:
9:: end for
return representative_data:

2.4. Prompt Expansion and Selection

Algorithm 3 and Figure 5 show how the binary tree structure emerges naturally due to the presence of two new prompts derived from the initial prompt. In each iteration, this process generates

P_{FP}

and

P_{FN}

from

P_{0}

. As the process repeats, the tree expands exponentially, with the number of nodes growing in powers of two: 1, 2, 4, 8, etc., making the algorithm highly computationally intensive and expensive in the long run.

Due to the high computational cost of the framework, a limit is placed on the expansion of nodes. The parameter max_prompt_expanded controls the maximum number of nodes that can be expanded (default = 2). As a result, if more than two new prompts are generated from a previous iteration, only max_prompt_expanded prompts will be expanded. This limitation is enforced by marking nodes as either active or inactive.

Only active nodes are expanded. An algorithm is used to identify the top-performing prompts at each level (iteration) of the tree. At each level, the algorithm selects the nodes with the highest accuracy, and the top-n prompts (where

n = \max_prompt_expanded

) are expanded. This selection strategy helps minimize computational effort by terminating low-performing nodes. The node termination process is illustrated in Section 2.6.

Algorithm 3 Prompt expansion and selection.

Require:

P_{0}

: Initial prompt,

D_{t e x t, l a b e l}

: Training dataset (text, label), max_depth: Maximum
iteration of expansion, max_nodes_per_level: Maximum expanded nodes per level
(default = 2),

G ()

: Two Gradients Function.
Ensure: Optimized prompt

P^{*}

1:: Initialize: Active node and iteration
2:: active_node_set $\leftarrow {P_{0}}$
3:: $iteration \leftarrow 0$
4:: while $i t e r a t i o n \leq$ max_depth do
5:: for each $P_{i} \in$ active_node_set do
6:: Step 1: Generate new prompts using $G ()$
7:: $P_{F P}, P_{F N} = G (P_{i}, D_{t e x t, l a b e l})$
8:: Step 2: Collect the training accuracy of the newly generated nodes
9:: $A_{F P} \leftarrow$ Train accuracy of $P_{F P}$
10:: $A_{F N} \leftarrow$ Train accuracy of $P_{F N}$
11:: Append $(P_{F P}, A_{F P})$ and $(P_{F N}, A_{F N})$ to accuracy_list[iteration]
12:: end for
13:: Step 3: Update active nodes for the next iteration
14:: Sort accuracy_list[iteration] by $A_{i}$ in descending order
15:: active_nodes $\leftarrow accuracy_{list}_{iteration} [0 : \max_nodes_per_level]$
16:: $iteration \leftarrow iteration + 1$
17:: end while
18:: Final Step: Select the prompt with the highest accuracy across all levels
19:: $P^{*} \leftarrow arg {max}_{(P_{i}, A_{i}) \in accuracy_list} A_{i}$ x
20:: return $P^{*}$

2.5. Selecting the Final Prompt

Once the maximum iteration is reached or no further FP/FN data are available, the expansion process stops. The training accuracy of all prompts across all iterations is recorded. The final step is comparing these accuracies. The prompt with the highest accuracy, when evaluated by

L L M_{1}

, is selected as the final prompt. If the initial prompt outperforms the newly generated prompts, the initial prompt will be chosen as the final prompt (

P^{*}

).

This final prompt constitutes the output of the proposed PO2G framework. In the following section, we illustrate the complete workflow of the PO2G framework, from receiving the initial prompt to producing the final optimized prompt.

2.6. PO2G Framework Illustration

This subsection details the process of refining a prompt from its initial version until reaching the final optimized prompt. Figure 6 and Figure 7 illustrate the PO2G framework over three iterations.

The PO2G framework begins by applying the initial prompt to the training data using

L L M_{1}

. The response generated by

L L M_{1}

is collected as a prediction and subsequently used to generate feedback via either

L L M_{2 F P}

or

L L M_{2 F N}

. The feedback is used to generate a new prompt using

L L M_{3}

. This iterative process of using

L L M_{1}

,

L L M_{2 F P}

or

L L M_{2 F N}

, and

L L M_{3}

is part of Algorithm 1 and is visually represented in Figure 6.

The process of expanding

P_{0}

and selecting the final prompt is illustrated in Figure 7. The algorithm for prompt expansion and selection was previously discussed in Algorithm 3. After setting an initial prompt

P_{0}

, Algorithm 1 generates two new prompts:

P_{1.1}

and

P_{1.2}

. These prompts are then evaluated using Algorithm 3, which determines whether a node should be marked as active or inactive. The iterative expansion of active nodes continues until a stopping criterion is met, resulting in the selection of the final prompt. As illustrated in Figure 7, the process concludes at iteration 3, with prompt

P_{3.4}

selected as the final prompt, which achieves the highest training score accuracy of 91%.

2.7. Comparing the PO2G Framework with ProTeGi

The PO2G framework evaluates the initial prompt on the entire training dataset to obtain loss signals. In contrast, ProTeGi divides the data into random mini-batches and aggregates loss signals from each batch to refine the prompt. Additionally, while ProTeGi relies solely on random mini-batching, PO2G allows for either random sampling or clustering to select feedback.

A key difference is in how incorrect predictions are handled. The PO2G framework separates them into two distinct loss signals (one for false positives and one for false negatives) and clusters these signals into groups. In contrast, ProTeGi generates local loss signals by aggregating gradients from a minibatch’s data.

LLMs typically perform best with clear, targeted instructions; combining all feedback in one request can lead to poor responses or hallucinations. PO2G mitigates this by providing separate instructions for FP and FN data to minimize conflicting signals, ensuring more accurate feedback.

To refine the prompt further, ProTeGi generates paraphrases of the improved prompt to produce slight variations. In contrast, PO2G selects different samples, either randomly or via clustering, to explore alternative textual gradient directions rather than merely paraphrasing the text.

ProTeGi uses minibatch, whereas PO2G uses the entire dataset to assess the improved prompt when optimizing the prompt. This comprehensive evaluation helps avoid local optima that may arise from mini-batch sampling.

A potential drawback of using the entire training dataset is the risk of overfitting, as the prompt might be tailored too closely to the training data distribution. Although this approach increases computational time compared to mini-batch sampling, it reduces redundancy by avoiding repeated evaluations of the same prompt on identical data.

Each framework has its pros and cons, which we evaluate experimentally in terms of accuracy and computational efficiency. The results are discussed in the next section.

3. Experimental Setup

This section describes our experiment to evaluate the PO2G framework across 16 NLP tasks. These tasks fall into three categories: (i) the three tasks used in ProTeGi, (ii) six additional publicly available (non-domain) tasks, and (iii) seven domain-specific legal tasks.

Each task uses separate and balanced 200 training and 200 test samples, except for the legal NLP task, which is less than 200 and imbalanced. The legal tasks are smaller and imbalanced; for these, the training and test sets are identical (train=test) to study optimization behavior. The performance measure accuracy is evaluated on all tasks. For the imbalanced legal setting, performance metrics include positive-class precision, recall, and F₁.

3.1. Tasks and Data

The NLP classification tasks used in this study are listed below.

Tasks used in ProTeGi research. [13]:
–
Ethos [25]
–
Liar [26]
–
Ar Sarcasm [27]
Publicly available data (Non-Domain-Specific Tasks):
–
Financial Sentiment [28]
–
Amazon Review [29]
–
Tweet Airline [30]
–
Tweet Hate Speech [31]
–
Tweet Offensive Language [31]
–
Clickbait [32]
Legal document data (Domain-Specific Tasks):
–
Italian FTT documents (150 samples):
∗
Tax Product
∗
Tax Transaction
∗
Tax Exemption
∗
Tax Calculation
∗
Tax Subject
∗
Tax Process
–
FCA Regulation Document:
∗
Obligation Extraction

3.2. Language Models and API Configuration

All of the main experiments use GPT-4o Mini (gpt-4o-mini-2024-07-18) across frameworks to ensure fairness. PO2G relies on parallel LLM calls; running comparable open-source models would require hardware we did not have available, so we restrict evaluation to OpenAI-served models provided by our sponsor. On ablation (Section 4.3), we swap the feedback/prompt-generation components (

L L M_{2 F P}

,

L L M_{2 F N}

, and

L L M_{3}

) to GPT-4o, while keeping the classifier

L L M_{1}

as GPT-4o Mini; this isolates the effect of a stronger LLM under a fixed classifier.

3.3. Prompt Initialization

A consistent initial prompt is used throughout all experiments. For the three ProTeGi tasks, we reuse the original ProTeGi prompts; for the remaining tasks, we construct comparable, concise instructions aligned with the ProTeGi style (Appendix Table A3). While most experiments initialize from these prompts, the ablation study also evaluates an empty initialization to test the framework’s ability to discover prompts from scratch.

3.4. Systems Compared

Our goal is a faster, sample-efficient prompt optimizer, with special attention to legal annotation. We evaluate the following:

ProTeGi (baseline): We reimplement ProTeGi [13] with identical algorithmic settings (e.g., beam size $= 5$ ) under GPT-4o Mini.
PO2G+C (proposed with clustering): PO2G uses two loss signals (one from false positives (FP) and one from false negatives (FN)) to propose and score edits during each iteration. We cluster the FP and FN error sets separately (default clusters $k = 5$ per side) and select the most influential instances within clusters to form gradients. We then expand each gradient into candidate edits (max expansion $= 2$ ) and score them.
PO2G (no clustering): This is identical to PO2G+C but without clustering. FP and FN examples are sampled randomly (default random samples $= 5$ per side), expanded (max expansion $= 2$ ), and scored.

3.5. Evaluation Organization

Non-domain performance: Accuracy and cumulative API calls across nine tasks (Section 1.1); robustness calls on three tasks (LIAR, AR_Sarcasm, and Clickbait) with five independent runs each, reporting SE and pairwise t-test.
Legal (domain) behavioral analysis: Accuracy, API calls, and positive-class precision/recall/F₁ across seven tasks with train=test; robustness on Obligation with five runs, reporting SE and pairwise t-test.
Ablations: Clustering (PO2G+C vs. PO2G), empty vs. initial prompt, and LLM by mainly comparing the accuracy and API calls.

3.6. Iteration Budget, Metrics, and Statistics

Optimization proceeds for six self-refinement iterations (

i \in {0, \dots, 6}

), where

i = 0

denotes the evaluation of the initial prompt. We report on the following:

Accuracy and cumulative API calls per iteration.
Positive-class precision/recall/F₁ for legal imbalance data.
Robustness: five independent runs for selected tasks; we report the mean and standard error (SE = SD $/ \sqrt{5}$ ).
The pairwise t-test: comparing frameworks’ accuracy on a certain condition explained in the results and discussion. Exact p-values are reported.

Prompts are provided in the Appendix. All non-domain experiments use different train/test data of size 200 each. Legal tasks are smaller and imbalanced; we use train=test to study optimization behavior due to data limitations. Robustness runs use the identical configuration five times; the difference in performance at different times is be related to LLM stochasticity.

4. Results and Discussion

This section presents accuracy and cumulative API calls. We separate the analysis into non-domain datasets for the main performance comparison, a legal-domain behavioral analysis, and three ablation studies. Following prior work on ProTeGi, accuracy is the primary metric. An exploratory study on imbalanced data will report precision, recall, and F₁ for the positive class, as explained in the previous section.

4.1. Main Performance on Non-Domain Tasks

To compare prompt-optimization strategies, we evaluate both the achieved accuracy and API calls used during optimization. We first compare iterations 0, 3, and 6 across nine datasets, then analyze the accuracy and cumulative API calls over all iterations. Finally, we report robustness (SE) and per-dataset pairwise t-test on three datasets with five repeated runs.

4.1.1. Overview

Initial Prompt (Iteration 0)

Table 4 reports on the accuracy and API calls using the same initial prompt and identical LLM/decoding settings. As expected, both frameworks require 200 API calls (one per training instance) and achieve similar accuracy; small differences remain despite identical configurations, reflecting LLM stochasticity.

Iteration 3

After three refinements (Table 5), the average accuracy across nine datasets diverges: PO2G+C attains a higher mean accuracy than ProTeGi, typically with a higher API-call budget, though several high-accuracy tasks (e.g., Ethos, Sentiment, and Amazon Review) show fewer calls for PO2G+C. A pairwise t-test across datasets at Iteration 3 yields

p = 0.0256

(two-sided), indicating a statistically significant mean difference in accuracy under our protocol.

Iteration 6

By six iterations (Table 6), the accuracy gap narrows and both frameworks approach a plateau, suggesting diminishing returns and a possible LLM ceiling. Mean accuracy for PO2G+C remains slightly higher, but a The pairwise t-test at iteration 6 is not significant (

p = 0.7110

). API cost continues to diverge: ProTeGi averages ∼14–15 thousand calls, whereas PO2G+C exceeds 17,000 on average during iteration 6, reflecting its more expensive approach per iteration.

Conducting a The pairwise t-test on the ProTegi@6 vs. PO2G@3 iteration step, the result is 0.6451, which is statistically not significant. The average accuracy of PO2G in three iterations was still higher compared to ProTeGi in six iterations. In terms of overall performance comparison, it suggests that the performance of PO2G@3 is similar to that of ProTeGi@6 while requiring fewer API calls to achieve a similar level of performance.

4.1.2. Accuracy

Figure 8 and Table 7 show iteration-wise averages of non-domain data, for which accuracy is the primary metric. Accuracy generally increases from iteration 0 to 6. ProTeGi exhibits mild fluctuations (iterations 2–4), consistent with minibatch-based selection and bandit scoring; PO2G+C shows smoother gains. The fluctuation in performance may be attributed to the use of minibatches, suggesting that ProTeGi, during iterations 3 to 4, is overfitting to a specific local minibatch, which results in a higher prompt score compared to other prompts tested across multiple minibatches. Further research are needed to confirm the effect of using Minibatch.

4.1.3. API Call Efficiency and Trade-Off

Cumulative API calls increase with iteration depth for both methods (Figure 9 and Table 7). Under the default settings, PO2G+C typically incurs a higher call budget than ProTeGi at the same iteration on the non-domain average (e.g., 17,194.9 vs. 14,563.2 at iteration 6). The gap is explained by the optimization process: PO2G+C harvests feedback from both false positives and false negatives and evaluates a larger pool of candidate edits, whereas ProTeGi’s bandit/minibatch scoring constrains evaluations per iteration per mini batch. When accuracy becomes high and incorrect predictions become scarce, the number of feedback edits to the prompt that can be proposed is reduced. This is why some iteration 3 tasks exhibit fewer calls for PO2G+C, despite usually having higher costs.

4.1.4. Statistical Significance on Three Public Datasets (5 runs)

We run five independent trials on LIAR, ARSarcasm, and Clickbait, reporting per-dataset SE (standard error) and The pairwise t-test for iterations 3 and 6. On LIAR, PO2G+C significantly outperforms ProTeGi at

i = 3

(

p = 0.0280

), but not at

i = 6

(

p = 0.7833

). For ARSarcasm, the difference is not significant at

i = 3

(

p = 0.1254

) but becomes significant at

i = 6

(

p = 0.0173

). For Clickbait, differences are not significant during either iteration (

p = 0.2796

,

p = 0.5734

). Full run-level results are presented in Table 8, Table 9 and Table 10.

The results of the The pairwise t-test and SE comparing ProTeGi and PO2G are inconclusive. In the next subsection, their performance on legal data is displayed.

4.2. Legal Data Behavioral Analysis

4.2.1. Overview

As the primary purpose of the proposed framework is document annotation, we include an exploratory analysis on imbalanced legal datasets provided by our sponsor. These data are reused for optimization and evaluation (train=test), so the goal is to study behavior. The legal datasets are annotated by domain experts. Tax-related data involve element annotation from Italian FTT documents; the Obligation data derive from FCA regulation annotations.

Initial Prompt (Iteration 0)

Using the same prompt and LLM, accuracies are very similar (Table 11); small deviations reflect LLM stochasticity.

Iteration 3

Both frameworks improve substantially by iteration 3; PO2G+C attains a higher average accuracy and a slightly lower average API budget (Table 12), indicating a favorable accuracy–cost trade-off.

Iteration 6

By iteration 6, both frameworks continue to improve, with PO2G+C maintaining an accuracy advantage and a lower average API budget overall (Table 13).

Overall, by iteration 6, PO2G+C has higher mean accuracy and, on average, is more efficient (fewer cumulative API calls) than ProTeGi in these legal annotation tasks. The per-task costs are heterogeneous and API savings concentrate where PO2G+C converges quickly (e.g., Tax Product, Tax Exemption, and Tax Calculation), while other tasks still incur higher call budgets. Because the datasets are imbalanced, we further report positive-class precision, recall, and F₁ in Section 4.2.4 to validate these performance gains.

4.2.2. Accuracy

Figure 10 and Table 14 show iteration-wise accuracy concerning legal tasks. Both frameworks improve substantially from the initial iteration, confirming the benefit of iterative prompt refinement. PO2G+C consistently outperforms ProTeGi from iteration 1 onward and reaches its peak performance faster, indicating faster optimization of the observed training pool.

4.2.3. API Call Efficiency and Trade-Off

Figure 11 and Table 14 illustrate average cumulative API calls per iteration concerning the legal tasks. Call counts increase monotonically for both methods, but PO2G+C consistently requires fewer calls than ProTeGi during the same iteration (e.g., 13,053 vs. 13,681 at

i = 6

). This pattern reflects optimization dynamics: as PO2G+C reaches higher accuracy earlier, false positives/negatives become scarce, yielding fewer actionable feedback items and thus fewer candidate edits to generate and evaluate. In contrast, ProTeGi’s bandit/minibatch scoring continues to evaluate a broader set of candidates per round.

4.2.4. Precision, Recall, and F1 Score

Table 15 reports positive-class precision, recall, and F₁ across iterations concerning the legal tasks. Iteration 0 uses identical prompts but shows slight performance differences, which we attribute to LLM stochasticity. We focus on the positive label when carrying out document annotation. Both frameworks improve steadily with each iteration; however, from iteration 3 onward, PO2G+C exhibits a more rapid and consistent rise in all three metrics. By the final iteration, PO2G+C attains the highest F₁ (0.8401), along with precision and recall, indicating more reliable detection of positive instances. These results indicate that PO2G+C provides a sustained advantage on the positive class recall precision, which translates into higher F₁ over six iteration steps.

4.2.5. Statistical Significance

Table 16 compares ProTeGi and PO2G+C on the Obligation dataset over five independent runs during iterations 3 and 6. Both methods improve between iterations, and PO2G+C has a higher mean accuracy at each stage. The pairwise t-test indicate that the difference during iteration 3 is not significant (

p = 0.2199

), whereas by iteration 6, it becomes highly significant (

p = 0.0002

). Standard errors are lower for PO2G+C at both iterations, suggesting slightly better run-to-run stability on this condition.

While most of the performance metrics favor PO2G+C in the legal domain test, note that the legal setting here uses train=test for behavioral analysis; all results in this subsection are an exploration to characterize optimization on the observed pool and do not establish external generalization. Stronger claims in the legal domain will require held-out evaluations on valid data. In the following subsection, we present an ablation study.

4.3. Ablation Study

This section presents an ablation study conducted to assess the influence of three design choices on the proposed framework: clustering, the presence/absence of an initial prompt, and LLM choice.

4.3.1. Clustering

We employ clustering to reduce redundancy by selecting representative centroids and prioritizing influential samples, aiming to enhance efficiency. The following tables compare the two PO2G variants (with and without clustering).

Across the nine non-domain datasets (Table 17), clustering (+C) generally reduces the long-run API budget while maintaining very similar accuracy. On average, at iteration 6, PO2G+C lowers cumulative API calls by ≈1.8 thousand relative to PO2G with minimal performance difference that should be LLM stochasticity.

On the legal (domain) data where train=test, adding clustering improves average accuracy across all seven tasks and tends to reduce calls during later iterations (Table 18).

Across the domain tasks in Table 19, PO2G+C consistently outperforms PO2G on positive-class precision, recall, and F₁ once refinement is underway. Differences are negligible during iteration 0 and mixed during iteration 1, after which PO2G+C yields clear improvements and remains positive throughout iteration 6. This pattern suggests that clustering strengthens minority-class detection by raising both recall and precision.

Concerning non-domain data, clustering delivers modest accuracy changes but a better long-run accuracy–cost trade-off; during the (train=test) domain analysis, it improves both accuracy and, throughout later iterations, the call budget. Effects likely depend on the batch size and pool size; we do not evaluate clustering at larger training scales, leaving this to future studies.

4.3.2. Empty Prompt

Here, we remove the initial prompt entirely, using an empty starting point. Experiments are run on Obligation, LIAR, AR_Sarcasm, and Clickbait (reusing the datasets from Section 4.1.4), showing performance in the Table 20. During iteration 0, an empty prompt should be random. Adding a good initial prompt acts as a strong prior and typically accelerates optimization; however, we observe mixed outcomes. Clickbait and Obligation benefit from the initial prompt by iterations 3 and 6, while LIAR and AR_Sarcasm perform better from an empty start, suggesting the provided initial prompt was suboptimal for these datasets and may have caused the optimizer to converge on less helpful gradients.

We do not pursue further analysis here, but the results illustrate that the optimizer can recover useful prompts even from an empty start, potentially requiring more iterations than having a proper initial prompt.

4.3.3. LLM Comparison

In this subsection, we analyze the impact of different LLM configurations on model performance as part of the ablation study. We compare the GPT-4o and cheaper 4o Mini across four datasets: Liar, AR_sarcasm, Clickbait, and Obligation. Table 21 presents the results, focusing on test accuracy. We change the LLM on

L L M_{2 F P}

,

L L M_{2 F N}

, and

L L M_{3}

to larger LLMs, resulting in an increase and decrease in accuracy. There is quite a significant increase in performance on the Clickbait dataset; however, there is a decrease in performance on the Liar and Obligation datasets.

We select a sample of prompts and examine the length of the final optimized prompts at i = 6 in Table 22. GPT-4o Mini generally produces prompts with less length and detail compared to GPT-4o. Comparing performance to the length of the prompt with numerous details or samples does not guarantee a better result. However, it increases the inference cost.

Our results are mixed. Using GPT-4o for feedback/prompt generation improves Clickbait but reduces LIAR and Obligation relative to GPT-4o Mini, possibly due to a mismatch between the LLM (GPT-4o) and the classifier LLM (

L L M_{1} =

4o-Mini). Longer, detailed prompts from GPT-4o do not guarantee better accuracy and will increase inference cost. A full test on other LLM configurations is left to future work due to budget and hardware limitations. The instructions generated by GPT-4o may not be fully understood by 4o-Mini due to its smaller model and lower capability compared to GPT-4o.

5. Conclusions

In this paper, we introduced PO2G+C, an automatic and discrete prompt optimization framework. Our experiments demonstrated that PO2G outperforms the ProTeGi framework on the same iteration step. Importantly, the PO2G framework reaches the performance of ProTeGi@6 by iteration 3 on average non-domain datasets, as shown in Section 4.1. Although PO2G+C typically incurs a higher per-iteration call budget, it attains a comparable accuracy in fewer iterations. The total accumulated calls to reach a target accuracy can be lower (e.g., PO2G+C@3 vs. ProTeGi@6). We also conduct a pairwise t-test and SE to check the statistical significance across nine datasets. The pairwise t-test on PO2G@3 was statistically significant against ProTeGi@3 and was not statistically significant with ProTeGi@6.

On legal (domain) data, where train = test, PO2G+C shows higher accuracy, precision, recall, and F1 than ProTeGi while using fewer or comparable cumulative API calls by later iterations. Because these experiments evaluate optimization behavior rather than external generalization, we treat them as a behavioral analysis instead of claiming performance improvement.

Our ablation studies yield three takeaways.

Clustering generally reduces long-run API calls with minimal accuracy change on non-domain data, and it improves accuracy and reduces cost on legal tasks; we hypothesize larger gains with bigger training pools for future work.
Initial prompt quality matters: a good seed can accelerate optimization, whereas a suboptimal seed can mislead the search; nevertheless, PO2G can generate a prompt from an empty prompt, albeit typically with more iterations.
LLM choice matters: using GPT-4o for feedback/prompt generation produces longer, more detailed prompts and mixed accuracy changes when the classifier remains 4o-mini, suggesting a mismatch effect; fully matched upgrades (generator and classifier) remain to be assessed and are left for future work.

6. Limitations

Legal domain

Legal experiments use small, imbalanced datasets and train on the same data as the test set. These settings are useful for behavioral analysis of optimization dynamics, but not valid as a proper benchmark to claim performance improvement. Reported gains used as behavior analysis. A proper legal benchmark is needed for benchmark validity.

LLM scope

Due to resource constraints and the framework’s reliance on parallel API calls, we primarily evaluated it with OpenAI GPT-4o-mini, conducting ablations that utilize GPT-4o for feedback/prompt generation, while maintaining the classifier as 4o-mini. This generator–classifier mismatch can influence results and was not exhaustively explored. We did not test open-source LLMs due to hardware limitations; performance and cost may vary under different providers, model sizes, or settings.

Data size

Following ProTeGi [13], we conducted experiments with 200 train/200 test examples per dataset. While the PO2G framework helps avoid local overfitting in minibatches, it still overfits to the training data over deeper iterations. Increasing the dataset size could further mitigate overfitting, though at the expense of additional testing time. Notably, expanding the optimization iterations from three to six doubled the computational effort but only yielded marginal performance improvements, suggesting that the optimal iteration count depends on both dataset size and the specific NLP task before diminishing returns.

Cost measurement

We primarily reported API call counts as a proxy for the computational cost. In depth accounting (input/output token usage, monetary price, etc.) was out of scope. This may alter the perceived efficiency trade-offs, especially when prompts become longer using more expensive LLMs.

Statistics and robustness

Robustness analyses used n = 5 runs per dataset. Pairwise t-test were reported per dataset and iteration, but no clear conclusion was drawn regarding robustness or statistical significance across use cases. Despite identical initial prompts and settings, repeating PO2G runs exhibited variation due to LLM stochasticity, which affects robustness and reproducibility.

Classification task

Since the framework was designed for document annotation, it was evaluated as a classification task using FP and FN For non-classification tasks; therefore, alternative suitable metrics (other than FP and FN) are needed to guide prompt optimization. Further exploration of loss signal categorization may also enhance this process.

Author Contributions

Conceptualization, A.J.L. and H.W.; methodology, A.J.L. and H.W.; software, A.J.L.; validation, A.J.L. and H.W.; formal analysis, A.J.L.; investigation, A.J.L.; resources, H.W. and K.R.; writing—original draft preparation, A.J.L.; writing—review and editing, A.J.L. and H.W.; visualization, A.J.L.; supervision, H.W. and K.R.; project administration, H.W. and K.R.; funding acquisition, H.W. and K.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received funding from Invest NI for the ARC (Advanced Research Engineering Centre) project. The project is part-financed by the European Regional Development Fund under the Investment for Growth and Jobs Programme 2014–2020. The APC was funded by Invest NI.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are partially available in a publicly accessible repository. Nine of the datasets are openly available in the GitHub repository (https://github.com/AnthonyJethro/PO2G). The remaining seven datasets are third-party data and are available from the authors with the permission of the respective third parties.

Acknowledgments

This research is supported by the ARC (Advanced Research Engineering Centre) project. PwC (*PricewaterhouseCoopers LLP, a limited liability partnership incorporated in England with its registered office at 1 Embankment Place, London WC2N 6RH) is in receipt of Grant for R&D support from Invest NI for ARC. This project is part-financed by the European Regional Development Fund under the Investment for Growth and Jobs Programme 2014–2020.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large Language Model
PO2G	Prompt Optimization with Two Gradients
PO2G+C	Prompt Optimization with Two Gradients with Cluster
NLP	Natural Language Processing
GPT	Generative Pre-training Transformer
FP	False Positive
FN	False Negative
ProTeGi	Prompt Optimization with Textual Gradients
SE	Standard Error
CLAPS	Clustering and Pruning for Efficient Black-box Prompt Search
PACE	Prompt with Actor–Critic Editing
AMPO	Automatic Multi-Branched Prompt Optimization
STRAGO	Strategic-Guided Optimization
APO-CF	Automatic Prompt Optimization via Confusion Matrix Feedback
SCULPT	Systematic Tuning of Long Prompts
GREATER	Gradient Over Reasoning makes Smaller Language Models Strong Prompt Optimizers
LPO	Local Prompt Optimization
PROPEL	PRompt OPtimization with Expert priors for LLMs
p	p-Value (p)
$Δ$	Delta

Appendix A

Appendix A.1. Prompt in LLM

The table below consists of prompts that were used on LLMs that were previously mentioned in the Methodology section.

Table A1. Prompts used by each LLM component (classifier, feedback generators, and prompt generator).

LLM	Prompt
$L L M_{1}$	#Task {prompt} #Output format Answer Yes or No as the label # PredictionText: {text} Classification:
$L L M_{2 F P}$	The prompt is the modified version from ProTeGi [13]. The modified prompt was built to generate false positive Feedback and optimized for a larger context window.
$L L M_{2 F N}$	The prompt is the modified version from ProTeGi [13]. The modified prompt was built to generate false negative Feedback and optimized for a larger context window.
$L L M_{3}$	The prompt was built to generate the prompt by incorporating feedback generated by $L L M_{2 F P}$ and $L L M_{2 F N}$ .

Appendix A.2. List Obligation Prompt from Prompt Expansion Sample

The table below consists of the prompt displayed in Figure 7 in the Methodology section.

Table A2. Prompt-optimization examples.

Prompt ID	Prompt Text
0	Does the following text contain an obligation? (Train Accuracy: 74.50%) (Test Accuracy: 74.00%) [Root]
1.1	Does this text contain an explicit or implied reference to obligations, such as legal, regulatory, or procedural responsibilities? Consider both direct language (e.g., “must,” “required,” “obligations”) and indirect references (e.g., discussions of responsibilities, standards, or guidance relevant to obligations). (Train Accuracy: 64.00%) [FP]
1.2	Does this text explicitly or implicitly suggest an obligation or responsibility? This includes obligations stated outright (e.g., using words like “must,” “should,” or “required to”) as well as implied responsibilities inferred from context, such as situations where potential harm or expected action suggests an obligation. (Train Accuracy: 82.50%) (Test Accuracy: 82.50%) [FN]
2.1	Does this text contain an explicit or implied reference to actionable obligations, such as legal, regulatory, or procedural responsibilities? Focus on identifying texts that impose, discuss, or describe specific responsibilities that need to be fulfilled. Consider both direct language (e.g., “must,” “required,” “obligations”) and indirect references (e.g., discussions of standards, guidance, or responsibilities relevant to actionable obligations). Exclude texts that merely mention the existence of obligations without describing or imposing specific actionable responsibilities. (Train Accuracy: 58.50%) [FP]
2.2	Does this text contain an explicit or implied reference to obligations, such as legal, regulatory, or procedural responsibilities? Consider both: - Direct language that clearly indicates obligations (e.g., “must,” “required,” “obligations”). - Indirect references that imply expectations, responsibilities, or strong recommendations, such as guidance, suggestions, or calls to action (e.g., “should,” “expected to,” “encouraged to”). When analyzing the text, evaluate whether the context suggests an expectation of action that aligns with obligations, even if the language is not explicitly directive. (Train Accuracy: 85.00%) [FN]
2.3	Does this text explicitly or implicitly suggest an obligation or responsibility for an identifiable subject (e.g., a person, organization, or group) to take a specific action? This includes obligations stated outright (e.g., using words like “must,” “should,” or “required to”) as well as implied responsibilities inferred from context, such as situations where potential harm, expected action, or directives suggest an obligation.Avoid classifying general descriptive statements or informational content as obligations unless they assign a clear responsibility or actionable directive to an identifiable subject. Focus on identifying obligations tied to actionable directives or expectations, rather than statements that merely describe the nature of a situation or provide information. (Train Accuracy: 89.00%) (Test Accuracy: 90.00%) [FP]
3.1	Does this text contain an explicit or implied reference to obligations, such as legal, regulatory, or procedural responsibilities? When analyzing the text, consider the following: 1. Direct Language: Look for clear and explicit indicators of obligations, such as words like “must,” “required,” “obligations,” or other directive language that imposes a necessary action. 2. Implied References: Evaluate whether the text contains indirect suggestions of expectations, responsibilities, or strong recommendations using terms like “should,” “expected to,” “encouraged to,” or similar phrasing. Implied obligations often suggest an expectation of action but may not explicitly command it. 3. Context of Action: Assess whether the text actively directs or implies that the reader or audience must take action to fulfill an obligation. - Instructive Context: Texts that guide, recommend, or imply that a specific action should or must be taken are likely referencing obligations. - Informational Context: Exclude texts that solely describe or explain existing legal, regulatory, or procedural frameworks without suggesting or implying that the reader must take action. Key Distinction: A reference to legal, regulatory, or procedural terms (e.g., “Article 6 of the GDPR”) does not necessarily constitute an obligation unless the text ties it to an expectation or directive for the reader to act. Examples: - Implied Obligation: “Organizations are expected to demonstrate compliance with GDPR requirements.” - Neutral Description: “The GDPR outlines requirements for data processing under Article 6.” Final Evaluation: Determine whether the text is intended to inform the reader about obligations (neutral description) or actively direct them to fulfill an obligation (instructive context). (Train Accuracy: 85.50%) [FP]
3.2	Does this text contain an explicit or implied reference to obligations, such as legal, regulatory, or procedural responsibilities? Consider both: - Direct language that clearly indicates obligations (e.g., “must,” “required,” “obligations”). - Indirect references that imply expectations, responsibilities, or strong recommendations, such as guidance, suggestions, or calls to action (e.g., “should,” “expected to,” “encouraged to”). When analyzing the text: - Evaluate whether the context suggests an expectation of action that aligns with obligations, even if the language is not explicitly directive. - Pay attention to soft guidance or recommendations (e.g., “could,” “may”) if they are tied to formal entities, guidelines, or decision-making processes that suggest procedural or regulatory expectations. - Consider references to external authorities (e.g., “The ICO provides more information”) as potential indicators of implied obligations. - Analyze whether the text references resources, guidelines, or tools provided by formal entities, as these could imply procedural or regulatory responsibilities. Example of Indirect Obligations: - “Firms could use a legitimate interests assessment to help determine whether they should use it as a basis for processing. The ICO provides more information on legitimate interests in their guide to data protection.” In this example, the phrase “should use it” implies an expectation, and the mention of the ICO suggests a regulatory guideline, both of which are relevant to obligations. (Train Accuracy: 85.00%) [FN]
3.3	Does this text explicitly or implicitly suggest an obligation or responsibility for an identifiable subject (e.g., a person, organization, or group) to take a specific action? This includes obligations stated outright (e.g., using words like “must,” “should,” or “required to”) as well as implied responsibilities inferred from context, such as situations where potential harm, expected action, or directives suggest an obligation.Avoid classifying general descriptive statements, explanations of terms, or informational content as obligations unless they directly assign a clear responsibility or actionable directive to an identifiable subject within the context of the statement itself. For example: - Definitions or explanations of how terms like “Must” or “Should” are used in a document are not obligations unless the text actively imposes a duty or responsibility. - Statements that describe a situation or provide general information without assigning responsibility to an identifiable subject are not obligations. Focus on identifying obligations tied to actionable directives or expectations, rather than statements that merely describe the nature of a situation, explain terminology, or provide general information. Confirm that the text imposes or implies a responsibility directly, rather than referencing obligations indirectly or explaining their framework. (Train Accuracy: 84.50%) [FP]
3.4	Does this text explicitly or implicitly suggest an obligation or responsibility for an identifiable subject (e.g., a person, organization, or group) to take a specific action? This includes obligations stated outright (e.g., using words like “must,” “should,” or “required to”) as well as implied responsibilities inferred from context. Implied obligations may arise when the text identifies a necessity, expectation, or requirement for action through phrases like “requires,” “calls for,” “necessitates,” or similar terms, even if explicit obligation-related words are absent. Additionally, consider whether the text suggests a responsibility by highlighting potential harm, expected actions, or directives that imply an obligation.Avoid classifying general descriptive statements or informational content as obligations unless they assign a clear responsibility or actionable directive to an identifiable subject. Focus on identifying obligations tied to actionable directives or expectations, including those implied through context, necessity, or the framing of required actions, rather than statements that merely describe the nature of a situation or provide information. (Train Accuracy: 91.00%) (Test Accuracy: 89.50%) [FN]

Appendix A.3. List Initial Prompt

The table below shows the initial prompt used at iteration = 0.

Table A3. Initial prompts used at iteration 0.

Task ID	Task	Initial Prompt
Category I
9.1	ethos	Is the following text hate speech?
10.1	liar	Determine whether the statement is a lie (Yes) or not (No) based on the context and other information.
11.1	ArSarcasm	Is this tweet sarcastic?
Category II
3.1	Financial Sentiment	Is this text expressing a positive overall sentiment?
4.1	Amazon_review	Is this text a positive review?
5.1	Tweet_airline	Is this text containing a positive meaning?
6.1	HateSpeech	Is this text hate speech?
7.1	Offensive_Language	Is this text offensive language?
8.1	ClickBait	Is this text clickbait?
Category III
1.1	tax_product	Is this text about a tax product that is taxed by the authority?
1.2	tax_transaction	Is this text about a tax transaction?
1.3	tax_exemption	Is this a tax exemption?
1.4	tax_calculation	Is this a method or formula for calculating tax?
1.5	tax_subject	Is this text about someone who is liable to pay or report tax?
1.6	tax_process	Is this a process of paying tax to the authority?
2.1	Obligation	Is this text an obligation?

Appendix A.4. Performance Results

Complete per–dataset, per–iteration results, including accuracy, precision, recall, F₁ score, cumulative API calls, and time, are provided in our public repository GitHub (https://github.com/AnthonyJethro/PO2G). The repository contains all testing performance, public dataset, and initial prompt.

References

OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Ling, C.; Zhao, X.; Lu, J.; Deng, C.; Zheng, C.; Wang, J.; Chowdhury, T.; Li, Y.; Cui, H.; Zhang, X.; et al. Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey. arXiv 2024, arXiv:2305.18703. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar]
Katz, D.M.; Bommarito, M.J.; Gao, S.; Arredondo, P. GPT-4 Passes the Bar Exam. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2024, 382, 20230254. [Google Scholar] [CrossRef] [PubMed]
Savelka, J.; Ashley, K.D. The Unreasonable Effectiveness of Large Language Models in Zero-Shot Semantic Annotation of Legal Texts. Front. Artif. Intell. 2023, 6, 1279794. [Google Scholar] [CrossRef] [PubMed]
Choi, J.H.; Hickman, K.E.; Monahan, A.B.; Schwarcz, D. ChatGPT Goes to Law School. J. Leg. Educ. 2022, 71, 387–400. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
White, J.; Fu, Q.; Hays, S.; Sandborn, M.; Olea, C.; Gilbert, H.; Elnashar, A.; Spencer-Smith, J.; Schmidt, D.C. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT, 2023. arXiv 2023, arXiv:2302.11382. [Google Scholar] [CrossRef]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Li, W.; Wang, X.; Li, W.; Jin, B. A Survey of Automatic Prompt Engineering: An Optimization Perspective. arXiv 2025, arXiv:2502.11560. [Google Scholar]
Zhang, Z.; Zhang, A.; Li, M.; Smola, A. Automatic Chain of Thought Prompting in Large Language Models. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Sun, H.; Li, X.; Xu, Y.; Homma, Y.; Cao, Q.; Wu, M.; Jiao, J.; Charles, D. AutoHint: Automatic Prompt Optimization with Hint Generation. arXiv 2023, arXiv:2307.07415. [Google Scholar] [CrossRef]
Pryzant, R.; Iter, D.; Li, J.; Lee, Y.; Zhu, C.; Zeng, M. Automatic Prompt Optimization with “Gradient Descent” and Beam Search; Association for Computational Linguistics: Singapore, 2023. [Google Scholar] [CrossRef]
Zhou, H.; Wan, X.; Vulić, I.; Korhonen, A. Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Association for Computational Linguistics: Singapore, 2023; pp. 13064–13077. [Google Scholar] [CrossRef]
Dong, Y.; Luo, K.; Jiang, X.; Jin, Z.; Li, G. PACE: Improving Prompt with Actor-Critic Editing for Large Language Model. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 7304–7323. [Google Scholar] [CrossRef]
Yuksekgonul, M.; Bianchi, F.; Boen, J.; Liu, S.; Huang, Z.; Guestrin, C.; Zou, J. TextGrad: Automatic “Differentiation” via Text. arXiv 2024, arXiv:2406.07496. [Google Scholar] [CrossRef]
Yang, S.; Wu, Y.; Gao, Y.; Zhou, Z.; Zhu, B.B.; Sun, X.; Lou, J.G.; Ding, Z.; Hu, A.; Fang, Y.; et al. AMPO: Automatic Multi-Branched Prompt Optimization. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 20267–20279. [Google Scholar] [CrossRef]
Wu, Y.; Gao, Y.; Zhu, B.B.; Zhou, Z.; Sun, X.; Yang, S.; Lou, J.G.; Ding, Z.; Yang, L. StraGo: Harnessing Strategic Guidance for Prompt Optimization. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 10043–10061. [Google Scholar] [CrossRef]
Choi, J. Efficient Prompt Optimization for Relevance Evaluation via LLM-Based Confusion Matrix Feedback. Appl. Sci. 2025, 15, 5198. [Google Scholar] [CrossRef]
Kumar, S.; Venkata, A.Y.; Khandelwal, S.; Santra, B.; Agrawal, P.; Gupta, M. SCULPT: Systematic Tuning of Long Prompts. arXiv 2025, arXiv:2410.20788. [Google Scholar]
Das, S.S.S.; Kamoi, R.; Pang, B.; Zhang, Y.; Xiong, C.; Zhang, R. GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers. arXiv 2025, arXiv:2412.09722. [Google Scholar]
Jain, Y.; Chowdhary, V. Local Prompt Optimization. arXiv 2025, arXiv:2504.20355. [Google Scholar] [CrossRef]
Mayilvaghanan, K.; Nathan, V.; Kumar, A. PROPEL: Prompt Optimization with Expert Priors for Small and Medium-sized LLMs. In Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing, Albuquerque, NM, USA, 3 May 2025; Association for Computational Linguistics: Albuquerque, NM, USA, 2025; pp. 272–302. [Google Scholar] [CrossRef]
Wu, S.; Koo, M.; Scalzo, F.; Kurtz, I. AutoMedPrompt: A New Framework for Optimizing LLM Medical Prompts Using Textual Gradients. arXiv 2025, arXiv:2502.15944. [Google Scholar]
Mollas, I.; Chrysopoulou, Z.; Karlos, S.; Tsoumakas, G. ETHOS: A multi-label hate speech detection dataset. Complex Intell. Syst. 2022, 8, 4663–4678. [Google Scholar] [CrossRef]
Wang, W.Y. Liar, Liar Pants on Fire: A New Benchmark Dataset for Fake News Detection; Association for Computational Linguistics: Vancouver, BC, Canada, 2017. [Google Scholar] [CrossRef]
Abu Farha, I.; Magdy, W. From Arabic Sentiment Analysis to Sarcasm Detection: The ArSarcasm Dataset. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France, 11–16 May 2020; pp. 32–39. [Google Scholar]
Malo, P.; Sinha, A.; Korhonen, P.; Wallenius, J.; Takala, P. Good Debt or Bad Debt: Detecting Semantic Orientations in Economic Texts. J. Assoc. Inf. Sci. Technol. 2014, 65, 782–796. [Google Scholar] [CrossRef]
Hou, Y.; Li, J.; He, Z.; Yan, A.; Chen, X.; McAuley, J. Bridging Language and Items for Retrieval and Recommendation. arXiv 2024, arXiv:2403.03952. [Google Scholar] [CrossRef]
Crowdflower. Twitter Airline Sentiment. Kaggle Dataset. 2015. Available online: https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment (accessed on 29 January 2025).
Davidson, T.; Warmsley, D.; Macy, M.; Weber, I. Automated Hate Speech Detection and the Problem of Offensive Language. In Proceedings of the 11th International AAAI Conference on Web and Social Media (ICWSM), Montreal, QC, Canada, 15–18 May 2017; pp. 512–515. [Google Scholar] [CrossRef]
Chakraborty, A.; Paranjape, B.; Kakarla, S.; Ganguly, N. Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media. In Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), San Francisco, CA, USA, 18–21 August 2016; pp. 9–16. [Google Scholar] [CrossRef]

Figure 1. Prompt improvement from initial to iteration 3: a simple prompt transformed into a prompt with detailed instructions.

Figure 2. PO2G framework flowchart. The prompt optimization algorithm (red) is detailed in Algorithm 1, cluster sampling (green) in Algorithm 2, and prompt expansion and selection (yellow) in Algorithm 3.

Figure 3. Algorithm 1: Two-gradient prompt optimization.

Figure 4. Algorithm 2: Clustering to find an influential sample.

Figure 5. Algorithm 3: Prompt expansion and selection.

Figure 6. Gradient-based prompt optimization overview. This figure illustrates the process described in Algorithm 1.

Figure 7. Prompt expansion sample for 3 iterations. This figure demonstrates expansion step generates a binary-tree structure iteratively. The full prompt details can be found in Table A2 in the Appendix A.

Figure 8. Model accuracy per iteration (non-domain average) from Table 7.

Figure 9. Cumulative API calls per iteration (non-domain average) from Table 7.

Figure 10. Model accuracy per iteration (legal domain average) from Table 14.

Figure 11. Cumulative API calls per iteration (non-domain average) from Table 14.

Table 1. Survey papers on prompting and automatic prompt engineering.

Reference	Method	Limitation
White et al. [8]	Prompt Pattern Catalog (survey)	—
Liu et al. [9]	Pretrain–prompt–predict survey	—
Li et al. [10]	Optimization-based prompting survey	—

Table 2. Foundational prompt-engineering methods that inspired our work. We group them as analysis-driven (reasoning) vs. direct-label (produces labels without explicit reasoning).

Reference	Method	Limitation
Analysis-driven methods
Wei et al. [3]	Chain-of-Thought	Requires curated prompt; higher inference cost due to longer reasoning chains/tokens
Zhang et al. [11]	Auto-CoT	LLM-generated prompt; inconsistent on multiple run;higher inference cost due to longer reasoning chains/tokens
Sun et al. [12]	AutoHint	Actor–critic loop over a single evolving prompt; hint may encourage prompt to analyze and reason
Direct-label methods
Pryzant et al. [13]	ProTeGi	Bandit/minibatch scoring; no explicit separation of FP vs. FN errors
Zhou et al. [14]	CLAPS	Evolutionary selection without explicit error feedback
Dong et al. [15]	PACE	Optimizes one candidate prompt per step, not tailored to classification; no explicit separation of FP vs. FN errors
Yuksekgonul et al. [16]	TextGrad	complex framework; better fit to complex reasoning/ than classification

Table 3. Recent studies related to our framework.

Reference	Method	Limitation/Notes vs. Our Framework
Yang et al. [17]	AMPO	Expands and keeps a pool of candidates; evaluated mainly on instruction-following/reasoning tasks.
Wu et al. [18]	STRAGO	Generates guided edits with additional “thinking” tokens; analysis-oriented, increasing token cost.
Choi [19]	APO-CF	Uses aggregate confusion-matrix signals (not per-instance FP/FN); demonstrated mainly for relevance evaluation; limited search depth per round.
Kumar et al. [20]	SCULPT	Designed a systematic way to edit long prompts, may miss on the other structure not defined.
Das et al. [21]	GREATER	Requires token-level logits/probabilities unavailable in closed LLM.
Jain and Chowdhary [22]	LPO	Token-level/local edits can miss larger structural changes.
Mayilvaghanan et al. [23]	PROPEL	Agent-loop with cost overhead; evaluated mainly on small/medium LLMs.
Wu et al. [24]	AutoMedPrompt	Medical QA-centric; single-candidate refinement.

Table 4. Non-domain initial performance comparison of ProTeGi and PO2G+C across multiple tasks.

Task ID	Task	ProTeGi		PO2G+C
Task ID	Task	Accuracy	API Calls	Accuracy	API Calls
9.1	Ethos	0.9350	200	0.9350	200
10.1	Liar	0.5020	200	0.5280	200
11.1	ARSarcasm	0.8270	200	0.8410	200
Average_1		0.7547	200	0.7680	200
3.1	Sentiment	0.9350	200	0.9400	200
4.1	Amazon Review	0.9300	200	0.9300	200
5.1	Tweet Airline	0.9400	200	0.9400	200
6.1	Hate Speech	0.9000	200	0.9000	200
7.1	Offensive Language	0.8550	200	0.8550	200
8.1	Clickbait	0.8630	200	0.8890	200
Average_2		0.9038	200	0.9090	200
Average_all		0.8541	200	0.8620	200

Table 5. Non-domain performance comparison of ProTeGi and PO2G+C across multiple tasks for iteration 3.

Task ID	Task	ProTeGi		PO2G+C
Task ID	Task	Accuracy	API Calls	Accuracy	API Calls
9.1	Ethos	0.9350	7481	0.9300	6462
10.1	Liar	0.5820	7344	0.6130	10,300
11.1	ARSarcasm	0.8280	7510	0.8670	10,138
Average_1		0.7817	7445	0.8033	8967
3.1	Sentiment	0.9850	6989	0.9950	4240
4.1	Amazon Review	0.9450	7080	0.9400	3230
5.1	Tweet Airline	0.9450	7118	0.9550	8886
6.1	Hate Speech	0.8850	7055	0.9000	10,098
7.1	Offensive Language	0.8600	6822	0.8650	10,098
8.1	Clickbait	0.8980	6779	0.9270	9613
Average_2		0.9197	6974	0.9303	7694
Average_all		0.8737	7131	0.8880	8118

Table 6. Non-domain performance comparison of ProTeGi and PO2G+C across multiple tasks during iteration 6.

Task ID	Task	ProTeGi		PO2G+C
Task ID	Task	Accuracy	API Calls	Accuracy	API Calls
9.1	Ethos	0.9500	15,070	0.9250	13,128
10.1	Liar	0.6020	14,998	0.6010	22,420
11.1	ARSarcasm	0.8440	15,029	0.8650	22,258
Average_1		0.7987	15,032	0.7970	19,269
3.1	Sentiment	0.9900	14,401	0.9400	7876
4.1	Amazon Review	0.9400	14,478	0.9500	6462
5.1	Tweet Airline	0.9250	14,469	0.9400	19,794
6.1	Hate Speech	0.9100	14,468	0.9150	22,016
7.1	Offensive Language	0.8650	14,012	0.9350	20,400
8.1	Clickbait	0.9420	14,144	0.9350	20,400
Average_2		0.9287	14,329	0.9358	16,158
Average_all		0.8853	14,563	0.8896	17,195

Table 7. Average performance of the non-domain dataset per iteration.

Iteration	ProTeGi		PO2G+C
Iteration	Accuracy	API Calls	Accuracy	API Calls
0	0.8541	200	0.8620	200
1	0.8694	2347	0.8678	1843
2	0.8773	4681	0.8858	5039
3	0.8737	7131	0.8880	8118
4	0.8686	9599	0.8921	11,171
5	0.8777	12,084	0.8902	14,165
6	0.8853	14,563	0.8896	17,195

Table 8. Performance comparison of ProTeGi and PO2G+C on the LIAR dataset during iterations 3 and 6.

Run	Iteration = 3		Iteration = 6
Run	ProTeGi	PO2G+C	ProTeGi	PO2G+C
1	0.5900	0.6250	0.6150	0.6250
2	0.5900	0.6250	0.6150	0.6250
3	0.6000	0.6300	0.6200	0.6400
4	0.5900	0.6050	0.5900	0.5600
5	0.5800	0.5800	0.6100	0.5850
Average	0.5900	0.6130	0.6100	0.6070
SE	0.0032	0.0093	0.0052	0.0149
Pairwise t-test	0.0280		0.7833

Table 9. Performance comparison of ProTeGi and PO2G+C on ARSARCASM dataset during iterations 3 and 6.

Run	Iteration = 3		Iteration = 6
Run	ProTeGi	PO2G+C	ProTeGi	PO2G+C
1	0.8550	0.8550	0.8450	0.8650
2	0.8550	0.8550	0.8450	0.8650
3	0.7900	0.8800	0.8100	0.8500
4	0.8250	0.8650	0.8500	0.8550
5	0.8500	0.8800	0.8500	0.8750
Average	0.8350	0.8670	0.8400	0.8620
SE	0.0125	0.0056	0.0076	0.0044
Pairwise t-test	0.1254		0.0173

Table 10. Performance comparison of ProTeGi and PO2G+C on the CLICKBAIT dataset during iterations 3 and 6.

Run	Iteration = 3		Iteration = 6
Run	ProTeGi	PO2G+C	ProTeGi	PO2G+C
1	0.8250	0.9750	0.9400	0.9550
2	0.8250	0.9750	0.9400	0.9550
3	0.9200	0.9200	0.9500	0.9150
4	0.9550	0.9200	0.9450	0.9400
5	0.9300	0.9200	0.9550	0.9350
Average	0.8910	0.9420	0.9460	0.9400
SE	0.0275	0.0135	0.0029	0.0074
Pairwise t-test	0.2796		0.5734

Table 11. Legal domain initial performance comparison of ProTeGi and PO2G+C across multiple tasks.

Task ID	Task	ProTeGi		PO2G+C
Task ID	Task	Accuracy	API Calls	Accuracy	API Calls
1.1	Tax Product	0.5200	150	0.5867	150
1.2	Tax Transaction	0.3600	150	0.3600	150
1.3	Tax Exemption	0.9600	150	0.9667	150
1.4	Tax Calculation	0.9667	150	0.9667	150
1.5	Tax Subject	0.5800	150	0.6000	150
1.6	Tax Process	0.5667	150	0.5800	150
2.1	Obligation	0.4670	200	0.4670	200
Average		0.6315	157	0.6467	157

Table 12. Legal domain performance comparison of ProTeGi and PO2G+C across multiple tasks during iteration 3.

Task ID	Task	ProTeGi		PO2G+C
Task ID	Task	Accuracy	API Calls	Accuracy	API Calls
1.1	Tax Product	0.9400	6366	0.9800	6382
1.2	Tax Transaction	0.8533	6440	0.9267	5926
1.3	Tax Exemption	0.9733	6068	0.9800	1974
1.4	Tax Calculation	0.9933	6417	1.0000	3646
1.5	Tax Subject	0.8000	6438	0.8733	6838
1.6	Tax Process	0.8467	6383	0.8600	7598
2.1	Obligation	0.8690	7308	0.8950	9452
Average		0.8965	6489	0.9307	5974

Table 13. Legal domain performance comparison of ProTeGi and PO2G+C across multiple tasks during iteration 6.

Task ID	Task	ProTeGi		PO2G+C
Task ID	Task	Accuracy	API Calls	Accuracy	API Calls
1.1	Tax Product	0.9533	13,575	0.9800	11,398
1.2	Tax Transaction	0.8933	13,640	0.9267	13,982
1.3	Tax Exemption	0.9867	13,201	0.9800	3950
1.4	Tax Calculation	0.9933	13,593	1.0000	7142
1.5	Tax Subject	0.8133	13,565	0.9000	15,958
1.6	Tax Process	0.8467	13,504	0.8733	18,300
2.1	Obligation	0.8740	14,688	0.9240	20,642
Average		0.9087	13,681	0.9406	13,053

Table 14. Average performance of the legal domain dataset per iteration.

Iteration	ProTeGi		PO2G+C
Iteration	Accuracy	API Calls	Accuracy	API Calls
0	0.6315	157	0.6467	157
1	0.8659	1847	0.8929	1165
2	0.8803	4094	0.9154	3532
3	0.8965	6489	0.9307	5974
4	0.9013	8889	0.9337	8876
5	0.9061	11,283	0.9373	11,025
6	0.9087	13,681	0.9406	13,053

Table 15. Precision, recall, and F₁ per iteration on the legal domain task.

Iter.	ProTeGi			PO2G+C
Iter.	Prec.	Rec.	F₁	Prec.	Rec.	F₁
0	0.5248	0.8078	0.5389	0.5575	0.7747	0.5422
1	0.6353	0.5332	0.5612	0.7995	0.6379	0.6961
2	0.6595	0.6364	0.6439	0.8740	0.7217	0.7643
3	0.7771	0.7520	0.7516	0.9174	0.7950	0.8137
4	0.7998	0.7440	0.7693	0.8839	0.8172	0.8367
5	0.8039	0.7484	0.7587	0.8960	0.8194	0.8409
6	0.8265	0.7591	0.7754	0.9218	0.7945	0.8401

Table 16. Performance comparison of ProTeGi and PO2G+C on the Obligation dataset during iterations 3 and 6.

Run	Iteration = 3		Iteration = 6
Run	ProTeGi	PO2G+C	ProTeGi	PO2G+C
1	0.8600	0.8700	0.8750	0.9200
2	0.8600	0.8700	0.8750	0.9200
3	0.8900	0.9000	0.8850	0.9250
4	0.9050	0.9000	0.8500	0.9100
5	0.8350	0.9050	0.8850	0.9400
Average	0.8700	0.8890	0.8740	0.9230
SE	0.0123	0.0078	0.0064	0.0049
Pairwise t-test	0.2199		0.0002

Table 17. Non-domain tasks: PO2G baseline and differences for PO2G+C relative to PO2G across iterations. Positive

Δ

accuracy favors PO2G+C; negative

Δ

API calls indicate fewer calls for PO2G+C.

Table 17. Non-domain tasks: PO2G baseline and differences for PO2G+C relative to PO2G across iterations. Positive

Δ

accuracy favors PO2G+C; negative

Δ

API calls indicate fewer calls for PO2G+C.

Iteration	PO2G		$Δ$ (PO2G+C − PO2G)
Iteration	Accuracy	API Calls	Accuracy	API Calls
0	0.8552	200	0.0068	0
1	0.8712	1888	−0.0034	−45
2	0.8760	5035	0.0098	4
3	0.8892	8006	−0.0012	112
4	0.8961	11,216	−0.0040	−45
5	0.8932	14,398	−0.0030	−233
6	0.8853	19,062	0.0042	−1867

Table 18. Legal domain tasks: PO2G baseline and differences for PO2G+C relative to PO2G across iterations. Positive

Δ

accuracy favors PO2G+C; negative

Δ

API calls indicate fewer calls for PO2G+C.

Table 18. Legal domain tasks: PO2G baseline and differences for PO2G+C relative to PO2G across iterations. Positive

Δ

accuracy favors PO2G+C; negative

Δ

API calls indicate fewer calls for PO2G+C.

Iteration	PO2G		$Δ$ (PO2G+C − PO2G)
Iteration	Accuracy	API Calls	Accuracy	API Calls
0	0.6423	157	0.0044	0
1	0.8781	1199	0.0149	−33
2	0.8957	3484	0.0197	48
3	0.8977	6073	0.0330	−100
4	0.9033	8041	0.0304	835
5	0.9113	11,776	0.0259	−751
6	0.9201	14,599	0.0205	−1546

Table 19. Domain tasks: PO2G baseline and differences for PO2G+C relative to PO2G across iterations. Positive

Δ

favors PO2G+C.

Table 19. Domain tasks: PO2G baseline and differences for PO2G+C relative to PO2G across iterations. Positive

Δ

favors PO2G+C.

Iteration	PO2G			$Δ$ (PO2G+C − PO2G)
Iteration	Precision	Recall	F₁	$Δ$ Prec.	$Δ$ Rec.	$Δ$ F₁
0	0.5623	0.7725	0.5396	$- 0.0049$	0.0023	0.0026
1	0.7455	0.6541	0.6897	0.0541	$- 0.0162$	0.0064
2	0.7502	0.6576	0.6625	0.1239	0.0641	0.1018
3	0.7581	0.6492	0.6627	0.1593	0.1459	0.1511
4	0.8281	0.7225	0.7331	0.0557	0.0948	0.1036
5	0.8735	0.7192	0.7504	0.0225	0.1002	0.0905
6	0.8840	0.7373	0.7656	0.0379	0.0571	0.0744

Table 20. Empty vs. Initial prompt iterations (

i = 0, 3, 6

).

Δ

Acc = Initial − Empty (positive value favors the Initial prompt). Initial-prompt accuracies are reported in Table 14.

Table 20. Empty vs. Initial prompt iterations (

i = 0, 3, 6

).

Δ

Acc = Initial − Empty (positive value favors the Initial prompt). Initial-prompt accuracies are reported in Table 14.

Dataset	Empty Prompt (Test Accuracy.)			$Δ$ Accuracy (Initial − Empty)
Dataset	$i = 0$	$i = 3$	$i = 6$	$i = 0$	$i = 3$	$i = 6$
LIAR	0.3650	0.6800	0.6200	0.1630	−0.0670	−0.0190
AR_Sarcasm	0.3800	0.8950	0.9000	0.4610	−0.0280	−0.0350
Clickbait	0.4250	0.6250	0.6600	0.4640	0.3020	0.2750
Obligation	0.6300	0.7900	0.8000	−0.1630	0.1050	0.1240
Average	0.4500	0.7475	0.7450	0.2313	0.0780	0.0863

Table 21. Comparison of GPT-4o performance and its delta to 4o-mini–GPT-4o across key iterations (i = 0, i = 3, and i = 6). Positive

Δ

favors 4o-mini.

Table 21. Comparison of GPT-4o performance and its delta to 4o-mini–GPT-4o across key iterations (i = 0, i = 3, and i = 6). Positive

Δ

favors 4o-mini.

Dataset	GPT-4o			$Δ$ 4o-mini - GPT-4o
Dataset	i = 0	i = 3	i = 6	i = 0	i = 3	i = 6
LIAR	0.5050	0.6000	0.5800	0.0230	0.0130	0.0210
AR_Sarcasm	0.8500	0.8700	0.8650	−0.0090	−0.0030	0.0000
Clickbait	0.8850	0.9800	0.9900	0.0040	−0.0530	−0.0550
Obligation	0.4650	0.8750	0.8950	0.0020	0.0200	0.0290

Table 22. Sample prompt length (word count) at iteration 6 for two LLMs.

Dataset	GPT-4o	4o Mini
LIAR	360	222
AR_Sarcasm	466	67
Clickbait	459	302
Obligation	440	31

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lieander, A.J.; Wang, H.; Rafferty, K. Prompt Optimization with Two Gradients for Classification in Large Language Models. AI 2025, 6, 182. https://doi.org/10.3390/ai6080182

AMA Style

Lieander AJ, Wang H, Rafferty K. Prompt Optimization with Two Gradients for Classification in Large Language Models. AI. 2025; 6(8):182. https://doi.org/10.3390/ai6080182

Chicago/Turabian Style

Lieander, Anthony Jethro, Hui Wang, and Karen Rafferty. 2025. "Prompt Optimization with Two Gradients for Classification in Large Language Models" AI 6, no. 8: 182. https://doi.org/10.3390/ai6080182

APA Style

Lieander, A. J., Wang, H., & Rafferty, K. (2025). Prompt Optimization with Two Gradients for Classification in Large Language Models. AI, 6(8), 182. https://doi.org/10.3390/ai6080182

Article Menu

Prompt Optimization with Two Gradients for Classification in Large Language Models

Abstract

1. Introduction

1.1. Literature Review

1.1.1. Survey Paper

1.1.2. Foundation Inspiration

1.1.3. Related Work

1.2. Framework with Two Gradients

1.3. Legal LLM

1.4. Research Contributions

2. Methodology

2.1. Hard Prompt Tuning in Textual Gradient Descent

2.2. Two Distinct Gradients

2.3. Loss Signal Selection: Cluster Sampling

2.4. Prompt Expansion and Selection

2.5. Selecting the Final Prompt

2.6. PO2G Framework Illustration

2.7. Comparing the PO2G Framework with ProTeGi

3. Experimental Setup

3.1. Tasks and Data

3.2. Language Models and API Configuration

3.3. Prompt Initialization

3.4. Systems Compared

3.5. Evaluation Organization

3.6. Iteration Budget, Metrics, and Statistics

4. Results and Discussion

4.1. Main Performance on Non-Domain Tasks

4.1.1. Overview

Initial Prompt (Iteration 0)

Iteration 3

Iteration 6

4.1.2. Accuracy

4.1.3. API Call Efficiency and Trade-Off

4.1.4. Statistical Significance on Three Public Datasets (5 runs)

4.2. Legal Data Behavioral Analysis

4.2.1. Overview

Initial Prompt (Iteration 0)

Iteration 3

Iteration 6

4.2.2. Accuracy

4.2.3. API Call Efficiency and Trade-Off

4.2.4. Precision, Recall, and F1 Score

4.2.5. Statistical Significance

4.3. Ablation Study

4.3.1. Clustering

4.3.2. Empty Prompt

4.3.3. LLM Comparison

5. Conclusions

6. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Prompt in LLM

Appendix A.2. List Obligation Prompt from Prompt Expansion Sample

Appendix A.3. List Initial Prompt

Appendix A.4. Performance Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI