You are currently viewing a new version of our website. To view the old version click .
Information
  • Article
  • Open Access

9 December 2025

Chain-of-Thought Prompt Optimization via Adversarial Learning

,
,
and
1
School of Computer Science, Wuhan University, Wuhan 430072, China
2
Institute of Power Transmission and Transformation Technology, State Grid Zhejiang Electric Power Co., Ltd., Research Institute, Hangzhou 310014, China
*
Author to whom correspondence should be addressed.

Abstract

Chain-of-Thought (CoT) prompting has demonstrated strong effectiveness in improving the reasoning capabilities of Large Language Models (LLMs). However, existing CoT optimization approaches still lack systematic mechanisms for evaluating and refining prompts. To address this gap, we propose Adversarial Chain-of-Thought (adv-CoT), a framework that introduces adversarial learning into prompt optimization. Adv-CoT iteratively refines an initial prompt through generator–discriminator interactions and integrates both feedback and verification mechanisms. This process enables more targeted and interpretable improvements to CoT instructions and demonstrations. We evaluate adv-CoT on twelve datasets across commonsense, factual, symbolic, and arithmetic reasoning. Across 12 reasoning datasets, adv-CoT yields an average improvement of 4.44% on GPT-3.5-turbo and 1.08% on GPT-4o-mini, with both gains being statistically significant (paired t-test, p < 0.05). The experimental results show that the framework yields consistent but task-dependent gains, particularly on numerical and factual reasoning tasks, and maintains competitive performance on symbolic and commonsense benchmarks. Paired significance tests further indicate that improvements are statistically reliable on high-capacity proprietary models, while results on smaller open-source models exhibit greater variance. Although these findings demonstrate the promise of adversarial refinement for CoT prompting, the conclusions remain preliminary. The effectiveness of adv-CoT depends on the base model’s reasoning capability, and the current evaluation is limited to four major categories of reasoning tasks. We will release the full implementation and prompts to support further investigation into broader applications and more generalizable prompt optimization strategies.

1. Introduction

Large Language Models (LLMs) have achieved significant advances in Natural Language Processing (NLP) tasks [1,2,3,4,5]. To better guide their reasoning process, researchers have introduced In-Context Learning (ICL) [6,7,8]. Moreover, Chain-of-Thought (CoT) [9,10,11,12,13] prompting has demonstrated significant effectiveness in enhancing the performance of LLMs. Unlike traditional prompts that require the model to directly produce an answer, often resulting in abrupt reasoning and hallucinations [14,15,16,17], causing logical errors, CoT encourages the model to articulate intermediate reasoning steps. By decomposing complex problems into smaller subtasks, this approach facilitates error tracing and substantially improves output accuracy, particularly in domains such as mathematical and symbolic reasoning.
In traditional CoT, the reasoning process generated by LLMs is often difficult to effectively verify for authenticity [18]. Moreover, CoT prompts typically require task-specific adjustments for different application scenarios, lacking generality and scalability [13,19,20]. Similarly, despite the rapid progress of CoT prompting, existing approaches still lack a principled and scalable mechanism for optimizing and verifying CoT prompts. Current methods typically rely on heuristic refinement, manual prompt editing, or sampling-based Self-Consistency, all of which provide limited control over error propagation and lack explicit criteria for assessing the quality of intermediate reasoning steps. Furthermore, most prior work focuses on improving the final answer accuracy rather than systematically evaluating whether the reasoning process itself becomes more reliable, diverse, or robust. Existing approaches [13,21,22,23,24] still rely heavily on costly downstream evaluations and lack a principled optimization framework for prompt design. These limitations motivate us to explore a more efficient and theoretically grounded approach.
Generative Adversarial Networks (GANs) and adversarial learning [25] have achieved remarkable success in various domains, including image generation [25,26] and model robustness enhancement [27,28]. At their core, GANs consist of a generator-discriminator framework where the generator aims to increase the likelihood of the discriminator making errors by producing data indistinguishable from real samples, while the discriminator seeks to correctly differentiate between generated and authentic data. Throughout training, both components evolve adversarially.
Inspired by GANs and Adversarial In-Context Learning (adv-ICL) [29], we introduce adversarial learning into the evaluation and optimization of CoT prompts. We propose the Adversarial Chain-of-Thought (adv-CoT) framework, which takes an initial prompt as input and, through several rounds of iterative refinement, produces two fully developed sets of CoT-based instructions and demonstrations, with one serving as the generator and the other as the discriminator. Without modifying model parameters, adv-CoT enhances LLM performance by optimizing prompts adversarially. Compared with existing optimization approaches, our method introduces an adversarially guided refinement loop that exposes model-specific reasoning failures in a targeted manner. adv-CoT actively modifies the underlying instructions and demonstrations. In contrast to verification-based approaches that evaluate final answers, our framework incorporates verification into the optimization loop itself, providing a more principled mechanism for improving both stability and reliability. The comparison between the traditional CoT method and our adv-CoT is illustrated in Figure 1.
Figure 1. Comparison between standard CoT prompting and adv-CoT. adv-CoT performs adversarial learning to optimize the prompts of a generator and a discriminator. The trained discriminator acts as a verifier that selects the highest-confidence answer from multiple outputs generated with the refined generator prompt, improving reliability without parameter updates.
To guide the development of our adversarial Chain-of-Thought framework, we focus on three core research questions:
RQ1.
How can we design a principled and scalable mechanism to refine CoT prompts without model fine-tuning, enabling iterative improvement through structured adversarial feedback?
RQ2.
Can adversarial interactions between generator and discriminator expose weaknesses in reasoning chains and drive more accurate, stable, and error-resistant CoT reasoning across diverse datasets and task types?
RQ3.
How can explicit verification signals be integrated into the optimization loop to evaluate and enhance the reliability of the generated reasoning paths and final answers?
These RQs directly shape the design of our adv-CoT framework and the corresponding analyses in later sections, including robustness evaluation (RQ1), feedback-driven refinement and error reduction (RQ2), and verification-based reliability assessment (RQ3).
We evaluate adv-CoT on 12 datasets spanning diverse reasoning types, including commonsense, factual, symbolic, and arithmetic tasks. Results demonstrate that adv-CoT effectively optimizes prompts and improves model output accuracy. Furthermore, combining Self-Consistency (SC) [18] with adv-CoT yields strong performance across multiple datasets. The framework is also highly extensible and can integrate with other prompt-related methods for further improvement.
To provide a clear roadmap, the remainder of this paper is organized as follows. Section 2 reviews related work on prompt optimization and adversarial learning. Section 3 introduces the key components and terminology that underpin the adv-CoT framework. Section 4 presents the full methodology, including the adversarial learning setup, feedback mechanism, optimization algorithm, and verifier design. Section 5 describes the experimental setup, covering datasets, backbone models, baselines, evaluation metrics, and implementation details. Section 6 reports and discusses the main empirical results, including significance analysis, ablation studies, and cross-model generalization. Finally, Section 7 concludes the paper by summarizing the findings, outlining limitations, and highlighting directions for future research.

2. Related Work

2.1. Prompt Optimization for Chain-of-Thought Reasoning

CoT prompting has been shown to significantly improve reasoning performance in LLMs [9,10]. However, whether prompts are manually created or automatically generated, their effectiveness is still mainly assessed through computationally intensive downstream tasks, with costs increasing as the number of candidate prompts, sampled reasoning paths, and task instances grows [13,30,31]. Recent research has proposed more systematic optimization methods, including black box optimization [32] and Bayesian optimization combined with Hyperband [33]. These methods still provide limited theoretical guarantees for reasoning trajectories. Methods focusing on process-level evaluation, such as process supervision and PRM800K [34], offer promising alternatives to evaluation based solely on final task outcomes, but they remain in early development. Meanwhile, studies on robust prompt optimization [35] and uncertainty-aware evaluation [36] highlight the need for a unified framework that can simultaneously improve efficiency, robustness, and reasoning quality in CoT prompting.

2.2. Adversarial Training

Generative Adversarial Networks (GANs) were first introduced by Goodfellow et al. [25], establishing a minimax game framework in which the generator is incentivized to produce high-fidelity outputs. This adversarial paradigm has achieved remarkable success in tasks such as image generation [25,26], super-resolution [37], and domain adaptation [38,39]. While GANs have been widely adopted in the field of computer vision [40], applying them to discrete domains like text generation remains challenging due to the non-differentiability of sampling operations [41,42]. To overcome these difficulties, various enhancements [43,44] have been developed to stabilize the training process and reduce exposure bias.
Beyond stability, diversification and robustness have also become key objectives in adversarial text generation. To alleviate mode collapse and enhance output variety, recent work incorporates diversity-promoting objectives and entropy-based regularization [45]. Robustness remains challenging due to noisy rewards and discriminator drift; adversarial regularization [46] and robust discriminator designs [47] help improve reliability under unstable adversarial feedback. These advances shift adversarial text generation toward more diverse and dependable modeling.
When applying GANs to LLMs, direct parameter updates are often infeasible due to the high computational cost. In this context, leveraging adversarial learning for prompt optimization presents a more practical and efficient alternative. adv-ICL [29] is the first to leverage adversarial learning to optimize prompts for in-context learning. adv-ICL has already established a formal theoretical foundation for applying GAN-style minimax optimization to prompt refinement, demonstrating that adversarial learning is a valid and effective mechanism for optimizing prompts. While motivated by similar principles, adv-ICL and our method differ in several key aspects. Specifically, compared to adv-ICL, adv-CoT not only considers the outputs of the generator during training but also incorporates the discriminator’s loss to filter and refine generated prompts. This joint optimization enables both the generator and the discriminator to improve simultaneously, thereby maximizing the effectiveness of the training framework. Moreover, adv-ICL suffers from training instability due to the inherent variability in prompt candidates generated by LLMs, resulting in fluctuating loss values and increased training iterations, which ultimately raise computational costs. Conversely, adv-CoT addresses this issue by introducing a feedback mechanism during the discriminator phase: it identifies errors and provides revision suggestions to guide the generation of more effective prompt variants. This targeted guidance substantially reduces the number of required training iterations, improving efficiency while preserving performance.

3. Key Components and Terminology

To establish a clear conceptual foundation for our adversarial reasoning framework, we summarize the core modules involved in the optimization process. These definitions clarify the roles and interactions among components before presenting the full methodology.
Generator. The generator produces an initial reasoning chain and a candidate answer for a given input. It acts as the source of model-generated outputs that will later be examined by the discriminator. In the adversarial setting, the generator aims to produce answers that are indistinguishable from human-written responses.
Discriminator. The discriminator evaluates whether an answer is human-generated or produced by the generator. It plays the adversarial counterpart to the generator: while the generator attempts to craft outputs that appear human-authored, the discriminator attempts to correctly identify synthetic content. This adversarial tension drives the refinement loop.
Proposer. The proposer analyzes the generator’s output and identifies potential shortcomings in the reasoning process. It provides structured feedback by explaining possible failure points and suggesting targeted corrections. This feedback serves as a reflective signal used in subsequent refinement.
Modifier. The modifier incorporates the proposer’s feedback to revise and refine the reasoning chain. It performs targeted rewriting—strengthening logic, correcting errors, or restructuring reasoning—to improve the likelihood that the updated answer can pass the discriminator’s evaluation.
Verifier (trained discriminator). The verifier is the final, trained form of the discriminator, used exclusively at the answer-validation stage. Rather than guiding training, it functions as a judging module to determine whether the final reasoning chain yields a correct solution.
Adversarial learning. Adversarial learning refers to the competitive optimization dynamic between the generator and the discriminator. The generator aims to minimize the discriminator’s ability to detect generated content, while the discriminator attempts to maximize its classification accuracy, thereby improving reasoning quality through adversarial refinement.
Minimax game. This adversarial interaction can be formalized as a minimax game. The generator selects strategies that maximize its chance of fooling the discriminator, whereas the discriminator selects strategies that minimize the generator’s success. This mutually opposing objective structure provides a principled modeling perspective for the optimization process.
Self-Consistency. Self-Consistency is a decoding strategy that aggregates multiple sampled reasoning paths and selects the final answer based on agreement (e.g., majority voting). It increases reliability by reducing variance across individual reasoning samples.

4. Methods

4.1. Adversarial Learning

Adversarial learning serves as the core component of the adv-CoT framework, which consists of two main modules: a generator and a discriminator. Both modules are powered by an LLM and utilize CoT prompting. The generator produces answers based on the given instruction and CoT demonstrations, while the discriminator evaluates whether the input answer is correct or generated by the generator, also using the instruction and CoT demonstrations for guidance.
In addition, a modifier module refines the prompts based on the discriminator’s judgments and revision suggestions. It generates new instruction-example pairs, and the system selects the optimal pair based on the computed loss value to update both the generator and the discriminator. The detailed training procedure is illustrated in Figure 2.
Figure 2. The training process of adv-CoT is formulated as a minimax game. The generator produces outputs, while the discriminator determines whether the input is a real sample or a generated sample. The Proposer provides suggestions for prompt revision, and the Modifier applies concrete modifications until the loss value meets the desired criterion. The colors of the arrows correspond to the information transmitted by the respective modules.
Generator: The generator G is guided by a prompt Prompt G , which consists of an instruction Instruction G and a set of input–output demonstrations Demonstration G = ( I 1 G , O 1 G , , I k G , O k G ) , where I i G denotes an example input and O i G is the corresponding output. The generator takes a user query as input and produces an answer as output.
Discriminator: The discriminator D is guided by a prompt Prompt D , which consists of an instruction Instruction D and a set of input–output demonstrations Demonstration D = ( I 1 D , O 1 D , D 1 D , , I k D , O k D , D k D ) , where I i D is an example input, O i D is the corresponding output, and D i D represents the discriminator’s decision and rationale for that example. D takes as input the user’s question along with the answer generated by the generator and outputs a binary decision indicating whether the answer is likely to be genuine or generated.
Modifier: The modifier M, driven by an LLM, is guided by a prompt Prompt M that consists of a single instruction. This instruction includes two parts: Instruction i M , which specifies how to revise the generator’s and discriminator’s instructions ( Instruction G and Instruction D ), and Instruction e M , which guides the modification of the demonstrations ( Demonstration G and Demonstration D ). The modifier operates under a zero-shot prompting setting. Based on this instruction, the modifier generates multiple prompt variants Prompt V G and Prompt V D for the generator and discriminator, respectively. These variants are then used to recompute the loss function, enabling iterative updates to both the generator and discriminator.

4.2. Feedback

In adv-ICL, the discriminator merely identifies correctness, offering limited actionable feedback for subsequent optimization. This limitation becomes more significant in the context of LLMs, where the parameters updated by the modifier M are expressed as explicit textual prompts. Due to the inherent randomness in LLM outputs, the direction of modification becomes uncertain, leading to instability in the loss function and increased computational cost. To address this issue, adv-CoT records the discriminator’s decisions, collecting input–output pairs that the generator fails to handle, as well as those that the discriminator misclassifies or cannot confidently judge. These challenging cases are then passed to the Proposer, an LLM-driven module that generates targeted revision suggestions. This process offers explicit guidance for the modifier, effectively directing the prompt updates and improving training efficiency.

4.3. Adversarial Learning Algorithm

Algorithm 1 provides a detailed description of our adv-CoT. Inspired by GANs and adv-ICL, the training objective loss function of adv-CoT is defined as follows:
J ( D , G ) = E x , y p d a t a log D ( x , y ) + E x p d a t a log 1 D ( x , G ( x ) ) ,
where p data denote the real data distribution, and E represent the expectation with respect to the indicated distribution. The discriminator D outputs one of two options: “(A) Correct ground truth” or “(B) Generated output.” The confidence scores associated with these discriminator outputs are treated as expectation values during iterative training. For the discriminator D, the objective is to maximize the loss (considering that confidence scores are negative) to better distinguish generated answers from real ones. Conversely, the generator G aims to minimize the loss to avoid being detected as incorrect by the discriminator. Thus, the entire training process can be formulated as a minimax optimization problem:
min G max D J ( D , G ) .
Algorithm 1 Adversarial Chain-of-Thought Optimization
  1:
Input: Generator G, Discriminator D, Proposer P, Modifier M
  2:
Input: Prompts P r o m p t G , P r o m p t D , P r o m p t P , P r o m p t M = I n s t r u c t i o n i M / I n s t r u c t i o n d M
  3:
Input: Number of iterations N u m I , demonstrations N u m D , max candidates N u m C
  4:
Input: Demonstration set S
  5:
for  i = 1  to  N u m I  do
  6:
       Sample N u m D examples from S and compute initial loss J
  7:
       for all target ∈ {instruction, demonstration} do
  8:
             P generates F e e d b a c k P from P r o m p t P and D’s outputs
  9:
             for  j = 1  to  N u m C  do
10:
                   M generates new candidate C n e w using P r o m p t M , F e e d b a c k P , and J
11:
                   Compute loss J n e w for C n e w
12:
                   if  J n e w > J  then
13:
                        Update candidate and loss: C C n e w , J J n e w
14:
                        break
15:
                   end if
16:
              end for
17:
        end for
18:
        // Similarly optimize P r o m p t G for G to minimize J
19:
        …
20:
end for
21:
Output: Optimized prompts P r o m p t G , P r o m p t D
The final outcome of the training yields a new set of prompts, Prompt new G and Prompt new D , corresponding to an updated generator G new and discriminator D new .

4.4. Verifier

The Verifier V refers to the trained discriminator D new . Given a user query, the updated generator G new produces an answer, which is then evaluated by the verifier. The verifier performs r rounds of evaluation, and the answer with the highest confidence score (i.e., the one associated with the greatest loss value) is selected as the final output. While prior prompting methods have explored ways to verify LLM-generated outputs, most rely on the LLMs to make simple judgments without a clear evaluation criterion or standard. In contrast, during adversarial training, both the generator and discriminator evolve together. This co-training allows the generator to produce more accurate outputs and the discriminator to serve as a reliable verifier. Traditional applications of GANs often focus solely on the generator after training, overlooking the discriminator. However, in the adv-CoT framework, the discriminator offers explicit verification signals, effectively enhancing the reliability and accuracy of the final answers. A more detailed illustration can be found in Figure 3.
Figure 3. Verification Process. The generator produces multiple reasoning paths, and the Verifier evaluates each path based on its log-probability scores. The path with the highest log-probability is selected as the final output.

5. Experimental Setup

5.1. Datasets

Following previous work [48], we selected three types of datasets in our experiments. The details are as follows: (1) Commonsense & factual reasoning. This category includes CommonsenseQA (CSQA) [49], StrategyQA [50], OpenBookQA [51], the AI2 Reasoning Challenge (ARC-c) [52], a sports understanding task from the BIG-Bench benchmark [53], and BoolQ [54]. These datasets are used to evaluate performance on commonsense and factual reasoning capabilities. (2) Symbolic reasoning. We evaluate two symbolic reasoning tasks: Last Letter Concatenation and Coin Flip [9]. (3) Arithmetic reasoning. This category includes grade-school math problems from GSM8K [55], the challenging math word problem dataset SVAMP [56], as well as AQuA [57] and MultiArith [58], which are all used for assessing mathematical problem-solving abilities. The detailed descriptions of the datasets can be found in Appendix A.

5.2. Backbone Models

We evaluated both open-source and closed-source models across a broad range of datasets. For the open-source model, we selected Llama-3-8B-Instruct [59]. Regarding closed-source models, we employed GPT-3.5-turbo (specifically the gpt-3.5-turbo-0125 version) and GPT-4o-mini (specifically the gpt-4o-mini-2024-07-18 version). Additionally, in cross-model experiments, we included deepseek-chat, which operates in the non-thinking mode of DeepSeek-V3.2-Exp. To ensure the reproducibility of our experiments, we fixed the random seed to 42. During training, prompt optimization relies on the generator producing sufficiently diverse responses. During testing, the generator is also required to generate multiple distinct answers; therefore, no fixed random seed is imposed on the generator. During the experiments, the generator, discriminator, proposer, and modifier modules were all powered by the same backbone model, unless otherwise specified.

5.3. Baselines

We begin our experiments with CoT [9] prompting and enhance it with Self-Consistency [18], using 10 sampled reasoning paths per input. To comprehensively validate the effectiveness of our proposed framework, we introduce two additional baselines: (1) AutoCoT [13], which acquires diverse questions via clustering-based sampling and automatically generates reasoning chains to construct demonstrations using Zero-Shot-CoT, eliminating the need for manual design of task-specific demonstrations and significantly reducing human effort; (2) adv-ICL [29], a representative method of adversarial in-context learning, used to compare the performance differences in adversarial optimization ideas.

5.4. Evaluation Metrics

In all experiments, we report accuracy as the evaluation metric. All datasets used in this work are single-label reasoning tasks—either multiple-choice (e.g., CSQA, ARC-c, AQuA), yes/no classification (e.g., BoolQ, StrategyQA), or exact numeric prediction (e.g., GSM8K, MultiArith). In these settings, accuracy directly measures whether the model outputs the correct answer, and it is also the predominant evaluation metric in prior CoT research. Other metrics such as precision, recall, or F1-score are less informative because each sample contains only one ground-truth label and no class-imbalance or multi-label structure is involved.
Formally, accuracy is defined as:
Accuracy = i = 1 N 1 ( y ^ i = y i ) N ,
where y ^ i denotes the model prediction, y i is the ground-truth label, and N is the total number of instances.
We follow a unified evaluation pipeline for all models: given an input question, the model produces either a single CoT sample or 10 samples in the Self-Consistency setting. Answers are extracted via regex-based parsing and mapped to the official labels. The full evaluation script, including extraction, SC aggregation, and error-handling rules, is released in our repository.

5.5. Implementation Details

To encourage diverse outputs, we set the temperature to 0.6 for the generator, proposer, and modifier, while the discriminator operated with a temperature of 0 for deterministic judgment. Considering the training cost, we set the number of reasoning paths evaluated by the verifier to 3. All datasets are configured with 5 samples, except for Letter, which is set to 4. More details about the parameters can be found in Appendix B. During the experiments, the log-probabilities of the output tokens are used as a reference signal for computing the loss. For each dataset, we initialized the generator prompt P r o m p t G and discriminator prompt P r o m p t D as follows. The instruction for the generator I n s t r u c t i o n G was set to “Let’s think step by step” [10]. We adopted initial demonstrations for the generator based on prior work [48]. For the discriminator I n s t r u c t i o n D , we extended the adv-ICL [29] discriminator prompt by incorporating an explicit reasoning requirement: “Judge whether the answer is the correct ground truth or a generated fake answer, and provide the reasoning process. You should make a choice from the following two options: (A) correct ground truth, (B) generated fake answer.” The initial demonstrations for the discriminator used input–output pairs from the generator as input, while the output was constructed by directly appending I n s t r u c t i o n D to the input under a zero-shot prompting setting using GPT-4, to elicit the model’s reasoning process. The detailed prompts can be found in Appendix C.

6. Results and Discussion

6.1. Results

Table 1 reports the performance of different reasoning methods across commonsense, symbolic, and arithmetic datasets. Overall, adv-CoT provides consistent improvements on most benchmarks. Although several gains remain moderate, repeated-run statistics show that the variance of adv-CoT is generally small, indicating that the method is stable despite introducing adversarial perturbations. These empirical results primarily address RQ2, as they evaluate whether adversarial interactions between the generator and discriminator improve the stability, accuracy, and error resistance of CoT reasoning across diverse tasks.
Table 1. Accuracy (%) of GPT-3.5-turbo, GPT-4o-mini, and Llama-3-8B-Instruct across commonsense, symbolic, and arithmetic reasoning benchmarks. The red color indicates a decrease, while the green color indicates an increase. The arrows represent downward and upward trends, respectively.
For GPT-3.5-turbo, adv-CoT achieves notable improvements on Sports (+4.5), GSM8K (+3.7), AQuA (+3.9), SVAMP (+1.2), and BoolQ (+0.4). These results suggest that adversarially generated intermediate steps help correct shallow or shortcut reasoning behaviors, particularly in numerical and factual reasoning tasks. Moderate gains are also observed on OpenBookQA and ARC-c. However, performance remains nearly unchanged on StrategyQA and slightly decreases on Coin (−1.6), indicating that tasks with limited reasoning depth or strong lexical biases may not benefit from adversarial reasoning signals.
For GPT-4o-mini, adv-CoT again yields stable improvements, though with smaller margins overall. The largest gains appear on StrategyQA (+1.9), Letter (+2.0), SVAMP (+0.3), and AQuA (+0.8). Results on commonsense and arithmetic benchmarks remain largely unchanged, reflecting GPT-4o-mini’s already strong inherent reasoning capabilities. Notably, adv-CoT does not reduce performance on any arithmetic dataset and maintains perfect accuracy on Coin across all runs.
For Llama-3-8B-Instruct, adv-CoT leads to meaningful gains on CSQA, StrategyQA, ARC-c, BoolQ, GSM8K, SVAMP, and AQuA. The method is particularly beneficial for natural-language reasoning tasks requiring multi-step justification. Slight performance drops occur on OpenBookQA, Sports, Letter, and Coin, suggesting that smaller open-source models may be more sensitive to adversarial perturbations, especially in symbolic tasks.
Across datasets, the magnitude of improvement varies in ways that align with their underlying reasoning characteristics. Arithmetic and factual tasks exhibit the largest gains, consistent with the observation that their reasoning errors tend to follow structured and recurring patterns—such as omitted intermediate steps or unsupported factual links—that can be effectively suppressed under adversarial refinement. Symbolic reasoning tasks show smaller or less stable improvements, likely because their discrete step transitions make correctness harder to infer from textual surface forms, reducing the discriminator’s ability to provide consistent gradient signals. Slight decreases observed in Letter and certain commonsense benchmarks appear to be associated with tasks requiring flexible associative reasoning, where overly constrained refinement may occasionally limit beneficial variability.
Qualitative examination further indicates that adversarial optimization tends to produce more concise and coherent reasoning trajectories in tasks with well-defined solution structures, while tasks with inherently diverse reasoning strategies display a broader range of outcomes. These trends collectively suggest that the effectiveness of adv-CoT is closely related to the degree of structural regularity present in the target reasoning process.
Across all models, arithmetic datasets such as GSM8K, SVAMP, and MultiArith show improved or comparable results, demonstrating that the introduced adversarial reasoning steps do not destabilize numeric reasoning. CoT+SC is included only as a reference baseline, and details regarding Self-Consistency are discussed in Section 6.4.1.

6.2. Significance Analysis

All statistical analyses were conducted in Python (version 3.12.5). We used the scipy.stats package for paired t-tests, numpy for vectorized operations, and a custom bootstrap implementation based on 10,000 resampling iterations. Cohen’s d for dependent samples was computed following the standard formula using the mean and standard deviation of paired differences. When controlling for multiple comparisons, we applied the Benjamini–Hochberg FDR procedure implemented in statsmodels.
To examine whether the improvements in adv-CoT are reliable, we conducted paired statistical tests on the twelve benchmark datasets. For each model, we calculated the accuracy difference between adv-CoT and CoT on every dataset. These differences were treated as paired observations. We then applied three analyses: a paired t-test, a 10,000-sample bootstrap procedure, and Cohen’s d for dependent samples.
Table 2 reports the results. For GPT-3.5-turbo, adv-CoT shows a significant gain of +4.44% (p = 0.0266; 95% CI = [1.76, 8.04]). The effect size is medium to large (d = 0.739). For GPT-4o-mini, the gain is smaller but still significant (+1.08%, p = 0.0134; 95% CI = [0.47, 1.83]). Its effect size is large (d = 0.849). For Llama-3-8B-Instruct, the improvement is modest (+0.98%) and not significant (p = 0.189; 95% CI = [−0.32, 2.27]). The effect size is small to medium (d = 0.404).
Table 2. Significance test results comparing adv-CoT against CoT across models.
These results show a clear pattern. adv-CoT provides stable and meaningful improvements on high-capacity proprietary models. The gains on smaller open-source models are less consistent and vary more across datasets. The combined evidence from significance testing and effect sizes supports this conclusion.
Since each backbone model is evaluated over twelve datasets, the paired significance tests involve multiple comparisons. To control the increased risk of Type I errors, we applied the Benjamini–Hochberg False Discovery Rate (FDR) correction to all p-values. The corrected values do not change any of our main conclusions: the improvements remain significant for GPT-3.5-turbo and GPT-4o-mini, while the results for Llama-3-8B-Instruct remain non-significant.

6.3. Ablation Study

The ablation analyses provide additional evidence for RQ1 and RQ3, by disentangling how feedback, verification, and different prompt components jointly shape the optimization dynamics and the reliability of the resulting reasoning process.

6.3.1. Ablation Study on Feedback and Verification Modules

To assess the individual contributions of the feedback and verification modules in our framework, we conduct ablation experiments on five datasets—ARC-C, Sports, BoolQ, Letter, and AQuA—using gpt-4o-mini. We compare three configurations: the full adv-CoT framework, adv-CoT without the feedback module (w/o feedback), and adv-CoT without the verification module (w/o verification). For a controlled comparison, the full and w/o-verification settings share the same optimized prompts, allowing us to isolate the effect of the verification stage. The results are summarized in Figure 4.
Figure 4. Ablation study results: (a) accuracy when different components are removed; (b) training steps with and without feedback.
Across the five datasets, the full adv-CoT achieves accuracies of 95.3%, 82.3%, 81.0%, 90.5%, and 79.5%. Removing feedback yields 95.8%, 78.6%, 80.0%, 84.8%, and 76.8%, whereas removing verification results in 94.1%, 74.0%, 79.5%, 84.3%, and 76.1%. With the exception of ARC-C—where omitting feedback produces a marginal improvement of only 0.5 percentage points—both ablations lead to consistent performance drops on all remaining datasets. The most substantial decreases appear when verification is removed on Sports (−8.3) and Letter (−6.2), and when feedback is removed on Letter (−5.7) and Sports (−3.7). These patterns indicate that the verification module plays a disproportionately strong role in stabilizing performance on high-variance tasks such as Sports and Letter, while the feedback module is essential for ensuring reliable improvements across different reasoning categories.
Training dynamics exhibit the same trend. The full adv-CoT shows lower final training loss (13.3, 10.3, 10.3, 10.0, 11.6) compared with the w/o-feedback variant (15.0, 11.9, 14.6, 16.0, 17.33), demonstrating faster convergence and more stable optimization. In contrast, removing feedback yields noticeably higher loss and slower optimization, suggesting that feedback provides more useful refinement signals. Verification, meanwhile, primarily improves the robustness of the final predictions rather than the speed of convergence.
These results highlight the complementary roles of the two modules: feedback enhances the efficiency and consistency of the optimization process, while verification ensures the reliability of the final reasoning outputs. The observed degradation in both accuracy and optimization stability when either module is removed demonstrates that both mechanisms are indispensable components of the adv-CoT framework. Overall, both mechanisms play complementary roles: feedback drives efficient learning, and verification ensures the robustness of the final reasoning results.

6.3.2. Ablation on Instruction–Demonstration Contributions

The optimized prompt used in adv-CoT consists of two key components: a task-level instruction that specifies the reasoning procedure, and a set of demonstrations that provide concrete exemplars of the desired chain-of-thought format. Although both components are jointly refined during the adversarial optimization process, their individual contributions to model performance remain unclear. To better understand their respective roles, we conduct an ablation study that isolates the effects of each component.
We evaluate three optimization configurations: (i) instruction-only, where the demonstrations are fixed and only the task directive is refined; (ii) demonstration-only, where the instruction remains unchanged while the exemplars are optimized; and (iii) joint optimization, corresponding to the full adv-CoT setting. Experiments were carried out using gpt-4o-mini on three representative benchmarks—AQuA (arithmetic reasoning), Letter (symbolic pattern reasoning), and ARC-c (commonsense abstraction). These datasets span distinct cognitive domains and thus provide complementary perspectives on the behavior of the two prompt components.
Figure 5 present the results. Optimizing only the instruction yields accuracies of 77.9% (AQuA), 90.4% (Letter), and 94.3% (ARC-c). While these gains indicate that task-level guidance plays a meaningful role in encouraging more structured reasoning, its improvement is consistently lower than that obtained by optimizing demonstrations. Refining only the exemplars leads to stronger performance on two datasets—79.1% on AQuA and 94.6% on ARC-c—suggesting that example-driven corrections may exert a more direct influence on the model’s reasoning format. However, on the Letter dataset, demonstration-only optimization results in a slight decrease (89.2%) compared with the instruction-only setting, implying that fine-grained symbolic tasks may be more sensitive to instruction clarity.
Figure 5. Ablation on Instruction vs. Demonstration Optimization.
The joint optimization setting achieves the highest accuracy across all datasets—79.9%, 91.0%, and 95.3%—indicating that the two components are complementary rather than redundant. The interaction between refined instructions and optimized exemplars appears to stabilize reasoning trajectories and reduce the likelihood of error propagation. In particular, the improvements observed on the ARC-c dataset suggest that commonsense abstraction benefits from both explicit procedural guidance and carefully constructed demonstrations, reflecting a synergy between high-level directive refinement and example-driven reasoning corrections.
Overall, this ablation study shows that neither instructions nor demonstrations alone can account for the full performance of adv-CoT. The most substantial improvements arise from their coordinated optimization, underscoring the importance of treating the prompt as a multi-component structure whose elements interact to guide reasoning quality.

6.4. Discussion

6.4.1. Difference Between Verification and Self-Consistency

Although both the verifier mechanism and Self-Consistency (SC) involve generating multiple reasoning paths, their selection principles differ fundamentally. The verifier evaluates each reasoning chain using a learned confidence-based criterion and selects the answer with the highest predicted reliability. In contrast, SC adopts a majority-voting scheme, selecting the most frequently occurring answer across sampled reasoning paths. To better understand the distinct and complementary roles of these two mechanisms, we evaluate four configurations on four datasets (OpenBookQA, Coin, Letter, and AQuA): (1) optimized prompts only, (2) optimized prompts with verifier, (3) optimized prompts with SC, and (4) optimized prompts with both verifier and SC.
Figure 6 presents the empirical results. The optimized-prompt baseline achieves accuracies of 84.2%, 94.3%, 75.4%, and 64.5%, respectively. Incorporating the verifier yields improvements to 87.4%, 95.1%, 75.6%, and 70.0%. Although the gains are modest on Letter (0.2 percentage points), the improvements on OpenBookQA (+3.2) and AQuA (+5.5) indicate that confidence-based filtering is especially beneficial for tasks involving noisy or unstable intermediate reasoning. This aligns with the intended role of the verifier: eliminating low-confidence or logically inconsistent reasoning chains.
Figure 6. Accuracy comparison between Verifier and Self-Consistency.
SC produces stronger improvements on three of the four datasets, reaching 86.0%, 98.2%, 78.4%, and 71.2%. The substantial gain on Coin (+3.9 percentage points over the verifier) suggests that SC more effectively mitigates stochastic variation by aggregating diverse reasoning trajectories, thereby reducing exposure to single-path failures. The robustness brought by majority voting appears particularly advantageous for tasks sensitive to sampling noise, such as symbolic reasoning in Letter and arithmetic reasoning in AQuA.
When combining both mechanisms, performance further improves to 86.8%, 99.2%, 80.0%, and 71.6%, achieving the highest accuracy on all datasets. This pattern indicates that verification and SC address different failure modes: verification suppresses low-quality reasoning paths by confidence calibration, while SC benefits from reasoning diversity and reduces variance across candidates. The complementary effects of these mechanisms suggest that combining confidence-based selection with diversity aggregation leads to more reliable and robust inference.
Overall, the performance degradation observed when removing the verification component reinforces its contribution to answer reliability, while the consistently stronger gains offered by SC highlight its importance for stabilizing inference under sampling variability. These findings provide a more nuanced understanding of the roles played by confidence calibration and diversity consolidation within the adv-CoT framework.

6.4.2. Cross-Model Generalization Analysis

To examine whether adversarial refinement overfits to a single model family, we conduct a cross-model generalization study using GPT-4o-mini and DeepSeek-V3.2-Exp. Each model is used both as the optimization (train) model and the evaluation (test) model, producing six train–test combinations. Experiments are performed on AQuA, Letter, and Sports.
The results in Table 3 reveal several nuanced trends. First, within-model evaluation shows clear and consistent gains: adv-CoT improves performance when GPT-4o-mini trains and evaluates on itself, and similarly when DeepSeek trains and evaluates on itself. These results verify that adversarial refinement is effective when the optimization and inference models share the same reasoning structure.
Table 3. Cross-model generalization results of adv-CoT across GPT-4o-mini and DeepSeek-V3.2-Exp. Each cell shows accuracy on AQuA/Letter/Sports.
Second, cross-family transfer is feasible but task-dependent. Prompts optimized using DeepSeek transfer positively to GPT-4o-mini on AQuA and Sports, while showing a slight decrease on Letter. Conversely, GPT-optimized prompts transfer to DeepSeek with improvements on AQuA and Letter, but produce a small decline on Sports. These mixed outcomes indicate that part of the refined reasoning structure generalizes across model families, while another part remains tied to model-specific inductive biases.
Third, the magnitude and direction of transfer effects vary across tasks. Arithmetic reasoning (AQuA) benefits the most from cross-model refinement, whereas factual reasoning (Sports) is more sensitive to model-specific characteristics, leading to occasional performance degradation.
Overall, the cross-model analysis suggests that adv-CoT does not rely solely on a single model family; it learns partially transferable reasoning structures. At the same time, the variability observed across tasks and model pairs highlights the importance of future work on architecture-agnostic adversarial prompt optimization.

7. Conclusions

In this work, we introduced adv-CoT, an adversarial framework for evaluating and refining Chain-of-Thought prompts without modifying model parameters. Through generator–discriminator interactions combined with feedback and verification mechanisms, adv-CoT enables structured prompt revision and improves accuracy across a range of reasoning tasks. Our experiments on twelve datasets demonstrate that adv-CoT provides consistent yet task-dependent performance gains, particularly for numerical and factual reasoning. Significance analyses further confirm that these improvements are statistically reliable for high-capacity proprietary models, while the gains on smaller open-source models remain less stable.
Despite these promising results, several limitations must be acknowledged. First, the effectiveness of adv-CoT is inherently constrained by the underlying LLM’s reasoning capability. When applied to smaller or less reliable models, the feedback and refinement processes may be misled by incorrect intermediate reasoning, thereby reducing optimization quality. Second, the generality of adv-CoT remains limited. Our evaluation focuses on four categories of reasoning tasks and does not extend to more heterogeneous domains such as code generation, scientific problem-solving, long-context inference, or multimodal tasks. The cross-domain robustness of adversarial prompt refinement therefore warrants further investigation. Third, the framework depends on access to token-level log probabilities, which are required for discriminator scoring and verification. Many proprietary LLMs do not expose log-probability outputs, which restricts the applicability of adv-CoT in settings where such information is unavailable. Broader adoption of the framework may thus be constrained by API design and model transparency. Fourth, although feedback and verification help reduce variability, the iterative adversarial optimization introduces non-trivial computational overhead. The framework requires multiple rounds of model interaction during both training and validation, which may increase latency and cost—especially when relying on commercial APIs. Careful hyperparameter selection is required to balance optimization quality with computational efficiency.
Future work will explore broader task domains, more scalable feedback mechanisms, and improved control over prompt modification to achieve greater generality and robustness. We hope that the released implementation and prompts will facilitate deeper investigation into principled and transferable prompt optimization strategies.

Author Contributions

Conceptualization, G.Y., X.C., S.W. and J.L.; methodology, G.Y., X.C., S.W. and J.L.; software, G.Y.; validation, G.Y., X.C. and S.W.; formal analysis, G.Y. and X.C.; investigation, G.Y. and X.C.; resources, X.C., S.W. and J.L.; data curation, G.Y.; writing—original draft preparation, G.Y.; writing—review and editing, G.Y., X.C., S.W. and J.L.; visualization, G.Y.; supervision, X.C. and J.L.; project administration, J.L.; funding acquisition, S.W. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the State Grid Corporation Headquarters Science and Technology Project: Research on equipment operation and inspection disposal reasoning technology based on knowledge-enhanced generative model and intelligent agent and demonstration application (5700-202458333A-2-1-ZX).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The implementation details of our code and the experiment logs are available at https://github.com/fatsunshineboy/adv_cot (accessed on 3 December 2025).

Acknowledgments

During the preparation of this study, the authors used ChatGPT (GPT-4o, OpenAI) for the purposes of polishing parts of the manuscript. The authors have reviewed and edited the output and take full responsibility for the content of this publication. The numerical calculations in this paper were conducted on the supercomputing system at the Supercomputing Center of Wuhan University.

Conflicts of Interest

Author Shaohe Wang was employed by the company State Grid Zhejiang Electric Power Co., Ltd., Research Institute Hangzhou, China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
adv-CoTAdversarial Chain-of-Thought
adv-ICLAdversarial In-Context Learning
CoTChain-of-Thought
GANsGenerative Adversarial Networks
LLMsLarge Language Models
NLPNatural Language Processing
SCSelf-Consistency

Appendix A. Statistics of Datasets

Following prior work, we conduct evaluations on twelve datasets covering arithmetic reasoning, commonsense reasoning, symbolic reasoning, and natural language understanding. The statistics of these datasets are presented in Table A1, and detailed information for each dataset is provided below.
Table A1. Dataset Descriptions.
Table A1. Dataset Descriptions.
DatasetNumber of SamplesAverage WordsAnswer FormatLicence
CSQA122127.8Multi-choiceUnspecified
StrategyQA22909.6Yes or NoApache-2.0
OpenBookQA50027.6Multi-choiceUnspecified
ARC-c117247.5Multi-choiceCC BY SA-4.0
Sports10007.0Yes or NoApache-2.0
BoolQ32708.7Yes or NoCC BY SA-3.0
Last Letters50015.0StringUnspecified
Coin Flip50037.0Yes or NoUnspecified
GSM8K131946.9NumberMIT License
SVAMP100031.8NumberMIT License
AQuA25451.9Multi-choiceApache-2.0
MultiArith60031.8NumberCC BY SA-4.0

Appendix B. Extended Experiments

Appendix B.1. Temperature of Proposer and Modifier

To further examine the sensitivity of adv-CoT to sampling diversity, we conduct an ablation study on the temperature parameters used in the proposer and modifier modules. Unlike the generator and discriminator—whose temperatures remain fixed to ensure stable answer generation and evaluation—the proposer and modifier rely on stochastic sampling to construct alternative reasoning paths and feedback signals. Temperature therefore governs the trade-off between exploration (higher temperature) and determinism (lower temperature), directly influencing the diversity and usefulness of the generated feedback.
We evaluate five temperature settings (0.0, 0.2, 0.4, 0.6, 0.8) on two representative datasets, AQuA and ARC-C. The results are summarized in Figure A1. For AQuA, accuracy fluctuates significantly across temperatures: it reaches 80.3% at T = 0.0 , decreases to 74.4% at T = 0.2 , then partially recovers to 78.3% and 79.1% at T = 0.4 and T = 0.6 , before declining again to 73.6% at T = 0.8 . These results indicate that extremely low temperatures limit exploration, producing overly conservative feedback, while moderate temperatures (0.4–0.6) provide a more effective balance between diversity and logical coherence. However, excessively high temperatures introduce noisy or inconsistent revisions, thereby weakening optimization effectiveness.
In contrast, ARC-C exhibits notable robustness across the entire temperature range: accuracy remains within a narrow band of 94.6–95.3% for all settings. This stability suggests that commonsense reasoning requires less fine-grained control over stochasticity—likely due to its shorter reasoning chains and reduced sensitivity to perturbations—rendering ARC-C more tolerant to variations in exploration level.
Overall, these results demonstrate that temperature acts as a task-dependent factor in adv-CoT. Numerical reasoning tasks benefit from carefully calibrated exploration within the feedback modules, whereas tasks with simpler logical structures remain comparatively insensitive to temperature variation. Effective temperature selection can therefore enhance prompt refinement by encouraging meaningful exploration while avoiding excessive randomness.
Figure A1. Performance Comparison of Different Methods on Llama-3-8B-Instruct.

Appendix B.2. Ablation Studies on Number of Iterations and Data Samples

To further validate the robustness of adv-CoT and investigate the impact of its key hyperparameters, we conducted additional ablation experiments with different combinations of the number of iterations ( N u m I ) and the number of prompt samples ( N u m D ). The evaluation was carried out on three benchmark datasets: MultiArith, ARC-C, and Sports. For each N u m D 1 , 3 , 5 , 7 , the iteration number N u m I was varied from 1 to 5, and the results are reported in Table A2. All scores are presented in percentages with one decimal place for clarity and consistency.
Table A2. Ablation results of adv-CoT with different N u m I and N u m D values.
Table A2. Ablation results of adv-CoT with different N u m I and N u m D values.
Num D Num I MultiArith (%)ARC-C (%)Sports (%)
1197.672.760.1
295.076.651.1
398.180.850.8
497.881.965.6
596.678.464.1
3198.378.761.9
298.380.566.1
398.880.778.9
498.679.971.3
598.079.758.0
5198.171.982.1
297.675.176.9
398.379.988.7
497.678.857.5
597.877.876.4
7197.673.885.6
298.075.489.5
398.082.480.7
497.579.977.8
597.580.784.8
Overall, these results demonstrate that both hyperparameters exert significant influence on model performance across datasets. For MultiArith, which primarily measures numerical reasoning ability, accuracy remains consistently high under most configurations, ranging from 95.0% to 98.8%. The best performance is obtained when N u m D = 3 and N u m I = 3 , indicating that moderate iteration depth and an appropriate number of prompt samples are beneficial for stable arithmetic reasoning. However, excessive iteration ( N u m I > 4 ) or too few prompts tends to cause slight performance drops, possibly due to optimization saturation or insufficient contextual guidance.
In the ARC-C dataset, which emphasizes commonsense reasoning, the results show greater variation. When N u m D = 1 , performance peaks at 81.9% with N u m I = 4 , while most other settings remain below 80%. As the prompt number increases to N u m D = 3 and N u m D = 5 , performance becomes more stable, reaching best scores of 80.7% and 79.9%, respectively. When N u m D = 7 , the model achieves its overall best performance of 82.4% at N u m I = 3 , suggesting that moderate prompt diversity and iteration balance enhance generalization in abstract reasoning tasks.
For the Sports dataset, which relies more on factual and contextual reasoning, the model exhibits strong sensitivity to prompt diversity. With N u m D = 1 , accuracy increases gradually from 50.8% to 65.6% as the iteration deepens. When more prompts are introduced, performance improves substantially—reaching 78.9% at N u m D = 3 , 88.7% at N u m D = 5 , and peaking at 89.5% when N u m D = 7 and N u m I = 2 . This indicates that rich prompt contexts significantly benefit knowledge-intensive reasoning.
Although Table A2 reports results for different values of N u m D and N u m I , we further analyze their interaction to better understand the optimization dynamics of adv-CoT. We observe that increasing N u m D beyond 1 substantially stabilizes the refinement process, whereas N u m D = 1 yields large variance across datasets, particularly on Sports. This confirms the role of sampling-based diversity in improving reasoning robustness. However, excessively large N u m D (e.g., 7) does not lead to additional gains and occasionally degrades performance, suggesting that over-sampling may introduce noisy or low-quality candidate reasoning paths.
For N u m I , we find that a moderate iteration depth ( N u m I = 3) consistently achieves the best or near-best results across datasets, while deeper refinement ( N u m I = 5) tends to produce diminishing returns or even slight degradation, indicating that excessive adversarial refinement may over-correct the reasoning trajectory. Importantly, the optimal settings of N u m D and N u m I are not independent: N u m I = 3 performs well when N u m D = 3 or 5, but N u m D = 1 fails to benefit from larger N u m I , revealing a clear interaction between sampling diversity and refinement depth.
These observations highlight that adv-CoT benefits from a balanced combination of sampling and iterative refinement, rather than simply increasing either dimension, providing methodological insights aligned with prior findings in adversarial prompt optimization.
In summary, these supplementary results confirm that adv-CoT performs best when both the iteration number and prompt diversity are moderately balanced. Extremely small or large hyperparameter values tend to underfit or overfit the reasoning process. The optimal configuration, typically around N u m I = 3∼4 and N u m D = 5∼7, provides a consistent trade-off between stability and adaptability, reinforcing the parameter choices used in the main experiments.

Appendix C. Prompt

We provide the prompt used in the experiment. The prompts for both the Generator and the Discriminator consist of an instruction and demonstrations. The demonstrations for the Generator are adapted from prior work, while those for the Discriminator are constructed by concatenating the Discriminator’s instruction with the input–output pairs of the Generator’s demonstrations, and then providing them as zero-shot prompts to GPT-4o-mini to obtain the corresponding outputs. Furthermore, the Proposer and Modifier offer revision suggestions and concrete modifications to the instructions and demonstrations of the Generator and the Discriminator, respectively. The detailed prompts are as follows:

Appendix C.1. Generator Initial Prompt

Let’s think step by step [10].

Appendix C.2. Discriminator Initial Prompt

Judge the answer is correct ground truth or generated fake answer and give the reasoning process. You should make a choice from the following two answers. (A) correct ground truth (B) generated fake answer [29].

Appendix C.3. Proposer Prompt for Generator’s Instruction

Below are examples in this format: <input> [Question] </input> <output> [Answer] </output> <answer> [Discriminator_Output] </answer>. In each case the discriminator marked the generator’s output as incorrect. In one sentence, give a precise, actionable modification to the generator’s instruction that targets common failure patterns to prevent similar mistakes. Do not mention the specific details of the errors, names, items or data in the examples. In conclusion, the words that appear in the examples must not be included in the suggestions. Old generator’s instruction: {instruction_generator}. Start the output with ‘New generator’s instruction should…’.

Appendix C.4. Proposer Prompt for Generator’s Demonstrations

Below are examples in this format: <input> [Question] </input> <output> [Answer] </output> <answer> [Discriminator_Output] </answer>. In each case the discriminator marked the generator’s output as incorrect. In up to one sentence, provide precise, actionable modifications to the generator’s examples that address these common failure patterns and steer its outputs away from similar errors. Do not mention the specific details of the errors, names, items or data in the examples. In conclusion, the words that appear in the examples must not be included in the suggestions. Old generator’s instruction: {instruction_generator}. Start the output with ‘New generator’s examples should…’.

Appendix C.5. Proposer Prompt for Discriminator’s Instruction

Below are examples in this format: <input> [Question] </input> <output> [Answer] </output> <answer> [Discriminator_Output] </answer>. In each case, the discriminator failed to detect errors in the generator’s output or identified the correct answer as incorrect. Please suggest concise, one-sentence edits to the discriminator’s instruction prompt so it more effectively detects generator output errors, focusing on common mistake patterns and how to reword the prompt to avoid them. Do not mention the specific details of the errors, names, items or data in the examples. In conclusion, the words that appear in the examples must not be included in the suggestions. Old discriminator instruction: {instruction_discriminator}. Start the output with ‘New discriminator’s instruction should…’.

Appendix C.6. Proposer Prompt for Discriminator’s Demonstrations

Below are examples in this format: <input> [Question] </input> <output> [Answer] </output> <answer> [Discriminator_Output] </answer>. In each case, the discriminator failed to detect errors in the generator’s output or identified the correct answer as incorrect. In up to one sentence, provide concise, actionable suggestions for designing new discriminator examples that more effectively expose common generator mistakes. Do not mention the specific details of the errors, names, items or data in the examples. In conclusion, the words that appear in the examples must not be included in the suggestions. Discriminator instruction: {instruction_discriminator}. Start the output with ‘New discriminator’s examples should…’.

Appendix C.7. Modifier Prompt for Generator’s Instruction

Try to generate a new generator instruction to improve the correctness of the generator’s output. Keep the task instruction as declarative. The instruction should be precise and representative to inspire the generator to think. Extra suggestions are only for reference and can be ignored when they conflict with the overall situation.

Appendix C.8. Modifier Prompt for Generator’s Demonstrations

Please generate a new example that polish the following example to improve the correctness of the generator’s output. The example should be challenging, precise and representative to inspire the generator to think. The format of the examples should not be changed. Generator’s instruction is {instruction_generator_old}. Example must follow this XML format: <example> <input> [Question] </input> <output> [Answer] </output> </example>. Replace the content in [] with your output. Provide only the revised ‘<example>…</example>’ blocks.

Appendix C.9. Modifier Prompt for Discriminator’s Instruction

Try to generate a new instruction to make LLM discriminator more precise to find LLM generator’s errors. Keep the task instruction as declarative. The instruction should be precise and representative to inspire the discriminator to think. Extra suggestions are only for reference and can be ignored when they conflict with the overall situation.

Appendix C.10. Modifier Prompt for Discriminator’s Demonstrations

Please generate a new example to make LLM discriminator more precise to find LLM generator’s errors. The example should be challenging, precise and representative to inspire the discriminator to think. The format of the examples should not be changed. Discriminator’s instruction is {instruction_discriminator_old}. Example must follow this XML format: <example> <input> [Question] </input> <output> [Answer] </output> <answer> [Discriminator_Output] </answer> </example>. Replace the content in [] with your output. Provide only the revised ‘<example>…</example>’ blocks.

References

  1. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
  2. Rae, J.W.; Borgeaud, S.; Cai, T.; Millican, K.; Hoffmann, J.; Song, F.; Aslanides, J.; Henderson, S.; Ring, R.; Young, S.; et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv 2021, arXiv:2112.11446. [Google Scholar]
  3. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
  4. Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
  5. Xiao, T.; Zhu, J. Foundations of large language models. arXiv 2025, arXiv:2501.09223. [Google Scholar]
  6. Min, S.; Lewis, M.; Zettlemoyer, L.; Hajishirzi, H. Metaicl: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 2791–2809. [Google Scholar]
  7. Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Xia, H.; Xu, J.; Wu, Z.; Chang, B.; et al. A survey on in-context learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 1107–1128. [Google Scholar]
  8. Dherin, B.; Munn, M.; Mazzawi, H.; Wunder, M.; Gonzalvo, J. Learning without training: The implicit dynamics of in-context learning. arXiv 2025, arXiv:2507.16003. [Google Scholar] [CrossRef]
  9. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
  10. Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
  11. Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst. 2023, 36, 11809–11822. [Google Scholar]
  12. Weng, Y.; Zhu, M.; Xia, F.; Li, B.; He, S.; Liu, S.; Sun, B.; Liu, K.; Zhao, J. Large language models are better reasoners with self-verification. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 2550–2575. [Google Scholar]
  13. Zhang, Z.; Zhang, A.; Li, M.; Smola, A. Automatic Chain of Thought Prompting in Large Language Models. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  14. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
  15. Li, J.; Chen, J.; Ren, R.; Cheng, X.; Zhao, W.X.; Nie, J.Y.; Wen, J.R. The dawn after the dark: An empirical study on factuality hallucination in large language models. arXiv 2024, arXiv:2401.03205. [Google Scholar] [CrossRef]
  16. Xu, Z.; Jain, S.; Kankanhalli, M. Hallucination is inevitable: An innate limitation of large language models. arXiv 2024, arXiv:2401.11817. [Google Scholar] [CrossRef]
  17. Sun, Y.; Yin, Z.; Guo, Q.; Wu, J.; Qiu, X.; Zhao, H. Benchmarking hallucination in large language models based on unanswerable math word problem. arXiv 2024, arXiv:2403.03558. [Google Scholar] [CrossRef]
  18. Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
  19. Zhang, X.; Du, C.; Pang, T.; Liu, Q.; Gao, W.; Lin, M. Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 333–356. [Google Scholar]
  20. Diao, S.; Wang, P.; Lin, Y.; Pan, R.; Liu, X.; Zhang, T. Active prompting with chain-of-thought for large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 1330–1350. [Google Scholar]
  21. Ye, J.; Gong, S.; Chen, L.; Zheng, L.; Gao, J.; Shi, H.; Wu, C.; Jiang, X.; Li, Z.; Bi, W.; et al. Diffusion of Thought: Chain-of-Thought Reasoning in Diffusion Language Models. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 105345–105374. [Google Scholar]
  22. Luo, W.; Wang, W.; Li, X.; Zhou, W.; Jia, P.; Zhao, X. TAPO: Task-Referenced Adaptation for Prompt Optimization. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
  23. Larionov, D.; Eger, S. Promptoptme: Error-aware prompt compression for llm-based mt evaluation metrics. arXiv 2024, arXiv:2412.16120. [Google Scholar]
  24. Guo, P.F.; Tsai, Y.D.; Lin, S.D. Benchmarking large language model uncertainty for prompt optimization. arXiv 2024, arXiv:2409.10044. [Google Scholar] [CrossRef]
  25. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27. Available online: https://proceedings.neurips.cc/paper_files/paper/2014/hash/f033ed80deb0234979a61f95710dbe25-Abstract.html (accessed on 3 December 2025).
  26. Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
  27. Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv 2013, arXiv:1312.6199. [Google Scholar]
  28. Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Šrndić, N.; Laskov, P.; Giacinto, G.; Roli, F. Evasion attacks against machine learning at test time. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Prague, Czech Republic, 22–26 September 2013; pp. 387–402. [Google Scholar]
  29. Long, D.; Zhao, Y.; Brown, H.; Xie, Y.; Zhao, J.; Chen, N.; Kawaguchi, K.; Shieh, M.; He, J. Prompt optimization via adversarial in-context learning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 7308–7327. [Google Scholar]
  30. Zhou, Y.; Muresanu, A.I.; Han, Z.; Paster, K.; Pitis, S.; Chan, H.; Ba, J. Large language models are human-level prompt engineers. In Proceedings of the Eleventh International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
  31. Pryzant, R.; Iter, D.; Li, J.; Lee, Y.T.; Zhu, C.; Zeng, M. Automatic prompt optimization with “gradient descent” and beam search. arXiv 2023, arXiv:2305.03495. [Google Scholar] [CrossRef]
  32. Cheng, J.; Liu, X.; Zheng, K.; Ke, P.; Wang, H.; Dong, Y.; Tang, J.; Huang, M. Black-box prompt optimization: Aligning large language models without model training. arXiv 2023, arXiv:2311.04155. [Google Scholar] [CrossRef]
  33. Schneider, L.; Wistuba, M.; Klein, A.; Golebiowski, J.; Zappella, G.; Merra, F.A. Hyperband-based Bayesian optimization for black-box prompt selection. arXiv 2024, arXiv:2412.07820. [Google Scholar]
  34. Lightman, H.; Kosaraju, V.; Burda, Y.; Edwards, H.; Baker, B.; Lee, T.; Leike, J.; Schulman, J.; Sutskever, I.; Cobbe, K. Let’s verify step by step. In Proceedings of the Twelfth International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  35. Li, M.; Wang, W.; Feng, F.; Cao, Y.; Zhang, J.; Chua, T.S. Robust prompt optimization for large language models against distribution shifts. arXiv 2023, arXiv:2305.13954. [Google Scholar]
  36. Ashok, D.; May, J. Language models can predict their own behavior. arXiv 2025, arXiv:2502.13329. [Google Scholar] [CrossRef]
  37. Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
  38. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
  39. Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7167–7176. [Google Scholar]
  40. Iglesias, G.; Talavera, E.; Díaz-Álvarez, A. A survey on GANs for computer vision: Recent research, analysis and taxonomy. Comput. Sci. Rev. 2023, 48, 100553. [Google Scholar] [CrossRef]
  41. Ren, D.; Cai, Y.; Li, Q. Unlocking the Power of GANs in Non-Autoregressive Text Generation. arXiv 2023, arXiv:2305.03977. [Google Scholar]
  42. Maus, N.; Chao, P.; Wong, E.; Gardner, J. Black box adversarial prompting for foundation models. arXiv 2023, arXiv:2302.04237. [Google Scholar] [CrossRef]
  43. Yu, L.; Zhang, W.; Wang, J.; Yu, Y. Seqgan: Sequence generative adversarial nets with policy gradient. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
  44. Nie, W.; Narodytska, N.; Patel, A. Relgan: Relational generative adversarial networks for text generation. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  45. Guo, J.; Lu, S.; Cai, H.; Zhang, W.; Yu, Y.; Wang, J. Long text generation via adversarial training with leaked information. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  46. Zhang, R.; Chen, C.; Gan, Z.; Wang, W.; Shen, D.; Wang, G.; Wen, Z.; Carin, L. Improving Adversarial Text Generation by Modeling the Distant Future. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 2516–2531. [Google Scholar]
  47. Zhang, S.; Qian, Z.; Huang, K.; Zhang, R.; Xiao, J.; He, Y.; Lu, C. Robust generative adversarial network. Mach. Learn. 2023, 112, 5135–5161. [Google Scholar] [CrossRef]
  48. Wang, J.; Sun, Q.; Li, X.; Gao, M. Boosting language models reasoning with chain-of-knowledge prompting. arXiv 2023, arXiv:2306.06427. [Google Scholar]
  49. Talmor, A.; Herzig, J.; Lourie, N.; Berant, J. COMMONSENSEQA: A Question Answering Challenge Targeting Commonsense Knowledge. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 4149–4158. [Google Scholar]
  50. Geva, M.; Khashabi, D.; Segal, E.; Khot, T.; Roth, D.; Berant, J. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Trans. Assoc. Comput. Linguist. 2021, 9, 346–361. [Google Scholar] [CrossRef]
  51. Mihaylov, T.; Clark, P.; Khot, T.; Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv 2018, arXiv:1809.02789. [Google Scholar] [CrossRef]
  52. Clark, P.; Cowhey, I.; Etzioni, O.; Khot, T.; Sabharwal, A.; Schoenick, C.; Tafjord, O. Think you have solved question answering? Try arc, the ai2 reasoning challenge. arXiv 2018, arXiv:1803.05457. [Google Scholar] [CrossRef]
  53. Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A.A.; Abid, A.; Fisch, A.; Brown, A.R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Trans. Mach. Learn. Res. 2023, 2023, 1–95. [Google Scholar]
  54. Clark, C.; Lee, K.; Chang, M.W.; Kwiatkowski, T.; Collins, M.; Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv 2019, arXiv:1905.10044. [Google Scholar]
  55. Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training verifiers to solve math word problems. arXiv 2021, arXiv:2110.14168. [Google Scholar] [CrossRef]
  56. Patel, A.; Bhattamishra, S.; Goyal, N. Are NLP models really able to solve simple math word problems? arXiv 2021, arXiv:2103.07191. [Google Scholar] [CrossRef]
  57. Ling, W.; Yogatama, D.; Dyer, C.; Blunsom, P. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv 2017, arXiv:1705.04146. [Google Scholar] [CrossRef]
  58. Roy, S.; Roth, D. Solving general arithmetic word problems. arXiv 2016, arXiv:1608.01413. [Google Scholar] [CrossRef]
  59. Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.