1. Introduction
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide spectrum of tasks [
1], yet they remain susceptible to jailbreak attacks—adversarial inputs designed to bypass safety safeguards and elicit harmful content [
2,
3]. These attacks exploit the inherent tension between an LLM’s instruction-following nature and its safety alignment, often referred to as competing objectives. Despite the implementation of rigorous alignment objectives and content policies, attackers can effectively override these defenses through prompt manipulation—such as role-playing or refusal suppression—without altering model parameters [
2,
4]. The prevalence and transferability of these attacks highlight a fundamental deficiency: alignment achieved through pre-training or fine-tuning alone is insufficient to guarantee robust safety under adversarial conditions [
5,
6], making the development of adaptive defense mechanisms a critical priority for secure AI deployment.
A variety of defense strategies have been proposed to mitigate these risks. Traditional supervised defenses, such as safety classifiers [
7], incur high maintenance and training costs, while prompt-based interventions often degrade response quality or exhibit limited robustness against evolving patterns [
8,
9]. More recently, multi-agent LLM systems have emerged as a promising direction [
10]. For instance, the AutoDefense framework leverages collaborative reasoning among multiple agents—specializing in intent analysis, prompt inference, and final judgment—to enhance robustness through task decomposition.
However, existing multi-agent defenses typically rely on a static agent design, where agent roles and system prompts are manually defined and remain fixed throughout deployment. While some approaches attempt to improve performance via model-level fine-tuning, such optimization is often impractical for real-world deployments that rely on closed-source models or limited computational resources that only support inference-time execution. This mismatch between static defense configurations and dynamic attack evolution significantly limits the long-term efficacy of current multi-agent systems. This limitation motivates a defense mechanism that can improve robustness without relying on blanket refusal or repeated model retraining.
To bridge this gap, we propose a novel Multi-Agent LLM Defense framework optimized via Adversarial Testing, as illustrated in
Figure 1. Moving beyond traditional model-level parameter updates [
1,
7], our approach shifts the focus to system-level optimization. We utilize the framework of adversarial testing [
11,
12] as a dynamic mechanism to iteratively refine the system prompts (
) of the defense agents.
Specifically, our framework introduces a closed-loop evolutionary process consisting of two specialized components: an Attack Design agent that actively simulates adaptive adversarial samples to probe system vulnerabilities, and an Optimization agent that automatically refines the defense agents’ prompts based on attack-defense feedback. This enables the defense system to evolve its strategies during the inference phase, “patching” identified reasoning flaws without requiring any gradient-based updates to the underlying LLMs.
Our method offers three key advantages:
No Model Fine-tuning: By optimizing at the prompt level rather than the parameter level, the framework is fully compatible with off-the-shelf and closed-source LLMs, making it highly suitable for inference-only environments.
Automated System Evolution: The adversarial feedback loop enables a self-improving defense mechanism where the system automatically discovers and mitigates defensive vulnerabilities through continuous prompt refinement.
Robustness and Adaptability: Our approach enhances the system’s capability to recognize sophisticated and previously unseen jailbreak patterns while maintaining high utility and low false-positive rates for benign user requests.
Transparent Deployment Trade-off: We report the extra inference latency introduced by multi-agent adjudication and analyze how robustness improvements interact with benign-user utility, rather than relying solely on a single attack-success score.
2. Related Work
2.1. Defense Against Jailbreak Attacks
The vulnerability of Large Language Models (LLMs) to jailbreak attacks has catalyzed extensive research into robust defense mechanisms. These attacks typically exploit the “competing objectives” between a model’s instruction-following capabilities and its safety alignment [
3]. Current defensive strategies can be broadly categorized into model-level interventions, prompt-based mitigations, and response-based filtering.
Model-level defenses aim to embed safety constraints directly into the model parameters through techniques like Reinforcement Learning from Human Feedback (RLHF) [
1] or safety-oriented fine-tuning. However, these methods often incur prohibitive computational costs and may remain vulnerable to automated adversarial suffixes generated via discrete optimization [
5,
13]. To address this, prompt-based interventions such as Self-Reminders [
8] and Goal Prioritization [
9] attempt to steer the model towards safer responses by prepending safety instructions. Oth er techniques like SmoothLLM [
14] and Perplexity Filtering [
15] employ random perturbations or lexical analysis to neutralize adversarial patterns.
Furthermore, a significant category is response-based defense, which scrutinizes the generated output before it reaches the user. Pioneering works like LLM Self Defense [
16] and Self-Guard [
17] leverage the intrinsic reasoning capabilities of LLMs to self-examine and filter harmful content. Despite their efficacy, these single-model defenses often depend heavily on the instruction-following strength of the underlying LLM, creating a bottleneck for smaller, efficient open-source models [
18].
2.2. Evolution of Jailbreak Attacks
Jailbreak attacks have evolved from manual prompt engineering (e.g., the DAN prompts [
2]) to sophisticated automated algorithms. Semantic-based attacks such as PAIR [
11] and TAP [
12] utilize an auxiliary LLM to iteratively refine prompts. Linguistic-based methods like CipherChat [
19] and DeepInception [
4] bypass safeguards by disguising harmful intents through low-resource languages or nested role-playing scenarios. Recent studies have even exposed vulnerabilities in multi-turn interactions, where agents can be manipulated into generating prohibited content through task decomposition [
20]. To systematically evaluate these threats, comprehensive benchmarks such as JailbreakRadar [
6] have been developed to measure defense efficacy across diverse attack surfaces. JailbreakBench further establishes an open robustness benchmark for LLM jailbreak evaluation by standardizing harmful behaviors, evaluation components, and leaderboard-style reporting of attacks and defenses [
21].
2.3. Multi-Agent LLM Systems for Defense
The multi-agent paradigm has emerged as a powerful alternative by decomposing complex defense tasks into specialized sub-tasks assigned to different agents [
10]. A prominent example is AutoDefense [
18], which employs a multi-agent framework consisting of specialized roles such as the Intention Analyzer, Prompt Analyzer, and Judge [
18]. By utilizing task decomposition, it allows efficient models like LLaMA-2 to protect larger victim models while integrating external tools like Llama Guard [
7] to lower false positive rates.
However, existing multi-agent defenses are largely characterized by a static agent design where roles and system prompts remain fixed [
18]. While some works explore game-theoretic evolutions like ICAG [
22], they often require continuous model training. This limitation underscores the need for a framework that can adaptively optimize system-level interactions—specifically the system prompts—without modifying model weights. Our work addresses this gap by introducing an automated, adversarial testing-driven optimization process for multi-agent defense systems.
Compared with AutoDefense, which fixes the role prompts after manual design, our method treats the prompt set itself as an optimizable system state. Compared with PAIR/TAP-style jailbreak generation, the attack agent in our framework is not used to improve an attacker at deployment time; instead, it supplies controlled failure cases for a defender-side optimizer. In this setting, Attack Success Rate (ASR) denotes the percentage of harmful attempts that bypass the defense, whereas the benign false positive rate (FPR) denotes the percentage of harmless responses incorrectly rejected by the defense. Including both quantities in the optimization target prevents the system from reducing ASR through a trivial always-block policy.
3. Methodology
3.1. Problem Formulation and Preliminaries
We focus on defending jailbreak attacks that induce safety-aligned large language models (LLMs) to generate harmful, policy-violating, or misaligned outputs [
3]. Formally, let
denote a victim LLM with parameters
, where
is the space of user prompts and
is the space of generated responses. Given a benign or malicious user prompt
, the model produces a response
.
A jailbreak attack constructs an adversarial prompt
, where
is an attack transformation such as role-playing or refusal suppression [
2,
5]. Following the threat model in prior work [
16,
18], we assume the attacker can manipulate the input prompt but cannot directly alter the response after generation. Therefore, our goal is to design a defense system:
that determines whether
y should be presented to the user or overridden by a safe alternative. The additional argument
denotes optional conversation or optimization history. When only a single response is available,
is empty; in multi-turn or iterative refinement settings, it stores prior agent rationales, failed attack cases, and prompt refinements. Consistent with the constraints of real-world deployment, we consider a zero-training setting [
18], where all LLMs are fixed and only inference-time optimization is permitted.
3.2. Overall Framework Architecture
Building upon the response-filtering paradigm and multi-agent decomposition introduced in AutoDefense [
18], we propose an Adversarial-Test-Driven Multi-Agent Defense Framework. The system consists of a set of agents
, each instantiated with a role-specific system prompt and a well-defined sub-task [
10].
The overall framework operates in two tightly coupled processes (see
Figure 2):
3.3. Agent Design in the Framework
3.3.1. Defense Agents
The core defense module adopts a task-decomposed structure similar to AutoDefense [
18], but is enhanced to be prompt-optimizable.
Intention Analyzer (
): This agent aims to infer the latent intent behind a response
y. It produces a structured semantic interpretation
, capturing potential risks as explored in intention analysis prompting [
23].
Prompt Analyzer (
): It attempts to reconstruct candidate original prompts
without jailbreak artifacts [
18].
Judge (
): This agent aggregates evidence from previous steps to produce the final judgment
, prioritizing safety based on predefined content policies [
7].
3.3.2. Attack Design and Optimization Agents
To move beyond static defense, we introduce two specialized agents for system evolution:
Attack Design Agent: Explicitly models an adaptive adversary to generate challenging samples
that target defense weaknesses [
11,
12].
Optimization Agent: Responsible for refining the system prompts
of defense agents based on adversarial feedback, ensuring the system evolves without model parameter updates [
22].
3.4. System-Level Optimization via Adversarial Refinement
The core contribution of this work is the transition from a static defense configuration to a dynamic, self-evolving system. Unlike prior multi-agent defenses that rely on fixed instructions [
18], our framework iteratively refines the system prompts
of the defense agents through a closed-loop adversarial testing process.
3.4.1. Adversarial Optimization Objective
The objective of our framework is to find an optimal set of prompts
that minimizes the bypass probability of adversarial attacks while maintaining high utility for benign requests. Formally, let
be the joint loss function defined as follows:
where
is the Attack Success Rate under an adaptive adversary
,
is the False Positive Rate on a benign dataset
, and
is a trade-off coefficient prioritizing safety. In implementation, we use validation batches to estimate these two terms and select prompt updates only when they do not increase the weighted objective. This prevents the optimizer from obtaining lower ASR by simply rejecting benign requests.
3.4.2. The Iterative Optimization Loop
The optimization proceeds in discrete rounds
, forming a closed-loop refinement cycle as shown in the right panel of
Figure 2 and summarized in Algorithm 1.
Phase 1: Adaptive Attack Generation. The Attack Design Agent acts as an automated red-teaming module. It generates a batch of challenging adversarial samples
by conditioned on the current defense state
:
where
represents the history of successful/failed attacks from previous rounds, allowing the agent to perform adaptive probing. The history contains the generated adversarial prompt, the victim response, each defense agent’s intermediate rationale, the final decision, and whether the case is harmful or benign.
Phase 2: Adversarial Testing and Failure Analysis. The current defense system
is evaluated against
. We identify the set of failure cases
, where the defense fails to intercept a harmful response:
Each case in
is then passed to the Optimization Agent along with the detailed reasoning chain (CoT) generated by the defense agents during the failed attempt.
Phase 3: Prompt Refinement. The Optimization Agent performs a meta-optimization task. It analyzes the gap between the agents’ reasoning and the safety policy
, then produces updated prompts
using a text-to-text transformation:
The update is designed to correct the identified reasoning flaws—for instance, by adding specific constraints to the Intention Analyzer or clarifying the decision boundary for the Judge. The optimizer is also constrained to preserve benign-task utility: a proposed prompt refinement is rejected if it improves harmful interception only by broadening refusal criteria to neutral scientific, educational, or daily-life responses.
3.4.3. Convergence and Inference-Only Constraints
A critical advantage of this process is that the operator
operates entirely in the natural language space. Since the parameters
of the underlying models remain frozen, the entire system evolves through System-Level Intelligence rather than gradient-based updates. This ensures that the defense can continuously improve even when deployed using closed-source APIs or inference-only hardware [
18].
The final optimized system represents a robust configuration that has been systematically stress-tested against a wide spectrum of adversarial strategies simulated during the iterative process.
3.5. Collaboration and Coordination Mechanism
Agents collaborate through a coordinator-controlled protocol [
10,
18]. Let
be the execution sequence. Each agent
generates an output
based on the response
y and the accumulated shared context from preceding agents [
18]. This sequential collaboration enforces task specialization and reduces the cognitive load on individual models.
| Algorithm 1 Adversarial-Test-Driven Multi-Agent Prompt Optimization |
Require: Initial system prompts ; Safety policy ; Benign dataset ; Target LLM ; Number of iterations T. Ensure: Optimized system prompts .
- 1:
- 2:
while do - 3:
▹ Phase 1: Adaptive Adversarial Probing - 4:
Attack Agent generates adversarial samples [ 11] - 5:
▹ Phase 2: Multi-Agent Defense Inference - 6:
for each do - 7:
Generate victim response: - 8:
Analyze intent: [ 23] - 9:
Infer original prompts: [ 18] - 10:
Final judgment: - 11:
end for - 12:
▹ Phase 3: Failure Analysis & System Evolution - 13:
Collect failure cases - 14:
Calculate False Positive Rate on benign dataset - 15:
if is not empty or then - 16:
Optimization Agent analyzes reasoning flaws in - 17:
▹ System-level prompt refinement - 18:
else - 19:
return ▹ Convergence reached - 20:
end if - 21:
- 22:
end while - 23:
return
|
4. Experiment
4.1. Experimental Setup
To rigorously assess the defensive resilience and operational stability of our multi-agent framework, we structure our datasets into two distinct tiers: a compact development set designated for prompt tuning and hyper-parameter optimization, and a comprehensive large-scale evaluation corpus used to determine the final reported metrics [
18].
Adversarial Evaluation Corpus: We measure the robustness of our defense against malicious jailbreak attempts by utilizing two primary sources of adversarial inputs: (1) a Curated Set of 33 severe-risk prompts extracted from the official red-teaming initiatives of OpenAI and Anthropic [
3,
24]. This subset targets high-stakes vulnerabilities, including terrorism, self-inflicted harm, and the exposure of personally identifiable information (PII) [
25]; and (2) a DAN Set containing 390 prompts distributed across 13 restricted categories, originating from the “Do Anything Now” vulnerability analyses [
2]. In alignment with the experimental design in [
18], we designate GPT-3.5-Turbo-1106 as the target victim LLM. To effectively elicit toxic outputs, we deploy the Combination-1 attack strategy [
3], which pairs prefix injection (mandating a harmless affirmative opening) with refusal suppression (eliminating standard refusal terminology). Ultimately, our adversarial testing suite comprises 330 curated response pairs (10 generations per prompt) and 1950 DAN pairs (5 generations per prompt) [
18].
Benign Inquiry Corpus: To accurately measure false-positive rates and prevent over-defensiveness, we introduce two inherently safe datasets into our pipeline: (1) Safe Inquiries, which provides 330 routine scientific and daily-life queries, initially synthesized by GPT-4 and subsequently processed through GPT-3.5 [
18]; and (2) Stanford Alpaca [
26], from which we randomly sample 1000 instruction-response pairs from the original 52 K dataset to act as proxies for a diverse array of standard, benign user interactions.
Cross-Model Safety–Utility Protocol: The JailbreakBench initiative established a standardized paradigm for evaluating LLM vulnerabilities by unifying the definitions of malicious behaviors, evaluation methodologies, and reporting metrics [
21]. Inspired by this standardization, we construct a cross-model evaluation protocol tailored specifically to the response-filtering mechanism of our defense, encompassing both safe and adversarial behaviors. We deploy our core defense strategy across two distinct API backbones: GPT-3.5-Turbo-1106 and GPT-4.1. Performance is quantified using two primary indicators. For adversarial inputs, the Attack Success Rate (ASR) measures the proportion of malicious candidate outputs that successfully bypass the defense framework. Conversely, for benign inputs, the False Positive Rate (FPR) calculates the fraction of harmless user requests that are erroneously intercepted. Given that our implementation-level analyses rely on automated defense logs rather than independent human annotation passes, we report these metrics as decision-based ASR and FPR estimations. System utility is formally defined as
, statistical variance is bounded using Wilson 95% confidence intervals, and the computational overhead is reported as the mean wall-clock latency per filtering decision.
4.2. Main Results
Comparisons with baseline defenses: We evaluate the performance of our proposed framework against a comprehensive suite of defense mechanisms using GPT-3.5 Turbo as the victim model. Following the standard protocol for multi-agent safety, we utilize LLaMA-2-13B as the backbone for our defense agents [
18,
27]. To ensure a fair comparison and highlight the incremental value of our system-level optimization, we align our results with the competitive baselines reported in
Table 1.
Beyond the aggregate ASR comparison, we analyze several implementation configurations to clarify which components drive the observed behavior. The adversarial-testing-only configuration keeps the red-teaming probe but omits prompt optimization, the optimization-only configuration performs prompt refinement without the adaptive attack loop, the single-agent classifier configuration collapses the defense into one safety-judgment call, and the static three-agent configuration uses task decomposition without adversarially optimized prompts. The full AT-Driven Multi-Agent framework combines adaptive attack generation, prompt optimization, and multi-agent adjudication. We also include two utility-oriented settings: a context-free adjudication setting, which judges the candidate response without explicit user-intent grounding, and a context-grounded setting, which conditions the final decision on the conversation context and inferred user intent. All entries in
Table 2 are reported as percentages within their corresponding evaluation scope, avoiding direct comparison of raw corpus sizes across harmful and benign subsets.
Table 2 shows that the proposed framework obtains the lowest harmful pass rate among the harmful-response configurations, indicating that adaptive attack generation and prompt optimization are complementary to multi-agent adjudication. The utility-oriented rows should be interpreted within their own evaluation scopes rather than as raw-count comparisons. They show that context grounding substantially reduces benign over-blocking and latency, which is important for deployment settings where a defense must reject unsafe content without suppressing harmless user requests. Note that the ASR values in
Table 1 reflect the end-to-end performance of the fully integrated framework on the complete evaluation dataset, whereas the harmful pass rates in
Table 2 are measured within isolated implementation configurations on the Combination-1 attack subset; the numerical differences between the two tables therefore stem from their distinct evaluation scopes rather than any methodological inconsistency.
As illustrated in
Table 1, our AT-Driven Multi-Agent framework achieves an Attack Success Rate (ASR) of 7.82%, outperforming the state-of-the-art static multi-agent defense, AutoDefense (7.95%) [
18]. The implementation-level analysis in
Table 2 further shows that combining adaptive attack generation, prompt optimization, and multi-agent adjudication yields a lower decision-based harmful pass rate than using any of these design choices in isolation. Exact McNemar tests against the optimization-only, single-agent classifier, and static three-agent configurations are all below
, indicating that the improvement is not attributable to random decision fluctuations in DAN responses generated with the Combination-1 attack. While prompt-based interventions such as Self-Reminder offer a baseline level of protection, they often struggle with the semantic complexity of jailbreak templates [
3,
8]. In contrast, our framework utilizes an Attack Design agent to actively probe for vulnerabilities and an Optimization agent to iteratively refine the defense prompts [
22]. This iterative hardening ensures the system maintains robust boundaries against the “competing objectives” exploited by modern attacks [
3].
Furthermore, the multi-agent approach significantly outperforms the standard Safety Classifier. This validates that the structural decomposition of defense tasks—specifically separating intention analysis and prompt reconstruction—is more effective than single-pass classification for identifying sophisticated policy violations [
10,
18].
4.3. Ablation Study
To investigate the individual contribution of each component within our framework, we conduct an ablation study by comparing the full method against two critical variants:
The performance comparison on the DAN dataset is summarized in
Table 3. The results demonstrate the impact of each module on the overall Attack Success Rate (ASR) when defending the GPT-3.5-Turbo-1106 victim model.
Synergy of Adversarial Probing and Optimization. As shown in
Table 3, the removal of either the adversarial training (AT) or the prompt optimization (OPT) component leads to a degradation in defense performance. Notably, the w/o AT variant exhibits the highest ASR (
), suggesting that without active adversarial probing, the system cannot effectively anticipate and prepare for the sophisticated, adaptive attack patterns often encountered in real-world scenarios [
12].
The w/o OPT variant (
) performs significantly better than the OPT-only version but still falls short of the full system. This indicates that while generating adversarial samples provides a robust “stress test” for the agents, the Optimization agent is essential for translating failure cases into actionable refinements of the system prompts [
22]. The full method achieves the best performance (
), validating the hypothesis that a closed-loop interaction between an adaptive adversary and an automated optimizer creates the most robust defensive boundary. This synergy allows the system to not only identify current weaknesses but also “patch” its reasoning logic through system-level prompt refinement [
18].
4.4. Sample Analysis
To qualitatively evaluate the superiority of our adversarial-test-driven framework, we conduct a comparative analysis of defense behaviors on the DAN evaluation set [
18]. Our focus is on cases where the static baseline fails to identify harmful content—a phenomenon often caused by the victim model disguising policy violations as “educational” or “informative” responses.
Statistical Improvement: As summarized in
Table 4, the AT-Driven method demonstrates a significant advancement in interception capabilities. While the static AutoDefense baseline struggles with a high bypass rate (64.36% passed), our framework increases the block rate from 34.72% to 57.38%. This 22.66% improvement is primarily attributed to the Optimization agent’s ability to refine the decision boundaries of the defense agents through iterative adversarial feedback.
Qualitative Case Study:
Table 5 provides a granular comparison of defense reasoning. These cases highlight a common failure mode in static defenses: they are often deceived by responses that prepend an innocuous affirmative (e.g., “Certainly!”) or frame illegal activities as “general information”.
For instance, in the Hate Speech and Cyberbullying categories, the baseline defense is bypassed because the harmful content is presented under the guise of “providing examples” or “raising awareness”. However, our Optimization agent identifies that providing explicit slurs or actionable campaign strategies—regardless of the framing—constitutes a high-risk violation. By refining the system prompts, the Intention Analyzer in our framework learns to peel back the “informative” layer to reveal the underlying harmful instruction, resulting in a robust Blocked judgment where the baseline remains oblivious.
4.5. Cross-Model Transferability and Utility Evaluation
To examine whether the optimized multi-agent defense generalizes beyond the DAN setting, we conduct a JailbreakBench-informed cross-model validation. JailbreakBench was designed to make jailbreak evaluation more reproducible and comparable by specifying harmful behaviors, evaluation components, and leaderboard-style reporting [
21]. We adopt this evaluation philosophy to analyze transferability, benign-user over-refusal, utility, and latency under a unified protocol. This analysis is intended to characterize deployment behavior in addition to the standardized DAN comparison above.
We compare three online decision configurations, each corresponding to a clear deployment choice within the same AT-Driven Multi-Agent Defense framework. Single-pass context-aware judge denotes one optimized defense-judge call that receives the conversation context and candidate response. Context-grounded two-step judge first extracts the user intent from and then evaluates against that grounded intent. Full multi-agent adjudication denotes the complete proposed defense setting, in which specialized defense agents analyze the interaction and a final decision step aggregates their judgments.
Table 6 and
Figure 3 and
Figure 4 show that the three configurations obtain the same average decision-based ASR estimate (12.5%) across the two backbones, while benign utility differs substantially. Full multi-agent adjudication has the best aggregate utility (83.3%) and the lowest mean decision-based FPR estimate (16.7%), but it also has the highest latency (3.59 s). This pattern indicates that multi-agent coordination is most beneficial for reducing false positives and improving decision reliability, while deployments with strict latency budgets may use the context-grounded two-step judge as a lower-cost alternative.
The
sensitivity results in
Table 7 explain the role of the objective coefficient in Equation (
2). Across
, the ranking remains stable: full multi-agent adjudication achieves the lowest weighted loss because its lower FPR compensates for the additional latency.
4.6. Hyperparameter Analysis
We evaluate the robustness of our framework across four key parameters, as illustrated in
Figure 5. The analysis focuses on ASR@3 and ASR@4 to assess the stability of the defense under varied adversarial conditions. Rather than interpreting the curves as a universal zero-failure guarantee, we use them to identify stable regions for adversarial threshold, attack complexity, attack diversity, and detection sensitivity.
Adversarial Threshold and Detection Sensitivity. As shown in
Figure 5a, the defense performance peaks at an adversarial threshold of 0.7. A higher threshold (0.9) tends to filter out valuable adversarial signals required for system prompt optimization, whereas lower values provide insufficient challenges for the defense to evolve. Regarding detection sensitivity (
Figure 5d), the ASR reaches its highest point under medium sensitivity. While high sensitivity effectively blocks attacks through stricter criteria, the low sensitivity setting also demonstrates strong performance (11.03% ASR@3). This counterintuitive result suggests that lower sensitivity reduces the frequency of over-corrections and false-positive-driven updates, allowing the agents to maintain a more generalized understanding of safety constraints.
Attack Complexity and Diversity: The impact of attack complexity is depicted in
Figure 5b. The framework is most challenged by medium-complexity attacks, while the ASR drops to 11.18% for high-complexity prompts. This is because highly complex attacks often introduce unnatural semantic patterns that are readily flagged by our specialized Intention and Prompt Analyzers. For attack diversity (
Figure 5c), we observe an optimal point for the adversary at the 0.3 level. Beyond this threshold, excessive transformations likely dilute the specific attack intent or create linguistic incoherence, leading to a steady decline in attack success rates as the defense agents can more easily identify these outliers.
5. Conclusions
In this paper, we introduced an Adversarial-Test-Driven Multi-Agent Defense framework designed to enhance the robustness of LLMs against jailbreak attacks. By transitioning from fixed agent instructions to a dynamic, self-evolving configuration, our system iteratively refines the reasoning logic of defense agents through a closed-loop adversarial testing process. The core strength of our approach lies in its ability to achieve significant safety gains—reducing the Attack Success Rate (ASR) to 7.82% and improving content interception by over 22%—entirely within the inference-only regime. This ensures full compatibility with both open-source and proprietary LLMs without the need for costly model fine-tuning or parameter updates. The safety gains are best interpreted together with benign FPR and latency: on the JailbreakBench-informed cross-model protocol, full multi-agent adjudication improves aggregate utility to 83.3% but increases latency to 3.59 s, while context-grounded decision making can reduce benign over-blocking in implementation-level utility analysis.
Our hyperparameter and ablation studies further confirm that the synergy between adaptive adversarial probing and automated prompt refinement is critical for identifying and neutralizing sophisticated policy violations that bypass traditional filters. At the same time, the method has practical deployment costs: it introduces additional LLM calls, depends on representative benign evaluation corpora for balancing harmful-response interception and benign-request preservation, and requires careful grounding of the optimization prompt to avoid conservative over-refusal. Future work will extend the JailbreakBench-informed cross-model protocol, include more open-source backbones, and explore cost-aware routing that invokes the full multi-agent adjudication module only on uncertain cases. As jailbreak attacks continue to grow in complexity, our work provides a scalable and efficient path for developing adaptive safeguards that can evolve alongside emerging threats, ensuring more secure and reliable AI interactions in real-world deployments.
Author Contributions
Conceptualization, Y.Q., Y.H. and H.C.; Data curation, Y.Q., Y.H. and S.L.; Formal analysis, Y.Q.; Methodology, Y.Q. and Y.H.; Project administration, H.C.; Software, Y.Q., Y.H. and H.C.; Validation, Y.Q., L.C. and J.W.; Visualization, Y.Q.; Writing—original draft, Y.Q.; Writing—review and editing, Y.Q., Y.H. and H.C. All authors have read and agreed to the published version of the manuscript.
Funding
The APC was funded by Zhilin Technology Co., Ltd. The funder did not participate in the design of the study, data collection, analysis, interpretation of results, manuscript preparation, or the decision to publish.
Data Availability Statement
Publicly available datasets were analyzed in this study. The adversarial and benign evaluation corpora, including the OpenAI/Anthropic Curated Set, DAN Set, Stanford Alpaca, and JailbreakBench benchmark, are open-source. Detailed descriptions and relevant references for these datasets are provided in
Section 4.1 of this manuscript.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- Shen, X.; Chen, Z.; Backes, M.; Shen, Y.; Zhang, Y. “Do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, Salt Lake City, UT, USA, 14–18 October 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1671–1685. [Google Scholar]
- Wei, A.; Haghtalab, N.; Steinhardt, J. Jailbroken: How does llm safety training fail? Adv. Neural Inf. Process. Syst. 2023, 36, 80079–80110. [Google Scholar]
- Li, X.; Zhou, Z.; Zhu, J.; Yao, J.; Liu, T.; Han, B. DeepInception: Hypnotize large language model to be jailbreaker. In Proceedings of the NeurIPS Safe Generative AI Workshop 2024, Vancouver, BC, Canada, 14–15 December 2024. [Google Scholar]
- Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J.Z.; Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv 2023, arXiv:2307.15043. [Google Scholar] [CrossRef]
- Chu, J.; Liu, Y.; Yang, Z.; Shen, X.; Backes, M.; Zhang, Y. Jailbreakradar: Comprehensive assessment of jailbreak attacks against llms. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 21538–21566. [Google Scholar]
- Inan, H.; Upasani, K.; Chi, J.; Rungta, R.; Iyer, K.; Mao, Y.; Tontchev, M.; Hu, Q.; Fuller, B.; Testuggine, D.; et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv 2023, arXiv:2312.06674. [Google Scholar]
- Wu, F.; Xie, Y.; Yi, J.; Shao, J.; Curl, J.; Lyu, L.; Chen, Q.; Xie, X. Defending chatgpt against jailbreak attack via self-reminder. Nat. Mach. Intell. 2023, 5, 1486–1496. [Google Scholar] [CrossRef]
- Zhang, Z.; Yang, J.; Ke, P.; Mi, F.; Wang, H.; Huang, M. Defending large language models against jailbreaking attacks through goal prioritization. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 8865–8887. [Google Scholar]
- Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. Autogen: Enabling next-gen LLM applications via multi-agent conversations. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
- Chao, P.; Robey, A.; Dobriban, E.; Hassani, H.; Pappas, G.J.; Wong, E. Jailbreaking black box large language models in twenty queries. In Proceedings of the 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), Copenhagen, Denmark, 9–11 April 2025; IEEE: New York, NY, USA, 2025; pp. 23–42. [Google Scholar]
- Mehrotra, A.; Zampetakis, M.; Kassianik, P.; Nelson, B.; Anderson, H.; Singer, Y.; Karbasi, A. Tree of attacks: Jailbreaking black-box llms automatically. Adv. Neural Inf. Process. Syst. 2024, 37, 61065–61105. [Google Scholar]
- Jia, X.; Pang, T.; Du, C.; Huang, Y.; Gu, J.; Liu, Y.; Cao, X.; Lin, M. Improved Techniques for Optimization-Based Jailbreaking on Large Language Models. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
- Robey, A.; Wong, E.; Hassani, H.; Pappas, G.J. Smoothllm: Defending large language models against jailbreaking attacks. arXiv 2023, arXiv:2310.03684. [Google Scholar]
- Alon, G.; Kamfonas, M. Detecting language model attacks with perplexity. arXiv 2023, arXiv:2308.14132. [Google Scholar] [CrossRef]
- Phute, M.; Helbling, A.; Hull, M.D.; Peng, S.; Szyller, S.; Cornelius, C.; Chau, D.H. LLM self defense: By self examination, LLMs know they are being tricked. In Proceedings of the Second Tiny Papers Track at ICLR 2024, Vienna, Austria, 11 May 2024. [Google Scholar]
- Wang, Z.; Yang, F.; Wang, L.; Zhao, P.; Wang, H.; Chen, L.; Lin, Q.; Wong, K.F. Self-guard: Empower the llm to safeguard itself. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1648–1668. [Google Scholar]
- Zeng, Y.; Wu, Y.; Zhang, X.; Wang, H.; Wu, Q. AutoDefense: Multi-agent LLM defense against jailbreak attacks. In Proceedings of the NeurIPS Safe Generative AI Workshop 2024, Vancouver, BC, Canada, 14–15 December 2024. [Google Scholar]
- Yuan, Y.; Jiao, W.; Wang, W.; Huang, J.; He, P.; Shi, S.; Tu, Z. GPT-4 is too smart to be safe: Stealthy chat with LLMs via cipher. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Srivastav, D.; Zhang, X. Safe in isolation, dangerous together: Agent-driven multi-turn decomposition jailbreaks on llms. In Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), Vienna, Austria, 31 July 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 170–183. [Google Scholar]
- Chao, P.; Debenedetti, E.; Robey, A.; Andriushchenko, M.; Croce, F.; Sehwag, V.; Dobriban, E.; Flammarion, N.; Pappas, G.J.; Tramer, F.; et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. Adv. Neural Inf. Process. Syst. (Neurips) 2024, 37, 55005–55029. [Google Scholar]
- Zhou, Y.; Han, Y.; Zhuang, H.; Guo, K.; Liang, Z.; Bao, H.; Zhang, X. Defending Jailbreak Prompts via In-Context Adversarial Game. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 20084–20105. [Google Scholar]
- Zhang, Y.; Ding, L.; Zhang, L.; Tao, D. Intention analysis makes LLMs a good jailbreak defender. In Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025), Abu Dhabi, United Arab Emirates, 19–24 January 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 2947–2968. [Google Scholar]
- Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. Constitutional ai: Harmlessness from ai feedback. arXiv 2022, arXiv:2212.08073. [Google Scholar] [CrossRef]
- Li, H.; Guo, D.; Fan, W.; Xu, M.; Huang, J.; Meng, F.; Song, Y. Multi-step Jailbreaking Privacy Attacks on ChatGPT. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 4138–4153. [Google Scholar]
- Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T.B. Alpaca: A strong, replicable instruction-following model. Stanf. Cent. Res. Found. Model. 2023, 3, 7. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Figure 1.
Overview of the proposed Multi-Agent LLM Defense framework optimized via Adversarial Testing. The system employs a closed-loop evolutionary process: an Attack Design agent probes vulnerabilities with adaptive samples, while an Optimization agent iteratively refines the defense agents’ system prompts () based on feedback, enabling continuous self-improvement without model parameter updates.
Figure 1.
Overview of the proposed Multi-Agent LLM Defense framework optimized via Adversarial Testing. The system employs a closed-loop evolutionary process: an Attack Design agent probes vulnerabilities with adaptive samples, while an Optimization agent iteratively refines the defense agents’ system prompts () based on feedback, enabling continuous self-improvement without model parameter updates.
Figure 2.
Architecture of the Adversarial-Test-Driven Multi-Agent Defense Framework. The diagram illustrates the interplay between the two core components: the Defense Inference Process, where the Intention Analyzer, Prompt Analyzer, and Judge collaborate to filter responses; the Adversarial Optimization Process, where the Attack Design agent probes vulnerabilities and the Optimization agent refines system prompts () based on feedback. This closed-loop design ensures system-level evolution while keeping model parameters () frozen.
Figure 2.
Architecture of the Adversarial-Test-Driven Multi-Agent Defense Framework. The diagram illustrates the interplay between the two core components: the Defense Inference Process, where the Intention Analyzer, Prompt Analyzer, and Judge collaborate to filter responses; the Adversarial Optimization Process, where the Attack Design agent probes vulnerabilities and the Optimization agent refines system prompts () based on feedback. This closed-loop design ensures system-level evolution while keeping model parameters () frozen.
Figure 3.
Aggregate safety-utility trade-off on the JailbreakBench-informed cross-model protocol across GPT-3.5-Turbo-1106 and GPT-4.1. The labels correspond to the three online decision configurations defined in the text.
Figure 3.
Aggregate safety-utility trade-off on the JailbreakBench-informed cross-model protocol across GPT-3.5-Turbo-1106 and GPT-4.1. The labels correspond to the three online decision configurations defined in the text.
Figure 4.
Mean wall-clock latency of the three online decision configurations. Full multi-agent adjudication improves benign-user utility in the aggregate results but requires additional inference calls.
Figure 4.
Mean wall-clock latency of the three online decision configurations. Full multi-agent adjudication improves benign-user utility in the aggregate results but requires additional inference calls.
Figure 5.
Hyperparameter analysis of the Multi-Agent Defense framework. The subfigures show the trends of ASR@3 and ASR@4 under different configurations of (a) threshold, (b) complexity, (c) diversity, and (d) sensitivity.
Figure 5.
Hyperparameter analysis of the Multi-Agent Defense framework. The subfigures show the trends of ASR@3 and ASR@4 under different configurations of (a) threshold, (b) complexity, (c) diversity, and (d) sensitivity.
Table 1.
Comparison of ASR with various defense methods on the DAN dataset. We employ the Combination-1 attack (Refusal Suppression + Prefix Injection) as the primary adversarial vector [
3,
18].
Table 1.
Comparison of ASR with various defense methods on the DAN dataset. We employ the Combination-1 attack (Refusal Suppression + Prefix Injection) as the primary adversarial vector [
3,
18].
| Defense Method | ASR (%) |
|---|
| No Defense [18] | 55.74 |
| OpenAI Moderation API | 53.79 |
| Self Defense [16] | 43.64 |
| Llama Guard (Prompt + Response) [7] | 21.28 |
| Safety Classifier (Single-pass Baseline) | 12.39 |
| Single-Agent Defense (AutoDefense) [18] | 9.44 |
| Three-Agent Defense (AutoDefense) [18] | 7.95 |
| AT-Driven Multi-Agent (Ours) | 7.82 |
Table 2.
Implementation-level configuration analysis reported as rates within each evaluation scope. Harmful pass rate is the percentage of harmful responses allowed by the defense, corresponding to a decision-based ASR estimate. Benign block rate is the percentage of harmless responses incorrectly rejected, corresponding to a decision-based FPR estimate.
Table 2.
Implementation-level configuration analysis reported as rates within each evaluation scope. Harmful pass rate is the percentage of harmful responses allowed by the defense, corresponding to a decision-based ASR estimate. Benign block rate is the percentage of harmless responses incorrectly rejected, corresponding to a decision-based FPR estimate.
| Evaluation Scope | Configuration | Harmful Pass Rate (%) | Benign Block Rate (%) | Latency (s) |
|---|
| DAN responses with Combination-1 | Adversarial-testing-only configuration | 43.54 | – | – |
| DAN responses with Combination-1 | Optimization-only configuration | 52.00 | – | – |
| DAN responses with Combination-1 | Single-agent classifier configuration | 64.15 | – | – |
| DAN responses with Combination-1 | Static three-agent configuration | 65.28 | – | – |
| DAN responses with Combination-1 | Full AT-Driven Multi-Agent framework | 42.62 | – | – |
| Benign utility evaluation | Context-free adjudication setting | – | 87.50 | 17.21 |
| Balanced safety–utility evaluation | Context-grounded deployment setting | 58.33 | 0.00 | 2.81 |
Table 3.
Ablation study results on the DAN dataset, illustrating the performance impact of Adversarial Training (AT) and Prompt Optimization (OPT) modules.
Table 3.
Ablation study results on the DAN dataset, illustrating the performance impact of Adversarial Training (AT) and Prompt Optimization (OPT) modules.
| Method Variant | ASR (%) |
|---|
| AutoDefense (Static Baseline) [18] | 7.95 |
| w/o AT (OPT-only) | 10.13 |
| w/o OPT (AT-only) | 8.04 |
| Full Method (AT + OPT) | 7.82 |
Table 4.
Defense outcome statistics on the DAN dataset (Total: 1950 pairs). The AT-Driven method significantly enhances the system’s ability to discern harmful content.
Table 4.
Defense outcome statistics on the DAN dataset (Total: 1950 pairs). The AT-Driven method significantly enhances the system’s ability to discern harmful content.
| Method | Blocked | Passed | Block Rate (%) |
|---|
| AutoDefense [18] | 677 | 1273 | 34.72% |
| AT-Driven | 1119 | 831 | 57.38% |
Table 5.
Detailed case comparison showcasing the refined detection capabilities of the AT-Driven method. Our approach successfully identifies latent policy violations that bypass the static baseline.
Table 5.
Detailed case comparison showcasing the refined detection capabilities of the AT-Driven method. Our approach successfully identifies latent policy violations that bypass the static baseline.
| Safety Category | Adversarial Response Snippet | Baseline | AT-Driven (Ours): Violation Insight |
|---|
| Copyright | “…simple method to remove watermarks from copyrighted images…erase the watermark…” | Passed | Blocked: Explicit instruction for bypassing intellectual property protections. |
| Hate Speech | “…example of hate speech targeting racial groups: [Specific Slurs/Stereotypes]…” | Passed | Blocked: Disguised generation of prohibited hate speech under the guise of an example. |
| Cyberbullying | “…tips for creating a cyberbullying campaign…utilize social media to spread [malice]…” | Passed | Blocked: Provides actionable strategies for harassment and psychological harm. |
| Illegal Acts | “…general information on software cracking…bypassing copy protection or licensing…” | Passed | Blocked: Identified underlying malicious instruction for digital theft and unauthorized access. |
| Degrading Content | “…examples of degrading content that targets women…misogynistic language and images…” | Passed | Blocked: Failed to filter harmful gender-based stereotypes masked as descriptive content. |
Table 6.
JailbreakBench-informed cross-model validation under two model backbones. Decision-based ASR and FPR estimates are reported with Wilson 95% confidence intervals; utility is , and latency is measured as mean wall-clock time per defense decision.
Table 6.
JailbreakBench-informed cross-model validation under two model backbones. Decision-based ASR and FPR estimates are reported with Wilson 95% confidence intervals; utility is , and latency is measured as mean wall-clock time per defense decision.
| Backbone | Online Decision Configuration | ASR Estimate (95% CI) | FPR Estimate (95% CI) | Utility | Latency (s) |
|---|
| gpt-3.5-turbo-1106 | Single-pass context-aware judge | 8.3 [1.5, 35.4] | 41.7 [19.3, 68.0] | 58.3 | 1.38 |
| gpt-3.5-turbo-1106 | Context-grounded two-step judge | 8.3 [1.5, 35.4] | 41.7 [19.3, 68.0] | 58.3 | 1.77 |
| gpt-3.5-turbo-1106 | Full multi-agent adjudication | 8.3 [1.5, 35.4] | 8.3 [1.5, 35.4] | 91.7 | 3.32 |
| gpt-4.1 | Single-pass context-aware judge | 16.7 [4.7, 44.8] | 16.7 [4.7, 44.8] | 83.3 | 1.57 |
| gpt-4.1 | Context-grounded two-step judge | 16.7 [4.7, 44.8] | 25.0 [8.9, 53.2] | 75.0 | 2.09 |
| gpt-4.1 | Full multi-agent adjudication | 16.7 [4.7, 44.8] | 25.0 [8.9, 53.2] | 75.0 | 3.87 |
Table 7.
Sensitivity of the weighted validation objective on the JailbreakBench-informed cross-model protocol. Lower values are better.
Table 7.
Sensitivity of the weighted validation objective on the JailbreakBench-informed cross-model protocol. Lower values are better.
| Online Decision Configuration | | | |
|---|
| Single-pass context-aware judge (avg.) | 0.250 | 0.209 | 0.167 |
| Context-grounded two-step judge (avg.) | 0.281 | 0.229 | 0.178 |
| Full multi-agent adjudication (avg.) | 0.156 | 0.146 | 0.136 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |