1. Introduction
Large Language Models (LLMs) have emerged as powerful tools, demonstrating impressive capabilities in translation [
1], summarization [
2], question answering, and creative writing [
3]. Millions of users interact with LLMs to access helpful information [
4]. However, malicious attackers can also exploit LLMs to extract sensitive or restricted information, a process known as “jailbreaking”. To address these security concerns, simulating attackers’ behaviors becomes essential, as it enables researchers to uncover potential weaknesses in LLMs systematically.
Initially, attackers relied on manually constructing jailbreak prompts [
5], which are time-consuming and suboptimal in scalability. To overcome these challenges, researchers have developed automated jailbreak techniques, which can be broadly categorized into two approaches: optimization-based and obfuscation-based. Optimization-based methods utilize gradient-based optimization techniques [
6,
7] to craft adversarial prompt suffixes, forcing the LLM to produce compliant responses (e.g., “Sure, let’s...”). Meanwhile, obfuscation-based methods employ strategies such as word substitution [
8] or character encoding [
9] to obscure malicious keywords within the prompt. While effective, these methods exhibit identifiable prompt patterns and can be defended by feature detection [
10].
To address the above limitations, researchers have explored persuasion strategies in everyday communication, which are less detectable. These methods employ persuasion techniques to jailbreak LLMs [
11], adopting a fundamental approach that involves integrating plausible justifications for malicious actions [
12,
13]. Furthermore, recent studies investigated multi-turn dialogues to conceal malicious intent [
14,
15,
16,
17]. In these attacks, adversaries designed progressive questions, aiming to obtain harmful answers through a series of step-by-step interactions. Through a psychological attribution analysis of existing jailbreak methods, we found that these methods treat LLMs as passive persuasion targets. The core challenge in persuasion-based jailbreaking lies in constructing plausible justifications that convince the model to execute harmful actions. However, existing persuasion methods rely on externally imposed justifications, which may conflict with the internal reasoning patterns of LLMs. Our goal is to investigate how self-persuasion mechanisms, in which LLMs autonomously generate justifications for harmful actions, affect the effectiveness of jailbreaking.
Inspired by Greenwald’s Cognitive Response Theory [
18], we propose Persu-Agent, a multi-turn dialogue jailbreaking framework. This cognitive theory posits that persuasion is not determined by the message itself, but rather by the recipient’s cognitive processing of the message. Building on this theory, we hypothesize that engaging the target LLM in generating its justifications can enhance the likelihood of persuasion. Therefore, we employ self-persuasion as the central mechanism in our framework. Unlike direct persuasion, where justifications are provided explicitly, self-persuasion [
19] takes a different approach. It employs open-ended questions to encourage the target model to generate justifications and persuade itself. This process positions the LLM as both the persuader and the target of persuasion, increasing the likelihood of compliance. After eliciting the supporting justifications for the target action, we then progressively request the LLM to execute the action. To demonstrate the motivation of our approach,
Figure 1 illustrates a sample comparison of our method with direct persuasion approaches.
To validate the proposed method, we conducted extensive experiments across multiple advanced LLMs. Our results demonstrate that Persu-Agent achieves an average attack success rate (ASR) of 84%, surpassing existing state-of-the-art (SOTA) methods, which achieve a maximum ASR of 73%. Moreover, our method can also effectively circumvent existing defense strategies. These findings highlighted the effectiveness of our method and revealed new insights into LLM vulnerabilities.
In conclusion, our contributions are as follows:
Psychological Analysis of Jailbreak Methods: We analyze existing jailbreak methods from a psychological perspective and reveal that, despite employing diverse persuasion techniques, they rely on external persuasion strategies that treat LLMs as passive targets.
Novel Jailbreak Framework: We apply the self-persuasion technique to LLM jailbreak and propose an automatic framework, Persu-Agent, which persuades the LLM with its own generated justifications.
Effective Performance: Experiments demonstrate that our method effectively guides LLMs to answer harmful queries, outperforming existing SOTA approaches.
The remainder of this paper is organized as follows:
Section 2 reviews related work.
Section 3 presents the framework design.
Section 4 details the experimental setup and results.
Section 5 addresses the limitations of our approach. Finally,
Section 6 concludes the paper.
4. Experiment
Our experimental evaluation aims to comprehensively assess the effectiveness of the proposed Persu-Agent framework across multiple dimensions. Specifically, we conducted experiments to (1) validate the overall performance of our multi-turn jailbreaking approach compared to existing methods, (2) evaluate our method’s capability in circumventing various defense mechanisms, (3) examine the impact of different dialogue scenarios on jailbreaking success rates, and (4) analyze the individual contribution of each component with ablation studies. Through these systematic evaluations, we demonstrate the superiority of our approach and provide insights into the factors that influence jailbreaking effectiveness.
4.1. Experimental Setup
Target LLMs. We evaluated five advanced LLMs [
50] in this experiment. Their names are as follows (with specific versions): Llama-3.1 (Llama3.1-8B-instruct), Qwen-2.5 (Qwen-2.5-7B-instruct), GLM-4 (ChatGLM-4-9B), GPT-4o (GPT-4o-mini), and Claude-3 (Claude-3-haiku-20240307). These models include commercial and open-source implementations, all widely adopted and exhibiting broad representativeness. In the experiments, we configured all the model’s temperatures to 0.7 and set max_token to 512. The temperature of 0.7 represents an empirically validated balance between creativity and factual accuracy [
51]. The max_token limit of 512 is set based on empirical observations to ensure complete responses to most queries [
52].
Datasets. Our test dataset is JailbreakBench [
53], which consists of 100 harmful queries. These queries are categorized into ten types: harassment/discrimination, malware/hacking, physical harm, economic harm, fraud/deception, disinformation, sexual/adult content, privacy, expert advice, and government decision-making, which are curated in accordance with OpenAI’s usage policies [
54]. In particular, no test questions overlap with the training set.
Baselines. We compared Persu-Agent with SOTA persuasion-based attack methods, including PAIR, DeepInception, Red Queen, Chain of Attack, and ActorAttack. To ensure fair comparison across all methods, we conducted 6 test attempts for each baseline and our proposed method. Since we selected 6 different dialogue scenarios for our approach, we correspondingly performed the same number of attempts for all other methods. For multi-turn jailbreak methods, we limit the maximum number of dialogue turns to 6. These methods are introduced as follows.
Prompt Automatic Iterative Refinement (PAIR) [
28]. PAIR improves jailbreak prompts based on feedback from LLMs, continuously refining them with persuasive techniques to achieve a successful jailbreak.
Persuasive Adversarial Prompts (PAP) [
11]. PAP leverages human persuasion techniques, packaging harmful queries as harmless requests to circumvent LLM security restrictions.
Chain of Attack (CoA) [
17] used seed attack chains as examples to generate multi-turn jailbreaking prompts.
ActorAttack [
16] refers to network relations theory and starts with the question of relevant elements to obtain progressively harmful answers.
Red Queen [
15] manually sets 40 types of multi-turn dialogue contexts and integrates harmful queries into them. We selected the six most effective scenarios.
Metrics. We assess the efficacy of jailbreak attack methods using two primary metrics: (1) Attack Success Rate (ASR)
. The content evaluation modules determine the criteria for a successful jailbreak. A query is considered successfully jailbroken only when both HarmBench models return “Yes” (indicating the model answered the malicious question) and JailJudge assigns a score above 2 (following the model demo’s judgment criteria [
55]). For a given query, multi-round jailbreak dialogues are conducted. If any of these dialogues succeed, the query is considered to have been successfully jailbroken. (2) Harmfulness Score (HS): We also evaluate the harmfulness level of model responses using JailJudge scores, which range from 1 (complete refusal) to 10 (complete compliance). This metric offers a more nuanced evaluation of the severity of harmful content generated by various jailbreak methods, complementing the binary ASR metric.
4.2. Overall Evaluation
From
Table 3, we can observe that Persu-Agent attains an average ASR of 84%, exceeding baseline methods and confirming that the self-persuasion strategy is more effective at eliciting harmful outputs.
Table 4 shows that Persu-Agent reaches the highest score of 6.02, indicating that its successful jailbreaks also yield content that is more explicitly harmful than that produced by baseline approaches. Statistical analysis also confirms the improvement over the strongest baseline Red Queen method (84% vs. 73%,
), demonstrating that the observed performance gains are statistically reliable.
Meanwhile, Red Queen yields high ASR and maliciousness scores on most models; however, it collapses on Claude-3. The assistant refuses to give harmful instructions and instead returns safety advice, so no specific malicious content is produced, and the maliciousness score drops to 1.00. Chain-of-Attack and ActorAttack perform worse, with average ASRs of only 56% and 52%. Their multi-turn strategies fail to keep the dialogue focused on the harmful request; conversations drift to benign topics, and the model neutralizes the attack. In contrast, our approach achieved consistently high ASR and maliciousness scores across different models, demonstrating strong generalization capabilities.
4.3. Defense Evaluation
We compared the performance of Persu-Agent and Red Queen in breaching widely adopted jailbreak defense strategies [
53]. The results shown in
Table 5 were tested on Llama-3.1. The specific mechanisms of each defense method are as follows:
Self-reminder [
44]: This approach encapsulates user input with prompt templates to remind LLMs not to generate malicious content. The system adds explicit instructions before processing the query to reinforce safety guidelines (the prompt is detailed in
Appendix A.3).
Llama Guard 3 [
46]: This defense mechanism analyzes the entire jailbreak dialogue to classify it as either “safe” or detailed malicious types. If the classification result is “safe”, it indicates that the dialogue has successfully bypassed the security defenses.
SmoothLLM [
56]: This defense method selects a certain percentage of characters in the input prompt and randomly replaces them with other printable characters. In this experiment, we randomly modified 10% of all the attack prompts, which is the optimal scheme based on the corresponding paper.
The results demonstrate Persu-Agent’s superior performance against various defense mechanisms compared to Red Queen. (1) Self-reminder Defense: Persu-Agent significantly outperforms Red Queen in penetrating this defense. Red Queen’s ASR plummeted from 85% to 5%, while Persu-Agent’s ASR exhibited a mere 9% decrease. (2) Llama Guard 3 Classification: We utilized Llama Guard 3 to classify jailbroken dialogues. The results showed that 76% of Persu-Agent’s dialogues still achieved successful attacks after the inspection process. (3) SmoothLLM Defense: This method randomly modifies a certain percentage of characters within attack prompts. It reduced success rates by only 17% for Persu-Agent and 27% for Red Queen. Overall, the experimental results demonstrate that our method can bypass multiple defense mechanisms, highlighting the limitations of existing defense strategies.
4.4. Category and Scenario Evaluation
Figure 3 shows the number of successful jailbreak dialogues in different harmful query categories and dialogue scenarios. The categories for each query were labeled by the JailbreakBench dataset. Among the various dialogue scenarios, the ASR reached its highest value in fiction writing. This can be attributed to the imaginative and unconstrained nature of fictional contexts, which offer more flexibility and ambiguity. Such settings are particularly susceptible to jailbreaking methods, as they allow models to bypass ethical constraints under the guise of creativity or narrative role-play.
Jailbreaking attacks struggle with harassment, adult content, and physical harm, likely because their very language betrays any covert intent. The vocabulary in these areas is hyper-salient—profanity and slurs, explicit sexual terminology, or precise technical jargon—so even minimal use of banned terms immediately triggers keyword filters. These findings show disparities in LLM’s defensive effect across different query categories and dialogue scenarios, highlighting the need to defend against sophisticated, harmful queries. Queries involving fraud, privacy, government decisions, and malware often contain technical language or ambiguous phrasing, which makes it easier to frame them as legitimate, educational, or hypothetical. This professional complexity provides plausible deniability, allowing jailbreaking attempts to be masked as harmless inquiries (e.g., for research, testing, or awareness).
Meanwhile, we employed Azure AI’s content safety API [
57] to evaluate the maliciousness scores of the dialogues. This API assigns scores ranging from 0 to 7 across four malicious dimensions for each dialogue. As shown in
Table 6, our analysis reveals that Sex/Adult, Physical Harm, and Harassment categories consistently exhibit the highest total maliciousness scores. These results correspond to their lower jailbreak attack success rates. This suggests that these highly malicious content types are more effectively detected and defended against by safety mechanisms. Notably, the expert advice category demonstrates the highest SelfHarm score (0.163), which explains why the target models successfully defend against most attacks in this category.
4.5. Ablation Study
In the ablation experiment, we evaluated the impact of different dialogue script design methods in the full dataset. The detailed experimental design is as follows:
Without Initiation Stage: Removes the self-persuasion guiding process from the first stage, eliminating the initial justification step that establishes self-persuasion.
Without Foundation Stage: Eliminates the Foundation Stage from the dialogue framework, which removes the intermediate questioning process designed to encourage the target LLM to engage in thinking related to the target action.
Without Multi-stage Design: Removes the multi-turn dialogue structure entirely, replacing the staged approach with a single-turn direct request.
As presented in
Table 7, the ablation study results demonstrated a consistent decrease in ASR upon removing specific script stages. In contrast, the complete design achieved the highest ASR, underscoring the significance of combining these stages in turn to maximize jailbreak effectiveness.
4.6. Failure Analysis
To understand the limitations of our approach, we analyzed the distribution of jailbreak failures across different harmful query categories and target models, as illustrated in
Figure 4. In our framework, a jailbreak attempt is classified as a failure when the attack does not achieve a successful jailbreak within the maximum dialogue turns.
The results demonstrate significant variations in both model resistance and query-type susceptibility. Claude-3 exhibits substantially higher failure rates (48–60 per category), showing stronger safety alignment, while GLM-4 and Qwen-2.5 show lower resistance (3–33 failures), making them more vulnerable to self-persuasion techniques. Across all models, certain harmful categories—particularly physical harm, sexual content, and harassment—consistently trigger stronger safety mechanisms. This pattern suggests that current safety training prioritizes preventing direct, immediate harm over abstract threats, highlighting how both intrinsic model safety and query characteristics fundamentally shape jailbreak success rates.
5. Limitation and Discussion
Our analysis concentrates on semantically coherent jailbreak prompts, leaving nonsensical or unintelligible prompts beyond the present scope. Because most users interact with LLMs through natural language, we believe that prioritizing semantically meaningful jailbreaks yields the highest practical value.
The dialogue scenarios we study do not cover every conceivable jailbreak pattern. Instead, they represent the most common and recurring patterns observed in practice, thereby providing a robust foundation for analysis. Crucially, the proposed framework remains flexible and can be extended to incorporate newly discovered scenarios.
Disclosing the technical details of our methods inevitably raises ethical considerations. We contend that transparent reporting advances defensive research, enabling the community to design more effective safeguards. Accordingly, we adhere to responsible-disclosure principles and strongly encourage the community to use our findings solely for constructive purposes.
Future defenses could benefit from several avenues: (1) developing self-persuasion-aware detection systems capable of identifying the gradual persuasion patterns characteristic of our attack method, thus addressing multi-turn manipulation that single-turn defenses might overlook; (2) exploring counter-persuasion mechanisms that allow systems to recognize and resist incremental drift toward harmful outputs; and (3) incorporating self-persuasion attack patterns into adversarial training pipelines to enhance model robustness against this class of attacks.