Abstract
Large Language Models (LLMs) can adopt various roles through prompt guidance, enabled by pretraining on diverse corpora and instruction-following alignment. Safety alignment mechanisms attempt to define a helpful, honest, and harmless assistant to guide their behavior. However, specific role settings can bypass these safeguards, inducing LLMs to respond to harmful queries. In this study, we identify the role settings that lead LLMs to generate such harmful responses, which contribute to more reliable LLMs. We design an automated jailbreak framework, RoleBreaker, that optimizes role-play prompts with representation analysis and adaptive search. Experiments on 7 open-source LLMs show that RoleBreaker achieves an average jailbreak success rate of 87.3% with 4.0 attempts, outperforming SOTA methods. Furthermore, by summarizing the jailbreak experiences and applying them to closed-source commercial models (GPT-4.1, GLM-4, Gemini-2.0), we achieve an average jailbreak success rate of 84.3% with 4.3 attempts. These results reveal vulnerabilities in current alignment mechanisms and demonstrate the transferability of our approach.