You are currently viewing a new version of our website. To view the old version click .
Electronics
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

6 December 2025

Evading LLMs’ Safety Boundary with Adaptive Role-Play Jailbreaking

,
,
,
and
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
*
Author to whom correspondence should be addressed.
Electronics2025, 14(24), 4808;https://doi.org/10.3390/electronics14244808 
(registering DOI)

Abstract

Large Language Models (LLMs) can adopt various roles through prompt guidance, enabled by pretraining on diverse corpora and instruction-following alignment. Safety alignment mechanisms attempt to define a helpful, honest, and harmless assistant to guide their behavior. However, specific role settings can bypass these safeguards, inducing LLMs to respond to harmful queries. In this study, we identify the role settings that lead LLMs to generate such harmful responses, which contribute to more reliable LLMs. We design an automated jailbreak framework, RoleBreaker, that optimizes role-play prompts with representation analysis and adaptive search. Experiments on 7 open-source LLMs show that RoleBreaker achieves an average jailbreak success rate of 87.3% with 4.0 attempts, outperforming SOTA methods. Furthermore, by summarizing the jailbreak experiences and applying them to closed-source commercial models (GPT-4.1, GLM-4, Gemini-2.0), we achieve an average jailbreak success rate of 84.3% with 4.3 attempts. These results reveal vulnerabilities in current alignment mechanisms and demonstrate the transferability of our approach.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.