A Hybrid Perplexity-MAS Framework for Proactive Jailbreak Attack Detection in Large Language Models

Ping Wang; Hao-Cyuan Li; Hsiao-Chung Lin; Wen-Hui Lin; Fang-Ci Wu; Nian-Zu Xie; Zhon-Ghan Yang

doi:10.3390/app152413190

,

and

¹

Department of Information Management, Kun Shan University, Tainan 710303, Taiwan

²

Department of Information Management, National Chin-Yi University of Technology, Taichung 411030, Taiwan

^*

Author to whom correspondence should be addressed.

Appl. Sci.2025, 15(24), 13190;https://doi.org/10.3390/app152413190

This article belongs to the Section Computing and Artificial Intelligence

Version Notes

Order Reprints

Abstract

Jailbreak attacks (JAs) represent a sophisticated subclass of adversarial threats wherein malicious actors craft strategically engineered prompts that subvert the intended operational boundaries of large language models (LLMs). These attacks exploit latent vulnerabilities in generative AI architectures, allowing adversaries to circumvent established safety protocols and illicitly induce the model to output prohibited, unethical, or harmful content. The emergence of such exploits underscores critical gaps in the security and controllability of modern AI systems, raising profound concerns about their societal impact and deployment in sensitive environments. In response, this study introduces an innovative defense framework that synergistically integrates language model perplexity analysis with a Multi-Agent System (MAS)-oriented detection architecture. This hybrid design aims to fortify the resilience of LLMs by proactively identifying and neutralizing jailbreak attempts, thereby ensuring the protection of user privacy and ethical integrity. The experimental setup adopts a query-driven adversarial probing strategy, in which jailbreak prompts are dynamically generated and injected into the open-source LLaMA-2 model to systematically explore potential vulnerabilities. To ensure rigorous validation, the proposed framework will be evaluated using a custom jailbreak detection benchmark encompassing metrics such as Attack Success Rate (ASR), Defense Success Rate (DSR), Defense Pass Rate (DPR), False Positive Rate, Benign Pass Rate (BPR), and End-to-End Latency. Through iterative experimentation and continuous refinement, this work endeavors to advance the defensive capabilities of LLM-based systems, enabling more trustworthy, secure, and ethically aligned deployment of generative AI in real-world environments.

Keywords:

jailbreak attack; large language model; multi-agent system; perplexity analysis; LLaMA-2

A Hybrid Perplexity-MAS Framework for Proactive Jailbreak Attack Detection in Large Language Models

Abstract

Article Metrics

Citations

Article Access Statistics