You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

16 December 2025

A Hybrid Perplexity-MAS Framework for Proactive Jailbreak Attack Detection in Large Language Models

,
,
,
,
,
and
1
Department of Information Management, Kun Shan University, Tainan 710303, Taiwan
2
Department of Information Management, National Chin-Yi University of Technology, Taichung 411030, Taiwan
*
Author to whom correspondence should be addressed.
This article belongs to the Section Computing and Artificial Intelligence

Abstract

Jailbreak attacks (JAs) represent a sophisticated subclass of adversarial threats wherein malicious actors craft strategically engineered prompts that subvert the intended operational boundaries of large language models (LLMs). These attacks exploit latent vulnerabilities in generative AI architectures, allowing adversaries to circumvent established safety protocols and illicitly induce the model to output prohibited, unethical, or harmful content. The emergence of such exploits underscores critical gaps in the security and controllability of modern AI systems, raising profound concerns about their societal impact and deployment in sensitive environments. In response, this study introduces an innovative defense framework that synergistically integrates language model perplexity analysis with a Multi-Agent System (MAS)-oriented detection architecture. This hybrid design aims to fortify the resilience of LLMs by proactively identifying and neutralizing jailbreak attempts, thereby ensuring the protection of user privacy and ethical integrity. The experimental setup adopts a query-driven adversarial probing strategy, in which jailbreak prompts are dynamically generated and injected into the open-source LLaMA-2 model to systematically explore potential vulnerabilities. To ensure rigorous validation, the proposed framework will be evaluated using a custom jailbreak detection benchmark encompassing metrics such as Attack Success Rate (ASR), Defense Success Rate (DSR), Defense Pass Rate (DPR), False Positive Rate, Benign Pass Rate (BPR), and End-to-End Latency. Through iterative experimentation and continuous refinement, this work endeavors to advance the defensive capabilities of LLM-based systems, enabling more trustworthy, secure, and ethically aligned deployment of generative AI in real-world environments.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.