Abstract
Jailbreak attacks (JAs) represent a sophisticated subclass of adversarial threats wherein malicious actors craft strategically engineered prompts that subvert the intended operational boundaries of large language models (LLMs). These attacks exploit latent vulnerabilities in generative AI architectures, allowing adversaries to circumvent established safety protocols and illicitly induce the model to output prohibited, unethical, or harmful content. The emergence of such exploits underscores critical gaps in the security and controllability of modern AI systems, raising profound concerns about their societal impact and deployment in sensitive environments. In response, this study introduces an innovative defense framework that synergistically integrates language model perplexity analysis with a Multi-Agent System (MAS)-oriented detection architecture. This hybrid design aims to fortify the resilience of LLMs by proactively identifying and neutralizing jailbreak attempts, thereby ensuring the protection of user privacy and ethical integrity. The experimental setup adopts a query-driven adversarial probing strategy, in which jailbreak prompts are dynamically generated and injected into the open-source LLaMA-2 model to systematically explore potential vulnerabilities. To ensure rigorous validation, the proposed framework will be evaluated using a custom jailbreak detection benchmark encompassing metrics such as Attack Success Rate (ASR), Defense Success Rate (DSR), Defense Pass Rate (DPR), False Positive Rate, Benign Pass Rate (BPR), and End-to-End Latency. Through iterative experimentation and continuous refinement, this work endeavors to advance the defensive capabilities of LLM-based systems, enabling more trustworthy, secure, and ethically aligned deployment of generative AI in real-world environments.