Abstract
The pervasive deployment of large language models (LLMs) has given rise to mounting concerns regarding the safety and security of the content generated by these models. Nevertheless, the absence of comprehensive evaluation methods constitutes a substantial obstacle to the effective assessment and enhancement of the safety and security of LLMs. In this paper, we develop the Safety and Security (S&S) Benchmark, integrating multi-source data to ensure comprehensive evaluation. The benchmark comprises 44,872 questions covering ten major risk categories and 76 fine-grained risk points, including high-risk dimensions such as malicious content generation and jailbreak attacks. In addition, this paper introduces an automated evaluation framework based on multi-model judgment. Experimental results demonstrate that this mechanism significantly improves both accuracy and reliability: compared with single-model judgment (GPT-4o, 0.973 accuracy), the proposed multi-model framework achieves 0.986 accuracy while maintaining a similar evaluation time (~1 h) and exhibits strong consistency with expert annotations. Furthermore, adversarial robustness experiments show that our synthesized attack data effectively increases the attack success rate across multiple LLMs, such as from 14.76% to 27.60% on GPT-4o and from 18.24% to 30.35% on Qwen-2.5-7B-Instruct, indicating improved sensitivity to security risks. The proposed unified scoring metric system enables comprehensive model comparison; summarized ranking results show that GPT-4o achieves consistently high scores across ten safety and security dimensions (e.g., 96.26 in ELR, 97.63 in PSI), while competitive open-source models such as Qwen2.5-72B-Instruct and DeepSeek-V3 also achieve strong performance (e.g., 96.70 and 97.63 in PSI, respectively). Although all models demonstrate strong alignment in the safety dimension, they exhibit pronounced weaknesses in security—particularly against jailbreak and adversarial attacks—highlighting critical vulnerabilities and providing actionable direction for future model hardening. This work provides a comprehensive, scalable solution and high-quality data support for automated evaluation of LLMs.