A Safety and Security-Centered Evaluation Framework for Large Language Models via Multi-Model Judgment

Jinxin Zhang; Yunhao Xia; Hong Zhong; Weichen Lu; Qingwei Deng; Changsheng Wan

doi:10.3390/math14010090

,

and

¹

School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China

²

ZTE Corporation, Shenzhen 518057, China

³

The State Key Laboratory of Mobile Network and Mobile Multimedia Technology, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Mathematics2026, 14(1), 90;https://doi.org/10.3390/math14010090
(registering DOI)

This article belongs to the Special Issue Mathematical Foundations and Optimization Techniques for Large Language Models

Version Notes

Order Reprints

Abstract

The pervasive deployment of large language models (LLMs) has given rise to mounting concerns regarding the safety and security of the content generated by these models. Nevertheless, the absence of comprehensive evaluation methods constitutes a substantial obstacle to the effective assessment and enhancement of the safety and security of LLMs. In this paper, we develop the Safety and Security (S&S) Benchmark, integrating multi-source data to ensure comprehensive evaluation. The benchmark comprises 44,872 questions covering ten major risk categories and 76 fine-grained risk points, including high-risk dimensions such as malicious content generation and jailbreak attacks. In addition, this paper introduces an automated evaluation framework based on multi-model judgment. Experimental results demonstrate that this mechanism significantly improves both accuracy and reliability: compared with single-model judgment (GPT-4o, 0.973 accuracy), the proposed multi-model framework achieves 0.986 accuracy while maintaining a similar evaluation time (~1 h) and exhibits strong consistency with expert annotations. Furthermore, adversarial robustness experiments show that our synthesized attack data effectively increases the attack success rate across multiple LLMs, such as from 14.76% to 27.60% on GPT-4o and from 18.24% to 30.35% on Qwen-2.5-7B-Instruct, indicating improved sensitivity to security risks. The proposed unified scoring metric system enables comprehensive model comparison; summarized ranking results show that GPT-4o achieves consistently high scores across ten safety and security dimensions (e.g., 96.26 in ELR, 97.63 in PSI), while competitive open-source models such as Qwen2.5-72B-Instruct and DeepSeek-V3 also achieve strong performance (e.g., 96.70 and 97.63 in PSI, respectively). Although all models demonstrate strong alignment in the safety dimension, they exhibit pronounced weaknesses in security—particularly against jailbreak and adversarial attacks—highlighting critical vulnerabilities and providing actionable direction for future model hardening. This work provides a comprehensive, scalable solution and high-quality data support for automated evaluation of LLMs.

Keywords:

large language models; safety and security evaluation; benchmark; automated evaluation; LLM-judge

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.