SafeCodeRL: Security-Constrained Multi-Agent Reinforcement Learning for Trustworthy LLM-Generated IoT/CPS Software
Abstract
1. Introduction
- We formulate secure LLM code generation as a trustworthy IoT/CPS software problem. SafeCodeRL targets sensor telemetry APIs, edge-gateway utilities, configuration handlers, and secure data-aggregation routines where generated vulnerabilities can directly affect sensing data integrity and CPS reliability.
- We propose a multi-agent closed-loop framework that unifies functional testing and security auditing. The Planner, Code, Security, Test, and Critic agents exchange structured feedback so that generated software is iteratively optimized at both the logical and vulnerability levels.
- We develop a constraint-aware reinforcement learning algorithm for secure code generation. The Lagrangian mechanism dynamically balances functionality and security penalties, while action shielding blocks catastrophic code patterns before sandbox execution.
- We validate SafeCodeRL on established code/security benchmarks and a scenario-level IoT/CPS software suite. On HumanEval, SWE-bench, SecurityEval, and CyberSecEval, SafeCodeRL improves functional metrics while reducing vulnerability rates; on the 25-task IoT/CPS case study, it improves Secure Pass@1 for telemetry APIs, gateway diagnostics, configuration handling, secure aggregation, and sensor parsing tasks.
2. Related Work
2.1. LLM-Generated Code and Software Vulnerabilities
2.2. IoT/CPS Security and Trustworthy Sensing Software
2.3. Secure Alignment and Reinforcement Learning for Code
2.4. Multi-Agent Software Engineering
3. Methods
3.1. IoT/CPS-Oriented System Formulation
3.2. Five-Agent Architecture
3.2.1. Planner Agent
- [{"task_id": i, "desc": "...", "in_out_types": "..."}].
3.2.2. Code Agent
3.2.3. Security Agent
3.2.4. Test Agent
3.2.5. Critic Agent
3.3. Reinforcement Learning Modeling
| Algorithm 1: SafeCodeRL: Reinforcement Learning with Functional and Security-Aware Rewards |
![]() |
3.4. Security Constraint Modeling
3.5. Synergy and Training Pipeline
4. Experiments
4.1. Datasets, Evaluation Metrics, and Base Models
4.2. Implementation Details
4.3. Main Results
Evaluation of Base Models Under SafeCodeRL
4.4. IoT/CPS Software Security Case Study
4.5. Efficiency and TTS Scalability
4.6. Ablation Study
4.7. Sensitivity Analysis
4.7.1. Critic Base Model Capability Sensitivity
4.7.2. Shielding Trigger Distribution
5. Ethics Statement
6. Discussion
7. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Representative IoT/CPS Case-Study Tasks
| Task Category | Representative Requirement | Complexity Descriptor | Dominant Security Oracle | Validation Signal |
|---|---|---|---|---|
| Sensor telemetry API | Store/query mocked sensor records. | 2 interfaces; 4–6 fields; 3–5 boundary tests. | CWE-89 SQL injection. | Correct records; reject tainted telemetry. |
| Edge-gateway diagnostics | Parse diagnostic parameters. | 1 wrapper; 3–4 parameters; command-like cases. | CWE-78 command injection. | Bounded output; no shell expansion. |
| Configuration or firmware path handler | Normalize virtual file paths. | 1 path API; nested paths; traversal payloads. | CWE-22 path traversal. | Access stays inside virtual device tree. |
| Secure aggregation and authentication | Aggregate readings after credential checks. | Aggregation logic; token check; 4–8 fields. | Hard-coded credential or insecure auth. | Aggregate correctly; reject invalid credentials. |
| Networked sensor parser | Parse malformed sensor packets. | Parser state logic; malformed packets; DoS boundaries. | Unsafe deserialization or DoS. | Bounded errors; no unsafe deserialization. |
Appendix B. Prompt Design for Agents
| Prompt Component | Content |
|---|---|
| System Prompt | You are an elite Software Architect. Your primary task is to break down complex user requirements into a Directed Acyclic Graph (DAG) of atomic, executable coding subtasks. |
| You must strictly output your plan in JSON format like this: | |
| ["task_id": "T1", "dependency": [], "description": "...", "in_out_types": "...", ...] | |
| Do not output any markdown code blocks, explanations, or conversational text outside the JSON array. | |
| User Prompt Template | User Requirement: {user_instruction} |
| Action: Decompose this requirement into logical, sequential steps. Ensure variables and functional states naturally flow between dependencies. Output the JSON array now. |
| Prompt Component | Content |
|---|---|
| System Prompt | You are an expert Developer. Your goal is to write highly efficient, functionally correct, and strictly secure code based on the provided subtask. |
| If you are provided with critique/feedback from previous testing or security iterations, you must carefully modify the existing code to resolve the reported errors or vulnerabilities. | |
| Only output the raw, executable Python code within a “‘python block. No conversational text. | |
| User Prompt Template | Subtask Description: {task_description} |
| Current Code (if refining): {current_code} | |
| Critic Feedback (if any): {critic_feedback} | |
| Action: Generate or refine the code to fulfill the subtask. Explicitly address the Critic’s feedback if provided. |
| Prompt Component | Content |
|---|---|
| System Prompt | You are a strict Cybersecurity Auditor. You will be provided with a generated code snippet and a preliminary Static Analysis Report (e.g., from AST parsing or CodeQL). |
| Your task is to filter out false positives from the static report, confirm real logic vulnerabilities, and output a structured vulnerability report. Think adversarial. | |
| User Prompt Template | Code Snippet: {code_snippet} |
| Static Analysis Raw Report: {sast_report} | |
| Action: Analyze the data flow. If a vulnerability is confirmed, output exactly in this format: | |
| CWE-ID: [e.g., CWE-89] | |
| Severity: [Low/Medium/High] | |
| Vulnerable Lines: [line numbers] | |
| Mitigation Hint: [Concrete instructions to fix it securely] |
| Prompt Component | Content |
|---|---|
| System Prompt | You are an expert Software QA Engineer. Your task is to generate comprehensive edge-case unit tests (in Python unittest or pytest) for the provided code to verify functional correctness and robustness. Ensure coverage of boundary conditions and failure points. |
| User Prompt Template | Target Code Snippet: {current_code} |
| Action: Generate 3 to 5 edge-case test functions to stress-test the above code. Output ONLY the raw Python test code within a “‘python block. Do not write test data requiring external file reads unless mocked. |
| Prompt Component | Content |
|---|---|
| System Prompt | You are a rigorous Code Reviewer and System Router. Aggregate execution logs from the Sandbox Test and the vulnerability report from the Security Auditor. |
| Generate a concise, actionable Critique for the Developer. Output a routing tag (<ROUTE: Code|Test|Security|End>) to determine the next agent. | |
| User Prompt Template | Current Source Code: {current_code} |
| Test Execution Logs: {test_logs} | |
| Security Auditor Report: {security_report} | |
| Action: Summarize critical failures. Pinpoint logic flaws and enforce secure coding practices. Provide a unified [CRITIQUE] section, followed immediately by the [ROUTING] section. |
References
- Wang, X.; Wan, Z.; Hekmati, A.; Zong, M.; Alam, S.; Zhang, M.; Krishnamachari, B. The Internet of Things in the Era of Generative AI: Vision and Challenges. IEEE Internet Comput. 2024, 28, 57–64. [Google Scholar] [CrossRef]
- Pearce, H.; Ahmad, B.; Tan, B.; Dolan-Gavitt, B.; Karri, R. Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP); IEEE: Piscataway, NJ, USA, 2022; pp. 754–768. [Google Scholar]
- Asare, O.; Nagappan, M.; Asokan, N. Is GitHub’s Copilot as bad as humans at introducing vulnerabilities in code? Empir. Softw. Eng. 2023, 28, 129. [Google Scholar] [CrossRef]
- Bhatt, M.; Chennabasappa, S.; Nikolaidis, C.; Wan, S.; Evtimov, I.; Gabi, D.; Song, D.; Ahmad, F.; Aschermann, C.; Fontana, L.; et al. Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models. arXiv 2023, arXiv:2312.04724. [Google Scholar] [CrossRef]
- Sandoval, G.; Pearce, H.; Nys, T.; Karri, R.; Garg, S.; Dolan-Gavitt, B. Lost at C: A user study on the security implications of large language model code assistants. In Proceedings of the 32nd USENIX Security Symposium, Anaheim, CA, USA, 9–11 August 2023. [Google Scholar]
- Tony, C.; Mutas, M.; Díaz Ferreyra, N.E.; Scandariato, R. LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations. In Proceedings of the IEEE/ACM 20th International Conference on Mining Software Repositories (MSR), Melbourne, Australia, 15–16 May 2023; pp. 588–592. [Google Scholar] [CrossRef]
- Chen, Y.; Sun, W.; Fang, C.; Chen, Z.; Ge, Y.; Han, T.; Zhang, Q.; Liu, Y.; Chen, Z.; Xu, B. Security of Language Models for Code: A Systematic Literature Review. arXiv 2024, arXiv:2410.15631. [Google Scholar] [CrossRef]
- Bai, Y.; Jones, A.; Ndousse, K.; Askell, A.; Chen, A.; DasSarma, N.; Drain, D.; Fort, S.; Ganguli, D.; Henighan, T.; et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv 2022, arXiv:2204.05862. [Google Scholar] [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- An, S.; Ma, Z.; Lin, Z.; Zheng, N.; Lou, J.-G.; Chen, W. Learning from Mistakes Makes LLM Better Reasoner. arXiv 2023, arXiv:2310.20689. [Google Scholar]
- Qian, C.; Liu, W.; Liu, H.; Chen, N.; Dang, Y.; Li, J.; Yang, C.; Chen, W.; Su, Y.; Cong, X.; et al. ChatDev: Communicative Agents for Software Development. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; pp. 15174–15186. [Google Scholar] [CrossRef]
- Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Zhang, C.; Wang, J.; Wang, Z.; Yau, S.K.S.; Lin, Z.; et al. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. arXiv 2023, arXiv:2308.00352. [Google Scholar] [CrossRef]
- Huang, D.; Zhang, J.M.; Luck, M.; Bu, Q.; Qing, Y.; Cui, H. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation. arXiv 2024, arXiv:2312.13010. [Google Scholar] [CrossRef]
- Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained policy optimization. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 22–31. [Google Scholar]
- Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.D.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
- Li, Y.; Choi, D.; Chung, J.; Kushman, N.; Schrittwieser, J.; Leblond, R.; Eccles, T.; Keeling, J.; Gimeno, F.; Dal Lago, A.; et al. Competition-level code generation with AlphaCode. Science 2022, 378, 1092–1097. [Google Scholar] [CrossRef]
- Rozière, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code Llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
- Hassija, V.; Chamola, V.; Saxena, V.; Jain, D.; Goyal, P.; Sikdar, B. A Survey on IoT Security: Application Areas, Security Threats, and Solution Architectures. IEEE Access 2019, 7, 82721–82743. [Google Scholar] [CrossRef]
- Khan, M.A.; Salah, K. IoT Security: Review, Blockchain Solutions, and Open Challenges. Future Gener. Comput. Syst. 2018, 82, 395–411. [Google Scholar] [CrossRef]
- Deng, L.; Zhong, Q.; Song, J.; Lei, H.; Li, W. LLM-Based Unknown Function Automated Modeling in Sensor-Driven Systems for Multi-Language Software Security Verification. Sensors 2025, 25, 2683. [Google Scholar] [CrossRef]
- Al-Safi, H.; Ibrahim, H.; Steenson, P. Vega: LLM-Driven Intelligent Chatbot Platform for Internet of Things Control and Development. Sensors 2025, 25, 3809. [Google Scholar] [CrossRef]
- Athanasopoulou, A.; Fotiou, N.; Chatzopoulos, A. Interacting with IoT Data Spaces Using LLMs and the Model Context Protocol. Sensors 2026, 26, 1193. [Google Scholar] [CrossRef] [PubMed]
- Hajipour, H.; Schönherr, L.; Holz, T.; Fritz, M. HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data. arXiv 2024, arXiv:2409.06446. [Google Scholar]
- He, J.; Vechev, M. Large Language Models for Code: Security Hardening and Adversarial Testing (SVEN). In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS), Copenhagen, Denmark, 26–30 November 2023. [Google Scholar] [CrossRef]
- He, J.; Vero, M.; Krasnopolska, G.; Vechev, M. Instruction Tuning for Secure Code Generation (SafeCoder). In Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024; pp. 18043–18062. [Google Scholar]
- Weyssow, M.; Kamanda, A.; Sahraoui, H. CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences. arXiv 2024, arXiv:2403.09032. [Google Scholar] [CrossRef]
- Le, H.; Wang, Y.; Gotmare, A.D.; Savarese, S.; Hoi, S.C.H. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. Adv. Neural Inf. Process. Syst. 2022, 35, 21314–21328. [Google Scholar]
- Shojaee, P.; Jain, A.; Tipirneni, S.; Reddy, C.K. Execution-based Code Generation using Deep Reinforcement Learning (PPOCoder). Trans. Mach. Learn. Res. (TMLR) 2023, arXiv:2301.13816. [Google Scholar]
- Islam, M.A.; Ali, M.E.; Parvez, M.R. MapCoder: Multi-Agent Code Generation for Competitive Problem Solving. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; pp. 4912–4944. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Adv. Neural Inf. Process. Syst. (NeurIPS) 2022, 35, 24824–24837. [Google Scholar]
- Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-Refine: Iterative Refinement with Self-Feedback. Adv. Neural Inf. Process. Syst. (NeurIPS) 2023, 36, 46534–46594. [Google Scholar]
- Avgustinov, P.; de Moor, O.; Jones, M.P.; Schäfer, M. QL: Object-oriented Queries on Relational Data. In Proceedings of the 30th European Conference on Object-Oriented Programming (ECOOP), Rome, Italy, 17–22 July 2016. [Google Scholar] [CrossRef]
- Newsome, J.; Song, D. Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software. In Proceedings of the 12th Annual Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, 3–4 February 2005. [Google Scholar]
- Merkel, D. Docker: Lightweight Linux Containers for Consistent Development and Deployment. Linux J. 2014, 239, 2. [Google Scholar]
- Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.I.; Abbeel, P. High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
- Stooke, A.; Achiam, J.; Abbeel, P. Responsive Safety in Reinforcement Learning by PID Lagrangian Methods. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; pp. 9133–9143. [Google Scholar]
- Alshiekh, M.; Bloem, R.; Ehlers, R.; Könighofer, B.; Niekum, S.; Topcu, U. Safe Reinforcement Learning via Shielding. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Hendrycks, D.; Basart, S.; Kadavath, S.; Mazeika, M.; Arora, A.; Guo, E.; Burns, C.; Puranik, S.; He, H.; Song, D.; et al. Measuring Coding Challenge Competence with APPS. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, Virtual, 6–14 December 2021. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Jimenez, C.E.; Yang, J.; Wettig, A.; Yao, S.; Pei, K.; Press, O.; Narasimhan, K. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? In Proceedings of the The Twelfth International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Siddiq, M.L.; Santos, J.C.S. SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques. In Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security; ACM: New York, NY, USA, 2022; pp. 29–33. [Google Scholar] [CrossRef]
- Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. DeepSeek-V3 Technical Report. arXiv 2024, arXiv:2412.19437X. [Google Scholar]
- OpenAI. Hello GPT-4o. 2024. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 13 May 2024).
- Guo, D.; Yang, D.; Zhang, H.; Song, J.; Wang, P.; Zhu, Q.; Xu, R.; Zhang, R.; Ma, S.; Bi, X.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
- Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C.H.; Gonzalez, J.E.; Zhang, H.; Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP), Koblenz, Germany, 23–26 October 2023; pp. 611–626. [Google Scholar] [CrossRef]
- Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv 2023, arXiv:2308.08155. [Google Scholar]
- Snell, C.; Lee, J.; Xu, K.; Kumar, A. Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters. In Proceedings of the The Twelfth International Conference on Learning Representations (ICLR) Workshop, Vienna, Austria, 11 May 2024. [Google Scholar]
- Lightman, H.; Kosaraju, V.; Burda, Y.; Edwards, H.; Baker, B.; Lee, T.; Leike, J.; Schulman, J.; Sutskever, I.; Cobbe, K. Let’s Verify Step by Step. In Proceedings of the The Twelfth International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Hui, B.; Yang, J.; Cui, Z.; Yang, J.; Liu, D.; Zhang, L.; Liu, T.; Zhang, J.; Yu, B.; Lu, K.; et al. Qwen2.5-Coder Technical Report. arXiv 2024, arXiv:2409.12186. [Google Scholar] [CrossRef]








| IoT/CPS Context | Generated Artifact | Security Risks Constrained by SafeCodeRL | Functional Validation Signal |
|---|---|---|---|
| Sensor telemetry service | HTTP or database access logic for storing sensing streams | SQL injection, unsafe query construction, missing input validation | Correct insertion, retrieval, and aggregation over mocked telemetry records |
| Edge gateway diagnostics | Python or shell-wrapper utility for local device status checks | OS command injection, unsafe subprocess calls, unrestricted external commands | Correct parsing of diagnostic requests under sandboxed execution |
| Firmware/config. handler | File upload, update, and log-access logic | Path traversal, unrestricted filesystem access, hard-coded secrets | Correct file selection and update behavior within a restricted virtual device tree |
| Secure data aggregation | Local aggregation or filtering function for multi-sensor inputs | Insecure authentication, unsafe trust assumptions, leakage of device identifiers | Accurate aggregation under normal and boundary sensor streams |
| Networked sensor parser | Parser for MQTT/HTTP-style payloads or edge messages | Denial-of-service style parsing errors, unsafe deserialization, malformed-input crashes | Robust parsing of valid, missing, malformed, and adversarial payloads |
| Method | Agent Base Model | HumanEval Pass@1 ↑ | SWE-Bench Resolution ↑ | SecurityEval VR ↓ | CyberSecEval VR ↓ |
|---|---|---|---|---|---|
| Single-Agent Vanilla [42] | Llama-3-8B | 62.4% | 3.1% | 42.5% | 38.7% |
| Single-Agent Vanilla [43] | DeepSeek-V3 | 75.2% | 8.2% | 32.1% | 29.5% |
| Single-Agent Vanilla [44] | GPT-4o | 85.1% | 8.7% | 29.8% | 26.5% |
| Single-Agent Reasoning [45] | DeepSeek-R1 | 82.5% | 12.4% | 15.6% | 12.8% |
| Self-Refine [44] | GPT-4o | 88.2% | 11.2% | 24.3% | 22.1% |
| ChatDev [11] | GPT-4o | 88.6% | 11.6% | 21.9% | 19.8% |
| MetaGPT [12] | GPT-4o | 89.1% | 12.0% | 20.4% | 18.7% |
| MapCoder [29] | GPT-4o | 89.8% | 12.7% | 18.9% | 17.3% |
| AgentCoder [13] | GPT-4o | 90.2% | 13.1% | 17.6% | 16.4% |
| SafeCodeRL (Ours) | Llama-3-8B | 78.5% (+16.1) | 8.5% (+5.4) | 12.2% (−30.3) | 14.8% (−23.9) |
| SafeCodeRL (Ours) | DeepSeek-V3 | 86.8% (+11.6) | 15.2% (+7.0) | 4.2% (−27.9) | 5.8% (−23.7) |
| SafeCodeRL (Ours) | GPT-4o | 91.4% (+6.3) | 14.8% (+6.1) | 5.5% (−24.3) | 6.2% (−20.3) |
| Method | Base Model | Functional Pass@1 ↑ | Vulnerability Rate ↓ | Secure Pass@1 ↑ | Avg. Iterations ↓ |
|---|---|---|---|---|---|
| Single-Agent Vanilla [42] | Llama-3-8B | 64.5% | 44.3% | 35.8% | 1.0 |
| Single-Agent Vanilla [43] | DeepSeek-V3 | 80.3% | 32.5% | 55.8% | 1.0 |
| Self-Refine [31] | GPT-4o | 88.4% | 24.3% | 68.2% | 2.8 |
| ChatDev [11] | GPT-4o | 89.0% | 21.7% | 71.9% | 3.7 |
| MetaGPT [12] | GPT-4o | 90.2% | 18.9% | 74.8% | 3.5 |
| MapCoder [29] | GPT-4o | 90.7% | 17.4% | 77.3% | 3.2 |
| AgentCoder [13] | GPT-4o | 91.3% | 15.9% | 79.7% | 3.1 |
| SafeCodeRL (Ours) | Llama-3-8B | 81.4% | 12.5% | 72.9% | 2.4 |
| SafeCodeRL (Ours) | DeepSeek-V3 | 89.2% | 4.9% | 84.7% | 2.1 |
| SafeCodeRL (Ours) | GPT-4o | 92.9% | 3.7% | 89.3% | 1.9 |
| Model | Scaling Paradigm | Avg. Tokens per Problem ↓ | Inference Latency (s) ↓ | Pass@1 (%) ↑ |
|---|---|---|---|---|
| Qwen2.5-Coder [50] (Single) | No Scaling | 1100 | 3.2 | 83.4 |
| GPT-4o [44] (Single) | External Sampling Best-of-N (N = 8) | 18,500 | 25.4 | 86.1 |
| DeepSeek-R1 [45] (Single) | Internal Long-Chain Reasoning | 35,200 | 45.2 | 82.5 |
| SafeCodeRL-Qwen (1 Iteration) | External Multi-Agent Collaboration | 3800 | 7.9 | 85.2 |
| SafeCodeRL-Qwen (3 Iterations) | External Multi-Agent Collaboration | 8200 | 16.5 | 88.5 |
| SafeCodeRL-Qwen (4 Iterations) | External Multi-Agent Collaboration | 10,500 | 21.8 | 89.2 |
| Variant Configuration | SWE-Bench Resolution Rate ↑ | SecurityEval VR ↓ | Average Iterations ↓ |
|---|---|---|---|
| SafeCodeRL Full | 14.1% | 4.6% | 2.3 |
| w/o Five-Agent Decomposition (3-Agent Code-Security-Test) | 10.9% | 14.3% | 4.1 |
| w/o SFT Warm-start | 10.7% | 8.9% | 3.6 |
| w/o Router Distillation | 12.2% | 6.8% | 3.5 |
| w/o Progressive (Binary Functional Reward) | 10.4% | 9.6% | 3.9 |
| w/o Critic Routing (Fixed Pipeline) | 11.5% | 5.2% | 4.0 |
| w/o Lagrangian Constraint (Static Weight) | 12.8% | 17.5% | 2.5 |
| w/o Shielding (Removed Hard Constraint) | 13.9% | 11.8% | 2.4 |
| Generator | Critic | HumanEval (Pass@1) ↑ | SecurityEval (VR) ↓ | Critique Quality ↑ |
|---|---|---|---|---|
| Qwen2.5-Coder | No Critic (Vanilla Zero-shot) | 83.4% | 34.5% | N/A |
| Qwen2.5-Coder | Llama-3-8B [42] | 84.5% | 11.2% | 3.2/5.0 |
| Qwen2.5-Coder | DeepSeek-V3 [43] | 88.7% | 4.6% | 4.5/5.0 |
| Qwen2.5-Coder | GPT-4o [44] | 89.5% | 3.8% | 4.8/5.0 |
| CWE-ID | Vulnerability Category | Count | Share (%) |
|---|---|---|---|
| CWE-89 | SQL injection | 660 | 46.5 |
| CWE-22 | Path traversal | 440 | 31.0 |
| CWE-78 | OS command injection | 320 | 22.5 |
| Total | — | 1420 | 100.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, Z.; Chen, J.; Wei, Z.; Lin, L.; Tong, G. SafeCodeRL: Security-Constrained Multi-Agent Reinforcement Learning for Trustworthy LLM-Generated IoT/CPS Software. Sensors 2026, 26, 3502. https://doi.org/10.3390/s26113502
Wang Z, Chen J, Wei Z, Lin L, Tong G. SafeCodeRL: Security-Constrained Multi-Agent Reinforcement Learning for Trustworthy LLM-Generated IoT/CPS Software. Sensors. 2026; 26(11):3502. https://doi.org/10.3390/s26113502
Chicago/Turabian StyleWang, Zhihua, Junfan Chen, Zixiang Wei, Lan Lin, and Guoxiang Tong. 2026. "SafeCodeRL: Security-Constrained Multi-Agent Reinforcement Learning for Trustworthy LLM-Generated IoT/CPS Software" Sensors 26, no. 11: 3502. https://doi.org/10.3390/s26113502
APA StyleWang, Z., Chen, J., Wei, Z., Lin, L., & Tong, G. (2026). SafeCodeRL: Security-Constrained Multi-Agent Reinforcement Learning for Trustworthy LLM-Generated IoT/CPS Software. Sensors, 26(11), 3502. https://doi.org/10.3390/s26113502


