Abstract
Large language models (LLMs) have shown remarkable potential for automatic code generation. Yet, these models share a weakness with their human counterparts: inadvertently generating code with security vulnerabilities that could allow unauthorized attackers to access sensitive data or systems. In this work, we propose Feedback-Driven Security Patching (FDSP), wherein LLMs automatically refine vulnerable generated code. The key to our approach is a unique framework that leverages automatic static code analysis to enable the LLM to create and implement potential solutions to code vulnerabilities. Further, we curate a novel benchmark, PythonSecurityEval, that can accelerate progress in the field of code generation by covering diverse, real-world applications, including databases, websites, and operating systems. Our proposed FDSP approach achieves the strongest improvements, reducing vulnerabilities by up to 33% when evaluated with Bandit and 12% with CodeQL and outperforming baseline refinement methods.