Leveraging Static Analysis for Feedback-Driven Security Patching in LLM-Generated Code
Abstract
1. Introduction
- Our contributions are as follows:
- We propose FDSP, a method for improving LLM-generated code security by incorporating feedback from static analyzers such as Bandit.
- We develop a novel benchmark, PythonSecurityEval, to evaluate how well language model-based approaches can produce secure code stratified by diverse and common types of security vulnerability.
- Across three benchmarks, including PythonSecurityEval, using GPT-4, GPT-3.5, and CodeLlama, we find that FDSP improves patch success rates by up to 17.6% against baselines.
2. Related Work
2.1. Language Models for Code
2.2. Refinement of LLMs
3. Background
3.1. LLMs in Software Engineering Applications
3.2. LLM Refinement
3.3. Code Vulnerabilities
4. Our Approach
- Code generation: An LLM generates candidate code for a given task.
- Code testing: The code is passed through a static analyzer to identify security issues and produce structured feedback reports.
- Multi-solution generation: Using this feedback, the LLM generates multiple candidate patches aimed at refining the detected vulnerabilities.
- Iterative refinement: Each candidate solution is fed back into the LLM along with the vulnerable code to further refine and resolve any remaining vulnerabilities.
- Solution diversity: Different security vulnerabilities may require different fix strategies. Generating multiple solutions increases coverage of potential remediation approaches.
- Implementation reliability: LLMs may not perfectly implement a solution on the first attempt. Multiple iterations allow for refinement and correction of partial implementations.
| Algorithm 1 Feedback-Driven Security Patching (FDSP) algorithm |
|
4.1. Code Generation
4.2. Code Testing
4.3. Multi-Solution Generation
4.4. Iterative Refinement
5. Experimental Settings
5.1. Benchmarks
- LLMSecEval: This dataset contains natural language prompts to evaluate LLMs on generating secure source code [17]. LLMSecEval is comprised of 150 total prompts (natural language descriptions of code), covering the majority of the top 25 Common Weakness Enumeration (CWE).
- SecurityEval: This dataset can evaluate LLMs on their ability to generate secure Python 3 programs [20]. SecurityEval comprises 121 natural language prompts covering 75 types of vulnerabilities. Each prompt includes the header of a Python function along with comments describing each function.
- PythonSecurityEval (Ours): We collected a new benchmark from Stack Overflow to address the limitations of the existing datasets. Current datasets are limited in size and diversity and are therefore insufficient for evaluating the ability of LLMs to generate secure code adequately addressing security vulnerabilities. PythonSecurityEval includes natural language prompts intended to generate Python functions that cover diverse real-world applications. This dataset consisting of 470 prompts is three times larger than those used in LLMSecEval and SecurityEval.
| Domain | PythonSecurityEval | LLMSecEval | SecurityEval |
|---|---|---|---|
| (Ours) | |||
| Computation | 168 (35.7%) | 44 (29.5%) | 32 (26.4%) |
| System | 313 (66.6%) | 94 (63.1%) | 68 (56.2%) |
| Network | 147 (31.3%) | 63 (42.3%) | 29 (24.0%) |
| Cryptography | 29 (6.2%) | 8 (5.4%) | 16 (13.2%) |
| General | 414 (88.1%) | 128 (85.9%) | 118 (97.5%) |
| Database | 114 (24.3%) | 23 (15.4%) | 6 (5.0%) |
| Web Frameworks | 43 (9.1%) | 46 (30.9%) | 8 (6.6%) |
| Total | 470 | 150 | 121 |
5.2. Baselines
- Direct prompting: This approach involves sending generated code back to an LLM with the instruction: Does the provided function have a security issue? If yes, please refine the issue. If LLMs detect any security issues in the code, they will refine the issue and generate secure code.
- Self-Debugging: The initial step in self-debugging is for LLMs to generate the code. Subsequently, the generated code is sent back to the same LLMs to generate feedback. Finally, both the generated code and the feedback are fed back to the LLM to correct any existing bugs.
- Bandit feedback: We develop this baseline that utilizes Bandit to produce a report if there are any security issues in the code, as shown in Figure 2. We use this report as feedback to enable the LLM to refine the vulnerable code. This strategy is similar to prior approaches wherein external tools provide feedback to the LLM to refine its outputs [48,49,50]. Bandit feedback does not provide a solution to refine the issue; it simply highlights the problematic line and type of issue.
- Verbalization: We verbalize the feedback from Bandit, via an LLM, to produce intelligible and actionable feedback to resolve security issues. The verbalized feedback provides a detailed explanation in natural language of the specialized output from Bandit This expanded explanation offers deeper insights into the security issues and may suggest solutions to address the vulnerabilities.
5.3. Evaluation Metrics
- Bandit: Bandit is an open-source static analysis tool developed by the OpenStack Security Project to identify security vulnerabilities in Python source code. Its core mechanism involves parsing the Abstract Syntax Tree (AST) of Python programs and systematically inspecting it to detect known security anti-patterns and vulnerable code constructs. Bandit’s rule set covers a broad range of common security issues in Python, including hardcoded credentials, weak cryptographic algorithms, and unsafe subprocess management. The tool automatically generates reports highlighting potential vulnerabilities, their severity, and their precise locations within the code. In this study, we leverage Bandit both as a vulnerability detection tool for providing external feedback to large language models (LLMs) during code refinement and as an evaluation metric for measuring the effectiveness of our approach.
- CodeQL: is an open-source static analysis framework developed by GitHub (Version 2.23.1) for detecting vulnerabilities and code patterns in source code. The CodeQL workflow begins by parsing the source code into a database representation that captures its syntax, structure, and semantics, including abstract syntax trees (ASTs), control flow graphs, and data flow information. Custom queries can then be executed against this database to identify specific issues, such as insecure API usage or potential SQL injection vulnerabilities. CodeQL supports multiple programming languages, including Python. In this study, we use CodeQL as an external evaluation metric to assess the security of generated code and to evaluate how effectively refinement techniques mitigate identified vulnerabilities [51].
5.4. Models
- GPT-4: GPT-4 is a Generative Pre-trained Transformer model developed by OpenAI. Trained on massive text corpora using unsupervised learning, GPT-4 leverages the Transformer architecture to excel in a wide range of language tasks, including code generation, summarization, translation, and bug refining. Notably, GPT-4 is a closed-source model.
- GPT-3.5: GPT-3.5 is also part of the GPT family developed by OpenAI. With 175 billion parameters, it was trained on a general-purpose dataset. Among the various GPT-3.5 versions, we utilize “gpt-3.5-turbo-instruct”, which is specifically instruction-tuned to follow user prompts and generate responses aligned with user intent.
- CodeLlama: CodeLlama is an advanced, open-source LLM developed by Meta AI, trained primarily on code datasets. It is available in three model sizes—7B, 13B, and 34B parameters. In this study, we employ CodeLlama-Instruct-34B, an instruction-tuned variant optimized for understanding and following user instructions, making it well-suited for both code generation and refinement tasks.
5.5. Research Questions
- RQ1.
- What is the fundamental capability of LLMs in refining security vulnerabilities?This question aims to determine how effectively LLMs can inherently correct insecure code and highlight their limitations without incorporating external feedback.
- RQ2.
- How does Bandit feedback affect the ability of LLMs to refine code vulnerabilities?This question examines how effectively the LLMs incorporate feedback provided by provided by Bandit, a static code analysis tool.
- RQ3.
- How does FDSP improve LLM performance in refining code vulnerabilities?This question aims to assess how well the LLMs generate multiple potential solutions and iterate over each one to refine vulnerabilities.
- RQ4.
- How important are the multiple generated solutions and iterations of FDSP?We conduct ablation studies to isolate these factors by restricting FDSP to a single solution or iteration. This analysis reveals whether the diversity of generated solutions and iterative refinement contribute to FDSP effectiveness.
6. Experimental Results
6.1. RQ1: LLMs Are Somewhat Effective at Refining Vulnerable Code on Their Own
6.2. RQ2: Bandit-Based Feedback Is Beneficial Towards Correcting Security Vulnerabilities in Generated Code
| Dataset | Models | GPT 4 | GPT 3.5 | CodeLlama | |||
|---|---|---|---|---|---|---|---|
| Evaluation Metrics | Bandit | CodeQL | Bandit | CodeQL | Bandit | CodeQL | |
| LLM Sec Eval | Generated code | 38.2% | 10.1% | 34.2% | 18.1% | 28.6% | 20.7% |
| Direct prompting | 35.3% | 4.7% | 28.0% | 7.4% | 24.0% | 11.6% | |
| Self-debugging | 24.0% | 7.4% | 28.0% | 8.7% | 24.6% | 15.7% | |
| Bandit feedback | 8.0% | 5.4% | 18.6% | 8.7% | 18.0% | 13.2% | |
| Verbalization | 7.3% | 5.4% | 18.0% | 6.7% | 16.6% | 10.7% | |
| FDSP (Ours) | 6.0% | 6.7% | 12.6% | 8.1% | 14.6% | 9.1% | |
| Security Eval | Generated code | 34.7% | 12.4% | 38.0% | 9.1% | 46.2% | 32.2% |
| Direct prompting | 21.4% | 5.8% | 25.6% | 8.3% | 38.0% | 14.1% | |
| Self-debugging | 16.5% | 9.1% | 27.2% | 9.1% | 38.8% | 17.4% | |
| Bandit feedback | 4.1% | 6.6% | 13.2% | 5.8% | 21.4% | 13.4% | |
| Verbalization | 4.9% | 6.6% | 13.2% | 5.8% | 17.3% | 13.4% | |
| FDSP (Ours) | 4.1% | 8.3% | 5.7% | 2.5% | 8.2% | 12.1 | |
| Python Security Eval | Generated code | 40.2% | 17.9% | 48.5% | 13.2% | 42.3% | 13.2% |
| Direct prompting | 25.1% | 9.6% | 42.5% | 8.5% | 31.0% | 6.6% | |
| Self-debugging | 24.8% | 8.7% | 43.4% | 8.9% | 33.1% | 7.9% | |
| Bandit feedback | 9.3% | 9.1% | 26.3% | 6.4% | 20.0% | 6.2% | |
| Verbalization | 8.7% | 8.5% | 23.6% | 7.4% | 19.5% | 6.0% | |
| FDSP (Ours) | 7.4% | 7.7% | 15.7% | 5.7% | 8.7% | 5.7% | |
6.3. RQ3: FDSP Shows Consistent Improvement over the Baseline
| 1 # Generated code containing a vulnerability. 2 def show_user(username): 3 connection = get_connection() 4 cursor = connection.cursor() 5 6 # building and executing SQL select query 7 select_query = f ″SELECT * FROM users WHERE username = ’{username}’″ 8 cursor.execute(select_query) 9 10 # getting all rows from the executed query 11 user_info = cursor.fetchall() 12 13 # close the cursor and connection 14 cursor.close() 15 connection.close() 16 return~user_info 17 18 # Refined the generated code using FDSP. 19 def show_user(username): 20 connection = get_connection() 21 cursor = connection.cursor() 22 23 # building and executing SQL select query using parametrization 24 select_query = ″SELECT * FROM users WHERE username=%s″ 25 cursor.execute(select_query, (username,)) 26 27 # getting all rows from the executed SQL query 28 user_info = cursor.fetchall() 29 30 # close the cursor and connection 31 cursor.close() 32 connection.close() 33 return~user_info |
| Code Snippet 1: An example of LLM-generated code containing a vulnerability (top), and the corresponding code refined and fixed by FDSP (bottom). |
6.4. RQ4: Ablation Study
6.5. Qualitative Analysis
6.6. What Are the Most Frequent, Unresolved Coding Vulnerabilities Produced by LLMs?
6.7. Evaluating Functional Correctness in Code Refinement


6.8. Cross-Tool Evaluation of FDSP with Semgrep
6.9. Statistical Analysis of Refinement Effectiveness
6.10. Computational Cost
6.11. Comparative Performance Analysis: FDSP vs. INDICT in Multi-Round Vulnerability Reduction
6.12. Static Analysis Limitations
6.13. Beyond Function-Level Repair: Multi-Function Evaluation
6.14. Controlled Vulnerability Injection and Qualitative Evaluation
7. Threats to Validity
7.1. Internal Validity
7.2. Construct Validity
7.3. External Validity
8. Reproducibility Statement
8.1. Model Configuration
8.2. Dataset Collection and Preprocessing
8.3. Security Labeling and Verification
8.4. Static Analysis Configuration
9. Conclusions and Future Works
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Appendix A.1. Comparative Analysis of CoT and FDSP Performance
| Dataset | Approach | GPT-4 | CodeLlama | ||
|---|---|---|---|---|---|
| Bandit | CodeQL | Bandit | CodeQL | ||
| LLMSecEval | CoT | 26.8% | 7.38% | 24.8% | 12.0% |
| FDSP (Ours) | 6.0% | 6.7% | 14.6% | 9.1% | |
| SecurityEval | CoT | 18.0% | 5.8% | 38.5% | 17.5% |
| FDSP (Ours) | 4.1% | 8.3% | 8.2% | 12.1% | |
| Python SecurityEval | CoT | 22.7% | 8.4% | 36.4% | 9.14% |
| FDSP (Ours) | 7.4% | 7.7% | 8.7% | 5.7% | |
Appendix A.2. Parameter Sensitivity Analysis
| Configuration | Vulnerability Rate | Mean Time (SD) | Observation |
|---|---|---|---|
| , | 9.5% | 7.40 (±2.5) | Fewer iterations, higher vulnerability |
| , | 7.4% | 22.8 (±11.4) | Performance improves with iteration depth |
| , | 7.4% | 29.9 (±21.3) | Balanced trade-off (default) |
| , | 7.2% | 59.0 (±72.0) | Marginal gain, high cost |
| , | 10.0% | 8.3 (±5.8) | Limited diversity, weaker coverage |
| , | 8.7% | 15.7 (±7.4) | Moderate improvement |
| , | 7.4% | 29.9 (±21.3) | Optimal balance of diversity and cost |
| , | 7.4% | 57.1 (±46.0) | Diminishing returns beyond |

| CWE ID | Description |
|---|---|
| CWE-20 | Improper Input Validation |
| CWE-22 | Improper Limitation of a Pathname to a Restricted Directory (’Path Traversal’) |
| CWE-78 | Improper Neutralization of Special Elements used in an OS Command (’OS Command Injection’) |
| CWE-79 | Improper Neutralization of Input During Web Page Generation (’Cross-site Scripting’) |
| CWE-89 | Improper Neutralization of Special Elements used in an SQL Command (’SQL Injection’) |
| CWE-94 | Improper Control of Generation of Code (’Code Injection’) |
| CWE-119 | Improper Restriction of Operations within the Bounds of a Memory Buffer |
| CWE-200 | Exposure of Sensitive Information to an Unauthorized Actor |
| CWE-284 | Improper Access Control |
| CWE-287 | Improper Authentication |
| CWE-306 | Missing Authentication for Critical Function |
| CWE-352 | Cross-Site Request Forgery (CSRF) |
| CWE-400 | Uncontrolled Resource Consumption |
| CWE-502 | Deserialization of Untrusted Data |
| CWE-611 | Improper Restriction of XML External Entity Reference (XXE) |
| CWE-703 | Improper Handling of Exceptional Conditions |
| CWE-798 | Use of Hard-coded Credentials |
| CWE-120 | Buffer Copy without Checking Size of Input (’Classic Buffer Overflow’) |
| CWE-125 | Out-of-bounds Read |
| CWE-190 | Integer Overflow or Wraparound |
| CWE-295 | Improper Certificate Validation |
| CWE-416 | Use After Free |
| CWE-434 | Unrestricted Upload of File with Dangerous Type |
| CWE-476 | NULL Pointer Dereference |
| CWE-732 | Incorrect Permission Assignment for Critical Resource |
| CWE-787 | Out-of-bounds Write |
| CWE-862 | Missing Authorization |
| CWE-918 | Server-Side Request Forgery (SSRF) |
| CWE-1021 | Improper Restriction of Rendered UI Layers or Frames |
| CWE-1295 | Debug Features Enabled in Production |
| Examples | Generated Potential Solutions |
|---|---|
| Example 1 | (1) Use Parameterized Queries: The most effective way to prevent SQL injection is to use parameterized queries, also known as prepared statements. This strategy doesn’t form SQL commands by using string operations. Instead, it designates placeholders for each parameter, and the database engine will fill these placeholders securely. However, SQLite doesn’t allow placeholders for table names, so it might not be possible to bind table names to placeholders in this case. (2) Input Validation: Another strategy is to validate the user inputs or in this case, the table_name variable. We can ensure that it only contains valid table name characters and does not contain dangerous SQL syntax. This can be done using a whitelist of allowed characters or a regular expression (regex) to match valid table names. (3) Devise a Whitelisting Mechanism: This would involve creating a list of tables that are allowed to be accessed via the "total_rows" function. By checking if the table name provided to the function is present in the whitelisted table names, the code can ensure that only legitimate queries are executed, proving effective against SQL injection. |
| Example 2 | (1) Parameterized Query: The main solution for preventing SQL injection is to use parameterized queries. This entails using placeholders for data in your SQL statements and then providing the data to the SQL interpreter separately so it’s correctly and safely quoted. Unfortunately, SQLite in Python doesn’t directly support parameterization of table or column names. So, the programmer needs to ensure they sanitize their table or column name inputs. (2) Use of Sanitizing function: To handle not directly supported parameterization, we can also create a sanitizing function to ensure the validity of the table name provided to prevent SQL injection though still possible to think about the limitation of sanitizing process. (3) Use of Database API: Instead of allowing the program user to provide arbitrary table or column names, consider using a Database API to generate these names. This will provide an interface for the user to view and interact with the data, but not change the structure of the SQL queries themselves. This significantly reduces the risk of SQL injection attacks because the user may only affect the value in a field, not the structure of the SQL queries since the table names are not user inputs. |
| Example 3 | (1) Use Parameterized Queries: This involves using placeholders instead of directly inserting values into the statements. Parameters are used to help differentiate between the SQL code and the data passed through. When executed, the SQL statement compiles together with the parameters as one command which prevents attackers from interfering with the structure of SQL queries. (2) Use Object-Relational Mapping (ORM) Libraries: Technologies such as SQLAlchemy, Django ORM, can be used to handle operations with databases. These libraries usually come with secure features that protect from SQL injection attacks. (3) Use Database Views: Business logic could be implemented in the database side to decrease the complexity of SQL built in the application, and to encapsulate the database structure. Views could be used to give simplified perspective of the database and to restrict access to sensitive parts of the database schema. Reducing database control and visibility minimizes the potential damage of SQL Injection attacks. |



References
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the Advances in Neural Information Processing Systems 33, Online, 6–12 December 2020; pp. 877–1901. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Rozière, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Remez, T.; Rapin, J.; et al. Code llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Yu, T.; Zhang, R.; Yang, K.; Yasunaga, M.; Wang, D.; Li, Z.; Ma, J.; Li, I.; Yao, Q.; Roman, S.; et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv 2018, arXiv:1809.08887. [Google Scholar]
- Lachaux, M.A.; Roziere, B.; Chanussot, L.; Lample, G. Unsupervised translation of programming languages. arXiv 2020, arXiv:2006.03511. [Google Scholar] [CrossRef]
- Shypula, A.; Madaan, A.; Zeng, Y.; Alon, U.; Gardner, J.; Hashemi, M.; Neubig, G.; Ranganathan, P.; Bastani, O.; Yazdanbakhsh, A. Learning performance-improving code edits. arXiv 2023, arXiv:2302.07867. [Google Scholar]
- Pearce, H.; Tan, B.; Ahmad, B.; Karri, R.; Dolan-Gavitt, B. Examining Zero-Shot Vulnerability Repair with Large Language Models. In Proceedings of the IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 22–25 May 2023. [Google Scholar]
- Wong, M.F.; Guo, S.; Hang, C.N.; Ho, S.W.; Tan, C.W. Natural language generation and understanding of big code for ai-assisted programming: A review. Entropy 2023, 25, 888. [Google Scholar] [CrossRef]
- Hermann, K.; Peldszus, S.; Steghöfer, J.P.; Berger, T. An Exploratory Study on the Engineering of Security Features. In Proceedings of the International Conference on Software Engineering (ICSE), Ottawa, ON, Canada, 27 April–3 May 2025. [Google Scholar]
- Spiess, C.; Gros, D.; Pai, K.S.; Pradel, M.; Rabin, M.R.I.; Alipour, A.; Jha, S.; Devanbu, P.; Ahmed, T. Calibration and correctness of language models for code. In Proceedings of the International Conference on Software Engineering (ICSE), Lisbon, Portugal, 14–20 April 2024. [Google Scholar]
- Zhang, T.; Yu, Y.; Mao, X.; Wang, S.; Yang, K.; Lu, Y.; Zhang, Z.; Zhao, Y. Instruct or Interact? Exploring and Eliciting LLMs’ Capability in Code Snippet Adaptation Through Prompt Engineering. In Proceedings of the International Conference on Software Engineering (ICSE), Lisbon, Portugal, 14–20 April 2024. [Google Scholar]
- Chen, X.; Lin, M.; Schärli, N.; Zhou, D. Teaching large language models to self-debug. arXiv 2023, arXiv:2304.05128. [Google Scholar] [CrossRef]
- Athiwaratkun, B.; Gouda, S.K.; Wang, Z. Multi-lingual Evaluation of Code Generation Models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Siddiq, M.L.; Casey, B.; Santos, J.C.S. A Lightweight Framework for High-Quality Code Generation. arXiv 2023, arXiv:2307.08220. [Google Scholar] [CrossRef]
- Le, H.; Sahoo, D.; Zhou, Y.; Xiong, C.; Savarese, S. INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
- Tony, C.; Mutas, M.; Díaz Ferreyra, N.; Scandariato, R. LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations. In Proceedings of the 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR), Melbourne, Australia, 15–16 May 2023. [Google Scholar]
- Lightman, H.; Kosaraju, V.; Burda, Y.; Edwards, H.; Baker, B.; Lee, T.; Leike, J.; Schulman, J.; Sutskever, I.; Cobbe, K. Let’s verify step by step. In Proceedings of the Twelfth International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Huang, J.; Chen, X.; Mishra, S.; Zheng, H.S.; Yu, A.W.; Song, X.; Zhou, D. Large Language Models Cannot Self-Correct Reasoning Yet. In Proceedings of the Twelfth International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Siddiq, M.; Santos, J. SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques. In Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P S22), Virtually, 18 November 2022. [Google Scholar]
- Zheng, Q.; Xia, X.; Zou, X.; Dong, Y.; Wang, S.; Xue, Y.; Wang, Z.; Shen, L.; Wang, A.; Li, Y.; et al. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv 2023, arXiv:2303.17568. [Google Scholar]
- Austin, J.; Odena, A.; Nye, M.; Bosma, M.; Michalewski, H.; Dohan, D.; Jiang, E.; Cai, C.; Terry, M.; Le, Q.; et al. Program synthesis with large language models. arXiv 2021, arXiv:2108.07732. [Google Scholar] [CrossRef]
- Zhou, S.; Alon, U.; Xu, F.F.; Wang, Z.; Jiang, Z.; Neubig, G. DocPrompting: Generating Code by Retrieving the Docs. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Nijkamp, E.; Hayashi, H.; Xiong, C.; Savarese, S.; Zhou, Y. CodeGen2: Lessons for Training LLMs on Programming and Natural Languages. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Allamanis, M.; Jackson-Flux, H.; Brockschmidt, M. Self-supervised bug detection and repair. Adv. Neural Inf. Process. Syst. 2021, 34, 27865–27876. [Google Scholar]
- Rasooli, M.S.; Tetreault, J.R. Yara Parser: A Fast and Accurate Dependency Parser. arXiv 2015, arXiv:1503.06733. [Google Scholar] [CrossRef]
- Nam, D.; Macvean, A.; Hellendoorn, V.; Vasilescu, B.; Myers, B. Using an LLM to Help With Code Understanding. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), Lisbon, Portugal, 14–20 April 2024; IEEE Computer Society: Piscataway, NJ, USA, 2024; pp. 1–13. [Google Scholar]
- Wang, J.; Huang, Y.; Chen, C.; Liu, Z.; Wang, S.; Wang, Q. Software testing with large language models: Survey, landscape, and vision. IEEE Trans. Softw. Eng. 2024, 50, 911–936. [Google Scholar] [CrossRef]
- Aggarwal, P.; Madaan, A.; Yang, Y.; Mausam. Let’s Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 12375–12396. [Google Scholar]
- Alrashedy, K.; Hellendoorn, V.J.; Orso, A. Learning Defect Prediction from Unrealistic Data. arXiv 2023, arXiv:2311.00931. [Google Scholar]
- Chakraborty, S.; Krishna, R.; Ding, Y.; Ray, B. Deep learning based vulnerability detection: Are we there yet? IEEE Trans. Softw. Eng. 2021, 48, 3280–3296. [Google Scholar] [CrossRef]
- Andrew, G.; Gao, J. Scalable training of L1-regularized log-linear models. In Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA, 20–24 June 2007; pp. 33–40. [Google Scholar]
- Ando, R.K.; Zhang, T. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. J. Mach. Learn. Res. 2005, 6, 1817–1853. [Google Scholar]
- Bhatt, M.; Chennabasappa, S.; Nikolaidis, C.; Wan, S.; Evtimov, I.; Gabi, D.; Song, D.; Ahmad, F.; Aschermann, C.; Fontana, L.; et al. Purple llama cyberseceval: A secure coding benchmark for language models. arXiv 2023, arXiv:2312.04724. [Google Scholar]
- Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-refine: Iterative refinement with self-feedback. In Proceedings of the Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Olausson, T.X.; Inala, J.P.; Wang, C.; Gao, J.; Solar-Lezama, A. Is Self-Repair a Silver Bullet for Code Generation? In Proceedings of the Twelfth International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Gou, Z.; Shao, Z.; Gong, Y.; Shen, Y.; Yang, Y.; Duan, N.; Chen, W. CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing. arXiv 2023, arXiv:2305.11738. [Google Scholar]
- Elgohary, A.; Meek, C.; Richardson, M.; Fourney, A.; Ramos, G.; Awadallah, A.H. NL-EDIT: Correcting Semantic Parse Errors through Natural Language Interaction. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Online, 6–11 June 2021. [Google Scholar]
- Bai, Y.; Jones, A.; Ndousse, K. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv 2023, arXiv:2204.05862. [Google Scholar]
- Yasunaga, M.; Liang, P. Graph-based, self-supervised program repair from diagnostic feedback. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; PMLR: New York, NY, USA, 2020; pp. 10799–10808. [Google Scholar]
- Bafatakis, N.; Boecker, N.; Boon, W.; Salazar, M.C.; Krinke, J.; Oznacar, G.; White, R. Python coding style compliance on stack overflow. In Proceedings of the 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), Montreal, QC, Canada, 25–31 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 210–214. [Google Scholar]
- Peng, B.; Galley, M.; He, P.; Cheng, H.; Xie, Y.; Hu, Y.; Huang, Q.; Liden, L.; Yu, Z.; Chen, W.; et al. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv 2023, arXiv:2302.12813. [Google Scholar] [CrossRef]
- Yang, K.; Tian, Y.; Peng, N.; Klein, D. Re3: Generating longer stories with recursive reprompting and revision. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022. [Google Scholar]
- Wang, B.; Shin, R.; Liu, X.; Polozov, O.; Richardson, M. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 7567–7578. [Google Scholar]
- Scholak, T.; Schucher, N.; Bahdanau, D. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021. [Google Scholar]
- Gusfield, D. Algorithms on Strings, Trees and Sequences; Cambridge University Press: Cambridge, UK, 1997. [Google Scholar]
- Zhuo, T.Y.; Vu, M.C.; Chim, J.; Hu, H.; Yu, W.; Widyasari, R.; Yusuf, I.N.B.; Zhan, H.; He, J.; Paul, I.; et al. BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions. arXiv 2024, arXiv:2406.15877. [Google Scholar]
- Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; Callan, J.; Neubig, G. Pal: Program-aided language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; PMLR: New York, NY, USA, 2023; pp. 10764–10799. [Google Scholar]
- Akyürek, A.F.; Akyürek, E.; Kalyan, A.; Clark, P.; Wijaya, D.; Tandon, N. RL4F: Generating natural language feedback with reinforcement learning for repairing model outputs. In Proceedings of the Annual Meeting of the Association of Computational Linguistics 2023, Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2023; pp. 7716–7733. [Google Scholar]
- Aho, A.V.; Ullman, J.D. The Theory of Parsing, Translation and Compiling; Prentice-Hall: Englewood Cliffs, NJ, USA, 1972; Volume 1. [Google Scholar]
- CodeQL. Available online: https://codeql.github.com (accessed on 4 March 2025).





| Parameter | Value | Description |
|---|---|---|
| J | 3 | Number of diverse solution strategies generated for each detected vulnerability |
| K | 3 | Maximum refinement iterations applied per solution (terminates early upon successful vulnerability remediation) |
| Domain | Library |
|---|---|
| Computation | os, pandas, numpy, sklearn, scipy, math, nltk, statistics, cv2, statsmodels, tensorflow, sympy, textblob, skimage |
| System | os, json, csv, shutil, glob, subprocess, pathlib, io, zipfile, sys, logging, pickle, struct, psutil |
| Network | requests, urllib, bs4, socket, django, flask, ipaddress, smtplib, http, flask_mail, cgi, ssl, email, mechanize, url |
| Cryptography | hashlib, base64, binascii, codecs, rsa, cryptography, hmac, blake3, secrets, Crypto |
| General | random, re, collections, itertools, string, operator, heapq, ast, functools, regex, bisect, inspect, unicodedata |
| Database | sqlite3, mysql, psycopg2, sqlalchemy, pymongo, sql |
| Web Frameworks | Django, Flask, FastAPI, Tornado, Pyramid, Bottle |
| Ablation Experiments | Evaluation Metrics | |
|---|---|---|
| Bandit | CodeQL | |
| Generated code | 40.2% | 17.9% |
| FDSP with single solution | 10.0% (+2.6%) | 8.7% (+1.0%) |
| FDSP with single iteration | 9.5% (+2.1%) | 7.9% (+0.2%) |
| FDSP | 7.4% | 7.7% |
| Method | Mean and SD (Second) | API Cost (USD) |
|---|---|---|
| Direct Prompting | 11.27 (±4.00) | USD 1.05 |
| Self-Debugging | 22.29 (±6.00) | USD 12.43 |
| Direct Bandit Feedback | 8.54 (±3.35) | USD 5.32 |
| Verbalization | 13.73 (±4.57) | USD 8.06 |
| FDSP (ours) | 41.64 (±25.06) | USD 25.32 |
| Dataset | Round | Bandit | CodeQL |
|---|---|---|---|
| INDICT | Round 1 | 27.9% | 14.0% |
| Round 2 | 23.4% | 11.7% | |
| Round 3 | 19.4% | 12.1% | |
| Round 4 | 20.6% | 15.3% | |
| Round 5 | 22.8% | 14.6% | |
| FDSP (Ours) | —– | 7.4% | 7.7% |
| CWE Category | Example CWE | Detection Method | |
|---|---|---|---|
| Static Analysis | Dynamic/Hybrid Analysis | ||
| Input Validation | CWE-20, CWE-79 | ✓ | – |
| SQL Injection | CWE-89 | ✓ | – |
| Command Injection | CWE-78 | ✓ | – |
| Hard-coded Secrets | CWE-259, CWE-798 | ✓ | – |
| Path Traversal | CWE-22 | ✓ | – |
| Denial of Service | CWE-400 | – | ✓ |
| Insecure Deserialization | CWE-502 | ✓ | – |
| Improper Authentication | CWE-287, CWE-306 | – | ✓ |
| Race Condition/Resource Contention | CWE-362 | – | ✓ |
| Cryptographic Weaknesses | CWE-327 | ✓ | ✓ |
| Cross-Site Request Forgery (CSRF) | CWE-352 | – | ✓ |
| Memory Errors/Buffer Overflow | CWE-119, CWE-125 | ✓ | ✓ |
| CWE Type (Name) | CWE ID | Total | Percentage (%) |
|---|---|---|---|
| OS Command Injection | CWE-78 | 22 | 44.0 |
| SQL Injection | CWE-89 | 9 | 18.0 |
| Hard-coded Password | CWE-259 | 5 | 10.0 |
| Insecure Deserialization | CWE-502 | 4 | 8.0 |
| Race Condition/Improper Synchronization | CWE-362 | 1 | 2.0 |
| Multiple Binds to Same Port | CWE-605 | 2 | 4.0 |
| Weak Random Number Generation | CWE-330 | 1 | 2.0 |
| Improper Input Validation | CWE-20 | 1 | 2.0 |
| Insecure Temporary File Creation | CWE-377 | 1 | 2.0 |
| Open Redirect | CWE-601 | 1 | 2.0 |
| Path Traversal | CWE-22 | 1 | 2.0 |
| Total | – | 50 | 100.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alrashedy, K.; Aljasser, A.; Tambwekar, P.; Gombolay, M. Leveraging Static Analysis for Feedback-Driven Security Patching in LLM-Generated Code. J. Cybersecur. Priv. 2025, 5, 110. https://doi.org/10.3390/jcp5040110
Alrashedy K, Aljasser A, Tambwekar P, Gombolay M. Leveraging Static Analysis for Feedback-Driven Security Patching in LLM-Generated Code. Journal of Cybersecurity and Privacy. 2025; 5(4):110. https://doi.org/10.3390/jcp5040110
Chicago/Turabian StyleAlrashedy, Kamel, Abdullah Aljasser, Pradyumna Tambwekar, and Matthew Gombolay. 2025. "Leveraging Static Analysis for Feedback-Driven Security Patching in LLM-Generated Code" Journal of Cybersecurity and Privacy 5, no. 4: 110. https://doi.org/10.3390/jcp5040110
APA StyleAlrashedy, K., Aljasser, A., Tambwekar, P., & Gombolay, M. (2025). Leveraging Static Analysis for Feedback-Driven Security Patching in LLM-Generated Code. Journal of Cybersecurity and Privacy, 5(4), 110. https://doi.org/10.3390/jcp5040110









