A Survey on Large Language Models in Software Security: Opportunities and Threats
Abstract
1. Introduction
- RQ1: How does the integration of LLMs alter traditional secure development practices across the Secure Development Lifecycle, including vulnerability detection, code review, DevSecOps, and threat modeling?
- RQ2: What latent security risks arise from widespread LLM usage, particularly regarding insecure code propagation, flawed automated repairs, and sensitive-data leakage?
- RQ3: How can explainability, traceability, and auditability of LLM-assisted security decisions be strengthened to satisfy assurance and regulatory requirements?
2. Methodology
2.1. Search Strategy and Databases
- ACM Digital Library (conference and journal articles on software engineering, security, and programming languages).
- IEEE Xplore (software engineering, dependable systems, and security venues).
- SpringerLink (journals such as Empirical Software Engineeringand Software and Systems Modeling).
- ScienceDirect (Elsevier journals, including Journal of Systems and Software and High-Confidence Computing).
- Scopus (used as a broad multidisciplinary index to cross-check coverage and retrieve additional peer-reviewed articles).
- arXiv (preprints; treated as a supplementary source and explicitly distinguished from indexed venues).
Search Strings
2.2. Study Selection and Screening
- ACM Digital Library: 132 records.
- IEEE Xplore: 155 records.
- SpringerLink: 102 records.
- ScienceDirect: 88 records.
- Scopus: 444 records.
- arXiv (preprints): 196 records.
2.2.1. Deduplication and Consistency Checks
2.2.2. Title and Abstract Screening
- Not LLM-related or not focused on code (e.g., general NLP security, social media analysis): 241 records.
- No substantive security-relevant outcome (e.g., productivity-only studies of code completion): 126 records.
2.3. Full-Text Assessment and Eligibility
- No empirical evaluation (e.g., conceptual position papers, essays, or vision pieces): 46 articles.
- Insufficient security focus (e.g., developer productivity studies with only incidental mention of security): 26 articles.
- Incomplete or non-reproducible results (e.g., missing evaluation setup or datasets): 12 articles.
2.4. Inclusion and Exclusion Criteria
2.4.1. Inclusion Criteria
- Explicitly involved the use, evaluation, or analysis of an LLM (or a closely related large code model) in software engineering.
- Addressed security-relevant tasks such as vulnerability detection, patch generation, secure code generation, threat modeling, data leakage, or security/privacy assessment.
- Reported empirical evidence, such as quantitative benchmarks, experiments, or structured case studies (including industrial case reports).
2.4.2. Exclusion Criteria
- Did not involve LLMs or large code models (e.g., traditional static or ML-only methods without comparison to LLMs).
- Focused on NLP or security topics unrelated to source code or software systems.
- Lacked empirical evaluation (conceptual essays, tutorials, or purely theoretical discussions).
- Provided no security-relevant outcomes (e.g., productivity-only evaluations of code completion tooling).
- Were duplicates, near-duplicates, or inaccessible in full text.
2.5. Treatment of Preprints and Indexed Versions
- Peer-reviewed publications (journals, conferences, and workshops indexed in Scopus, ACM, IEEE, SpringerLink, or ScienceDirect) were prioritized in the narrative synthesis.
- Non-academic sources (blog posts, vendor white papers, and ChatGPT-generated content) were not treated as primary evidence and were used only to contextualize tooling or practice when strictly necessary.
2.6. Data Extraction and Coding
- Bibliographic metadata (year, venue, publication type: journal, conference, workshop, preprint, or technical report).
- LLM or model family evaluated (e.g., GPT-3/3.5/4, Codex, Code Llama, LLaMA 2, DeepSeek-R1, CodeBERT, GraphCodeBERT).
- Security tasks and scenarios (e.g., vulnerability detection, secure code generation, patch synthesis, malware generation, data leakage analysis, threat modeling).
- Evaluation metrics (e.g., accuracy, precision, recall, F1, repair success rate, exploitability of generated code).
- Key findings, mapped to the three research questions RQ1–RQ3.
2.7. Quality and Bias Assessment
2.8. Limitations of the Review Process
- Screening capacity. Screening and coding were carried out by a limited number of reviewers, which may introduce selection bias despite the use of a shared protocol and consensus discussions.
- Preprint variability. Preprints (especially from arXiv) vary in rigor. While they are important for tracking the rapidly evolving LLM landscape, conclusions based on preprints are treated as provisional.
- Heterogeneous benchmarks. Studies employ different datasets, vulnerability taxonomies, and metrics, which limits direct comparability of quantitative results across papers.
- Rapid model evolution. New model releases, fine-tuned variants, and updated APIs may outpace published evaluations, meaning that some empirical findings may become outdated quickly.
3. Results and Discussion
3.1. RQ1: LLM Integration and Autonomy in Secure Development
3.2. RQ2: Latent Security Risks of LLM Code Generation
3.3. RQ3: Explainability, Auditability, and Compliance
4. Our Findings
4.1. F1: LLMs’ Function as High-Bandwidth Security Assistants Rather than Autonomous Engineers
4.2. F2: Benefits Concentrate on Visibility, Coverage, and Cognitive Support
4.3. F3: Security Risks Are Structural and Persistent Across Models
4.4. F4: Evaluation Practices Limit the Strength of Generalization
4.5. F5: Governance, Explainability, and Human Oversight Are Structural Requirements
Synthesis Across F1–F5
5. Future Work
- Rigorously reproducible evaluation protocols, not one-off benchmarks. A recurring barrier to comparability is that many studies differ in dataset curation, vulnerability taxonomies, prompt templates, decoding parameters, and reporting granularity. Future research should adopt reproducible evaluation protocols with versioned model identifiers, recorded inference settings, and standardized reporting of context-window usage and toolchain dependencies. Replication should be treated as a first-class research product, especially for assistant tools whose behavior can shift across model updates [18,22,30]. In addition, evaluations should include robustness checks that vary prompt phrasing, codebase context, and language choice to capture the instability observed in grounded-context studies of LLM understanding [43,50].
- Human–AI collaboration as a measurable security control. Results across RQ1 and RQ2 suggest that the security impact of LLMs is mediated by human acceptance and review practices. Future studies should move beyond productivity metrics and explicitly measure security outcomes in realistic workflows: acceptance rates of insecure suggestions, time-to-detection of subtle flaws, and the effectiveness of structured review checkpoints. Controlled trials should test interventions such as uncertainty signaling, escalation triggers, and “secure-by-policy” templates that force developers to verify properties (e.g., input validation, authorization checks) before merging [30,40,51]. Particular attention is needed for automation bias and over-trust, which can be amplified by persuasive but incomplete explanations [50,51].
- Security-aware dataset governance and provenance at scale. Multiple strands of evidence indicate that training data composition and contamination shape both vulnerability propagation and leakage risk. Future work must establish dataset governance standards for code corpora: provenance tracking, license auditing, deduplication, vulnerability filtering, and continuous maintenance. This includes empirically validating whether removing known-vulnerable snippets and duplicated repositories reduces insecure-by-default generation without harming functional correctness. Broader ML security surveys highlight the need to treat leakage and memorization as structural risks when sensitive code is present in training or fine-tuning pipelines [32,33]. Dataset releases should provide security-centered documentation (e.g., known CVE contamination, duplication metrics, licensing risk flags) to enable reliable downstream evaluation [14,73].
- Semantics-first patch evaluation and program–repair integration. Repair remains a bottleneck: many generated patches are syntactically plausible yet semantically incomplete, especially for vulnerabilities that require cross-function invariants or precise state reasoning. Future research should integrate neural program repair insights with formal and test-based validation to evaluate patches for semantic correctness, exploitability reduction, and regression safety [69,70]. This implies moving beyond pass/fail compilation metrics toward property-based tests, differential testing, and proof-carrying evidence where feasible. Comparative studies should also benchmark hybrid pipelines that combine LLM patch proposals with deterministic analyzers and constraint solvers, aligning with broader secure-coding tool surveys that emphasize layered defenses [14,71].
- Robustness to adversarial manipulation, including backdoors and prompt injection. RQ2 and RQ3 findings show that adversarial manipulation is a realistic concern for code-generating models and agentic toolchains. Future work should systematically evaluate backdoor and data poisoning threats, including how malicious triggers survive instruction tuning and how they manifest in code suggestions [36,39]. Similarly, evaluations should cover prompt-injection and context-manipulation attacks in retrieval-augmented and tool-using assistants, where untrusted inputs can steer generation. Defensive research should develop and validate detection methods (behavioral probes, anomaly detection, provenance-based filtering) that operate under black-box constraints common in proprietary tools [39,45].
- Explainability that is faithful, auditable, and compliance-aligned. The field needs explanation mechanisms that are not only readable but also faithful and verifiable. Systematic reviews of explainable AI for security and empirical work on reasoning failures show that LLM explanations can diverge from actual model behavior, producing false confidence in fixes and threat assessments [47,50,51]. Future research should define evaluation criteria for explanation faithfulness in software security settings (e.g., alignment with data-flow evidence, consistency under perturbations, calibration to uncertainty). In parallel, compliance-driven contexts require traceability from model outputs to controls and evidence artifacts; governance-oriented work highlights accountability gaps if such traceability is absent [30,45].
- Threat modeling and DevSecOps integration with measured assurance outcomes. As LLMs are increasingly integrated into DevSecOps, future studies should test how AI affects threat modeling quality, security gate performance, and incident response readiness. AI-driven threat modeling offers opportunities (faster enumeration of attack surfaces) but also risks (shallow or biased coverage). Future work should benchmark LLM-assisted threat modeling against expert baselines and evaluate how outputs affect downstream security decisions [41]. DevSecOps-focused reviews suggest that integration success depends on process design and organizational controls, not only model capability [40].
- Governance, accountability, and socio-technical policy design. Ultimately, trustworthy deployment is constrained by governance: who is responsible for an AI-assisted change, what is logged, what is reviewed, and what is auditable. Governance and accountability challenges in LLMs remain open, particularly in enterprise settings where “shadow AI” usage, unclear data-handling policies, and vendor opacity can undermine assurance [30,45]. Future research should propose and empirically evaluate governance patterns (policy-as-code for AI usage, audit logs for AI-assisted diffs, model-risk registers, and red-team protocols) that can be adopted without prohibitive overhead.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
- Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code Llama: Open Foundation Models for Code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. LLaMA 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature 2025, 645, 633–638. [Google Scholar] [CrossRef] [PubMed]
- Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proceedings of the Findings of EMNLP 2020, Online, 16–20 November 2020; pp. 1536–1547. [Google Scholar]
- Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, L.; Duan, N.; Svyatkovskiy, A.; Fu, S.; et al. GraphCodeBERT: Pre-Training Code Representations with Data Flow. arXiv 2021, arXiv:2009.08366. [Google Scholar]
- Fan, A.; Gokkaya, B.; Harman, M.; Lyubarskiy, M.; Sengupta, S.; Yoo, S.; Zhang, J.M. Large Language Models for Software Engineering: Survey and Open Problems. In Proceedings of the 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), Melbourne, Australia, 14–20 May 2023; pp. 31–53. [Google Scholar] [CrossRef]
- Bui, T.-D.; Vu, T.T.; Nguyen, T.-T.; Nguyen, S.; Vo, H.D. Correctness Assessment of Code Generated by Large Language Models Using Internal Representations. J. Syst. Softw. 2025, 224, 112570. [Google Scholar] [CrossRef]
- Tamberg, K.; Bahsi, H. Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study. IEEE Access 2025, 13, 29698–29717. [Google Scholar] [CrossRef]
- Chen, Y.; Ding, Z.; Alowain, L.; Chen, X.; Wagner, D. DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection. In Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, Hong Kong, China, 16–18 October 2023; pp. 654–668. [Google Scholar] [CrossRef]
- Zibaeirad, A.; Vieira, M. VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching. arXiv 2024, arXiv:2409.10756. [Google Scholar] [CrossRef]
- Wang, P.; Liu, X.; Xiao, C. CVE-Bench: Benchmarking LLM-Based Software Engineering Agent’s Ability to Repair Real-World CVE Vulnerabilities. In Proceedings of the NAACL 2025, Albuquerque, NM, USA, 29 April–4 May 2025; pp. 4207–4224. [Google Scholar] [CrossRef]
- Shiri Harzevili, N.; Boaye Belle, A.; Wang, J.; Wang, S.; Jiang, Z.M.; Nagappan, N. A Systematic Literature Review on Automated Software Vulnerability Detection Using Machine Learning. ACM Comput. Surv. 2024, 57, 55. [Google Scholar] [CrossRef]
- Ghaffarian, S.M.; Shahriari, H.R. Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey. ACM Comput. Surv. 2017, 50, 56. [Google Scholar] [CrossRef]
- Kumar, P. Large Language Models (LLMs): Survey, Technical Frameworks, and Future Challenges. Artif. Intell. Rev. 2024, 57, 260. [Google Scholar] [CrossRef]
- Sheng, Z.; Chen, Z.; Gu, S.; Huang, H.; Gu, G.; Huang, J. LLMs in software security: A survey of vulnerability detection techniques and insights. ACM Comput. Surv. 2025, 58, 134. [Google Scholar] [CrossRef]
- Negri-Ribalta, C.; Geraud-Stewart, R.; Sergeeva, A.; Lenzini, G. A Systematic Literature Review on the Impact of AI Models on the Security of Code Generation. Front. Big Data 2024, 7, 1386720. [Google Scholar] [CrossRef]
- Pearce, H.; Ahmad, B.; Tan, B.; Dolan-Gavitt, B.; Karri, R. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. In Proceedings of the 2022 ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), Singapore, 14–18 November 2022; pp. 1215–1227. [Google Scholar]
- Siddiq, M.L.; Santos, J.C.S. SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques. In Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P&S), Pittsburgh, PA, USA, 17–18 November 2022; pp. 29–33. [Google Scholar] [CrossRef]
- Nguyen, N.; Nadi, S. An Empirical Evaluation of GitHub Copilot’s Code Suggestions. In Proceedings of the 19th International Conference on Mining Software Repositories (MSR ’22), Pittsburgh, PA, USA, 23–24 May 2022; pp. 1–5. [Google Scholar] [CrossRef]
- Tihanyi, N.; Bisztray, T.; Jain, R.; Ferrag, M.A.; Cordeiro, L.C.; Mavromatis, V. How Secure is AI-Generated Code: A Large-Scale Comparison of Large Language Models. Empir. Softw. Eng. 2024, 29, 138. [Google Scholar] [CrossRef]
- Siddiq, M.; Gopinath, R.; Bhat, P. Empirical Evaluation of GitHub Copilot for Security Vulnerabilities. J. Syst. Softw. 2023, 203, 111915. [Google Scholar]
- Tambon, F.; Nikanjam, A.; An, L.; Khomh, F.; Antoniol, G. Bugs in Large Language Models Generated Code: An Empirical Study. Empir. Softw. Eng. 2025, 30, 80. [Google Scholar] [CrossRef]
- Mastropaolo, A.; Ciniselli, M.; Cooper, N.; Palacio, D.N.; Poshyvanyk, D.; Oliveto, R.; Bavota, G. On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE), Melbourne, Australia, 14–20 May 2023; pp. 2149–2160. [Google Scholar] [CrossRef]
- Perry, N.; Srivastava, M.; Kumar, D.; Boneh, D. Do Users Write More Insecure Code with AI Assistants? In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ’23), Copenhagen, Denmark, 26–30 November 2023; pp. 2785–2799. [Google Scholar] [CrossRef]
- Fu, M.; Tantithamthavorn, C.K.; Nguyen, V.; Le, T. ChatGPT for Vulnerability Detection, Classification, and Repair: How Far Are We? In Proceedings of the 30th Asia-Pacific Software Engineering Conference (APSEC 2023), Seoul, Republic of Korea, 4–7 December 2023; pp. 632–636. [Google Scholar] [CrossRef]
- Pearce, H.; Ahmad, B.; Tan, B.; Dolan-Gavitt, B.; Karri, R. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. Commun. ACM 2025, 68, 96–105. [Google Scholar] [CrossRef]
- Sculley, D.; Holt, G.; Golovin, D.; Davydov, E.; Phillips, T.; Ebner, D.; Chaudhary, V.; Young, M.; Crespo, J.F.; Dennison, D. Hidden Technical Debt in Machine Learning Systems. In Proceedings of the 29th International Conference on Neural Information Processing Systems (NeurIPS 2015), Montreal, QC, Canada, 7–12 December 2015; Volume 2, pp. 2503–2511. [Google Scholar]
- Sandoval, G.; Pearce, H.; Nys, T.; Karri, R.; Garg, S.; Dolan-Gavitt, B. Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security ’23), Anaheim, CA, USA, 9–11 August 2023; pp. 2205–2222. [Google Scholar]
- Sallou, J.; Durieux, T.; Panichella, A. Breaking the Silence: The Threats of Using LLMs in Software Engineering. In Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), Lisbon, Portugal, 12–21 April 2024; pp. 102–106. [Google Scholar] [CrossRef]
- Carlini, N.; Tramer, F.; Wallace, E.; Jagielski, M.; Herbert-Voss, A.; Lee, K.; Roberts, A.; Brown, T.; Song, D.; Erlingsson, U.; et al. Extracting Training Data from Large Language Models. In Proceedings of the 30th USENIX Security Symposium (USENIX Security ’21), Virtual, 11–13 August 2021; pp. 2633–2650. [Google Scholar]
- Schuster, R.; Song, C.; Tromer, E.; Shmatikoff, V. You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion. In Proceedings of the 30th USENIX Security Symposium (USENIX Security ’21), Virtual, 11–13 August 2021; pp. 1559–1575. [Google Scholar]
- Rigaki, M.; Garcia, S. A Survey of Privacy Attacks in Machine Learning. ACM Comput. Surv. 2023, 56, 101. [Google Scholar] [CrossRef]
- Jahanshahi, M.; Mockus, A. Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets. In Proceedings of the 2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), Ottawa, ON, Canada, 3 May 2025; pp. 104–111. [Google Scholar] [CrossRef]
- Zhou, X.; Weyssow, M.; Widyasari, R.; Zhang, T.; He, J.; Lyu, Y.; Chang, J.; Zhang, B.; Huang, D.; Lo, D. LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks. arXiv 2025, arXiv:2502.06215. [Google Scholar] [CrossRef]
- Ramakrishnan, G.; Albarghouthi, A. Backdoors in Neural Models of Source Code. In Proceedings of the 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 2892–2899. [Google Scholar] [CrossRef]
- Shi, J.; Liu, Y.; Zhou, P.; Sun, L. BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT. arXiv 2023, arXiv:2304.12298. [Google Scholar]
- Cheng, W.; Sun, K.; Zhang, X.; Wang, W. Security Attacks on LLM-Based Code Completion Tools. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 23669–23677. [Google Scholar] [CrossRef]
- Apruzzese, G.; Anderson, H.S.; Dambra, S.; Freeman, D.; Pierazzi, F.; Roundy, K. “Real Attackers Don’t Compute Gradients”: Bridging the Gap Between Adversarial ML Research and Practice. In Proceedings of the 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), Raleigh, NC, USA, 8–10 February 2023; pp. 339–364. [Google Scholar] [CrossRef]
- Fu, M.; Pasuksmit, J.; Tantithamthavorn, C. AI for DevSecOps: A Landscape and Future Opportunities. ACM Trans. Softw. Eng. Methodol. 2024, 33, 197. [Google Scholar] [CrossRef]
- Elsharef, I.; Zeng, Z.; Gu, Z. Facilitating Threat Modeling by Leveraging Large Language Models. In Proceedings of the Workshop on AI Systems with Confidential Computing (AISCC 2024), San Diego, CA, USA, 26 February 2024; pp. 1–8. [Google Scholar] [CrossRef]
- Deng, G.; Liu, Y.; Mayoral-Vilches, V.; Liu, P.; Li, Y.; Xu, Y.; Zhang, T.; Liu, Y.; Pinzger, M.; Rass, S. PentestGPT: An LLM-Empowered Automatic Penetration Testing Framework. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security ’24), Philadelphia, PA, USA, 14–16 August 2024; pp. 1279–1296. [Google Scholar]
- Barke, S.; James, M.B.; Polikarpova, N. Grounded Copilot: How Programmers Interact with Code-Generating Models. Proc. ACM Program. Lang. 2023, 7, 85–111. [Google Scholar] [CrossRef]
- Gao, J.; Gebreegziabher, S.A.; Choo, K.T.W.; Li, T.J.-J.; Perrault, S.T.; Malone, T.W. A Taxonomy for Human–LLM Interaction Modes: An Initial Exploration. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’24), Honolulu, HI, USA, 11–16 May 2024; pp. 1–11. [Google Scholar] [CrossRef]
- Mökander, J.; Schuett, J.; Kirk, H.R.; Floridi, L. Auditing Large Language Models: A Three-Layered Approach. AI Ethics 2023, 4, 1085–1115. [Google Scholar] [CrossRef]
- Brundage, M.; Avin, S.; Wang, J.; Belfield, H.; Krueger, G.; Hadfield, G.; Khlaaf, H.; Yang, J.; Toner, H.; Fong, R.; et al. Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. Patterns 2020, 1, 100089. [Google Scholar]
- Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. Information Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
- Ding, W.; Abdel-Basset, M.; Hawash, H.; Ali, A.M. Explainability of Artificial Intelligence Methods, Applications and Challenges: A Comprehensive Survey. Inf. Sci. 2022, 615, 238–292. [Google Scholar] [CrossRef]
- Liu, Y.; Tantithamthavorn, C.; Liu, Y.; Li, L. On the Reliability and Explainability of Language Models for Program Generation. ACM Trans. Softw. Eng. Methodol. 2024, 33, 126. [Google Scholar] [CrossRef]
- Huang, J.; Chang, K.C.-C. Towards Reasoning in Large Language Models: A Survey. In Proceedings of the Findings of the Association for Computational Linguistics (ACL 2023), Toronto, ON, Canada, 9–14 July 2023; pp. 1049–1065. [Google Scholar] [CrossRef]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 248. [Google Scholar] [CrossRef]
- Klemmer, J.H.; Horstmann, S.A.; Patnaik, N.; Ludden, C.; Burton, C.; Powers, C.; Massacci, F.; Rahman, A.; Votipka, D.; Lipford, H.R.; et al. Using AI Assistants in Software Development: A Qualitative Study on Security Practices and Concerns. In Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS ’24), Salt Lake City, UT, USA, 14–18 October 2024; pp. 2726–2740. [Google Scholar] [CrossRef]
- National Institute of Standards and Technology. NIST SP 800-218A: Secure Software Development Framework (SSDF) with AI System-Specific Practices; Technical Report; U.S. Department of Commerce: Washington, DC, USA, 2024. Available online: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-218A.pdf (accessed on 15 December 2025).
- Rzig, D.; Chakraborty, S.; Haiduc, S.; Shahriar, H. Large Language Model for Vulnerability Detection and Repair: Literature Review and the Road Ahead. ACM Trans. Softw. Eng. Methodol. 2025, 34, 145. [Google Scholar] [CrossRef]
- Kitchenham, B.; Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; EBSE Technical Report EBSE-2007-01; Software Engineering Group, School of Computer Science and Mathematics, Keele University: Keele, UK; Department of Computer Science, University of Durham: Durham, UK, 2007; Volume 2.3, pp. 1–57. [Google Scholar]
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
- Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly. High-Confid. Comput. 2024, 4, 100211. [Google Scholar] [CrossRef]
- Mohsin, A.; Janicke, H.; Wood, A.; Sarker, I.H.; Maglaras, L.; Janjua, N. Can We Trust Large Language Models Generated Code? arXiv 2024, arXiv:2406.12513. [Google Scholar] [CrossRef]
- Li, Y.; Li, X.; Wu, H.; Xu, M.; Zhang, Y.; Cheng, X.; Xu, F.; Zhong, S. Everything You Wanted to Know About LLM-Based Vulnerability Detection but Were Afraid to Ask. arXiv 2025, arXiv:2504.13474. [Google Scholar] [CrossRef]
- Fakih, M.; Dharmaji, R.; Bouzidi, H.; Araya, G.Q.; Ogundare, O.; Faruque, M.A. LLM4CVE: Enabling Iterative Automated Vulnerability Repair with Large Language Models. arXiv 2025, arXiv:2501.03446. [Google Scholar] [CrossRef]
- Fu, Y.; Liang, P.; Li, Z.; Shahin, M.; Yu, J.; Chen, J. Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study. ACM Trans. Softw. Eng. Methodol. 2025, 34, 218. [Google Scholar] [CrossRef]
- Lyu, M.R.; Ray, B.; Roychoudhury, A.; Tan, S.H.; Thongtanunam, P. Automatic Programming: Large Language Models and Beyond. ACM Trans. Softw. Eng. Methodol. 2025, 34, 140. [Google Scholar] [CrossRef]
- Ullah, S.; Han, M.; Pujar, S.; Pearce, H.; Coskun, A.; Stringhini, G. LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks. In Proceedings of the 2024 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2024; pp. 862–880. [Google Scholar] [CrossRef]
- Majdinasab, V.; Bishop, M.J.; Rasheed, S.; Moradidakhel, A.; Tahir, A.; Khomh, F. Assessing the Security of GitHub Copilot’s Generated Code—a Targeted Replication Study. In Proceedings of the 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Rovaniemi, Finland, 12–15 March 2024; pp. 435–444. [Google Scholar] [CrossRef]
- Hossain, S.J. LLM-USED: Repository of LLM Framework Usage Data and Plots. 2025. Available online: https://github.com/shafayetjamilhossain205/LLM-USED (accessed on 28 September 2025).
- Chong, C.J.; Yao, Z.; Neamtiu, I. Artificial-Intelligence Generated Code Considered Harmful. arXiv 2024, arXiv:2409.19182. [Google Scholar] [CrossRef]
- Sultana, S.; Afreen, S.; Eisty, N. Code Vulnerability Detection: A Comparative Analysis of Emerging LLMs. arXiv 2024, arXiv:2409.10490. [Google Scholar]
- Li, Y.; Shezan, F.H.; Wei, B.; Wang, G.; Tian, Y. SoK: Towards Effective Automated Vulnerability Repair. In Proceedings of the 34th USENIX Security Symposium, Seattle, WA, USA, 13–15 August 2025. [Google Scholar]
- Zhong, W.; Hu, Q.; Zhu, Q.; Zhang, H. Neural Program Repair: Systems, Challenges and Solutions. In Proceedings of the 13th Asia-Pacific Symposium on Internetware (Internetware 2022), Beijing, China, 6 August 2022; pp. 1–10. [Google Scholar] [CrossRef]
- Bhandari, G.; Naseer, A.; Moonen, L. CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE ’21), Athens, Greece, 20–21 August 2021; pp. 30–39. [Google Scholar] [CrossRef]
- Bouzid, R.; Khoury, R. Assessing the Effectiveness of ChatGPT in Secure Code Development: A Systematic Literature Review. ACM Comput. Surv. 2025, 57, 324. [Google Scholar] [CrossRef]
- Xia, C.S.; Wei, Y.; Zhang, L. Automated Program Repair in the Era of Large Pre-Trained Language Models. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE 2023), Melbourne, Australia, 14–20 May 2023; pp. 1482–1494. [Google Scholar] [CrossRef]
- Khoury, R.; Avila, A.R.; Brunelle, J.; Camara, B.M. How Secure is Code Generated by ChatGPT? In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; pp. 2445–2451. [Google Scholar] [CrossRef]
- Al-Kaswan, A.; Izadi, M.; van Deursen, A. Traces of Memorisation in Large Language Models for Code. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE ’24), Lisbon, Portugal, 14–20 April 2024; pp. 1–12. [Google Scholar] [CrossRef]
- Al-Boghdady, A.; Wassif, K.; El-Ramly, M. The Presence, Trends, and Causes of Security Vulnerabilities in Operating Systems of IoT’s Low-End Devices. Sensors 2021, 21, 2329. [Google Scholar] [CrossRef] [PubMed]
- Tóth, R.; Bisztray, T.; Erdodi, L. LLMs in Web Development: Evaluating LLM-Generated PHP Code—Unveiling Vulnerabilities and Limitations. In Proceedings of the Computer Safety, Reliability, and Security. SAFECOMP 2024 Workshops (DECSoS, SASSUR, TOASTS, and WAISE) Florence, Italy, 17–20 September 2024; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2024; Volume 14989, pp. 425–437. [Google Scholar] [CrossRef]
- Pa Pa, Y.M.; Tanizaki, S.; Kou, T.; van Eeten, M.; Yoshioka, K.; Matsumoto, T. An Attacker’s Dream? Exploring the Capabilities of ChatGPT for Developing Malware. In Proceedings of the 16th Cyber Security Experimentation and Test Workshop (CSET 2023), Marina del Rey, CA, USA, 7–8 August 2023; pp. 10–18. [Google Scholar] [CrossRef]
- Cotroneo, D.; Foggia, A.; Improta, C.; Liguori, P.; Natella, R. Automating the Correctness Assessment of AI-Generated Code for Security Contexts. J. Syst. Softw. 2024, 216, 112113. [Google Scholar] [CrossRef]
- Tomassi, A. Data Security and Privacy Concerns for Generative AI Platforms. Ph.D. Thesis, Politecnico di Torino, Turin, Italy, 2024. [Google Scholar]
- Kumamoto, T.; Yoshida, Y.; Fujima, H. Evaluating large language models in ransomware negotiation: A comparative analysis of chatgpt and claude. Res. Sq. 2023. [Google Scholar] [CrossRef]
- Team, P.S. Securing AI for Cymulate: A Case Study in Controlled AI Adoption; Prompt Security Technical Reports; Prompt Security Ltd.: Tel Aviv, Israel, 2025; Available online: https://prompt.security/blog/case-study-securing-ai-for-cymulate-ensuring-safe-ai-adoption-across-teams (accessed on 12 December 2025).








| No. | Study (Abbrev.) | DC | BS | RT | Ref. |
|---|---|---|---|---|---|
| 1 | Systematic literature review on AI models and code-generation security | High | Moderate | High | [17] |
| 2 | Assessing the security of GitHub (2023 version) Copilot’s generated code (replication study) | High | High | Moderate | [64] |
| 3 | DiverseVul (large-scale vulnerable code dataset) | High | Moderate | High | [10] |
| 4 | BadGPT (security vulnerabilities/backdoor attacks on LLMs) | Moderate | Low | Low | [37] |
| 5 | Cracks in The Stack (risks in The Stack v2 training dataset) | High | Moderate | Moderate | [34] |
| Framework/Practice | Tools Studied | Reported Effects | Limitations | Refs. |
|---|---|---|---|---|
| SDL and DevSecOps | GPT-4, Code Llama, security-tuned LLMs | Faster triage; cross-language reasoning; backlog summarization | High false positives; uneven precision; limited business-logic coverage | [16,17,40,63,67] |
| Code Review | Copilot (2023 version), ChatGPT (GPT-3.5/GPT-4, OpenAI, 2023), VulnLLMEval, assistant-style tools | Inline CWE detection; patch suggestions; natural language explanations | Insecure outputs persist; oversimplified or partial fixes; hallucinated rationales | [11,18,22,27,64,71] |
| Threat Modeling | GPT-4, Claude, other chat models | Attack-path sketches; misuse/abuse cases; STRIDE-style ideation | Limited depth; missing domain-specific threats; requires expert validation | [15,41,42,57] |
| Supply Chain and Datasets | The Stack v2, public code corpora | CVEs and licensing issues surfaced; data-quality analysis | Recurring vulnerable patterns; contamination and data leakage across benchmarks | [34,35,59] |
| Lifecycle Phase | Tools Studied | Autonomy Level | Observed Gaps | Refs. |
|---|---|---|---|---|
| Requirements | Prompt-based assistants; RAG systems | Assistive only | Needs human-driven prioritization and scope definition | [15,16] |
| Design | Chat-based LLMs with security patterns | Assistive only | No formal guarantees for architectural correctness; limited threat-coverage | [41,42,44,62] |
| Coding | Copilot (2023 version), ChatGPT (GPT-3.5/GPT-4, OpenAI, 2023), other assistants | Partial automation | Insecure outputs persist; new vulnerabilities introduced; context gaps | [27,43,61,64] |
| Testing | VulnLLMEval, CVE-Bench, DiverseVul | Partial automation | Mislabeling of patched vs. vulnerable samples; benchmark leakage | [10,11,12,59] |
| Deployment and Repair | LLM-based auto-patch pipelines | Partial automation (∼10–30% CVEs fixed) | Repair success remains low; generalization limited; semantic correctness hard to guarantee | [60,68,69,70,72] |
| Maintenance | empirical vulnerability detection | Assistive | Vulnerabilities persist across releases; risk of accumulated technical debt | [26,28,54] |
| STRIDE Category | Vulnerability Type | Dataset/Benchmark | Avg. Occurrence (%) | Reference |
|---|---|---|---|---|
| Information Disclosure | SQL Injection | CVE-Bench | 2.36% | [12] |
| Elevation of Privilege | Cross-Site Scripting (XSS) | CVE-Bench | 20.24% | [12] |
| Denial of Service | HTTP Response Splitting | CVE-Bench | 2.36% | [12] |
| Risk Category | Examples Observed | Affected Tools/Models | Reported Impact | References |
|---|---|---|---|---|
| Insecure Code Generation | Weak cryptography; unsafe memory ops; insecure handlers; unsafe file I/O | Copilot, Codex, GPT-style tools | Security weaknesses persist across prompts and contexts; high variance across benchmarks | [18,21,22,27,61,76,78] |
| Propagation of Vulnerable Patterns | Hardcoded secrets; injection; unsafe file handling; insecure reuse | Codex, Copilot, chat assistants | Automation bias amplifies risk when suggestions are accepted quickly; insecure reuse at scale | [13,17,29,57] |
| Incomplete / Flawed Fixes | Superficial CWE patches; non-compiling or semantics-breaking repairs | VulnLLMEval, CVE-Bench, LLM4CVE | Limited semantic repair success on real CVEs; evaluation constraints matter | [11,12,60,68,69,70] |
| Data Leakage and Memorization | Secrets; credentials; verbatim code resurfacing | Models trained on large code corpora | Small but non-trivial leakage rates; extraction/inversion risks under targeted prompts | [17,31,32,33,34,74] |
| Adversarial Misuse | Prompt injection; jailbreaks; malicious code generation; hidden behaviors | Chat assistants, code completion tools | Prompt integrity can be subverted; backdoors and misuse scenarios remain plausible | [36,37,38,39,66,77] |
| Developer Over-Reliance and Technical Debt | Automation bias; reduced manual review; deferred refactoring | Assistant-centric workflows | Higher acceptance of insecure code; accumulation of long-term technical debt | [28,29,30,52,54] |
| Organizational and Governance Risks | Shadow AI usage; unclear data-handling policies; weak accountability | Enterprise deployments | Policy violations and unclear responsibility for security failures | [45,46,79] |
| Aspect | Approach/Technique | Observed Benefits | Limitations/Gaps | Refs. |
|---|---|---|---|---|
| Explainability Mechanisms | Rule-based rationales; security-focused highlighting; step-wise prompting | Improved interpretability; supports quicker identification of obvious flaws | Traces may be unfaithful; instability under prompt/context shifts | [48,49] |
| Compliance Alignment | Mapping to ISO/IEC 27001, HIPAA, NIST SSDF-AI | Supports traceability between outputs and controls | No standardized automated scoring; heavy reliance on expert review | [53,57] |
| Assurance and Evidence | Security arguments; test-and-review pipelines; continuous monitoring | Reusable evidence for audits and certifications | Limited benchmarks for AI-enabled assurance; hard to compare tools | [45,46] |
| Model Transparency Tools | Behavioral probes; differential prompts; token-level inspection | Supports sanity checks and localized vulnerability analysis | Overhead in large projects; no agreed metrics for transparency sufficiency | [49,50] |
| Governance and Organizational Controls | Secure SDLC extensions; policy/audit logs; controlled AI adoption | Mitigates shadow AI; clarifies accountability in safety-critical systems | Implementation overhead; requires sustained operational discipline | [45,46,81] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Rashid, M.B.; Hossain, M.S.J.; Khan, M.I.; Tahora, S.; Siddika, A.; Prakash, M.I.; Yeasmin, S.; Shahriar, H. A Survey on Large Language Models in Software Security: Opportunities and Threats. Computers 2026, 15, 226. https://doi.org/10.3390/computers15040226
Rashid MB, Hossain MSJ, Khan MI, Tahora S, Siddika A, Prakash MI, Yeasmin S, Shahriar H. A Survey on Large Language Models in Software Security: Opportunities and Threats. Computers. 2026; 15(4):226. https://doi.org/10.3390/computers15040226
Chicago/Turabian StyleRashid, Md Bajlur, Mohammad Shafayet Jamil Hossain, Mohammad Ishtiaque Khan, Sharaban Tahora, Aiasha Siddika, Mahmudul Islam Prakash, Sharmin Yeasmin, and Hossain Shahriar. 2026. "A Survey on Large Language Models in Software Security: Opportunities and Threats" Computers 15, no. 4: 226. https://doi.org/10.3390/computers15040226
APA StyleRashid, M. B., Hossain, M. S. J., Khan, M. I., Tahora, S., Siddika, A., Prakash, M. I., Yeasmin, S., & Shahriar, H. (2026). A Survey on Large Language Models in Software Security: Opportunities and Threats. Computers, 15(4), 226. https://doi.org/10.3390/computers15040226

