A Survey on Large Language Models in Software Security: Opportunities and Threats

Rashid, Md Bajlur; Hossain, Mohammad Shafayet Jamil; Khan, Mohammad Ishtiaque; Tahora, Sharaban; Siddika, Aiasha; Prakash, Mahmudul Islam; Yeasmin, Sharmin; Shahriar, Hossain

doi:10.3390/computers15040226

Open AccessArticle

A Survey on Large Language Models in Software Security: Opportunities and Threats

by

Md Bajlur Rashid

¹

,

Mohammad Shafayet Jamil Hossain

²

,

Mohammad Ishtiaque Khan

³

,

Sharaban Tahora

³,

Aiasha Siddika

⁴,

Mahmudul Islam Prakash

⁵

,

Sharmin Yeasmin

⁶ and

Hossain Shahriar

^7,*

¹

Cybersecurity and Information Technology, University of West Florida, Pensacola, FL 32514, USA

²

Department of Mathematics and Statistics, South Dakota State University, Brookings, SD 57007, USA

³

Department of Computer Science, Kennesaw State University, Marietta, GA 30060, USA

⁴

Information Technology, Georgia Gwinnett College, Lawrenceville, GA 30043, USA

⁵

Department of Mathematics & Statistics, University of West Florida, Pensacola, FL 32514, USA

⁶

Department of Computer Science, Southeast University, Dhaka 1208, Bangladesh

⁷

Center for Cybersecurity, University of West Florida, Pensacola, FL 32514, USA

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(4), 226; https://doi.org/10.3390/computers15040226

Submission received: 23 October 2025 / Revised: 24 December 2025 / Accepted: 29 December 2025 / Published: 3 April 2026

(This article belongs to the Special Issue Using New Technologies in Cyber Security Solutions (3rd Edition))

Download

Browse Figures

Versions Notes

Abstract

The rise of large language models (LLMs), such as GPT-4, Codex, Code Llama, Claude 3, CodeGemma and DeepSeek, etc., is changing the way software development is approached. These models provide strong support for tasks like writing codes, analyzing bugs, and automation. At the same time, their use in software development creates both opportunities and new risks. This survey reviews how LLMs are being used to improve security practices in software development, including vulnerability detection, secure code generation, threat analysis, and patch development. It also discusses how attackers may exploit LLMs for malicious purposes, such as writing malware, carrying out phishing campaigns, or bypassing defenses. We draw on case studies that show LLMs can help uncover zero-day vulnerabilities and speed up secure coding but also highlight cases where they have been misused to generate harmful code, sometimes unintentionally. The paper examines technical challenges like bias in training data, the difficulty of interpreting model outputs, and the risks of adversarial attacks. It also considers ethical and regulatory issues related to accountability, compliance, and responsible use. By bringing together findings from recent research and industry practice, the survey outlines future directions for building safer models, developing stronger defensive frameworks, and shaping policies that balance innovation with security. Overall, the paper argues for a careful approach where LLMs are used to strengthen software security while addressing the risks they introduce through collaboration, oversight, and ongoing improvements.

Keywords:

large language models (LLMs); software security; vulnerability detection; adversarial attacks; secure code generation; audit; ethical AI

1. Introduction

Large language models (LLMs) have rapidly transitioned from experimental natural language processing systems into foundational components of contemporary software engineering toolchains, with increasingly direct implications for software security. Early large-scale transformer-based models trained on mixed natural language and source code corpora demonstrated that a single architecture could generalize across program synthesis, translation, summarization, and bug explanation tasks [1]. These early results fundamentally altered assumptions about the separation between natural language understanding and program analysis. Subsequent generations of open and proprietary models, including Code Llama [2], LLaMA 2 [3], and reasoning-enhanced systems such as DeepSeek-R1 [4], expanded these capabilities across programming languages, paradigms, and increasingly complex reasoning tasks. In parallel, representation-learning approaches such as CodeBERT and GraphCodeBERT introduced structure-aware and data flow-aware pre-training strategies that significantly improved downstream performance on program analysis, vulnerability localization, and code understanding tasks [5,6].

These technical advances have catalyzed a rapidly growing body of empirical research examining the role of LLMs in software security. Multiple studies show that LLMs can assist with vulnerability detection, secure code generation, automated testing, and repair, often achieving higher recall and broader cross-language coverage than traditional static analysis or rule-based tools [7,8,9]. Dedicated datasets and benchmarks—including DiverseVul [10], VulnLLMEval [11], and CVE-Bench [12]—have enabled more systematic evaluation of detection accuracy, robustness, and repair correctness. Complementary surveys and systematization efforts situate these results within the broader evolution of automated secure coding, machine learning-based vulnerability detection, and AI-assisted software engineering [13,14,15,16,17].

At the same time, a substantial and growing body of peer-reviewed evidence demonstrates that LLM-assisted development introduces persistent and non-trivial security risks. Empirical studies of AI-based code assistants consistently report the generation of insecure patterns, including injection vulnerabilities, improper cryptographic usage, unsafe deserialization, flawed access control, and insufficient input validation [18,19,20,21,22,23,24,25]. Replication and empirical studies further indicate that these weaknesses are not isolated artifacts of early models but persist across versions, deployment contexts, and prompting strategies [26,27,28]. Such findings reinforce concerns about automation bias, over-trust in AI-generated suggestions, and the gradual accumulation of latent vulnerabilities in production systems [29,30].

Beyond code generation quality, LLMs raise deeper systemic security concerns related to data leakage, memorization, and adversarial manipulation. Prior work demonstrates that large models can unintentionally reveal sensitive or proprietary training data through extraction and model inversion attacks [31,32,33]. Analyses of widely used pre-training corpora reveal the presence of vulnerable, outdated, or mislicensed code, creating pathways for vulnerability propagation and legal risk at scale [34,35]. Adversarial studies further show that code-generating models are susceptible to backdoor insertion, prompt injection, and malicious steering, enabling the generation of harmful or policy-violating code even under nominal safeguards [36,37,38,39]. Collectively, these results suggest that LLM-related security issues are structural properties of probabilistic learning from large-scale corpora, rather than incidental defects that disappear with scale alone.

The increasing integration of LLMs into real-world development workflows amplifies both their benefits and their risks. AI-powered assistants are now embedded in DevSecOps pipelines, continuous integration systems, code review tools, and threat modeling activities [40,41,42]. Although such integration enables earlier vulnerability discovery and scalable reasoning across large codebases, it also complicates accountability, traceability, and governance. Empirical studies of human–AI collaboration show that developers often struggle to distinguish correct from flawed recommendations, particularly when explanations appear confident but are semantically incomplete [43,44]. These challenges raise fundamental questions about responsibility, auditability, and risk ownership in AI-augmented software engineering [45,46].

Consequently, explainability and auditability have emerged as first-order requirements for LLM deployment in security- and compliance-critical environments. Prior research explores explainable AI techniques for security, including structured reasoning, syntax-grounded explanations, and transparency mechanisms tailored to program analysis [47,48,49]. However, recent empirical evidence indicates that LLM-generated explanations can be misleading, incomplete, or internally inconsistent, particularly under adversarial or ambiguous conditions [50,51,52]. These limitations pose significant challenges for regulated domains that require verifiable evidence, reproducible audit trails, and formal assurance arguments, such as healthcare, finance, and critical infrastructure [53,54].

Figure 1 illustrates the rapid growth of research at the intersection of LLMs and software engineering since 2020. Although the overall publication volume has increased sharply, security-focused studies remain comparatively fewer and methodologically heterogeneous, spanning peer-reviewed empirical work, controlled experiments, and early-stage preprints. This heterogeneity complicates synthesis and underscores the need for a structured, evidence-driven assessment of current capabilities, limitations, and risks.

Motivated by these observations, this survey synthesizes existing research on LLMs in software security to clarify what is empirically established, where results diverge, and which challenges remain unresolved. The analysis is organized around three research questions:

RQ1: How does the integration of LLMs alter traditional secure development practices across the Secure Development Lifecycle, including vulnerability detection, code review, DevSecOps, and threat modeling?
RQ2: What latent security risks arise from widespread LLM usage, particularly regarding insecure code propagation, flawed automated repairs, and sensitive-data leakage?
RQ3: How can explainability, traceability, and auditability of LLM-assisted security decisions be strengthened to satisfy assurance and regulatory requirements?

Rather than advocating fully autonomous use, this survey emphasizes governed, human-in-the-loop integration grounded in reproducible evaluation, curated datasets, and transparent oversight. Such an approach is essential for transitioning LLMs from promising research prototypes into trustworthy components of security-critical software engineering ecosystems.

2. Methodology

To ensure transparency and reproducibility, this review follows established guidelines for systematic literature reviews in software engineering (e.g., [55]) and adopts a protocol-based approach. The methodology includes a documented search strategy, explicit database selection (including major indexing services such as Scopus), a stepwise screening process, a verifiable protocol for deduplication and preprint handling, and a PRISMA-style flow diagram summarizing study selection (Figure 2).

2.1. Search Strategy and Databases

A structured search was conducted across six major bibliographic sources commonly used in empirical software engineering and security research:

ACM Digital Library (conference and journal articles on software engineering, security, and programming languages).
IEEE Xplore (software engineering, dependable systems, and security venues).
SpringerLink (journals such as Empirical Software Engineeringand Software and Systems Modeling).
ScienceDirect (Elsevier journals, including Journal of Systems and Software and High-Confidence Computing).
Scopus (used as a broad multidisciplinary index to cross-check coverage and retrieve additional peer-reviewed articles).
arXiv (preprints; treated as a supplementary source and explicitly distinguished from indexed venues).

The search covered the period from January 2020 to March 2025, capturing the emergence of transformer-based LLMs and their application to software security tasks such as vulnerability detection, secure code generation, and automated patching. Scopus was used to validate and extend coverage from ACM, IEEE, SpringerLink, and ScienceDirect; for example, running the finalized search string in Scopus alone returned more than 400 candidate documents, confirming that there is already a substantial peer-reviewed body of work on LLMs and software security.

Search Strings

Because terminology varies across studies, we designed compound Boolean queries around three conceptual groups: (1) large language models, (2) software security, and (3) code-related tasks (generation, analysis, or repair). The canonical search string (adapted to each database is syntax) was

(“large language model” OR “LLM” OR “code LLM” OR “foundation model”)

AND (“software security” OR “vulnerability” OR “secure coding”

OR “security testing” OR “threat modeling”)

AND (“code generation” OR “automatic programming”

OR “vulnerability detection” OR “patch generation”)

For Scopus and some publisher portals, controlled vocabulary terms (e.g., “software vulnerability”, “secure software engineering”) were additionally included where supported. We used AND between the LLM, security, and code-related concept groups to restrict results to studies that address all three dimensions, and OR within each group to capture lexical variation across venues and indexing systems.

2.2. Study Selection and Screening

The search results from all six sources were exported (in BibTeX or RIS where possible) and merged into a single reference library. Figure 2 summarizes the selection process using a PRISMA-style flow diagram customized for this review.

In total, the initial search yielded 1117 records before deduplication:

ACM Digital Library: 132 records.
IEEE Xplore: 155 records.
SpringerLink: 102 records.
ScienceDirect: 88 records.
Scopus: 444 records.
arXiv (preprints): 196 records.

2.2.1. Deduplication and Consistency Checks

We first removed exact and near-duplicate entries across databases using a combination of DOI matching, title and year matching, and fuzzy string similarity on titles (to account for minor formatting and capitalization differences). For records without DOIs (primarily some arXiv preprints and a small number of conference papers), we manually checked titles and authors to avoid double-counting preprint/published pairs.

This process removed 585 duplicate or near-duplicate records, leaving 532 unique records for screening. Where both an arXiv preprint and a peer-reviewed version existed (e.g., [1,2,3,56]), the peer-reviewed version was treated as the primary record, and the preprint was retained only as a pointer to the earlier version rather than as a separate study.

2.2.2. Title and Abstract Screening

Two reviewers independently screened the titles and abstracts of the 532 unique records against the inclusion and exclusion criteria (defined below). Disagreements were resolved through discussion, and a third reviewer was consulted in a small number of borderline cases (e.g., papers on general LLM evaluation with limited but non-zero security content).

During this phase, 367 records were excluded due to the following reasons:

Not LLM-related or not focused on code (e.g., general NLP security, social media analysis): 241 records.
No substantive security-relevant outcome (e.g., productivity-only studies of code completion): 126 records.

This left 165 records for full-text assessment.

2.3. Full-Text Assessment and Eligibility

Full texts of the 165 remaining records were retrieved via institutional access or open access sources. Twelve records could not be obtained in full text (e.g., inaccessible theses or missing proceedings) and were therefore excluded.

The remaining 153 full-text articles were assessed for eligibility. We applied the inclusion and exclusion criteria to the full text, focusing on whether the study (i) used an LLM or closely related model for a software engineering task, (ii) produced security-relevant outcomes (e.g., vulnerability detection rates, secure code generation quality, data leakage analysis), and (iii) reported empirical results (benchmarks, experiments, or case studies).

In this step, 84 articles were excluded for the following main reasons:

No empirical evaluation (e.g., conceptual position papers, essays, or vision pieces): 46 articles.
Insufficient security focus (e.g., developer productivity studies with only incidental mention of security): 26 articles.
Incomplete or non-reproducible results (e.g., missing evaluation setup or datasets): 12 articles.

After these exclusions, 81 studies were included in the final synthesis.

2.4. Inclusion and Exclusion Criteria

2.4.1. Inclusion Criteria

A study was included if it met all of the following:

Explicitly involved the use, evaluation, or analysis of an LLM (or a closely related large code model) in software engineering.
Addressed security-relevant tasks such as vulnerability detection, patch generation, secure code generation, threat modeling, data leakage, or security/privacy assessment.
Reported empirical evidence, such as quantitative benchmarks, experiments, or structured case studies (including industrial case reports).
Was published in a peer-reviewed venue (e.g., [9,16,17,27,57]) or available as a preprint with sufficiently detailed methodology (e.g., [11,35,58,59,60]).

2.4.2. Exclusion Criteria

Studies were excluded if they

Did not involve LLMs or large code models (e.g., traditional static or ML-only methods without comparison to LLMs).
Focused on NLP or security topics unrelated to source code or software systems.
Lacked empirical evaluation (conceptual essays, tutorials, or purely theoretical discussions).
Provided no security-relevant outcomes (e.g., productivity-only evaluations of code completion tooling).
Were duplicates, near-duplicates, or inaccessible in full text.

2.5. Treatment of Preprints and Indexed Versions

Because the field evolves rapidly, preprints form a substantial portion of early evidence. To maintain rigor,

Peer-reviewed publications (journals, conferences, and workshops indexed in Scopus, ACM, IEEE, SpringerLink, or ScienceDirect) were prioritized in the narrative synthesis.
When a preprint later appeared in a peer-reviewed venue, the published version was used as the primary record (e.g., [1,2,3]).
Purely preprint studies were included only if they presented clear empirical methodology, datasets, and evaluation details (e.g., [35,37,56,58,59]).
Non-academic sources (blog posts, vendor white papers, and ChatGPT-generated content) were not treated as primary evidence and were used only to contextualize tooling or practice when strictly necessary.

Overall, of the 81 included studies, 68 are peer-reviewed journal, conference, or workshop papers (e.g., [9,16,17,27,42,61,62,63]) and 13 are preprints or technical reports that provide up-to-date empirical results on emerging models and benchmarks.

2.6. Data Extraction and Coding

For each included study, we extracted the following information using a structured data extraction form:

Bibliographic metadata (year, venue, publication type: journal, conference, workshop, preprint, or technical report).
LLM or model family evaluated (e.g., GPT-3/3.5/4, Codex, Code Llama, LLaMA 2, DeepSeek-R1, CodeBERT, GraphCodeBERT).
Security tasks and scenarios (e.g., vulnerability detection, secure code generation, patch synthesis, malware generation, data leakage analysis, threat modeling).
Datasets and benchmarks used (e.g., DiverseVul [10], CVE-Bench [12], VulnLLMEval [11], proprietary industry datasets).
Evaluation metrics (e.g., accuracy, precision, recall, F1, repair success rate, exploitability of generated code).
Key findings, mapped to the three research questions RQ1–RQ3.

Two reviewers independently coded each paper; disagreements were resolved through discussion. Where necessary, we revisited the full text to clarify ambiguous categorizations (e.g., studies that mixed productivity and security outcomes).

2.7. Quality and Bias Assessment

To strengthen the robustness of the synthesis, we conducted a structured quality assessment for a representative subset of included studies. We followed standard SLR practices and assessed each study across three dimensions:

Dataset Clarity and Transparency (DC): Are datasets, benchmarks, and pre-processing steps clearly described and, where possible, publicly available? (e.g., [10,11]).
Baseline Strength (BS): Does the study compare LLMs against strong, realistic baselines (e.g., static analyzers, existing ML models, or human experts) rather than only trivial or outdated baselines [9,17]?
Reproducibility and Reporting Quality (RT): Are prompts, hyperparameters, model versions, and evaluation protocols described in sufficient detail for replication (e.g., [27,63,64])?

Table 1 summarizes ratings for five representative studies chosen from the 81 included papers. Studies with low transparency or incomplete methodological reporting were flagged and treated cautiously in the synthesis. In our analysis, peer-reviewed studies with high DC and RT scores (e.g., [16,17,27]) formed the backbone of the argumentation, while preprints and lower-transparency studies were used primarily to highlight emerging trends or open issues.

2.8. Limitations of the Review Process

Several limitations should be acknowledged:

Screening capacity. Screening and coding were carried out by a limited number of reviewers, which may introduce selection bias despite the use of a shared protocol and consensus discussions.
Preprint variability. Preprints (especially from arXiv) vary in rigor. While they are important for tracking the rapidly evolving LLM landscape, conclusions based on preprints are treated as provisional.
Heterogeneous benchmarks. Studies employ different datasets, vulnerability taxonomies, and metrics, which limits direct comparability of quantitative results across papers.
Rapid model evolution. New model releases, fine-tuned variants, and updated APIs may outpace published evaluations, meaning that some empirical findings may become outdated quickly.

Despite these constraints, the review procedure, including the use of Scopus as a central index, cross-checking between databases and preprint servers, and explicit PRISMA-style tracking, provides a transparent and reproducible basis for synthesizing current knowledge on LLMs in secure software development.

3. Results and Discussion

This section presents the findings of our survey, structured around the three research questions (RQ1–RQ3). For each question, we synthesize evidence from recent literature, highlight trends and limitations, and connect them to the broader goals of secure software development. All figures and tables include explicit source citations, and interpretations avoid categorical claims where the underlying evidence is derived from heterogeneous or preliminary studies.

3.1. RQ1: LLM Integration and Autonomy in Secure Development

The integration of large language models (LLMs) into secure development workflows is reshaping how security tasks are performed across the Secure Development Lifecycle (SDL), DevSecOps pipelines, code review, and threat modeling. Traditional workflows have relied on static analysis, dynamic testing, and manual review, whereas LLMs introduce capabilities such as contextual reasoning, multi-language understanding, and rapid generalization across security patterns and vulnerability classes. Early code-focused LLMs like Codex, Code Llama, and LLaMA 2 showed that large models trained jointly on natural language and code can perform program synthesis, translation, and bug explanation [1,2,3]. Encoder-style models such as CodeBERT and GraphCodeBERT further improved semantic code understanding for tasks like vulnerability localization and code search [5,6]. Surveys and reviews position LLMs at the intersection of software engineering and security automation, while emphasizing uneven maturity across tasks, evaluation heterogeneity, and fragile generalization [14,15,16,17,30,57].

Figure 3 illustrates adoption and visibility of major LLM frameworks. Panel (a) highlights repository-level dependence, where foundational ML libraries dominate, while workflow/orchestration frameworks (e.g., LangChain, Transformers) show accelerating uptake as building blocks for LLM-driven coding assistants, retrieval-augmented generation (RAG) pipelines, and security triage tools [65]. This trend is consistent with evidence that LLM components are increasingly embedded into DevSecOps workflows to support prioritization, continuous assessment, and pipeline-level automation [16,40]. Niche retrieval frameworks remain smaller in scale but are relevant to security and compliance tasks involving policy lookup, evidence retrieval, and audit traceability. Panel (b) shows developer attention via GitHub “stars,” reinforcing that orchestration frameworks are becoming central to developer workflows and are likely to shape how secure SDL and code-review practices evolve.

Figure 4 indicates that vulnerability rates in LLM-generated code remain substantial and, for some tools/settings, appear to increase between 2023 and 2024. These observations align with multiple independent empirical evaluations of AI code assistants and code-generating models, including studies on GitHub Copilot and other tools that report recurring insecure patterns (e.g., injection, unsafe handling, and crypto misuse) even under security-oriented prompts [18,20,21,22]. At the same time, cross-study comparisons require caution because rates depend strongly on benchmark design, task selection, and model/tool versions [17,30].

A growing body of benchmarking work provides a more systematic picture of LLM performance on security-relevant tasks. Ullah et al. evaluate eight LLMs across 228 security scenarios using the SecLLMHolmes framework, finding that models handle simple vulnerability templates but exhibit non-deterministic responses and fail on complex multi-step reasoning, with even GPT-4 producing incorrect answers in 17% of cases from trivial variable renaming alone [63]. Tamberg and Bahsi show that LLM-based approaches can achieve strong recall for vulnerability detection but often suffer from false positives and inconsistent precision across languages [9]. These limitations are consistent with broader evidence on LLM reasoning failures and reliability gaps in complex settings [30,50]. Sultana et al. demonstrate that model rankings are highly sensitive to dataset design and labeling practices [67], while empirical studies observe gradual improvements across releases alongside persistent blind spots for complex vulnerabilities [26].

Empirical studies on code assistants reinforce this picture. Pearce et al. and Majdinasab et al. find that a non-trivial fraction of assistant-generated suggestions contain at least one security weakness, even when developers ask explicitly for secure code [27,64]. These findings complement earlier Copilot-focused evaluations and replications [18,20,22]. Studies of context and grounding further suggest that assistants can misunderstand project-specific constraints and cross-file context, which matters for secure code review and fixes [43]. User studies also indicate that developers may over-trust plausible-looking outputs, miss subtle flaws, and accept insecure suggestions at non-trivial rates [29,30,52]. Broader analyses of code generation quality similarly report that LLM outputs can introduce bugs and security-relevant defects across tasks and languages [23,24].

Beyond coding, LLMs are being explored for higher-level security activities. Sridhar et al. find that LLMs can assist in brainstorming attack paths and abuse cases, but often produce incomplete threat models that require expert curation [42]. This aligns with work that frames AI-driven threat modeling as promising but risk-prone, especially when organizations treat generated artifacts as complete or authoritative [41]. Lyu et al. argue that current models fall short of the guarantees needed for fully autonomous security-critical development, emphasizing hybrid pipelines with explicit human oversight [62]. In practice-oriented terms, the emerging consensus is that LLMs behave more like “junior reviewers” than autonomous security agents, providing useful drafts and hypotheses that must be checked systematically [30,44].

Dataset and supply-chain perspectives add another dimension. Reviews emphasize that vulnerabilities in training data, benchmark contamination, and inconsistent labeling can inflate apparent performance and weaken generalization claims [17,35]. Jahanshahi and Mockus identify vulnerabilities and licensing risks in large-scale pre-training corpora, raising both security and governance concerns for downstream systems [34]. Meanwhile, work on automated repair and neural program repair highlights that patch synthesis is difficult to evaluate robustly, and many settings omit constraints such as maintainability, test adequacy, and semantic preservation [68,69,70]. Overall, the evidence supports augmentation rather than autonomy as the dominant paradigm for LLMs in secure software development. Table 2 summarizes LLM integration across key security frameworks and practices.

Table 3 synthesizes autonomy across phases. Early phases benefit from drafting and pattern identification, but human validation remains indispensable [15,16,44]. During coding, partial automation is offset by insecure outputs, newly introduced weaknesses, and context/grounding failures [27,43,61,64]. Testing results show high recall but frequent misclassification and benchmark leakage [10,11,59]. Repair studies typically remain below 30% success under realistic constraints, consistent with broader program–repair surveys that emphasize semantic preservation and evaluation difficulty [12,60,68,69,70,72]. Maintenance studies indicate vulnerabilities can persist and accumulate as technical debt without systematic monitoring [26,28,54]. Overall, current evidence points to augmentation rather than autonomy as the dominant paradigm.

3.2. RQ2: Latent Security Risks of LLM Code Generation

While LLMs offer significant benefits for software development, they introduce latent risks stemming from probabilistic generation, exposure to uncurated training data, and susceptibility to adversarial manipulation. Across studies, models reproduce insecure, outdated, or unsafe patterns (e.g., weak cryptography, injection vulnerabilities, unsafe deserialization, and flawed validation) [17,57,73]. Empirical evaluations of GitHub Copilot and other AI tools repeatedly document insecure suggestions, including vulnerabilities that persist across prompts and languages [18,20,21,22,27]. More broadly, studies of LLM code generation report bugs and security defects that arise from shallow pattern-matching, limited constraint satisfaction, and fragile reasoning under context shifts [23,24,50].

Table 4 maps common vulnerability classes in synthesized code to STRIDE categories. Although rates vary across benchmarks and tools, recurring web vulnerabilities (e.g., XSS) remain prominent, consistent with evidence that many assistants default to insecure patterns without explicit sanitization and context-aware constraints [12,17,30].

Beyond insecure code, LLMs pose risks related to data exposure and memorization. Figure 5 shows that while most outputs contain no leaks, identifiable information and secrets still appear in a small fraction of generations [17]. This aligns with prior work on training-data extraction, model inversion, and broader surveys of ML data leakage that show large models can unintentionally reveal sensitive or proprietary information under targeted prompting or adversarial interaction [31,32,33,74]. Analyses of pre-training corpora also identify vulnerable and mislicensed code as plausible sources for vulnerability propagation and accidental leakage [34,35,59].

Table 5 provides a detailed comparison of latent security risks observed across studies, organized by risk category. These risks are especially acute in constrained environments such as low-end IoT devices, where baseline vulnerability rates are already high and patch deployment is difficult [75]. In such settings, LLM-introduced weaknesses may be costly or hard to remediate. Overall, the evidence indicates that latent risks stem from three principal sources: (i) replication and amplification of insecure code, (ii) incomplete or incorrect vulnerability fixes, and (iii) leakage or manipulation of sensitive content and system behavior. Socio-technical studies suggest that without explicit governance and careful human oversight, these risks can accumulate and manifest as persistent technical debt and chronic exposure to known vulnerability patterns [28,30,54].

3.3. RQ3: Explainability, Auditability, and Compliance

As LLMs become embedded in secure software development workflows, ensuring explainability, auditability, and regulatory compliance has emerged as a central challenge. Unlike traditional static and dynamic analysis tools, LLMs generate probabilistic outputs that may lack interpretable rationale or semantic traceability. This introduces barriers for developers, auditors, and regulators who must validate correctness, safety, and accountability. Reliability analyses and user studies show that LLMs can produce persuasive explanations that may be incomplete, inconsistent, or unfaithful to actual model behavior, especially under ambiguity or adversarial pressure [48,49,50,51].

Explainability techniques attempt to mitigate these issues. Ding et al survey interpretable AI approaches for secure software engineering, including rule-based rationales, localized highlighting of suspicious code regions, and templates aligned with CWE categories [48]. Systematic analyses of explainable AI highlight that effective explanations must go beyond surface-level interpretability, emphasizing properties such as faithfulness to the underlying model, robustness across inputs, and alignment with the decision context to ensure meaningful and trustworthy use [47]. However, empirical work also shows that explanation failure modes are common, including plausible but incorrect justifications and brittle chains of reasoning under prompt perturbations [50,51,52]. Structured prompting (e.g., self-checking and rule-aligned exemplars) can improve consistency, but remains sensitive to phrasing, context, and tool integration [30,57,58].

Compliance requirements amplify these challenges. In regulated sectors (e.g., HIPAA, ISO/IEC 27001, sector-specific regulation), outputs must be explainable and auditable. NIST’s SSDF-AI extension calls for AI-system-specific secure development practices, including documentation of model behavior, testing, and risk management for AI-enabled components [53]. More general governance frameworks for trustworthy AI emphasize verifiable claims, reproducible pipelines, and transparent documentation [46]. Work on governance and accountability highlights the organizational need for role clarity, audit trails, and responsibility assignment when LLMs influence security decisions [45]. These concerns align with evidence from DevSecOps integration research: adding AI components can increase pipeline complexity and blur accountability without explicit controls and monitoring [40].

Table 6 summarizes the key findings for RQ3 across explainability mechanisms, compliance alignment, and governance approaches. Adversarial research adds further concerns for explainability and auditability. Backdoor attacks show that hidden behaviors can be introduced into code-generating models and may not be detectable through surface-level explanations alone [36,37]. Prompt-injection and environment-manipulation attacks on LLM-based tools demonstrate that adversaries can corrupt the decision context itself, leading to subtle but dangerous modifications in code suggestions [38,39,66]. LLM-assisted malware generation studies and security-misuse analyses similarly highlight that guardrails and governance are necessary for safe deployment in realistic environments [39,77,80].

Concerns about repair quality reinforce the need for explainability and strong assurance. Figure 6 shows that automated repair success rates remain low across languages, consistent with repair-focused empirical studies and broader program–repair surveys that emphasize semantic correctness and evaluation rigor [12,17,60,68,69,70,72]. Even when models produce convincing rationales, explanations can mask incomplete or incorrect fixes, making human-in-the-loop review essential for safety-critical code [30,51].

Overall, achieving explainability and auditability for LLMs in secure software development will require a combination of interpretable-by-design techniques, robust evaluation datasets, explicit governance frameworks, and sustained human oversight [45,46,47,48,49,53]. Longitudinal and socio-technical evidence suggests that adoption in regulated environments depends on verifiable audit trails, reproducible evidence, and transparent role definitions for AI components within the secure development lifecycle [30,54,62]. Without improved oversight, LLM outputs risk becoming opaque artifacts that undermine both security and compliance.

Taken together, the results across RQ1–RQ3 show a heterogeneous but converging picture. LLMs can improve the speed and reach of secure development activities, yet they also introduce systematic vulnerabilities, data-governance issues, and explainability gaps that traditional tools do not fully address. Because findings are distributed across datasets, tasks, and evaluation setups, strong conclusions should not be drawn from any single study in isolation. The next section therefore provides an integrated synthesis highlighting cross-cutting themes, trade-offs, and open gaps that emerge when the evidence is considered holistically.

4. Our Findings

This section synthesizes the collective evidence across the reviewed literature to answer the three research questions (RQ1–RQ3) at a conceptual level, rather than reiterating task-level results or benchmark-specific observations already discussed in Section 3. In response to reviewer feedback, we deliberately avoid re-listing Secure Development Lifecycle (SDL) phases, DevSecOps components, or individual tools. Instead, we abstract recurring patterns, constraints, and tensions that cut across models, datasets, evaluation settings, and organizational contexts.

We organize the synthesis around five cross-cutting findings: (F1) how LLMs are actually used in secure development practice, (F2) where their benefits are most robust, (F3) where risks systematically arise, (F4) why current evaluation practice limits strong generalization, and (F5) why governance, explainability, and human oversight emerge as structural requirements rather than optional enhancements. Throughout, we prioritize peer-reviewed evidence, while incorporating preprints where they provide unique empirical insight, consistent with established SLR guidance [15,16,55,56,57,71].

4.1. F1: LLMs’ Function as High-Bandwidth Security Assistants Rather than Autonomous Engineers

Across studies addressing RQ1, a consistent and robust pattern emerges: LLMs are not replacing human security engineers or traditional assurance mechanisms, but instead function as high-bandwidth assistants embedded within existing socio-technical systems. Regardless of model family or deployment context, LLMs are integrated into IDEs, CI/CD pipelines, code review tooling, and security triage workflows, where they generate candidates, explanations, or hypotheses that are subsequently filtered and validated by humans.

This assistant-oriented role holds across multiple generations of models. Early large-scale code models demonstrated strong generative and explanatory capacity but lacked the determinism and guarantees required for autonomous security decision-making [1,2,3]. Representation-learning approaches such as CodeBERT and GraphCodeBERT improved vulnerability localization and semantic understanding, yet remained decision-support tools rather than replacements for expert judgment [5,6]. More recent reasoning-enhanced models, including DeepSeek-R1 and agentic pipelines, extend analysis across files and repositories, but empirical evaluations still situate them within assistive, not authoritative, roles [4,59,62].

Evidence from repository-level telemetry reinforces this interpretation. Dependency analysis (e.g., Figure 3) shows widespread adoption of LLM frameworks as infrastructural components rather than standalone decision engines, indicating integration into broader development and security ecosystems [65]. Empirical studies and safety-critical studies further show that organizations treating LLM outputs as authoritative tend to accumulate unresolved vulnerabilities and technical debt over time, whereas teams that frame LLMs as advisory tools achieve more stable outcomes [26,28,54]. Collectively, these findings suggest that the dominant and realistic role of LLMs today is that of *capability amplifiers*, not autonomous security engineers.

4.2. F2: Benefits Concentrate on Visibility, Coverage, and Cognitive Support

Across RQ1 and RQ3, the strongest and most reproducible benefits of LLM integration are not improvements in correctness guarantees, but expansions in visibility, coverage, and human cognitive support.

First, LLMs consistently improve the *breadth* of vulnerability detection. Across multiple datasets and benchmarks, LLM-based approaches achieve higher recall than traditional static analysis or classical ML models, particularly for medium-complexity vulnerabilities and cross-language patterns [9,10,16,17,67]. While precision varies substantially, high recall is often operationally acceptable when LLM outputs are treated as candidates for downstream triage rather than final verdicts [15,27,63]. This aligns with DevSecOps practices that emphasize early, wide-net detection combined with later-stage validation.

Second, LLMs provide cross-language and cross-layer abstraction that is difficult to replicate with rule-based tools. Joint natural language–code representations enable transfer of vulnerability concepts across programming languages, frameworks, and system layers [5,6]. Longitudinal studies indicate that this abstraction is particularly valuable in large, polyglot codebases, where security expertise is unevenly distributed [16]. Reasoning-oriented and agentic systems extend this capability to repository-level and multi-step analysis, though without eliminating the need for expert oversight [4,62].

Third, LLMs provide substantial value as cognitive scaffolding. By translating low-level code findings into natural-language explanations, mitigation suggestions, and references to standards, LLMs help developers understand and contextualize security issues [44,48]. Even when generated fixes are incomplete, accompanying explanations can improve developers’ mental models of vulnerability classes and defensive patterns [49,63,71]. This supports collaboration models in which humans retain decision authority while relying on LLMs to accelerate comprehension and exploration [44,54].

4.3. F3: Security Risks Are Structural and Persistent Across Models

Findings related to RQ2 converge on the conclusion that LLM-associated security risks are structural rather than accidental. Figure 7 aggregates latent risks reported across studies and illustrates that insecure-by-default code generation dominates observed failures, followed by propagation of vulnerable patterns and data-related risks.

Empirical evaluations consistently report that 25–40% of LLM-generated code contains at least one security weakness under typical usage conditions, with substantially higher rates under adversarial prompting [27,58,61,64,78]. Automated repair studies further show that LLM-generated patches frequently address surface symptoms while failing to preserve deeper semantic correctness or long-term maintainability [12,60,68,72]. These failure modes persist across model families and benchmark designs, indicating that they are not artifacts of specific implementations.

Training data plays a central role in these risks. Analyses of large pre-training corpora reveal embedded CVEs, outdated practices, and licensing violations that can be reproduced downstream [34,59]. This inheritance effect is especially concerning in long-lived systems such as IoT and embedded software, where remediation is costly or infeasible [75]. Adversarial research further demonstrates that prompt injection, jailbreaks, and backdoor attacks can manipulate LLM behavior without access to model weights [37,38]. Together, these findings indicate that LLM-related risks arise from probabilistic generation, dataset curation, and socio-technical deployment practices rather than isolated bugs.

4.4. F4: Evaluation Practices Limit the Strength of Generalization

A central meta-finding across RQ1 and RQ2 is that heterogeneous evaluation practices constrain how confidently results can be generalized to production settings. Although many studies follow established SLR principles, differences in search scope, inclusion criteria, datasets, and metrics yield partially incompatible evidence bases [15,16,17,55,57,71].

Empirical results span controlled lab tasks, real-world repositories, user studies, and proprietary datasets, often using incomparable success criteria [9,10,12,27,78]. Benchmark contamination and data leakage further inflate reported performance in some settings [34,35]. Empirical studies partially mitigate these issues by revealing how LLM usage reshapes security work over time, including automation bias and shifting review practices [26,28,54]. As a result, our synthesis treats single-benchmark claims as indicative rather than definitive.

4.5. F5: Governance, Explainability, and Human Oversight Are Structural Requirements

Across RQ3 and socio-technical analyses, governance and explainability emerge as first-order requirements for responsible LLM adoption. Because LLM outputs are probabilistic and persuasive, post hoc explanation and auditability are essential for detecting hallucinations, shallow fixes, and adversarial manipulation.

Work on interpretable AI demonstrates that structured explanations, syntax-grounded reasoning, and security-aligned narratives help developers interrogate LLM outputs, though they do not guarantee faithfulness [48,49]. Compliance-oriented approaches show how LLMs can support audit-ready documentation and multi-hop reasoning for regulatory frameworks, provided final judgments remain human [53]. Organizational case studies emphasize the need for access controls, logging, dataset governance, and explicit accountability structures [46,79,81].

Figure 8 synthesizes these governance trade-offs by framing LLM adoption as a balance between opportunity, risk, and unresolved gaps.

Synthesis Across F1–F5

Taken together, the evidence shows that LLMs offer substantial value by expanding visibility, accelerating reasoning, and supporting human understanding of security issues, while simultaneously introducing persistent risks rooted in probabilistic generation, dataset inheritance, and socio-technical deployment. The most robust use cases are those that treat LLMs as governed, explainable assistants with humans firmly in the loop. Fully autonomous, compliance-grade secure software engineering remains aspirational, and progress toward that goal depends as much on advances in evaluation methodology and governance as on improvements in model architecture or scale.

5. Future Work

The evidence synthesized in Section 3 and Section 4 indicates that LLMs are already reshaping security work, but not in a uniform or reliably “secure-by-default” way. Empirical evaluations of AI-assisted code generation repeatedly show that vulnerability rates can remain substantial across tools and tasks, and that security outcomes are sensitive to prompt framing, context, and developer behavior [18,19,20,21,22,23,24,25]. At the same time, studies on LLM-enabled detection and repair suggest promising gains in recall and cross-language transfer, yet expose persistent limits in multi-step reasoning, semantic faithfulness of explanations, and patch correctness [7,30,50,69]. In consequence, future research must focus on methodological rigor, longitudinal validation, and governance-aware system design, rather than treating security improvements as an automatic consequence of model scaling.

Rigorously reproducible evaluation protocols, not one-off benchmarks. A recurring barrier to comparability is that many studies differ in dataset curation, vulnerability taxonomies, prompt templates, decoding parameters, and reporting granularity. Future research should adopt reproducible evaluation protocols with versioned model identifiers, recorded inference settings, and standardized reporting of context-window usage and toolchain dependencies. Replication should be treated as a first-class research product, especially for assistant tools whose behavior can shift across model updates [18,22,30]. In addition, evaluations should include robustness checks that vary prompt phrasing, codebase context, and language choice to capture the instability observed in grounded-context studies of LLM understanding [43,50].
Human–AI collaboration as a measurable security control. Results across RQ1 and RQ2 suggest that the security impact of LLMs is mediated by human acceptance and review practices. Future studies should move beyond productivity metrics and explicitly measure security outcomes in realistic workflows: acceptance rates of insecure suggestions, time-to-detection of subtle flaws, and the effectiveness of structured review checkpoints. Controlled trials should test interventions such as uncertainty signaling, escalation triggers, and “secure-by-policy” templates that force developers to verify properties (e.g., input validation, authorization checks) before merging [30,40,51]. Particular attention is needed for automation bias and over-trust, which can be amplified by persuasive but incomplete explanations [50,51].
Security-aware dataset governance and provenance at scale. Multiple strands of evidence indicate that training data composition and contamination shape both vulnerability propagation and leakage risk. Future work must establish dataset governance standards for code corpora: provenance tracking, license auditing, deduplication, vulnerability filtering, and continuous maintenance. This includes empirically validating whether removing known-vulnerable snippets and duplicated repositories reduces insecure-by-default generation without harming functional correctness. Broader ML security surveys highlight the need to treat leakage and memorization as structural risks when sensitive code is present in training or fine-tuning pipelines [32,33]. Dataset releases should provide security-centered documentation (e.g., known CVE contamination, duplication metrics, licensing risk flags) to enable reliable downstream evaluation [14,73].
Semantics-first patch evaluation and program–repair integration. Repair remains a bottleneck: many generated patches are syntactically plausible yet semantically incomplete, especially for vulnerabilities that require cross-function invariants or precise state reasoning. Future research should integrate neural program repair insights with formal and test-based validation to evaluate patches for semantic correctness, exploitability reduction, and regression safety [69,70]. This implies moving beyond pass/fail compilation metrics toward property-based tests, differential testing, and proof-carrying evidence where feasible. Comparative studies should also benchmark hybrid pipelines that combine LLM patch proposals with deterministic analyzers and constraint solvers, aligning with broader secure-coding tool surveys that emphasize layered defenses [14,71].
Robustness to adversarial manipulation, including backdoors and prompt injection. RQ2 and RQ3 findings show that adversarial manipulation is a realistic concern for code-generating models and agentic toolchains. Future work should systematically evaluate backdoor and data poisoning threats, including how malicious triggers survive instruction tuning and how they manifest in code suggestions [36,39]. Similarly, evaluations should cover prompt-injection and context-manipulation attacks in retrieval-augmented and tool-using assistants, where untrusted inputs can steer generation. Defensive research should develop and validate detection methods (behavioral probes, anomaly detection, provenance-based filtering) that operate under black-box constraints common in proprietary tools [39,45].
Explainability that is faithful, auditable, and compliance-aligned. The field needs explanation mechanisms that are not only readable but also faithful and verifiable. Systematic reviews of explainable AI for security and empirical work on reasoning failures show that LLM explanations can diverge from actual model behavior, producing false confidence in fixes and threat assessments [47,50,51]. Future research should define evaluation criteria for explanation faithfulness in software security settings (e.g., alignment with data-flow evidence, consistency under perturbations, calibration to uncertainty). In parallel, compliance-driven contexts require traceability from model outputs to controls and evidence artifacts; governance-oriented work highlights accountability gaps if such traceability is absent [30,45].
Threat modeling and DevSecOps integration with measured assurance outcomes. As LLMs are increasingly integrated into DevSecOps, future studies should test how AI affects threat modeling quality, security gate performance, and incident response readiness. AI-driven threat modeling offers opportunities (faster enumeration of attack surfaces) but also risks (shallow or biased coverage). Future work should benchmark LLM-assisted threat modeling against expert baselines and evaluate how outputs affect downstream security decisions [41]. DevSecOps-focused reviews suggest that integration success depends on process design and organizational controls, not only model capability [40].
Governance, accountability, and socio-technical policy design. Ultimately, trustworthy deployment is constrained by governance: who is responsible for an AI-assisted change, what is logged, what is reviewed, and what is auditable. Governance and accountability challenges in LLMs remain open, particularly in enterprise settings where “shadow AI” usage, unclear data-handling policies, and vendor opacity can undermine assurance [30,45]. Future research should propose and empirically evaluate governance patterns (policy-as-code for AI usage, audit logs for AI-assisted diffs, model-risk registers, and red-team protocols) that can be adopted without prohibitive overhead.

Across these directions, a consistent implication emerges: progress toward trustworthy LLM-assisted secure development will depend as much on methodology, evaluation discipline, and governance engineering as on model architecture improvements. The most valuable near-term advances are likely to come from reproducible, longitudinal studies and verifiable hybrid pipelines that explicitly measure security outcomes and organizational assurance, rather than from isolated demonstrations on narrow benchmarks [18,30,73].

6. Conclusions

This survey synthesized the current state of research on large language models (LLMs) in software security, integrating evidence across the studies published between 2020 and 2025. We organized the review around three research questions: (RQ1) how LLM integration alters secure development practices across the lifecycle; (RQ2) which latent risks arise from widespread LLM usage; and (RQ3) how explainability, auditability, and governance can be strengthened to satisfy assurance and regulatory needs.

For RQ1, the literature converges on a practical characterization: LLMs presently function as high-bandwidth security assistants embedded in development workflows rather than autonomous security engineers. Studies examining AI-assisted coding tools and code-generation systems show that LLMs can accelerate security-relevant activities—including vulnerability triage, explanation, and candidate repair generation—and can generalize across languages more readily than many traditional approaches [7,14]. However, evidence also shows that these gains are uneven and highly context-sensitive: models often struggle with implicit invariants, cross-file dependencies, and multi-step reasoning, and they can misunderstand context even when provided with surrounding code [43,50]. Consequently, the strongest empirical support favors augmentation with structured oversight: LLM outputs are most defensible when layered with deterministic analysis, testing, and human review.

For RQ2, the surveyed evidence indicates that LLM-related security risks are systematic rather than incidental. Multiple independent empirical studies report that AI-based code assistants frequently produce insecure patterns (e.g., injection vulnerabilities, unsafe validation, weak cryptographic choices), including when developers explicitly request secure implementations [18,19,20,21,22,25]. Complementary work shows that LLM-generated code can introduce bugs with security implications and that the correctness of generated fixes is often fragile, particularly under real-world constraints [23,24,69]. Beyond code quality, broader ML security research highlights the risks of memorization, leakage, and inversion, which become especially salient when proprietary or sensitive code appears in training or fine-tuning pipelines [32,33]. Adversarial research further shows that code-generating models can be manipulated through backdoors and other attacks, reinforcing that robustness must be evaluated against adaptive threats and not only benign prompting [36,39].

For RQ3, explainability and auditability emerge as decisive barriers to high-assurance deployment. Although explainable AI for security is an active area, empirical studies emphasize that LLM explanations can be persuasive yet unfaithful, inconsistent, or incomplete, which can amplify automation bias and create false assurance in security decisions [47,50,51]. This limitation matters most in compliance-driven environments where outputs must be traceable to evidence and controls. Governance-oriented work highlights that accountability gaps can widen when LLMs are embedded into pipelines without clear logging, review policies, and responsibility boundaries [30,45]. Similarly, AI-driven threat modeling illustrates both opportunity (expanded enumeration) and risk (shallow coverage) unless outputs are systematically validated and owned by accountable roles [41].

Taken together, the surveyed literature supports a balanced conclusion: LLMs can materially improve the reach and speed of security work, but they also introduce structural risks that require deliberate controls. A credible path toward trustworthy integration demands (1) reproducible and contamination-aware evaluation frameworks, (2) curated and ethically governed datasets with provenance and leakage safeguards, (3) explanation mechanisms evaluated for faithfulness and audit utility, and (4) governance structures that define accountability, monitoring, and safe operational boundaries [30,33,45,55,73]. As the field matures, progress should be assessed not only by model capability, but by the degree to which LLM-assisted security workflows produce verifiable, longitudinally stable improvements in real systems under realistic organizational and adversarial conditions [39,40].

Author Contributions

Conceptualization, M.B.R. and S.T.; methodology, M.B.R. and S.T.; software, M.B.R. and M.S.J.H.; validation, M.B.R. and M.I.K.; formal analysis, M.B.R. and M.S.J.H.; investigation, M.B.R. and M.I.K.; resources, M.B.R.; data curation, M.B.R. and M.I.K.; writing—original draft preparation, M.B.R.; writing—review and editing, M.B.R., M.S.J.H., M.I.K., S.T., A.S., M.I.P., S.Y., and H.S.; visualization, M.B.R. and M.S.J.H.; supervision, S.T. and H.S.; project administration, H.S.; funding acquisition, H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation (NSF) under Award Numbers 2433800 (ML4CS), 2421324 (ALAMOSE), and 1946442 (ACES), and by the National Institutes of Health (NIH) under Grant Number 5R42LM014356-03. The APC was funded by the National Science Foundation. Any opinions, findings, and recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF or NIH.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code Llama: Open Foundation Models for Code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. LLaMA 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature 2025, 645, 633–638. [Google Scholar] [CrossRef] [PubMed]
Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proceedings of the Findings of EMNLP 2020, Online, 16–20 November 2020; pp. 1536–1547. [Google Scholar]
Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, L.; Duan, N.; Svyatkovskiy, A.; Fu, S.; et al. GraphCodeBERT: Pre-Training Code Representations with Data Flow. arXiv 2021, arXiv:2009.08366. [Google Scholar]
Fan, A.; Gokkaya, B.; Harman, M.; Lyubarskiy, M.; Sengupta, S.; Yoo, S.; Zhang, J.M. Large Language Models for Software Engineering: Survey and Open Problems. In Proceedings of the 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), Melbourne, Australia, 14–20 May 2023; pp. 31–53. [Google Scholar] [CrossRef]
Bui, T.-D.; Vu, T.T.; Nguyen, T.-T.; Nguyen, S.; Vo, H.D. Correctness Assessment of Code Generated by Large Language Models Using Internal Representations. J. Syst. Softw. 2025, 224, 112570. [Google Scholar] [CrossRef]
Tamberg, K.; Bahsi, H. Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study. IEEE Access 2025, 13, 29698–29717. [Google Scholar] [CrossRef]
Chen, Y.; Ding, Z.; Alowain, L.; Chen, X.; Wagner, D. DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection. In Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, Hong Kong, China, 16–18 October 2023; pp. 654–668. [Google Scholar] [CrossRef]
Zibaeirad, A.; Vieira, M. VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching. arXiv 2024, arXiv:2409.10756. [Google Scholar] [CrossRef]
Wang, P.; Liu, X.; Xiao, C. CVE-Bench: Benchmarking LLM-Based Software Engineering Agent’s Ability to Repair Real-World CVE Vulnerabilities. In Proceedings of the NAACL 2025, Albuquerque, NM, USA, 29 April–4 May 2025; pp. 4207–4224. [Google Scholar] [CrossRef]
Shiri Harzevili, N.; Boaye Belle, A.; Wang, J.; Wang, S.; Jiang, Z.M.; Nagappan, N. A Systematic Literature Review on Automated Software Vulnerability Detection Using Machine Learning. ACM Comput. Surv. 2024, 57, 55. [Google Scholar] [CrossRef]
Ghaffarian, S.M.; Shahriari, H.R. Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey. ACM Comput. Surv. 2017, 50, 56. [Google Scholar] [CrossRef]
Kumar, P. Large Language Models (LLMs): Survey, Technical Frameworks, and Future Challenges. Artif. Intell. Rev. 2024, 57, 260. [Google Scholar] [CrossRef]
Sheng, Z.; Chen, Z.; Gu, S.; Huang, H.; Gu, G.; Huang, J. LLMs in software security: A survey of vulnerability detection techniques and insights. ACM Comput. Surv. 2025, 58, 134. [Google Scholar] [CrossRef]
Negri-Ribalta, C.; Geraud-Stewart, R.; Sergeeva, A.; Lenzini, G. A Systematic Literature Review on the Impact of AI Models on the Security of Code Generation. Front. Big Data 2024, 7, 1386720. [Google Scholar] [CrossRef]
Pearce, H.; Ahmad, B.; Tan, B.; Dolan-Gavitt, B.; Karri, R. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. In Proceedings of the 2022 ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), Singapore, 14–18 November 2022; pp. 1215–1227. [Google Scholar]
Siddiq, M.L.; Santos, J.C.S. SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques. In Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P&S), Pittsburgh, PA, USA, 17–18 November 2022; pp. 29–33. [Google Scholar] [CrossRef]
Nguyen, N.; Nadi, S. An Empirical Evaluation of GitHub Copilot’s Code Suggestions. In Proceedings of the 19th International Conference on Mining Software Repositories (MSR ’22), Pittsburgh, PA, USA, 23–24 May 2022; pp. 1–5. [Google Scholar] [CrossRef]
Tihanyi, N.; Bisztray, T.; Jain, R.; Ferrag, M.A.; Cordeiro, L.C.; Mavromatis, V. How Secure is AI-Generated Code: A Large-Scale Comparison of Large Language Models. Empir. Softw. Eng. 2024, 29, 138. [Google Scholar] [CrossRef]
Siddiq, M.; Gopinath, R.; Bhat, P. Empirical Evaluation of GitHub Copilot for Security Vulnerabilities. J. Syst. Softw. 2023, 203, 111915. [Google Scholar]
Tambon, F.; Nikanjam, A.; An, L.; Khomh, F.; Antoniol, G. Bugs in Large Language Models Generated Code: An Empirical Study. Empir. Softw. Eng. 2025, 30, 80. [Google Scholar] [CrossRef]
Mastropaolo, A.; Ciniselli, M.; Cooper, N.; Palacio, D.N.; Poshyvanyk, D.; Oliveto, R.; Bavota, G. On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE), Melbourne, Australia, 14–20 May 2023; pp. 2149–2160. [Google Scholar] [CrossRef]
Perry, N.; Srivastava, M.; Kumar, D.; Boneh, D. Do Users Write More Insecure Code with AI Assistants? In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ’23), Copenhagen, Denmark, 26–30 November 2023; pp. 2785–2799. [Google Scholar] [CrossRef]
Fu, M.; Tantithamthavorn, C.K.; Nguyen, V.; Le, T. ChatGPT for Vulnerability Detection, Classification, and Repair: How Far Are We? In Proceedings of the 30th Asia-Pacific Software Engineering Conference (APSEC 2023), Seoul, Republic of Korea, 4–7 December 2023; pp. 632–636. [Google Scholar] [CrossRef]
Pearce, H.; Ahmad, B.; Tan, B.; Dolan-Gavitt, B.; Karri, R. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. Commun. ACM 2025, 68, 96–105. [Google Scholar] [CrossRef]
Sculley, D.; Holt, G.; Golovin, D.; Davydov, E.; Phillips, T.; Ebner, D.; Chaudhary, V.; Young, M.; Crespo, J.F.; Dennison, D. Hidden Technical Debt in Machine Learning Systems. In Proceedings of the 29th International Conference on Neural Information Processing Systems (NeurIPS 2015), Montreal, QC, Canada, 7–12 December 2015; Volume 2, pp. 2503–2511. [Google Scholar]
Sandoval, G.; Pearce, H.; Nys, T.; Karri, R.; Garg, S.; Dolan-Gavitt, B. Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security ’23), Anaheim, CA, USA, 9–11 August 2023; pp. 2205–2222. [Google Scholar]
Sallou, J.; Durieux, T.; Panichella, A. Breaking the Silence: The Threats of Using LLMs in Software Engineering. In Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), Lisbon, Portugal, 12–21 April 2024; pp. 102–106. [Google Scholar] [CrossRef]
Carlini, N.; Tramer, F.; Wallace, E.; Jagielski, M.; Herbert-Voss, A.; Lee, K.; Roberts, A.; Brown, T.; Song, D.; Erlingsson, U.; et al. Extracting Training Data from Large Language Models. In Proceedings of the 30th USENIX Security Symposium (USENIX Security ’21), Virtual, 11–13 August 2021; pp. 2633–2650. [Google Scholar]
Schuster, R.; Song, C.; Tromer, E.; Shmatikoff, V. You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion. In Proceedings of the 30th USENIX Security Symposium (USENIX Security ’21), Virtual, 11–13 August 2021; pp. 1559–1575. [Google Scholar]
Rigaki, M.; Garcia, S. A Survey of Privacy Attacks in Machine Learning. ACM Comput. Surv. 2023, 56, 101. [Google Scholar] [CrossRef]
Jahanshahi, M.; Mockus, A. Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets. In Proceedings of the 2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), Ottawa, ON, Canada, 3 May 2025; pp. 104–111. [Google Scholar] [CrossRef]
Zhou, X.; Weyssow, M.; Widyasari, R.; Zhang, T.; He, J.; Lyu, Y.; Chang, J.; Zhang, B.; Huang, D.; Lo, D. LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks. arXiv 2025, arXiv:2502.06215. [Google Scholar] [CrossRef]
Ramakrishnan, G.; Albarghouthi, A. Backdoors in Neural Models of Source Code. In Proceedings of the 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 2892–2899. [Google Scholar] [CrossRef]
Shi, J.; Liu, Y.; Zhou, P.; Sun, L. BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT. arXiv 2023, arXiv:2304.12298. [Google Scholar]
Cheng, W.; Sun, K.; Zhang, X.; Wang, W. Security Attacks on LLM-Based Code Completion Tools. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 23669–23677. [Google Scholar] [CrossRef]
Apruzzese, G.; Anderson, H.S.; Dambra, S.; Freeman, D.; Pierazzi, F.; Roundy, K. “Real Attackers Don’t Compute Gradients”: Bridging the Gap Between Adversarial ML Research and Practice. In Proceedings of the 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), Raleigh, NC, USA, 8–10 February 2023; pp. 339–364. [Google Scholar] [CrossRef]
Fu, M.; Pasuksmit, J.; Tantithamthavorn, C. AI for DevSecOps: A Landscape and Future Opportunities. ACM Trans. Softw. Eng. Methodol. 2024, 33, 197. [Google Scholar] [CrossRef]
Elsharef, I.; Zeng, Z.; Gu, Z. Facilitating Threat Modeling by Leveraging Large Language Models. In Proceedings of the Workshop on AI Systems with Confidential Computing (AISCC 2024), San Diego, CA, USA, 26 February 2024; pp. 1–8. [Google Scholar] [CrossRef]
Deng, G.; Liu, Y.; Mayoral-Vilches, V.; Liu, P.; Li, Y.; Xu, Y.; Zhang, T.; Liu, Y.; Pinzger, M.; Rass, S. PentestGPT: An LLM-Empowered Automatic Penetration Testing Framework. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security ’24), Philadelphia, PA, USA, 14–16 August 2024; pp. 1279–1296. [Google Scholar]
Barke, S.; James, M.B.; Polikarpova, N. Grounded Copilot: How Programmers Interact with Code-Generating Models. Proc. ACM Program. Lang. 2023, 7, 85–111. [Google Scholar] [CrossRef]
Gao, J.; Gebreegziabher, S.A.; Choo, K.T.W.; Li, T.J.-J.; Perrault, S.T.; Malone, T.W. A Taxonomy for Human–LLM Interaction Modes: An Initial Exploration. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’24), Honolulu, HI, USA, 11–16 May 2024; pp. 1–11. [Google Scholar] [CrossRef]
Mökander, J.; Schuett, J.; Kirk, H.R.; Floridi, L. Auditing Large Language Models: A Three-Layered Approach. AI Ethics 2023, 4, 1085–1115. [Google Scholar] [CrossRef]
Brundage, M.; Avin, S.; Wang, J.; Belfield, H.; Krueger, G.; Hadfield, G.; Khlaaf, H.; Yang, J.; Toner, H.; Fong, R.; et al. Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. Patterns 2020, 1, 100089. [Google Scholar]
Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. Information Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Ding, W.; Abdel-Basset, M.; Hawash, H.; Ali, A.M. Explainability of Artificial Intelligence Methods, Applications and Challenges: A Comprehensive Survey. Inf. Sci. 2022, 615, 238–292. [Google Scholar] [CrossRef]
Liu, Y.; Tantithamthavorn, C.; Liu, Y.; Li, L. On the Reliability and Explainability of Language Models for Program Generation. ACM Trans. Softw. Eng. Methodol. 2024, 33, 126. [Google Scholar] [CrossRef]
Huang, J.; Chang, K.C.-C. Towards Reasoning in Large Language Models: A Survey. In Proceedings of the Findings of the Association for Computational Linguistics (ACL 2023), Toronto, ON, Canada, 9–14 July 2023; pp. 1049–1065. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 248. [Google Scholar] [CrossRef]
Klemmer, J.H.; Horstmann, S.A.; Patnaik, N.; Ludden, C.; Burton, C.; Powers, C.; Massacci, F.; Rahman, A.; Votipka, D.; Lipford, H.R.; et al. Using AI Assistants in Software Development: A Qualitative Study on Security Practices and Concerns. In Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS ’24), Salt Lake City, UT, USA, 14–18 October 2024; pp. 2726–2740. [Google Scholar] [CrossRef]
National Institute of Standards and Technology. NIST SP 800-218A: Secure Software Development Framework (SSDF) with AI System-Specific Practices; Technical Report; U.S. Department of Commerce: Washington, DC, USA, 2024. Available online: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-218A.pdf (accessed on 15 December 2025).
Rzig, D.; Chakraborty, S.; Haiduc, S.; Shahriar, H. Large Language Model for Vulnerability Detection and Repair: Literature Review and the Road Ahead. ACM Trans. Softw. Eng. Methodol. 2025, 34, 145. [Google Scholar] [CrossRef]
Kitchenham, B.; Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; EBSE Technical Report EBSE-2007-01; Software Engineering Group, School of Computer Science and Mathematics, Keele University: Keele, UK; Department of Computer Science, University of Durham: Durham, UK, 2007; Volume 2.3, pp. 1–57. [Google Scholar]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly. High-Confid. Comput. 2024, 4, 100211. [Google Scholar] [CrossRef]
Mohsin, A.; Janicke, H.; Wood, A.; Sarker, I.H.; Maglaras, L.; Janjua, N. Can We Trust Large Language Models Generated Code? arXiv 2024, arXiv:2406.12513. [Google Scholar] [CrossRef]
Li, Y.; Li, X.; Wu, H.; Xu, M.; Zhang, Y.; Cheng, X.; Xu, F.; Zhong, S. Everything You Wanted to Know About LLM-Based Vulnerability Detection but Were Afraid to Ask. arXiv 2025, arXiv:2504.13474. [Google Scholar] [CrossRef]
Fakih, M.; Dharmaji, R.; Bouzidi, H.; Araya, G.Q.; Ogundare, O.; Faruque, M.A. LLM4CVE: Enabling Iterative Automated Vulnerability Repair with Large Language Models. arXiv 2025, arXiv:2501.03446. [Google Scholar] [CrossRef]
Fu, Y.; Liang, P.; Li, Z.; Shahin, M.; Yu, J.; Chen, J. Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study. ACM Trans. Softw. Eng. Methodol. 2025, 34, 218. [Google Scholar] [CrossRef]
Lyu, M.R.; Ray, B.; Roychoudhury, A.; Tan, S.H.; Thongtanunam, P. Automatic Programming: Large Language Models and Beyond. ACM Trans. Softw. Eng. Methodol. 2025, 34, 140. [Google Scholar] [CrossRef]
Ullah, S.; Han, M.; Pujar, S.; Pearce, H.; Coskun, A.; Stringhini, G. LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks. In Proceedings of the 2024 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2024; pp. 862–880. [Google Scholar] [CrossRef]
Majdinasab, V.; Bishop, M.J.; Rasheed, S.; Moradidakhel, A.; Tahir, A.; Khomh, F. Assessing the Security of GitHub Copilot’s Generated Code—a Targeted Replication Study. In Proceedings of the 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Rovaniemi, Finland, 12–15 March 2024; pp. 435–444. [Google Scholar] [CrossRef]
Hossain, S.J. LLM-USED: Repository of LLM Framework Usage Data and Plots. 2025. Available online: https://github.com/shafayetjamilhossain205/LLM-USED (accessed on 28 September 2025).
Chong, C.J.; Yao, Z.; Neamtiu, I. Artificial-Intelligence Generated Code Considered Harmful. arXiv 2024, arXiv:2409.19182. [Google Scholar] [CrossRef]
Sultana, S.; Afreen, S.; Eisty, N. Code Vulnerability Detection: A Comparative Analysis of Emerging LLMs. arXiv 2024, arXiv:2409.10490. [Google Scholar]
Li, Y.; Shezan, F.H.; Wei, B.; Wang, G.; Tian, Y. SoK: Towards Effective Automated Vulnerability Repair. In Proceedings of the 34th USENIX Security Symposium, Seattle, WA, USA, 13–15 August 2025. [Google Scholar]
Zhong, W.; Hu, Q.; Zhu, Q.; Zhang, H. Neural Program Repair: Systems, Challenges and Solutions. In Proceedings of the 13th Asia-Pacific Symposium on Internetware (Internetware 2022), Beijing, China, 6 August 2022; pp. 1–10. [Google Scholar] [CrossRef]
Bhandari, G.; Naseer, A.; Moonen, L. CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE ’21), Athens, Greece, 20–21 August 2021; pp. 30–39. [Google Scholar] [CrossRef]
Bouzid, R.; Khoury, R. Assessing the Effectiveness of ChatGPT in Secure Code Development: A Systematic Literature Review. ACM Comput. Surv. 2025, 57, 324. [Google Scholar] [CrossRef]
Xia, C.S.; Wei, Y.; Zhang, L. Automated Program Repair in the Era of Large Pre-Trained Language Models. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE 2023), Melbourne, Australia, 14–20 May 2023; pp. 1482–1494. [Google Scholar] [CrossRef]
Khoury, R.; Avila, A.R.; Brunelle, J.; Camara, B.M. How Secure is Code Generated by ChatGPT? In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; pp. 2445–2451. [Google Scholar] [CrossRef]
Al-Kaswan, A.; Izadi, M.; van Deursen, A. Traces of Memorisation in Large Language Models for Code. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE ’24), Lisbon, Portugal, 14–20 April 2024; pp. 1–12. [Google Scholar] [CrossRef]
Al-Boghdady, A.; Wassif, K.; El-Ramly, M. The Presence, Trends, and Causes of Security Vulnerabilities in Operating Systems of IoT’s Low-End Devices. Sensors 2021, 21, 2329. [Google Scholar] [CrossRef] [PubMed]
Tóth, R.; Bisztray, T.; Erdodi, L. LLMs in Web Development: Evaluating LLM-Generated PHP Code—Unveiling Vulnerabilities and Limitations. In Proceedings of the Computer Safety, Reliability, and Security. SAFECOMP 2024 Workshops (DECSoS, SASSUR, TOASTS, and WAISE) Florence, Italy, 17–20 September 2024; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2024; Volume 14989, pp. 425–437. [Google Scholar] [CrossRef]
Pa Pa, Y.M.; Tanizaki, S.; Kou, T.; van Eeten, M.; Yoshioka, K.; Matsumoto, T. An Attacker’s Dream? Exploring the Capabilities of ChatGPT for Developing Malware. In Proceedings of the 16th Cyber Security Experimentation and Test Workshop (CSET 2023), Marina del Rey, CA, USA, 7–8 August 2023; pp. 10–18. [Google Scholar] [CrossRef]
Cotroneo, D.; Foggia, A.; Improta, C.; Liguori, P.; Natella, R. Automating the Correctness Assessment of AI-Generated Code for Security Contexts. J. Syst. Softw. 2024, 216, 112113. [Google Scholar] [CrossRef]
Tomassi, A. Data Security and Privacy Concerns for Generative AI Platforms. Ph.D. Thesis, Politecnico di Torino, Turin, Italy, 2024. [Google Scholar]
Kumamoto, T.; Yoshida, Y.; Fujima, H. Evaluating large language models in ransomware negotiation: A comparative analysis of chatgpt and claude. Res. Sq. 2023. [Google Scholar] [CrossRef]
Team, P.S. Securing AI for Cymulate: A Case Study in Controlled AI Adoption; Prompt Security Technical Reports; Prompt Security Ltd.: Tel Aviv, Israel, 2025; Available online: https://prompt.security/blog/case-study-securing-ai-for-cymulate-ensuring-safe-ai-adoption-across-teams (accessed on 12 December 2025).

Figure 1. Cumulative growth of published studies on LLMs in software security from 2020 to 2024. Data compiled from ACM, IEEE Xplore, SpringerLink, ScienceDirect, Scopus, and arXiv (accessed August 2025).

Figure 2. PRISMA-style flow diagram for study selection in this survey. The diagram summarizes the identification, screening, eligibility assessment, and inclusion of 81 studies on large language models (LLMs) in software security, based on records retrieved from ACM Digital Library, IEEE Xplore, SpringerLink, ScienceDirect, Scopus, and arXiv (January 2020–March 2025).

Figure 3. Examples of LLM framework use, showing (a) adoption via “Used by” dependency counts and (b) developer endorsement via GitHub stars. (a) “Used by” repositories on GitHub (July 2025); (b) GitHub stars by framework (July 2025). Data source: Hossain (2025) [65].

Figure 4. Comparison of LLM-generated code vulnerability rates (2023 vs. 2024). Data sources: Majdinasab et al. (2024) [64]; Fu et al. (2025) [61]; Cheng et al. (2025) [38]; Chong et al. (2024) [66].

Figure 5. Leakage of identifiable, private, and secret information in Codex/Copilot outputs. Data source: Negri-Ribalta et al. (2024) [17].

Figure 6. LLM automated vulnerability-repair success rates across benchmarks and languages. Data sources: Li et al. (2025) [68]; Negri-Ribalta et al. (2024) [17]; Wang et al. (2025) [12]; Fakih et al. (2025) [60]; and Xia et al. (2023) [72].

Figure 7. Latent risks of LLM-based code generation—data compiled from reviewed studies.

Figure 8. Triadic framework of LLM adoption in software security—data compiled from reviewed studies.

Table 1. Summary of the quality assessment for five representative studies selected from the 81 papers reviewed in this work. Ratings: High, Moderate, Low. Criteria: DC = dataset clarity and availability; BS = baseline strength; RT = reproducibility and transparency.

No.	Study (Abbrev.)	DC	BS	RT	Ref.
1	Systematic literature review on AI models and code-generation security	High	Moderate	High	[17]
2	Assessing the security of GitHub (2023 version) Copilot’s generated code (replication study)	High	High	Moderate	[64]
3	DiverseVul (large-scale vulnerable code dataset)	High	Moderate	High	[10]
4	BadGPT (security vulnerabilities/backdoor attacks on LLMs)	Moderate	Low	Low	[37]
5	Cracks in The Stack (risks in The Stack v2 training dataset)	High	Moderate	Moderate	[34]

Table 2. LLM integration with security frameworks. Data sources: Sheng et al. (2025) [16]; Negri-Ribalta et al. (2024) [17]; Pearce et al. (2025) [27]; Majdinasab et al. (2024) [64]; Ullah et al. (2024) [63]; Deng et al.(2023) [42]; Jahanshahi and Mockus (2025) [34]; Zhou et al. (2025) [35]; Sultana et al. (2024) [67]; Bouzid and Khoury (2025) [71]; and Copilot security evaluations [18,20,22].

Framework/Practice	Tools Studied	Reported Effects	Limitations	Refs.
SDL and DevSecOps	GPT-4, Code Llama, security-tuned LLMs	Faster triage; cross-language reasoning; backlog summarization	High false positives; uneven precision; limited business-logic coverage	[16,17,40,63,67]
Code Review	Copilot (2023 version), ChatGPT (GPT-3.5/GPT-4, OpenAI, 2023), VulnLLMEval, assistant-style tools	Inline CWE detection; patch suggestions; natural language explanations	Insecure outputs persist; oversimplified or partial fixes; hallucinated rationales	[11,18,22,27,64,71]
Threat Modeling	GPT-4, Claude, other chat models	Attack-path sketches; misuse/abuse cases; STRIDE-style ideation	Limited depth; missing domain-specific threats; requires expert validation	[15,41,42,57]
Supply Chain and Datasets	The Stack v2, public code corpora	CVEs and licensing issues surfaced; data-quality analysis	Recurring vulnerable patterns; contamination and data leakage across benchmarks	[34,35,59]

Table 3. LLM autonomy across secure development phases. Data sources: Sheng et al. (2025) [16]; Kumar (2024) [15]; Pearce et al. (2025) [27]; Majdinasab et al. (2024) [64]; Fu et al. (2025) [61]; Wang et al. (2025) [12]; Li et al. (2025) [68]; Fakih et al. (2025) [60]; Fu and Tantithamthavorn (2023) [26]; Lyu et al. (2025) [62]; Gao et al. (2024) [44]; Rzig et al.(2025) [54]; and repair and program–repair perspectives [69,70].

Lifecycle Phase	Tools Studied	Autonomy Level	Observed Gaps	Refs.
Requirements	Prompt-based assistants; RAG systems	Assistive only	Needs human-driven prioritization and scope definition	[15,16]
Design	Chat-based LLMs with security patterns	Assistive only	No formal guarantees for architectural correctness; limited threat-coverage	[41,42,44,62]
Coding	Copilot (2023 version), ChatGPT (GPT-3.5/GPT-4, OpenAI, 2023), other assistants	Partial automation	Insecure outputs persist; new vulnerabilities introduced; context gaps	[27,43,61,64]
Testing	VulnLLMEval, CVE-Bench, DiverseVul	Partial automation	Mislabeling of patched vs. vulnerable samples; benchmark leakage	[10,11,12,59]
Deployment and Repair	LLM-based auto-patch pipelines	Partial automation (∼10–30% CVEs fixed)	Repair success remains low; generalization limited; semantic correctness hard to guarantee	[60,68,69,70,72]
Maintenance	empirical vulnerability detection	Assistive	Vulnerabilities persist across releases; risk of accumulated technical debt	[26,28,54]

Table 4. Mapping of vulnerability types in LLM-generated code to STRIDE threat categories. Data source: CVE-Bench (Wang et al. 2025) [12].

STRIDE Category	Vulnerability Type	Dataset/Benchmark	Avg. Occurrence (%)	Reference
Information Disclosure	SQL Injection	CVE-Bench	2.36%	[12]
Elevation of Privilege	Cross-Site Scripting (XSS)	CVE-Bench	20.24%	[12]
Denial of Service	HTTP Response Splitting	CVE-Bench	2.36%	[12]

Table 5. Detailed comparison of latent security risks in LLM-generated code. Data sources: Pearce et al. (2025) [27]; Fu et al. (2025) [61]; Tóth et al. (2024) [76]; Negri-Ribalta et al. (2024) [17]; Sandoval et al. (2023) [29]; Li et al. (2025) [68]; Jahanshahi and Mockus (2025) [34]; Cheng et al. (2025) [38]; Pa Pa et al (2023) [77]; Chong et al. (2024) [66]; Ji et al.(2023) [51]; Rzig et al.(2025) [54]; and Copilot security and misuse literature [18,20,22,36,39].

Risk Category	Examples Observed	Affected Tools/Models	Reported Impact	References
Insecure Code Generation	Weak cryptography; unsafe memory ops; insecure handlers; unsafe file I/O	Copilot, Codex, GPT-style tools	Security weaknesses persist across prompts and contexts; high variance across benchmarks	[18,21,22,27,61,76,78]
Propagation of Vulnerable Patterns	Hardcoded secrets; injection; unsafe file handling; insecure reuse	Codex, Copilot, chat assistants	Automation bias amplifies risk when suggestions are accepted quickly; insecure reuse at scale	[13,17,29,57]
Incomplete / Flawed Fixes	Superficial CWE patches; non-compiling or semantics-breaking repairs	VulnLLMEval, CVE-Bench, LLM4CVE	Limited semantic repair success on real CVEs; evaluation constraints matter	[11,12,60,68,69,70]
Data Leakage and Memorization	Secrets; credentials; verbatim code resurfacing	Models trained on large code corpora	Small but non-trivial leakage rates; extraction/inversion risks under targeted prompts	[17,31,32,33,34,74]
Adversarial Misuse	Prompt injection; jailbreaks; malicious code generation; hidden behaviors	Chat assistants, code completion tools	Prompt integrity can be subverted; backdoors and misuse scenarios remain plausible	[36,37,38,39,66,77]
Developer Over-Reliance and Technical Debt	Automation bias; reduced manual review; deferred refactoring	Assistant-centric workflows	Higher acceptance of insecure code; accumulation of long-term technical debt	[28,29,30,52,54]
Organizational and Governance Risks	Shadow AI usage; unclear data-handling policies; weak accountability	Enterprise deployments	Policy violations and unclear responsibility for security failures	[45,46,79]

Table 6. Summary of findings for RQ3: explainability and compliance of LLMs in secure development. Data sources: Ding et al (2022) [48]; Liu et al. (2024) [49]; Arrieta et al. (2020) [47]; Huang et al. (2023) [50]; Ji et al. (2023) [51]; NIST (2024) [53]; Brundage et al. (2020) [46]; Mökander et al. (2023) [45]; Yao et al. (2024) [57]; and Prompt Security Team (2025) [81].

Aspect	Approach/Technique	Observed Benefits	Limitations/Gaps	Refs.
Explainability Mechanisms	Rule-based rationales; security-focused highlighting; step-wise prompting	Improved interpretability; supports quicker identification of obvious flaws	Traces may be unfaithful; instability under prompt/context shifts	[48,49]
Compliance Alignment	Mapping to ISO/IEC 27001, HIPAA, NIST SSDF-AI	Supports traceability between outputs and controls	No standardized automated scoring; heavy reliance on expert review	[53,57]
Assurance and Evidence	Security arguments; test-and-review pipelines; continuous monitoring	Reusable evidence for audits and certifications	Limited benchmarks for AI-enabled assurance; hard to compare tools	[45,46]
Model Transparency Tools	Behavioral probes; differential prompts; token-level inspection	Supports sanity checks and localized vulnerability analysis	Overhead in large projects; no agreed metrics for transparency sufficiency	[49,50]
Governance and Organizational Controls	Secure SDLC extensions; policy/audit logs; controlled AI adoption	Mitigates shadow AI; clarifies accountability in safety-critical systems	Implementation overhead; requires sustained operational discipline	[45,46,81]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rashid, M.B.; Hossain, M.S.J.; Khan, M.I.; Tahora, S.; Siddika, A.; Prakash, M.I.; Yeasmin, S.; Shahriar, H. A Survey on Large Language Models in Software Security: Opportunities and Threats. Computers 2026, 15, 226. https://doi.org/10.3390/computers15040226

AMA Style

Rashid MB, Hossain MSJ, Khan MI, Tahora S, Siddika A, Prakash MI, Yeasmin S, Shahriar H. A Survey on Large Language Models in Software Security: Opportunities and Threats. Computers. 2026; 15(4):226. https://doi.org/10.3390/computers15040226

Chicago/Turabian Style

Rashid, Md Bajlur, Mohammad Shafayet Jamil Hossain, Mohammad Ishtiaque Khan, Sharaban Tahora, Aiasha Siddika, Mahmudul Islam Prakash, Sharmin Yeasmin, and Hossain Shahriar. 2026. "A Survey on Large Language Models in Software Security: Opportunities and Threats" Computers 15, no. 4: 226. https://doi.org/10.3390/computers15040226

APA Style

Rashid, M. B., Hossain, M. S. J., Khan, M. I., Tahora, S., Siddika, A., Prakash, M. I., Yeasmin, S., & Shahriar, H. (2026). A Survey on Large Language Models in Software Security: Opportunities and Threats. Computers, 15(4), 226. https://doi.org/10.3390/computers15040226

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Survey on Large Language Models in Software Security: Opportunities and Threats

Abstract

1. Introduction

2. Methodology

2.1. Search Strategy and Databases

Search Strings

2.2. Study Selection and Screening

2.2.1. Deduplication and Consistency Checks

2.2.2. Title and Abstract Screening

2.3. Full-Text Assessment and Eligibility

2.4. Inclusion and Exclusion Criteria

2.4.1. Inclusion Criteria

2.4.2. Exclusion Criteria

2.5. Treatment of Preprints and Indexed Versions

2.6. Data Extraction and Coding

2.7. Quality and Bias Assessment

2.8. Limitations of the Review Process

3. Results and Discussion

3.1. RQ1: LLM Integration and Autonomy in Secure Development

3.2. RQ2: Latent Security Risks of LLM Code Generation

3.3. RQ3: Explainability, Auditability, and Compliance

4. Our Findings

4.1. F1: LLMs’ Function as High-Bandwidth Security Assistants Rather than Autonomous Engineers

4.2. F2: Benefits Concentrate on Visibility, Coverage, and Cognitive Support

4.3. F3: Security Risks Are Structural and Persistent Across Models

4.4. F4: Evaluation Practices Limit the Strength of Generalization

4.5. F5: Governance, Explainability, and Human Oversight Are Structural Requirements

Synthesis Across F1–F5

5. Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI