A Zero-Touch Vulnerability Remediation Framework Based on OpenVAS, Threat Intelligence, and RAG-Enhanced Large Language Models
Abstract
1. Introduction
- We propose an end-to-end Zero-Touch remediation architecture integrating OpenVAS, NVD/Cyber Threat Intelligence (CTI), RAG, and LLMs under ZSM principles. As Table 1 shows, no prior peer-reviewed work, to the best of our knowledge, combines all six capabilities in a single closed-loop pipeline: open-scanner integration, RAG-based threat intelligence retrieval, dual-LLM verification, confidence-based routing, automated CI/CD deployment with rollback, and multi-model statistical validation.
- We design a dual-LLM verification and confidence-based routing mechanism that extends evidence-grounded approaches such as ProveRAG while reducing hallucination and erroneous patch risks.
- We systematically evaluate how prompt engineering, input segmentation, and RAG each affect patching accuracy and latency through an empirical study on 350 real-world vulnerabilities, quantifying improvements in remediation quality, manual effort reduction, and time efficiency.
- We implement system-level defenses and auditing mechanisms aligned with the OWASP LLM Top 10 and RAG security best practices. Reproducible implementation artifacts—including the RAG pipeline, evaluation scripts, and prompt templates—will be released upon acceptance; anonymized materials are available to reviewers upon request during the review period.
2. Related Work
2.1. Automated Vulnerability Management and Zero-Touch Security
2.2. Vulnerability Scanning and the OpenVAS Ecosystem
2.3. Applications of Large Language Models in Cybersecurity and Vulnerability Remediation
2.4. The Role of Retrieval-Augmented Generation (RAG) in Security and Vulnerability Management
2.5. Threat Intelligence Integration and Prioritization Strategy
2.6. Prompt Engineering and LLM Security Governance
3. System Architecture
| Algorithm 1. Zero-Touch Vulnerability Remediation Pipeline |
| Input: Target network assets T, Threat intelligence sources TI, Confidence thresholds θ_high and θ_low, Knowledge base KB, Top-K parameter K = 3, RRF smoothing constant c = 60 Output: Remediation execution reports R 1: // === SCANNING LAYER === 2: schedule OpenVAS periodic scan on T 3: xml_report ← OpenVAS.scan(T) 4: vulns ← ParseXML (xml_report) // Extract CVE, CVSS, description, etc. 5: for each vuln v in vulns do 6: v.json ← Normalize(v) // Convert to unified JSON schema 7: end for 8: delta ← DiffWithHistory (vulns, historical_DB) // New, recurring, resolved 9: 10: // === AI DECISION LAYER === 11: for each vulnerability v in vulns do 12: // Step 1: RAG Retrieval 13: query ← BuildQuery (v.cve_id, v.description, v.service) 14: docs_vec ← FAISS.search (query, top_K = 2*K) 15: docs_kw ← BM25.search (query, top_K = 2*K) 16: evidence ← RRF_Merge (docs_vec, docs_kw, top_K = K) 17: evidence ← ApplyTemporalBoost (evidence, days = 30) 18: 19: // Step 2: LLM Generation with Structured Prompt 20: prompt ← BuildPrompt (SystemRole, TaskInstruction, 21: FewShotExamples, Scratchpad, v.json, evidence) 22: response ← PrimaryLLM.generate (prompt) // GPT-4-class model 23: recommendation ← ParseJSON (response) 24: 25: // Step 3: Dual-LLM Verification (AuxLLM = gpt-4o) 26: verification ← AuxLLM.verify (recommendation, evidence) 27: confidence ← ComputeConfidence (verification) 28: 29: // Step 4: Confidence-Based Routing 30: if confidence >= θ_high then 31: route v to ORCHESTRATION_LAYER (auto-execute) 32: else if confidence >= θ_low then 33: route v to HUMAN_REVIEW_QUEUE (semi-automated) 34: else 35: route v to MANUAL_HANDLING 36: end if 37: end for 38: 39: // === ORCHESTRATION LAYER === 40: for each auto-approved recommendation rec do 41: playbook ← SelectAnsibleTemplate (rec.action_type) 42: script ← FillTemplate (playbook, rec.details) 43: snapshot ← CreateSnapshot (rec.target_host) 44: result ← Jenkins.deploy (script, environment = “staging”) 45: if ValidateDeployment (result) == SUCCESS then 46: Jenkins.promote (script, environment = “production”) 47: else 48: Rollback (snapshot) 49: FlagForManualInvestigation (rec) 50: end if 51: // Feedback loop 52: UpdateKB (KB, rec, result) // Reinforce or penalize strategy 53: R.append (GenerateReport (rec, result)) 54: end for 55: return R |
3.1. Scanning Layer
3.2. AI Decision Layer
| Algorithm 2. Hybrid RAG Retrieval with Temporal Boosting |
| Input: Vulnerability record v, Vector index VI (FAISS), Keyword index KI (BM25), Knowledge base KB, Top-K parameter K = 3, RRF smoothing constant c = 60, Temporal window W = 30 days, Source boost factors α_auth = 1.3, α_recent = 1.5, α_patch = 1.2 Output: Ranked evidence documents E 1: query_text ← Concat(v.cve_id, v.description, v.service_name) 2: query_vec ← EmbeddingModel.encode(query_text) 3: 4: // Parallel retrieval from both indices 5: results_vec ← VI.search(query_vec, top_K = 2*K) 6: results_kw ← KI.search(query_text, top_K = 2*K) 7: 8: // Reciprocal Rank Fusion (RRF) 9: merged ← {} 10: for each doc d in results_vec ∪ results_kw do 11: rank_v ← rank of d in results_vec (or infinity if absent) 12: rank_k ← rank of d in results_kw (or infinity if absent) 13: merged[d].score ← 1/(c + rank_v) + 1/(c + rank_k) 14: end for 15: 16: // Temporal relevance boosting 17: for each doc d in merged do 18: if d.source in {“CISA_KEV”, “ExploitDB”} then 19: if d.last_exploited_date >= (today − W) then 20: merged[d].score ← merged[d].score × α_recent // α_recent = 1.5 21: end if 22: end if 23: if d.has_official_patch == true then 24: merged[d].score ← merged[d].score × α_patch // α_patch = 1.2 25: end if 26: end for 27: 28: // Source trustworthiness boosting 29: for each doc d in merged do 30: if d.source_tier == “authoritative” then 31: merged[d].score ← merged[d].score × α_auth // α_auth = 1.3 32: end if 33: end for 34: E ← TopK (merged, K) 35: return E |
| Algorithm 3. Dual-LLM Verification Procedure |
| Input: recommendation R from primary LLM, retrieved evidence E, knowledge base KB Output: confidence score c ∈ [0, 1], verification verdict v ∈ {PASS, FAIL} 1: function DUAL_LLM_VERIFY (R, E, KB) 2: // Check 1: Version existence 3: v_exists ← 1 if R.patch_version ∈ KB.known_versions else 0 4: // Check 2: Evidence alignment 5: e_aligned ← 1 if SEMANTIC_SIM (R.rationale, E) > 0.7 else 0 6: // Check 3: Action feasibility 7: a_feasible ← 1 if R.action_type ∈ ALLOWED_ACTIONS else 0 8: // Check 4: Format validity 9: f_valid ← 1 if VALIDATE_JSON_SCHEMA(R) else 0 10: // Weighted confidence computation 11: c ← 0.3 · v_exists + 0.4 · e_aligned + 0.2 · a_feasible + 0.1 · f_valid 12: v ← PASS if (c ≥ θ_high ∧ f_valid = 1) else FAIL 13: return (c, v) 14: end function |
3.3. Orchestration Layer
- Successful cases reinforce the reliability score of the corresponding remediation strategy.
- Failed or rolled-back cases trigger updates to the knowledge base, for example, documenting compatibility issues between a specific patch and OS version.
3.4. Prompt Engineering and Knowledge Fusion Techniques
- System Role: Instructs the model to act as a senior vulnerability analyst, ensuring a professional and rigorous tone. It also explicitly directs the model to follow a specific output format (i.e., a JSON structure) and not to reveal the reasoning process to the end user.
- Task Instruction: Describes the definition of input and output fields. The model is instructed to parse an <Input_XML> containing OpenVAS report data and extract fields such as id, description, root_cause, attack_vector, cvss_score, severity, and remediation, outputting them in JSON format.
- Few-Shot Examples: We manually crafted several representative examples, including common vulnerabilities and cases with missing information. These examples are embedded in the prompt, showcasing both input samples and expected output formats.
- Output Constraints: We added an [Output Constraints] section that explicitly enforces schema compliance: all fields are required (with null for insufficient evidence), and the model is forbidden from fabricating CVE identifiers, package versions, or URLs. This section serves as a guardrail against hallucinated artifacts—the most dangerous error category in automated remediation.
4. Experimental Design and Results
4.1. Dataset and Experimental Setup
- Baseline: Basic prompt only. The model receives the raw OpenVAS report using a minimal prompt without advanced prompt engineering or RAG-based retrieval. This is the starting point representing unoptimized LLM performance.
- Prompt: Prompt-optimized only. This configuration uses our four-layer prompt template along with field extraction and long-text segmentation, but excludes RAG retrieval. It isolates the performance gain attributable to prompt engineering alone.
- Prompt + RAG (Final): Full system configuration. This activates all components: prompt optimization, long-text segmentation, and RAG-based threat intelligence retrieval. For gpt-3.5-turbo, all three configurations in the comparative performance results reported in Section 4.2 were evaluated using the fine-tuned variant (see Section 3.4); thus the Baseline and Prompt columns isolate the effects of prompt engineering and RAG retrieval within the fine-tuned model, rather than comparing against the original (non-fine-tuned) base model. All other models were evaluated without weight modification. It represents the fully realized version of the proposed solution. Throughout the remainder of this paper, we use “Final” and “Prompt + RAG” interchangeably to refer to this configuration.
- OpenVAS-Hint Baseline: For each case in the 50-case evaluation subset (see the non-LLM baseline comparison in Section 4.2), the openvas_remediation_hint field from the scanner output (see Figure 2) is extracted and formatted into the same JSON output schema used by the LLM configurations. Fields that the hint does not cover (e.g., root_cause, attack_vector) are set to null. The resulting outputs are scored against the gold standard using the identical evaluation rubric and dual-rater protocol. This baseline establishes the scanner-only lower bound—i.e., what an organization would achieve by acting solely on scanner-provided guidance without any LLM or retrieval augmentation.
- Retrieval-Only Baseline: For each of the same 50 evaluation-subset cases, the hybrid RAG retrieval engine (Algorithm 2) is executed identically to the Prompt + RAG configuration, returning the top-K (K = 3) evidence documents. Instead of passing these to an LLM, the system concatenates the retrieved documents and applies a deterministic template-based extractor to populate the output JSON schema: the remediation field is filled with the first actionable sentence from the highest-ranked advisory; root_cause and attack_vector are extracted via regex patterns matching CVSS vector strings and CWE descriptions; remaining fields are populated from the vulnerability record metadata. Outputs are truncated to match the LLM output length and scored using the same rubric. This baseline isolates the value of retrieval infrastructure from generative synthesis, directly addressing whether the accuracy gains are attributable to RAG retrieval alone or to the LLM’s ability to synthesize cross-document evidence into coherent remediation plans.
- Accuracy (primary metric): The proportion of cases in which the model’s recommended remediation actions were judged correct by both expert raters. A recommendation was rated correct if it satisfied all of the following criteria: (a) the primary remediation action type matched the gold standard (e.g., package upgrade, configuration change, service restart); (b) the target software component was correctly identified; and (c) the recommended version or patch identifier was either an exact match or a functionally equivalent alternative (e.g., recommending version 2.12.7 when the gold standard specifies ≥ 2.12.6). Multi-step remediations were scored as correct only if all critical steps were present. Cases where the two raters disagreed were resolved through discussion until consensus was reached. We note two additional disambiguation rules: (i) when multiple valid remediation paths existed for a single vulnerability (e.g., upgrading to version A or applying vendor workaround B), the recommendation was scored correct if it matched any accepted alternative documented by both raters during gold-standard construction; (ii) recommendations that proposed a strictly newer patch version than the gold standard were accepted as correct provided the raters confirmed the newer version also addresses the target CVE, reflecting the practical reality that vendor-recommended versions may evolve between gold-standard creation and evaluation.
- Hallucination rate: The proportion of recommendations containing at least one factually unsupported or verifiably incorrect claim, including fabricated CVE identifiers, non-existent software versions, or remediation steps contradicted by vendor advisories. Each recommendation was independently assessed by both raters against the knowledge base and vendor documentation.
- Knowledge coverage rate: The proportion of recommendations that cite at least one retrieved evidence segment relevant to the target CVE, as judged by rater consensus.
- Latency: Average time from input submission to response generation.
- Token Usage and Accuracy per 1K Tokens (acc/1kTok = Accuracy × 1000/Average Tokens per Query): Cost-effectiveness metric.
4.2. Experimental Results and Analysis
- Knowledge recency: LLM training corpora have fixed knowledge cutoffs; vulnerabilities disclosed after the cutoff date are effectively invisible to the model. RAG bridges this temporal gap by injecting up-to-date vendor advisories and exploit intelligence at inference time, directly addressing the knowledge staleness problem [10,11].
- Evidence grounding: Without external evidence, models must rely entirely on parametric memory, which is prone to plausible-sounding but factually incorrect outputs (hallucinations). By conditioning generation on retrieved documents with explicit provenance, the model’s outputs become verifiable against cited sources, reducing the system-level hallucination rate from 23.4% (Baseline) to 7.8% (Final, with dual-LLM verification; see the ablation results below for the verifier’s marginal contribution) [9].
- Input compression: Counterintuitively, RAG did not increase token consumption for most models (Table 3). This is because the retrieval step replaces verbose raw vulnerability descriptions with concise, pre-filtered evidence snippets, effectively compressing the input while preserving—and often enhancing—informational density.
5. Discussion
5.1. Prompt Engineering vs. Model Fine-Tuning
5.2. The Critical Role of RAG
5.3. Trade-Offs Between Cost and Latency
5.4. Risk Management in Automated Remediation
- Automated vulnerability remediation improves operational efficiency but carries risks. Incorrect patches can disrupt services, create dependency conflicts, or introduce new vulnerabilities [22]. The system addresses these through four layered safety mechanisms:
- Action Whitelisting (prevents unauthorized change types): The system enforces a predefined list of change types allowed for automated execution. Package updates and configuration hardening are permitted on development environments and container images; changes to production-critical services (databases, authentication systems, payment gateways) require explicit manual approval. This policy is version-controlled and auditable.
- Version Verification (prevents misapplication): Prior to deployment, the orchestration layer cross-references the recommended patch version against the target system’s installed software inventory. If a version mismatch is detected (e.g., recommending a patch for Apache 2.4.x on a system running 2.2.x), the deployment is blocked and the case is escalated.
- Rollback Strategy (limits blast radius): Every automated deployment is preceded by a system snapshot. If post-deployment validation (service health checks, functional regression tests, or re-scanning) detects any anomaly within a configurable observation window, the system automatically rolls back to the pre-change state. In our pilot deployment, rollback was observed in a small minority of auto-executed cases, all successfully restored within minutes.
- Confidence Threshold (filters unreliable recommendations): The three-tier routing mechanism (Section 3.2) ensures that only high-confidence recommendations reach automated execution, providing a probabilistic safety gate upstream of all deployment actions.
5.5. Limitations
- External Validity of Data: Our experimental dataset primarily comes from a single enterprise environment, which may not cover all industry domains and diverse software/hardware ecosystems. The applicability of this system in sectors such as healthcare, finance, or critical infrastructure requires further validation and adaptation, as these domains impose additional regulatory constraints and operate specialized software stacks.
- Expert Annotation Variability: Cybersecurity experts vary in their writing styles when crafting remediation plans, which may affect the consistency of the gold standard. Although we employed dual-reviewer verification (Cohen’s κ = 0.87) to improve annotation reliability, potential biases cannot be entirely eliminated. Future work should explore inter-organizational annotation campaigns to diversify the gold standard.
- RAG Knowledge Update Frequency: If the knowledge base fails to incorporate the latest vulnerability information in a timely manner (e.g., newly disclosed zero-day vulnerabilities with no available data), the model may still lack sufficient grounding evidence. Our failure mode analysis (Section 5.2) confirmed that retrieval misses account for approximately 40% of residual errors. More frequent automated ingestion pipelines with freshness monitoring are needed to ensure knowledge coverage.
- Execution Risk and Compliance: Automated execution of remediation actions must be built upon strict access controls and approval processes. Organizations must also address compliance questions—such as which systems permit automated changes, how change records are documented, and how automated decisions satisfy audit requirements under frameworks such as ISO/IEC 27001 [61] and SOC 2 Trust Services Criteria [62]—to meet internal risk governance and external regulatory requirements.
- Model Vendor Dependency: The current evaluation is based exclusively on OpenAI GPT-family models accessed via commercial APIs. This introduces vendor lock-in risks including pricing changes, API deprecation, data residency concerns, and limited transparency into model behavior. Extending the evaluation to open-weight models (e.g., LLaMA, Mistral, Qwen) and on-premises deployments would improve generalizability and address data sovereignty requirements common in regulated industries.
- Baseline Scope: Two non-LLM baselines (OpenVAS-Hint and Retrieval-Only) were evaluated on a 50-case subset and provide preliminary numerical evidence consistent with an accuracy ladder—66.0% → 78.0% → 82.6%—but because the Prompt + RAG figure derives from a larger (350-case) sample, same-subset replication is needed to confirm the incremental LLM contribution (Table 4). The 50-case sample, however, limits statistical power: the Retrieval-Only vs. Prompt + RAG gap is small and the cross-sample-size comparison (50-case baselines vs. 350-case Prompt + RAG) limits direct statistical testing. A full 350-case replication of the non-LLM baselines and finer-grained RAG ablations (BM25-only vs. FAISS-only vs. RRF, disabling temporal boosting) would strengthen the evidence.
- Adversarial Robustness: While the framework incorporates multiple safety mechanisms (input filtering, dual-LLM verification, action whitelisting, confidence routing), we have not yet conducted systematic adversarial testing such as prompt injection attacks (e.g., malicious instructions embedded in scan descriptions) or knowledge base poisoning experiments (e.g., injecting fabricated advisories). Preliminary manual inspection suggests that the structured JSON output schema and the dual-verifier provide partial defense, but formal red-team evaluation is needed to quantify resilience and identify remaining attack surfaces.
- Domain-Specific Applicability: The current evaluation targets web-facing enterprise IT systems (Linux/Windows servers, web applications, network appliances). Extending the framework to non-web domains introduces additional constraints. For SCADA/ICS environments, automated patching must respect real-time control loops and may require vendor-certified firmware updates with mandatory downtime windows; the confidence routing mechanism would need domain-specific thresholds (θ_high ≥ 0.95) and integration with OT change-management workflows. For healthcare/medical device environments, FDA premarket clearance requirements restrict which software components may be modified without re-certification, necessitating a policy layer that distinguishes patchable infrastructure from regulated device software. The framework’s modular architecture—separating scanning, intelligence retrieval, reasoning, and execution—accommodates such extensions by allowing domain-specific policy modules to be inserted at the routing stage without modifying the core RAG or verification pipelines.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| API | Application Programming Interface |
| BM25 | Best Matching 25 (keyword ranking function) |
| CI/CD | Continuous Integration/Continuous Deployment |
| CPE | Common Platform Enumeration |
| CTI | Cyber Threat Intelligence |
| CVE | Common Vulnerabilities and Exposures |
| CVSS | Common Vulnerability Scoring System |
| CWE | Common Weakness Enumeration |
| FAISS | Facebook AI Similarity Search |
| JSON | JavaScript Object Notation |
| KB | Knowledge Base |
| KEV | Known Exploited Vulnerabilities |
| LLM | Large Language Model |
| MISP | Malware Information Sharing Platform |
| NVD | National Vulnerability Database |
| NVT | Network Vulnerability Test |
| OWASP | Open Worldwide Application Security Project |
| RAG | Retrieval-Augmented Generation |
| RRF | Reciprocal Rank Fusion |
| SOC | Security Operations Center |
| ZSM | Zero-touch Network and Service Management |
References
- NIST. National Vulnerability Database (NVD). 2024. Available online: https://nvd.nist.gov (accessed on 15 January 2026).
- Verizon. 2024 Data Breach Investigations Report (DBIR). 2024, 100p. Available online: https://www.verizon.com/business/resources/reports/2024-dbir-data-breach-investigations-report.pdf (accessed on 15 January 2026).
- (ISC)2. ISC2 2024 Cybersecurity Workforce Study. 2024. Available online: https://www.isc2.org/Insights/2024/10/ISC2-2024-Cybersecurity-Workforce-Study (accessed on 15 January 2026).
- ETSI ISG ZSM. Zero-Touch Network and Service Management (ZSM); Reference Architecture; ETSI GS ZSM 002; ETSI: Valbonne, France, 2019. [Google Scholar]
- Fu, M.; Tantithamthavorn, C.; Nguyen, V.; Le, T. ChatGPT for vulnerability detection, classification, and repair: How far are we? In 2023 30th Asia-Pacific Software Engineering Conference (APSEC); IEEE: New York, NY, USA, 2023; pp. 632–636. [Google Scholar]
- Hasanov, I.; Virtanen, S.; Hakkala, A.; Isoaho, J. Application of large language models in cybersecurity: A systematic literature review. IEEE Access 2024, 12, 176751–176778. [Google Scholar] [CrossRef]
- Divakaran, D.M.; Peddinti, S.T. Large language models for cybersecurity: New opportunities. IEEE Secur. Priv. 2025, 23, 38–45. [Google Scholar] [CrossRef]
- Deng, G.; Liu, Y.; Mayoral-Vilches, V.; Liu, P.; Li, Y.; Xu, Y.; Zhang, T.; Liu, Y.; Pinzger, M.; Rass, S. PentestGPT: Evaluating and harnessing large language models for automated penetration testing. In 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024; USENIX Association: Berkeley, CA, USA, 2024; pp. 847–864. [Google Scholar]
- Fayyazi, R.; Trueba, S.H.; Zuzak, M.; Yang, S.J. ProveRAG: Provenance-driven vulnerability analysis with automated retrieval-augmented LLMs. arXiv 2024, arXiv:2410.17406. [Google Scholar] [CrossRef]
- Mend.io. All About RAG: What It Is and How to Keep It Secure. 2024. Available online: https://www.mend.io/blog/all-about-rag-what-it-is-and-how-to-keep-it-secure/ (accessed on 15 January 2026).
- NVIDIA. What Is Retrieval-Augmented Generation (RAG)? NVIDIA Blog, 31 January 2025. Available online: https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/ (accessed on 15 January 2026).
- Yang, L.; Naser, S.; Shami, A.; Muhaidat, S.; Ong, L.; Debbah, M. Toward zero touch networks: Cross-layer automated security solutions for 6G wireless networks. IEEE Trans. Commun. 2025, 73, 7650–7679. [Google Scholar] [CrossRef]
- Gallego-Madrid, J.; Sanchez-Iborra, R.; Ruiz, P.M.; Skarmeta, A.F. Machine learning-based zero-touch network and service management: A survey. Digit. Commun. Netw. 2022, 8, 105–123. [Google Scholar] [CrossRef]
- Liyanage, M.; Pham, Q.-V.; Dev, K.; Bhattacharya, S.; Maddikunta, P.K.R.; Gadekallu, T.R.; Yenduri, G. A survey on zero touch network and service management (ZSM) for 5G and beyond networks. J. Netw. Comput. Appl. 2022, 203, 103362. [Google Scholar] [CrossRef]
- Coronado, E.; Behravesh, R.; Subramanya, T.; Fernandez-Fernandez, A.; Siddiqui, M.S.; Costa-Perez, X.; Riggio, R. Zero touch management: A survey of network automation solutions for 5G and 6G networks. IEEE Commun. Surv. Tutor. 2022, 24, 2535–2578. [Google Scholar] [CrossRef]
- Lu, G.; Ju, X.; Chen, X.; Pei, W.; Cai, Z. GRACE: Empowering LLM-based software vulnerability detection with graph structure and in-context learning. J. Syst. Softw. 2024, 212, 112031. [Google Scholar] [CrossRef]
- Wei, Z.; Sun, J.; Sun, Y.; Liu, Y.; Wu, D.; Zhang, Z.; Zhang, X.; Li, M.; Liu, Y.; Li, C.; et al. Advanced smart contract vulnerability detection via LLM-powered multi-agent systems. IEEE Trans. Softw. Eng. 2025, 51, 2830–2846. [Google Scholar] [CrossRef]
- Nong, Y.; Yang, H.; Cheng, L.; Hu, H.; Cai, H. APPATCH: Automated adaptive prompting large language models for real-world software vulnerability patching. In 34th USENIX Security Symposium (USENIX Security 25), Seattle, WA, USA, 13–15 August 2025; USENIX Association: Berkeley, CA, USA, 2025; pp. 4481–4500. [Google Scholar]
- Li, Y.; Wang, S.; Nguyen, T.N. DLFix: Context-based code transformation learning for automated program repair. In ACM/IEEE 42nd International Conference on Software Engineering; Association for Computing Machinery: New York, NY, USA, 2020; pp. 602–614. [Google Scholar] [CrossRef]
- Bhandari, G.; Gavric, N.; Shalaginov, A. Generating vulnerability security fixes with code language models. Inf. Softw. Technol. 2025, 185, 107786. [Google Scholar] [CrossRef]
- Wang, P.; Liu, X.; Xiao, C. CVE-Bench: Benchmarking LLM-based software engineering agents’ ability to repair real-world CVE vulnerabilities. In 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 4207–4224. [Google Scholar] [CrossRef]
- Hu, Y.; Li, Z.; Shu, K.; Guan, S.; Zou, D.; Xu, S.; Yuan, B.; Jin, H. SoK: Automated vulnerability repair: Methods, tools, and assessments. In 34th USENIX Security Symposium (USENIX Security 25), Seattle, WA, USA, 13–15 August 2025; USENIX Association: Berkeley, CA, USA, 2025; pp. 4421–4440. [Google Scholar]
- Yildiz, A.; Teo, S.G.; Lou, Y.; Feng, Y.; Wang, C.; Divakaran, D.M. Benchmarking LLMs and LLM-based agents in practical vulnerability detection for code repositories. In 63rd Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 30848–30865. [Google Scholar] [CrossRef]
- Chopra, S.; Ahmad, H.; Goel, D.; Szabo, C. ChatNVD: Advancing cybersecurity vulnerability assessment with large language models. arXiv 2024, arXiv:2412.04756. [Google Scholar] [CrossRef]
- Benzaid, C.; Taleb, T. AI-driven zero-touch network and service management in 5G and beyond: Challenges and research directions. IEEE Netw. 2020, 34, 186–194. [Google Scholar] [CrossRef]
- Benzaid, C.; Taleb, T. ZSM security: Threat surface and best practices. IEEE Netw. 2020, 34, 124–133. [Google Scholar] [CrossRef]
- El Rajab, M.; Yang, L.; Shami, A. Zero-touch networks: Towards next-generation network automation. Comput. Netw. 2024, 243, 110294. [Google Scholar] [CrossRef]
- Hazra, A.; Kalita, A.; Gurusamy, M.; Sah, D.K. Potential of zero-touch network management in Industry 5.0: A future prospect. IEEE Internet Comput. 2024, 28, 45–52. [Google Scholar] [CrossRef]
- Jiang, J.; Jin, S.; Li, X.; Zhang, K.; Sun, B. A zero-touch dynamic configuration management framework for time-sensitive networking (TSN). Entropy 2025, 27, 584. [Google Scholar] [CrossRef] [PubMed]
- Yang, L.; El Rajab, M.; Shami, A.; Muhaidat, S. Enabling AutoML for zero-touch network security: Use-case driven analysis. IEEE Trans. Netw. Serv. Manag. 2024, 21, 3555–3582. [Google Scholar] [CrossRef]
- Lira, O.G.; Caicedo, O.M.; da Fonseca, N.L.S. Large language models for zero touch network configuration management. IEEE Commun. Mag. 2025, 63, 146–153. [Google Scholar] [CrossRef]
- Aksu, M.U.; Altuncu, E.; Bicakci, K. A first look at the usability of OpenVAS vulnerability scanner. In Workshop on Usable Security (USEC) 2019, San Diego, CA, USA, 24 February 2019; NDSS Symposium: San Diego, CA, USA, 2019; pp. 1–11. [Google Scholar] [CrossRef]
- Vimala, K.; Fugkeaw, S. VAPE-BRIDGE: Bridging OpenVAS results for automating Metasploit framework. In 2022 14th International Conference on Knowledge and Smart Technology (KST), Chon Buri, Thailand, 26–29 January 2022; IEEE: New York, NY, USA, 2022; pp. 69–74. [Google Scholar] [CrossRef]
- GitHub. Responsible Use of Copilot Autofix for Code Scanning. GitHub Docs. Available online: https://docs.github.com/en/code-security/code-scanning/managing-code-scanning-alerts/responsible-use-autofix-code-scanning (accessed on 15 January 2026).
- Xu, H.; Wang, S.; Li, N.; Wang, K.; Zhao, Y.; Chen, K.; Yu, T.; Liu, Y.; Wang, H. Large language models for cyber security: A systematic literature review. ACM Trans. Softw. Eng. Methodol. 2025. [Google Scholar] [CrossRef]
- Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A survey on large language model security and privacy: The good, the bad, and the ugly. High-Confid. Comput. 2024, 4, 100211. [Google Scholar] [CrossRef]
- Raiaan, M.A.K.; Mukta, M.S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J.; Ahmad, J.; Ali, M.E.; Azam, S. A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE Access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
- Jaffal, N.O.; Alkhanafseh, M.; Mohaisen, D. Large language models in cybersecurity: A survey of applications, vulnerabilities, and defense techniques. AI 2025, 6, 216. [Google Scholar] [CrossRef]
- Karras, A.; Theodorakopoulos, L.; Karras, C.; Theodoropoulou, A.; Kalliampakou, I.; Kalogeratos, G. LLMs for cybersecurity in the big data era: A comprehensive review of applications, challenges, and future directions. Information 2025, 16, 957. [Google Scholar] [CrossRef]
- Sagodi, Z.; Antal, G.; Bogenfurst, B.; Isztin, M.; Hegedus, P.; Ferenc, R. Reality check: Assessing GPT-4 in fixing real-world software vulnerabilities. In 28th International Conference on Evaluation and Assessment in Software Engineering; Association for Computing Machinery: New York, NY, USA, 2024; pp. 252–261. [Google Scholar] [CrossRef]
- Zhou, X.; Cao, S.; Sun, X.; Lo, D. Large language model for vulnerability detection and repair: Literature review and the road ahead. ACM Trans. Softw. Eng. Methodol. 2025, 34, 1–31. [Google Scholar] [CrossRef]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Kuttler, H.; Lewis, M.; Yih, W.-T.; Rocktaschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. (NeurIPS) 2020, 33, 9459–9474. [Google Scholar]
- OpenText. RAG and Agentic AI: Revolutionizing Cybersecurity Analysis. OpenText Blogs, 21 April 2025. Available online: https://blogs.opentext.com/rag-and-agentic-ai-revolutionizing-cybersecurity-analysis/ (accessed on 15 January 2026).
- Alam, M.T.; Bhusal, D.; Nguyen, L.; Rastogi, N. CTIBench: A benchmark for evaluating large language models in cyber threat intelligence tasks. arXiv 2024, arXiv:2406.07599. [Google Scholar]
- Rajapaksha, S.; Rani, R.; Karafili, E. A RAG-based question-answering solution for cyber-attack investigation and attribution. In Computer Security. ESORICS 2024 International Workshops; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; pp. 238–256. [Google Scholar] [CrossRef]
- Bornea, A.-L.; Ayed, F.; De Domenico, A.; Piovesan, N.; Maatouk, A. Telco-RAG: Navigating the challenges of retrieval-augmented language models for telecommunications. In GLOBECOM 2024-2024 IEEE Global Communications Conference; IEEE: New York, NY, USA, 2024; pp. 2359–2364. [Google Scholar] [CrossRef]
- Shafran, A.; Schuster, R.; Shmatikov, V. Machine against the RAG: Jamming retrieval-augmented generation with blocker documents. In 34th USENIX Security Symposium (USENIX Security 25), Seattle, WA, USA, 13–15 August 2025; USENIX Association: Berkeley, CA, USA, 2025; pp. 3787–3806. [Google Scholar]
- OWASP Foundation. OWASP Top 10 for Large Language Model Applications, Version 2025; OWASP Foundation: Wakefield, MA, USA, 2025. Available online: https://owasp.org/www-project-top-10-for-large-language-model-applications/ (accessed on 15 January 2026).
- Liu, Y.; Jia, Y.; Jia, J.; Song, D.; Gong, N.Z. DataSentinel: A game-theoretic detection of prompt injection attacks. In 2025 IEEE Symposium on Security and Privacy (SP); IEEE: New York, NY, USA, 2025; pp. 2190–2208. [Google Scholar] [CrossRef]
- Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv 2024, arXiv:2402.07927. [Google Scholar] [CrossRef]
- Chen, B.; Zhang, Z.; Langrene, N.; Zhu, S. Unleashing the potential of prompt engineering for large language models. Patterns 2025, 6, 101260. [Google Scholar] [CrossRef] [PubMed]
- Debnath, T.; Siddiky, M.N.A.; Rahman, M.E.; Das, P.; Guha, A.K.; Rahman, M.R.; Kabir, H.M.D. A comprehensive survey of prompt engineering techniques for large language models. TechRxiv 2025. preprint. [Google Scholar] [CrossRef] [PubMed]
- Sampaio, J.P.B.; Duarte, B.K.; Almeida, P.S.; Dantas, M.G. Prompt engineering for large language models: A systematic review and future directions. Res. Sq. 2025. preprint. [Google Scholar] [CrossRef]
- Liu, Y.; Tao, S.; Meng, W.; Yao, F.; Zhao, X.; Yang, H. LogPrompt: Prompt engineering towards zero-shot and interpretable log analysis. In 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings; Association for Computing Machinery: New York, NY, USA, 2024; pp. 364–365. [Google Scholar] [CrossRef]
- Zhang, T.; Huang, X.; Zhao, W.; Bian, S.; Du, P. LogPrompt: A log-based anomaly detection framework using prompts. In 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–8. [Google Scholar] [CrossRef]
- Young, D.L.; Larson, E.C.; Thornton, M.A. Prompt engineering for detecting phishing. In SPIE Assurance and Security for AI-enabled Systems 2025; SPIE: Bellingham, WA, USA, 2025; Volume 13476, pp. 71–79. [Google Scholar] [CrossRef]
- Edemacu, K.; Wu, X. Privacy preserving prompt engineering: A survey. arXiv 2024, arXiv:2404.06001. [Google Scholar] [CrossRef]
- Derner, E.; Batistic, K.; Zahalka, J.; Babuska, R. A security risk taxonomy for prompt-based interaction with large language models. IEEE Access 2024, 12, 126176–126187. [Google Scholar] [CrossRef]
- Rodriguez, A.D.; Dearstyne, K.R.; Cleland-Huang, J. Prompts matter: Insights and strategies for prompt engineering in automated software traceability. In 2023 IEEE 31st International Requirements Engineering Conference Workshops (REW); IEEE: New York, NY, USA, 2023; pp. 455–464. [Google Scholar] [CrossRef]
- Cormack, G.V.; Clarke, C.L.A.; Buettcher, S. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Boston, MA, USA, 19–23 July 2009; Association for Computing Machinery: New York, NY, USA, 2009; pp. 758–759. [Google Scholar] [CrossRef]
- ISO/IEC 27001:2022; Information Security, Cybersecurity and Privacy Protection—Information Security Management Systems—Requirements. International Organization for Standardization: Geneva, Switzerland, 2022.
- AICPA. 2022 Trust Services Criteria for Security, Availability, Processing Integrity, Confidentiality, and Privacy; American Institute of Certified Public Accountants: New York, NY, USA, 2022. [Google Scholar]








| Feature | PentestGPT [8] | ProveRAG [9] | CVE-Bench [21] | Ours |
|---|---|---|---|---|
| Scanner integration (OpenVAS) | No | No | No | Yes |
| RAG with threat intelligence | No | Yes | No | Yes |
| Dual-LLM verification | No | Partial (self-check) | No | Yes |
| Confidence-based routing | No | No | No | Yes |
| Automated patch deployment | No | No | Compilation test | CI/CD + rollback |
| Multi-model evaluation | GPT-4 only | GPT-4 only | Multiple agents | 5 models |
| Statistical validation | No | No | No | z-test; McNemar; Cohen’s h |
| End-to-end closed-loop | No | No | No | Yes |
| Model | Accuracy (%) | Latency (s) | Token Usage | acc/1kTok |
|---|---|---|---|---|
| gpt-4o-mini | 82.6 | 2.90 | 871 | 94.8 |
| gpt-4.1 | 82.0 | 1.30 | 870 | 94.3 |
| gpt-3.5-turbo † | 81.6 | 1.98 | 908 | 89.9 |
| gpt-4o | 81.6 | 3.04 | 923 | 88.4 |
| gpt-o4-mini | 76.7 | 16.78 | 2483 | 30.9 |
| Model | Baseline (%) | Prompt (%) | Prompt + RAG (%) | Delta Final—Baseline (pp) | Delta Token |
|---|---|---|---|---|---|
| gpt-4o-mini | 60.0 | 61.7 | 82.6 | +22.6 | −56 |
| gpt-4.1 | 53.3 | 55.0 | 82.0 | +28.7 | −92 |
| gpt-3.5-turbo † | 50.0 | 53.7 | 81.6 | +31.6 | +178 |
| gpt-4o | 51.7 | 55.7 | 81.6 | +29.9 | −277 |
| gpt-o4-mini | 45.0 | 47.7 | 76.7 | +31.7 | +912 |
| Method | Actionable Rate (%) | Accuracy (%) | Acc|Actionable (%) | Invalid Rate (%) |
|---|---|---|---|---|
| OpenVAS-Hint | 100.0 | 66.0 | 66.0 | 0.0 |
| Retrieval-Only | 100.0 | 78.0 | 78.0 | 0.0 |
| Prompt + RAG | 100.0 | 82.6 | 82.6 | 7.8 |
| Model | Baseline (%) | Prompt + RAG (%) | Delta (pp) | z-Statistic | p-Value | Cohen’s h |
|---|---|---|---|---|---|---|
| gpt-4o-mini | 60.0 | 82.6 | +22.6 | 6.83 | <0.001 | 0.51 |
| gpt-4.1 | 53.3 | 82.0 | +28.7 | 8.53 | <0.001 | 0.63 |
| gpt-3.5-turbo † | 50.0 | 81.6 | +31.6 | 9.35 | <0.001 | 0.68 |
| gpt-4o | 51.7 | 81.6 | +29.9 | 8.85 | <0.001 | 0.65 |
| gpt-o4-mini | 45.0 | 76.7 | +31.7 | 9.08 | <0.001 | 0.66 |
| Metric | Without Verifier | With Verifier | Delta |
|---|---|---|---|
| Accuracy (%) | 80.3 | 82.6 | +2.3 pp |
| Hallucination rate (%) | 14.6 | 7.8 | −6.8 pp |
| Invalid JSON output (%) | 5.1 | 1.4 | −3.7 pp |
| Version mismatch errors (%) | 8.9 | 3.2 | −5.7 pp |
| Metric | Value |
|---|---|
| Cases reaching auto-execution (high-confidence) | 122/350 (34.9%) |
| Cases blocked by policy whitelist | 18/122 (14.8%) |
| Cases blocked by version mismatch check | 7/122 (5.7%) |
| Rollback triggered post-deployment | 4/97 (4.1%) |
| Rollback cause: transient health check timeout | 3/4 |
| Rollback cause: dependency conflict (non-critical) | 1/4 |
| Service outages caused by automated execution | 0 |
| Data loss events | 0 |
| Unintended security regressions | 0 |
| Median scan-to-recommendation time (RAG + LLM) | 4.2 s |
| Median recommendation-to-deployment time | 2.8 min |
| Median deployment-to-verification time | 6.1 min |
| Median end-to-end time (high-confidence, auto-executed) | 9.7 min |
| Median rollback completion time | <3 min |
| θ_high | Auto-Candidate (%) | Human Review (%) | Manual (%) | Executed | Rollback Rate (%) | Successful Execution Rate (%) |
|---|---|---|---|---|---|---|
| 0.80 | 37.7 | 42.3 | 20.0 | 105 | 6.1 | 93.9 |
| 0.85 * | 34.9 | 45.1 | 20.0 | 97 | 4.1 | 95.9 |
| 0.90 | 34.9 | 45.1 | 20.0 | 97 | 2.7 | 97.3 |
| Threat Category | Attack Surface | Implemented Mitigation | Residual Risk |
|---|---|---|---|
| Prompt Injection (LLM01) | Scan descriptions, KB documents | Structured JSON output schema; input sanitization; dual-LLM verification cross-checks | Adversarial payloads in scanner XML not yet red-teamed |
| Sensitive Information Disclosure (LLM06) | LLM outputs, audit logs | Output filtered to JSON schema fields only; no free-text passthrough; role-based KB access | Enterprise host/network metadata in prompts requires data minimization review |
| Insecure Output Handling (LLM02) | Ansible playbooks generated from LLM output | Action whitelisting restricts allowed change types; version verification blocks mismatched patches | Template injection via crafted remediation strings not formally tested |
| Training Data Poisoning (LLM03) | Fine-tuning dataset | Expert-curated Q&A pairs with dual review; CVE-level deduplication against eval set | Community KB entries (Exploit-DB, MISP feeds) not individually verified |
| Excessive Agency (LLM08) | Orchestration Layer auto-execution | Confidence thresholds gate automation; staged deployment (staging to production); snapshot + rollback | Threshold sensitivity to distribution shift not yet analyzed |
| Model Denial of Service (LLM04) | API rate limits, retrieval latency | 60 s timeout with exponential backoff; retry cap (3 attempts); graceful degradation to manual queue | Sustained adversarial load on RAG index not stress tested |
| Over-Reliance (LLM09) | Analyst trust in auto-generated recommendations | Confidence scores displayed; provenance citations in output; semi-auto tier requires human sign-off | No formal user study on analyst calibration or automation bias |
| Supply Chain Vulnerabilities (LLM05) | Model vendor API, KB data feeds | Date-pinned API deployments; source trustworthiness tagging (3-tier); stale entry flagging | Vendor model behavior changes between pinned versions not monitored |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Hsieh, C.-H.; Cheng, C.-Y.; Wang, Y.-C. A Zero-Touch Vulnerability Remediation Framework Based on OpenVAS, Threat Intelligence, and RAG-Enhanced Large Language Models. Mathematics 2026, 14, 1072. https://doi.org/10.3390/math14061072
Hsieh C-H, Cheng C-Y, Wang Y-C. A Zero-Touch Vulnerability Remediation Framework Based on OpenVAS, Threat Intelligence, and RAG-Enhanced Large Language Models. Mathematics. 2026; 14(6):1072. https://doi.org/10.3390/math14061072
Chicago/Turabian StyleHsieh, Cheng-Hui, Chen-Yi Cheng, and Yung-Chung Wang. 2026. "A Zero-Touch Vulnerability Remediation Framework Based on OpenVAS, Threat Intelligence, and RAG-Enhanced Large Language Models" Mathematics 14, no. 6: 1072. https://doi.org/10.3390/math14061072
APA StyleHsieh, C.-H., Cheng, C.-Y., & Wang, Y.-C. (2026). A Zero-Touch Vulnerability Remediation Framework Based on OpenVAS, Threat Intelligence, and RAG-Enhanced Large Language Models. Mathematics, 14(6), 1072. https://doi.org/10.3390/math14061072
