Large Language Models for Recovery Plan Generation in Internet-Connected Critical Infrastructures: Architectures, Applications, Limitations, and Research Directions

Tsochev, Georgi; Gergov, Ivo

doi:10.3390/fi18060295

Open AccessReview

Large Language Models for Recovery Plan Generation in Internet-Connected Critical Infrastructures: Architectures, Applications, Limitations, and Research Directions

by

Georgi Tsochev

^1,*

and

Ivo Gergov

²

¹

Department of Intelligent Technologies in Industry, Faculty of Computer Systems and Technology, Technical University of Sofia, 1000 Sofia, Bulgaria

²

Evo Tech Development Ltd., 1000 Sofia, Bulgaria

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(6), 295; https://doi.org/10.3390/fi18060295

Submission received: 11 April 2026 / Revised: 24 May 2026 / Accepted: 26 May 2026 / Published: 1 June 2026

(This article belongs to the Special Issue Generative Artificial Intelligence: Systems, Technologies and Applications)

Download

Browse Figures

Versions Notes

Abstract

Critical infrastructures are increasingly Internet-connected cyber–physical systems whose recovery after cyber incidents must satisfy safety, timing, regulatory, and interdependency constraints. Yet, the use of large language models (LLMs) for generating recovery plans remains fragmented across cybersecurity, industrial control, digital twins, and AI assurance research. This review synthesizes that emerging field through a structured critical survey of studies on LLMs in incident response, OT/ICS resilience, and cyber–physical recovery, with a focused perspective on grounding, trust, and assurance mechanisms relevant to recovery-plan generation. It develops an architecture-centric taxonomy spanning prompt-only assistants, retrieval-augmented copilots, graph-aware planners, multi-agent systems, and hybrid verification/simulation pipelines; maps realistic applications across energy, water, manufacturing, transportation, healthcare, and telecommunications; and organizes limitations into technical, security, governance, and human-factor categories. Based on this synthesis, the paper proposes the Grounded Recovery Planning Stack as a reference architecture and outlines a staged roadmap from human-in-the-loop copilots to bounded orchestration. The main conclusion is that near-term value lies in grounded, auditable, compliance-aware copilots, whereas autonomous recovery execution remains premature without stronger validation, state-aware grounding, sector-specific benchmarks, and formal safeguards.

Keywords:

large language models; generative artificial intelligence; critical infrastructures; recovery planning; operational technology; industrial control systems; cyber resilience; retrieval-augmented generation

1. Introduction

Critical infrastructures are not merely collections of isolated assets. They are interdependent socio-technical systems in which failures propagate across cyber, physical, organizational, and geographic layers [1,2,3,4]. In such environments, post-incident recovery is rarely a linear checklist exercise. A restoration decision made for one subsystem can degrade another subsystem’s availability, violate operational safety constraints, or delay the recovery of a more critical public service. This challenge is particularly acute in infrastructures that combine legacy field devices, supervisory control, IT services, vendor dependencies, and human operators.

Operational technology (OT) and industrial control environments intensify the problem. Compared with enterprise IT, critical infrastructure recovery must account for process safety, deterministic timing, device heterogeneity, maintenance windows, fallback modes, and the possibility that the “secure” action is not the “safe” action [5,6,7,8,9,10]. Recent regulatory and standards developments also place stronger emphasis on resilience, preparedness, continuity, and recovery, which means that organizations increasingly need response capabilities that are not only technically sound but also explainable, documented, and auditable.

From a Future Internet perspective, the problem is becoming more pressing because critical infrastructures are no longer isolated control islands. They increasingly rely on remote connectivity, cloud and edge services, industrial IoT gateways, vendor portals, and cross-organizational data exchange, which expand both operational visibility and recovery complexity [5,6,7,8,9,10]. Recovery planning therefore has to reason not only about local device restoration but also about networked dependencies across digital platforms, service providers, and interdependent infrastructures.

At the same time, LLMs have matured rapidly. The transformer architecture enabled the scale-up of contemporary language modeling [11], while successive model generations, from BERT [12] and GPT-3 [13] to instruction-following systems [14] and GPT-4-class models [15], demonstrated that language models can summarize, classify, reason over heterogeneous text, generate code, and interact with external tools. The broader “foundation model” perspective further clarified that large pre-trained models can be adapted across tasks and domains, including high-consequence ones, although adaptation and governance remain essential [16].

Recent advances in prompting and agent design make LLMs especially relevant to planning-oriented tasks. Chain-of-thought prompting, zero-shot reasoning, and plan-and-solve prompting improve decomposition of multi-step problems [17,18,19]. ReAct and Toolformer connect reasoning with external actions and tools [20,21]. Retrieval-augmented generation (RAG) and its later variants allow models to ground outputs in external knowledge stores rather than parametric memory alone [22,23,24]. Multi-agent and agentic LLM paradigms further distribute reasoning across specialized roles, which is attractive for incident command, safety checking, and documentation workflows [25,26].

Model and agent capabilities have continued to advance after these foundational works. Recent frontier releases and evaluations report stronger performance in coding, tool use, long-context reasoning, and cybersecurity-relevant tasks, while also emphasizing the need for stronger deployment safeguards and safety evaluation [27,28,29,30]. At the same time, memory-augmented and self-evolving agent systems are emerging in the broader agent ecosystem, including long-term memory mechanisms, skill libraries, and iterative self-improvement loops [31,32,33,34,35]. These developments motivate the present article’s architecture-centric focus: in critical infrastructures, rapid model progress changes the capability frontier, but dependable recovery still depends on grounding, validation, authorization, and auditability.

However, the literature remains dispersed. Existing reviews examine critical infrastructure protection broadly [36,37,38,39] or survey LLMs in cybersecurity at large [40,41,42]. These works are valuable, but they do not center the specific problem of generating recovery plans for critical infrastructures after disruptive incidents. Recovery-plan generation sits at the intersection of playbook engineering, incident response, infrastructure dependency modeling, OT/ICS safety, AI assurance, and human decision making. Treating it as merely another chatbot use case obscures the most consequential design requirements.

Unlike prior surveys on LLMs in cybersecurity or generative AI for critical infrastructure protection, this article focuses specifically on the generation of recovery plans after disruptive events. Its novelty lies in framing recovery-plan generation as a socio-technical planning problem; organizing the field around architecture patterns for grounded planning and plan assurance; introducing a reference stack for trustworthy recovery-plan generation; and proposing a staged research roadmap from grounded copilots to verified, sector-aware recovery planners.

The review addresses four questions. First, what architecture patterns are emerging for LLM-enabled recovery-plan generation in critical infrastructures? Second, which application scenarios are realistic across major sectors? Third, what technical, organizational, and governance limitations currently constrain adoption? Fourth, which research directions are most likely to move the field from promising prototypes toward dependable operational capability? To answer these questions, the remainder of the article first defines the problem space, then explains the review methodology, synthesizes the literature by architecture and application, analyzes limitations and assurance requirements, and finally proposes a research roadmap and future directions.

The paper contributions can be summarized as follows:

It formulates recovery-plan generation as a distinct high-consequence planning problem in critical infrastructures, rather than treating it as a generic chatbot or summarization task.
It develops an architecture-centric synthesis spanning prompt-only assistants, RAG-grounded copilots, graph-aware planners, multi-agent planners, and hybrid pipelines coupled with verification, simulation, or digital twins.
It proposes the Grounded Recovery Planning Stack (GRPS) as a reference design for trustworthy deployment in operationally sensitive environments.
It maps sectoral applications across energy, water, manufacturing, transportation, healthcare, and telecommunications, while explicitly distinguishing evidence maturity for decision support from readiness for unattended autonomy.
It organizes the field’s limitations into an assurance-oriented framework and translates that synthesis into a staged research roadmap and practical deployment guidance for operators, vendors, and regulators.

The remainder of the paper is organized as follows. Section 2 defines recovery planning as a socio-technical task and explains why LLM-assisted recovery is harder in critical infrastructures than in enterprise IT. Section 3 presents the structured critical review methodology. Section 4 reviews related survey literature and identifies the specific gap addressed by this article. Section 5 synthesizes LLM-enabled recovery-planning architectures and introduces the Grounded Recovery Planning Stack (GRPS). Section 6 maps sectoral applications across critical infrastructure domains. Section 7 analyzes limitations, risks, and assurance requirements. Section 8 presents the staged research roadmap and cross-cutting future directions. Section 9 discusses theoretical and practical implications for operators, vendors, and regulators. Section 10 concludes the article.

2. Background and Problem Framing

2.1. Recovery Planning as a Distinct Socio-Technical Task

Recovery planning should be distinguished from detection, triage, or post-incident reporting. In this review, recovery-plan generation denotes the production of an ordered and justified set of restoration actions, decision points, approvals, rollback conditions, and monitoring checks intended to restore a target service state while respecting safety, business priority, and regulatory constraints. This conception is consistent with the broader evolution from static intrusion response taxonomies toward playbook-oriented and resilience-aware response systems [43,44,45,46,47,48].

Traditional playbooks remain indispensable because they encode organizational knowledge, responsibilities, and approved response patterns. Government and standards organizations have also pushed playbooks toward more formal and machine-readable forms, such as the CISA response playbooks and OASIS CACAO [49,50]. Yet, even well-authored playbooks are necessarily incomplete. They cannot anticipate every variant of asset state, telemetry inconsistency, operator availability, third-party dependency, or cross-sector cascade that arises during a real incident. As a result, human teams still spend substantial effort contextualizing playbooks, reconciling conflicting evidence, and drafting recovery sequences under time pressure.

The core problem can therefore be represented as a planning task over a partially observed, safety-constrained state space. Let the input contain incident evidence, asset and service dependencies, operational constraints, recovery priorities, available resources, and validated organizational knowledge. The output is not just a text answer, but a plan bundle consisting of ordered actions, justifications, uncertainty flags, required approvals, and measurable completion criteria. A useful recovery-planning system must transform heterogeneous evidence into action recommendations that are grounded in current context and can survive operational scrutiny.

This formulation is materially different from generic question answering. A plausible-sounding recommendation is insufficient if it references the wrong asset, ignores a maintenance interlock, violates process safety, or omits the need for a rollback condition. In high-consequence infrastructures, the relevant success criterion is not linguistic fluency but whether the plan is safe, feasible, auditable, and timely enough to support human decision making.

2.2. Why LLMs Are Attractive for Recovery-Plan Generation

The attraction of LLMs lies in their ability to unify activities that are usually separated across tools: interpreting unstructured documents, summarizing alerts, extracting procedures from manuals, reconciling naming mismatches, generating operator-facing explanations, and proposing action sequences. Early work has already explored this potential in incident response planning and autonomous defense contexts. Hays and White examined the use of LLMs for incident response planning and review [51]. Hammar et al. proposed a lightweight pipeline that combines fine-tuning, retrieval, and planning to reduce hallucination in incident response planning [52]. IRCopilot used collaborative LLM components to emulate the dynamic phases of an incident response team [53]. In parallel, autonomous cyber defense research has studied LLM agents in simulated defensive environments and hybrid architectures [54,55,56,57].

These studies suggest that LLMs are especially promising when recovery work is information-heavy and coordination-heavy rather than purely closed-loop control. They can draft candidate plans; explain why a given recovery order is preferable; translate technical recommendations for different stakeholders; and accelerate knowledge access across logs, tickets, runbooks, and manuals. In other words, LLMs are strongest where recovery depends on integrating fragmented knowledge and weaker where safe actuation requires deterministic guarantees.

2.3. Why Critical Infrastructures Are Harder than Enterprise IT

Critical infrastructures add constraints that sharply distinguish them from general enterprise IT. First, the cost of a wrong action can be physical harm, not just data loss. Second, restoration often involves cyber–physical dependencies: a recovered server may be operationally useless if a field controller, sensor chain, or communications link remains impaired [1,2,38]. Third, OT environments often contain legacy systems, proprietary protocols, incomplete asset inventories, and limited logging [5]. Fourth, recovery authority is distributed across engineers, operators, safety staff, contractors, and sector regulators. Fifth, many infrastructures must plan for degraded modes rather than full restoration, which means “best” plans are multi-objective and context dependent.

These constraints imply that critical infrastructure recovery cannot rely on raw language generation alone. It needs state-aware grounding, explicit constraint handling, uncertainty management, and role-appropriate human authorization. The question is therefore not whether an LLM can propose a recovery plan, but what surrounding architecture can make such a proposal dependable enough to be useful.

3. Review Methodology

This article follows a structured critical review methodology informed by PRISMA 2020 reporting principles and software-engineering guidance for systematic reviews [58,59]. It should therefore be read as a structured critical/narrative review rather than as a formal systematic review or scoping review. The aim is analytical depth rather than exhaustive bibliometric enumeration. In particular, the review does not claim a meta-analysis, nor does it report screening counts that would imply a closed corpus. Instead, it prioritizes representative and influential sources up to March 2026 across five intersecting areas: critical infrastructure resilience, OT/ICS security, incident response and playbooks, LLMs in cybersecurity, and trustworthy AI for high-consequence use.

Search and selection emphasized peer-reviewed papers, standards, official guidance, and high-impact preprints from venues where the field currently moves fastest. Because recovery-plan generation is still emergent, excluding preprints would omit important technical developments in agentic response systems and domain-grounded planning. To reduce duplication, papers were coded by problem focus, architecture pattern, application sector, validation strategy, human role, and limitation profile.

Table 1 summarizes the structured review protocol used in this article, including the search scope, screening logic, and synthesis strategy.

A deliberate choice in this review is to analyze literature through the lens of recovery-plan generation, even when individual papers focus on adjacent tasks such as PLC code generation, log understanding, CTI reasoning, or autonomous cyber defense. This is justified because real recovery planning draws on precisely these adjacent capabilities: telemetry interpretation, state estimation, procedural recall, dependency reasoning, action sequencing, and operator communication. The synthesis therefore privileges functional relevance over narrow keyword matching.

For transparency, the use of generative AI assistance in manuscript preparation is disclosed in the Acknowledgments in line with journal policy; however, source selection, factual verification, interpretation, and final editorial decisions remained the responsibility of the authors.

4. Related Review Literature and Identified Gap

Prior review literature can be divided into three clusters. The first cluster surveys critical infrastructure protection and cyber-resilience more broadly, often with emphasis on dependency modeling, CPS resilience, and protection strategies [36,37,38]. The second cluster surveys generative AI and LLMs for critical infrastructure protection at a macro level [39]. The third cluster reviews LLMs across the cybersecurity domain, covering tasks such as vulnerability analysis, malware understanding, log analysis, CTI extraction, and offensive security [40,41,42].

The present article differs from all three clusters. It does not ask simply whether LLMs are useful in cybersecurity, nor does it review critical infrastructure protection in general. Instead, it asks how LLMs can generate, justify, and validate recovery plans for critical infrastructures after disruptive incidents. This narrower focus reveals requirements that broader surveys typically treat only indirectly: live-state grounding, infrastructure dependencies, rollback logic, human authorization, safety barriers, and sector-specific assurance.

Table 2 positions the present article against representative review literature and makes the article’s distinct gap and contribution explicit.

This gap matters because recovery is where cyber defense meets real-world service continuity. Detection without recovery remains incomplete. Likewise, autonomy without assurance is unacceptable in critical infrastructures. A review centered on recovery-plan generation therefore contributes a problem formulation that is both operationally concrete and scientifically underdeveloped.

5. Architectural Patterns for LLM-Enabled Recovery-Plan Generation

The reviewed literature suggests that LLM-enabled recovery planning is best understood through architecture patterns rather than model names. The same foundation model can behave very differently depending on whether it is used as a prompt-only assistant, a RAG-grounded copilot, a graph-aware planner, or part of a verified hybrid pipeline. Architecture determines what context the model receives, what tools it can invoke, how it is checked, and how much autonomy it is allowed.

Table 3 synthesizes the main architectural patterns that recur in the literature on LLM-enabled recovery-plan generation.

5.1. Prompt-Only Assistants

The simplest pattern uses an LLM directly through carefully engineered prompts. In this mode, the model acts as a drafting assistant for candidate recovery steps, communication templates, or after-action rationales. Prompt-only systems are attractive because they require minimal integration effort and can be deployed rapidly for tabletop exercises, analyst note writing, and first-pass procedure drafting [51]. They are also useful in low-maturity environments where structured knowledge bases do not yet exist.

However, prompt-only assistants are poorly suited to high-consequence recovery decisions. Without live grounding, they can hallucinate non-existent assets, miss organization-specific nomenclature, or recommend actions that are technically coherent but operationally invalid. In recovery contexts, these are not marginal defects. They directly undermine trust and can create unsafe recommendations. Prompt-only systems should therefore be treated as ideation tools or documentation aids rather than decision engines for live restoration.

5.2. Retrieval-Augmented Copilots

RAG-grounded systems are currently the most credible near-term architecture for recovery support. Here, the model retrieves relevant material from curated knowledge sources—runbooks, vendor manuals, incident tickets, asset inventories, change logs, safety procedures, and CTI reports—and then generates a plan conditioned on those sources [22,23]. This architecture lowers reliance on parametric memory and increases auditability because recommendations can be traced to retrieved evidence.

The incident response planning pipeline proposed by Hammar et al. is particularly instructive because it combines fine-tuning, retrieval, and planning in a lightweight model stack designed to reduce hallucination [52]. Similar logic appears in domain assistants such as ChatIoT, which integrates heterogeneous IoT security information using RAG [60], and in CTI-focused work such as SEvenLLM, which emphasizes domain-tailored instruction data and evaluation [66]. Log understanding and event template extraction studies further suggest that LLM pipelines can provide operational value once they are anchored to the right evidence sources [67,68].

For recovery planning, the most important design decision in RAG systems is not the retriever alone but the composition of the knowledge base. A recovery copilot must fuse several evidence classes: static procedural knowledge, near-real-time incident evidence, current asset and topology state, and organizational constraints such as approvals or regulatory reporting obligations. If any of these classes is missing or stale, retrieval may still return apparently relevant context while the final plan remains infeasible.

5.3. Knowledge-Graph and GraphRAG Planners

Critical infrastructure recovery depends heavily on dependency reasoning. Which services depend on which controllers? Which remote terminal units depend on which communications paths? Which safety functions must remain online before a controller reboot is attempted? Knowledge graphs and GraphRAG-like methods are therefore natural extensions of plain-text RAG [24]. They allow the planning system to reason over typed entities and relations rather than relying only on semantically similar document chunks.

Recent work points in this direction. Webb et al. use LLMs and retrieval for cyber knowledge completion, bridging gaps between attack-pattern taxonomies and cyber–physical risk reasoning [61]. Grangel-González et al. show how cyber–physical systems can be represented within knowledge graphs [69]. Dagnas et al. demonstrate that graph-based CPS modeling can support resilience quantification and the identification of critical points [62]. Combined, these strands suggest a path toward dependency-aware recovery planning in which the LLM no longer “guesses” relationships from prose but queries a structured representation of the infrastructure.

From a recovery perspective, graph-aware grounding offers three major advantages. First, it can support service-centric planning rather than asset-centric planning. Second, it can surface indirect effects and restoration prerequisites. Third, it can make explanations more actionable by linking recommendations to explicit dependency paths. The main obstacle is that most organizations do not yet maintain sufficiently complete and timely knowledge graphs of their infrastructures, which makes graph-aware recovery planning as much a data-engineering challenge as an LLM challenge.

5.4. Multi-Agent Planners

Multi-agent architectures distribute planning across specialized roles such as incident commander, OT engineer, safety officer, compliance reviewer, and communications lead [25,26]. Conceptually, this mirrors how real recovery decisions are made: no single person owns all relevant knowledge, and acceptable plans usually emerge through review, critique, and approval. LLM-based multi-agent systems therefore offer a compelling abstraction for recovery-plan generation.

IRCopilot exemplifies this design by organizing automated incident response into collaborative session components with differentiated responsibilities [53]. Autonomous cyber defense studies also suggest that LLM-based teams can be evaluated as interacting defenders rather than isolated models [54,55]. Even seemingly offensive studies such as PentestGPT are relevant by analogy because they show the value of decomposing complex tasks into cooperating modules to reduce context loss and improve persistence across long workflows [70].

For critical infrastructure recovery, multi-agent designs offer a practical path to bounded autonomy. The “planner” agent can draft candidate actions, a “safety” agent can reject actions that violate process rules, a “topology” agent can fetch state or dependency data, and a “documentation” agent can produce operator-facing rationale. Yet multi-agent systems do not eliminate the need for grounding and verification. They mainly improve division of cognitive labor. Without strong arbitration and evidence control, they can still amplify error through persuasive but mutually reinforcing mistakes.

5.5. Hybrid Planners with Verification, Simulation, and Digital Twins

The highest-assurance pattern combines LLMs with symbolic constraints, simulators, optimization routines, digital twins, or formal verification. This pattern is especially promising for recovery tasks that touch industrial logic, control sequences, or safety-critical reconfiguration. LLM4PLC is an important example because it pairs LLM generation with compilers, grammar checking, and verification tools to improve correctness of PLC programs [63]. In adjacent work, LLMs have been explored for industrial control and HVAC management [64] and for end-to-end control of industrial automation systems through agentic frameworks [65].

Digital twin research strengthens this pattern by providing a virtual environment in which recovery candidates can be validated before execution [71,72,73]. In a mature architecture, the LLM would not directly approve a restoration step. It would instead propose candidate sequences whose feasibility is tested against rules, simulation models, or twin-based state projections. This is conceptually close to earlier work on response and recovery engines [48], but with the LLM handling knowledge synthesis and explanation while symbolic components enforce hard constraints.

The practical implication is clear: the more consequential the actuation, the more the architecture must shift from pure language generation toward hybrid verification. In other words, the path to trustworthy recovery automation in critical infrastructures is not bigger models alone; it is better coupling between language models, structured infrastructure knowledge, and domain-specific validators.

5.6. A Reference Stack for Trustworthy Recovery-Plan Generation

Based on the reviewed literature, this article proposes a conceptual Grounded Recovery Planning Stack (GRPS) for critical infrastructures. The stack is not a product architecture but a synthesis intended to clarify design requirements:

Operational context assembly collects incident evidence, asset state, service priorities, dependencies, and human-role information.
Grounding and dependency fusion merges text retrieval, structured asset data, and knowledge-graph links.
Plan generation drafts one or more candidate recovery sequences, including rollback logic and justification.
Assurance gates apply policy checks, safety constraints, simulation or digital-twin validation, and uncertainty assessment.
Human authorization and orchestration routes the plan to appropriate approvers and preserves an audit trail.
Post-incident learning updates runbooks, retrieval corpora, and evaluation sets from the outcome.

The main benefit of the GRPS view is that it prevents organizations from over-focusing on the language model while under-engineering the surrounding system. In recovery planning, dependable behavior emerges from the full stack: curated operational knowledge, explicit dependency models, constrained tool use, verification, and human approval. Figure 1 summarizes this layered reference architecture for trustworthy deployment design.

6. Applications in Critical Infrastructure Sectors

Evidence for fully automated recovery remains limited, but the literature already supports a meaningful map of where LLM-enabled recovery planning is most and least plausible. The strongest near-term use cases are those that combine heterogeneous documentation, rapidly evolving telemetry, and significant coordination overhead, while still leaving final authority with human operators. Sector maturity is therefore best interpreted as evidence maturity for decision support, not as readiness for unattended autonomy.

Table 4 compares sectoral application scenarios and indicates the current maturity of the available evidence.

The maturity labels in Table 4 are qualitative coding judgments rather than quantitative performance scores. They are based on three criteria: (i) the availability of domain-specific LLM, recovery, or closely related CI/OT prototypes; (ii) the level of validation reported in the literature, ranging from conceptual discussion to simulation, testbed validation, or operational deployment; and (iii) the degree of integration between LLM-based planning and operational evidence such as topology, telemetry, runbooks, or validators. In this coding, “Low” denotes mostly conceptual or adjacent evidence with no recovery-specific validation; “Low-to-moderate” denotes mixed evidence or partial validation of relevant components; and “Moderate” denotes testbed, simulation, or domain-specific prototype evidence, while still falling short of validated unattended operational autonomy.

6.1. Energy and Utility Infrastructures

In energy and utility settings, recovery planning must coordinate cyber restoration with physical service priorities and grid or process constraints. Although the literature contains more conceptual discussions than field deployments, the combination of dependency modeling, digital twins, and LLM planning makes the sector a leading candidate for graph-grounded recovery copilots [39,61,71,72]. A practical example would be drafting a service-prioritized restoration sequence that explicitly differentiates between actions requiring field dispatch and those that can be completed from the control room.

6.2. Water and Wastewater Infrastructures

Water and wastewater systems are especially relevant because testbeds such as SWaT have been used to study CPS resilience and graph-based assessment [62]. In such systems, a recovery planner must reason about treatment stages, sensor trustworthiness, chemical safety, and degraded operation. The value of an LLM here is less in direct control and more in integrating technical manuals, process dependencies, and operator procedures into a coherent restoration narrative that can be reviewed and adapted in real time.

6.3. Manufacturing and Process Industries

Manufacturing and process industries currently provide the clearest technical bridge from language models to recovery-oriented actuation support. LLM4PLC shows that language models can be embedded in a verification loop for industrial logic generation [63]. Song et al. and Xia et al. extend the discussion toward broader industrial control using LLMs [64,65]. These studies do not yet solve infrastructure recovery as a whole, but they demonstrate a critical building block: LLM outputs can be made operationally relevant when paired with compilers, validators, and domain-specific execution context.

6.4. Transportation and Logistics

Transportation and logistics systems present a profile in which safe recovery is deeply dependent on interlocking procedures, scheduling dependencies, and cross-organizational communication. In such environments, the immediate value of LLMs is likely to lie in assembling procedure-aware recovery options, clarifying decision prerequisites, and translating technical recovery choices into operator-facing actions rather than in direct actuation.

6.5. Healthcare and Hospital Infrastructures

Healthcare infrastructures are similarly coordination intensive. A recovery copilot could assemble downtime procedures, dependency-aware restoration priorities, clinical escalation rules, and stakeholder communication templates. The main benefit would be speed and coherence in cross-team sensemaking, especially when IT restoration decisions must be sequenced against patient-safety priorities.

6.6. Telecommunications and Cloud-Backed Services

Telecommunications and cloud-backed infrastructures are the most immediately adjacent to today’s enterprise incident-response tooling. LLMs can already assist with SOC and CTI workflows, log understanding, and threat-intelligence reasoning [66,67,68,74,75]. These capabilities are not identical to recovery planning, but they reduce the information bottleneck that often delays restoration after an incident and therefore provide a realistic entry point for recovery-oriented copilots.

7. Limitations, Risks, and Assurance Requirements

The central limitation of LLM-enabled recovery planning is that a convincing plan is not necessarily a correct plan. In high-consequence infrastructures, this gap is decisive. Dependability requires that every recommendation be evaluated not only for linguistic plausibility but also for factual grounding, state consistency, operational feasibility, safety, and accountability.

7.1. Technical Limitations

Hallucination remains the most visible technical risk. Surveys consistently show that LLMs can generate outputs that appear authoritative yet are unsupported or false [76,77]. For recovery planning, hallucination can take several domain-specific forms: inventing non-existent assets, misremembering procedural prerequisites, confusing vendor terminology, or extrapolating from outdated context. Detection and mitigation methods such as hallucination checkers and abstention mechanisms are improving [78,79], but they should be treated as partial safeguards rather than guarantees.

A second limitation is state mismatch. Even a factually grounded plan may be wrong for the current infrastructure state if retrieval uses stale inventories, outdated topology, or incomplete incident evidence. This is one reason why plain-text RAG alone is insufficient in OT and critical infrastructure settings. Recovery planners need live or near-live synchronization with asset state, change history, and dependency models.

A third limitation is long-horizon planning under uncertainty. Recovery procedures often branch, require conditional rollback, and evolve as new evidence arrives. LLMs can decompose such tasks more effectively than earlier models [17,19], but they still struggle with persistent world models and calibrated uncertainty. Benchmarks such as TruthfulQA remind us that larger models are not automatically more truthful or reliable [80].

7.2. Security and Misuse Risks

Because recovery planners will likely be integrated with enterprise search, ticketing systems, documentation stores, and orchestration tools, they inherit a broad attack surface. Prompt injection can manipulate retrieved context or tool use [81]. Adversarial jailbreaks can subvert aligned behavior [82]. Training-data extraction and data leakage are also relevant where sensitive operational content is processed by general-purpose models [83]. In critical infrastructures, these are not abstract concerns: a compromised planning assistant could actively recommend unsafe or strategically harmful recovery actions.

The broader critique that LLMs behave as “stochastic parrots” is especially salient in this domain [84]. Recovery planning requires more than eloquent pattern completion. It requires grounded and accountable reasoning about specific infrastructures. Likewise, ethical and social risk analyses of language models are directly relevant because infrastructure recovery decisions affect public safety, service equity, and trust in institutions [85].

7.3. Human and Organizational Limitations

Many recovery failures are organizational before they are algorithmic. Playbook usability research shows that response artifacts often fail because they are difficult to interpret or poorly matched to analyst workflows [45,46]. Incident response research also shows that external actors such as insurers and legal advisors can reshape recovery priorities in ways that do not always maximize learning or resilience [47]. Introducing LLMs into this environment can either reduce friction or amplify ambiguity, depending on how responsibilities and approval paths are designed.

Recent in-the-wild evidence from security operations centers is instructive. Analysts appear to use LLMs primarily as cognitive aids for sensemaking and context building rather than for final high-stakes decisions [74]. This pattern is likely to persist in critical infrastructure recovery. It suggests that the most realistic adoption model is augmentation with retained human decision authority, not replacement of incident commanders or control-room staff.

7.4. Governance and Assurance Requirements

Governance frameworks such as the NIST AI RMF and the NIST Generative AI Profile provide a useful vocabulary for trustworthy deployment [86,87]. For recovery-plan generation, these frameworks imply several concrete requirements: provenance of retrieved evidence, segregation of trusted and untrusted sources, systematic red teaming, uncertainty-aware abstention, strong audit logging, and explicit role-based authorization for any action with operational consequence.

Evaluation must also move beyond generic accuracy. Cybersecurity-specific benchmark suites are improving [75,88], but they still underrepresent the recovery tasks most relevant to critical infrastructures: degraded-mode restoration, service-priority trade-offs, interdependency reasoning, rollback planning, and sector-specific compliance checks. Until recovery-oriented benchmarks exist, organizations should validate candidate systems in sandboxes, digital twins, or testbeds before considering live deployment.

Recent system-level guidance sharpens these requirements into deployable engineering practices. The NCSC Guidelines for Secure AI System Development organize safeguards across secure design, secure development, secure deployment, and secure operation and maintenance [89]. In parallel, OWASP’s 2025 guidance on LLM application security and prompt injection shows that recovery assistants must be engineered as secure applications with explicit defenses against indirect prompt injection, unsafe tool mediation, and overreliance on generated content [90,91].

Threat-informed assurance should also be grounded in attacker knowledge bases and lifecycle controls. MITRE ATLAS provides a living knowledge base of tactics and techniques against AI-enabled systems that can support threat modeling and red teaming of recovery-planning stacks [92]. NIST’s adversarial machine learning taxonomy and SP 800-218A extend this perspective by providing a common vocabulary for attacks and mitigations and by augmenting secure software development practices for generative AI and dual-use foundation models [93,94]. For infrastructures that move closer to operational deployment, joint guidance on secure AI deployment and the newer OT-specific guidance on secure AI integration are especially relevant because they translate AI risk management into operational constraints, operator accountability, and defensive deployment patterns for critical environments [95,96].

Table 5 consolidates the main limitations and maps them to corresponding assurance mechanisms suitable for critical infrastructures.

8. Research Roadmap and Future Directions

A realistic research roadmap should recognize that the field is not moving from “no AI” to “full autonomy” in one step. The more plausible trajectory is staged: first grounded copilots, then validated planners, and only later bounded orchestration in well-instrumented environments. Each stage requires advances in data, architecture, assurance, and human factors.

Table 6 summarizes the staged research roadmap proposed in this review.

The horizons in Table 6 are indicative research and deployment-readiness horizons, not deterministic forecasts. They are derived from the synthesis of current evidence maturity, the engineering effort required for trustworthy grounding and validation, and the slower certification and governance cycles typical of critical infrastructures. The staged ordering therefore reflects increasing assurance burden: read-only copilots can be piloted with curated corpora and human review, dependency-aware planners require live-state and graph integration, bounded orchestration requires validated execution limits and approval policies, and verified ecosystems require interoperable artifacts and sector-level assurance cases.

Figure 2 complements Table 6 by visualizing the staged transition from grounded copilots toward bounded orchestration and by highlighting the cross-cutting requirements that must mature in parallel.

The first stage should focus on machine-readable operational knowledge. Many organizations still store the information needed for recovery in fragmented PDFs, tickets, spreadsheets, and tribal knowledge. Without disciplined curation, LLMs cannot reliably support recovery planning. Research on playbooks, CTI corpora, and domain assistants already suggests that knowledge engineering is as decisive as model choice [44,60,66].

The second stage should emphasize state-aware grounding and benchmark development. Recovery planning is not a generic security benchmark problem. It requires datasets and evaluation tasks that encode service dependencies, operational constraints, and evolving incident state. Existing benchmarks for cybersecurity LLMs are valuable foundations [75,88], but they do not yet capture infrastructure restoration under partial observability. Recovery-specific benchmarks should include tasks such as selecting safe restoration order, identifying missing prerequisites, drafting rollback conditions, and explaining trade-offs among availability, safety, and compliance.

The third stage should center on verification and uncertainty. Hybrid architectures should become the default for any scenario that approaches actuation. The model may generate hypotheses or candidate plans, but symbolic checks, twin-based simulation, or domain verifiers should determine whether these candidates are admissible [62,63,71]. In parallel, abstention and uncertainty signaling must be treated as first-class capabilities rather than optional UX features [79].

A fourth workstream concerns human factors and organizational integration. Recovery planners will be adopted only if they fit existing incident command structures and improve, rather than degrade, operator workload and accountability. This requires longitudinal field studies similar to emerging SOC collaboration research [74] but tailored to OT, utilities, and other critical sectors. Questions of trust calibration, review burden, explanation design, and post-incident learning deserve as much attention as model performance.

Finally, the roadmap should include deployment engineering for constrained environments. Critical infrastructures may require on-premises inference, low-latency models, offline fallback, and strong data residency controls. The lightweight orientation in recent incident-response planning work [52] is therefore strategically important. Smaller, well-grounded models may be preferable to frontier models when confidentiality, cost, or air-gapped deployment dominates.

Very recent research also supports the staged roadmap proposed here. Hammar et al. propose an iterative verification-and-abstention loop in which candidate actions are checked against lookahead consistency and refined through external feedback [97]. Gao et al. push the agenda toward end-to-end incident-response agents that integrate perception, reasoning, planning, and action in a single lightweight model [98]. Related work on large-language-model agents for automated cyber tactics planning shows continued progress in multi-step cyber decision support, but also reinforces the importance of simulation-backed evaluation [99]. These trajectories align with national-level threat assessments which argue that AI is likely to amplify existing cyber intrusion workflows in the near term rather than magically eliminate operational constraints [100].

Cross-Cutting Future Directions

Several future directions appear especially promising and, taken together, define a research agenda that is more specific than the generic “apply LLMs to cybersecurity” narrative.

First, recovery-specific multimodal models deserve attention. Critical infrastructure recovery depends not only on text but also on diagrams, P&IDs, HMI screenshots, alarm tables, topology views, and maintenance records. Future recovery planners should integrate multimodal evidence rather than treating all context as plain text.

Second, graph- and twin-coupled planning is likely to become a central architecture. Plain RAG can retrieve relevant passages, but recovery planning needs explicit reasoning over dependencies, prerequisites, and service impacts. Knowledge graphs and digital twins provide the substrate for this transition [62,69,71,72].

Third, verifiable and abstaining planners should replace “always-answering” assistants. A trustworthy planner must know when available evidence is inadequate and must surface uncertainty in an operator-usable form [78,79]. This principle is especially important when incident evidence is conflicting, incomplete, or adversarially manipulated.

Fourth, small and private deployment patterns are likely to matter more than raw benchmark leadership in many infrastructures. On-premises or edge-deployable models, combined with retrieval from locally curated knowledge stores, may offer a better trade-off among confidentiality, availability, and controllability than larger external APIs [52,87].

Fifth, interoperable recovery artifacts should be standardized. Machine-readable playbooks such as CACAO are an important start [50], but critical infrastructure recovery will also need structured representations for service dependencies, operational constraints, fallback modes, and approval logic. Standardization in this area would make benchmarking, auditing, and cross-tool integration substantially easier.

Sixth, evaluation must become sector-aware. The field needs recovery benchmarks that are aligned with energy, water, manufacturing, healthcare, transportation, and telecom realities rather than generic cyber tasks. Sector-specific community profiles, testbeds, and benchmark suites would provide the missing bridge between academic prototypes and operational adoption.

Seventh, compliance-aware and privacy-aware recovery planning will become a deployment prerequisite. Future systems should generate not only restoration steps but also the supporting artifacts needed for approval, notification, evidence retention, and post-incident accountability. The EU Artificial Intelligence Act and the EDPB report on privacy risks in LLM systems, together with OT-specific joint guidance on secure AI integration, imply that deployment viability will increasingly depend on governance-by-design, data minimization, and auditable human oversight rather than technical accuracy alone [96,101,102].

Eighth, memory-augmented and self-evolving recovery agents should be studied cautiously. Persistent memory can help maintain incident context across handovers, preserve lessons learned from prior exercises, and support longer recovery timelines. Self-evolving mechanisms, skill libraries, and reflective improvement loops may also improve repeated planning tasks [31,32,34,35]. In critical infrastructures, however, such capabilities must be bounded by change-control rules, memory provenance, decay or revocation policies, and safety review because an agent that learns from incorrect or adversarial experience could institutionalize unsafe recovery behavior.

Ninth, the impact of frontier model releases should be evaluated through recovery-specific tasks rather than generic leaderboard scores. GPT-5.5 and Claude Mythos Preview illustrate rapid progress in coding, tool use, cyber tasks, and multi-step agentic workflows [27,28,29,30]. Nevertheless, stronger model capability does not remove the need for state-aware grounding, trusted evidence, simulation-backed validation, and human authorization. For this reason, the roadmap in this article is intentionally architecture- and assurance-centered rather than tied to a single model generation.

9. Discussion

The literature reviewed here suggests a consistent conclusion: LLMs are already useful for recovery-related cognition, but not yet sufficient for recovery autonomy in critical infrastructures. Their strongest current contributions lie in information synthesis, plan drafting, explanation, and coordination support. Their weakest aspects appear when plans depend on hidden state, exact infrastructure context, or hard safety constraints that cannot be inferred reliably from text alone.

This observation has two implications. The first is theoretical. Recovery-plan generation should be studied as a layered planning-and-assurance problem rather than as a monolithic language-model task. The main scientific challenge is not just better prompting; it is the integration of language reasoning with state estimation, dependency modeling, validation, and human governance. The second implication is practical. Organizations can obtain value now from bounded, evidence-grounded copilots without waiting for full autonomy. Conversely, attempts to skip directly to autonomous recovery orchestration are likely to overestimate current model dependability.

A further insight from the literature is that architecture maturity and data maturity are inseparable. Many barriers blamed on “LLM unreliability” are actually consequences of incomplete runbooks, stale inventories, ambiguous naming, or undocumented dependencies. In this sense, LLM adoption may expose long-standing resilience deficits that predate AI. This is useful. It means recovery-planning research can catalyze better knowledge engineering, process documentation, and dependency visibility even before high-assurance automation becomes feasible.

Practical Implications for Operators, Vendors, and Regulators

One lesson from comparison with adjacent architecture papers is that conceptual clarity alone is not enough; deployability must be made visible. For operators, this means that the first successful deployments should probably be read-only or recommendation-only systems integrated with existing approval chains. For vendors, the priority is not merely better prompting but better interfaces to inventories, historian data, ticketing systems, policy stores, and simulation environments. For regulators and auditors, the central question is whether a recovery recommendation can be traced back to authoritative evidence, reviewed by accountable humans, and reproduced after the fact. In Future Internet settings, deployable solutions must also handle edge-cloud partitioning, controlled remote connectivity, and auditable exchange of recovery evidence across organizational boundaries.

Table 7 translates the discussion into a practical deployment checklist for pilot recovery copilots.

The strategic implication is straightforward: organizations should treat LLM-enabled recovery planning as an assurance engineering program rather than as a stand-alone AI feature. The strongest near-term pilots will therefore look deliberately conservative—narrowly scoped, heavily grounded, evidence-citing, and embedded in existing recovery governance.

This review has limitations. It is a structured critical synthesis rather than an exhaustive bibliometric census. The evidence base remains uneven: manufacturing and cyber-defense simulations are better represented than real-world utility, transport, or healthcare recovery deployments, and many agentic planning approaches are still evaluated primarily in simulations or early-stage studies. Accordingly, the roadmap should be interpreted as a research agenda and deployment-readiness framework rather than as a deterministic forecast of adoption.

Recent literature published after much of the core survey corpus reinforces—but does not eliminate—this gap. Broad cyber-resilience reviews continue to aggregate LLM applications across detection, analysis, and response [103]. Disaster-management surveys show a parallel movement toward phase-aware AI support across preparedness, response, and recovery [104]. A 2026 Applied Sciences article addresses LLMs for IIoT protection and incident triage in critical infrastructure [105]. At the more operational end of the spectrum, agentic industrial automation work couples LLM planning with simulation-backed validation for fault recovery [106], while AIR reframes incident response as a first-class safety mechanism for autonomous agent systems [107]. Collectively, these studies confirm the field’s momentum, but they still do not provide an architecture-centric, assurance-driven synthesis focused specifically on recovery-plan generation under OT safety, interdependency, and authorization constraints.

10. Conclusions

LLM-enabled recovery-plan generation for critical infrastructures is an emerging but strategically important research area. The field should not be reduced to generic chatbots or broad claims about “AI for cybersecurity.” Recovery planning is a distinct high-consequence task that requires grounded reasoning over infrastructure state, dependencies, safety rules, and organizational responsibilities. The reviewed literature shows that meaningful progress is already being made through RAG-grounded copilots, multi-agent decompositions, knowledge-graph integration, and hybrid architectures that combine language models with verification or simulation.

At the same time, the limitations are substantial. Hallucination, stale context, dependency blindness, prompt injection, data leakage, governance gaps, and human-factor risks all constrain safe deployment. Accordingly, the most defensible near-term trajectory is the development of grounded, auditable, human-in-the-loop recovery copilots, followed by validated dependency-aware planners in sector testbeds and only then bounded orchestration in carefully controlled environments.

The main contribution of this article is to reposition recovery-plan generation as a central design problem for trustworthy AI in critical infrastructures. By synthesizing the literature around architecture patterns, sectoral applications, limitations, assurance mechanisms, practical deployment requirements, and a staged roadmap, the review provides a foundation for future work that is technically rigorous, operationally relevant, and attentive to the realities of networked, cloud-edge, and safety-critical infrastructure resilience.

Author Contributions

Conceptualization, I.G. and G.T.; methodology, I.G. and G.T.; software, I.G.; validation, I.G. and G.T.; formal analysis, I.G. and G.T.; investigation, I.G.; resources, I.G.; data curation, I.G.; writing—original draft preparation, I.G.; writing—review and editing, G.T.; visualization, I.G.; supervision, G.T.; project administration, G.T.; funding acquisition, G.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been accomplished with financial support from the European Regional Development Fund within the Operational Program “Bulgarian National Recovery and Resilience Plan”, the procedure for the direct provision of grants “Establishing a network of research higher education institutions in Bulgaria”, and under Project BG-RRP-2.004-0005 “Improving the research capacity and quality to achieve international recognition and resilience of TU-Sofia (IDEAS)”.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

Author Ivo Gergov was employed by Evo Tech Development Ltd. The remaining author declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ACD	autonomous cyber defense
AI	artificial intelligence
CACAO	Collaborative Automated Course of Action Operations
CER	Critical Entities Resilience
CI	critical infrastructure
CPS	cyber–physical system
CTI	cyber threat intelligence
CSF	Cybersecurity Framework
GRPS	Grounded Recovery Planning Stack
HMI	human–machine interface
ICS	industrial control system
LLM	large language model
OT	operational technology
P&ID	piping and instrumentation diagram
PLC	programmable logic controller
PRISMA	Preferred Reporting Items for Systematic Reviews and Meta-Analyses
RAG	retrieval-augmented generation
SOC	security operations center
SWaT	Secure Water Treatment

References

Rinaldi, S.M.; Peerenboom, J.P.; Kelly, T.K. Identifying, understanding, and analyzing critical infrastructure interdependencies. IEEE Control Syst. Mag. 2001, 21, 11–25. [Google Scholar] [CrossRef]
Ouyang, M. Review on modeling and simulation of interdependent critical infrastructure systems. Reliab. Eng. Syst. Saf. 2014, 121, 43–60. [Google Scholar] [CrossRef]
Linkov, I.; Eisenberg, D.A.; Plourde, K.; Seager, T.P.; Allen, J.; Kott, A. Resilience metrics for cyber systems. Environ. Syst. Decis. 2013, 33, 471–476. [Google Scholar] [CrossRef]
Woods, D.D. Four concepts for resilience and the implications for the future of resilience engineering. Reliab. Eng. Syst. Saf. 2015, 141, 5–9. [Google Scholar] [CrossRef]
Stouffer, K.; Pillitteri, V.; Lightman, S.; Abrams, M.; Hahn, A. Guide to Operational Technology (OT) Security; NIST Special Publication 800-82 Rev. 3; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2023. [CrossRef]
National Institute of Standards and Technology. The NIST Cybersecurity Framework (CSF) 2.0; NIST Cybersecurity White Paper 29; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2024.
European Parliament and Council of the European Union. Directive (EU) 2022/2555 on measures for a high common level of cybersecurity across the Union (NIS 2 Directive). Off. J. Eur. Union 2022, L333, 80–152. [Google Scholar]
European Parliament and Council of the European Union. Directive (EU) 2022/2557 on the resilience of critical entities. Off. J. Eur. Union 2022, L333, 164–198. [Google Scholar]
Swanson, M.; Bowen, P.; Phillips, A.; Gallup, D.; Lynes, D. Contingency Planning Guide for Federal Information Systems; NIST Special Publication 800-34 Rev. 1; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2010.
Nelson, A.; Rekhi, S.; Souppaya, M.; Scarfone, K. Incident Response Recommendations and Considerations for Cybersecurity Risk Management: A CSF 2.0 Community Profile; NIST Special Publication 800-61 Rev. 3; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2025. [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019; Association for Computational Linguistics: Vienna, Austria, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020); Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 1877–1901. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35 (NeurIPS 2022); Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 27730–27744. [Google Scholar]
OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the Opportunities and Risks of Foundation Models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems 35 (NeurIPS 2022); Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 24824–24837. [Google Scholar]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. In Advances in Neural Information Processing Systems 35 (NeurIPS 2022); Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 22199–22213. [Google Scholar]
Wang, L.; Xu, W.; Lan, Y.; Hu, Z.; Lan, Y.; Lee, R.K.W.; Lim, E.P. Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. In Proceedings of ACL 2023 Findings; Association for Computational Linguistics: Vienna, Austria, 2023; pp. 2603–2619. [Google Scholar]
Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations (ICLR 2023); Curran Associates, Inc.: Red Hook, NY, USA, 2023. [Google Scholar]
Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023); Curran Associates, Inc.: Red Hook, NY, USA, 2023. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020); Curran Associates, Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2312.10997. [Google Scholar]
Zhang, Q.; Chen, S.; Bei, Y.; Yuan, Z.; Zhou, H.; Hong, Z.; Chen, H.; Xiao, Y.; Zhou, C.; Dong, J.; et al. A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models. arXiv 2025, arXiv:2501.13958. [Google Scholar] [CrossRef]
Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.V.; Wiest, O.; Zhang, X. Large Language Model based Multi-Agents: A Survey of Progress and Challenges. In Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI 2024), Jeju, Republic of Korea, 3–9 August 2024; pp. 8048–8057. [Google Scholar] [CrossRef]
Plaat, A.; de Bruin, B.; van der Wees, M.; van Veen, B.; Madoerin, T.; de Heer, H. Agentic Large Language Models: A Survey. J. Artif. Intell. Res. 2025, 84, 1–74. [Google Scholar] [CrossRef]
OpenAI. Introducing GPT-5.5; OpenAI: San Francisco, CA, USA, 2026; Available online: https://openai.com/index/introducing-gpt-5-5/ (accessed on 7 May 2026).
OpenAI. GPT-5.5 System Card; OpenAI Deployment Safety Hub: San Francisco, CA, USA, 2026; Available online: https://deploymentsafety.openai.com/gpt-5-5 (accessed on 7 May 2026).
Anthropic. Alignment Risk Update: Claude Mythos Preview; Anthropic: San Francisco, CA, USA, 2026; Available online: https://anthropic.com/claude-mythos-preview-risk-report (accessed on 7 May 2026).
UK AI Security Institute. Our Evaluation of Claude Mythos Preview’s Cyber Capabilities; Department for Science, Innovation and Technology: London, UK, 2026. Available online: https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities (accessed on 7 May 2026).
Zhang, Z.; Dai, Q.; Bo, X.; Ma, C.; Li, R.; Chen, X.; Zhu, J.; Dong, Z.; Wen, J.R. A Survey on the Memory Mechanism of Large Language Model-based Agents. ACM Trans. Inf. Syst. 2025, 43, 155. [Google Scholar] [CrossRef]
Gao, H.; Geng, J.; Hua, W.; Hu, M.; Juan, X.; Liu, H.; Liu, S.; Qiu, J.; Qi, X.; Ren, Q.; et al. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence. Trans. Mach. Learn. Res. 2026; accepted.
OpenClaw. OpenClaw—Personal AI Assistant. GitHub Repository. 2026. Available online: https://github.com/openclaw/openclaw (accessed on 7 May 2026).
Nous Research. Hermes Agent: The Agent That Grows with You. GitHub Repository. 2026. Available online: https://github.com/NousResearch/hermes-agent (accessed on 7 May 2026).
Nous Research. Hermes Agent Self-Evolution. GitHub Repository. 2026. Available online: https://github.com/NousResearch/hermes-agent-self-evolution (accessed on 7 May 2026).
Ani, U.D.; Watson, J.D.M.; Nurse, J.R.C.; Cook, A.; Maple, C. A review of critical infrastructure protection approaches: Improving security through responsiveness to the dy-namic modelling landscape. In Living in the Internet of Things (IoT 2019); IET: London, UK, 2019; pp. 1–15. [Google Scholar] [CrossRef]
Segovia-Ferreira, J.; Navarro, J.; Alcaraz, C.; Zeadally, S. A survey on cyber-resilience approaches for cyber-physical systems. ACM Comput. Surv. 2024, 56, 202. [Google Scholar] [CrossRef]
Humayed, A.; Lin, J.; Li, F.; Luo, B. Cyber-Physical Systems Security—A Survey. IEEE Internet Things J. 2017, 4, 1802–1831. [Google Scholar] [CrossRef]
Yigit, Y.; Ferrag, M.A.; Ghanem, M.C.; Sarker, I.H.; Maglaras, L.A.; Chrysoulas, C.; Moradpoor, N.; Tihanyi, N.; Janicke, H. Generative AI and LLMs for Critical Infrastructure Protection: Evaluation Benchmarks, Agentic AI, Challenges, and Opportunities. Sensors 2025, 25, 1666. [Google Scholar] [CrossRef]
Coelho da Silva, T.; Westphall, C.B. A Survey of Large Language Models in Cybersecurity. arXiv 2024, arXiv:2402.16968. [Google Scholar] [CrossRef]
Motlagh, F.N.; Hajizadeh, M.; Majd, M.; Najafi, P.; Cheng, F.; Meinel, C. Large Language Models in Cybersecurity: State-of-the-Art. arXiv 2024, arXiv:2402.00891. [Google Scholar] [CrossRef]
Xu, H.; Wang, S.; Li, N.; Wang, K.; Zhao, Y.; Chen, K.; Yu, T.; Liu, Y.; Wang, H. Large Language Models for Cyber Security: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. 2025, in press/online first. [Google Scholar] [CrossRef]
Stakhanova, N.; Basu, S.; Wong, J. A taxonomy of intrusion response systems. Int. J. Inf. Comput. Secur. 2007, 1, 169–184. [Google Scholar] [CrossRef]
Applebaum, A.; Johnson, S.; Limiero, M.; Smith, M. Playbook Oriented Cyber Response. In Proceedings of the 2018 National Cyber Summit (NCS), Huntsville, AL, USA, 5–7 June 2018; pp. 8–15. [Google Scholar] [CrossRef]
Stevens, R.; Votipka, D.; Dykstra, J.; Tomlinson, F.; Quartararo, E.; Ahern, C.; Mazurek, M.L. How Ready Is Your Ready? Assessing the Usability of Incident Response Playbook Frameworks. In CHI Conference on Human Factors in Computing Systems (CHI 2022); Association for Computing Machinery: New York, NY, USA, 2022; pp. 1–18. [Google Scholar] [CrossRef]
Schlette, D.; Empl, P.; Caselli, M.; Schreck, T.; Pernul, G. Do You Play It by the Books? A Study on Incident Response Playbooks and Influencing Factors. In Proceedings of the 2024 IEEE Symposium on Security and Privacy (SP 2024); IEEE: New York, NY, USA, 2024; pp. 3625–3643. [Google Scholar] [CrossRef]
Woods, D.W.; Böhme, R.; Wolff, J.; Schwarcz, D. Lessons Lost: Incident Response in the Age of Cyber Insurance and Breach Attorneys. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 2023); USENIX Association: Berkeley, CA, USA, 2023; pp. 2259–2273. [Google Scholar]
Zonouz, S.A.; Khurana, H.; Sanders, W.H.; Yardley, T.M. RRE: A game-theoretic intrusion response and recovery engine. In Proceedings of the 39th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2009); IEEE: New York, NY, USA, 2009; pp. 439–448. [Google Scholar] [CrossRef]
Cybersecurity and Infrastructure Security Agency. Cybersecurity Incident and Vulnerability Response Playbooks; CISA: Washington, DC, USA, 2021. [Google Scholar]
OASIS Open. Collaborative Automated Course of Action Operations (CACAO) Security Playbooks Version 2.0; OASIS Committee Specification 02; OASIS Open: Burlington, MA, USA, 2023. [Google Scholar]
Hays, C.; White, C. Employing LLMs for Incident Response Planning and Review. arXiv 2024, arXiv:2403.01271. [Google Scholar] [CrossRef]
Hammar, K.; Alpcan, T.; Lupu, E.C. Incident Response Planning Using a Lightweight Large Language Model with Reduced Hallucination. In Proceedings of the Network and Distributed System Security (NDSS) Symposium 2026, San Diego, CA, USA, 23–27 February 2026. [Google Scholar] [CrossRef]
Lin, X.; Zhang, J.; Deng, G.; Liu, T.; Liu, X.; Yang, C.; Zhang, T.; Guo, Q.; Chen, R. IRCopilot: Automated Incident Response with Large Language Models. arXiv 2025, arXiv:2505.20945. [Google Scholar] [CrossRef]
Castro, S.R.; Campbell, R.; Lau, N.; Villalobos, O.; Duan, J.; Cardenas, A.A. Large Language Models Are Autonomous Cyber Defenders. In Proceedings of the IEEE Conference on Artificial Intelligence (CAI 2025); IEEE: New York, NY, USA, 2025; pp. 1125–1132. [Google Scholar] [CrossRef]
Mohammadi, H.; Davis, J.J.; Kiely, M. Leveraging Large Language Models for Autonomous Cyber Defense: Insights from CAGE-2 Simulations. IEEE Intell. Syst. 2025, 40, 29–36. [Google Scholar] [CrossRef]
Rigaki, M.; Lukáš, O.; Catania, C.A.; Garcia, S. Out of the Cage: How Stochastic Parrots Win in Cyber Security Environments. In Proceedings of the 16th International Conference on Agents and Artificial Intelligence (ICAART 2024); SCITEPRESS: Rome, Italy, 2024; Volume 3, pp. 774–781. [Google Scholar] [CrossRef]
Loevenich, J.F.; Adler, E.; Mercier, R.; Velazquez, A.; Lopes, R.R.F. Design of an Autonomous Cyber Defence Agent Using Hybrid AI Models. In 2024 International Conference on Military Communication and Information Systems (ICMCIS); IEEE: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef] [PubMed]
Kitchenham, B.; Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; EBSE Technical Report EBSE-2007-01; Keele University: Staffordshire, UK; Durham University: Durham, UK, 2007. [Google Scholar]
Dong, Y.; Aung, Y.L.; Chattopadhyay, S.; Zhou, J. ChatIoT: Large Language Model-Based Security Assistant for Internet of Things with Retrieval-Augmented Generation. arXiv 2025, arXiv:2502.09896. [Google Scholar]
Webb, B.K.; Purohit, S.; Meyur, R. Cyber Knowledge Completion Using Large Language Models. arXiv 2024, arXiv:2409.16176. [Google Scholar] [CrossRef]
Dagnas, R.; Barbeau, M.; Garcia-Alfaro, J.; Yaich, R. Graph Analytics for Cyber-Physical System Resilience Quantification. arXiv 2025, arXiv:2504.02120. [Google Scholar] [CrossRef]
Fakih, M.; Dharmaji, R.; Moghaddas, Y.; Quiros Araya, G.; Ogundare, O.; Al Faruque, M.A. LLM4PLC: Harnessing Large Language Models for Verifiable Programming of PLCs in Industrial Control Systems. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP 2024); Association for Computing Machinery: New York, NY, USA, 2024; pp. 192–203. [Google Scholar] [CrossRef]
Song, L.; Zhang, C.; Zhao, L.; Bian, J. Pre-Trained Large Language Models for Industrial Control. arXiv 2023, arXiv:2308.03028. [Google Scholar] [CrossRef]
Xia, Y.; Jazdi, N.; Zhang, J.; Shah, C.; Weyrich, M. Control Industrial Automation System with Large Language Models. arXiv 2024, arXiv:2409.18009. [Google Scholar]
Ji, H.; Yang, J.; Chai, L.; Wei, C.; Yang, L.; Duan, Y.; Wang, Y.; Sun, T.; Guo, H.; Li, T.; et al. SEvenLLM: Benchmarking, Eliciting, and Enhancing Abilities of Large Language Models in Cyber Threat Intelligence. arXiv 2024, arXiv:2405.03446. [Google Scholar] [CrossRef]
Karlsen, E.; Luo, Y.; Nargesian, F.; Pu, C. Benchmarking Large Language Models for Log Analysis in the Context of Cyber Security. arXiv 2023, arXiv:2311.14519. [Google Scholar]
Vaarandi, R.; Bahşi, H. Using Large Language Models for Template Detection from Security Event Logs. Int. J. Inf. Secur. 2025, 24, 104. [Google Scholar] [CrossRef]
Grangel-González, I.; Halilaj, L.; Coskun, G.; Auer, S.; Lohmann, S. Seamless Integration of Cyber-Physical Systems in Knowledge Graphs. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing (SAC 2018); Association for Computing Machinery: New York, NY, USA, 2018; pp. 2000–2003. [Google Scholar] [CrossRef]
Deng, G.; Liu, Y.; Mayoral-Vilches, V.; Liu, P.; Li, Y.; Xu, Y.; Zhang, T.; Liu, Y.; Pinzger, M.; Rass, S. PentestGPT: An LLM-Empowered Automatic Penetration Testing Tool. arXiv 2023, arXiv:2308.06782. [Google Scholar]
Lu, Y.; Liu, C.; Wang, K.I.-K.; Huang, H.; Xu, X. Digital Twin-Driven Smart Manufacturing: Connotation, Reference Model, Applications and Research Issues. Robot. Comput.-Integr. Manuf. 2020, 61, 101837. [Google Scholar] [CrossRef]
Wang, Y.; Su, Z.; Guo, S.; Dai, M.; Luan, T.H.; Liu, Y. A Survey on Digital Twins: Architecture, Enabling Technologies, Security and Privacy, and Future Prospects. arXiv 2023, arXiv:2301.13350. [Google Scholar] [CrossRef]
Al Zami, M.B.; Shaon, S.; Quy, V.K.; Nguyen, D.C. Digital Twin in Industries: A Comprehensive Survey. IEEE Access, 2024; in press/early access.
Singh, R.; Tariq, S.; Jalalvand, F.; Chhetri, M.B.; Nepal, S.; Paris, C.; Lochner, M. LLMs in the SOC: An Empirical Study of Human-AI Collaboration in Security Operations Centres. arXiv 2025, arXiv:2508.18947. [Google Scholar] [CrossRef]
Deason, L.; Bali, A.; Bejean, C.; Bolocan, D.; Crnkovich, J.; Croitoru, I.; Durai, K.; Midler, C.; Miron, C.; Molnar, D.; et al. CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning. arXiv 2025, arXiv:2509.20166. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 248. [Google Scholar] [CrossRef]
Tonmoy, S.M.T.I.; Zaman, S.M.M.; Jain, V.; Rani, A.; Rawte, V.; Chadha, A.; Das, A. A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models. arXiv 2024, arXiv:2401.01313. [Google Scholar] [CrossRef]
Sriramanan, G.; Bharti, S.; Sadasivan, V.S.; Saha, S.; Kattakinda, P.; Feizi, S. LLM-Check: Investigating Detection of Hallucinations in Large Language Models. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024); Curran Associates, Inc.: Red Hook, NY, USA, 2024. [Google Scholar]
Yadkori, Y.A.; Kuzborskij, I.; Stutz, D.; György, A.; Fisch, A.; Doucet, A.; Beloshapka, I.; Weng, W.-H.; Yang, Y.-Y.; Szepesvári, C.; et al. Mitigating LLM Hallucinations via Conformal Abstention. arXiv 2024, arXiv:2405.01563. [Google Scholar] [CrossRef]
Lin, S.; Hilton, J.; Evans, O. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022); Association for Computational Linguistics: Vienna, Austria, 2022; pp. 3214–3252. [Google Scholar] [CrossRef]
Liu, Y.; Jia, Y.; Geng, R.; Jia, J.; Gong, N.Z. Prompt Injection Attacks and Defenses in LLM-Integrated Applications. arXiv 2023, arXiv:2310.12815. [Google Scholar]
Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J.Z.; Fredrikson, M. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv 2023, arXiv:2307.15043. [Google Scholar] [CrossRef]
Carlini, N.; Tramèr, F.; Wallace, E.; Jagielski, M.; Herbert-Voss, A.; Lee, K.; Roberts, A.; Brown, T.; Song, D.; Erlingsson, Ú.; et al. Extracting Training Data from Large Language Models. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 2021); USENIX Association: Berkeley, CA, USA, 2021; pp. 2633–2650. [Google Scholar]
Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT 2021); Association for Computing Machinery: New York, NY, USA, 2021; pp. 610–623. [Google Scholar] [CrossRef]
Weidinger, L.; Mellor, J.; Rauh, M.; Griffin, C.; Uesato, J.; Huang, P.S.; Cheng, M.; Glaese, M.; Balle, B.; Kasirzadeh, A.; et al. Ethical and Social Risks of Harm from Language Models. arXiv 2021, arXiv:2112.04359. [Google Scholar] [CrossRef]
Tabassi, E. Artificial Intelligence Risk Management Framework (AI RMF 1.0); NIST AI 100-1; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2023. [CrossRef]
Autio, C.; Schwartz, R.; Dunietz, J.; Jain, S.; Stanley, M.; Tabassi, E.; Hall, P.; Roberts, K. Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile; NIST AI 600-1; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2024. [CrossRef]
Bhatt, M.; Chennabasappa, S.; Li, Y.; Nikolaidis, C.; Song, D.; Wan, S.; Ahmad, F.; Aschermann, C.; Chen, Y.; Kapil, D.; et al. CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models. arXiv 2024, arXiv:2404.13161. [Google Scholar]
National Cyber Security Centre. Guidelines for Secure AI System Development; National Cyber Security Centre: London, UK, 2023.
OWASP Foundation. OWASP Top 10 for LLM Applications 2025; OWASP GenAI Security Project; OWASP Foundation: Wilmington, DE, USA, 2025. [Google Scholar]
OWASP Foundation. LLM01:2025 Prompt Injection; OWASP GenAI Security Project; OWASP Foundation: Wilmington, DE, USA, 2025. [Google Scholar]
The MITRE Corporation. MITRE ATLAS™: Adversarial Threat Landscape for Artificial-Intelligence Systems; MITRE: McLean, VA, USA, 2025. [Google Scholar]
National Institute of Standards and Technology. Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations; NIST AI 100-2e2025; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2025.
Booth, H.; Souppaya, M.; Vassilev, A.; Ogata, M.; Scarfone, K. Secure Software Development Practices for Generative AI and Dual-Use Foundation Models; NIST Special Publication 800-218A; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2024. [CrossRef]
National Security Agency; Cybersecurity and Infrastructure Security Agency; Federal Bureau of Investigation; Australian Signals Directorate’s Australian Cyber Security Centre; Canadian Centre for Cyber Security; New Zealand National Cyber Security Centre; United Kingdom National Cyber Security Centre. Deploying AI Systems Securely: Best Practices for Deploying Secure and Resilient AI Systems; Joint Cybersecurity Information Sheet; National Security Agency: Fort Meade, MD, USA, 2024.
Cybersecurity and Infrastructure Security Agency; Australian Signals Directorate’s Australian Cyber Security Centre; National Security Agency’s Artificial Intelligence Security Center; Federal Bureau of Investigation; Canadian Centre for Cyber Security; German Federal Office for Information Security; Netherlands National Cyber Security Centre; New Zealand National Cyber Security Centre; United Kingdom National Cyber Security Centre. Principles for the Secure Integration of Artificial Intelligence in Operational Technology; Joint Guidance; Joint Cybersecurity Information Sheet; Cybersecurity and Infrastructure Security Agency: Washington, DC, USA, 2025.
Hammar, K.; Alpcan, T.; Lupu, E.C. Hallucination-Resistant Security Planning with a Large Language Model. arXiv 2026, arXiv:2602.05279. [Google Scholar] [CrossRef]
Gao, Y.; Hammar, K.; Li, T. In-Context Autonomous Network Incident Response: An End-to-End Large Language Model Agent Approach. arXiv 2026, arXiv:2602.13156. [Google Scholar]
Ren, Y.; Wang, J.; Zhao, Z.; Wen, H.; Li, H.; Zhu, H. Automated Tactics Planning for Cyber Attack and Defense Based on Large Language Model Agents. Neural Netw. 2025, 191, 107842. [Google Scholar] [CrossRef]
National Cyber Security Centre. Impact of AI on Cyber Threat from Now to 2027; National Cyber Security Centre: London, UK, 2025.
European Parliament and Council of the European Union. Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Off. J. Eur. Union 2024, L2024/1689, 1–144. [Google Scholar]
European Data Protection Board. AI Privacy Risks & Mitigations Large Language Models (LLMs); European Data Protection Board: Brussels, Belgium, 2025. [Google Scholar]
Ding, W.; Abdel-Basset, M.; Ali, A.M.; Moustafa, N. Large language models for cyber resilience: A comprehensive review, challenges, and future perspectives. Appl. Soft Comput. 2025, 170, 112663. [Google Scholar] [CrossRef]
Lei, Z.; Dong, Y.; Li, W.; Ding, R.; Wang, Q.R.; Li, J. Harnessing Large Language Models for Disaster Management: A Survey. In Findings of the Association for Computational Linguistics: ACL 2025; Association for Computational Linguistics: Vienna, Austria, 2025; pp. 14528–14551. [Google Scholar] [CrossRef]
Manowska, A.; Syta, J. Application of Large Language Models in the Protection of Industrial IoT Systems for Critical Infrastructure. Appl. Sci. 2026, 16, 730. [Google Scholar] [CrossRef]
Vyas, J.; Mercangöz, M. Autonomous Control Leveraging LLMs: An Agentic Framework for Next-Generation Industrial Automation. arXiv 2025, arXiv:2507.07115. [Google Scholar] [CrossRef]
Xiao, Z.; Sun, J.; Chen, J. AIR: Improving Agent Safety through Incident Response. arXiv 2026, arXiv:2602.11749. [Google Scholar] [CrossRef]

Figure 1. Layered reference architecture for LLM-assisted recovery-plan generation in critical infrastructures. The figure synthesizes the proposed stack across context assembly, grounding, orchestration, validation, governance, and output packaging.

Figure 2. Staged roadmap for LLM-enabled recovery planning in critical infrastructures, moving from grounded copilots to validated planners and bounded orchestration. The figure also indicates increasing planning authority and the cross-cutting requirements that must mature across all stages.

Table 1. Structured review protocol used in this article.

Component	Description
Objective	Synthesize architectures, applications, limitations, and future research directions for LLM-enabled recovery-plan generation in critical infrastructures.
Source types	Journal articles, conference papers, standards, official guidance, and selected high-quality preprints.
Databases and repositories	Scopus, Web of Science, IEEE Xplore, ACM Digital Library, SpringerLink, ScienceDirect, MDPI, arXiv, NIST, CISA, OASIS, and EU policy sources.
Search themes	Combinations of “large language model”, “generative AI”, “agentic AI”, “critical infrastructure”, “OT”, “ICS”, “CPS”, “incident response”, “recovery planning”, “playbook”, “resilience”, “digital twin”, and “knowledge graph”.
Inclusion criteria	Sources addressing at least one of the following: (i) recovery/response planning, (ii) CI/OT/CPS resilience, (iii) LLM architectures relevant to planning, or (iv) assurance/evaluation for dependable deployment.
Exclusion criteria	Purely offensive studies without planning relevance, duplicate reports, low-substance opinion pieces, and papers whose contribution could not be connected to recovery-plan generation.
Coding dimensions	Sector, planning task, data/grounding source, architecture pattern, tool use, human role, verification strategy, evaluation metrics, and limitation profile.
Synthesis approach	Structured critical review with conceptual integration; no quantitative meta-analysis and no claim of an exhaustive census.

Table 2. Positioning of this article against representative review literature.

Review	Primary Scope	Critical-Infrastructure Focus	Recovery-Plan Generation Focus	Architecture-Centric Synthesis	Roadmap/Assurance Focus
Ani et al. [27]	Critical infrastructure protection approaches	Yes	No	Limited	Limited
Segovia-Ferreira et al. [28]	Cyber-resilience for CPS	Yes	No	Partial	Partial
Yigit et al. [30]	Generative AI and LLMs for CIP	Yes	Partial	Partial	Yes
Coelho da Silva and Westphall [31]	LLMs in cybersecurity	No	No	Yes	Partial
Motlagh et al. [32]	LLMs in cybersecurity state of the art	No	No	Yes	Partial
Xu et al. [33]	Systematic review of LLMs in cybersecurity	No	No	Yes	Yes
This article	LLM-enabled recovery-plan generation in CI	Yes	Yes	Yes	Yes

Table 3. Architectural patterns for LLM-enabled recovery-plan generation.

Pattern	Primary Grounding Sources	Strengths	Main Failure Modes	Representative Studies
Prompt-only assistant	User prompts and model parameters	Rapid drafting, brainstorming, after-action summaries	Hallucination, stale knowledge, low executability	Hays and White [51]
RAG-grounded copilot	Runbooks, manuals, tickets, inventories, logs, CTI	Traceability, contextualization, lower factual drift	Retrieval mismatch, context overflow, source poisoning	Hammar et al. [52]; Dong et al. [60]
Knowledge-graph/GraphRAG planner	Asset-service graphs, attack knowledge, dependency graphs	Dependency awareness, structured reasoning, impact tracing	Graph incompleteness, stale topology, integration cost	Webb et al. [61]; Dagnas et al. [62]
Multi-agent planner	Role-specific prompts, tools, retrieved context	Task decomposition, role separation, critique and review	Coordination overhead, cascading errors, opaque arbitration	IRCopilot [53]; Castro et al. [54]
Hybrid verified planner	RAG + rules + simulators + digital twins + formal checkers	Highest assurance potential, feasibility testing, bounded autonomy	Engineering complexity, tooling burden, limited portability	LLM4PLC [63]; Song et al. [64]; Xia et al. [65]

Table 4. Sectoral application scenarios for LLM-enabled recovery-plan generation.

Sector	Representative Inputs	Recovery Objective	Most Suitable Architecture Pattern	Current Evidence Maturity
Electric power and energy	SCADA alarms, switching procedures, topology models, maintenance constraints	Service restoration sequencing, substation isolation, operator communication	RAG + graph-aware grounding + validation	Low-to-moderate
Water and wastewater	Process alarms, historian data, P&IDs, treatment-stage dependencies	Safe staged restoration, contamination containment, degraded-mode operation	RAG + dependency graph + digital-twin validation	Moderate in testbeds
Manufacturing and process industries	PLC logic, control recipes, HMI states, vendor manuals	Restart sequencing, logic restoration, fallback recipe selection	Hybrid verified planner	Moderate
Transportation and logistics	Traffic/rail operations procedures, interlocking dependencies, scheduling constraints	Safe service recovery under partial outage	RAG + multi-agent coordination	Low
Healthcare and hospital infrastructures	Device inventories, clinical priority rules, downtime procedures, cyber incident tickets	Priority service restoration and cross-stakeholder coordination	RAG-grounded copilot	Low
Telecom and cloud-backed infrastructure	Logs, CTI, tickets, service maps, escalation policies	Triage-to-recovery coordination, configuration rollback, customer communication	RAG + multi-agent planner	Moderate

Table 5. Limitations and corresponding assurance mechanisms for recovery-plan generation.

Risk or Limitation	Why It Matters in Critical Infrastructures	Recommended Mitigation or Assurance Mechanism	Candidate Evaluation Signal
Hallucinated actions	Can produce unsafe or impossible restoration steps	RAG with provenance, source citation, hallucination detection, abstention	Unsupported-step rate; abstention quality; reviewer override rate
State mismatch	A technically correct action may be wrong for the current asset state	Live asset synchronization, topology validation, change-log integration	Plan executability against current state
Dependency blindness	Local optimization can trigger cross-service failures	Knowledge graphs, service maps, impact analysis, twin-based dependency checks	Service-level restoration score; cascade-risk score
Unsafe control recommendations	May violate process safety or maintenance constraints	Rule engines, formal checks, digital twins, human safety approval	Unsafe-action rate in simulation or testbed
Prompt injection/data poisoning	May manipulate plans or retrieved evidence	Trusted-source segregation, retrieval hardening, tool sandboxing, red teaming	Attack success rate under adversarial prompts
Data leakage	Sensitive operational data may leave controlled boundaries	On-prem or private deployment, redaction, access control, retention policies	Leakage tests; privacy incident count
Over-reliance and deskilling	Operators may accept plausible but wrong plans	Mandatory review gates, explanation design, training and tabletop exercises	Time-to-approval; trust calibration; post-exercise error rate
Lack of accountability	Unclear responsibility undermines safe adoption	Role-based approvals, immutable audit trails, policy-aligned escalation paths	Approval trace completeness; policy compliance rate

Table 6. Proposed research roadmap for LLM-enabled recovery-plan generation in critical infrastructures.

Horizon	Planning Stage	Readiness Basis and Deliverable
0–12 months	Grounded recovery copilots	Basis: curated corpora, provenance, role-aware prompting, and tabletop review are sufficient for read-only support. Deliverable: auditable RAG copilots for plan drafting and documentation.
1–3 years	Validated dependency-aware planners	Basis: asset-service graphs, live state feeds, recovery-specific benchmarks, and rollback/simulation checks become available. Deliverable: graph-grounded planners validated in sector sandboxes and testbeds.
3–5 years	Bounded operational orchestration	Basis: approval policies, digital twins, orchestration interfaces, uncertainty reporting, and operator-trust evidence are mature enough for constrained use. Deliverable: human-supervised planners with limited execution authority.
Beyond 5 years	Verified adaptive recovery ecosystems	Basis: machine-readable recovery artifacts, cross-organizational interoperability, certification criteria, and assurance cases are standardized. Deliverable: interoperable and certifiable recovery-planning ecosystems for selected CI domains.

Table 7. Practical deployment checklist for pilot LLM-based recovery copilots.

Deployment Dimension	Minimum Requirement for a Credible Pilot	Red Flag Indicating Premature Deployment
Knowledge base quality	Current runbooks, inventories, dependency maps, and role definitions are curated and versioned	Reliance on scattered PDFs, stale spreadsheets, or undocumented tribal knowledge
Grounding and provenance	Every recommendation cites retrieved evidence and retrieval timestamps	Recommendations cannot be traced back to authoritative sources
Validation workflow	Plans are checked against policy rules, simulations, or domain constraints before approval	Model outputs are accepted based only on linguistic plausibility
Human authorization	Explicit approvers, override rules, and escalation paths are defined	Ambiguous ownership of who may approve or reject recovery actions
Security and privacy	Access controls, prompt-injection defenses, and data-boundary policies are enforced	Sensitive recovery context is exposed to uncontrolled tools or connectors
Auditability	Inputs, retrieved sources, plan versions, approvals, and outcomes are logged	The organization cannot reconstruct why a recommendation was made
Evaluation	Pilot exercises measure plan quality, abstention behavior, operator workload, and time-to-decision	Success is judged only by user enthusiasm or anecdotal usefulness

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tsochev, G.; Gergov, I. Large Language Models for Recovery Plan Generation in Internet-Connected Critical Infrastructures: Architectures, Applications, Limitations, and Research Directions. Future Internet 2026, 18, 295. https://doi.org/10.3390/fi18060295

AMA Style

Tsochev G, Gergov I. Large Language Models for Recovery Plan Generation in Internet-Connected Critical Infrastructures: Architectures, Applications, Limitations, and Research Directions. Future Internet. 2026; 18(6):295. https://doi.org/10.3390/fi18060295

Chicago/Turabian Style

Tsochev, Georgi, and Ivo Gergov. 2026. "Large Language Models for Recovery Plan Generation in Internet-Connected Critical Infrastructures: Architectures, Applications, Limitations, and Research Directions" Future Internet 18, no. 6: 295. https://doi.org/10.3390/fi18060295

APA Style

Tsochev, G., & Gergov, I. (2026). Large Language Models for Recovery Plan Generation in Internet-Connected Critical Infrastructures: Architectures, Applications, Limitations, and Research Directions. Future Internet, 18(6), 295. https://doi.org/10.3390/fi18060295

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Models for Recovery Plan Generation in Internet-Connected Critical Infrastructures: Architectures, Applications, Limitations, and Research Directions

Abstract

1. Introduction

2. Background and Problem Framing

2.1. Recovery Planning as a Distinct Socio-Technical Task

2.2. Why LLMs Are Attractive for Recovery-Plan Generation

2.3. Why Critical Infrastructures Are Harder than Enterprise IT

3. Review Methodology

4. Related Review Literature and Identified Gap

5. Architectural Patterns for LLM-Enabled Recovery-Plan Generation

5.1. Prompt-Only Assistants

5.2. Retrieval-Augmented Copilots

5.3. Knowledge-Graph and GraphRAG Planners

5.4. Multi-Agent Planners

5.5. Hybrid Planners with Verification, Simulation, and Digital Twins

5.6. A Reference Stack for Trustworthy Recovery-Plan Generation

6. Applications in Critical Infrastructure Sectors

6.1. Energy and Utility Infrastructures

6.2. Water and Wastewater Infrastructures

6.3. Manufacturing and Process Industries

6.4. Transportation and Logistics

6.5. Healthcare and Hospital Infrastructures

6.6. Telecommunications and Cloud-Backed Services

7. Limitations, Risks, and Assurance Requirements

7.1. Technical Limitations

7.2. Security and Misuse Risks

7.3. Human and Organizational Limitations

7.4. Governance and Assurance Requirements

8. Research Roadmap and Future Directions

Cross-Cutting Future Directions

9. Discussion

Practical Implications for Operators, Vendors, and Regulators

10. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI