1. Introduction
Critical infrastructures are not merely collections of isolated assets. They are interdependent socio-technical systems in which failures propagate across cyber, physical, organizational, and geographic layers [
1,
2,
3,
4]. In such environments, post-incident recovery is rarely a linear checklist exercise. A restoration decision made for one subsystem can degrade another subsystem’s availability, violate operational safety constraints, or delay the recovery of a more critical public service. This challenge is particularly acute in infrastructures that combine legacy field devices, supervisory control, IT services, vendor dependencies, and human operators.
Operational technology (OT) and industrial control environments intensify the problem. Compared with enterprise IT, critical infrastructure recovery must account for process safety, deterministic timing, device heterogeneity, maintenance windows, fallback modes, and the possibility that the “secure” action is not the “safe” action [
5,
6,
7,
8,
9,
10]. Recent regulatory and standards developments also place stronger emphasis on resilience, preparedness, continuity, and recovery, which means that organizations increasingly need response capabilities that are not only technically sound but also explainable, documented, and auditable.
From a Future Internet perspective, the problem is becoming more pressing because critical infrastructures are no longer isolated control islands. They increasingly rely on remote connectivity, cloud and edge services, industrial IoT gateways, vendor portals, and cross-organizational data exchange, which expand both operational visibility and recovery complexity [
5,
6,
7,
8,
9,
10]. Recovery planning therefore has to reason not only about local device restoration but also about networked dependencies across digital platforms, service providers, and interdependent infrastructures.
At the same time, LLMs have matured rapidly. The transformer architecture enabled the scale-up of contemporary language modeling [
11], while successive model generations, from BERT [
12] and GPT-3 [
13] to instruction-following systems [
14] and GPT-4-class models [
15], demonstrated that language models can summarize, classify, reason over heterogeneous text, generate code, and interact with external tools. The broader “foundation model” perspective further clarified that large pre-trained models can be adapted across tasks and domains, including high-consequence ones, although adaptation and governance remain essential [
16].
Recent advances in prompting and agent design make LLMs especially relevant to planning-oriented tasks. Chain-of-thought prompting, zero-shot reasoning, and plan-and-solve prompting improve decomposition of multi-step problems [
17,
18,
19]. ReAct and Toolformer connect reasoning with external actions and tools [
20,
21]. Retrieval-augmented generation (RAG) and its later variants allow models to ground outputs in external knowledge stores rather than parametric memory alone [
22,
23,
24]. Multi-agent and agentic LLM paradigms further distribute reasoning across specialized roles, which is attractive for incident command, safety checking, and documentation workflows [
25,
26].
Model and agent capabilities have continued to advance after these foundational works. Recent frontier releases and evaluations report stronger performance in coding, tool use, long-context reasoning, and cybersecurity-relevant tasks, while also emphasizing the need for stronger deployment safeguards and safety evaluation [
27,
28,
29,
30]. At the same time, memory-augmented and self-evolving agent systems are emerging in the broader agent ecosystem, including long-term memory mechanisms, skill libraries, and iterative self-improvement loops [
31,
32,
33,
34,
35]. These developments motivate the present article’s architecture-centric focus: in critical infrastructures, rapid model progress changes the capability frontier, but dependable recovery still depends on grounding, validation, authorization, and auditability.
However, the literature remains dispersed. Existing reviews examine critical infrastructure protection broadly [
36,
37,
38,
39] or survey LLMs in cybersecurity at large [
40,
41,
42]. These works are valuable, but they do not center the specific problem of generating recovery plans for critical infrastructures after disruptive incidents. Recovery-plan generation sits at the intersection of playbook engineering, incident response, infrastructure dependency modeling, OT/ICS safety, AI assurance, and human decision making. Treating it as merely another chatbot use case obscures the most consequential design requirements.
Unlike prior surveys on LLMs in cybersecurity or generative AI for critical infrastructure protection, this article focuses specifically on the generation of recovery plans after disruptive events. Its novelty lies in framing recovery-plan generation as a socio-technical planning problem; organizing the field around architecture patterns for grounded planning and plan assurance; introducing a reference stack for trustworthy recovery-plan generation; and proposing a staged research roadmap from grounded copilots to verified, sector-aware recovery planners.
The review addresses four questions. First, what architecture patterns are emerging for LLM-enabled recovery-plan generation in critical infrastructures? Second, which application scenarios are realistic across major sectors? Third, what technical, organizational, and governance limitations currently constrain adoption? Fourth, which research directions are most likely to move the field from promising prototypes toward dependable operational capability? To answer these questions, the remainder of the article first defines the problem space, then explains the review methodology, synthesizes the literature by architecture and application, analyzes limitations and assurance requirements, and finally proposes a research roadmap and future directions.
The paper contributions can be summarized as follows:
It formulates recovery-plan generation as a distinct high-consequence planning problem in critical infrastructures, rather than treating it as a generic chatbot or summarization task.
It develops an architecture-centric synthesis spanning prompt-only assistants, RAG-grounded copilots, graph-aware planners, multi-agent planners, and hybrid pipelines coupled with verification, simulation, or digital twins.
It proposes the Grounded Recovery Planning Stack (GRPS) as a reference design for trustworthy deployment in operationally sensitive environments.
It maps sectoral applications across energy, water, manufacturing, transportation, healthcare, and telecommunications, while explicitly distinguishing evidence maturity for decision support from readiness for unattended autonomy.
It organizes the field’s limitations into an assurance-oriented framework and translates that synthesis into a staged research roadmap and practical deployment guidance for operators, vendors, and regulators.
The remainder of the paper is organized as follows.
Section 2 defines recovery planning as a socio-technical task and explains why LLM-assisted recovery is harder in critical infrastructures than in enterprise IT.
Section 3 presents the structured critical review methodology.
Section 4 reviews related survey literature and identifies the specific gap addressed by this article.
Section 5 synthesizes LLM-enabled recovery-planning architectures and introduces the Grounded Recovery Planning Stack (GRPS).
Section 6 maps sectoral applications across critical infrastructure domains.
Section 7 analyzes limitations, risks, and assurance requirements.
Section 8 presents the staged research roadmap and cross-cutting future directions.
Section 9 discusses theoretical and practical implications for operators, vendors, and regulators.
Section 10 concludes the article.
2. Background and Problem Framing
2.1. Recovery Planning as a Distinct Socio-Technical Task
Recovery planning should be distinguished from detection, triage, or post-incident reporting. In this review, recovery-plan generation denotes the production of an ordered and justified set of restoration actions, decision points, approvals, rollback conditions, and monitoring checks intended to restore a target service state while respecting safety, business priority, and regulatory constraints. This conception is consistent with the broader evolution from static intrusion response taxonomies toward playbook-oriented and resilience-aware response systems [
43,
44,
45,
46,
47,
48].
Traditional playbooks remain indispensable because they encode organizational knowledge, responsibilities, and approved response patterns. Government and standards organizations have also pushed playbooks toward more formal and machine-readable forms, such as the CISA response playbooks and OASIS CACAO [
49,
50]. Yet, even well-authored playbooks are necessarily incomplete. They cannot anticipate every variant of asset state, telemetry inconsistency, operator availability, third-party dependency, or cross-sector cascade that arises during a real incident. As a result, human teams still spend substantial effort contextualizing playbooks, reconciling conflicting evidence, and drafting recovery sequences under time pressure.
The core problem can therefore be represented as a planning task over a partially observed, safety-constrained state space. Let the input contain incident evidence, asset and service dependencies, operational constraints, recovery priorities, available resources, and validated organizational knowledge. The output is not just a text answer, but a plan bundle consisting of ordered actions, justifications, uncertainty flags, required approvals, and measurable completion criteria. A useful recovery-planning system must transform heterogeneous evidence into action recommendations that are grounded in current context and can survive operational scrutiny.
This formulation is materially different from generic question answering. A plausible-sounding recommendation is insufficient if it references the wrong asset, ignores a maintenance interlock, violates process safety, or omits the need for a rollback condition. In high-consequence infrastructures, the relevant success criterion is not linguistic fluency but whether the plan is safe, feasible, auditable, and timely enough to support human decision making.
2.2. Why LLMs Are Attractive for Recovery-Plan Generation
The attraction of LLMs lies in their ability to unify activities that are usually separated across tools: interpreting unstructured documents, summarizing alerts, extracting procedures from manuals, reconciling naming mismatches, generating operator-facing explanations, and proposing action sequences. Early work has already explored this potential in incident response planning and autonomous defense contexts. Hays and White examined the use of LLMs for incident response planning and review [
51]. Hammar et al. proposed a lightweight pipeline that combines fine-tuning, retrieval, and planning to reduce hallucination in incident response planning [
52]. IRCopilot used collaborative LLM components to emulate the dynamic phases of an incident response team [
53]. In parallel, autonomous cyber defense research has studied LLM agents in simulated defensive environments and hybrid architectures [
54,
55,
56,
57].
These studies suggest that LLMs are especially promising when recovery work is information-heavy and coordination-heavy rather than purely closed-loop control. They can draft candidate plans; explain why a given recovery order is preferable; translate technical recommendations for different stakeholders; and accelerate knowledge access across logs, tickets, runbooks, and manuals. In other words, LLMs are strongest where recovery depends on integrating fragmented knowledge and weaker where safe actuation requires deterministic guarantees.
2.3. Why Critical Infrastructures Are Harder than Enterprise IT
Critical infrastructures add constraints that sharply distinguish them from general enterprise IT. First, the cost of a wrong action can be physical harm, not just data loss. Second, restoration often involves cyber–physical dependencies: a recovered server may be operationally useless if a field controller, sensor chain, or communications link remains impaired [
1,
2,
38]. Third, OT environments often contain legacy systems, proprietary protocols, incomplete asset inventories, and limited logging [
5]. Fourth, recovery authority is distributed across engineers, operators, safety staff, contractors, and sector regulators. Fifth, many infrastructures must plan for degraded modes rather than full restoration, which means “best” plans are multi-objective and context dependent.
These constraints imply that critical infrastructure recovery cannot rely on raw language generation alone. It needs state-aware grounding, explicit constraint handling, uncertainty management, and role-appropriate human authorization. The question is therefore not whether an LLM can propose a recovery plan, but what surrounding architecture can make such a proposal dependable enough to be useful.
3. Review Methodology
This article follows a structured critical review methodology informed by PRISMA 2020 reporting principles and software-engineering guidance for systematic reviews [
58,
59]. It should therefore be read as a structured critical/narrative review rather than as a formal systematic review or scoping review. The aim is analytical depth rather than exhaustive bibliometric enumeration. In particular, the review does not claim a meta-analysis, nor does it report screening counts that would imply a closed corpus. Instead, it prioritizes representative and influential sources up to March 2026 across five intersecting areas: critical infrastructure resilience, OT/ICS security, incident response and playbooks, LLMs in cybersecurity, and trustworthy AI for high-consequence use.
Search and selection emphasized peer-reviewed papers, standards, official guidance, and high-impact preprints from venues where the field currently moves fastest. Because recovery-plan generation is still emergent, excluding preprints would omit important technical developments in agentic response systems and domain-grounded planning. To reduce duplication, papers were coded by problem focus, architecture pattern, application sector, validation strategy, human role, and limitation profile.
Table 1 summarizes the structured review protocol used in this article, including the search scope, screening logic, and synthesis strategy.
A deliberate choice in this review is to analyze literature through the lens of recovery-plan generation, even when individual papers focus on adjacent tasks such as PLC code generation, log understanding, CTI reasoning, or autonomous cyber defense. This is justified because real recovery planning draws on precisely these adjacent capabilities: telemetry interpretation, state estimation, procedural recall, dependency reasoning, action sequencing, and operator communication. The synthesis therefore privileges functional relevance over narrow keyword matching.
For transparency, the use of generative AI assistance in manuscript preparation is disclosed in the Acknowledgments in line with journal policy; however, source selection, factual verification, interpretation, and final editorial decisions remained the responsibility of the authors.
4. Related Review Literature and Identified Gap
Prior review literature can be divided into three clusters. The first cluster surveys critical infrastructure protection and cyber-resilience more broadly, often with emphasis on dependency modeling, CPS resilience, and protection strategies [
36,
37,
38]. The second cluster surveys generative AI and LLMs for critical infrastructure protection at a macro level [
39]. The third cluster reviews LLMs across the cybersecurity domain, covering tasks such as vulnerability analysis, malware understanding, log analysis, CTI extraction, and offensive security [
40,
41,
42].
The present article differs from all three clusters. It does not ask simply whether LLMs are useful in cybersecurity, nor does it review critical infrastructure protection in general. Instead, it asks how LLMs can generate, justify, and validate recovery plans for critical infrastructures after disruptive incidents. This narrower focus reveals requirements that broader surveys typically treat only indirectly: live-state grounding, infrastructure dependencies, rollback logic, human authorization, safety barriers, and sector-specific assurance.
Table 2 positions the present article against representative review literature and makes the article’s distinct gap and contribution explicit.
This gap matters because recovery is where cyber defense meets real-world service continuity. Detection without recovery remains incomplete. Likewise, autonomy without assurance is unacceptable in critical infrastructures. A review centered on recovery-plan generation therefore contributes a problem formulation that is both operationally concrete and scientifically underdeveloped.
5. Architectural Patterns for LLM-Enabled Recovery-Plan Generation
The reviewed literature suggests that LLM-enabled recovery planning is best understood through architecture patterns rather than model names. The same foundation model can behave very differently depending on whether it is used as a prompt-only assistant, a RAG-grounded copilot, a graph-aware planner, or part of a verified hybrid pipeline. Architecture determines what context the model receives, what tools it can invoke, how it is checked, and how much autonomy it is allowed.
Table 3 synthesizes the main architectural patterns that recur in the literature on LLM-enabled recovery-plan generation.
5.1. Prompt-Only Assistants
The simplest pattern uses an LLM directly through carefully engineered prompts. In this mode, the model acts as a drafting assistant for candidate recovery steps, communication templates, or after-action rationales. Prompt-only systems are attractive because they require minimal integration effort and can be deployed rapidly for tabletop exercises, analyst note writing, and first-pass procedure drafting [
51]. They are also useful in low-maturity environments where structured knowledge bases do not yet exist.
However, prompt-only assistants are poorly suited to high-consequence recovery decisions. Without live grounding, they can hallucinate non-existent assets, miss organization-specific nomenclature, or recommend actions that are technically coherent but operationally invalid. In recovery contexts, these are not marginal defects. They directly undermine trust and can create unsafe recommendations. Prompt-only systems should therefore be treated as ideation tools or documentation aids rather than decision engines for live restoration.
5.2. Retrieval-Augmented Copilots
RAG-grounded systems are currently the most credible near-term architecture for recovery support. Here, the model retrieves relevant material from curated knowledge sources—runbooks, vendor manuals, incident tickets, asset inventories, change logs, safety procedures, and CTI reports—and then generates a plan conditioned on those sources [
22,
23]. This architecture lowers reliance on parametric memory and increases auditability because recommendations can be traced to retrieved evidence.
The incident response planning pipeline proposed by Hammar et al. is particularly instructive because it combines fine-tuning, retrieval, and planning in a lightweight model stack designed to reduce hallucination [
52]. Similar logic appears in domain assistants such as ChatIoT, which integrates heterogeneous IoT security information using RAG [
60], and in CTI-focused work such as SEvenLLM, which emphasizes domain-tailored instruction data and evaluation [
66]. Log understanding and event template extraction studies further suggest that LLM pipelines can provide operational value once they are anchored to the right evidence sources [
67,
68].
For recovery planning, the most important design decision in RAG systems is not the retriever alone but the composition of the knowledge base. A recovery copilot must fuse several evidence classes: static procedural knowledge, near-real-time incident evidence, current asset and topology state, and organizational constraints such as approvals or regulatory reporting obligations. If any of these classes is missing or stale, retrieval may still return apparently relevant context while the final plan remains infeasible.
5.3. Knowledge-Graph and GraphRAG Planners
Critical infrastructure recovery depends heavily on dependency reasoning. Which services depend on which controllers? Which remote terminal units depend on which communications paths? Which safety functions must remain online before a controller reboot is attempted? Knowledge graphs and GraphRAG-like methods are therefore natural extensions of plain-text RAG [
24]. They allow the planning system to reason over typed entities and relations rather than relying only on semantically similar document chunks.
Recent work points in this direction. Webb et al. use LLMs and retrieval for cyber knowledge completion, bridging gaps between attack-pattern taxonomies and cyber–physical risk reasoning [
61]. Grangel-González et al. show how cyber–physical systems can be represented within knowledge graphs [
69]. Dagnas et al. demonstrate that graph-based CPS modeling can support resilience quantification and the identification of critical points [
62]. Combined, these strands suggest a path toward dependency-aware recovery planning in which the LLM no longer “guesses” relationships from prose but queries a structured representation of the infrastructure.
From a recovery perspective, graph-aware grounding offers three major advantages. First, it can support service-centric planning rather than asset-centric planning. Second, it can surface indirect effects and restoration prerequisites. Third, it can make explanations more actionable by linking recommendations to explicit dependency paths. The main obstacle is that most organizations do not yet maintain sufficiently complete and timely knowledge graphs of their infrastructures, which makes graph-aware recovery planning as much a data-engineering challenge as an LLM challenge.
5.4. Multi-Agent Planners
Multi-agent architectures distribute planning across specialized roles such as incident commander, OT engineer, safety officer, compliance reviewer, and communications lead [
25,
26]. Conceptually, this mirrors how real recovery decisions are made: no single person owns all relevant knowledge, and acceptable plans usually emerge through review, critique, and approval. LLM-based multi-agent systems therefore offer a compelling abstraction for recovery-plan generation.
IRCopilot exemplifies this design by organizing automated incident response into collaborative session components with differentiated responsibilities [
53]. Autonomous cyber defense studies also suggest that LLM-based teams can be evaluated as interacting defenders rather than isolated models [
54,
55]. Even seemingly offensive studies such as PentestGPT are relevant by analogy because they show the value of decomposing complex tasks into cooperating modules to reduce context loss and improve persistence across long workflows [
70].
For critical infrastructure recovery, multi-agent designs offer a practical path to bounded autonomy. The “planner” agent can draft candidate actions, a “safety” agent can reject actions that violate process rules, a “topology” agent can fetch state or dependency data, and a “documentation” agent can produce operator-facing rationale. Yet multi-agent systems do not eliminate the need for grounding and verification. They mainly improve division of cognitive labor. Without strong arbitration and evidence control, they can still amplify error through persuasive but mutually reinforcing mistakes.
5.5. Hybrid Planners with Verification, Simulation, and Digital Twins
The highest-assurance pattern combines LLMs with symbolic constraints, simulators, optimization routines, digital twins, or formal verification. This pattern is especially promising for recovery tasks that touch industrial logic, control sequences, or safety-critical reconfiguration. LLM4PLC is an important example because it pairs LLM generation with compilers, grammar checking, and verification tools to improve correctness of PLC programs [
63]. In adjacent work, LLMs have been explored for industrial control and HVAC management [
64] and for end-to-end control of industrial automation systems through agentic frameworks [
65].
Digital twin research strengthens this pattern by providing a virtual environment in which recovery candidates can be validated before execution [
71,
72,
73]. In a mature architecture, the LLM would not directly approve a restoration step. It would instead propose candidate sequences whose feasibility is tested against rules, simulation models, or twin-based state projections. This is conceptually close to earlier work on response and recovery engines [
48], but with the LLM handling knowledge synthesis and explanation while symbolic components enforce hard constraints.
The practical implication is clear: the more consequential the actuation, the more the architecture must shift from pure language generation toward hybrid verification. In other words, the path to trustworthy recovery automation in critical infrastructures is not bigger models alone; it is better coupling between language models, structured infrastructure knowledge, and domain-specific validators.
5.6. A Reference Stack for Trustworthy Recovery-Plan Generation
Based on the reviewed literature, this article proposes a conceptual Grounded Recovery Planning Stack (GRPS) for critical infrastructures. The stack is not a product architecture but a synthesis intended to clarify design requirements:
Operational context assembly collects incident evidence, asset state, service priorities, dependencies, and human-role information.
Grounding and dependency fusion merges text retrieval, structured asset data, and knowledge-graph links.
Plan generation drafts one or more candidate recovery sequences, including rollback logic and justification.
Assurance gates apply policy checks, safety constraints, simulation or digital-twin validation, and uncertainty assessment.
Human authorization and orchestration routes the plan to appropriate approvers and preserves an audit trail.
Post-incident learning updates runbooks, retrieval corpora, and evaluation sets from the outcome.
The main benefit of the GRPS view is that it prevents organizations from over-focusing on the language model while under-engineering the surrounding system. In recovery planning, dependable behavior emerges from the full stack: curated operational knowledge, explicit dependency models, constrained tool use, verification, and human approval.
Figure 1 summarizes this layered reference architecture for trustworthy deployment design.
6. Applications in Critical Infrastructure Sectors
Evidence for fully automated recovery remains limited, but the literature already supports a meaningful map of where LLM-enabled recovery planning is most and least plausible. The strongest near-term use cases are those that combine heterogeneous documentation, rapidly evolving telemetry, and significant coordination overhead, while still leaving final authority with human operators. Sector maturity is therefore best interpreted as evidence maturity for decision support, not as readiness for unattended autonomy.
Table 4 compares sectoral application scenarios and indicates the current maturity of the available evidence.
The maturity labels in
Table 4 are qualitative coding judgments rather than quantitative performance scores. They are based on three criteria: (i) the availability of domain-specific LLM, recovery, or closely related CI/OT prototypes; (ii) the level of validation reported in the literature, ranging from conceptual discussion to simulation, testbed validation, or operational deployment; and (iii) the degree of integration between LLM-based planning and operational evidence such as topology, telemetry, runbooks, or validators. In this coding, “Low” denotes mostly conceptual or adjacent evidence with no recovery-specific validation; “Low-to-moderate” denotes mixed evidence or partial validation of relevant components; and “Moderate” denotes testbed, simulation, or domain-specific prototype evidence, while still falling short of validated unattended operational autonomy.
6.1. Energy and Utility Infrastructures
In energy and utility settings, recovery planning must coordinate cyber restoration with physical service priorities and grid or process constraints. Although the literature contains more conceptual discussions than field deployments, the combination of dependency modeling, digital twins, and LLM planning makes the sector a leading candidate for graph-grounded recovery copilots [
39,
61,
71,
72]. A practical example would be drafting a service-prioritized restoration sequence that explicitly differentiates between actions requiring field dispatch and those that can be completed from the control room.
6.2. Water and Wastewater Infrastructures
Water and wastewater systems are especially relevant because testbeds such as SWaT have been used to study CPS resilience and graph-based assessment [
62]. In such systems, a recovery planner must reason about treatment stages, sensor trustworthiness, chemical safety, and degraded operation. The value of an LLM here is less in direct control and more in integrating technical manuals, process dependencies, and operator procedures into a coherent restoration narrative that can be reviewed and adapted in real time.
6.3. Manufacturing and Process Industries
Manufacturing and process industries currently provide the clearest technical bridge from language models to recovery-oriented actuation support. LLM4PLC shows that language models can be embedded in a verification loop for industrial logic generation [
63]. Song et al. and Xia et al. extend the discussion toward broader industrial control using LLMs [
64,
65]. These studies do not yet solve infrastructure recovery as a whole, but they demonstrate a critical building block: LLM outputs can be made operationally relevant when paired with compilers, validators, and domain-specific execution context.
6.4. Transportation and Logistics
Transportation and logistics systems present a profile in which safe recovery is deeply dependent on interlocking procedures, scheduling dependencies, and cross-organizational communication. In such environments, the immediate value of LLMs is likely to lie in assembling procedure-aware recovery options, clarifying decision prerequisites, and translating technical recovery choices into operator-facing actions rather than in direct actuation.
6.5. Healthcare and Hospital Infrastructures
Healthcare infrastructures are similarly coordination intensive. A recovery copilot could assemble downtime procedures, dependency-aware restoration priorities, clinical escalation rules, and stakeholder communication templates. The main benefit would be speed and coherence in cross-team sensemaking, especially when IT restoration decisions must be sequenced against patient-safety priorities.
6.6. Telecommunications and Cloud-Backed Services
Telecommunications and cloud-backed infrastructures are the most immediately adjacent to today’s enterprise incident-response tooling. LLMs can already assist with SOC and CTI workflows, log understanding, and threat-intelligence reasoning [
66,
67,
68,
74,
75]. These capabilities are not identical to recovery planning, but they reduce the information bottleneck that often delays restoration after an incident and therefore provide a realistic entry point for recovery-oriented copilots.
7. Limitations, Risks, and Assurance Requirements
The central limitation of LLM-enabled recovery planning is that a convincing plan is not necessarily a correct plan. In high-consequence infrastructures, this gap is decisive. Dependability requires that every recommendation be evaluated not only for linguistic plausibility but also for factual grounding, state consistency, operational feasibility, safety, and accountability.
7.1. Technical Limitations
Hallucination remains the most visible technical risk. Surveys consistently show that LLMs can generate outputs that appear authoritative yet are unsupported or false [
76,
77]. For recovery planning, hallucination can take several domain-specific forms: inventing non-existent assets, misremembering procedural prerequisites, confusing vendor terminology, or extrapolating from outdated context. Detection and mitigation methods such as hallucination checkers and abstention mechanisms are improving [
78,
79], but they should be treated as partial safeguards rather than guarantees.
A second limitation is state mismatch. Even a factually grounded plan may be wrong for the current infrastructure state if retrieval uses stale inventories, outdated topology, or incomplete incident evidence. This is one reason why plain-text RAG alone is insufficient in OT and critical infrastructure settings. Recovery planners need live or near-live synchronization with asset state, change history, and dependency models.
A third limitation is long-horizon planning under uncertainty. Recovery procedures often branch, require conditional rollback, and evolve as new evidence arrives. LLMs can decompose such tasks more effectively than earlier models [
17,
19], but they still struggle with persistent world models and calibrated uncertainty. Benchmarks such as TruthfulQA remind us that larger models are not automatically more truthful or reliable [
80].
7.2. Security and Misuse Risks
Because recovery planners will likely be integrated with enterprise search, ticketing systems, documentation stores, and orchestration tools, they inherit a broad attack surface. Prompt injection can manipulate retrieved context or tool use [
81]. Adversarial jailbreaks can subvert aligned behavior [
82]. Training-data extraction and data leakage are also relevant where sensitive operational content is processed by general-purpose models [
83]. In critical infrastructures, these are not abstract concerns: a compromised planning assistant could actively recommend unsafe or strategically harmful recovery actions.
The broader critique that LLMs behave as “stochastic parrots” is especially salient in this domain [
84]. Recovery planning requires more than eloquent pattern completion. It requires grounded and accountable reasoning about specific infrastructures. Likewise, ethical and social risk analyses of language models are directly relevant because infrastructure recovery decisions affect public safety, service equity, and trust in institutions [
85].
7.3. Human and Organizational Limitations
Many recovery failures are organizational before they are algorithmic. Playbook usability research shows that response artifacts often fail because they are difficult to interpret or poorly matched to analyst workflows [
45,
46]. Incident response research also shows that external actors such as insurers and legal advisors can reshape recovery priorities in ways that do not always maximize learning or resilience [
47]. Introducing LLMs into this environment can either reduce friction or amplify ambiguity, depending on how responsibilities and approval paths are designed.
Recent in-the-wild evidence from security operations centers is instructive. Analysts appear to use LLMs primarily as cognitive aids for sensemaking and context building rather than for final high-stakes decisions [
74]. This pattern is likely to persist in critical infrastructure recovery. It suggests that the most realistic adoption model is augmentation with retained human decision authority, not replacement of incident commanders or control-room staff.
7.4. Governance and Assurance Requirements
Governance frameworks such as the NIST AI RMF and the NIST Generative AI Profile provide a useful vocabulary for trustworthy deployment [
86,
87]. For recovery-plan generation, these frameworks imply several concrete requirements: provenance of retrieved evidence, segregation of trusted and untrusted sources, systematic red teaming, uncertainty-aware abstention, strong audit logging, and explicit role-based authorization for any action with operational consequence.
Evaluation must also move beyond generic accuracy. Cybersecurity-specific benchmark suites are improving [
75,
88], but they still underrepresent the recovery tasks most relevant to critical infrastructures: degraded-mode restoration, service-priority trade-offs, interdependency reasoning, rollback planning, and sector-specific compliance checks. Until recovery-oriented benchmarks exist, organizations should validate candidate systems in sandboxes, digital twins, or testbeds before considering live deployment.
Recent system-level guidance sharpens these requirements into deployable engineering practices. The NCSC Guidelines for Secure AI System Development organize safeguards across secure design, secure development, secure deployment, and secure operation and maintenance [
89]. In parallel, OWASP’s 2025 guidance on LLM application security and prompt injection shows that recovery assistants must be engineered as secure applications with explicit defenses against indirect prompt injection, unsafe tool mediation, and overreliance on generated content [
90,
91].
Threat-informed assurance should also be grounded in attacker knowledge bases and lifecycle controls. MITRE ATLAS provides a living knowledge base of tactics and techniques against AI-enabled systems that can support threat modeling and red teaming of recovery-planning stacks [
92]. NIST’s adversarial machine learning taxonomy and SP 800-218A extend this perspective by providing a common vocabulary for attacks and mitigations and by augmenting secure software development practices for generative AI and dual-use foundation models [
93,
94]. For infrastructures that move closer to operational deployment, joint guidance on secure AI deployment and the newer OT-specific guidance on secure AI integration are especially relevant because they translate AI risk management into operational constraints, operator accountability, and defensive deployment patterns for critical environments [
95,
96].
Table 5 consolidates the main limitations and maps them to corresponding assurance mechanisms suitable for critical infrastructures.
8. Research Roadmap and Future Directions
A realistic research roadmap should recognize that the field is not moving from “no AI” to “full autonomy” in one step. The more plausible trajectory is staged: first grounded copilots, then validated planners, and only later bounded orchestration in well-instrumented environments. Each stage requires advances in data, architecture, assurance, and human factors.
Table 6 summarizes the staged research roadmap proposed in this review.
The horizons in
Table 6 are indicative research and deployment-readiness horizons, not deterministic forecasts. They are derived from the synthesis of current evidence maturity, the engineering effort required for trustworthy grounding and validation, and the slower certification and governance cycles typical of critical infrastructures. The staged ordering therefore reflects increasing assurance burden: read-only copilots can be piloted with curated corpora and human review, dependency-aware planners require live-state and graph integration, bounded orchestration requires validated execution limits and approval policies, and verified ecosystems require interoperable artifacts and sector-level assurance cases.
Figure 2 complements
Table 6 by visualizing the staged transition from grounded copilots toward bounded orchestration and by highlighting the cross-cutting requirements that must mature in parallel.
The first stage should focus on machine-readable operational knowledge. Many organizations still store the information needed for recovery in fragmented PDFs, tickets, spreadsheets, and tribal knowledge. Without disciplined curation, LLMs cannot reliably support recovery planning. Research on playbooks, CTI corpora, and domain assistants already suggests that knowledge engineering is as decisive as model choice [
44,
60,
66].
The second stage should emphasize state-aware grounding and benchmark development. Recovery planning is not a generic security benchmark problem. It requires datasets and evaluation tasks that encode service dependencies, operational constraints, and evolving incident state. Existing benchmarks for cybersecurity LLMs are valuable foundations [
75,
88], but they do not yet capture infrastructure restoration under partial observability. Recovery-specific benchmarks should include tasks such as selecting safe restoration order, identifying missing prerequisites, drafting rollback conditions, and explaining trade-offs among availability, safety, and compliance.
The third stage should center on verification and uncertainty. Hybrid architectures should become the default for any scenario that approaches actuation. The model may generate hypotheses or candidate plans, but symbolic checks, twin-based simulation, or domain verifiers should determine whether these candidates are admissible [
62,
63,
71]. In parallel, abstention and uncertainty signaling must be treated as first-class capabilities rather than optional UX features [
79].
A fourth workstream concerns human factors and organizational integration. Recovery planners will be adopted only if they fit existing incident command structures and improve, rather than degrade, operator workload and accountability. This requires longitudinal field studies similar to emerging SOC collaboration research [
74] but tailored to OT, utilities, and other critical sectors. Questions of trust calibration, review burden, explanation design, and post-incident learning deserve as much attention as model performance.
Finally, the roadmap should include deployment engineering for constrained environments. Critical infrastructures may require on-premises inference, low-latency models, offline fallback, and strong data residency controls. The lightweight orientation in recent incident-response planning work [
52] is therefore strategically important. Smaller, well-grounded models may be preferable to frontier models when confidentiality, cost, or air-gapped deployment dominates.
Very recent research also supports the staged roadmap proposed here. Hammar et al. propose an iterative verification-and-abstention loop in which candidate actions are checked against lookahead consistency and refined through external feedback [
97]. Gao et al. push the agenda toward end-to-end incident-response agents that integrate perception, reasoning, planning, and action in a single lightweight model [
98]. Related work on large-language-model agents for automated cyber tactics planning shows continued progress in multi-step cyber decision support, but also reinforces the importance of simulation-backed evaluation [
99]. These trajectories align with national-level threat assessments which argue that AI is likely to amplify existing cyber intrusion workflows in the near term rather than magically eliminate operational constraints [
100].
Cross-Cutting Future Directions
Several future directions appear especially promising and, taken together, define a research agenda that is more specific than the generic “apply LLMs to cybersecurity” narrative.
First, recovery-specific multimodal models deserve attention. Critical infrastructure recovery depends not only on text but also on diagrams, P&IDs, HMI screenshots, alarm tables, topology views, and maintenance records. Future recovery planners should integrate multimodal evidence rather than treating all context as plain text.
Second, graph- and twin-coupled planning is likely to become a central architecture. Plain RAG can retrieve relevant passages, but recovery planning needs explicit reasoning over dependencies, prerequisites, and service impacts. Knowledge graphs and digital twins provide the substrate for this transition [
62,
69,
71,
72].
Third, verifiable and abstaining planners should replace “always-answering” assistants. A trustworthy planner must know when available evidence is inadequate and must surface uncertainty in an operator-usable form [
78,
79]. This principle is especially important when incident evidence is conflicting, incomplete, or adversarially manipulated.
Fourth, small and private deployment patterns are likely to matter more than raw benchmark leadership in many infrastructures. On-premises or edge-deployable models, combined with retrieval from locally curated knowledge stores, may offer a better trade-off among confidentiality, availability, and controllability than larger external APIs [
52,
87].
Fifth, interoperable recovery artifacts should be standardized. Machine-readable playbooks such as CACAO are an important start [
50], but critical infrastructure recovery will also need structured representations for service dependencies, operational constraints, fallback modes, and approval logic. Standardization in this area would make benchmarking, auditing, and cross-tool integration substantially easier.
Sixth, evaluation must become sector-aware. The field needs recovery benchmarks that are aligned with energy, water, manufacturing, healthcare, transportation, and telecom realities rather than generic cyber tasks. Sector-specific community profiles, testbeds, and benchmark suites would provide the missing bridge between academic prototypes and operational adoption.
Seventh, compliance-aware and privacy-aware recovery planning will become a deployment prerequisite. Future systems should generate not only restoration steps but also the supporting artifacts needed for approval, notification, evidence retention, and post-incident accountability. The EU Artificial Intelligence Act and the EDPB report on privacy risks in LLM systems, together with OT-specific joint guidance on secure AI integration, imply that deployment viability will increasingly depend on governance-by-design, data minimization, and auditable human oversight rather than technical accuracy alone [
96,
101,
102].
Eighth, memory-augmented and self-evolving recovery agents should be studied cautiously. Persistent memory can help maintain incident context across handovers, preserve lessons learned from prior exercises, and support longer recovery timelines. Self-evolving mechanisms, skill libraries, and reflective improvement loops may also improve repeated planning tasks [
31,
32,
34,
35]. In critical infrastructures, however, such capabilities must be bounded by change-control rules, memory provenance, decay or revocation policies, and safety review because an agent that learns from incorrect or adversarial experience could institutionalize unsafe recovery behavior.
Ninth, the impact of frontier model releases should be evaluated through recovery-specific tasks rather than generic leaderboard scores. GPT-5.5 and Claude Mythos Preview illustrate rapid progress in coding, tool use, cyber tasks, and multi-step agentic workflows [
27,
28,
29,
30]. Nevertheless, stronger model capability does not remove the need for state-aware grounding, trusted evidence, simulation-backed validation, and human authorization. For this reason, the roadmap in this article is intentionally architecture- and assurance-centered rather than tied to a single model generation.
9. Discussion
The literature reviewed here suggests a consistent conclusion: LLMs are already useful for recovery-related cognition, but not yet sufficient for recovery autonomy in critical infrastructures. Their strongest current contributions lie in information synthesis, plan drafting, explanation, and coordination support. Their weakest aspects appear when plans depend on hidden state, exact infrastructure context, or hard safety constraints that cannot be inferred reliably from text alone.
This observation has two implications. The first is theoretical. Recovery-plan generation should be studied as a layered planning-and-assurance problem rather than as a monolithic language-model task. The main scientific challenge is not just better prompting; it is the integration of language reasoning with state estimation, dependency modeling, validation, and human governance. The second implication is practical. Organizations can obtain value now from bounded, evidence-grounded copilots without waiting for full autonomy. Conversely, attempts to skip directly to autonomous recovery orchestration are likely to overestimate current model dependability.
A further insight from the literature is that architecture maturity and data maturity are inseparable. Many barriers blamed on “LLM unreliability” are actually consequences of incomplete runbooks, stale inventories, ambiguous naming, or undocumented dependencies. In this sense, LLM adoption may expose long-standing resilience deficits that predate AI. This is useful. It means recovery-planning research can catalyze better knowledge engineering, process documentation, and dependency visibility even before high-assurance automation becomes feasible.
Practical Implications for Operators, Vendors, and Regulators
One lesson from comparison with adjacent architecture papers is that conceptual clarity alone is not enough; deployability must be made visible. For operators, this means that the first successful deployments should probably be read-only or recommendation-only systems integrated with existing approval chains. For vendors, the priority is not merely better prompting but better interfaces to inventories, historian data, ticketing systems, policy stores, and simulation environments. For regulators and auditors, the central question is whether a recovery recommendation can be traced back to authoritative evidence, reviewed by accountable humans, and reproduced after the fact. In Future Internet settings, deployable solutions must also handle edge-cloud partitioning, controlled remote connectivity, and auditable exchange of recovery evidence across organizational boundaries.
Table 7 translates the discussion into a practical deployment checklist for pilot recovery copilots.
The strategic implication is straightforward: organizations should treat LLM-enabled recovery planning as an assurance engineering program rather than as a stand-alone AI feature. The strongest near-term pilots will therefore look deliberately conservative—narrowly scoped, heavily grounded, evidence-citing, and embedded in existing recovery governance.
This review has limitations. It is a structured critical synthesis rather than an exhaustive bibliometric census. The evidence base remains uneven: manufacturing and cyber-defense simulations are better represented than real-world utility, transport, or healthcare recovery deployments, and many agentic planning approaches are still evaluated primarily in simulations or early-stage studies. Accordingly, the roadmap should be interpreted as a research agenda and deployment-readiness framework rather than as a deterministic forecast of adoption.
Recent literature published after much of the core survey corpus reinforces—but does not eliminate—this gap. Broad cyber-resilience reviews continue to aggregate LLM applications across detection, analysis, and response [
103]. Disaster-management surveys show a parallel movement toward phase-aware AI support across preparedness, response, and recovery [
104]. A 2026 Applied Sciences article addresses LLMs for IIoT protection and incident triage in critical infrastructure [
105]. At the more operational end of the spectrum, agentic industrial automation work couples LLM planning with simulation-backed validation for fault recovery [
106], while AIR reframes incident response as a first-class safety mechanism for autonomous agent systems [
107]. Collectively, these studies confirm the field’s momentum, but they still do not provide an architecture-centric, assurance-driven synthesis focused specifically on recovery-plan generation under OT safety, interdependency, and authorization constraints.
10. Conclusions
LLM-enabled recovery-plan generation for critical infrastructures is an emerging but strategically important research area. The field should not be reduced to generic chatbots or broad claims about “AI for cybersecurity.” Recovery planning is a distinct high-consequence task that requires grounded reasoning over infrastructure state, dependencies, safety rules, and organizational responsibilities. The reviewed literature shows that meaningful progress is already being made through RAG-grounded copilots, multi-agent decompositions, knowledge-graph integration, and hybrid architectures that combine language models with verification or simulation.
At the same time, the limitations are substantial. Hallucination, stale context, dependency blindness, prompt injection, data leakage, governance gaps, and human-factor risks all constrain safe deployment. Accordingly, the most defensible near-term trajectory is the development of grounded, auditable, human-in-the-loop recovery copilots, followed by validated dependency-aware planners in sector testbeds and only then bounded orchestration in carefully controlled environments.
The main contribution of this article is to reposition recovery-plan generation as a central design problem for trustworthy AI in critical infrastructures. By synthesizing the literature around architecture patterns, sectoral applications, limitations, assurance mechanisms, practical deployment requirements, and a staged roadmap, the review provides a foundation for future work that is technically rigorous, operationally relevant, and attentive to the realities of networked, cloud-edge, and safety-critical infrastructure resilience.