1. Introduction
Modern societies critically rely on the supply of a variety of goods and services, such as energy and government. Most societies agree on a similar set of these goods and services to be defined as critical for societal and economic well-being [
1]. Thus, the infrastructures responsible for the provision of such goods and services are identified as Critical Infrastructures (CI). Correspondingly, such infrastructures are found in different sectors of modern societies, e.g., energy or water supply, governmental institutions, or security and defense authorities [
2]. Consequently, CI cannot be assigned to a specific type of system or organization. In general, the various Critical Infrastructure Systems and Organizations (CISOs) serve to securely provide the variety of critical goods and services mentioned above [
3,
4].
Due to its nature, it is of utmost importance to provide solutions that enable such CISOs to deal with corresponding risks and vulnerabilities. The International Organization for Standardization (ISO) defines risk as the effect of uncertainty on goals [
5]. The Society of Risk Analysis (SRA) has developed a more comprehensive definition to give more attention to the diversity of uncertainties and objectives [
6,
7]. Risk concepts are now addressed in all areas of a CI, for example in commercial sector [
8,
9], chemical sector [
10,
11], public health [
12,
13], transportation [
14,
15,
16,
17], information technology [
18,
19] and water systems sector [
20,
21]. The examples also illustrate the great variety of risk assessment and risk management approaches developed to analyze and manage risk. Therefore, it is not surprising that in the last decade, a risk science has emerged and established, which deals with the development of ge
neric concepts, theories, principles, methods, frameworks, procedures, and models to enhance the understanding, assessment, modelling, controlling and management of risks [
7,
22,
23,
24]. This development has been even more promoted by the ongoing digitization of organizations and the increasing interconnectedness of technical and socio-technical systems. Increasing complexity and growing interdependencies lead, in addition to other things, to uncertainties in the modeling, understanding, and assessment of risks [
7,
11,
12,
23,
24,
25,
26]. It is now undisputed that uncertainties reflect the knowledge or lack of knowledge in risk analysis. Therefore, the amount of uncertainty consequently determines the informative value of the risk assessment result that should be communicated to decision-makers. Uncertainty assessment and its consideration in risk-related decision-making are current fields of research and development becoming even more important due to the COVID-19 pandemic [
23,
24,
26,
27,
28].
In recent decades, a complementary line of development has emerged that results from the rising interest in the concept of resilience. It focuses on practicable approaches that enable the effective handling of surprises originating from uncertainties induced by lacking knowledge about interdependencies, changes, and prospects [
29,
30,
31]. Analogous to risk science, research activities were either dedicated to fundamental questions, such as the definition, characteristics, concepts, and cornerstones of resilience [
30,
31,
32,
33,
34], or aimed at providing methods and frameworks for analyzing or managing resilience [
33,
35,
36,
37,
38,
39,
40,
41,
42]. The wide variety of fields of application and perspectives in which resilience in socio-technical systems is considered has led to different definitions. For example, the European Commission defines resilience as the “ability of an individual, a household, a community, a country or a region to withstand, to adapt, and to quickly recover from stresses and shocks” [
43], reflecting a rather societal view. In contrast, the International Maritime Organisation (IMO) took a more technological view on resilience and defined it as the “ability of a system to detect and compensate external and internal disturbances, malfunction and breakdowns in parts of the system”, preferably without loss of functionalities and any degradation of their performance [
44]. Woods proposed that the label “resilience” shall be reserved for the ability of a system to deal with disturbances and interruptions outside the range of the nominal system capabilities and nominal use conditions [
45]. Hollnagel defined resilience as “the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances so that it can sustain required operations under both expected and unexpected conditions” [
42]. In parallel, academic discussions started on whether resilience analysis and management should be seen as an extension of risk analysis and management or as an integral part of it [
29,
46,
47]. However, a unifying goal of risk and resilience science is to ensure the operationality and reliability of vital Critical Infrastructure Systems and Organizations (CISO) regardless of whether the focus is rather on preventing or averting adverse events and consequences. To achieve this goal effectively and effi
ciently, the close relationship between risk and resilience strategies in their many forms must be tapped, coordinated, and exploited [
29]. For this, it is also necessary to consider all aspects of the CISO whose resilience is to be consolidated or increased. Häring et al. provided a framework and principles for generic resilience management that was derived from the standardized risk management process [
38]. Within this framework, they identified nine iterative steps that enable resilience quantification and development. Additionally, the authors provided a comparison of several resilience assessment methods. The authors put high emphasis on analyses and decisions during system design, while process steps during operation are unified in a single process responsible for monitoring and review. In addition, a variety of resilience frameworks have been developed in recent years that focus on either the analysis [
33,
35,
37,
41,
48,
49,
50], assessment [
51,
52,
53], or management [
36,
40] of resilience, or a combination thereof [
39]. The solution approaches are often discussed rather theoretically at the algorithm and method level [
33,
52] or in the application context [
35,
36,
37,
39,
42,
48,
49,
50,
51,
53,
54,
55,
56].
Driven by Hollnagel’s ability-oriented definition of the resilience concept, this paper introduces a framework for the operational management of the resilience capabilities of Critical Infrastructure Systems and Organizations (CISO). To the best of our knowledge, this is the first framework for operational resilience management that integrates the concept of digital twins and, thus, allows for the comprehensive and timely preparation and response to emerging threats.
The remainder of this work is organized as follows.
Section 2 briefly presents background information on concepts and cornerstones related to resilience and specifies the requirements for operational resilience management.
Section 3 introduces the proposed operational resilience management framework and its main components. The following
Section 4 presents and discusses in detail the tasks within the proposed operational resilience management process.
Section 5 provides application examples to illustrate the advantages of this approach, and
Section 6 concludes this work.
3. Resilience Management Framework
This section introduces the framework for Operational Resilience Management (ORM) and presents its two basic components—a Data and Information Base (DIB) and a resilience management process.
3.1. Thematic Classification
The aim of resilience management is to coordinate the recognition, monitoring, anticipation, and learning as core competencies of resilient Critical Infrastructure Systems and Organizations (CISO) [
31]. Although Woods suggested that the label “resilience” should refer more to the management of disruptions and interruptions outside the nominal operational area [
34], it is hardly useful to consider nominal and abnormal conditions separately. It is rather important that the CISO is aware of its own competencies and capabilities in terms of detection, monitoring, and anticipation. The CISO must also be able to identify both systemic and environmental situations that deviate from the valid specification of nominal conditions. For these, it is necessary to assess whether the anomaly may cause additional risks. If this question is answered in an affirmative manner, then appropriate measures must be identified to reduce the resulting risks and mitigate their negative consequences. Therefore, the framework proposed and discussed in this paper should facilitate operation under rated conditions and the detection of and adaptation to abnormal conditions.
Learning is represented in the framework rather indirectly. The design and operation of the CISO with all its functions, methods, and measures is only possible with qualified personnel whose qualification is the result of an ongoing learning process. Learning is also part of the recognition of anomalies, the evaluation of their criticality as well as identification and evaluation of potential mitigation measures, and thus, serves to build competencies. The effectiveness of what is learned is determined to a large extent by how these competencies can be maintained, adjusted, and retrieved. For this reason, it is necessary that the framework is also designed to manage and develop competencies efficiently.
3.2. Data and Information Base
The presented requirements on the ORM framework strongly motivate the availability of a Data and Information Base (DIB) that contains knowledge of the CISO’s ACTUAL and TARGET behavior, environment, and interdependencies. The TARGET behavior is defined by the performance requirements for the CISO services and the target conditions and influences assumed in the design phase. In contrast, the ACTUAL behavior is described using appropriate models parameterized with measured values.
Therefore, the CISO should be designed and operated based on state-of-the-art competence models, which cover the CISO, environmental conditions, and practicable resilience capabilities. Furthermore, the proposed framework requires that the model of the CISO is a digital representation of the reality in terms of components, functions, and processes. Compared to what was presented in
Section 2.3, one can note that these requirements match the digital twin (DT) concept. Additionally, processable environmental models (EM) are needed to describe relevant environmental conditions in all their diversity and variability that can significantly affect the functionality and performance of the CISO.
In general, DT and EM are composed of descriptive and predictive models. Descriptive models ensure that the diversity of components and aspects is sufficiently depicted. These models describe the system’s behavior based on historical as well as actual data and information [
76]. Predictive models are able to explain changes based on causes, dependencies, and interactions. Hence, these models enable the investigation of “what will happen” and “what should be done to make or to avoid it happening” [
77]. However, the consequent complexity of predictive models is significantly higher compared to descriptive models. Descriptive as well as predictive models should reflect the current state and behavior of the CISO and environment, including intrinsic interactions, through appropriate model parameterization and real-time monitoring of the model parameters (MP
RT). Furthermore, the nominal behavior should be characterized by nominal values for the model parameters (MP
N).
The described features of the DT concept are useful for the implementation and realization of advanced resilience capabilities. In the simplest case, the monitoring of MPRT and the comparing analysis between MPRT and MPN should enable the recognition of regular as well as irregular disruptions, disturbances, and changes. A more challenging approach analyses the changes within the CISO and environment in order to anticipate the emerging risks in time. This requires constant monitoring and assessment of the observed and expected behavior of the CISO as well as its response to environmental changes. This provides the necessary lead time to decide on the appropriate use of adaptive measures and to implement them. The anticipation skills can be further improved if DT and EM are connected to a suitable simulation environment. This enables the combined consideration of systemic/organizational and environmental aspects in their complexity and in relation to the diversity of potential scenario developments. Such scenario developments are also an appropriate means to perform a predictive assessment of the effectiveness of potential decisions and adaptive measures.
Core elements of the DT and the EM are modelling methods, ideally provided as executable software, whose corresponding properties are described in the model documentation (D). Furthermore, both models are parametrized by the model parameters (P). The model data of these parameters can be real-time data (RT) or nominal data (N). A DT or EM that is parametrized with RT reflects the current state of the CISO or environment. In contrast, a parametrization with N represents the intended or typical behavior. The features and model data of the DT and EM are stored in a respective data and information base (see
Figure 1a,b), which enables further analysis and processing.
The real-time data (RT) are gathered by the sensor systems (SeS), which are either part of the actual CISO or are operated by external service providers (
Figure 1c). In this paper, a sensor system is used as a synonym for any information source that is needed to collect information about the situation-related system status or environmental conditions. The specified sensor system ultimately determines which information is applicable in order to perform monitoring, assessment, controlling, and decision-making processes within the system or organization. It is mandatory that the SeS is able to provide the required data in the desired quality and frequency in order to assure that later analysis and processing are executed on the actual state of the CISO. A SeS is characterized by its parameters (P) and the corresponding documentation. The parameters can be of a different type, e.g., configuration data, control data, or the performance data of the sensor system, and enable the evaluation of the usability of the sensor system. Similar to the DT and EM, this comprises real-time data (RT) or nominal data (N). Please note that the actually measured values and attained information of the SeS are not located together with these parameters but are stored as the RT of the DT or EM.
Morphological analysis [
78] is a promising approach for scenario modelling, e.g., as applied in [
79]. The resulting scenario space contains all scenarios and may be described by the available parameters (P) of DT and EM and their feasible characterizations. Analogously to a morphological space, P and their characteristics are brought into relation to each other. Thus, different combinations within the space enable a derivation of scenarios. Consequently, a scenario can be described by certain parameters (P) that should be documented (D). From these parameters, the so-called Key Impact Factors (KIF) can be derived, which enable the sufficient characterization of a scenario. That means each scenario is described by a certain setup of KIF. Again, the setups can be based on real-time data (RT), nominal data (N), or already investigated scenarios (A). The corresponding information and database (DIB) are depicted in
Figure 1d.
3.3. Resilience Management Process
The purpose of this section is to provide an overview of the resilience management process at a somewhat abstract level to illustrate the main tasks and their relations (
Figure 2). The subsequent
Section 4 details the main tasks and describes the required functions as well as inherent interactions.
The basic structure of the process follows classical risk management [
5] and the proposal of Häring et al. [
38]. However, the proposed framework puts a strong emphasis on the actual operability of resilience management processes and thus has to be considered as an additional contributor to the resilience of the CISO. As a consequence, the framework focuses on the implementation of the required process and its tasks into a resilience-enhancing operating procedure.
In this context, it is notable that the system boundaries of a CISO are primarily determined by the components and functions required to fulfill the core task of the CISO, the provision of one or more services. Events that lead to disruptions of these services should be detected predictively or reactively by the framework in order to be able to initiate appropriate recovery measures. The execution of these recovery actions only partially takes place within the system boundaries of CISO and is rather to be considered as an additional service that requires further resources and specific expertise. The proposed ORM framework can provide support for such a service but cannot replace it. The case discussed in
Section 5 illustrates the applicability of the framework to CISO operation as well as recovery.
The task “Context specification” provides the evaluation of the framework employed for the resilience management of the CISO under consideration of new or modified challenges on the CISO (blue circle in
Figure 2). It is executed whenever changes occur in the CISO, in the environment, or in the requirements of the CISO. Changes can be intentional, e.g., modernization, automation, new standards, and regulations, or quite unexpected, e.g., increased wear out of components, limited supplies, economic embargos. The implementation of such changes is executed in the task “Adjust DIB for resilience assessment”, which administrates the adaptation of the employed models (DT, EM), the sensor system, and the applicable scenario spaces (orange circle in
Figure 2). This task, which is usually implemented with the help of external service providers, can be initiated by all other tasks with the exception of “Evaluation of Risk Mitigation Measures (RMM)” and “Anticipation of risk developments”.
“Data acquisition and management” and “Situation Analysis” are recurring tasks that relate to the operation of the monitoring capability of the CISO (green circle in
Figure 2). Both are (re)started by the “Context specification”, and thus, both tasks. “Data acquisition and management” continuously provide real-time data from the sensor system to the data and information base. In contrast, “Situation Analysis” continuously monitors and analyses situational changes. It is a central element of resilience management as it enables the identification of critical situations and developments. Its main purpose is the detection and description of anomalies and the identification of how an anomaly relates to the scenario space. The results of this task are then investigated by the following resilience assessment tasks.
Similar to the generic resilience management process proposed by Häring et al. [
38], the resilience assessment is carried out with the help of the three consecutively executed tasks (beige circles in
Figure 2). However, in the proposed scheme, the emphasis of the tasks is again on the actual operability of resilience management. The task “Anticipation of risk developments” focuses on the anticipation of potential risk developments with the help of scenario analysis. A possible outcome of this analysis might be that the determined risk is tolerable, and therefore the risk assessment is terminated. If this is not the case, the task “Management of Risk Mitigation Measures (RMM)” is initiated, which identifies possible approaches for risk mitigation. Therefore, the identified RMM are implemented in a simulation setup in order to determine corresponding performance and risk indicators [
80]. This task can be interrupted when required modifications of the data and information base are required, which will be detailed in sub
Section 4.5. The consecutive task “Evaluation of Risk Mitigation Measures (RMM)” serves the evaluation of the proposed RMM and supports the decision-making considering efficiency, uncertainties, and risk–benefit ratios. The obtained results are then handed as decision support to the operator or other stakeholders. Proposed and chosen risk mitigation approaches should be stored to enable a posterior analysis of the effectiveness in practice and implementation quality.
The final task, “Extend/modify the CISO with the chosen Risk Mitigation Measures (RMM)”, relates to the actual implementation of the measures (purple circle in
Figure 2). It is important to note that any consequences of these measures must be integrated into the data and information base of the CISO via the task “Adjust DIB for resilience assessment”.
Nominal operation of the CISO may be assumed if “Data collection and management” and “Situation analysis” are performed as specified, and the analysis results prove the overall compliance with the specifications. This can be understood as the first stage of resilience management, which decides on the base of the current situation whether the continued operation of the CISO in its previous form is reasonable and justifiable.
4. Processes of Resilience Management
This section details and discusses the tasks of the proposed resilience management process presented in
Section 3.3.
4.1. Context Specification
The purpose of the resilience management in CISO is to appropriately control and manage the CISO’s resilience capabilities in order to identify emerging known as well as unknown threats and to initiate appropriate countermeasures. A special challenge is the handling of emerging unknown threats as well as changes in the CISO or its purpose, which often requires an adaptation or extension of the resilience capabilities of the CISO. In such cases, it is necessary to verify whether the DIB still fulfills the requirements resulting from the new situation. For this purpose, the context of resilience management has to be (re)specified by first identifying and formulating the specific objectives and problems of the intended resilience management process (see
Figure 3). Next, it must be evaluated whether the DIB corresponds to the specified requirements. This evaluation should be undertaken considering the scenario space, the modelling (DT and EM), as well as the capabilities of the sensor system (see
Figure 4). If any of these components does not fulfil the requirements, a corresponding revision is requested, which might lead to the adjustment of the DIB. The realization of these adjustments is not further discussed in this paper, as they are considered an external service activity (see purple activity in
Figure 2).
4.1.1. Scenario Space Evaluation
Figure 4 details the subprocess for evaluating the usability of the available scenario space based on the concept of general morphological analysis [
78,
81]. Thus, a scenario space aggregates impact factors of threat scenarios in a multidimensional space, thus using a set of superior key factors for a qualitative and quantitative scenario description [
82]. Key factors can comprise factors of the environment as well as factors within the system itself and can thus be of a general as well as a system-specific nature. Therefore, existing scenario spaces reflect the already accumulated and usable competencies about the considered CISO as fields to be formed, about practicable resilience capabilities and their coordination, as well as about already known danger situations as impact factors and possibilities to deal with them. A specific scenario space shall be able to describe all conceivable threats and resulting scenarios, including those characterized as having high impact and low probability of occurrence (HILP) in relation to a considered CISO.
Initially, it must be examined whether one of the existing scenario spaces is, in principle, suitable to be used for investigations of the identified problems in relation to the considered CISO. If not, it is necessary to request the development and implementation of a new scenario space. If a scenario space is principally usable, the next examination checks the covering of the relevant impact factors, e.g., the points where threats or other aspects potentially affect the CISO under consideration. When consistency is not given here, an extension of the impact factors is requested.
Next, the key impact factors (KIF) are selected, e.g., by cross-impact analysis, if the available scenario space supports large numbers of interdependent impact factors of interest. In the following step, the relevant KIFs are checked for sufficient description of all scenarios within the scenario space, e.g., relevant changes in climate or weather conditions, terrorist attacks, or functional failures. Additionally, the KIFs should sufficiently be parametrized according to related variables and parameters of the used models of the system and environment. The description should also include the time-dependent trend of the KIFs for the purpose of scenario development prognosis. An insufficient setup of KIFs leads to process termination and the request for a scenario space extension. Finally, it is checked if a valid database of nominal KIF values (N) representing the nominal state of the system is already provided. If validity is given, the subprocess finishes, and the task continues with the model evaluation of the used digital twin (DT) and environmental models (EM) (see
Figure 5). Otherwise, new descriptor data are requested.
4.1.2. Sensor System Evaluation
The final subprocess of the task context specification investigates the capability of the sensor system (SeS) to provide the data and information needed for resilience management (
Figure 6). It should be emphasized again that “sensor system” is used as a synonym for all sources providing the required data and information.
Initially, a context-related specification of requirements for data and information is executed. The requirements are specifically adapted to the previously evaluated DT and EM models, as the sensor system shall enable the elicitation of all relevant data and information needed by the models to monitor the condition of the considered CISO and its environment as well as actual events and state changes. Thus, the sensor system acts as the bridge between the real system and DT as well as the real environment and EM. In general, the sensor system should provide physical measurement data as well as merged data (sensor fusion) with higher information levels or nonphysical information from other sources.
The fulfillment of requirements is evaluated by a two-step checking of the sensor system. In the first step, the sensor system is analyzed, considering its ability to provide the data and information as needed. In case of a negative outcome, the modification of the sensor system is requested. Next, the operability and capability of the sensors system, e.g., in terms of quality of data provision, is verified, and, if necessary, maintenance is requested.
4.2. Data Acquisition and Management
The task “Data acquisition and management” serves the monitoring of the considered CISO and its relevant environment in compliance with the specified requirements (
Figure 7). Its purpose is the retrieval of data and information needed for the parallel task “Situation analysis”. The task starts with the acquisition of the sensor system. Next, the formal requirements on the data provided, such as compliance with the data format, plausible data content, and availability of validity information, are verified. Failed data acquisitions, as well as violations of formal requirements, are reported to the control system, which ultimately decides whether data acquisition should continue or restoring measures are required for the sensor system.
Successfully retrieved actual data may be fused to enable the generation of higher-level information, e.g., geo-referenced as well as time-synchronized data or plausibility, consistency, and integrity information about the actual data. The details of this step strongly depend on the considered CISO, its context, and used models and parameters. Insufficient data quality and unstable data processing can lead to the failure of data fusion and the inability to provide the intended higher quality information. This must also be reported to the control, which decides on the further procedure.
If the data acquisition and processing are performed successfully, the data are provided to the scenario spaces and the DT or EM of the DIB. As a result, the controlling informs the task “Situation analysis” that new data are available for further investigation. It also ensures that data acquisition and processing are continued as cyclical tasks under normal conditions.
4.3. Situation Analysis
“Situation analysis” is considered a core task of resilience management (
Figure 8). Situation analysis is responsible for the identification of developing scenarios and is carried out by consecutively passing through various subprocesses. The current situation is represented by the state of DT and EM modelled with incoming (near) real-time data (RT), while the nominal situation is described via the nominal model data (N). Additionally, sequences of previously logged RT of DT and EM that are still present in the DIB may be used to describe prospective scenario developments.
Next, anomalies are detected by comparing the current and nominal situations. The current and the nominal situation may also include the analysis of time series to determine parameters that describe trends as well as to detect abnormal changes. In principle, various methods are applicable for comparing and evaluating the situation in order to detect anomalies. For example, outlier detection may be a suitable mean for the time-efficient detection of relevant deviations [
83,
84]. In comparison, the recognition of known and unknown patterns in incoming data indicating abnormal behavior may be conducted by AI-driven processes [
85,
86]. The choice of comparison methods depends on the type and complexity of the data used as well as the resilience management objectives. Additionally, the comparison results should provide the details needed for the following state assignment in the scenario space. If no anomaly is detected, e.g., the current situation of DM and EM is within the boundaries of the nominal behavior, the process situation analysis switches back into an idle state, waiting to be triggered for a restart by new incoming data.
If anomalies are detected, they are then characterized by a snapshot describing the deviations or found patterns based on key impact factors of a scenario space. The next processing step tries to match the snapshot of a detected anomaly to any feasible states in the scenario space to identify such scenarios which are representative of the current situation and are, therefore, possible developments. This may also include very unlikely scenarios and also scenarios with partial consistency, thus resulting in uncertainties with regards to further scenario development.
Additionally, it is reviewed whether matching scenarios were already recognized in an earlier analysis cycle. If this is the case, it is not necessary to start the processes of risk anticipation, RMM identification, and RMM evaluation again. This helps to prevent unnecessary analysis, as compliant risk mitigation measures are already in place or currently under investigation. New scenarios have to be checked for sufficient depiction within the existing scenario space.
A failed depiction should be analyzed to receive a well-formulated problem description needed to request external solution processes, e.g., the extension of the scenario space with additional key impact factors or a changed parametrization. If a depiction is possible, the resulting fully specified new state-related scenarios and their further characteristics are then added to the anomaly database of the scenario space for further analysis.
4.4. Anticipation of Risk Developments
The task, “Anticipation of risk developments”, depicted in
Figure 9, is triggered either by new or changed scenarios identified by the “Situation analysis” (see
Section 4.3).
The process analyzes the scenarios regarding risk indicators (RIs) related to the criticality of scenario development for the performance indicators (PIs) of the system. Therefore, possible developments of the scenarios are simulated with the help of the DT and EM. The required information for simulating the scenario development is taken from the descriptor database of the scenario space. This comprises the current anomaly data (ASnapshot, AScenarios) and the nominal values (N) and includes parameters, variables, and their feasible short-term development, as well as the corresponding likelihoods. The model-based simulations of all parametrized scenarios are conducted in consecutive runs. Each run starts with the composition of a DT- and EM-based tool kit that is sufficient for the simulation of the chosen scenario. This tool kit is an offline copy of parts or the whole DT using its nominal specification. On this base, the scenarios are simulated repeatedly, e.g., by Monte-Carlo simulation, in order to consider different feasible scenario developments and the respective possible outcomes and their likelihoods.
In the next step, all achieved simulation results are analyzed in regard to the resulting course of the PIs and corresponding risks that have an effect. Thus, RIs having an impact on the course of the PI in different simulated developments are assessed and quantified. In the last step of each run, the PIs, RIs, and their quantified impact, as well as the impact severity likelihood, are aggregated for the various simulated scenario outcomes.
After the finalization of the simulations, the results and associated PIs and RIs are stored in the anomaly database. Finally, a first assessment of the criticality based on the extracted PIs and RIs is done. The risk treatment is stopped if critical risk development within the analyzed scenarios can be ruled out with adequate confidence. The needed level of confidence should depend on individual criteria of the considered system, e.g., its criticality within a higher-level network. Otherwise, the following process of finding appropriate risk mitigation measures (RMM) is triggered.
4.5. Management of RMM
The goal of this task is the identification of RMM which are potentially suitable to reduce the risks in the anticipated risk developments as well as to mitigate resulting consequences. The effectiveness of the RMM has to be proved with respect to PIs and RIs determined by the former scenario-based simulations. In this way, this task decisively contributes to the enhancement of the abilities of the supported system in terms of its ability to react and adapt to emerging critical situations. As not only the mere suitability but also the added value of the measures to overall risk mitigation are relevant, the evaluation of the impact of these measures is additionally conducted within this activity.
In the first step (see
Figure 10), potential risk mitigation measures (RMM) are identified. Various methods may contribute to the fulfillment of this rather complex task. These methods are understood as a subprocess, which is not discussed in detail in this paper. Feasible methods range from best practice approaches over state-of-the-art analysis to (variance-based) sensitivity analysis, e.g., described in [
87]. The hereby predefined measures could be stored in a list, e.g., suitability for impact on various RIs. Thus, this list of K approaches is generally suitable for the RIs identified in the previous simulations, with a single approach representing a single measure or a combination of them. If no matching RMM is found (K = 0), the task is aborted in order to further continue the risk treatment with the evaluation process.
The identified RMM are further evaluated for their usability and practicability in the current DT model (see
Table 2). Here, usability can be assumed if RMM can directly be simulated and analyzed with DT models. Non-usable RMM can still be considered practicable when integration in a timely manner into the DT models is feasible without interrupting the running evaluation process. This evaluation includes the analysis and comparison of links and interfaces between RMM and DT models suitable for the implementation of the measures.
If at least one usable RMM has been identified, a simulation of the feasible scenario developments is prepared and conducted analogous to the anticipation of development activity (see
Section 4.5). The simulation setup now comprises the implemented measures and, if needed, the extended DT models. Furthermore, the simulation setup uses the anomaly database and the nominal values for the key factors for scenario description. The model-based simulation of the RMM-influenced situation developments is iteratively conducted for every feasible scenario. When completed, the activity finishes by storing the scenario developments and associated PIs and RIs (A
Scenario*, A
PI*, A
RI*) in the anomaly database of the scenario space.
4.6. Evaluation of RMM
The final task deals with the evaluation of the identified RMM and their simulated impact on situation developments (see
Figure 11). The goal is the provision of recommendations for response to developing situations, e.g., emerging threat scenarios, by an optimal choice of RMM.
The evaluation process itself is multidimensional, having in mind that different influences regarding decision-making have to be considered. Firstly, the effectivity of the proposed RMM is analyzed using the information from the anomaly database, e.g., by comparing the simulated course of PIs and RIs in original developing scenarios (AScenario) and the respective developments with the implemented RMM (AScenario*). Secondly, arising uncertainties from different sources within the framework are evaluated for their impact on decision validity. In this important part of decision analysis, a distinction between epistemic and deep uncertainty is reasonable. While the former is a result of imprecisions in modeling and information, the latter results from unknown scenario development due to incomplete information in the prognosis.
Finally, a Risk-Benefit-Analysis is conducted. Here, the variety of feasible scenario developments (AScenario, AScenario*) and RMM are considered in order to analyze tradeoffs between resulting risks and benefits of RMM implementation on PIs by using the results of the carried out simulations (API, API*, ARI, ARI*). Thus, the Risk-Benefit-Analysis considers the uncertain anticipation of threats and the resulting course of scenario development due to incomplete information as well as to measure effectiveness uncertainty. Note that diverging effects of RMM on different PIs or in different scenarios can lead to goal conflicts that can be solved, e.g., by Multi-Criteria Decision Analysis.
All evaluation steps have similar criteria for process abortion and ending risk treatment. Hence, noneffective RMM, too many uncertainties, or not accepted Risk-Benefit-Ratios lead to the recommendation of no RMM implementation. Thus, the decision support of the framework stays neutral and waits for more information on the task “Situation Analysis” (see
Section 4.3). If all steps are conducted successfully, the RMM are ranked according to the evaluation results. These results may be stored to track the real achieved RMM effectiveness. This tracking can be used to check the validity of the implementation recommendations by comparing achieved results using real-time data (RT) and expected results of the simulations. A further application of this tracking can support the learning of new best practice RMM solutions.
5. Conceptual Study
This section discusses the feasibility of using the framework for Operational Resilience Management (ORM) in collaboration with dynamic risk management by means of a qualitative analysis of a real event that occurred in a major city in Bavaria (Germany) in February 2021. It is important to note that we assume an implementation of the framework that complies with the requirements presented in the previous section.
5.1. Motivating Scenario Description
The scenario used here as an example was a major fire in a power plant induced by a technical defect that was not detected early enough by the employed condition monitoring system. Regardless of whether the fire could have been avoided or not, the risk management system failed with regard to fire prevention. However, the risk management system worked very well with regard to firefighting as well as the evacuation of employees. Consequently, no personal injuries or major destructions have been reported, with the exception of the power plant unit responsible for the heat supply, which was damaged in such a way that it stopped its operation. Possibly, a post-hazard risk analysis might be able to show to what extent such a fire can be avoided in the future. This analysis might reveal the need for improved monitoring and recognition capabilities of the power plant, i.e., higher resilience against the causes of the fire.
The fire damage to the power plant unit meant that one of several district heating generators in the regional network was no longer available. The power plant unit, which dates back to the 1930s, was primarily in operation to balance peak daytime consumption and contributed to the system margin. Thus, it was expected that at average Central European temperature conditions (~5 °C) and under typical consumption behavior, the failure of this power plant unit would not lead to a noticeable reduction in the district heating supply. However, at the time of the event, the region suffered from freezing cold weather conditions with temperatures constantly below −10 °C. Based on the available information, it is not possible to judge whether risk analyses already carried out by the district heating network operator classified the total failure of the power plant at extremely low temperatures as a rather unlikely event or as a short-term event without significant effects.
The consideration of the damage situation on the part of the power plant led to the estimate that several weeks would be needed to restore and rebuild the unit for heat generation. From the point of view of the district heating network operator and considering the current weather conditions, the district heating supply was no longer guaranteed for two city districts and its critical infrastructures, including a hospital, two old people’s and nursing homes, 15,000 households, schools, and businesses. Based on this information, the city declared a disaster situation, established a crisis team, and asked residents to reduce hot water and heating consumption to a minimum. This was undertaken via the catastrophe warning system KATWARN using mobile communication for messaging. Unfortunately, a radio mast of a mobile phone provider on the roof of the power plant was also destroyed resulting in an interruption of the mobile phone service. In summary, it was difficult to estimate how many of the affected residents would comply with this request. Complicating the situation, the pandemic regulations meant that a large proportion of the population worked from their home offices, resulting in an even higher demand for heat and energy. At the same time, the measure of procuring and installing mobile heating stations began but was delayed due to the unavailability of mobile heat generators on site. Although the risk management of the crisis team had worked quite well up to that point, the uncertainties in the situation impeded the assessment of whether the measures initiated would be sufficient to guarantee a minimum supply of heat. In addition, current pandemic regulations have been suspended in order to provide alternative housing options to the affected population in the event of a total loss of heat supply. This reduced the risk of freezing but increased the risk of being infected with COVID-19.
5.2. Elaboration of Application of the Proposed ORM Framework
In the following, we discuss the possible impact of the ORM framework as specified in
Section 3 and
Section 4 on the presented scenario in a qualitative manner. Therefore, different levels of the scenario are discussed, starting with the power plant as the lowest level, followed by the district heating supply and the urban crisis management.
5.2.1. Power Plant Level
Fire is either caused by the presence of heat, flammable substances, and oxygen in a certain mixture or by physically, chemically, or biologically induced spontaneous ignition processes. The observed outcome indicates that the employed monitoring system and the situation assessment by the operating personnel were not able to perceive and interrupt the ignition process at an early stage. This may be due to several causes: (a) The area where the fire originated was not monitored. (b) The area was monitored, but the data collected were inadequate to detect the emerging fire. (c) The anomaly detected by the monitoring system was an indicator of the emerging fire but was not noticed by the operators. (d) The detected anomaly was perceived as an indicator of an emerging fire by the operating personnel, but the scenario was classified as rather unlikely. (e) The emerging fire was detected too late to be stopped.
In the case of cause (a), the proposed ORM framework could perform the monitoring as needed for risk detection of emerging fires. In contrast, cause (b) could be circumvented via enhanced monitoring capabilities due to improved data acquisition and assessment, which is mainly provided by the tasks “Data acquisition and management” (
Section 4.2), “Situation analysis” (
Section 4.3), and “Anticipation of risk developments” (
Section 4.4). The proposed ORM framework would be particularly helpful in reducing the risks related to causes (c) to (e). Therefore, the framework would initiate a scenario analysis for each detected anomaly with respect to potentially evolving hazards (“Situation analysis” and “Anticipation of risk developments”) unless the data and information base contains already explanatory scenarios for the specific type of heat generator. Based on this analysis, adequate risk mitigation measures (RMM) could be proposed and evaluated, mainly via the tasks “Management of RMM” (
Section 4.5) and “Evaluation of RMM” (
Section 4.6).
5.2.2. District Heating Supply
Modern district heating power plants already optimize the use of local resources with the help of operational support software based on DT or simplified models of the power plant and setpoints for heat generation. Therefore, a district heating supply network is often described by a locally resolved hydrothermal model to ensure that the consumer obtains district heating at the required temperature and pressure. However, the actual heat consumption on the final customer level is unknown within these models. In the exemplary scenario discussed above, the network operator used its operating software to recognize in the early stages that the district heating supply was at risk under the circumstances described above. However, the pandemic, as well as extreme weather-related deviations to the customer profiles as well as the lack of knowledge of the current consumption of individual customers made it impossible to estimate the emerging restriction of district heating provision reliably. This also applies to the assessment of the extent to which voluntary restriction of heat consumption by customers could contribute to relaxing the critical situation.
The proposed ORM framework could notably improve this situation assessment. Therefore, the applied data and information base (DIB) must provide models not only for heat generators and networks but also for customers in terms of heat consumption. Recent developments indicate the feasibility of such dynamic models [
88]. Using the DIB and real-time data also at the customer level, acquired via the task “Data acquisition and management” (
Section 4.2), an enhanced description of the current situation and its dynamics is possible and could be provided by the task “Situation analysis” (
Section 4.3). Consequently, potential scenarios can be identified and described based on perceived changes in the system and the heat consumption, using the tasks “Situation analysis” and “Anticipation of risk developments” (
Section 4.4). This would also offer the possibility of regulatory intervention in times of crisis to avoid consumer behavior that causes additional harm to the stressed system, mainly derived and evaluated within the tasks “Management of RMM” (
Section 4.5) and “Evaluation of RMM” (
Section 4.6). Other possible risk mitigation measures include the formulation and updating of requirements and the scheduling of mobile heat generation.
5.2.3. Urban Crisis Management
In general, it is very much possible that the current pandemic regulations have been suspended too early, considering the given uncertainties of the developing situation. Enhanced situational awareness, as provided by the proposed ORM framework, could have helped in avoiding or at least delaying the suspension.
6. Conclusions
Recently published works indicate that current risk and resilience research puts high emphasis on uncertainties arising from the lack of knowledge about the behavior of complex networked Critical Infrastructure Systems and Organizations (CISO) and their multifaceted interdependences. As a result, a variety of risk-related as well as resilience-related frameworks offering either method-driven or application-oriented approaches were developed. Alas, these frameworks either lack the focus on operational management, have a rather theoretical approach, or are designed for specific applications. This observation motivated the development of the presented framework for operational resilience management (ORM), which aims at implementing the four cornerstones of resilience, namely responding, monitoring, anticipating, and learning, as operational processes in a CISO. Therefore, the proposed ORM framework provides a process that defines tasks for the proper coordination of the CISO’s inherent capabilities to identify and handle adverse as well as surprising events and developments.
Regardless of the context, resilience management requires a minimum of available knowledge. Ultimately, the available knowledge determines the possible performance of the implemented resilience-related capabilities. The framework reflects the importance of knowledge by building a living data and information base (DIB), which includes the digital twin (DT) concept. This DIB enables the description of the current situation and historical events in relation to CISO, environment, and evolving scenarios. Furthermore, the proposed ORM framework also defines operational measures for gathering, maintaining, and extending the available knowledge. This results in the following advantages:
When starting or resuming ORM, a context specification assesses whether the available knowledge is sufficient to perform the intended resilience management. An identified deficiency leads to the need to expand the DIB, e.g., by collecting additional data, improving DT and environmental models, or expanding methodological capabilities. This strengthens situational awareness with regard to known and unknown risks.
The ORM framework provides support for decision-making regarding the need for reassessing possible risk developments due to the current situation. For this purpose, the ACTUAL situation (DIB with real-time data) is compared with the TARGET situation (DIB with nominal data) in order to identify anomalies in the risk indicators used and to anticipate potential risk developments.
Exhaustive analysis and simulation with the help of DT and environmental models are performed to provide decision support by identification, evaluation, and selection of suitable and practicable risk mitigation measures (RMMs). In this context, novel as well as known RMMs were investigated with regard to their effectiveness. An additional feature is the capability to reduce the influence of uncertainties and to correct implemented measures successively.
All framework-dependent feedbacks between DT and CISO, data and models, as well as analyses and measures of risk mitigation, reflect the crucial resilience feature of a learning CISO. The results of scenario analyses and implementation of identified risk mitigation measures lead to an update of DT models in particular and of the DIB in general. Using the example of a real hazard in a Bavarian district heating power plant, it was shown how the framework could have a positive impact on decision-making processes involving risk mitigation measures as well as measures to increase the resilience of the CISO. One should be aware that the added value of using a framework necessarily depends on the actual implementation of the individual tasks in the context of a specific entity. It should be noted that the enumerated benefits can only be fully achieved if the database is as comprehensive as possible. In practice, limitations and complex implementation may have to be expected here, as deficiencies in data collection can often be identified. Another possible limitation is the lack of knowledge about functional relationships of the complex CISO, which can lead to necessary cutbacks in the accuracy of the system models used. For this reason, it is important to examine the proposed framework and its components in a subsequent step by means of a more detailed implementation on a practical example.
Consequently, to evaluate the effects of these potential limitations in more detail, it is important to subject the proposed framework and its components to a more detailed implementation on a practical example in future works.
Hence, the main contribution of this work is the provision of a methodical approach that paves the way for future research and development on the merging of methods for risk and resilience management as well as the digital twin concept.