Metrics and Strategies Used in Power Grid Resilience

: This article provides a comprehensive review of power grid resilience, including current metrics and definitions, as well as the procedures used to ensure and improve the resilience of a system. We also describe the different strategies used by users to ensure their own resilience. Additionally, this article highlights areas for future research and opportunities for the integration of emerging technologies such as computer vision. The main objective of this study was to explore the metrics and strategies used in power grids and for the users to improve and ensure resilience in case of events.


Introduction
The power grid is critical infrastructure that provides essential services to society.However, it is vulnerable to interruptions caused by various natural events, technical failures, and cyber attacks.Therefore, power grids must be resilient and able to mitigate and recover quickly from disruptions.Microgrids are a strategy to improve the resilience of power grids by enabling the generation and distribution of power locally in the event of grid interruption.
Given the aforementioned challenges, resilience is a highly relevant issue for electric power grids.Therefore, it is necessary to develop appropriate techniques and metrics based on the current proposals in the literature.Researchers have recently published articles reviewing various aspects of resilience studies ionn power grids.Table 1 demonstrates some of the typical main features found among these studies, showing that the topic of resilience still requires further research and evolution.
Table 1.Review of power system resiliency characteristics.

Characteristics of the Review Reference
Resiliency definitions [1][2][3][4][5] Resiliency enhancement [1,2,[6][7][8] Resilience metrics or indices [1,3,9,10] Resilience Cyber-physical [9,[11][12][13]] Extreme events [2,5,[7][8][9] Most of the methods used in studies to define the states of the power grid are based on physical models and the analysis of its infrastructure.However, this approach is unsuitable for handling the increasing uncertainty in the new variables that add complexity to the system.Similarly, the departments responsible for maintaining the electrical grid in operation and under proper conditions, as well as the users, have developed a series of procedures before, during, and after events with the aim of addressing the event without being significantly affected.In this article, we review recent metrics and strategies used by the power grid to represent the phases of recovery from disasters.
In a broader context, the evolution toward smart grids encompasses advancements in detection, computing, and communications [14].This transformation involves various aspects, such as the integration of renewable sources, electric vehicles, load forecasting techniques, dynamic pricing, demand response, and the development of microgrids, including the consideration of cyber threats to infrastructure [15,16].All these elements contribute to optimizing the operation and scheduling of the power grid.These technologies, coupled with actively participating agents [17], aim to enhance the overall performance of electricity generation, distribution, and consumption.As operators prepare and evolve alongside technology, this document briefly touches upon the topic of cyber resilience, acknowledging its extensive and complex nature and suggesting that it may warrant more in-depth exploration in a different context.
This review was conducted using a real and active energy operator as a reference, specifically LUMA in Puerto Rico [18,19].This adaptation is particularly relevant due to the island's susceptibility to various recurrent natural phenomena in the Caribbean, such as Hurricane Maria in 2017 [20].The ability to reference the strategies and measures applied by both the network operators and the end users of LUMA provides valuable insights for enhancing resilience.It is crucial to emphasize that this work also serves as a foundation for advancing toward the integration of new technologies associated with smart grids and understanding how users would adapt to these constantly evolving changes.
This review document follows the sequence outlined below: Section 2 explores various definitions and concepts related to resilience, extreme events, and metrics proposed in the literature.Section 3 discusses common strategies for resilience in the power grid, focusing on their state-of-the-art procedures established by the U.S. Department of Energy (DOE), the Electric Authority in Puerto Rico (LUMA), and the users.Section 4 introduces the document's discussion, in section 5 the future research Direction.Finally, in Section 6, the conclusions of the work are presented.

Concept of Grid Resiliency
Resilience is a multifaceted concept that has been examined in various disciplines, including social sciences, natural sciences, psychology, organizational management, engineering, and others.In general, resilience refers to the ability of a system to recover quickly from a disturbance.The concept of grid or power system resilience emerged in parallel with the examination of critical infrastructure resilience.Below are several definitions of grid resilience from recognized institutions: DOE: "The ability of a power system and its components to withstand and adapt to disruptions and rapidly recover from them" (www.energy.gov/eere/solar/solar-andresilience-basics(accessed on 2 May 2023)).
FERC: "The ability to withstand and reduce the magnitude and/or duration of disruptive events, which includes the capability to anticipate, absorb, adapt to, and/or rapidly recover from such an event" (The Federal Energy Regulatory Commission (FERC)).
In simple terms, grid resilience refers to the ability of a network to withstand disturbances and quickly recover with minimal downtime or disruption.

Resilience Framework
For evaluating the resilience of a power system, it is essential to identify the extreme events that the system may face, as resilience varies according to each of them.Subsequently, resilience metrics must be established, and appropriate assessment methodologies must be selected.
It is necessary to model the spatiotemporal influence of an event on the resilience of the electrical infrastructure and to calculate the consequences of failures.Finally, system resilience can be assessed in a single failure scenario to obtain a specific measured value or in multiple failure scenarios using expected values, probability distributions, or other forms of analysis.Figure 1 presents an idea of the steps to follow.

Reliability and Resilience
Reliability and resilience are two concepts that are often used interchangeably in the context of electrical systems.However, there are some key differences between them.Reliability refers to the ability of an electrical system to withstand and recover from typical random fault outages.In this case, load restoration methods can be used to supply power to affected areas.On the other hand, resilience refers to the ability of a power system to recover from extreme events, such as natural catastrophes, which can cause multiple cascading outages and damage to transmission and distribution networks.The restoration process is much more complex in these situations, and conventional techniques may be less effective [6].
Therefore, it is essential to have specific assessment methods and resilience enhancement techniques to deal with these types of events.Table 2 provides a summary of some of the significant differences between these two definitions.

Threat Events
Any factor with the potential to cause damage, destruction, or interruption to an electrical system is considered a threat.These threats can manifest as natural, technological, or human-induced events and generally outside of the control of electrical system planners (and operators www.oe.netl.doe.gov/OE417_annual_summary.aspx (accessed on 9 June 2023)).When we talk about extreme events or threats to a power grid, we often think of those caused by nature.
However, in addition to these extreme events, there are other types of threats to the network, such as cyber or technological attacks, which have increased with the advancement and growth of technology [12].Last but not least, there are threats of human origin.
Figure 2 shows some of the most common threats, but we can characterize them into different groups as follows [21]: • Natural hazards are caused by events like severe weather conditions, floods, earthquakes, hurricanes, landslides, frosts, and electrical storms.Additionally, they may result from interactions with wildlife, such as squirrels, snakes, or birds, which can lead to short circuits in distribution lines.

•
Technological hazards are caused by failures of systems and structures, for example, defects in materials or malfunctions.• Human-caused hazards can arise from accidents, such as inadvertently cutting a transmission line, or from intentional actions by an adversary, including cyberattacks or acts of terrorism.

Reliability and Resilience Metrics
No standardized metrics are available to evaluate a grid's resilience, so it is still a matter of discussion.In this part of the article, we present some metrics proposed in the literature according to attributes, characteristics, code-based, and others.
The resilience triangle, first introduced by [22], has served as a foundational element in the evaluation of resilience.This idea has progressed into the resilience trapezoid, being adapted to include factors related to degraded states in scenarios where immediate restoration measures are not promptly implemented after disruptions.

Reliability Metrics
The three most important reliability indices by the IEEE, as shown in IEEE Standard 1366 [23], are the System Average Interruption Frequency Index (SAIFI), representing the average number of power outages for one customer in a year; System Average Interruption Duration Index (SAIDI), representing the total number of minutes or hours of interruption for a client in one year; and Customer Average Interruption Duration Index (CAIDI, representing the average time required to restore service.These three indices were calculated using the following equations: where CI is the customers interrupted and is equal to the total number of customers affected by the event; CMI means customer minutes interrupted and is equal to CI times the duration of the event in minutes; and, finally, CS means customers served and is equal to the total number of customers connected to the affected area.The U.S. Energy Information Administration (EIA) displays the reliability indices described above from 1990 to the present on its website (www.eia.gov/electricity/data/eia861/ (accessed on 10 June 2023)).The data include 960 utilities corresponding to the states and 4 utilities for the territories, of which approximately 75% of the utilities apply Equations ( 1)-( 3) from the IEEE Standard, while the remaining 25% use other standards.Figure 3 shows the mean reliability indices for all events for the U.S. territories divided into states and territories, including Puerto Rico.
Reliability Indexes for All Events (With Major Event Days)

The Resilience Triangle
Figure 4 is a graphical representation of the resilience triangle [22], which has been utilized in various studies to assess resilience following an extreme event.However, the triangle is only capable of conducting a resilience assessment in a single phase, particularly in evaluating the recovery performance of an infrastructure after the event.
In this context, F A (t) represents the level of the actual performance of the system, and F T (t) represents the target performance level.Here, t e denotes the time of the disaster or threat, while t r signifies the commencement of the restoration process, and t pr denotes the moment at which the recovery is successfully completed, fully restoring operational functionality.This time frame is regarded as the study period for assessing resilience.
The hypotenuse of the triangle can have different shapes; it is not necessarily linear.The shape depends on the recovery strategy or functions used, as demonstrated in [24], where linear, exponential, and trigonometric functions are employed and evaluated.
Continuing with the literature review, in [25], the authors propose a mathematical formulation in (4) to measure the impact (I) between normal and abnormal conditions during an event, calculating the difference in functionality of the grid.In other words, (4) represents the area of resilience.
where F A (t) is the grid functionality actual at time t; R o is the total grid functionality in normal conditions; and t e and t pr are the time of the start of the event and restoration time, respectively.It is essential to highlight that the triangle in Figure 4 can be used in this case to evaluate the resilience of infrastructure.For example, Ref. [26] used various performance indicators (5) for infrastructure under both normal operating conditions and during extreme events.They relied on the concept of resilience to establish a resilience index as follows: In this expression, Q(t) represents the quality of the infrastructure (y-axis) in the triangle in Figure 4.In other words, the variable R o would be modified based on this quality.In addition, reference is made to the variables t e and t pr , mentioned above.

Resilience Trapezoid
In [10,27], the authors introduced a collection of metrics designed to gauge the resilience level of a power system, emphasizing its temporal aspects following a disaster event.This concept can be graphically depicted using the resilience trapezoid, as exemplified in Figure 5.
These metrics are abbreviated as F.L.E.P, signifying how fast (F) and how low (L) the resilience falls in disturbance phase I, by t ∈ [t e , t pe ]; how extensive (E) the postdisturbance degraded state is in phase II (postdisturbance degradation) by t ∈ [t pe , t r ]; and how promptly (P) the grid recovers in phase III (restore) by t ∈ [t r , t pr ].Here, F A (t) represents the actual performance level of the system, and F T (t) represents the target performance level.Furthermore, the postrestoration phase of network operation is described for the interval t ∈ [t pr , t ir ].Notably, this phase alone does not encompass the complete functionality, emphasizing the significance of the restoration stage in the infrastructure to ultimately achieve the desired recovery goal, denoted as F T (t) for t ∈ [t pr , t ir ].
To calculate the resilience index (RI) for a system facing an extreme disaster, authors [10] analyzed the multiphase resilience trapezoid shown in Figure 5.They used time-dependent resilience metrics and indicators to quantify the operational and infrastructure resilience of a system, as represented in the following Equations ( 6)-( 10): ) Area op = t pr t e F A (t)dt (10) where RI F denotes how quickly the fast is damaged, RI L indicates how low the system is degraded, RI E represents how extensive (hours) the system is degraded, and RI P specifies how promptly the system recovers.Then, a disaster Area o p is also presented, which represents the area of the trapezoid affected from the beginning of the event t e until postrestoration t pr , respectively.Specifically, while the system may have regained its pre-event operational state, demonstrating a certain degree of operational resilience, the infrastructure may require more time for complete recovery, highlighting the concept of infrastructure resilience.
In [28], another resilience recovery index RI R was defined as follows: According to Figure 5 and Equation (11), if the recovered level is close to the target level, RI R is high.Metrics based on the resilience characteristics were used in [27], where, to evaluate the impact of these strategies on resilience, it was essential to isolate them from other factors that influence the indices.The time period of interest, referred to as t r , is the time when the restoration phase begins, encompassing both the restoration state and the postrestoration state.
Utility power is assumed to be restored to serve critical loads at t ir − t r = τ, where τ represents the total outage duration, which is illustrated using Figure 5 and expressed in Equation (12): Another metric proposed in [27], but focused on the duration and type of event, is presented in Equation ( 13): where F and R are the failure recovery profiles; t e is the time of the incident; ∆T f = t e − t r is the duration of the failure; ∆T r = t pr − tr is the recovery duration.These quantitative metrics, described in Equations ( 6)- (10), can be adapted depending on what needs to be assessed.In other words, they can be tailored to cases where, for example, the number of affected customers, the number of down lines, affected sectors, affected substations, etc., are to be evaluated [29].Figure 5 illustrates that the trapezoid is highly adaptive and flexible in each of its aforementioned stages.Moreover, it can be applied not only in the electrical sector but in other fields.

Resilience Triangle versus Resilience Trapezoid
In this section, we provide a characterization of the utilization of the resilience triangle compared to the previously mentioned resilience trapezoid.One key feature of the triangle is its limitation to a single phase, specifically employed to assess the restoration status of an operating system after a disaster.On the other hand, the trapezoid offers an advantage as it is applicable to any threat, regardless of its nature, and allows for evaluation at different phases of the event.For instance, in the case of a short-duration event like an earthquake lasting seconds to minutes, resulting in a significant decline in resilience, the trapezoid can capture its evolution.
In contrast, the resilience triangle fails to capture the progression of longer-lasting events, such as a hurricane, which may span hours to days.Table 3, proposed in [6], summarizes other significant differences between these two metrics.

Resilience Strategies in Power Grids
Resilience and reliability stand out as two of the most crucial parameters for electrical grids.Consequently, various companies and public entities have endeavored to enhance these aspects by formulating comprehensive plans, including emergency response plans (ERPs).An ERP is a company document that delineates the procedures and steps to be followed under specific events that impact the integrity or state of the electrical grid.It also outlines the requisite documents and parameters that must be submitted or reported during and after an event.
There exists a distinction between energy security and resilience, where the former assesses common or more probable risks and the latter evaluates larger and less likely risks [30], such as those induced by natural disasters, human-caused disasters, and technological-caused disasters, as illustrated in Figure 2. The risks evaluated by both energy security and resilience are handled in a similar manner; however, when dealing with a substantial event, the procedures, resources, and time involved become greater.
These procedures and the way in which they are evaluated and classified vary by country and location.

Resilience as Seen by Grid-Side Operators-Top-to-Bottom Approach
In the U.S., where the U.S. Department of Energy (DOE) has a document called "Energy Emergency Response Playbook for States and Territories" [31], the DOE provides a starting point for all states and helps them develop their own ERP.For the disaster prevention phase (Figure 5), this document establishes threat levels depending on the severity of the incident, but only considers the three most critical levels, where the smaller the number, the bigger the incident.These levels are shown in Table 4.The next phases, such as phases I, II, and III, also are established in the document [31], describing the procedures.After phase III, the DOE does not take any action, because these actions are the responsibility of each state's energy authority.Another example of the difference in the way that entities evaluate incidents for the disaster prevention phase (Figure 5) is shown in Puerto Rico, where even though the DOE is the highest authority in the electricity sector, Puerto Rico is autonomous and establishes its owns threat levels based on its experience and incident probability.The threat levels are shown in Table 5.For the different phases, LUMA establishes the procedures and activities, which are shown in Figure 6.

Disaster Prevention
For this phase, LUMA has two different ways to act.The first one is when there are no threats; therefore, they perform periodic drills and training.The second is when is an imminent disaster and LUMA has to start performing a series of steps that help them to be more prepared to face the disaster.

Disturbance Progress
This phase starts when the disaster is in progress, in this phase, LUMA can only monitor the event and damage in order to ensure the system health and report it to the stakeholders.Also, they can elaborate restoration plans with the information obtained during the monitoring in order to obtain and ensure the resources needed for the restoration.

Postdisturbance Degraded State
This phase start when the disaster ended, as in the previous phase, in this phase, LUMA has to keep the stakeholders informed about the damage and the restoration plans that will be implemented.These plans must be analyzed, re-evaluated, and approved in order to ensure or change the designated resources; they also have to schedule the needed restoration activities.

Restorative State
This phase starts when the restoration activities begin.As in the previous two cases, Luma has to keep the stakeholders informed, but this time about the restoration progress.Also, the grid operator has to monitor the system in order to connect or disconnect branches or equipment, as necessary.In this phase, LUMA establishes a priority order for the restoration activities; this priority order is shown in Figure 7.
An important aspect of this phase is that when restoration efforts reach 90% of the total damage caused by the event, LUMA considers this phase as completed.

Postrestoration State
This phase begins when the restoration efforts are completed.In this phase, LUMA evaluates the event and the restoration activities in order to document lessons learned, look for deficiencies, and improve its action plans.

Resilience Cyber Attacks
The number of Internet of Things (IoT) devices connected worldwide to the network will increase by 12% on average annually, from nearly 27 billion in 2017 to 125 billion in 2030 (https://cdn.ihs.com/www/pdf/IoT_ebook.pdf (accessed on 10 June 2023)).This increase in automation in electrical systems or smart grids (SGs) is susceptible to cyber attacks.
Cyber attacks can cause physical damage and grid power outages, resulting in significant utility costs.To mitigate these risks, developing a cyber resilience strategy that involves measures such as implementing security systems, identifying vulnerabilities, and cybersecurity training is essential.Cyber resilience involves the ability to prevent, withstand, recover from, and adapt to a cyber attack.
A basic definition, "Cyber resilience: Identifying and defending against various types of cyber attacks and maintaining secure performance during the occurrence of such an event".In [33], an analogy between the physical threat model and the cyber threat model is proposed, establishing a comparison among certain attributes:

Common Strategies Used by Users to Improve Resilience-Bottom-Up Approach
In this subsection, some of the most common strategies employed by users or communities to prepare or adapt before, during, and after extreme events are discussed.Figure 8 is divided into different levels: Level I, home energy storage systems (ESSs); Level II adds the use of generation through natural resources such as solar photovoltaic (PV) or wind, excluding ESSs; Level III represents the integration of Levels I and II.Finally, Level IV is considered the most comprehensive, functioning as a decentralized microgrid (MG) that incorporates the aforementioned lower levels [34].
A review of literature related to the topic is presented below, which includes some examples of the use of some of these common strategies.• Distributed Energy Resources (DERs): They pertain to diverse small-scale energy generation and storage technologies that can be deployed in proximity to the point of consumption, such as residences, businesses, and communities.These technologies encompass renewable energy sources, fuel cells, and energy storage systems, including batteries, as illustrated in Figure 8, positioned at levels I to IV.The adoption of distributed energy resources (DERs.)serves as a means to enhance the reliability and resilience of the electrical grid.

•
Microgrid: This is a small-scale electrical grid that operates independently or in conjunction with an electrical grid.It usually consists of DERs such as PV, wind turbines, ESSs, and backup generators that are interconnected to supply power to a localized area at Level IV.Different architectures for building microgrids, including centralized, decentralized, and distributed forms, also provide construction flexibility depending on user needs.Microgrids offer several advantages over traditional power grids, including greater energy efficiency, greater reliability and resiliency, and integration of renewable energy sources and other DERs into the grid [34].
For example, in [35,36], the authors used the Distributed Energy Resources Customer Adoption Model to supply energy to critical buildings like hospitals with a flat load profile.The authors show that with a DER microgrid, the reliability for a 7-day outage improves from 45% using diesel generators to a 100% using DERs.
The energy storage system (ESS) is an excellent option to improve a grid's resilience; for example, in [37], the authors used ESSs to minimize the investment and load shedding costs under disasters based on Equation ( 14): where C inv is the total investment, N ave is the annual frequency of extreme disasters, and C loss (S) is the system load shedding cost under disasters.The results indicate that load shedding decreases when the ESS is configured with hardening lines, but it implies that greater investment is needed.

Discussion
In this document, multiple forms for assessing grid resilience were described, including the main representations that are employed.In addition, it was shown how the U.S. Department of Energy (DOE) and the Puerto Rico Department of Energy classify disasters, focusing mainly on the activities carried out by the latter to improve its resilience to disasters.Finally, we evaluated how users prepare for these disasters by classifying them based on levels according to their autonomy and complexity.Finally, we have to point out that there is much improvement is needed to maintain and increase grid reliability and resilience, for which we evaluated a series of challenges that were encountered throughout this study, which are detailed below: • Load Growth: The continuous growth in load is a major problem due to the large number of new users connecting to the electric grid, which can cause grid infrastructure and equipment to exceed their design values and fail without the need for an extreme event.

•
Climatic Variability: The challenges posed by climatic variability, marked by their heightened frequency and intensity, are considerable.In the United States, there has been a 67% rise in power outages caused by weather events since the year 2000 (https://www.energy.gov/energysaver/articles/renewable-energy-and-energystorage-can-help-you-power-through-natural(accessed on 13 October 2023)).Additionally, a substantial 83% of all documented power outages in the U.S. are linked to weather-related incidents (https://fairtradefinder.com/the-benefits-of-portablepower-stations-for-natural-disasters/ (accessed on 13 October 2023)).

•
Integration of Renewable Resources: While renewable energies are highly beneficial for increasing the reliability and resilience of users during power outages, when connected to the grid, they can cause damage due to the large amounts of injected energy, which can lead to failures.• Supply Chain: Because more of the users use diesel or gas generation as a backup for the electrical grid, the supply chain is a problem, because generators can run out of fuel (https://www.datacenterknowledge.com/archives/2012/10/31/diesel-thelifeblood-of-the-recovery-effort#close-modal (accessed on 15 October 2023)) because the fuel reserve was not sufficient or because trucks did not arrive at the generator site in time.

•
Intelligence Control.Nowadays, grid operators and utilities are seeking to introduce more and more intelligent equipment to the grid, which can be controlled remotely in order to improve reliability by minimizing reconnection time after faults, but this is also a weak point and a target for cyber attacks like that conducted on the Ukraine power grid on 2022 (https://www.bbc.com/news/technology-61085480(accessed on 22 October 2023)).

Future Research Directions
The future will witness the interconnection of multiple microgrids and the refinement of energy management strategies to foster collaboration and optimize resource distribution during critical events [38].Further exploration into dynamic boundary microgrids is crucial to augment adaptability and responsiveness in extreme conditions.The integration of AC/DC microgrids is paramount for achieving heightened stability and resilience against disruptions in a grid [39].The development of coordinated control methods covering energy sources, the electrical grid, loads, and storage systems is imperative for comprehensive and effective management [40,41].Additionally, acknowledging the escalating influence of artificial intelligence algorithms, such as machine learning, in the prevention of events and the improvement in emergency response within the evolving landscape of smart grids is essential [42].These strategic research domains, coupled with the integration of artificial intelligence and smart grid technologies, collectively contribute to fortified electrical infrastructure, poised to meet the challenges of the modern era.

Conclusions
In conclusion, this article provided detailed metrics and evaluation methods for the different energy infrastructure resilience planning methods.We also described the differences between disaster evaluation methods and progress seen from two different points of view: from the grid-side operator and the from the user side.For the former, the resilience trapezoid was used to approximate and evaluate the different stages that occur during an emergency and what steps or strategies LUMA employs in this specific case in order to increase grid resilience, in order improve the way the energy operator (LUMA) faces the recurrent natural phenomena occurring in the Caribbean.
Moreover, it is important to acknowledge that there may be limitations in the practical application of these techniques in real-world scenarios, so further research is needed to address these challenges.Overall, this study highlights the importance of continuing to explore new approaches and techniques to ensure the resilience of power grids in the face of disruptive events.

Figure 6 .
Figure 6.Top-to-bottom view of LUMA disaster strategies.* LUMA considers that the time t pr is reached when 90% of the damage has been repaired.

Figure 7 .
Figure 7. Priority order for restoration activities.

Table 2 .
Comparison of reliability and resilience.