In this section, we first discuss the general notion of resilience and present our definition as applicable to the smart grid. Then, we discuss the motivation, advantages and requirements of using DRASR as a contingency reserve (also known as reliability response).
2.1. Smart Grid Resilience
Many definitions exist in the literature for resilience. Most of these definitions describe resilience as the ability of a system or entity to avoid, absorb and recover from failures [9
]. In this work, we adopt the following definition of resilience based on the definition of resilience given by Laprie [9
], and the definition of dependability given by Avizienis et al. [10
is the persistent ability of the smart grid to avoid service failures that are more frequent and more severe than are acceptable when facing changes in the environment, and to recover from failures whenever they occur. A number of factors such as cyber-attacks, internal system failures, policy changes, configuration changes, or deployment changes can result in adverse conditions and disrupt system operation. We are specifically interested in evaluating the resilience of the smart grid under cyber-physical threats.
In recent years, evaluating the resilience of the smart grid has been a topic of interest in different research disciplines. A combination of qualitative and quantitative approaches are used in this evaluation. In the cyber-physical security domain, researchers are interested in evaluating resilience in the presence of cyber-physical threats and/or after adding cyber-security components that should enhance the resilience to those threats [11
]. Researchers in this discipline rely on risk assessment methodologies to evaluate resilience, which is considered the goal for risk management, that is, risk management enhances the resilience of the system under study [14
]. By definition, risk is the likelihood of an event multiplied by the potential impact of that event. In the cyber-security domain, the risk is usually computed as risk = vulnerability × threat × impact.
While this type of assessment covers likely risks (because of the vulnerability assessment step), it marginalizes unlikely risks (that are still possible), and does not cover unknown risk. While more systematic approaches are being developed in this domain [15
], most of the work has been done in an ad hoc fashion.
On the other hand, more systematic approaches have been proposed in the environmental hazards/socio-technical systems discipline to evaluate the resilience of smart grids (and critical infrastructures in general). Resilience is evaluated in this discipline for events like natural disasters (e.g., earthquakes and hurricanes), component failure and human vandalism [16
]. Because of the nature of the events in this field of research, probabilistic approaches (statistical and stochastic) are used and generalized to do the evaluation. The main problem with this type of analysis is that failure probability models are mainly designed based on statistical data for physical components in the system (e.g., transformers and generators in the presence of an earthquake), or stochastic models of failures for those components. This requires estimates of the probabilities of failures for these events in the system, which are non-trivial to compute [18
There has been an attempt to use the same probabilistic approaches to analyze smart grids under cyber-attacks in both the cyber-physical security domain and the environmental hazards/socio-technical systems domain [13
]. However, using the same method to estimate the probability of cyber-attacks (that cause failures) may not be appropriate because it is hard to represent cyber-attacks using probabilistic methods similar to the ones used to model failures because of earthquakes (e.g., what is the probability of a zero-day attack?). In addition, these methods do not capture the behavior of the attacker (attack scenario), which results in unrealistic attack modeling and impact analysis of the attack. For example, assigning a random variable to represent the mean time to attack that will cause a failure of a single power component like a generator neglects the attack scenario and leads to unrealistic impact analysis.
Cyber-physical attacks have different impacts on the smart grid like loss of power, loss of load, loss of information, or damage of equipment [20
]. These impacts may propagate and affect higher-level smart grid functions causing high-level function failures. Figure 1
demonstrates how the smart grid can be logically decomposed into a physical power layer, a monitoring and communication layer called Advanced Metering Infrastructure (AMI), and an application layer consisting of higher-level functions such as automated metering, outage management (OM) and DR. In addition to the essential functional layers, there is a need for an orthogonal cyber security layer (CS) for protecting the system against failures and attacks and ensuring the integrity, confidentiality and availability of the system. A resilient smart grid should be able to avoid function failures that are more severe or frequent than is acceptable.
Measuring resilience of critical infrastructure in general has been a topic of interest for researchers [21
]. Strigini [21
] summarizes three main measures that can be used to quantify resilience:
Measures of dependability in the presence of disturbances.
Measures of the amount of disturbances that a system can tolerate.
Measures of the probability of correct service given that a disturbance occurred.
What is common between these three types of measures is that they all require identifying function failures and acceptable degradation levels of smart grid services, which is consistent with the resilience definition presented earlier. In this paper, we quantify resilience by correlating and combining the first two measures listed above.
Our approach in quantifying resilience relies on: (1) measuring the dependability of a smart grid function (DRASR in this case) in the presence of cyber-physical attacks [21
], where a failure in this function is measured by power system frequency (Hz); and (2) the cyber-physical attack measured by the amount of load that actually responds to a DR event.
2.2. DR as Spinning Reserve
The primary function of the power system is to deliver continuous power. However, a large, complex system such as the power grid faces several threats to its stability in the form of disturbances and contingencies. The power system in North America operates stably at 60 Hz. Minor disturbances and contingencies such as generation loss cause the frequency to fluctuate, but as long as the system is able to prevent the frequency from going out of the optimal operation region (59.97 Hz–60.03 Hz) and quickly recover to 60 Hz, the system operates continuously [23
Power reserves are the primary mechanism to handle disturbances and contingencies and keep the system operating in its optimal operation region (59.97 Hz–60.03 Hz). Reserves are classified as spinning
, where spinning refers to the unused but synchronized capacity of the system and non-spinning refers to the unconnected capacity. The reserves are used by various response mechanisms such as Governor and Automatic Generation Control (AGC) to balance the frequency of the system. Based on their type, power reserves are classified as regulating reserves and contingency reserves. Mechanisms such as governor response and AGC use the regulating reserves to handle normal operational disturbances in the system. Contingency reserves (also referred to as reliability response) handle supply contingencies such as loss of generation [23
In the future, automated DR mechanisms will be used as a spinning reserve by utilities to automatically manage load in the system during times of contingencies, or during times of peak demand. For instance, during a contingency such as generator trip, DR will enable an intelligent system controller (or an operator) to send control commands in the form of load reduction requests to selected customers (or customer appliances), who (or which) will comply by shutting off the requested amount of load, thus providing a means to balance and stabilize the system without resorting to more expensive means like buying more energy. DR thus promises to be an efficient, low-cost option for utilities to ensure system stability.
There are several reasons that make DR suitable for this type of reliability response. First, it is infrequently needed (a few times a month) and only needed for a short amount of time (usually 10–15 min). This makes DR less intrusive to customers’ daily lives. Second, DR commands can be automatically deployed with the right communication and control technologies that provide fast responses. In addition, DR provides faster responses than generation. Finally, using DRASR may reduce the cost of operating and maintaining typical spinning reserve (synchronized generation) [24
DR can be considered resilient if the required amount of load is always curtailed within a bounded time, where the required load and time are dependent on utility-specific requirements. Using this definition, we can evaluate if DR was successful in performing its function as a spinning reserve in the presence of a cyber-physical attack (i.e., whether DR was resilient to cyber-physical attacks). Studies have shown that DR signals can be sent from the utility to customers’ loads within about 70 s [3
]. If DR was not successful in its function as a spinning reserve, then this means that certain requirements were violated, system stability was not maintained and additional actions should be taken to stabilize the system (like increasing generation).