A Risk-Based Approach to Assess the Operational Resilience of Transmission Grids

Modern risk analysis studies of the power system increasingly rely on big datasets, either synthesized, simulated, or real utility data. Particularly in the transmission system, outage events have a strong influence on the reliability, resilience, and security of the overall energy delivery infrastructure. In this paper we analyze historical outage data for transmission system components and discuss the implications of nearby overlapping outages with respect to resilience of the power system. We carry out a risk-based assessment using North American Electric Reliability Corporation (NERC) Transmission Availability Data System (TADS) for the North American bulk power system (BPS). We found that the quantification of nearby unscheduled outage clusters would improve the response times for operators to readjust the system and provide better resilience still under the standard definition of N-1 security. Finally, we propose future steps to investigate the relationship between clusters of outages and their electrical proximity, in order to improve operator actions in the operation horizon.


Introduction
Maintaining an adequate level of reliability and resilience in the planning and operation of the power grid is a challenging problem that operating entities face today due to frequent extreme events (e.g., failure of multiple physical components, natural disasters, cyber-attacks) and the increasing complexity of energy system infrastructure. Major catastrophic events, often called high-impact and low-probability (HILP) events, led to a large number of cascading events and blackouts. The major catastrophic events, in most cases, result in interruptions to customers and inconvenience to residents in affected areas due to loss of not just electricity, but also water, and communication. Therefore, power system resilience today is receiving more attention by regulators and the utility industry as a key factor of the defense against HILP events that have significant economical and societal impact. The reliability and resilience are two critical factors of electric grids, as highlighted by some relevant publications and policy guidelines [1][2][3][4][5][6][7]. Grid reliability as a fundamental objective of electric utilities covers two distinct attributes: adequacy and security, which are usually studied under HILP events. North American Electric Reliability Corporation (NERC) standards require entities to perform planning studies for their systems under extreme contingencies but extreme disaster events are not considered [8]. On the other hand, grid resilience studies are dealing primarily with catastrophic HILP events and are focusing on a diverse range of issues, such as flexibility, hardening, security, and recovery. Standards for resilience studies in planning and operation of power systems have not been developed yet. decision approach-based on a multi-level mixed integer programming (MIP)-to make investment decisions against terrorist threats is proposed in [31]. The authors od [32,33] demonstrate that hardening is one of the most effective ways to increase the power system resilience under extreme weather events.
Various methods on resilience assessment of the power system under cyber-attacks are presented in [34][35][36][37][38]. Guo et al. [34] present a reliability assessment of a cyber-physical power system considering cyber-attacks against monitoring functions. Al-Ammar and Fisher [35] perform resilience assessment of the power system to cyber and physical attacks, and they consider the degree of vulnerability to be a measure of the resilience of power system to attacks via cyber means or through physical means. The results of a survey of Information and Communication Technologies (ICT) vulnerabilities of a power system and relevant defense methodologies are presented in [36]. Huang et al. [37] present an integrated resilience response framework that links the situation awareness with resilience enhancement and provides effective and efficient responses in preventive and emergency states. A probabilistic risk-based methodology for security assessment of a power system by taking into account vulnerabilities of ICT systems that involve control and protection is presented in [38].
Different methodologies for assessing power system resilience are presented in [39][40][41][42][43][44][45][46]. Yan et al. [39] analyze the grid resilience to False Data Injection (FDI) attacks with different magnitudes and number of false data inputs. Van Harte et al. [40] propose an approach to prioritize power system resilience capabilities in order to contain the impact and restore the network quickly with a framework for assessing different disaster scenarios. Panteli et al. [41] propose a sequential-simulation based time-series model for evaluating the effect of wind on transmission lines and entire power infrastructure. Zhang et al. [42] propose a toughness approach to quantify the robustness of a power system against potential disasters. Ciapesoni at al. [43,44] present a risk-based resilience assessment methodology in operation-planning mode to predict the riskiest contingencies including threat intensity and component vulnerability that will affect the power system resilience. Chi et al. [45] present a literature survey on power distribution system resilience assessment.
A variety of studies have addressed the resilience solutions based on recovery and restoration to minimize the impact of extreme and catastrophic events [46][47][48][49][50]. Wang et al. [46] present and review the research towards methods and tools of forecasting natural disaster related power system disturbances, hardening and pre-storm operations, and restoration models. Arab et al. [47] present a significant change in power grid response and recovery schemes by developing a framework for proactive recovery of power assets to enhance the resilience. Van Harte et al. [48] present the different blackout recovery mechanisms available to the System Operator to respond to and recover from such an extreme event. Perrings et al. [49] propose the use of price functions to estimate people's willingness to pay for more resilience in the power supply. Ju et al. [50] present a reconfiguration model for load restoration in radial distributed systems that includes multiple energy services, including local combined heat and power (CHP) plants, to meet the demand of critical loads during post-disaster horizons.
Outage data obtained from bulk transmission equipment play an important role in BPS planning, operations, and maintenance practices. Outage data statistics are considered essential when evaluating past, present, and future grid resilience. NERC has been collecting continent-wide transmission outage and inventory data in Transmission Availability Data System (TADS) since 2008 [51,52]. TADS has been used to (a) assess the root cause of outages on major BPS elements; (b) to calculate typical reliability indices; and to (c) identify reliability risks due to independent, common mode, and dependent outages [53][54][55][56][57][58]. The fundamental aspects of common mode and dependent outages in the power systems are presented and reviewed in [59][60][61][62][63]. Another major mechanism of failure in the power grid is a cascading outage. A cascading outage is a defined as a sequence of dependent outages that successively weaken or degrade the power transmission system [64]. The work in [65][66][67][68][69][70][71] shows how to assess cascading via a sequence of dependent outages, how to benchmark the proposed analysis methodologies, and how to evaluate the cascading from historical outage data. The authors of [72][73][74][75][76][77][78][79][80] present a variety of methodologies for assessment of BPS resilience using historical outage data for major BPS components such as lines and transformers. Eskandarpour et al. [72] present a multi-dimensional machine learning model to improve power grid resilience through predictive outage estimation. Kelly-Gorham et al. [73] present an approach to compute overall transmission grid resilience using historical utility outage data. Thomson et al. [74] evaluate transformer historical failure data to assess the facility resilience and reliability. Campbell [75] suggests solutions for reducing impacts from weather-related outages that include improved tree-trimming schedules to keep rights-of-way clear, placing distribution and some transmission lines underground, implementing Smart Grid improvements to enhance power system operations and control, inclusion of more distributed generation, and changing utility maintenance practices and metrics to focus on power system reliability. Eskandarpour and Khodaei [76] present a machine learning based prediction method to determine the potential outage of power grid components in response to hurricanes. Dagle [77] indicates that power system operators now have an unprecedented wealth of data, coming from a variety of sources, such as demand response, synchrophasors, supervisory control and data acquisition (SCADA) systems, which, if managed properly, can provide opportunities for the efficiency, reliability, and resilience of the power system. Duchesne et al. [78] propose an approach combining Monte Carlo simulation, machine learning, and variance reduction techniques in the context of operation planning to assess the expected performance of the system over a certain look-ahead horizon that can guide the operation planner in decision-making.
Quantifying and analyzing the impact of cascading outages is an important part of grid resilience assessment. The NERC State of Reliability (SOR) report [79] reviews past reliability performance of the BPS, examines the state of system design, planning, and operations, and the ongoing efforts by NERC and the industry to continually improve system reliability and resiliency. This independent report is based on an analysis of data and metrics, which enables NERC to examine trends, identify potential risks to reliability, establish priorities, and develop effective mitigation strategies. The state of reliability also provides guidance to industry asset owners and operators in the form of recommendations to enhance the resilience of the BPS.
In this paper, we discuss the power system resilience concept in operation planning by evaluating the historical cluster outages of multiple transmission elements (e.g., lines, transformers) recorded within a 2-min time interval. This type of an outage is a threat to operating a single-contingency reliability criterion to each utility Transmission Operator (TOP). The paper further develops the methodology proposed in [80] that, for the first time, used the TADS data for assessing the resilience of BPS under these nearby overlapping outages. To gain a better understanding of how clusters of nearby outages can impact the system resilience in the future, this study examines both sustained and momentary outages. We perform a comprehensive analysis of the North American combined inventory and cluster outage data for both automatic sustained and momentary outages within a 2-min window. The analysis aims to identify the actionable information from outage data statistics that could be helpful in preventing or mitigating the consequences of newly studied overlapping outage clusters. In addition, this paper presents a methodology to evaluate a likelihood of clusters of different sizes and the overall cluster for a Transmission Owner (TO) based on its transmission inventory.

Operational Grid Resilience: Background and Definitions
The U.S. National Academies define resilience as "the ability to anticipate, prepare for, and adapt to changing conditions and withstand, respond to, and recover rapidly from disruptions" [6]. NERC defines power grid resilience as "the ability to reduce the magnitude and/or duration of disruptive events" [3]. Complementary definitions and power system resiliency metrics are presented in [81,82]: the power system is resilient if it operates reliably over range of operating conditions and has the capability to deliver power and absorb and to adapt to events of low probability and high consequence. The CIGRE C4.47 Power System Resilience Working Group defines power system resilience as the ability to limit the extent, severity, and duration of system degradation following an extreme event [83].
Robust and resilient operation of a power grid requires anticipation of unplanned outages that could lead to cascading and blackouts. Planning and operation standards are designed so the power grid shall always be operated such that instability, uncontrolled separation, cascading, or voltage collapse will not occur because of any single contingency or two sequential N-1 contingencies (N-1, time for readjustment, and another N-1). On the other hand, planning standards cover credible N-2 contingencies, such as double-circuit outages, circuits on common structures, or stuck breaker conditions. The specific NERC reliability standards that relate to the BPS capability to withstand events in anticipation of potential outages, manage the system after an event, and prepare to restore or rebound after an event are TPL-001-4, TOP-002, EOP-004-3, EOP-005-2, EOP-006-2, EOP-011-1, CIP-014-2, PRC-006-3, PRC-016-1, and TPL-007-1 [39]. While these criteria fulfill for example the NERC requirements to meet performance standards and operate securely under the N-1 contingency criterion, this is not a guarantee that the system is immune to multiple N-k outages. Detecting and preventing multiple outages is critical to maintaining power system reliability and resilience. Operation planning engineers, as well as control room operators, face complex situations resulting from these multiple events. When power grids have high volume of renewable energy sources or they are heavily stressed with high power transfers it becomes an increasingly challenging task to make electricity grids most efficient, reliable, and resilient.
A growing body of publications in recent years presents the concept of resilience by assessing the impact and mitigation measures to major disturbances as result of adverse weather, natural disasters, hurricanes, earthquakes, and cyberattacks . Reference [3] emphasizes, "To increase system resilience requires an understanding of a wide range of preparatory, preventive, and remedial actions, as well as how these impact planning, operation and restoration over the entire life cycle of different kinds of grid failures".

Risk-Based Methodology
A large body of work on power systems follows the classic definition of risk by Lowrance [84] whereby risk is defined as the impact of an event times the probability of occurrence for this event. Similarly, to the data-driven approach in this paper, other works have used large datasets and statistical or probabilistic approaches to the analysis of power system events in terms of quantifying risk [44,[84][85][86][87]. With enough data available, one can also use these datasets as a source of features and train modern machine learning approaches to predicting and quantifying risk [78,86]. Machine learning and artificial intelligence approaches also can provide timely recommendations to the operator in charge of remedial actions [76,78]. Common sources of data that recent research increasingly incorporates to risk and reliability studies are those characterizing renewable resources [88,89].
There is not a unique definition for resilience today, but the majority of published definitions focus on the power system ability to anticipate, absorb, and rapidly recover from an external, high-impact, low-probability event. A conceptual framework of power system resilience covers the following steps [46,86]: Step 1: Threat/Event characterization, • Step 2: Vulnerability of system's components, • Step 3: System response, and • Step 4: System restoration.
The key attributes of a resilient power system are robustness, resistance, resourcefulness, and redundancy. Due to the limitations mentioned above our study does not cover all these attributes and results are primarily related to step 1. To measure for example the robustness of a power system a comprehensive study needs to be performed to establish a threshold value of the consequence beyond which the performance of the system is considered to be unacceptable.
It is important to note the similarities and differences on risk studies versus resilience studies. Reference [90] provides the definition for resilience of an infrastructure as "the ability to anticipate, prepare for, and adapt to changing conditions and withstand, respond to, and recover rapidly from disruption". It also states that resilience management goes beyond risk management to address the complexities of a power grid and the uncertainty of future treats, as it includes risk analysis as a central component. Risk analysis depends on characterization of the threats, vulnerabilities, and consequences of adverse events to determine the expected loss of critical functionality [90]. Due to the scope of this paper and the datasets we are leveraging, the authors are not applying the traditional risk-based methodology presented in [91] but focusing on the risk factors that come from the outages included in the analysis. Therefore, authors have used the results on cluster statistics to evaluate an aggregated risk at operating entity level that could be helpful to TOs to identify mitigation measures to prevent or minimize the impacts of those outages.
CIGRE WG C4.47 proposed the reliability framework under a broader perspective to cover three components: adequacy, security, and resilience. A system that is resilient is not necessarily reliable. However, a reliable system must be resilient to extreme events, otherwise it would not satisfy the general definition of reliability. This perspective is supported also by FERC-Federal Energy Regulatory Commission-which indicates in [92] that "resilience is a component of reliability in relation to an event", and by NERC which states in [79] that "a bulk power system that provides an adequate level of reliability is a resilient one".

Overview
NERC has been collecting North American automatic outage data for transmission elements of 200 kilovolts (kV) and above since 1 January 2008. An automatic outage is defined as an outage that results from the automatic operation of a switching device, causing an element to change from an in-service state to a not in-service state. Single-pole tripping followed by successful AC single-pole (phase) reclosing is not an automatic outage [51]. Transmission elements of BPS reportable in TADS are (1) alternating current (AC) circuits (overhead and underground), (2) transformers (no generator step-up units), (3) direct current (DC) circuits (a DC circuit element is a complete line, not just a single pole), and (4) AC/DC back-to-back converters [51]. In 2015, TADS' reporting changed to align with the implementation of the Federal Energy Regulation Commission (FERC)-approved bulk electric system (BES) definition [87]. Two additional voltage classes were added-namely, less than 100 kV and 100-199 kV. Sustained automatic outages are the only outages collected at voltage classes below 200 kV.

Analysis Dataset and Definitions
For this analysis, TADS automatic (momentary and sustained) outages of TADS elements of 200 kV and above for years 2013-2019 were grouped by Transmission Owner (TO). These outages were sorted in chronological order, then examined to select groups of outages inside a TO with starting times of two consecutive outages separated by at most 2 min. This process resulted in 4246 groups that contained 10,501 outages (or 32.6% of all TADS automatic outages over the 7 years). Next, these groups were examined to detect outages that do not overlap in time with at least one other outage in the group. (Overlapping outages are defined here as outages that overlap in time, for any period of time. Namely, if two outages start at the same time, they overlap; if one of the outages starts earlier, the second outage should start before the first one ends for them to overlap.) These outages were removed from the study, and groups were redefined to contain only outages that overlap with one or more outages in the group. The resulted sets of outages are called clusters. Namely, a cluster is a set of automatic outages of transmission elements in the same company that satisfies the following conditions: (a) when sorted by their start time, a difference between start time of any two consecutive outages does not exceed 2 min; (b) each outage in a cluster overlaps in time with at least one other outage in a cluster. Condition (b) implies that outages in each cluster are "continuous," i.e., at any moment from the earliest start of all outages in the cluster to the latest end of all outages at least one outage continues. The size of a cluster is defined as the number of outages it contains. For any cluster of size 2 and greater, the operator has at least one N-2 contingency, but depending on the cluster size may have multiple N-2, N-3, N-4 . . . contingencies.
The final data set processed for this study consists of 2918 clusters comprised of 6942 automatic outages (or 21.6% of all 32,198 automatic outages of TADS elements 200 kV and above from 2013 to 2019). Table 1 illustrates a breakdown of the outages in clusters by transmission element type and by voltage class as reported in TADS. For transformers, the voltage class is the high-side voltage. Voltages are operating voltages.

Clusters by Year and Size Distribution
The outages listed in Table 1 are grouped together into clusters as summarized in Table 2. The inclusion of automatic outages for all TADS elements allows the capture of more nearby overlapping outages and a better evaluation of their risks to dynamic stability and resilience of the transmission system. As mentioned in Section 2.2.2, the clusters contain 21.6% of all automatic outages, indicating how common this type of outage and their clusters are for the North American BPS.   2  290  249  340  320  336  334  349  2218  3  86  54  67  92  74  53  64  490  4  23  14  20  14  21  25  12  129  5  9  7  3  6  2  9  7  43  6  0  3  5  2  1  1  4 Table 2 indicates that with an exception of the year 2014, the number of clusters in North America stayed consistent during the study period. In 2014 the number was significantly lower, and the largest cluster contained only seven outages. Overall, most clusters (76%) consist of two outages, with several outliers (clusters with sizes between 11 and 18). The average size of a cluster equals 2.4 outages. An empirical distribution of the cluster size is illustrated in Figure 1.

Initiating Causes of Outages
The 6942 outages in clusters are divided into 2007 momentary outages and 4945 sustained outages (i.e., outages lasting at least 1 min). The percentage of sustained outages in clusters is significantly higher than in the total population of automatic outages for years 2013-2019 (71% versus 58%). Figure 2 lists the outages by TADS initiating cause. Several of the smallest groups are not shown (together they contain less than 1% of outages in clusters). Lightning initiates the largest number of outages in clusters, but the majority of them are momentary. In contrast, Failed AC substation equipment is the leading cause of sustained outages in clusters, but it initiates a relatively small number of momentary outages. Power system condition is the third largest group. Next, we compare rankings of causes for outages in all clusters, in the largest clusters (size five and above), and for all TADS outages for the 7 years (Figure 3).
Lightning, the top cause of outages in clusters, is the second leading cause of all automatic outages in TADS, but it initiates only 8% of outages in large clusters. Unknown, the leading cause of TADS outages, ranks relatively low for clusters: it initiates 9% of outages in clusters and only 3% of outages in large clusters, because causes of larger transmission events tend to be better investigated and reported. Prominently, Power system condition causes 25% of outages in large clusters while in TADS it ranks low (4% of TADS outages). This cause is reported for automatic outages caused by power system conditions such as instability, overload trip, out-of-step, abnormal voltage, abnormal

Initiating Causes of Outages
The 6942 outages in clusters are divided into 2007 momentary outages and 4945 sustained outages (i.e., outages lasting at least 1 min). The percentage of sustained outages in clusters is significantly higher than in the total population of automatic outages for years 2013-2019 (71% versus 58%). Figure 2 lists the outages by TADS initiating cause. Several of the smallest groups are not shown (together they contain less than 1% of outages in clusters).

Initiating Causes of Outages
The 6942 outages in clusters are divided into 2007 momentary outages and 4945 sustained outages (i.e., outages lasting at least 1 min). The percentage of sustained outages in clusters is significantly higher than in the total population of automatic outages for years 2013-2019 (71% versus 58%). Figure 2 lists the outages by TADS initiating cause. Several of the smallest groups are not shown (together they contain less than 1% of outages in clusters). Lightning initiates the largest number of outages in clusters, but the majority of them are momentary. In contrast, Failed AC substation equipment is the leading cause of sustained outages in clusters, but it initiates a relatively small number of momentary outages. Power system condition is the third largest group. Next, we compare rankings of causes for outages in all clusters, in the largest clusters (size five and above), and for all TADS outages for the 7 years (Figure 3).
Lightning, the top cause of outages in clusters, is the second leading cause of all automatic outages in TADS, but it initiates only 8% of outages in large clusters. Unknown, the leading cause of TADS outages, ranks relatively low for clusters: it initiates 9% of outages in clusters and only 3% of outages in large clusters, because causes of larger transmission events tend to be better investigated and reported. Prominently, Power system condition causes 25% of outages in large clusters while in TADS it ranks low (4% of TADS outages). This cause is reported for automatic outages caused by power system conditions such as instability, overload trip, out-of-step, abnormal voltage, abnormal Lightning initiates the largest number of outages in clusters, but the majority of them are momentary. In contrast, Failed AC substation equipment is the leading cause of sustained outages in clusters, but it initiates a relatively small number of momentary outages. Power system condition is the third largest group. Next, we compare rankings of causes for outages in all clusters, in the largest clusters (size five and above), and for all TADS outages for the 7 years (Figure 3). frequency, or unique system configurations (e.g., an abnormal terminal configuration due to existing condition with one breaker already out of service) [9].
A variety of malicious attacks included Vandalism, Terrorism, or Malicious Acts, which caused only five outages in the clusters for the 7 years and no outages within large clusters-that being the reason why this cause is not shown in Figure 3.

Cluster Duration
Next, we define a cluster duration as the time elapsed between the earliest start time and the latest end time of all outages in the cluster and find that the average cluster duration is 61.4 h. The average cluster duration by cluster size is shown in Figure 4.  Figure 4 confirms that there is no observable correlation between cluster size and duration (note that there are few data points for clusters of largest sizes). Further analysis reveals that sustained outages in clusters tend to be longer than the overall sustained outages (the average outage duration is 51 h versus 40 h). The longest outages in clusters are initiated by Environmental, Failed AC/DC Terminal Equipment, and Failed AC circuit equipment (the average durations are 435, 169, and 153 h, respectively). Overall, a cluster duration depends more on causes of the outages rather than on the cluster size. Lightning, the top cause of outages in clusters, is the second leading cause of all automatic outages in TADS, but it initiates only 8% of outages in large clusters. Unknown, the leading cause of TADS outages, ranks relatively low for clusters: it initiates 9% of outages in clusters and only 3% of outages in large clusters, because causes of larger transmission events tend to be better investigated and reported. Prominently, Power system condition causes 25% of outages in large clusters while in TADS it ranks low (4% of TADS outages). This cause is reported for automatic outages caused by power system conditions such as instability, overload trip, out-of-step, abnormal voltage, abnormal frequency, or unique system configurations (e.g., an abnormal terminal configuration due to existing condition with one breaker already out of service) [9].

Analysis of Largest Clusters
A variety of malicious attacks included Vandalism, Terrorism, or Malicious Acts, which caused only five outages in the clusters for the 7 years and no outages within large clusters-that being the reason why this cause is not shown in Figure 3.

Cluster Duration
Next, we define a cluster duration as the time elapsed between the earliest start time and the latest end time of all outages in the cluster and find that the average cluster duration is 61.4 h. The average cluster duration by cluster size is shown in Figure 4.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 19 frequency, or unique system configurations (e.g., an abnormal terminal configuration due to existing condition with one breaker already out of service) [9]. A variety of malicious attacks included Vandalism, Terrorism, or Malicious Acts, which caused only five outages in the clusters for the 7 years and no outages within large clusters-that being the reason why this cause is not shown in Figure 3.

Cluster Duration
Next, we define a cluster duration as the time elapsed between the earliest start time and the latest end time of all outages in the cluster and find that the average cluster duration is 61.4 h. The average cluster duration by cluster size is shown in Figure 4.    Figure 4 confirms that there is no observable correlation between cluster size and duration (note that there are few data points for clusters of largest sizes). Further analysis reveals that sustained outages in clusters tend to be longer than the overall sustained outages (the average outage duration is 51 h versus 40 h). The longest outages in clusters are initiated by Environmental, Failed AC/DC Terminal Equipment, and Failed AC circuit equipment (the average durations are 435, 169, and 153 h, respectively). Overall, a cluster duration depends more on causes of the outages rather than on the cluster size.

Analysis of Largest Clusters
Next, we investigate in more detail the largest clusters from 2013 to 2019. Table 3 provides a summary of the 10 largest clusters, their sizes are nine and greater.  Table 3 informs that majority of outages in largest clusters are outages of ac circuits with two exceptions: both clusters of size 11 had seven transformers and four ac circuit outages each. One of these clusters (2017) was initiated by human error and the second one (2018) by lightning, but all remaining outages in these clusters were caused by power system conditions. These findings confirm the expectation that reported violations such as overloads and voltage problems usually trigger the operation of protection systems that trip out additional system elements such as lines, transformers, generators, load, etc. Overall, Power system condition appears as an initiating cause in the six largest clusters out of 10. Additionally, six clusters contain weather-related outages (Weather, Lightning, Fire).
Another interesting observation is that all outages in each cluster with size 9-12 started simultaneously, and 11 outages in a cluster of size 14 started simultaneously and 1 min after the first three outages. This observation helps explain an absence of correlation between cluster size and cluster duration. All clusters of size 9-16 are shorter than the average cluster duration of 61 h, and some of them are very short.

Distribution of Clusters by TO
The average annual number of clusters per company (TO) with TADS inventory above 200 kV was 2.4 from 2013 to 2019. It did not change significantly from year to year-similarly to the number of clusters. However, there is a large variability in the number of clusters by TO; this variability primarily reflects a company size. Figure 5 illustrates the distribution of number of clusters per TO for 7 years.

Company Risk Assessment
The cluster statistics presented in the previous sections can be used to evaluate a company risk caused by clusters of overlapping outages. The impact I of a cluster can be defined, for example, as its size or, in more sophisticated way, as the sum of equivalent MVA values of transmission elements involved in this cluster. The likelihood of a cluster can be estimated as follows. The expected number nk(7) of clusters of size k over 7 years for a company A is estimated by: nk(7) = Nk(7)*Inv(A)/Inv(TADS) (1) where Nk (7) is the number of clusters of size k in the TADS data from 2013 to 2019, listed in the last column of Table 2 where k is the size of a cluster and Ik is the impact of a cluster of size k. Estimates nk(1) can be further used to calculate likelihood of a cluster of a given size for an hour, a day etc. The company risk, as defined by (2), is proportional to a company inventory and the time period for which the risk is estimated. For example, assuming that a cluster impact is defined as the cluster size (Ik = k), for a company X with the transmission inventory of 62 elements with voltages above 200 kV, the 1-year cluster risk R(1 year) = 6.9 is calculated from 1-year estimates of number of clusters listed in Table 4. The 39 entities (about 22% of companies with TADS elements of 200 kV and above) have not experienced a cluster from 2013-2019. The 39 TOs had at least 20 clusters each. Out of these, four companies have more than 100 clusters each (with two more than 300 clusters each) over the 7 years. These outliers are the TOs with large inventory of TADS elements above 200 kV.

Company Risk Assessment
The cluster statistics presented in the previous sections can be used to evaluate a company risk caused by clusters of overlapping outages. The impact I of a cluster can be defined, for example, as its size or, in more sophisticated way, as the sum of equivalent MVA values of transmission elements involved in this cluster. The likelihood of a cluster can be estimated as follows. The expected number n k (7) of clusters of size k over 7 years for a company A is estimated by: n k (7) = N k (7)*Inv(A)/Inv(TADS) (1) where N k (7) is the number of clusters of size k in the TADS data from 2013 to 2019, listed in the last column of Table 2 where k is the size of a cluster and I k is the impact of a cluster of size k. Estimates n k (1) can be further used to calculate likelihood of a cluster of a given size for an hour, a day etc. The company risk, as defined by (2), is proportional to a company inventory and the time period for which the risk is estimated. For example, assuming that a cluster impact is defined as the cluster size (I k = k), for a company X with the transmission inventory of 62 elements with voltages above 200 kV, the 1-year cluster risk R(1 year) = 6.9 is calculated from 1-year estimates of number of clusters listed in Table 4.
For two actual TOs that report to TADS, anonymized companies A and B with similar inventory of about 62 elements a year, the numbers of observed clusters from 2013 to 2019 were as follows: company A experienced three clusters of size 2; company B had 13 clusters of size 2, six clusters of size 3, and two clusters of size 4 (the total of 20 clusters).
These empirical data show that for company B the suggested methodology provides good estimates of the number of clusters based on the combined inventory alone; however, it is not the case for company A. More general evaluation of the cluster risk estimates is illustrated in Figure 6. Figure 6A shows a histogram of the expected 7-year cluster risk for all TOs that reported in TADS from 2013-2019 and had at least one transmission element above 200 kV, with the cluster risk for each company calculated by formula (2) adjusted for 7 years. Figure 6B shows a histogram of actual cluster impact for the same set of TOs; for these calculations a likelihood of a cluster of a given size is replaced with the number of the clusters the TO had in 2013-2019. The histograms (A) and (B) are reasonably close in the middle parts of the corresponding distributions. The predictable differences between them are as follows. The distribution (A) starts with a small positive value, as even a TO with one reportable element has a positive expected cluster risk defined by formula (2)

Discussion
In addition to time clustered outages, future work is needed to identify which of the time clustered outages are also electrically close to each other. For example, a good step in this direction is to examine each outage within the time cluster using existing Generation Shift Distribution Factor technology (Gen DFAX). TADS data identify each line by a unique line name identifier, including each terminal's from and to substation name identifier. Such TADS information could be defined and improved to map to existing monitored transformers/line to each Gen DFAX table row. TADS substation identifiers could also be mapped to generation shift columns in the Gen DFAX table. The Gen DFAX table columns would need to include every 230 kV and above bus within each  Another way to get more precise cluster risk estimates is to use the detailed analysis by element type, which would require a derivation of count tables similar to Table 2 for all possible combinations of elements in a cluster of a given size. For example, for cluster of size 2, there would be six possible combinations (ac circuit, ac circuit), (ac circuit, dc circuit), (ac circuit, ac/dc back-to-back converter), etc. Moreover, if the equivalent MVA is chosen as a cluster impact I, further breakdown of element types by voltage classes should be done. This analysis is beyond the scope of this paper.
It is important to remember that due to a filtering procedure applied to the complete set of the 2013-2019 TADS automatic outages as a first step of data processing (described in Section 2.2.2), many overlapping outages are eliminated from the study (longer outages with starting times separated by greater than 2 min). Therefore, the presented statistics on clusters are intended to provide a lower estimate of the frequency of such transmission events on the system and inside a TO.

Discussion
In addition to time clustered outages, future work is needed to identify which of the time clustered outages are also electrically close to each other. For example, a good step in this direction is to examine each outage within the time cluster using existing Generation Shift Distribution Factor technology (Gen DFAX). TADS data identify each line by a unique line name identifier, including each terminal's from and to substation name identifier. Such TADS information could be defined and improved to map to existing monitored transformers/line to each Gen DFAX table row.
TADS substation identifiers could also be mapped to generation shift columns in the Gen DFAX table. The Gen DFAX table columns would need to include every 230 kV and above bus within each TO boundary. Otherwise, some transformer/line buses in TADS could be missing in the Gen DFAX table. An analysis of such distribution factors could be used to identify which of the time clustered outages are electrically close.
Overlapping electrically close forced outages are much more likely to challenge grid resiliency. Overlapping unplanned N-2 (or N-3, etc.) outages that are electrically close outages are more likely to challenge the response time of TOP or generation operators to readjust the system prior to the final N-1 event in the cluster.

Conclusions
The study reported in this paper provides insights on the quantification of power system resilience using historical outage data in TADS. We also discussed multiple challenges to grid resilience under overlapping nearby outages. The assessment of nearby outages in the operation horizon of the BPS goes beyond standard requirements. The comprehensive historical data analysis of cluster outages provides an operating entity with a quantitative method to identify the outages with the highest risks. The knowledge gained from this study shall help companies to understand potential risks and to identify mitigation measures to prevent or minimize the impacts of those outages. The approach presented here can be helpful to the industry in the process of monitoring risks to such time clustered outages.
In addition to the TADS data analysis in this paper, other NERC required event reports that analyze multiple outages should be cross referenced to TADS reported outages and noted in TADS. Based on these more in-depth after the fact event reports, the associated TADS data should be updated as needed. Similarly, further analysis could include non-automatic outages at lower voltages-including manual switching errors-to quantify their contribution to overall risk.
Future research to identify better alternative methods, beyond the discussed above Gen DFAX method, is needed to identify electrically close outages. In addition, future research around outage prediction based on machine learning algorithms is needed to proactively cope with overlapping electrically close outages and to improve grid resilience.