Classical Failure Modes and Effects Analysis in the Context of Smart Grid Cyber-Physical Systems

: Reliability assessment in traditional power distribution systems has played a key role in power system planning, design, and operation. Recently, new information and communication technologies have been introduced in power systems automation and asset management, making the distribution network even more complex. In order to achieve efficient energy management, the distribution grid has to adopt a new configuration and operational conditions that are changing the paradigm of the actual electrical system. Therefore, the emergence of the cyber-physical systems concept to face future energetic needs requires alternative approaches for evaluating the reliability of modern distribution systems, especially in the smart grids environment. In this paper, a reliability approach that makes use of failure modes of power and cyber network main components is proposed to evaluate risk analysis in smart electrical distribution systems. We introduce the application of Failure Modes and Effects Analysis (FMEA) method in future smart grid systems in order to establish the impact of different failure modes on their performance. A smart grid test system is defined and failure modes and their effects for both power and the cyber components are presented. Preventive maintenance tasks are proposed and systematized to minimize the impact of high-risk failures and increase reliability.


Introduction
Electric energy plays a crucial role in today's society. It is the most versatile and easily controlled form of energy and it is involved in almost all aspects of society's daily routine.
In recent years, several new challenges have been emerging due to the expansion of renewable energy sources (intermittent sources) in the electrical grid, due to the electrification of new industrial sectors and due to the new huge volume of online data generated from electrical systems. Moreover, in the future smart grids, it is expected that energy becomes available everywhere from dispersed sources associated with the growth of mobile loads and the increasing number of energy storage equipment [1,2]. With this, new technological functionalities are required to provide energy management in a more reliable, effective and secure way.
The conventional electric grid is a passive and rigid grid characterized by predictable power flow directions, conventional energy sources and expected load profiles. On the contrary, a smart grid can be described as an active grid, with constant fluctuations due to the intermittent operation of renewable energy sources, like solar or wind, unexpected load profiles, and unpredictable power flow directions, making a more dynamic grid. Consumers' participation in demand response and in electricity markets are also expected to play an important role in energy efficiency [2,3]. However, many new problems are arising such as: 3 of 25 FMEA procedure in a smart grid structure. As described in Section 8, the main conclusion is that maintenance tasks cannot be efficiently prioritized. The classical FMEA is successful in assemble failure modes and their causes for a smart grid. However, the classical FMEA needs to be modified to improve risk prioritization concerning the smart grid's reliability assessment and risk analysis,

FMEA applications in electrical power equipment: a brief overview
Most of the applications of FMEA in electrical power equipment were developed at the component level, that is, without considering the effect of equipment failures on systems' performance.
For example, concerning wind power technology, in [26] is shown a classical FMEA approach applied to assess the reliability of a 2MW wind turbine using three commercial software: XFMEA from Reliasoft, Reliability Workbench from Isograph, and Relex Reliability Studio 2007 from Crimson Quality. The authors divided each one of the three risk factors of FMEA (Severity, Occurrence, and Detection) into four risk categories. It was identified eight mechanical failure modes, five electrical failure modes, and three failure modes related to the turbine. Results show that when using the product of the Occurrence and Detection risk factors, FMEA under-estimates the operational field's failure rates in new turbine designs. Authors also propose that a procedure for failure prioritization using their risk priority number (RPN) value could be a useful tool for designers to identify weaknesses in new wind turbine designs.
Another FMEA application and analysis on wind power is shown in [27], where onshore and offshore wind turbines were considered. The classical FMEA is now compared with the authors' modified FMEA that studies the probability of occurrence instead of a ranking for occurrence as in [13], considering now the cost of the failure mode instead of severity rank, and uses a non-detection possibility based on failure data instead of a detection ranking. The paper also proposes a priority number called a cost-priority number (CPN), which was obtained by multiplication of the new three risk factors considered [27]. Their results show that, in general, the priority number from both approaches, the RPN and CPN, produce very similar prioritization for most of the major components considered.
A non-electrical system is studied in [28], where FMEA analysis is conducted to assess the reliability of hydraulic turbines, and to compare FMEA with the Fault Tree Analysis (FTA) method. Seven main hydraulic turbine components were considered for both analyses. This work indicates that FMEA and FTA are complimentary risk analysis methodologies capable of identifying failures and tracking their possible consequences. While FMEA makes an exhaustive analysis for each failure mode, FTA allows having a general view of the system and the relations between different components.
In [29], the authors assess the electro and mechanical components condition of a hydropower plant (Angara-Yenisei hydropower station) is studied. The FMECA method (FMEA plus criticality assessment) was applied to face the lack of statistical information about failures. Results achieved show that FMECA allows evaluating the possible effects of the failure modes even when there is a gap in the failure statistics.
Another example of FMEA application is in photovoltaic (PV) systems. In [30], FMEA is applied in a simple test system composed of four PV strings, string combiner devices, inverter, cable system (aerial and underground), a three-phase transformer and also its connection to the power grid. Five risk categories were defined for each of the FMEA's risk factors ranking. The author clearly shows that FMEA can improve the early detection of some hidden failures that could not immediately affect the PV system, but would induce a degradation if no action was taken.
Another FMEA application in PV systems can be found in [31]. The authors used relevant criteria and practical experience provided by personnel working in a PV power plant instead of using the one from theoretical and office technicians. It was identified 94 failure modes, 16 of which had an RPN greater than 100, being considered as the most critical failure modes for prioritization. Authors' conclusions establish substantial differences between FMEA results using criteria from practical 4 of 25 personnel, like maintenance operators, and those results using an FMEA performed considering the criteria of office and manage technicians, like engineers.
Risk analysis of an energy storage system (ESS) was developed in [32], where a review of the failure modes that affect lead-acid batteries (LAB) was done. The analysis focus on three aspects: (i) positive active material degradation with loss of adherence to the metallic grid, and positive electrode grid corrosion; (ii) irreversible sulfating of the negative active material; and (iii) the electrolyte, separator, charge-discharge regime, and other elements that contribute to the battery failure. This work shows the importance of identifying the failure modes and its associated mechanisms in leadacid batteries and in lead-carbon batteries (LCB) because it has a great potential for innovation and extensive applications in solar power integration projects.
Another extensive analysis of failure modes on batteries, now on lithium-based batteries (LIB), is presented in [33]. Lithium batteries are one of the most popular energy storage technologies for several applications including electric cars. This paper covers several experimental and simulation results to characterize different failure modes and their respective mechanisms in LIB technology. Most important, the authors claim the urgency for the development of computational direct simulation techniques for LIB based on its chemo-mechanical models to have a better perspective about possible material failures [33].
FMEA has also been applied in electrical components of power systems. For example, in [34] an FMEA analysis is conducted to assess reliability in capacitors banks used in the distribution power system at the Sultanate of Oman. Four risk categories (catastrophic, critical, marginal and insignificant) were defined for each FMEA risk factor ranking, and seventeen main failure modes were identified and analyzed. Some failure modes considered were the capacitor element shortcircuit, open circuit, insulating liquid leakage and leakage current for support insulators, for example. In [35], FMEA was used to identify the main failure modes to be used as input for a probabilistic method to assess the reliability of a 400 kV transmission system at the substation equipment level.
In [36] a modified FMEA based on Fuzzy Logic was developed. Three FMEA risk factor categories were represented by fuzzy sets and based on three continuity indexes: the loss of power in distribution transformers when a failure mode occurs, the frequency of interruption in each consumer unit, and the duration of interruption in each customer unit. Results show that the FMEA based on fuzzy logic achieves better prioritization results for the analyzed equipment.
Power transformers' failures have been extensively analyzed through the FMEA method because of its high impact in terms of security and cost in electric power grids. Three recent applications are presented following. In [37], and FMEA including criticality analysis is performed on 92 power transformers, identifying three critical components: windings with high criticality, onload tap changer (OLTC), and bushings with medium criticality. In [38], FMEA with criticality was applied on 384 non-failed distribution transformers in India. Results show that component insulation failures have a greater RPN and are caused by corrosion, moisture, high acidity, hot spot due to overloading and/or low quantity of oil. The second priority is achieved by winding failures that may be due to manufacturing defects, transient overvoltage, lightning, short-circuit and faulty connections. The third example is described in [39] where a general FMECA is applied to assess the risk of failure of 220 kV in-service power transformers, considering the failures that can result in transformer service interruption. Authors classified the failures as minor and major ones, performing an FMECA analysis for each of the two types of failure; the minor failures have not significant effects on transformer performance, while major failures are related to the transformer's components degradation and would be irreversible. Results show that outages caused by overcurrent have the highest RPN in the minor failure analysis. Failures due to insulation deterioration have the highest RPN in the major failure analysis, followed by load tap changer failures.
In electric power distribution systems, one can verify three lines: 1) a "local" one represented by the micro-grids; 2) a "global and classical" one exemplified by distribution power systems, and 3) being also "global" but incorporating the cyber-physical component, the smart grid. Some research can be pointed out. In [40], an FMEA analysis is conducted to identify the failure modes in microgrid equipment including different generation technologies. In [41] a classical FMECA was applied in a power distribution system located in the region of RELIZANE northwest of ALGERIA; authors conduct the FMECA analysis according to IEC 60812 standard [20]. Results show that it is necessary to replace the most of equipment, especially transformers and transmission lines; the analysis also allowed to identify the critical components that must be taken into account to improve the maintenance plans. More recently, authors showed in [25] an FMEA analysis for a smart grid framework. A comparison with a modified FMEA that combines the classical FMEA with a fuzzy inference system was studied to improve the prioritization of failure modes. Results clearly showed that fuzzy-based FMEA obtains better prioritization criteria for the analyzed failure modes when compared with classical FMEA applied to a smart-grid framework.
Overall, several studies focused on RCM and alternative approaches to evaluating reliability assessment in smart grid systems, but none of them have considered FMEA as a reliable tool for risk assessment.

The classical Failure Modes and Effect Analysis (FMEA): main concept and procedure
FMEA is a systematic methodology designed to identify known and potential failure modes, their causes and effects on system performance [18,20,22,34]. It was originally used by the US Armed Forces in 1949 [42] to classify failures "according to their impact on mission success that was related to the personnel and equipment safety." After, an impulse was given by its use in the program Apollo in the 1960s following its application in the aerospace industry. As defined in [22], FMEA is a method designed to: • Identify and fully understand potential failure modes and their causes, and the effects of failure on the system or end users, for a given product or process.

•
Assess the risk associated with the identified failure modes, effects, and causes, and prioritize issues for corrective action, and.

•
Identify and carry out corrective actions to address the most serious concerns. FMEA can be viewed as a proactive procedure for evaluating a process by identifying where and how it might fail and assessing the relative impact of different failures [43]. Despite FMEA primary objective is to improve the system design, it can be applied in any stage of a project to mitigate potential future risks produced by failure modes. FMEA is conducted by a cross-functional team of subject matter experts that analyzes the system to identify weaknesses and propose correcting actions that prevent a negative impact on the system's performance [22]. At this point, it is important to note that FMEA's objective is not to predict failures. Its aim is to identify existing and potential failures through a subjective and systematic assessment to classify those failures according to a risk measure.
The • Severity (SEV): that assesses the significance of the failure mode's effect on system operation; • Frequency of Occurrence (OCC): that represents the number of times the failure mode occurs. This risk factor is related to the failure rate; • Detectability (DET): that represents how detectable a certain failure can be before it happens. Risk factors OCC, SEV, and DET are divided into categories. In the classical FMEA, each of these categories is rated by an integer number, usually on a scale from 1 to 10 as in [18], or 1 to 5 as used in [25]). Of course, the categories and ratings for S, O, and D can be the same as those proposed in standards related to classical FMEA like IEC 60812:2006 [20] or specially defined depending on the problem characteristics.
The risk factor's categories and ratings used in this work are listed in Error! Reference source not found. (Frequency of occurrence), Error! Reference source not found. (Severity), and Error! Reference source not found. (Detectability). Very high 1 in 3 8 Repeated failures 1 in 8 7 High 1 in 20 6 Moderately high 1 in 80 5 Moderate 1 in 400 4 Relatively low 1 in 2000 3 Low 1 in 15000 2 Remote 1 in 150000 1 Nearly impossible ≤1 in 150000 Table 2. Traditional ratings for failure mode's severity (SEV) [18] Rating Effect Severity of effect 10 Hazardous without warning The highest severity ranking of a failure mode, occurring without warning and with the consequent hazard.

9
Hazardous with warning Higher severity ranking of a failure mode, occurring with a warning and the consequent hazardous.

8
Very high Operation of the system is broken down without compromising safe 7 High Operation of the system may be continued, but its performance is affected 6 Moderate Operation of the system is continued, but its performance is degraded 5 Low Performance of the system is affected seriously, and the maintenance is needed 4 Very low Performance of the system is less affected, and the maintenance may not be needed 3 Minor System performance and satisfaction with minor effect 2 Very minor System performance and satisfaction with a slight effect 1 None No effect Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 20 February 2020 doi:10.20944/preprints202002.0295.v1 Table 3. Traditional ratings for failure mode's detection (DET) [18] Rating Detection Criteria

10
Absolutely impossible Design control does not detect a potential cause of failure mode, or there is no design control 9 Very remote Very remote chance the design control will detect a potential cause of the failure or subsequent failure mode 8 Remote Remote chance the design control will detect a potential cause of the failure or subsequent failure mode 7 Very low Very low chance the design control will detect a potential cause of the failure or subsequent failure mode 6 Low Low chance the design control will detect a potential cause of the failure or subsequent failure mode 5 Moderate Moderate chance the design control will detect a potential cause of failure or subsequent failure mode 4 Moderately high Moderately high chance the design control will detect a potential cause of the failure or subsequent failure mode 3 High High chance the design control will detect a potential cause of the failure or subsequent failure mode 2 Very High Very high chance the design control will detect a potential cause of the failure or subsequent failure mode 1 Almost certain Design control will almost certainly detect a potential cause of failure or subsequent failure mode Based on these three risk factors, a risk priority number (RPN) is calculated as via the product of S, O, and D as a metric for evaluating each failure mode in the FMEA. Because the RPN calculation in the classical FMEA approach results from the unique arithmetic product between three integers, then there is no associated computational complexity. The higher the RPN of a failure mode, the greater the risk is for the system reliability. Hence, proper actions should be preferentially taken on the high-risk failure modes so that the system should increase its availability. As will be shown in the discussion Section, the RPN calculation is an important issue for FMEA. However, classical FMEA is still an important tool for reliability and risk assessment in highly complex industries such as aerospace, nuclear and petrochemical. The flowchart in Figure 1(a) shows how the 10 steps are linked for conducting a classical FMEA [18,20]. Notice the main loop in the FMEA flowchart in Figure 1(a). It appears when all RPN's are computed since the recommended corrective actions must be implemented, and reevaluated being performed again to verify if these corrective actions reduced the risk in the system.
The final FMEA report must contain all the failure modes ordered by their RPN ranking, being registered in a worksheet whose columns (Figure 1(b)) contain information about the component, associated failure mode(s), causes, consequences, detection methods, recommended actions, and the corresponding ratings for S, O and D risk factors Every FMEA report should include one section detailing all assumptions considered for the FMEA effectuated.

The test system architecture for a smart grid
In this section, a smart grid test system is presented for studying how the cyber-power interdependencies coupled with different failure modes will disturb the grid performance. Failure modes will be identified for both the power and the cyber components, and a complete FMEA analysis will be applied. Figure 2 shows the cyber and power architecture of the smart grid test system. The 30kV power network, depicted in black lines in Error! Reference source not found. down, is a meshed grid consisted of four 30kV substations. The grid presents redundancy in the 30kV grid, i.e., there are different ways for energy transport between busbar Nº.1 (B1), busbar Nº.2 (B2), busbar Nº.3 (B3) and busbar Nº. 4 (B4). storage system (ES) connected to busbar B3. A total of four power transformers (TR1, TR2, TR3, and TR4) and fifteen circuit breakers (CB1, CB2,…, CB15) are also included in the power network. Consumers in Error! Reference source not found. are represented as three load points named LPB2, LPB3, and LPB4 and connected to busbar B2, B3, and B4, respectively. Load LPB2 represents a 20 MW residential area, while LPB3 and LPB4 represent industrial and commercial areas referred to as 85 MW and 40 MW load, respectively.
Regarding the power equipment, only busbars, power cables (aerial lines L1, L4), circuit breakers (CB) and power transformers are considered for this FMEA analysis. Storage facility and generation stations were not considered into this FMEA analysis.
Failure rates for each component have been collected from two main sources: from statistical data obtained from the Portuguese electrical utility, being the second source a set of specialized databases and manufacturer datasheets [45][46][47][48][49]. Table 4 lists the failure rates used in our research and attributed them to each power component. Note that relative to aerial cables and for simplification purposes, it was assumed that different substations are equally distanced between each other (about 2,5 km).

Description of the cyber network of the smart grid test system
Included in the power network in Figure 2 there is a cyber network topology to monitor, protect, and control the power system. Among all possible cyber network topologies, a cyber-ring topology was selected due to its elementary architecture providing an acceptable level of reliability with a redundant path for data transmission. The cyber-control network shown in Figure 2 is a bus topology LAN-Ethernet and WAN-optical fiber network consisted of human-machine interfaces (HMIs), Ethernet switches (SWs), servers (SVs), energy boxes (EBs), intelligent electronic devices (IEDs) and Ethernet and optical fiber links (all marked in blue, red and green lines in Error! Reference source not found.).
The metering infrastructure is composed of smart meters designated in Error! Reference source not found. as energy boxes (EBs), being linked to load points in order to collect data about energy consumption. Note that, in practice, it is assumed that each customer is connected to a single EB. However, for simplifying purposes in this work, we consider only one main EB for all customers at each load point.
IEDs act as interface devices between power and communication networks, including measuring units, protective relays, and controllers. Each IED is responsible for monitoring and executes the commands received from HMIs. Table 5 lists the cyber-power links between each IED controller in Figure 2 network and their corresponding power elements (buses and circuit breakers).
As indicated in Error! Reference source not found., each IED or EB element is connected to an Ethernet switch (SW) through a LAN-Ethernet communication, which is then responsible for redirecting information through the corresponding communication links. Ethernet switches are all connected through a ring topology towards WAN-optical fiber network links (green lines in Error! Reference source not found.). At last, a central Ethernet device (MAIN SW) is responsible for gathering information from all points of the communication network, sending it to the corporate and control centers (up blue blocks in Error! Reference source not found.).
In the control center, all data concerning the power system status is available for monitoring, analysis, and decision-making. The control center is responsible for scheduling power generation to meet consumers' demand, also managing major system problems by executing automatic procedures or manual instructions through the HMIs. Real-time data gathered from the power system are also displayed on the HMI, which allows real-time intelligent data handling and network status monitoring. As also shown in Figure 2 up, left in red, an Inter-Control Center Communications Protocol server (ICCP server) is specified to provide data exchange over WANs between utility control centers and substations. As also indicated in Error! Reference source not found. up, an APPLICATIONS SERVER and an ENGINEERING SERVER manage a big amount of data and information that are stored in an ENGINEERING DATABASE. Table 5. Cyber-power links between power and cyber network The CORPORATE CENTER (Error! Reference source not found. up, right) is responsible for managing a high number of energy market players that will compete to provide the best power quality at the best price. Cost fluctuations on energy generation (due to different penetration levels of distributed generation and dynamic energy demand) are managed in the BUSINESS SERVER in order to optimize cost-effectiveness operations and optimize the balance between energy demand, storage, and production. A CORPORATE DATABASE is responsible for collecting and storing all energy market information in the corporate center, while E-MAIL SERVER, WEB APPS SERVER, and FILE TRANSFER PROTOCOL (FTP) servers make it accessible for all market stakeholders.
The reliability values of each cyber equipment described in the anterior paragraphs and used in this work are listed in Table 6. All values were obtained from datasheets and reliability statistics [32 -37], and all derived using reliability theory about failure rates [44]. For the Ethernet links, however, reliability data was not found explicitly in literature. To surpass this one assumed a very low failure rate value. Concerning the optical fiber links, it was assumed a total length of 10 km in the communication network.

Identifying potential failure modes in the smart grid test system
Potential failure modes that can occur in the smart grid test system in Error! Reference source not found. are needed to be evaluated in their causes and influence on the system. With this objective, this section summarizes the potential failure modes of each equipment considered in our smart grid test system. Each equipment is first categorized according to their type and function in the system. In this way, several failure modes are then defined and described for each power equipment.
The assessment considers two assumptions: 1. The analysis focused on the identification of single failures for smart grid components, and; 2. Complex interdependences or cascading failures are out of scope for the current analysis.
Power equipment comprehended in our analysis of four components: busbar, power cable, circuit breaker, and a power transformer. For each one, a set of failure modes and associated criteria were identified as listed in Table 7. Table 7. Failure modes for power equipment considered for analysis.

Failure mode Criteria
Busbar

Loss of structural integrity
The metallic strip can lose its mechanical integrity due to support insulators breakdown, cracking of welds and fracture of the copper bar.

Loss of electrical continuity
The occurrence of arc flashes degrades the copper bar.

Loss of electrical efficiency
Moisture and humidity can lead to short circuits. Related to the cyber-control equipment, their failure modes are listed in Table 8. The list shows the five cyber-control devices considered: Intelligent Electronic Device (IED), server (SV), Human-Machine Interfaces (HMI), Ethernet switch (SW), and the Energy Box (EB).
Security failure and power failure were considered for all devices. Security failure is related to the susceptibility of cyber equipment to lose their integrity, while power failure is related to its interruption affecting the normal operation of the cyber network.
The IED defective communication is the failure mode associated with damaged transducers or poor signal causing intermittent communication between the IED and remaining cyber-network.
The server (SV) data overload is the failure mode associated with lower storage capacity or an unexpectedly large amount of data to storage that can result in defective data storage. Hardware crash is another failure mode related to some physical damage caused by overheating situations or humidity causing a hard drive crash, thus resulting in loss of data. At last, any software error corrupting stored data will result in an operational failure mode.
An HMI data error is a failure mode that is generally associated with inherent problems in HMI operation that of course will compromise its normal functioning.
Two failure modes attributed in Table 8 to an Ethernet switch (SW) are related to cyber-attacks: the Performance decrease and the Network/Cyber storm failure modes. Congestion of packets and/or broadcast of an excessive number of messages in an uncontrollable way in a communication network can decrease the SW operational performance or even congestion SW operation. At last, an SW Operational failure caused by a bad SW configuration or module failure can blackout its operation.
Energy Box contains a Catastrophic failure mode associated to temperature stresses that can severely damage the EB. Power consumption misreading and Operational failure are two failure modes related to incorrect data acquisition. Manual manipulation, significant measurement error, improper EB programming, and defective installation, all result in incorrect data acquisition problems.
Related to network links, two types were considered: optical fiber links for communications in long distances, and Ethernet links for short distances. Their inherent characteristics result in different failure modes described in Table 9. Optical fiber links have a set of failure modes that are all related to its physic integrity: Fracture, Lead-bonds degradation, and Humidity induced failure modes. Ethernet link failures degrade any network performance by decreasing available capacity and disturbing IP-packet forwarding. Hardware or software failures can happen at protocol network layers. Integrity defects as manufacturing imperfections, incorrect connections or degradation in the RJ45connectors, for example, may lead to loss of physical connectivity in the network hardware or Link breakdown. Superposition of events usually occurs when electromagnetic coupling happens in adjacent pairs of wires causing signals interference. This is referred to as Crosstalk and is more frequent as the signal frequency increases.  Manufacturing imperfection, incorrect installation or RJ45connectors degradation results in delays in data transmission, or even its interruption. Link breakdown Cable breakdown due to external physical damage.

FMEA analysis and its results
A complete FMEA analysis was fulfilled to the smart grid test system in Error! Reference source not found., representing a typical cyber-power network. Using the failure modes systematized in the previous section, one searches for causes and potential impacts of each power and cyber equipment failures on the smart grid. Not only our performed FMEA takes into account the main interdependencies between power and cyber systems topology, but mechanisms that prevent the cause of each failure mode from occurring (current controls) are also proposed.
The three risk factors (Severity (SEV) in Table 1, Occurrence (OCC) in Table 2 and Detection (DET) in Table 3) were first assigned for each failure mode: • For Severity(SEV) rating, the seriousness of the failure and its effects in the system is taken into consideration; • For Detection(DET) assignment, it is considered the ability to detect the failure before it could affect the system, and; • For Occurrence(OCC) rating, its value is stated according to equipment's failure rates, as specified in Table 4 and Table 6. Assignment of all ratings is performed according to FMEA evaluators' expert criteria. Even in an Occurrence (OCC) rating, which could be accurately performed, it can be revised in accordance with a specific cause of failure that seems to be more or less likely to occur according to the FMEA's evaluators criteria.
In a general way, any failure mode is expected to be assigned with different Detection (DET) and Occurrence (OCC) ratings that depend on the causes that triggered it. However, Severity (SEV) rating is unique for each failure mode. Since each failure mode's priority is evaluated by its RPN value Error! Reference source not found., this may lead to different RPNs for the same failure mode since each cause of failure has its own RPN value.
Our research identified and analyzed a total of 107 failure modes associated with the smart grid test system, the overall failure modes can be found in [44]. To this paper, we selected the 42 highest risk failure modes, listed in Table 10, and ordered from most risky to least risky. Table 10 also includes the potential Failure Cause and the suggested recommended actions in order to minimize the impact of those failure modes in the smart grid. The complete FMEA table can be found in [44].
Examining the costs and causes of power and cyber incidents using the 42 highest risk failure modes in Table 10 2) Bus bar failure modes were also identified as critical (rank 5, 8 and 9), in the sense that their impact in the smart grid is significant mainly due to several associated failure modes with high RPNs; Cyber equipment incidents: 1) Related to cyber equipment, failure modes with the highest RPNs are those related with operational failures verified in Human-Machine Interfaces (HMIs) with RPN = 400, Ethernet switches (SWs) shutdown reaching RPN = 360, or Intelligent Electronic Device (IEDs) having some control failure, achieving RPN = 392; 2) Ethernet links, optical fiber links and Energy boxes (EBs) revealed the less critical equipment in the cyber system, mainly due to their low failure rates; 3) Failure modes related to security reasons, despite the enormous impact cyberattacks, can cause, were not indicated by FMEA as high-risk failures. For example, servers (SV) achieved a security failure of only RPN = 200. This is explained due to low occurrence ratings, in the sense that, in spite of the expected increase of cyberattacks attempts in future years, they will not be necessarily successful; 4) Power outages in a cyber-equipment's power supply are expected to be less frequent, thus expressed in Table 10 with lower RPN values. In fact, a general outlook on Table 10 outcomes show two important indications: 1) Besides all ratings were treated as equals, Occurrence OCC rating remains with low variations between different failure modes with high and low RPNs. Hence, it is not a decisive rating with impact on high-risk failures; 2) Failure modes characterized by high levels of unpredictability are likely to be more critical.
These modes occur without early warning and are difficult to prevent, while strong negative impacts on the smart grid operation have also a repercussion in high Severity SEV ratings. Finally, a conclusion regarding human interference in future smart grids must be pointed out. In fact, HMI's operational failure due to human error proves to have negative impacts on the grid. This human error is unintentional and its high probability of occurrence and unpredictability (as seen in Table 10) makes it a high-risk failure cause. This way, we expect that one of the main weaknesses in future smart grids is related to some tasks that demand human interference.

Discussion
In order to use the achieved FMEA results, it is important to account that significant information is lost during a classical FMEA procedure. This situation can compromise important conclusions concerning high-risk failure modes and their impact on the reliability of the system. In fact, Table 10 shows the result of FMEA giving prioritization of high-risk failure modes (based on their RPN value) due to their high-risk causes of failure. This means that, according to FMEA, maintenance strategies should be prioritized from the highest RPN to the lowest in order to increase the smart grid's reliability. This implies that failure's causes must receive special attention in any maintenance task. Doing this will decrease or eliminate any risk of a failure in the system, thus reducing some failure mode impact on the smart grid.
That should be established to decrease the number of times the respective failure manifests itself, so that system reliability increases as intended. However, this also means that numerous failure causes are herein discriminated as long as high-risk causes of failure of each failure mode are not taken into account for final FMEA analysis. In fact, some failure modes with critical causes have, sometimes, fewer RPN values than certain less critical failure modes, although identified as prioritized because of their higher RPN. In these situations, maintenance strategies for these failure modes with fewer RPN values may be ignored, if using the FMEA approach. For instance, Table 11 contains selected failure modes extracted from Table 10, which causes have equal Severity (SEV) ratings but different RPNs. Related to the busbar's electrical disturbances failure mode, it can be caused due to short circuits between bars with different phases (RPN of 320) or due to harmonics (RPN of 256). Although harmonics still have a high RPN, meaning it is a high-risk cause of failure, its importance could be neglected because it is ranked in 7th place from the 10 failure modes-causes shown in Table 11 and, therefore, maintenance strategies would not be recommended for this failure mode-cause [44]. Table 11. Selected failure modes for analysis and discussion.

Equipment
Failure Mode(s)

Failure Cause OCC DET SEV RPN RANK
Similarly, different combinations of OCC, SEV and DET values may result in the same RPN rating, but with different hidden risk implications. For example, the wrong operation in CB due to overloads and magnetic-core delamination in transformers has the same RPN -168 more precisely -, but their ratings are different. Their impacts on the system could be different, but unfortunately, FMEA could not distinguish them. This clearly shows that FMEA is limited in the prioritization of maintenance tasks. FMEA is not able to assign different weights for its ratings, leading to some misreading concerning the risk of a failure mode. For an adequate application of FMEA, it is of utmost importance to assemble subject experts with a high level of knowledge of the smart grid operation. This condition is related to the fact that failure modes and failure causes must be enumerated and exhaustively detailed and discussed in order to evaluate, as accurately as possible, the impacts of failure in the smart grid.
In the literature, we verified the lack of failure rates information discriminated against for each failure mode, either for power and cyber equipment. Even data found in the Portuguese electric energy utility (EDP Distribuição), a big company with interests in cost-effective maintenance methodologies, was inconclusive. In our research, failure mode's failure rates were subjectively discriminated from equipment's failure rates, which may have led to some errors in RPN final calculation, especially for OCC rating, which seemed to cause low impact for RPN the way it was obtained.
For FMEA to be correctly applied, experimental failure rates for each mode of failure must be detailed. If possible, extensive research would be useful to get experimental rates for each cause of failure. Therefore, for a deeper understanding of the criticality of certain failure, the collection of data on the frequency of failure for each power and cyber equipment, by specifying failure rates for each failure mode and their causes, would be profitable for reliability purposes. Knowing the frequency of a certain failure, as long as bearing in mind the real impact that failure triggers in the smart grid, would make FMEA more efficient (more reliability of OCC rating) and maintenance strategies more precise (strategies based on maintenance frequency adjustments are improved).
Finally, in order to ensure the system's high-reliability level, a cost-effective maintenance strategy must be achieved by prioritizing failure modes from the most critical to the lowest one, as long as one has to take into consideration maintenance costs for each equipment and each failure mode. This way, in what concerns the level of risk of the analyzed smart grid test system (note that, concerning the economic side, it is not evaluated in the present study), it is of utmost importance to establish maintenance strategies according to their risk number.
Strategies with the aim of (i) mitigating or eliminating failure modes in order to decrease OCC rating, (ii) increasing failure detectability for the purpose of lowering DET rating, and (iii) minimizing losses or negative impacts when a failure occurs in order to diminish SEV rating, all three must be performed in order to increase reliability of a smart grid topology.

Conclusions
This paper analyses the application of classical FMEA analysis in a smart grid environment. A simple smart grid test system was defined as having power and a cyber-component. Results of qualitative assessment of reliability analysis were performed, and a critical analysis of FMEA results was carried out. From all results and discussion presented, seven critical conclusions can be pulled out: 1) The top ten high risky failure modes are related to server, transformers, HMI, IED, busbar, power cables and Ethernet switch; 2) Short circuits are the causes for the riskiest failure modes in power equipment; 3) About cyber equipment, human and software errors (associated with HMI and servers) are considered as causes for high-risk failure modes; 4) The RPN value is highly sensitive to small variations in the three risk factors SEV, OCC, and DET; 5) The failure modes RPN-based prioritization is not adequate for applications in smart grids, because it does not take into account the relative importance of the three risk factors and that it Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 20 February 2020 doi:10.20944/preprints202002.0295.v1 is different for each team; For example, the relative importance of the severity factor in a transformer is different from the importance of the same factor when it is an Ethernet switch, and; 6) There is a lack of information regarding the failure rates associated with each analyzed failure mode because the occurrence of a failure is recorded without differentiating which failure mode is related to the said failure.
It is important to highlight that classical FMEA is successful in assemble failure modes and their causes in a given smart grid. However, for a better reliability assessment and risk analysis of a smart grid using FMEA, it needs to be modified to improve risk prioritization. Since power systems reliability assessment is usually conducted considering component failures as a whole, that is, without differentiating the failure modes that drive the component failure, FMEA can be used first to identify the criticality of the failure modes and then use these critical failure modes as inputs for a quantitative reliability analysis instead a single failure rate for each component.
Component's failure rate used in reliability analysis is a composition of failure probability functions for each of the failure modes identified for this component. This implies that reliability analysis would consider both the critical and non-critical failure modes. Therefore, considering the most critical failure modes for each equipment and using it as input for the quantitative reliability analysis, it would be possible to improve the perception of the failure mechanisms that lead to a reduction in system's reliability, allowing to focus the maintenance efforts to reduce the impact of this specific failure mode. For this reason, is important to the registry the failure statistics at failure mode level and not only at the component level.