Resilience Assessment: A Performance-Based Importance Measure

The resilience of a system can be considered as a function of its reliability and recoverability. Hence, for effective resilience management, the reliability and recoverability of all components which build up the system need to be identified. After that, their importance should be identified using an appropriate model for future resource allocation. The critical infrastructures are under dynamic stress due to operational conditions. Such stress can significantly affect the recoverability and reliability of a system‘s components, the system configuration, and consequently, the importance of components. Hence, their effect on the developed importance measure needs to be identified and then quantified appropriately. The dynamic operational condition can be modeled using the risk factors. However, in most of the available importance measures, the effect of risk factors has not been addressed properly. In this paper, a reliability importance measure has been used to determine the critical components considering the effect of risk factors. The application of the model has been shown through a case study.


Introduction
Critical infrastructures are complex systems whose high performance requires proper interaction between hardware, software, and wetware (humans being involved in the design and operation of these systems). External and internal working of infrastructures is dynamic, which constantly can change the performance characteristics of these systems. For example, dynamic operational conditions can affect equipment reliability and recoverability, two characteristics of infrastructure resilience (see Figure 1). Changing the reliability can cause an unexpected breakdown. For example, ambient temperature effects on the reliability and recoverability of power distribution have a dynamic nature, and a sudden low temperature will cause an unexpected power outage. Such unexpected stoppages need to be considered in any contingency plan. Moreover, it is important to clearly understand each component's importance in building critical infrastructures and their sensitivity to any change in operational conditions.
Recently, different resilience metrics have been developed to assess the resilience of systems in different sectors. Figure 1 shows some of the concepts of resilience and some key concepts, which have been used to represent the system statement pre-disruption, during disruption, and post-disruption. Resilience is a technical system that can be defined as the ability to withstand a major disruption within acceptable degradation parameters and recover within an acceptable time and composite costs and risks [1,2].
In other words, resilience could be interpreted as the probability that system conditions might exceed an irrevocable tipping point. But the probability in this subject covers the different areas that different approaches and indices can evaluate. The reliability (uptime) and recoverability (downtime) performance have been used as the most dominant probabilistic performance measurement tools. As Figure 1 shows, the reliability of the system resilience in the original stable state before a disruption occurs starts from the time to (normal or baseline state). Reliability is defined as the probability that a system can perform a required function under given conditions at a given instant of time or over a given time interval, assuming the required external resources are provided [3][4][5][6][7][8][9][10]. At the time of t 1 the system will be hit by a disruptive event (overstress such as an earthquake). Based on the inherent reliability and effectiveness of the operation and maintenance program, the performance of the system will degrade until t 2 . Some authors highlighted that as a vulnerability state, where higher vulnerability means more severe failure. At the time t 2 the contingency plan (recovery actions) is in place, and the system's recoverability will decide when the system will get back to the normal performance level [11]. One important step in improving infrastructure resilience is identifying the critical components that contribute to its resilience. The component importance can be identified from a reliability and recoverability point of view. Different reliability importance measures have been developed, which may be used in the resilience assessment as well. The importance measure denotes how each component will affect the infrastructure performance. In general, the resilience importance results from its component's resilience through the cohesive configuration components interacting. Therefore, a two-dimensional index that measured both resilience and each components' behavior in interaction with other components and the system needs to be developed to analyze the system. Such indexes are known as importance measures obtained from the reliability of the recoverability point of view. Different importance measures have been developed in the reliability of the engineering discipline. For example, Birnbaum (1968) proposed a quantitative definition of structural importance for systems with coherent structures, assuming that only the complex system's structure is known. Chang and Hwang used structural Birnbaum importance in the component assignment problem to obtain the best system reliability [12][13][14]. Amrutkar and Kamalja overviewed structural importance measures, reliability importance measures, and lifetime importance measures of importance measures for coherent systems from 1960 to 2017 [15]. A resilience-focused performance measure was offered through generated interdependent power-water networks by Almoghathawi and Barker [16]. Chacko introduced joint reliability importance measure for two or more multistate components, joint performance achievement worth, joint performance reduction worth, and the joint performance Fussell-Vesely measure, using expected performance, reliability, availability, and risk as output performance measures of the multistate system [17]. Xu et al. used the values of component importance to investigate a time-dependent risk quantification model, as well as the common cause failure treatment model in operation and maintenance management. The results showed that the absolute values and ranking order of time-dependent importance reflected the effect of the cumulative state duration of component on risk and comprehensively accounted for all possible situations of component unavailability [18]. Kamra and Pahuja analyzed the substation communication network architectures using various reliability importance measures. The practice of these component importance measures worked towards identifying the components that can be allocated for the improvement of system reliability [19]. Niu et al. extended the component importance to generating capacity adequacy assessment. The measurement index is the centrepiece in reliability importance based on traditional importance measures. It is demonstrated that a central component, the one with higher structure importance, can actually have less risk reduction worth than a branch, the one with lower structure importance [20]. Furthermore, some authors proposed the availability importance measure (AIM), which determines the importance of items regarding the availability of the mechanical system and smart grid (regarded as the next-generation electrical power grid).
A review of available studies revealed that in most available studies, the reliability of components depends on a single independent variable, time of operation, or time between failures (TBF). Moreover, these studies mostly assume that the data are homogenous, where the data are collected under identical operational conditions. Here, the equipment is experiencing the same operational conditions with the same environmental, operational and organizational stress. In reality, it is not a valid assumption. Studies show that most of the resilience data have a degree of heterogeneity that needs to be identified and quantified appropriately. In other words, operational conditions can significantly influence the infrastructures' reliability and recoverability characteristics in most real cases.
In general, risk factors could be categorized into two groups, observable and unobservable risk factors leading to observable and unobservable heterogeneity. Unobservable risk factors are such factors that they are unknown. Recent studies show that the unobservable risk factors can significantly change the components' reliability and recoverability and, consequently, the resilience characterization of infrastructures. Hence, the effect of both observable and unobservable risk factors should be considered while the component importance measure is analysing [21][22][23][24][25][26][27]. However, in most of the available importance measures, the effect of risk factors has not been addressed properly. Recently, different approaches have been used to analyze the effect of risk factors on system resilience, such as regression methods, neural networks, classical statistics, etc. [28][29][30]. For example, Cox regression and accelerator failure time (AFT) models are the two most applied regression models for modelling the effect of risk factors on the resilience of infrastructures [11,21,22,31]. In these models, reliability or recoverability can be explored as baseline hazard/repair rate and covariate function, reflecting the effect of risk factors on the baseline hazard rate. Baseline hazard represents the hazard when all of the risk factors (or predictors or independent variables) effects (coefficient values) are equal to zero [25].
Hence, the main motivation of this paper is to develop risk factors-reliability importance measures to isolate the effect of observable and unobservable risk factors. The paper is divided into three parts. Part 2 briefly presents the theoretical background for "risk factor-based reliability importance measure (RF-RIM)". Moreover, the methodology for the implementation of the model is discussed. Part 3 presents a case study featuring the reliability importance analysis part of the fleet loading system in Iran's ore mine. Finally, part 4 provides the conclusion of the paper.

Methodology and Framework: Risk Factor-Based Reliability Importance Measure (RF-RIM)
Mathematically, the resilience measure can be defined as the sum of reliability and recoverability (restoration) as follows [32]: where k, Λ p and Λ D are the conditional probabilities of the mitigation/recovery action success, correct prognosis, and diagnosis. Equation (1) turns technical infrastructure resilience into a quantifiable property; provides essential information for managing them efficiently. Reliability is defined as the probability that a system can perform a required function under given conditions at a given instant of time, assuming the required external resources are provided [12]. The reliability can be model using a statistical approach such as classical distribution. The restoration is considered as a joint probability of having an event, correct prognosis, diagnosis, and mitigation/recovery as follows [33]: where P Diagonosis is the probability of correct diagnosis, P Prognosis is the probability of correct prognosis, and P Recovery is the probability of correct recovery [32]. As mentioned, the importance measure shows how to affect each component on the system resilience. For example, in a series system, components to have the least reliability, the most effective have on the system resilience. However, in a parallel system, components that have the most reliability are the most effective on the system resilience. Figure 2 shows a systematic guideline for RF-RIM. As this figure shows, the initial step involves collecting failure and repair data and their associated risk factors. The most important challenge in the first step is the quality and accuracy of the collected data set, which significantly affects the analysis results [28]. In the second step, based on the nature of the collected data and risk factors, some statistical models are nominating to model the reliability of components. For example, in the presence of observable and unobservable risk factors, the frailty model can be used. Originally, this was developed by Asha et al. [34] into load share systems and described the effect of observable and unobservable covariates on the reliability analysis. In later years, authors such as Xu and Li, Misra et al., and Giorgio et al. discussed the properties of the frailty model [35][36][37]. Moreover, recently this model was used in spar part estimation, remaining useful life (RUL), recoverability, failure data analysis, and resilience analysis [21][22][23][24]38,39]. According to this model, the reliability of each component (R(t; z; z(t)|α)) can be modelled as follows [21][22][23][24]: where α is the frailty and has a probability density function g(α) with the mean to one and variance θ, where R i (t; z; z(t)) is the item's reliability function and considering the existence of p 1 time-independent observable risk factors and p 2 time-dependent observable risk factors. It can be estimated by [21][22][23][24]: where λ 0 is baseline hazard rate. Also, δ and η are regression coefficients of the corresponding time-independent and observable risk factors. Moreover, the unconditional reliability function of component i'th (R θi (t; z; z(t))) can be estimated as [21][22][23][24]: where R θi (t; z; z(t)) is the item's reliability function and considering the existence of observable and unobservable risk factors. If there is no effect from unobservable risk factors, then α = 1, and Equation (6) will reduce to the Cox regression model as follows [22,31,39]: For a guideline in risk factor-based reliability model selection, see Figure 3.
In step 3, the system reliability should be estimated. In the presence of observable and unobservable risk factors, for a series-parallel system with n series and m parallel subsystems, system reliability can be calculated with Equation (7) [21,22,40]: where, R s (t; z; z(t)|α) is system reliability at time t, z is a row vector consisting of the observable time-independent risk factors, z(t) is a row vector consisting of the observable time-dependent risk factors and α j are a time-independent frailty function for item j and represents the cumulative effect of one or more unobservable risk factors [21,22,41], R ij (t) is component reliability at time t. having the reliability model of the system the reliability importance measure of components that are working in a series-parallel system can be estimated by: where I i R and R i are RF-RIM and reliability of component considering by observable and unobservable risk factors.

Case Study
Mining is an important industry that provides raw materials, which are an essential input for other industries. Gol-Gohar iron ore mine is located in southern Iran, in the southwest of Kerman province. Gol-Gohar iron ore mine contains six sections. Each of them works independently. Mining in surface mines starts by drilling the rock, blasting, loading, and then transforming the rock to the production facility or a depot. Nowadays, the mining industry uses huge equipment to increase performance. Extraction equipment is very expensive; any unplanned stopped may cause tremendous costs.
Moreover, long stoppages may affect the ore processing facilities, which are downstream of the production chain. Recently, the resilience concept has instructed the mining industries to avoid any disturbance in the chain of production. A previous study showed that Gol-Gohar operational conditions could significantly affect the resilience characteristics of mining equipment. To manage resilience effectively, we need to measure different quality, net considering, and operational conditions. Hence, in this study, we used the RF-RIM developed in Section 2 on the transience of the loading fleet, including four Caterpillar (Caterpillar Inc. Construction machinery and equipment company, Deerfield, IL, USA) excavators model 390DL in section No. 1.

Data Collection and Classification
According to the guideline developed in Figure 2, the first step is to collect the data. The data required for reliability analysis can be divided into two categories: failure data and risk factors. In this case, failure data (time to failures) and their possible associated risk factors were collected from January 2016 to December 2018. These data are collected from various sources, including daily operation reports, mounted sensors on the machine, meteorology reports, geological specifications, and interviews with experts, meetings, archival documents (previous reports, machines catalogs). Collected risk factors include qualitative (categorical) and quantitative (continuous) risk factors. Continuous risk factors include: temperature (Z 5 ), precipitation (Z 4 ), and humidity (Z 6 ). The categorical risk factors include working shift (Z 1 ), rock kind (Z 2 ) and operation team (Z 3 ). Table 1 shows the formulation of the categorical risk factors. For example, this table shows shift has three categories: morning, afternoon, and night, and 1, 2, 3 represent them, respectively.

Risk Factor Test
Two tests have been carried out on 10 collected risk factors, correlation test, and PH assumption in this part. Correlation tests are performing to find if the identified risk factors are independent of each other. If there are some independent risk factors, they should be replaced by new risk factors built up based on independent risk factors. Furthermore, time dependency is checking to find the effect of some risk factor changing which time. Such tests are named the PH-assumption test.
Here the Pearson test is used for checking the correlation between risk factors. The risk factors correlation test for excavators showed there is no significant correlation between an identified risk factor. As an example, Table 2 shows such results for excavator A. As can be seen, there is no significant correlation between identified risk factors in a 95% confidence level. According to Figure 3, the application of PHM in its original form is limited to model the effect of time-independent risk factors. Hence, a stratified Cox regression model and extended Cox regression model have been developed to enhance its application for the timedependent risk factor. To select the best model among these models, the time dependency of risk factors should be checked as a proportional hazard assumption (PH assumption). PH assumption means hazards ratio (HR) remains constant over time, or equivalently, the hazard for one individual is proportional to the hazard for any other individual, where the proportionality constant is independent of time. The formula of PH assumption for the HR that compares two different specifications z * and z for the risk factors used as [42]: Different approaches such as theoretical and graphical, have been used to determine whether PH assumption fits a given data set. The graphical procedure, a goodness-of-fit testing procedure, and a procedure involving time-dependent variables have been used most widely in PH assumption evaluations. For more information, see [32].
Here, the theoretical model is used to check the PH assumption of risk factors checks using the Schoenfeld residual test [43,44]. The result of such analysis for Excavator A is shown in Table 3. As the table shows, the p-value for all risk factors is bigger than 5%; hence, the null hypotheses (the time-dependency of risk factors) can be rejected, including the time-independent risk factor.

RF-RIM Molding
In step 2, based on the result of the risk factor test, some possible models for reliability analysis of the excavator should be nominated, and the appropriate statistical test for the best-fit test need to be considered. As found in step 1, all risk factors are time-independent; hence based on the literature, the following model is nominated for the reliability molding:  Here, it should be highlighted that both Weibull-MPHM and Exponential-MPHM can model the effect of unobservable risk factors (unobservable heterogeneity). The Akaike information criterion (AIC) or Bayesian information criterion (BIC) are selected for the goodness of fit test (GOF) best model is selected [45,46]. AIC and BIC criteria are based on the information and are utilized by classically comparing the maximum likelihood value to select the appropriate model. These two criteria are formulated as follows [45]: where k indicates the number of estimated parameters, and N represents the number of observations (failures). The model with the smallest AIC and BIC values will be selected as the most appropriate choice in an appropriate model fitting. Furthermore, the likelihood ratio (LR) test can be used for checking the unobservable heterogeneity among the data [21,22,47], where null hypotheses will be no unobservable heterogeneity. Table 4 present the result of AIC, BIC, and LR. The two-last column shows the LR calculation. For example, under the assumption of Weibull-MPHM for Excavator D as highlighted in second row, the LR tests are performed as below: LR = 2 ln L λ ,β,η,θ − lnL λ 0 ,β 0 ,η 0 , 0 = 11.29 (12) where in this equation,λ andβ are the estimated parameters for Weibull distribution,η is the regression coefficient for observable risk factors, andθ is the degree of heterogeneity due to the effect of unobservable risk factors. The p-value for LR = 11.29 will be equal to zero, which leads to the rejection of the null hypothesis. Hence it can be concluded that there is no unobservable risk factors' effect on the reliability of the excavator D.  Table 5 shows the regression coefficients for the excavator D, associate Z test, and p-value. According to Table 5, only temperature and rock kind risk factors (as highlighted) significantly affect excavator D reliability. Hence, its reliability can be expressed as:  Also, in Table 5, if the risk factor value exp(α) is greater or (smaller) than 1, the risk factor will increase (decrease) hazard rate. Table 6 shows the best-fit model and its related parameters for all identified subsystems of the analysis system. As this table shows, for example, the best model for the reliability of excavator A is Exponential-MPHM, where θ has represented the degree of heterogeneity (variance of gamma distribution in Equation (5)). The reliability of the identified subsystem is shown in Figure 4.

System RF-RIM
In this stage, firstly, the relationship between the component (or subsystems) needs to be understood, and then a suitable model should be selected to model this relationship. Here all components are working in the parallels; hence, using Equation (7), the reliability of the system can be modelled as: Using the developed equations in Table 6 for the reliability of types of equipment, the reliability of the system can be written as: In Figure 5, the reliability of the system is plotted under the assumption of three risk factor combinations as follow:

•
Classical model: only time data be analyzed.

•
Winter: temperature = 10, night shift, west, and operation team C • Summer: temperature = 20, afternoon shift, ore and operation team B Using Equation (8), the importance measure of the subsystem can be calculated. The result of such analysis is shown in Figures 6-8. According to these figures, the importance of the subsystems is dependent on the operational conditions. For example, in Figure 6, the classical model shows that all components have about the same criticality. In the risk factor setting for winter (Figure 7), excavator B has maximum importance measure (first ranking), so excavator B is a critical subsystem. In the risk factor setting for summer (Figure 8), excavator D has the maximum importance measure (first ranking).
Excavator A has the highest reliability (see Figure 4), and its reliability importance is ranked as the lowest importance. The result of the analysis provided that as the importance of the component is changing over time, the resource allocation needs to have a dynamic nature as well, and they need to be updated. For example, the maintenance program needs to be updated as operational conditions are changing. Table 9 shows how the reliability importance of the subsystem is changing by operational condition and operating time. For example, in the wintertime, excavator D is the priority. However, after 20 h, excavator B will be the priority.

Conclusions
Society's performance and well-being greatly rely on its infrastructures such as communication, transportation, power distribution. These systems are complex, large, and expensive, often working in dynamic environmental conditions. Resilience is an emerging concept that has been used to quantify the performance of infrastructures. Different concepts have been used to model the resilience of a system, such as reliability and recoverability. To have an effective resilience estimation, the effect of the dynamic operational condition needs to be modeled on the resilience concepts.
Moreover, it is important to know the importance of each component in such complex systems that build up the system. Such a ranking will be based on further improvement and resource allocation. However, in most of the available importance measures, the effect of risk factors has not been addressed properly. For this aim, this paper has introduced a risk factor-based importance measure. The developed model enables the analyst to isolate the observable and unobservable risk factors on the reliability importance of components. Moreover, it developed a step-by-step guideline to facilitate the application of the models.
In this case, the importance of loading fleet number 1 at Gol-Gohar iron ore mine, including four excavators working in parallel, was analyzed in three risk factor sets. The results showed that the system has various important measures in different operating conditions. Moreover, it showed that unobservable risk factors significantly affect the reliability importance of some components. Hence, ignoring these effects may lead to unrealistic decisions.