Risk-Based Criticality Assessment for Smart Critical Infrastructures

: Today, critical infrastructure is more interconnected, which allows more vulnerabilities in the case of disasters. In addition, the effect of one infrastructure can lead to one or more cascading failures in another infrastructure due to the dependency complexity between them. This article introduces a holistic approach using network indicators and machine learning to better understand the geographical representation of critical infrastructure. Previous work on a similar model was based on a single measure; such as in fashion, this paper introduces four measures utilized to identify the most vital geographic zone in the city. The model aims to increase resilience, focusing on the preparedness phase by assessing the essential nodes of infrastructure, which allows more space to adopt risk mitigation strategies before any disturbance event. Holding in-depth knowledge of the vital zones of small scales and accordingly ranking them will positively improve the overall system resilience.


Introduction
Today, our countries, national safety, financial success, and social health are primarily reliant on a collection of highly interdependent critical infrastructures. Numerous cases containing these systems can be viewed, including the power networks system, natural gas infrastructure, communication infrastructure, water systems, and transportation systems. Understanding these infrastructures' behavior is crucial, particularly when stressed or under attack. Infrastructures attacks include a wide range of physical attacks and virtual attacks such as climate change and pandemics. One way to understand such behavior can be reflected by obtaining resilience. Resilience in this scenario can be defined as the ability of the system to prepare, absorb, and recover in a reasonable time. Metrics and models related to infrastructure resilience can exhibit signs that affect the interdependent critical infrastructures' performances and operational features. These models and measures require introducing interdependencies between infrastructures to present detailed descriptions of infrastructure resilience such as [1,2].
Resilience has also been defined as the capacity of a system to absorb disturbance, undergo change, and preserve approximately the equivalent function, structure, identity, and feedback [3]. Resilient infrastructure is different from sustainable infrastructure, which refers to designing, building, and operating the infrastructure through the day-to-day function in ways that do not reduce the social, economic, and ecological operations demanded to maintain human equity, diversity, and the functionality of natural systems. Resilience encompasses three primary stages based on the functionality timeline: preparedness, recovery, and restoration. Preparedness is a significant phase in promoting the function of interdependent critical infrastructures, and for any infrastructure to be prepared, an enhanced risk mitigation mechanism should be employed to the critical elements of that system. Nevertheless, determining the most critical components has been challenging, particularly in the state of multiple infrastructures. For example, based on the Department of Homeland Security's last report [4], there are more than 16 infrastructures that are considered critical; yet, these infrastructures are not formed in an isolated form but more in an interconnected fashion, and this raises the complexity of precisely locating the most critical nodes.
This paper introduces a holistic approach that uses an indicator-based approach, a network-based approach, and machine learning to better understand the geographical representation of the critical infrastructures. Understanding such a model will allow for the accurate measurement of infrastructures elements and determining which part has the most impact and then will help avoid any future disaster by applying the protection mechanism on that element.
The remainder of the paper is designed as follows. Section 2 reviews related work on critical infrastructure analysis to identify the most exposed or significant elements. Section 3 presents our methodology to identify and assess a metropolitan city's most vital geographic zones. Section 4 provides numerical results from a case study with our suggested approach, including a risk analysis. Lastly, Section 5 presents our conclusions.

Background and Related Work
One way of modeling infrastructure in literature has been conducted through the network modeling [5]. By modeling the infrastructure as a network, the implementation of centrality approaches to critical infrastructure protection and mitigation control can be applied. However, this approach has been applied to a single infrastructure, and there is no coherent framework apprehending other interdependent infrastructures as a comprehensive system. In [6], the authors manage to identify essential nodes of critical infrastructure based on the dependency risk graph by selecting a group of the most vital nodes and suggesting applying risk mitigation strategies for these nodes to enhance the overall resilience. Furthermore, they used graph centrality metrics to create and assess the effectiveness of alternative risk reduction plans by examining the relationship between dependency risk routes and network centrality properties. Nevertheless, several random graphs with randomly selected dependencies were carried out to verify mitigation strategies. Both data availability and modeling difficulties are the reason for generating random graphs.
Wang et al. [7] introduced a Node Topology Importance (NTI) method based on estimating the valuation of physical connectivity of a power grid communication graph after one node is damaged. The design employs several network measures to recognize the significance of a node considering the dependency within the power and communication networks. In a similar work, ref. [8] merged business factors (e.g., change in revenue/reputation, environmental cost, etc.) with graph metrics (e.g., betweenness centrality) to discover the most critical nodes on a network.
In [9], authors present a method of leveraging deep learning for the detection of threats to critical infrastructures before failures occur to enhance the overall system resilience. An automated review of the power infrastructure employing vehicle-mounted video acquisition devices is examined, and a machine learning algorithm (deep convolutional neural network) is formed, achieving high efficiency in identifying power-related infrastructures within images mostly familiar with rural environments. The imaged picture could be utilized to identify power-related infrastructure and possibly identify and flag infrastructure at risk of collapsing. Another relevant work seeks to measure the interdependence between climate change and COVID-19 by providing a combined method for food security [10]. Yet, there is still a gap in the literature to a unified approach covering the exact value of links between such infrastructures. All related work has been looked into the interdependence between one or at most two different infrastructures, such as power and water, as shown in the example above. In this work, the suggested approach is applicable to any number of different infrastructures located in the same geographical area, such as water, power, communication, and healthcare. That is achieved by initiating a connection (links) between them and assigning a weight to each connection. The weight of each link can be represented as the impact of each node on that link to the other (risk assessment matrix). In the following section, we present a deeper explanation of how to generate such links and assess and assign the corresponding weights.

Criticality Assessment Process Development
The criticality assessment in this work has been categorized into three hierarchical levels, and each group has several features, as shown in Figure 1. The criticality model contains three horizontal stages named level 1, level 2, and level 3. First, starting the assessment of each region with level 3 where the data collection phase occurs based on the metric type, for example, risk assessment for the examined area. Following that, level 2 includes grouping the collected data into more organized groups where all data are normalized and aggregated in this phase. Inside each group, we involve several stages, which are represented as follows: (1) Select measures; (2) weighting all features; (3) aggregate and rank measures; (4) feature selection.

Identify Measures and Data Acquisition
We select four primary measures to understand the geographic location we are endeavoring to assess. Every measure reflects a specific aspect related to the critical infrastructure. The measures and the description are centrality measures, criticality measures, interdependence measures, and community measures. All measures have been explained in detail as follows.

Centrality Measures
Employing centrality measures has been applied as a valuable mechanism to recognize the critical nodes in graph theory. Graph centrality measures are selected here to assess the relative importance of a node in a graph G. Several centrality methods exist; each has various features. Three network measures have favored being tested on each graph to determine the most critical nodes. The measure practiced here is the centrality degree, betweenness, and closeness. The adopted model was selected based on prior work in [11], where we used three centrality measures as an index to measure the importance of the nodes within a specific network and then aggregate the normalized value of each metric to result in an overall weight of each zone. Nevertheless, we formed each geographical site in this paper such as zip code and neighborhood as a graph G = (V, E), where V is the set of vertices or network nodes located in the examined geographical area and E is the set of edges or links or connections connecting the nodes. Nodes, in this case, contain all the critical infrastructure elements.
The geographic area is this work modeled Geo i based on the Nearest Neighbor Algorithm (NNA), which is one of the fundamental algorithms employed to resolve the traveling salesman problem, where the salesman begins at a random city and frequently visits the closest approaching city until all have been visited. The algorithm computes the Euclidean distance [12] from each point in a point pattern to its nearest neighbor (the nearest other points of the pattern). The nearest neighborhood algorithm helps meet a location with its nearest k neighbors in a multi-dimensional space. The aim behind using the NN algorithm is to connect the infrastructure nodes with a virtual link that matches the physical topology. According to [13], most power stations and cell tower position into the nearest hospital. In the power scenario, the cost of power transmission is expensive, and that leads to positioning the power station somewhere near the hospital. However, in cell towers, positioning is mainly based on increasing the coverage of users.
In this case, every CI node has a directed edge to the nearest CI node based on the real distance extracted from the geographic data. For example, CI Energy node has a straight edge to the nearest CI Healthcare node, as shown in Figure 2. Counting both one-way and two-way connection, constraints are added into the graphs as follows: 1. Healthcare nodes only receive a connection from all other infrastructures (water, energy, telecommunication); 2. healthcare cannot provide any outgoing connection to other infrastructure nodes; 3. water, energy, and telecommunication include a two-way connection between them; 4. every node receives only one connection from the nearest infrastructures; 6. centrality measures are calculated considering both in and out degrees (undirected graph) since every link has an impact on both sides. By forming the graph in this pattern, an actual graph is generated to accommodate the required centrality measure as follows. Degree Centrality: In a graph, the node degree can be described as the number of nodes to which a node is directly connected. Accordingly, the degree deg(i) of i ∈ V(G) is the number of edges attached to i. The degree centrality A d (i) of node i is given by: where d i is the cumulative number of edges from the node i and normalized by the maximum feasible degree (i.e., n − 1). The nodes with the highest A d (i) are recognized as more valuable.
Betweenness Centrality: It is another major benchmark for recognizing essential nodes in a complex network. The node and edge betweenness are described as the number of shortest paths passing through a node or an edge. The higher the betweenness of a node, the more critical the node is [14]. The betweenness centrality A b (i) of a vertex i ∈ V(i) can be measured as the percentage of shortest paths that cross over i. Hence, A b (i) can be formulated as: where σ st is the total number of shortest paths between node s and node t, with σ st(i) being the number of these paths that cross through node i. So, the larger A b (i), the more significant the node. Closeness Centrality: In a connected graph, the closeness centrality A c (i) of node i ∈ V(G) is the average range or closeness of the shortest path joining the node i and all other nodes in the graph. The closeness can be represented as: where dist(j, i) is the distance of the shortest path joining nodes i and j. The larger the A c (i), the more centrally positioned the node is in the graph, and the higher the value. The cumulative centrality measure A(i) of geographical location Geo i is yielded by:

Criticality Measures
The criticality measures B(i) of geographical location Geo i is given by: where B Number is the total critical infrastructure nodes located in the same geographical area. In addition, B Diversity is the total different critical infrastructures involved within the same zone.

Interdependence Measures
Modeling the interdependence among critical infrastructure has been a considerable challenge due to the complexity of the connections. There is relevant work in recognizing and modeling dependencies involves the use of sector-specific designs, e.g., gas lines, electric grid, or ICT, or more comprehensive methods that are relevant in different models of CIs [15][16][17].
An approach to model the interdependency between interconnected critical infrastructure is the dependency risk approach. This study applies the dependency risk methodology of Kotzanikolaou et al. [18] for analyzing first-order cascading failures by identifying direct relationships between pairs of critical infrastructures as assessed by critical infrastructure operators; however, in this study, we limit the order to the first order.
The interdependency between nodes in each area are formulated based on a risk dependency graph. The risk level of the dependency approach in this paper formulated by developing a similar approach in [18] with the difference in focusing on the first-order dependency. Dependencies are visualized in a graph G and applied in the graphs produced for each Geo i by using NNA algorithms in the previous section, where V is the set of nodes (or infrastructures or elements), and E represent the set of edges (dependencies). In addition, the weight of each edge is the level of the cascade failure emerging risk for the receiving infrastructure due to the dependency relies on a predefined risk range 0, . . ., 1. Each dependency represented as a straight edge from a node CI i to a node CI j assigned to an impact value I i,j and a likelihood value L i,j . The product of the impact and likelihood conditions show the dependency risk RD i,j directed to infrastructure CI j and caused infrastructure CI i as follows: Where L i,j is the likelihood that defined as the conditional degree of belief that CI j will become unavailable, due to the unavailability realized in CI i . The likelihood L i,j can take one of the following qualitative values: V L, L, M, H, V H (from very low to very high). To simplify the calculation as Table 1  ). In addition, I i,j outlines the impact that is set as the qualitative impact rate that the infrastructure CI j will experience if the link is unavailable due to a disruption failure in CI i . Table 2 displays a scale from one to nine indicating the direct correlated impact between i and j.   Note that those values can be assigned to ranges of economic damage or any different related loss such as the effect on system function or public trust. All these values are specified based on the knowledge we have, and in addition, the assumption of acquiring such a model will convince the related sector to release the demanded information appearing in an absolute result: For a better understanding of the dependencies in interconnected infrastructure, it can also be visualized through graphs, as shown in Figure 3. An infrastructure is denoted as a circle and its related risk dependencies representing in, i.e., an outgoing risk from the infrastructure CI X to the infrastructure CI Y . The quantity in each link points to the level of the incoming risk (cascading failure) for the end-node due to the dependency, based on a risk scale (1)(2)(3)(4)(5)(6)(7)(8)(9). For example, CI E has an incoming dependency risk R A,E = 3 from the infrastructure CI A . The risk assessment indicates the likelihood of a disrupted event from CI A to cascade to CI E (L A,E ), as well as the societal impact caused to CI E in the case of such failure at the source of the dependency (the infrastructure CI A ). After measuring all the risk dependency values for every edge, a total edge value is then calculated, ending in an overall risk value per graph (Geo i ) as follows:

Community Measures
Community measures are limited to include community-based indicators for every geographic zone Geo i , such as poverty level and population. Due to data availability, six various features have been selected in this measure and aggregated collectively, appearing in D i and serving the community rate for the defined Geo i as follows: where D R is the total risk value assigned based on the risk assessment matrix that is based on earlier work in [19]. The D R value conducted for the geographical region so that every Geo i has a specific risk value based on several terrestrial factors such as flooding type and frequency. D TEU is the total electricity use in million kWh. Flooding level, as well, describes D F based on the most advanced map given by DHS [4] and categorized into three classes (>0.1, 0.1 > Flood > 0.02, <0.02). D TEC is the total energy consumption in MMBtu) Million British thermal unit, which is the universal unit of heat and globally employed as a unit of estimating energy consumption. D Pop is the total population for each Geo i and likewise with D Pov representing the poverty percent.

Weighting and Ranking Critically Factors
At this point, it is addressed the determination of the features, measures, and levels. For each feature in level 3, a weight value based on feature scaling has been utilized [20] to deliver all values into a standard scale by employing the z-score normalization procedure based on the mean and standard deviation of each node metric over all nodes v [21] as determined for feature A d (v): Consequently, before aggregating the features into its corresponding measure, a normalization is first implemented, resulting in a normalized list of nodes (Vs) norm . Now, each feature in level 3 contains a normalized value and is ready to be summed to the correlated feature and then aggregated to its corresponding measure in level 2. We consider the use of the simple additive weighting, which is a well-known approach for scoring and ranking alternative options based on multiple attributes. Following that level 2 has four measures each representing different areas, an equivalent calculation from level 3 is employed for all four measures and aggregated to level 1, resulting in a C.A. value:

Feature Selection
After having the aggregate value of each measure, we now apply a machine learning method termed featured selection to distinguish the essential feature that holds the most influence on the total value. The future selection has been implemented in data mining research in current years, and it is one of the standard approaches to select the most important feature out of the data scientifically.
Automatic feature selection techniques can be applied to produce various models with varying subsets of a data set and recognize those attributes that are needed to develop an accurate model. The model was developed applying multiple linear regression algorithms [22], where the dependent variable in this case is the overall criticality assessment value C.A., and all other features are the independent variables. The typical machine learning rule has been practiced to carry the coefficient of each feature and then compare them as follows: (1) pre-processing the data set such as standardization, (2) splitting the data into training and testing data, (3) fitting the model into the test set, (4) performing model performance evaluation such as R 2 and t-value, and finally, (5) performing feature ranking based on the importance according to the model.
We apply the same model into each measure separately (community, criticality, interdependency, and centrality) to identify which measure has the most impact on the overall criticality assessment value and, similarly, to identify which feature has the highest impact on the corresponding measure.

Case Study
In order to illustrate the applicability of the proposed model, we present an application scenario, based on data provided by DHS [4] for the city of Pittsburgh. For simplicity goals, we narrowed our study to four central critical infrastructures: energy, telecommunication, water, and healthcare. The initial step is data collection for all four measures (centrality, critically, interdependence, and community) for greater Pittsburgh. The primary motivation behind utilizing the current data is data availability; however, the proposed holistic method in this work can be applied for similar data for other cities. Furthermore, more infrastructures can also be added once the data are available and there is a solid interdependence between them.
We begin by calculating the centrality measure that includes the degrees, betweenness, and closeness of each node. We model each zip code node as a graph to handle the required measures. Every critical infrastructure node implies a specific vertex in the graph connected to other nodes based on a well-known algorithm KNN, as described in Section 3. For instance, a power substation is connected to the nearest healthcare node in the network. While using a directed graph, many links represent a two-way link. For example, there is a two-way link between every power and telecommunication node; there is just a one-way link from the power network to a healthcare node. By creating such a model, we can exact node centrality measures efficiently, as shown in Table 3, which indicates the network centrality measures for 10 zip code areas in Pittsburgh, USA. B Number in the table indicates the total node number (entire infrastructures nodes) located in that zip code.
In addition, we collect the data related to the interdependence measures, which measure the cascading effects in case of the failure of one system to another, applying a risk dependency graph as described in Equation (7), and the risk caused and effective risk assessments were conducted for each Geo i (zipcodes, neighborhoods, and city districts). One example of the results after modeling the network for one zip code is shown in Figure 4, which uses 29 nodes from 4 different infrastructures. In addition, in Table 4, the 15106 zip code was selected to demonstrate the risk dependency matrix and used to estimate the overall risk value for the zip code. In community-based measures, the data were obtained through open source online city data [23]; for all other measures, we followed the data given by the DHS, which include 1290 critical infrastructure nodes. Table 5 indicates 10 randomly picked zip codes, each with the corresponding community-based values. For example, D TEC represents the total electricity use for the corresponding zip code.
The feature selection algorithm is then performed after building the model employing multiple linear regression. All features selected in level 3 confirmed to effect the criticality assessment value (C.A.), and based on that, we proceed in aggregating all the features into level 2 (measures) without eliminating any of them; however, some features achieved higher importance value than others, such as the number of nodes, interdependence, and degree centrality. Feature selection aims to show which feature impacts the overall model performance. Every ranking method has a different way of selecting the feature, such as comparing based on R 2 or standard coefficients. The features importance ranking algorithms are defined as follows: • Method LMG: Based on sequential R 2 but takes care of the dependence on orderings by averaging over orderings;   In addition to examining each feature, we explore the importance level of each feature corresponding to the measures using the same model. Table 6 shows the importance of each feature when compared with other features within the same category. For example, node numbers achieved a slightly higher value than node diversity in criticality measures. The table also shows the effects of each measure on the overall criticality value C.A.; for instance, interdependence seems to have the most impact on the C.A. value.

Result
After adding all the measures into the criticality assessment value, the C.A. of each Geo i location and each geographic area's importance and risk can be recognized by its corresponding value. An extended analysis has developed to a different geographical scale called neighborhood and city district and then compared with the Zip Code based graph. All three scenarios have been visualized with the ArcGIS tool produced by Environmental Systems Research Institute (ESRI) [24] following the same geographical scale, which is 9 km 2 to include the greater Pittsburgh area as follows.

Zip Code Based
A ZIP Code is a postal code used by the United States Postal Service (USPS). It was chosen to suggest that the mail travels more efficiently and quickly when senders use the code in the postal address. [25]. Figure 4 shows the visualization of the top 10 critical zip codes in the Pittsburgh area. Each geographical zip code is highlighted based on the corresponding C.A. values. We can outline that most zip codes are in downtown and uptown areas. Nevertheless, several zip codes were selected outside the city center, which reflects that some areas are not considered critical despite their importance. These zip codes have to consider risk mitigation strategies since they are all interconnected, and any cascade failure, if started in one of the critical zip codes, can develop very fast and impact the nearest critical zip code.

Neighborhood Based
Generally, neighborhood development followed ward boundaries, although the City Planning Commission has established some neighborhood areas in the post-industrial era [26]. We apply the model on the neighborhood boundaries to understand the most critical neighborhood in the city that needs the best risk strategies to reduce the impact resulting from any cascading failure. Figure 5 shows the visualization of city neighborhoods that are falling in the range of 2-7 in terms of C.A. value.

City Council District
The City Council districts are developed based on political boundaries and serve as the legislative body in many US cities. For this case study, Pittsburgh consists of nine districts, and each covers several neighborhoods [27]. Figure 6 shows the C.A. value of the most critical districts in Pittsburgh city. By comparing results from the above three geographical scale representations, we can conclude that there is an overlap of 50% between them, especially in the downtown area.
According to feature selection, the number of nodes plays a high factor in determining the overall C.A., and most of the downtown areas have large nodes. The exact reasons apply to other areas such as the Oakland and north shore areas.
By recognizing the critical geographical zones within the city scale, it becomes more accessible for the policymakers to leverage appropriate risk mitigating strategies such as security control for the critical region. Furthermore, this will intensify the overall resilience of interdependent critical infrastructure due to the complexity of the interdependent connectivity among them. Collecting the before-mentioned measures based on precise values such as critical nodes can develop in implementing the most reliable security control strategies to avoid future disaster events such as cascading failures. It can also assist in understanding the characteristics of the geographical territory, which presents us with the necessary information to enhance resilience in the critical infrastructures. Following the dependency and the complexity between essential nodes of infrastructure open the room for more tools to heighten all infrastructure resilience appearing in more sustained and resilient cities. However, just four central city-based infrastructures have been used in this work due to the data availability. The addition of more infrastructures will allow for more precise and accurate results. In addition, more measures will be added in the future to include other city aspects such as poverty and pandemics areas.

Conclusions
The criticality assessment methodology defined in this work extends the approach of Alqahtani et al. [19] by combining both graph centrality measures and dependency risk graphs as additional criteria for evaluating alternative critical geographical nodes strategies for interdependent critical infrastructures. The model aims to improve the preparedness phase resilience by assessing the critical zones of infrastructure, which allow for more space to adopt risk mitigation strategies. Future work can include applying more infrastructures and comparing the result with other cities or regions depending on the data availability and accuracy. Data Availability Statement: Most of the data presented in the paper, including an additional and complete dataset, can be found at City Data Bank research data basis at (https://www.city-data.com) (Accessed: 20 July 2021) and United State Census Bureau at (https://www.UnitedStatesZipCodes.org) (Accessed: 20 July 2021).

Conflicts of Interest:
The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.