A Belief Network Reasoning Framework for Fault Localization in Communication Networks

A small fault in a large communication network may cause abrupt and large alarms, making the localization of the root cause of failure a difficult task. Traditionally, fault localization is carried out by an operator who uses alarms in alarm lists; however, fault localization process complexity needs to be addressed using more autonomous and intelligent approaches. Here, we present an overall framework that uses a message propagation mechanism of belief networks to address fault localization problems in communication networks. The proposed framework allows for knowledge storage, inference, and message transmission, and can identify a fault’s root cause in an event-driven manner to improve the automation of the fault localization process. Avoiding the computational complexity of traditional Bayesian networks, we perform fault inference in polytrees with a noisy OR-gate model (PTNORgate), which can reduce computational complexity. We also offer a solution to store parameters in a network parameter table, similar to a routing table in communication networks, with the aim of facilitating the development of the algorithm. Case studies and a performance evaluation show that the solution is suitable for fault localization in communication networks in terms of speed and reliability.


Introduction
In large enterprises, communication networks have become a fundamental infrastructure. Increasingly diverse applications, such as online electronic transactions, network synergetic work, high-security remote monitoring, and even mission-critical remote control and emergency call services all run on top of the networks [1]. Networks are increasing in size and complexity and are moving toward heterogeneity. In such a network, maintaining a higher level of performance and reliability is both a significant task and a challenging problem for fault management. Fault localization is the core component in network fault management. Its purpose is to quickly and accurately locate the root cause of the fault. A good fault localization scheme will reduce network maintenance time and improve the availability of network services [2]. Furthermore, the future network will be more intelligent and adaptive than the currents ones. Therefore, their fault localization methods and techniques need to emphasize the following objectives: automation, accuracy, speed and reliability.
In communication networks, a single fault in one component can produce inconsistent outputs. These abnormal outputs may serve as inputs to other healthy parts of the networks [3]. This phenomenon often causes cascaded faults in a communication network and may cause a large for fault localization, and we present our framework and techniques for fault localization in Section 4. A fault scenario of the transmission network is studied in Section 5. We carried out a performance evaluation, and the discussion of the results is provided in Section 6. We provide a conclusion in Section 7.

Related Works
A consolidated taxonomy on various approaches and techniques for fault localization in computer networks has been presented in [6]. Generally, these approaches and techniques are broadly categorized as model-based approaches, rule-based approaches, case-based approaches, and emerging machine learning techniques. They aim to make fault diagnosis intelligent and automated. Some of the most preeminent examples will be briefly presented below.
Model-based approaches describe the behavior of the system as a mathematical model by means of expert knowledge. A profound understanding of the underlying structure and operating mechanism of the system is required [11,12]. In [13], a simple network management protocol (SNMP) based on a management model is proposed. The model can localize the root cause of the event and give advice to operators for solving problems; however, these models may be difficult to obtain and keep up to date.
In [14], a rule-based approach is proposed for communication network operation and management. Such a model generally consists of three parts: a rule base, a rule discovery engine, and an inference engine. The first two parts can be achieved by iterative and incremental algorithms. New rules are constantly added into the rule base by performing iteration algorithms in different conditions. The inference engine determines which rule is the most satisfied with the given situation [15]. Updating and enriching the knowledge base and carrying out the inference process are all more complex. Especially in a network in which the topology frequently changes, a large number of rules needs to be updated frequently. Therefore, this method is not well suited for such a network, although many emerging techniques have been proposed to automatically learn rules based on observed symptoms [16,17].
Case-based approaches rely on human experience obtained from the past fault cases [18][19][20]. A new experience is stored in the case base when a problem has been solved. The new experience would be retrieved and reused for future problems. In [2], the authors presented a hybrid approach that combines case-based reasoning and Bayesian networks (CBR-BN) to identify the root cause of faults. When a fault occurs, the approach carries out fault localization as follows: (1) the fault is viewed as a problem-case; (2) the existing case is matched in the case base; (3) if there is a similar case, then that solution case is identified and applied; (4) if there are no similar cases, then Bayesian inference is carried out, and a new outcome is obtained; (5) finally, the outcome is saved to the case base as a new solution case and reused for future problems. Similar to the rule-based techniques, the main limitations of the case-base techniques come from the time required to update a large number of cases, match a case, and enrich its case base.
In [21], the authors utilize machine learning-based techniques for fault identification and localization in the communication network. The model takes into account the packet loss, end-to-end delay, and aggregate flow rate captured from the networks in normal working states and different fault scenarios. In [22], the authors proposed a solution that uses deep learning to deal with the link handover fault of 5G networks when the mobile device moves from one base station to another base station. Machine learning-based techniques and approaches are well known as a powerful solution for fault localization for complex communication networks [23][24][25][26]. These solutions require a long training period and a large amount of sample data in fault scenarios to train their learning models. Such work is not always feasible with high-reliability and high-security networks. In addition, these solutions lack correct causal representation and lack interpretation for results.
Although the various fault location methods have been widely used over the past few decades, we need a method that can deal with the complex causal relations between failures and symptoms. Several works propose the use of the belief network model. A belief network provides an intuitive Here, we propose applying a message propagation mechanism of belief networks to address fault localization problems in communication networks. In this schema, the impact of each event propagates through the network between neighboring nodes in a message-passing manner. This modeling schema exposes either the dependencies among the network entities or the causal relationship among events [29]. Relying on these mechanisms of the belief networks, the fault propagation models have attracted increasing attention and have been widely used in various systems for fault localization.
In a belief network, we view each node not merely as a variable that represents an event but also as a separate processor that maintains the network parameters (prior probability, posterior probability, and conditional probability). The belief network provides an overall framework for storing knowledge, transferring messages, and carrying out causal reasoning. When an event occurs, each node in the belief network will exchange messages with its neighboring nodes; that is, it updates its own beliefs by receiving messages from its neighbors and sends new beliefs to its neighboring nodes. This state will continue until the events disappear or a new equilibrium is reached in the network. In the new equilibrium state, each node will be reassigned a new probability value. The higher the probability value, the more probable the fault's root cause. Therefore, fault localization problems may be translated into probability calculation problems.
The belief network is a directed acyclic graph. A graphical representation of the belief network has many advantages in modeling the fault propagation model, as stated in [47]. First, the representation of this structure is transparent for causal reasoning, and it exposes information about the structure so that we can easily understand its semantics. In contrast, an opaque reasoning model easily gives us an unexplained or even undesirable answer; second, the graphical model represents a perceivable dependent relationship that can be used effectively for causal reasoning; third, this structure facilitates the development of a viable model of human reasoning. Whether the models produced by this structure are from human expert knowledge or by learning from data, they always provide a good approximation to human thinking. Much more surprising is the fact that they can sometimes reveal the hidden information in the networks and offer novel insight into the network system's underbellies.

The Definition and Notations of Belief Networks
Depending on the application in the communication networks, belief networks can be defined as follows.
The belief network is a directed acyclic graph (DAG) in which each node represents a {0, 1}-value random variable. The directed edges that link between two nodes represent an existence of a causal relationship between two variables. Strengths of the influence of these causal relationships are measured by conditional probability. The nodes (random variables) in the belief network are denoted by capital letter X, The set of nodes is denoted by X = {X 1 , X 2 , ..., X n }, where X i indicates the i-th node. X i → X j represents a directed edge between node X i and X j , where X j is the child of X i , X i is the parent of X j . Let Par(X i ) = X i 1 , X i 2 , ..., X i n be the set of all parents of X i . P i is the conditional probability matrix associated with a random variable X i , and P x i , An evident e is an observed symptom.
Here, the set of nodes X is divided into two categories, X F and X A , which represent the set of all fault nodes and alarm nodes, respectively. X = (X F , X A ), X F = X F 1 , X F 2 , ..., X F n and X A = X A 1 , X A 2 , ..., X A n . The state value of node X F i X A i is 0 or 1, which represents the i-th fault (alarm) is absent (inhibitory) or present (active), respectively. If the state value of a node is 1, we say that the variable is instantiated.

The Noisy OR-Gate Model
It should be pointed out that these inference processes cannot avoid the exponential blowup with the number of nodes in general belief networks. The probability calculation is NP hard [47]. To overcome this limitation, we are proposing a simplified reasoning model called noisy OR-gate to reduce the complexity of the belief network's inference process, while retaining the advantages of the belief network's inference technique. As a result, the reasoning model shows a good performance in terms of speed, accuracy, and automation.
Each variable in the simplified noisy OR-gate model is a binary-valued variable in the belief network. Each variable consists of a causal factor and an inhibitory factor, as shown in Figure 1. The event R represents a consequent or prediction of the input. The input X = (X 1 , X 2 , ...X n ) represents explanations or conditions that may result in the occurrence of R. I = (I 1 , I 2 , ..., I n ) represents inhibitors that can prevent the occurrence of R. In the noisy OR-gate model [10], we assume that all potential causes of the same consequent are independent. This assumption of independence is suitable for investigating probabilistic fault localization techniques and is ubiquitous in the area of fault localization. Instead of a conditional probability matrix of traditional belief networks, the noisy OR-gate model lets each alternative cause separately hold the weight associated with each of its likely consequences. In the reasoning process, the consequent or prediction is absent only if all inhibitors associated with each of the likely causes are activated. In other words, if the consequent or prediction is present, at least one inhibitor associated with the present cause remains inactive. In a set of conditions, one may cause some specific event, when several of these conditions occur simultaneously, the occurrence probability of the event does not diminish. For example, there are many potential reasons, such as network congestion, failed connection, or a destroyed forwarding table. Each of them may individually cause a service disruption in a communication network. When a communication network suffers from several of these causal factors simultaneously, the occurrence probability of service disruption will only be higher. The most surprising aspect of these refinements is that it does not need to store the conditional probability matrix. It is guaranteed to perform reasoning tasks in polynomial time.

Fault Localization Techniques
Here, the noisy OR-gate model performs causal reasoning tasks in polytrees. The polytree is a singly connected network with no more than one path between any two nodes. This structure helps to avoid the loops in the networks and facilitates the development of the fault propagation algorithm.

Messages Fuse and Propagate in Belief Networks
The noisy OR-gate model utilizes a message-passing mechanism to exchange messages in belief networks, as shown in Figure 2. Each node exchanges messages with its neighboring nodes in the reasoning process [10,48]. Initially, the network is in a stable state, no event occurs, and all nodes remain in their waiting state until messages are received. As soon as events arise in the network, the nodes associated with events are activated. The influences produced by activated nodes are spread to their neighboring nodes along the edges between them. Each node then (1) receives all messages from their neighboring nodes; (2) absorbs and produces new messages by the belief update algorithm that we introduce in Section 4.2; (3) sends these new messages to their neighboring nodes. This process continues until the abnormal events are removed or a new equilibrium is reached in the belief networks. Both the processes of absorbing and producing messages are detailed in Sections 4.2 and 4.3, respectively. In this fashion, we can track the changing environment in the network and provide a coherent interpretation.
As shown in Figure 2, we consider a general node X, which excludes the root nodes and leaf nodes in the belief networks. The set of nodes (U 1 , U 2 , ..., U m ) and (Y 1 , Y 2 , ..., Y n ) are node X s parents and children, respectively. Node X received (π X (U 1 ) , π X (U 2 ) , ..., π X (U m ) , ) and .., λ Y n (X) , messages from its parents and children, respectively, and sends (λ X (U 1 ) , λ X (U 2 ) , ..., λ X (U m )) and π Y 1 (X) , π Y 2 (X) , ..., π Y n (X) to its parents and children, respectively. Node X triggers the calculation mechanism to update its own belief via collected messages. It should be pointed out that the belief update can be carried out gradually, and need not be interrupted until all the information is collected.

The Belief Update in Belief Networks
In this section, we introduce the belief update process of nodes in the belief networks based on the message propagation mechanism. Each node receives π (X) and λ (X) messages from its parents and children, respectively. As shown in Figure 2, we can obtain π (X) and λ (X) in node X as follows: Then, node X calculates its own new belief bel(x), as follows, In the above equations, α is a normalizing constant that renders ∑ x bel(x) = 1, and q i represents the probability that the i-th inhibitor is active, so we denote by c i = 1 − q i the probability that the i-th potential endorses the event X = true. We let π iX represent the message that the i-th parent U i sends to X. λ 0 Y i (x) represents the inhibited evidence that X receives from its i-th child, and λ 1 Y i (x) represents the active evidence that X receives from its i-th child. λ 0 (x) represents the inhibited evidence that X receives from all of its children, and λ 1 (x) represents the active evidence that X receives from all of its children. x = 1 and x = 0 represent the presence and absence of events, respectively.
Based on the description above, node X receives π(x) and λ(x) messages from its parents and children, respectively. On the other hand, X sends λ(x) and π(x) messages to its parents and children, respectively. We denote by λ X (u i ) the message that X sends to its i-th parent, and denote by π Y i (x) the message that X sends to its i-th child. We calculate them as follows: In the above equations, β is any constant. u i = 1 and u i = 0 represent the parent u i , which is active and inhibited, respectively.

The Storage Mechanism of Belief Networks
As mentioned in the previous section, belief networks have been viewed not merely as a computer architecture but also as a memory for storing knowledge. Similar to the way in which each router in the network maintains a routing table, each node in the belief network maintains a network parameter table. In the parameter table, node X records the information, as shown in Table 1. Node X receives λ 0 Y i and λ 1 Y i messages from each of its children by Equation (4). λ 0 (x) and λ 1 (x) are calculated by Equation (2). Likewise, Node X receives message π 0 X (u i ) from each of its parents by Equation (5), and π 0 (x) and π 1 (x) are calculated by Equation (1). p(x) is the prior probability of X and the bel(x) is its updated belief. It should be pointed out that node X only records the q i incoming from its parents. In addition, a hidden variable π 1 (x) can be calculated by π 1 (x) = 1 − π 0 (x).

Relationship Name Prior Probability Belief
We now find it more convenient to calculate the updated belief bel(x) of each node in the belief networks after every node receives π(x) from each of its parents and λ(x) from each of the children. This data storage solution would have been more useful for understanding and developing the message propagation algorithm. As an example, an application of the belief propagation algorithm to fault localization is given in Section 4.4.

Application of the Belief Propagation Algorithm to Fault Localization
One example of a belief network model corresponding to a small communication network is depicted in Figure 3. A fault may result in a set of alarms; in the same way, one alarm can also trigger other alarms. Based on the knowledge we have gained from human experience or learned from data using machine learning algorithms, the prior probability of fault is assigned to each corresponding fault node, and the conditional probability that measures the strength of dependency between neighboring nodes is recorded in children nodes.   In the application, the model established the boundary conditions as follows: • Fault nodes. If node X is a fault node, we set π(x) to be equal to its prior probability.

•
Leaf nodes. Alarm node X is a node with no children. If X is instantiated, we set λ 1 = 1, λ 0 = 0. In contrast, if X is in a normal state, we set λ 1 = 0, λ 0 = 1. In addition, if X has only one parent, in order to prevent the parent from receiving λ 0 (x) = 0, the message propagation between X and its one parent is restricted to (c i , q i ). In other words, we assume that node U is the only parent of X, and the conditional probability between them is (c i , q i ), if X is in a normal state, then λ 0 Instantiated node. If node X is instantiated, we set bel(x) = 1, λ 1 (x) = 1 and λ 0 (x) = 0 regardless of the other values in the expression. Therefore, node X is turned into a leaf node, and the message propagation is a block between X and its children.
In order to perform the event-driven fault localization task in the communication network, in our study, we adopted a self-activated message propagation algorithm for fault localization, as follows. The process of the fault localization algorithm starts with any nodes in the belief networks that are instantiated. As stated earlier in the introduction, the status value of the instantiated node is assigned 1, that is bel(x) = 1, λ 1 (x) = 1, λ 0 (x) = 0. The algorithm then starts performing the process of fault localization in an event-driven manner, until events disappear or a new equilibrium is reached in the network. The algorithm is executed by the following procedure: Step 1. The current belief network is initialized. In this phase, no event occurs, and no evidence arises in the networks. As a result, each node, except for the root nodes, receives π messages from each of its parents, and each node, except for the leaf nodes, receives λ messages from each of its children. Each alarm node is assigned a probability value by Equations (1) and (2).
Step 3. The X 's neighboring nodes calculate the new belief bel(x) based on the received messages, and send new π and λ messages to its children and parents, respectively.
Step 3 is repeated along the chain until a new equilibrium is reached or the abnormal events have been moved in the networks.
Step 5. A group of faulty nodes are found, and the nodes are arranged in descending order of probability. The higher the probability value, the more probable it is that the alarm occurs.
Step 6. The fault's root cause is estimated. One fault, or a combination from the set of faults, that provides the best explanation to all present alarms is selected.
These processes are also detailed in Figure 4.

Perform fault localization task
Step 4. New equilibrium state Faults are fixed Step 5. List fault nodes in descending order by belief values Fault nodes

Iteration
Step 6. Estimate the faults Abnormal state  Figure 3 shows a network with three fault nodes and five alarm nodes. The dependency relationship of the entities in the network was mapped into a polytree. Each network event corresponds to a node in the belief network. For example, network event alarm 1 corresponds to node A 1 in the belief network. The set of parameters about the network is labeled in Figure 3.
We assume that alarm 3 , alarm 4 and alarm 5 arise in the network, then the nodes A 3 , A 4 and A 5 are instantiated in the belief network, where λ 1 ( As a result, the new belief distribution of A 2 can be calculated by Equation (3): bel(a 2 ) = α(λ 1 (a 2 )π 1 (a 2 ), λ 0 (a 2 )π 0 (a 2 )) = α(0.595 × 0.4267, 0.045 × 0.5733) = α(0.2539, 0.0258) = (0.9078, 0.0922) Fault 3 receives the following messages from alarm 2 : We then obtain the Fault 3 belief: Finally, the updated belief distribution of f ault 3 is (0.9346, 0.0654). Each node can calculate its own belief distribution by receiving the messages from its neighbors. Therefore, we obtain the final belief distribution of each variable in the same way, which is shown in Table 2. It is obvious that alarm 3 , alarm 4 and alarm 5 are caused by f ault 3 . It can be observed in the above example that the occurrence of alarm 4 and alarm 5 is caused by f ault 3 . Alarm 2 , with the high occurrence probability without receiving alarm annunciation, can be explained by one of two factors: either the uncertainty of the network leads to alarm loss, or alarm 2 is not triggered due to its higher threshold value of the performance.

Case Study
In this section, a typical fault scenario in a communication network is studied. This study is supported by the Chinese government and the research foundation of a railway company. As a case example for applying and verifying the proposed approach, we selected a fault scenario in an optical transport network used in the railway business.
Railway companies often have a series of sites geographically separated down the railway line. To connect these sites to their service center, a high-quality channel for transferring data is necessary. An optical transport network provides the capacity to schedule and transmit various types of businesses with different particle sizes. In general, it consists of a long-distance physical optical fiber cable and a large amount of switching equipment. Faults often arise in these components. In this study, the structure of the transmission network is a synchronous digital hierarchy (SDH) network. The railway line is approximately 628 km long, and there are 34 sites distributed along the railway line. Figure 5 shows a topology diagrammatic sketch of the communication network of the railway company.
As the backbone of the communication network, transmission networks carry a large number of important services for train running, such as synchronous control for both locomotives and trains. Radio train dispatching communication controls the train tail device, sends and receives dispatching instructions to and from running trains, and identifies and checks the numbers of running trains, which are sensitive to network communication server quality. More importantly, those services are the basis for organizing railway transport, enhancing production efficiency, and protecting railway operation safety. Continuous monitoring of the performance of a transmission network and localizing root causes quickly and accurately after faults occur are of crucial importance for network managers to ensure the reliability and quality of the communication network. Any unexpected or prolonged downtime that leads to transportation interruption largely decreases the loyalty of customers and drastically affects the efficiency of transport. Alarms are our main information source for understanding the operating states of the equipment and performing fault localizing tasks in communication networks. In this case study, the alarm data were obtained from the alarm management system via the data interface. The fault information was obtained from the equipment specifications provided by the equipment vendors. There are several alarm attributes in every alarm message. Based on our approach, both alarm name and alarm position were selected for our case study. Figure 6 depicts a fault scenario consisting of four pieces of equipment; namely, NE-21, NE-22, NE-23, and NE-24, geographically separated over four sites, respectively. Here, NE refers to network entity. The number following the NE represents the geographical position code of the network entity. We assume that the fiber link between NE-22 and NE-23 breaks. NE-22 is positioned upstream of NE-23, and the business data then flow from NE-22 to NE-23. The broken link leads to a communication break between NE-22 and NE-23. As a consequence, the sites near NE-23 may experience an abnormal condition and report a large number of alarms. For example, NE-23 may trigger the R-LOS (Receive Loss Of Signal) alarm due to the broken link. The multiplex section protection mechanism of the SDH network was started up simultaneously and changeover occurs. The MS-APS-INDI-EX (Multiplex Section-Protect switch indicate expand) and APS-INDI (Automatic Protection Switching State indicate) alarms at NE-23 were then reported. We also received R-LOS and ALM-GFP-dLFD (Generic Framing Procedure Loss of frame delineation) alarms at NE-22 and NE-23, respectively, due to the loss of signal and the loss of frame alignment. The reported NE_22, NE_23 and NE_24 alarms are shown in Table 3. An experienced operator may speculate the probable causes of these alarms: (1) a fault in the single board of an optical switch in NE_22, (2) a fault in the single board of an optical switch in NE_23, and (3) a broken or degraded link between the optical fiber and NE_22 or NE_23. These scenarios may cause a loss of time due to troubleshooting and may evolve into catastrophe events. Table 3. Alarm data gathered from NE-22, NE-23 and NE-24.

NE-22 NE-23 NE-24
Alarm R-L, LTI, CLK BD, TU, APS, MS, T-A, R-L, ALM, E-L ALM The detailed description of the alarm identifier is shown in Figure 6.
In order to find the root cause among the three most probable causes with minimal time, we combined the human expertise and the knowledge learned from the historical alarm log in the belief network, as shown in Figure 7. In Figure 7, all of these alarms are mapped to the corresponding nodes in the belief network. Let us take these alarms as evidence in the causal reasoning. The fault localization process then starts from these alarm nodes. Based on the previously proposed inference algorithm, we find the right root cause to be consistent with the initial consideration in the shortest possible time. After three iterations, the probability distributions of the three potential causes are p(Board_22) = (0.0131, 0.9869),p(Fiber_Channal_22_23) = (0.9999, 0.0001) and p(Board_23) = (0.1410, 0.8589), respectively. It is evident that Fiber_Channal_22_23 is the fault root cause of these alarms, and we only require less than 0.0006 seconds.
The message exchange in the inference process is a dynamic iterative process; nevertheless, the final belief distribution of each node will converge with its own unique equilibrium state. Figure 8 depicts the dynamic convergence process of each variable.

Evaluation and Discussion
In this section, we present a series of experimental simulations to assess the performances of the proposed fault localization techniques according to four metrics: convergence speed, reliability, the ability to deal with multiple-source faults, and the capability of identifying faults in an uncertain environment.

Generation of the Belief Network
Based on the topology of the transmission network and the dependency relationship of their entities, a belief network was built combining human expertise and knowledge learned from observed data. Building a belief network structure and estimating conditional probability parameters are other important works, but these are beyond the scope of this study.

Experiment Settings
The fault scenarios and experimental data come from fault analysis reports. Here, 112 fault analysis reports from February 2016 to August 2019 were gathered. Each report records a large number of alarms and faults inferred from these alarms. In order to estimate the performance of the proposed approach, we selected 10 reports from the 112 fault analysis reports. In these 10 reports, we randomly added some noise alarm data.
In each experiment, one report is selected, and all the alarms in the report are viewed as symptoms. Their status values are set as 1. The fault localization process is then started from these alarms. According to the requirements of different evaluation indexes, corresponding evaluation results are obtained.
In addition, the data used for the experiments using support vector machine (SVM) and multi-layer perceptron (MLP) approaches are generated using a simulation environment (Simulation Laboratory). For example, we trained the MLP model with 1000 data samples for link failures, and the test data set includes 150 data samples. SVM and MLP are classic data analysis methods for the classification problem, and have been widely used in the literature for fault identification and localization [25,26,49,50].

Convergence speed
The convergence speed is an important metric for assessing the validity of a good dynamical system. In our model, alarms are viewed as a perturbation that, through the network between neighboring nodes, is the driving force of message propagation and fault inferences. The network reaches a new equilibrium through several iterations. Table 4 shows the experimental results obtained for networks with sizes ranging from 100 to 2000 nodes. These results show that the message propagation approach has a good fault localization performance in terms of convergence speed. As the size of the network increases, there is very little increase in the time required to reach equilibrium states. For example, the simulation results show that the approach requires 0.0054 s to reach the equilibrium state in a 100-node network, while reaching the equilibrium state for a 2000-node network only requires 0.0868 s. The faster the localization of the fault's root cause, the less substantial the impact of the fault on the network. In Table 5, we present the time taken using the different approaches to localize the fault's root cause in the network. The experiments were carried out in a belief network with 2000 nodes. Note that the localization times of the SVM and MLP approaches are not affected by the sizes of networks. Based on the experimental results, the PTNORgate approach significantly reduces the localization time of the BN. SVM and MLP approaches do not perform well in terms of time. This is because they require a long training period to train their learning models.

Reliability
Reliability is an important metric for assessing a fault localization system. It is used to measure the trustworthiness of a system's judgments. We will use the following metrics to estimate the reliability of the proposed approach.

•
Precision: The ratio of the number of fault analysis reports correctly identified over the total number of fault analysis reports identifying faults. The higher the value of precision, the lower the misdiagnosis rate, and vice versa. The precision value can be computed as follows: where P is the precision value, T P is the number of true positives, and F P is the number of false positives.

•
Recall: The ratio of the number of fault analysis reports correctly identified over the number of fault analysis reports that actually occurred. The higher the value of recall, the lower the misdiagnosis rate, and vice versa. The recall value is computed as follows: where F N is the number of false negatives. • F 1 -Score: F 1 -Score is the harmonic average of the precision and recall. Higher the value of F 1 -Score, the better the performance of the approach. The F 1 -Score value can be computed as follows: In Figure 9, we plot the precision, recall, and F 1 -Score values of the various approaches. The results show that the PTNORgate approach achieves 100% precision, closely followed by BN with a precision of 96.63%. This indicates that the cause-effect inference is suitable for fault localization. The results also show that the PTNORgate approach localizes fault with minimal misdiagnosis. We obtained a recall of 96.07% for PTNORgate. This high recall value implies that PTNORgate has a low false negative rate. MLP attained only 86.3%, 82.1% and 84.15% for precision, recall, and F 1 -Score, respectively. This may be due to the overfitting problem. SVM has the worst performance among the four approaches; nevertheless, it achieved 83.06% precision, 76% recall, and 79.81% F 1 -Score. Among the four approaches, the PTNORgate approach clearly outperforms others. This shows the reliability of the PTNORgate approach in fault localization.

Capability to Deal with Multi Source Fault
We have demonstrated that the proposed approach has a good performance in terms of convergence speed and reliability when dealing with a single fault. Now, we evaluate the capability of the approach to deal with multiple, simultaneous faults. A 2000-node network and 112 fault analysis reports are used.
The test process is as follows: Two fault analysis reports were randomly selected as the fault scenario of the test. The fault localization model was then run and the root cause was determined. Whether the diagnosis results were consistent with the fault analysis report was checked. Another two reports were selected from the remainder of the fault analysis reports. The entire procedure was repeated until all fault analysis reports were tested. The results show that this method can optimally solve the problem of fault localization in multiple fault scenarios.
Taking the failure scenarios described in Section 5 as examples, Figure 10a,b show the iteration processes and localization results of a single fault scenario, such as a fiber link break or a function board fault. Figure 10c shows the iteration processes and localization results of two faults occurring simultaneously. The results show that the method can accurately identify fiber link faults and function board faults.

The Ability to Identify Faults in Uncertain Environments
A communication network is a complex and dynamic system. Fault localization approaches need to be able to deal with the uncertainty of a network.
We took the fault scenario described in Section 5 as an example to consider fault inferences under uncertain conditions. Figure 11a-f show the iteration processes and results where one to six alarms are removed from the alarm lists. The alarms were removed randomly during the experiment. We received a total of 12 alarms in this fault scenario. Although the reasoning process was hard, our approach still identified the root cause of the fault when six alarms were removed. Figure 12 shows the fault identification accuracy at different levels of uncertainty. Considering that the uncertainty levels that exceed 50% are implausible in real-life fault scenarios, and that it is impossible to perform an effective fault inference, we used five configurations to generate different uncertainty levels: 10% of alarms are missing, 20% of alarms are missing, 30% of alarms are missing, 40% of alarms are missing, and 50% of alarms are missing. In Figure 12, it is apparent that uncertainty levels below 20% barely influence the fault localization results. The fault identification accuracy decreases as uncertainty levels increase. When considering multiple, simultaneous fault scenarios, we observe that the fault identification accuracy of multiple faults without overlapping alarms is higher than that of the multiple faults with overlapping alarms at the same uncertainty levels (except for 10% and 20%). This phenomenon is consistent with the local operation of the polytree structure.

Conclusions and Future Work
In this paper, we propose a framework for fault localization in communication networks. The clear structure of the data storage, inference, and message transmission in the overall framework exposes information about the fault inference procedure, and facilitates the development of a message propagation approach that is applicable to various fault localization problems. Fault localization in an event-driven manner improves the degree of the automation of fault localization and reduces human intervention in the fault localization process. The PTNORgate model was used to reduce the computational complexity of the inference process.
An extensive assessment of our proposed approach was carried out in experiments and shows its benefits in comparison to other approaches. These results show that our approach provides an efficient framework for root cause localization in terms of convergence speed, reliability, automation, and the ability to deal with multiple-source faults under uncertain environments. On the contrary, SVM and MLP performed poorly in our work due to multiple fault classifications and overfitting problems, respectively.
In the future, we plan to further investigate the accuracy of the dependency relationship between failures and alarms in the networks. Indeed, the reliability of a fault localization model requires an accurate dependency relationship between variables. Discovering and identifying dependency relationships between failures and alarms are complex tasks for a large-scale network. Therefore, the next step is to propose a method that automatically learns the causal relationship among failures and alarms purely from the data.