1. Introduction
Today’s real-world engineering systems are a product of careful design and manufacturing. These systems usually undergo rigorous testing and validation before deployment. However, due to wear and tear from sustained operations, degradation and faults in system components still occur. In addition, unlikely events and unanticipated situations can also create faults. To avoid these negative effects, it is imperative to accurately track system behavior and timely detect and isolate faults [
1,
2,
3,
4].
Model-based diagnosis techniques are frequently used to solve these problems. The key idea is to detect the discrepancies between the actual system behavior and the predictions of a model [
5]. Traditionally, two distinct scientific communities have employed different kinds of models to implement their own model-based diagnosis:
The Fault Detection and Isolation (FDI) methods capture system behavior using differential equation models, whose foundations are based on control and statistical decision theories.
The Diagnosis (DX) methods use qualitative model and logical approaches, with foundations in the fields of computer science and artificial intelligence.
More specifically, the FDI community has proposed three typical methods to track and diagnose system behavior using the system model: (1) parameter estimation based methods that estimate the value of particular parameter [
6]; (2) state estimation based methods that use observers or filters to estimate unknown variables [
7]; and (3) parity space based methods that design a set of residuals by eliminating the unknown variables [
5]. On the other hand, many researchers in DX community assume that the system can be modeled as a Discrete Event System (DES) [
8] at some level of abstraction. A DES model is characterized by a set of discrete states, a set of observable and unobservable events, and transitions between discrete states. The dynamic behavior of DES is described by partitioning time into discrete points at which events occur [
9]. On the basis of this system model, the goal of diagnosis is to find unobservable fault events or discrete fault states. Until now, this technique has been widely applied into many domains such as power transmission lines [
10] and telecommunication networks [
11,
12].
Over the years, some researchers and practitioners in both communities are dedicated their efforts to understanding and bridging the FDI and DX approaches. Cordier
et al. [
13] gave a systematic comparison of the analytical redundancy relations (ARR) and conflicts, but their analysis was only applied to the diagnosis of static systems. Bregon
et al. [
14] compared three different structural fault isolation techniques from both communities for linear continuous dynamic systems. Meanwhile, Travé-Massuyès [
15] further discussed the facets of diagnosis in the FDI and DX communities, and exemplified how different theories of these areas can be synergistically combined to provide better diagnostic solutions and achieve improved fault management.
In this paper, we only focus on the DX area. In this area, a typical diagnosis approach of DES is based on a diagnoser [
16,
17], which uses a deterministic finite state machine without emitted events. The diagnosis problem is addressed by compiling the original finite state machine into one that contains only observable transitions and produces the same language in terms of observations. The weakness of this approach is the feasibility of the compilation of the large-scale complex system model into a reasonable size. Some approaches were put forward to overcome this dilemma such as off-line compiler technique [
18], distributed diagnosis [
19] and hierarchical diagnosis [
20].
Simulation-based approach [
21,
22] is another general method. In this approach, the temporal evolution of system is decomposed into a set of state constraints that hold for the state at each time-step and a set of sequence constraints that restrict the possible transition sequences of such state, so the diagnosis process is performed by checking whether the state constraints and sequence constraints are consistent with the observations. The main problem of this approach is that the number of all possible system trajectories becomes too large to process only after a few time steps.
To address this problem, Williams [
21] and Kurien [
22] proposed a
k Best-First Trajectory Enumeration (BFTE). Unfortunately, trajectory probability is significantly underestimated in this approach, because it ignores the additional trajectories that lead to the same state. Considering this shortcoming, Martin [
23] presented a
k Best-First Belief State Enumeration (BFBSE) that increases estimator accuracy and uses less memory and computation time. After that, Williams and Ragno [
24] introduced a conflict-based A* (CDA*) search algorithm into these methods, so the belief state search process is further accelerated by eliminating subspaces around each state that are inconsistent with observations. For these
k best methods, how to choose a suitable
k value is the key issue. For real-time operation, a large value usually brings more computational complexity, while a small one loses estimator accuracy and even results in misdiagnosis. Moreover, symbolic techniques are also a feasible method. [
9] exploited Ordered Binary Decision Diagrams (OBDD) to encode system model and belief state, so the complete belief state can be estimated. However, it still limits its applicability to relatively simple system.
This paper develops a novel approximate simulation-based approach to track both a variety of operational modes of the system and arbitrary combinations of fault conditions, which will enable to determine the most likely system states and trajectories for dynamic diagnosis. First of all, LUG [
25] is introduced to describe online monitoring and fault diagnosis of discrete systems. Moreover, the LUG scheme employs the Monte Carlo (MC) technique to sample belief state distribution. The differences between our approach and classical MC technique are twofold. First, the particles in our approach are only focused on sample but not on-line filtering, and particles are assumed to be unit weight throughout simulation. Second, every particle is tagged with unique symbol, so the system evolution trajectories can be easily obtained, and numerous trajectories do not need to be preserved as system evolves. Finally, since Monte Carlo techniques incur sample impoverishment problem, the observation information is combined with prior information to recursively generate the most likely belief states. Although this technique of using observation information has been employed in the literatures [
7] and [
26], our approach does not need to exhaustively consider all possible successor modes, so it can cope with large number of discrete modes.
The rest of this paper is structured as follows:
Section 2 gives some basic definitions about component and system model, simulation-based dynamic diagnosis, and classical belief state update.
Section 3 introduces the LUG to describe the dynamic diagnosis process of discrete systems. In
Section 4, our approach is formally illustrated and analyzed in detail.
Section 5 describes the experimental results on a real-world model: a portion of the power supply control unit of spacecraft. Finally, a conclusion is presented in the last section.
3. Exploitation of LUG for State Tracking and Fault Diagnosis
A planning graph represents a relaxed look-ahead of the belief state space that identifies propositions reachable at different depths. It can be typically described as layered graphs of vertices
, where each level
t consists of a proposition layer
and an action layer
. More specifically, proposition layer
denotes the set of propositions at level
t, while action layer
includes all actions that have all of their precondition propositions in
. Edges between the layers describe the propositions in action preconditions (from
to
) and effects (from
to
) [
29].
LUG, proposed by Bryce [
25,
29,
30], extends traditional planning graph in the following two ways. Firstly, since uncertainty is considered in Bryce’s approach, an extra effect layer
is introduced into each level
t, where
. An effect is in the effect layer
if its associated action is in the action layer
and every one of its antecedent propositions is in
. As a result, the uncertainty planning graph can be denoted as a sequence of layers
. Secondly, LUG implicitly represents multiple uncertainty planning graphs by collapsing the graph connectivity into one uncertainty planning graph and uses annotations, called labels, to retain information about multiple worlds. Bryce has successfully applied this data structure to solve the probabilistic conformant planning problem with actions whose effects are uncertain.
The probabilistic conformant planning problem is closely related to state tracking and fault diagnosis of discrete systems [
31], because the task of generating most likely belief states to match given observations can be viewed as a probabilistic plan generation problem. In particular, both two problems exhibit many similar features, such as finite state space, the uncertainty of the initial state and action effects, reachable goals and so on. Therefore, LUG is introduced to represent dynamic diagnosis process of discrete systems. Proposition layer, action layer and effect layer in LUG correspond to the possible belief states, transition events and the possible results of transition in simulation-based dynamic diagnosis.
In order to address the exponential increasing problem of possible belief states, the Monte Carlo (MC) technique [
32] is employed to sample belief states in our method. First, unlike the traditional method using exact quantitative probability, we turn to approximation of probability by means of particles. The number of particles in a particular belief state represents the likelihood of this belief state. This approximate strategy allows our approach to focus on the highly probable belief states, without checking a prohibitively large number of unlikely belief states. Second, the MC technique in this paper is only used for sampling, and particles are assumed to be unit weight throughout simulation. Unlike generic particle filter, our work does not use observations to weight particles for re-sampling. For instance, assuming that belief state
has 100 particles and a transition occurs from belief state
to belief state
with probability 0.95, there will be 95 particles in belief state
after performing this transition. Third, every particle is tagged with unique symbol, which can be used to analyze the system possible evolution trajectories.
A simple circuit shown in
Figure 1 is used as an illustrative example. The relay in this circuit is modeled as an automaton with five discrete modes: S1: open, S2: closed, S3: stuck_open, S4: stuck_closed and S5: unknown. The mode transition is probabilistic. When the initial mode is S1: open with the command close, the possible successor modes include S2: closed, S3: stuck open and S5: unknown with the transition probability 0.989, 0.01 and 0.001, respectively.
Figure 1.
A simple circuit consisting of a battery, relay and a load.
Figure 1.
A simple circuit consisting of a battery, relay and a load.
Figure 2 reports the LUG for this relay. The number of particles is set to 1000. Proposition layer
and
represent the possible belief states at time
t and
t + 1. Action layer
contains the transition events at time
t. In this layer, controlled events and autonomous events are denoted by propositional logic formula and function, while idle events are drawn as dashed line. The effect layer
describes the possible transition effects after an infinitesimally time
and depends on the proposition layer
and action layer
. According to the formal definition of classical belief state update in
Section 2.3, the Prediction step in LUG is from proposition layer
to effect layer
, and the Update step is between effect layer
and proposition layer
. The label below each belief state contains the tagged particle
. At time t, 1000 particles are initialized using prior distribution of current belief state. If current mode of relay is unknown, the uniform distribution will be adopted. First, the Prediction step is executed. Take a possible transition relation
for example, proposition logic
is performed, and
particles are transitioned from
to effect
. After that, the observation at time step
t + 1 is taken into account. In case that the estimated belief state in effect
is consistent with observation (
), all the particles in effect
are moved further into belief state
. As can be seen from
Figure 2, four possible belief states: S1, S3, S4 and S5 with 398, 200, 202 and 200 particles are captured at time-step
t + 1, and possible system evolution trajectories also can be obtained by the tagged particles.
Figure 2.
Relay depicted by LUG.
Figure 2.
Relay depicted by LUG.
5. Experimental Results
We apply our state tracking and fault diagnosis approach on a simulation model of a real-world system—a selected subset of the power supply control unit of a spacecraft. This subsystem, shown in
Figure 5, consists of an input Sig_in from a battery and five outputs: (1) output Sig_out1 directly connected to Load A; (2) output Sig_out2 connected to Load B that is controlled by relay K1; (3) output Sig_out3 connected to Load C that is controlled by hot backup DC/DC module (DC/DC_h); (4) output Sig_out4 connected to Load D that is controlled by both hot backup DC/DC module and relay K2; and (5) output Sig_out5 connected to Load E that is controlled by cool backup DC/DC module (DC/DC_c). An external actuator issues commands cmd1, cmd2, cmd3 and cmd4 to control the relay K1, K2 and cool backup module. In our experiment, six sensors are used to collect observations: system input: Sig_in and system outputs: Sig_out1, Sig_out2, Sig_out3, Sig_out4 and
Sig_out5.
The schematics of the hot backup DC/DC module and cool backup DC/DC module are presented in
Figure 6. Four components main1, main2, spare1 and spare2 are voltage converting units.
Figure 6a shows the hot backup DC/DC module. The function is that component selector selects the voltage with higher value from main1 and spare1 to output. In the cool backup DC/DC module (see
Figure 6b), the external commands cmd3 or cmd4 switch relays K3 and K4 and determine the output voltage.
Figure 5.
Selected subset of the power supply control unit.
Figure 5.
Selected subset of the power supply control unit.
This selected subset of the power supply control unit involves nine components: four voltage converting units, four relays and one selector. More specifically, the voltage converting unit has five different discrete modes: nominal (M1), overvoltage protection (M2), overvoltage protection failure (M3), voltage conversion failure (M4) and unknown mode (M5).
Table 3 gives the mode transition matrix for this component. In addition, the relays and selector also contain five discrete modes. For lack of space, the transition matrixes for these components are not shown in this paper. Therefore, we can calculate that the system can potentially operate in roughly
distinct modes at each time-step, and the full system trajectories will even reach up to
as the system evolves.
Table 3.
The transition matrix for voltage converting unit.
Table 3.
The transition matrix for voltage converting unit.
Source Mode | Transition Constraint | Possible Successor Modes |
---|
M1 | M2 | M3 | M4 | M5 |
---|
M1 | sig_in < 97 | 0.989 | 0 | 0 | 0.01 | 0.001 |
M1 | sig_in >= 97 sig_in <= 103 | 0.979 | 0 | 0 | 0.02 | 0.001 |
M1 | sig_in > 103 | 0 | 0.959 | 0.02 | 0.02 | 0.001 |
M2 | sig_in < 97 | 0.989 | 0 | 0 | 0.01 | 0.001 |
M2 | sig_in >= 97 sig_in <= 103 | 0.979 | 0 | 0 | 0.02 | 0.001 |
M2 | sig_in > 103 | 0 | 0.959 | 0.02 | 0.02 | 0.001 |
M3 | - | 0 | 0 | 1 | 0 | 0 |
M4 | - | 0 | 0 | 0 | 1 | 0 |
M5 | - | 0 | 0 | 0 | 0 | 1 |
Several groups of simulations were conducted on a test set, which includes the nominal scenario and the occurrence of a fault in one, two components and three components at the same time. The experimental results refer to a C++ implementation of the diagnostic algorithm using a personal computer featuring an Intel (R) Core (TM) i3 CPU with 2.27 GHz, 4GB RAM (Lenovo, Kunshan, China), and are presented in the following subsections.
5.1. Basic Results
The aim of these simulations is to evaluate the space and time performance results of our state tracking and fault diagnosis method. For these simulations, nominal, single fault, two faults and three faults are considered, and the number of particles is set to 500.
The good experimental time complexity results are confirmed by looking at the computational cost in terms of CPU time.
Table 4 reports the average and the maximum CPU time for single-step mode estimation. The average time increases when more faults are injected. However, the CPU time is very low with three faults so that we claim that the algorithm can perform on-line.
For the belief state search problem, the number of expanded nodes is used to measure the space performance of algorithms. Moreover, since the consistency function usually consumes plenty of computing resources, the so-called times of consistency function are also employed to qualitatively evaluate the time performance. On the basis of the above consideration, the average and maximum number for these two values are also evaluated in
Table 5. As expected, these two values will increase slowly as more faults are considered, and generally reach a maximum value at the fault detection time, because a large amount of nodes are expanded to check the consistency with observation at that time.
Table 4.
Time statistics with single-step mode estimation (confidence 95%).
Table 4.
Time statistics with single-step mode estimation (confidence 95%).
Scenario | Average Time (ms) | Max Time (ms) |
---|
Nominal | 29.725 ± 0.634 | 85.46 |
Single Fault | 67.873 ± 1.770 | 143.68 |
Double Faults | 93.661 ± 5.198 | 328.65 |
Three Faults | 103.759 ± 6.866 | 423.57 |
Table 5.
The sizes of expanded nodes and the called times of consistency function per time step (confidence 95%).
Table 5.
The sizes of expanded nodes and the called times of consistency function per time step (confidence 95%).
Scenario | Expanded Nodes | Called Times of Consistency Function |
---|
Average Number | Max Number | Average Number | Max Number |
---|
Nominal | 96.538 ± 1.6221 | 116 | 8.2000 ± 0.1384 | 18 |
Single Fault | 103.455 ± 2.8798 | 151 | 14.4000 ± 0.6728 | 46 |
Double Faults | 108.727 ± 3.0792 | 202 | 22.7000 ± 1.8675 | 110 |
Three Faults | 115.545 ± 5.3045 | 273 | 24.5000 ± 2.1935 | 128 |
5.2. Number of Particles
In this subsection, we conduct a set of simulations in the nominal scenario with 10 time-steps to test the sensitivity of the number of particles to the performance of our approach. The number of particles varies from 100 to 1000 and typical experimental results are shown in
Figure 7. As can be seen from this figure, the performance of our method is relevant to the number of particles. As the number of particles increases, more belief states and trajectories are obtained, and the time consumption also goes up.
Figure 7.
Effect of the number of particles.
Figure 7.
Effect of the number of particles.
5.3. Comparison with Other Algorithms
We now compare the performance of our approach with two k best methods: (1) k best BFTE algorithm and (2) k best CDA* algorithm with respect to the following aspects: (1) estimation accuracy; (2) the consumed time as the number of obtained belief states increases; and (3) the sensitivity of different approaches’ performance to estimation time steps.
Figure 8.
Probability density maintained over time.
Figure 8.
Probability density maintained over time.
As discussed earlier,
k best methods choose
k best trajectories or belief states to track system dynamics, and the value of
k determines their estimation accuracy and performance. Blackmore
et al. [
35] pointed out that their estimation accuracy depends on whether or not
k is large enough for real belief state distribution. In other words, when the distribution over belief state is relatively flat,
k best methods maybe lead to losing the solution. Compared to these methods, our approach is robust for this situation. Generally speaking, the number of particles directly determines the estimation accuracy in our approach. Assumed that 100 particles are used, the loss of belief state probability density is less than 1% at each time step. If the particles increase to 500 or 1000, the loss will reduce to less than 0.2% or even 0.1%. Therefore, the number of obtained belief states at each time-step is dynamic adaptive and critically dependent on current belief state distribution. For a relatively concentrated distribution, our algorithm just needs to calculate a smaller number of belief states. On the other hand, more belief states will be obtained, when the desired distribution is relatively flat.
Figure 8 shows the maintained belief state probability density over many cycles. Since the
k best CDA* algorithm only improves computational performance but not estimation accuracy when compared to
k best BFTE algorithm, only
k best CDA* algorithm is shown in this figure. It is easy to find that the reduction in probability density is exponential in the number of time steps for both LUG and
k best algorithm, but the rate of decay is clearly slow for our proposed method.
In second experiment, we investigate a set of simulations with 10 time-steps to show the time consumption of different algorithms varying predefined parameter. In
Table 6,
,
and
denote the number of particles, belief states and trajectories, respectively. It is easy to see that
k best CDA* algorithm has a better time performance than the
k best BFTE algorithm. Moreover, the difference between the proposed approach and
k best CDA* algorithm can be analyzed in case that the same number of belief states
or trajectories
are obtained. When the
k value is smaller than 3, the time performance of
k best CDA* algorithm is good enough. However, when the
k value is set to 10 (
),
k best CDA* algorithm captures four belief states, but the time result is 4185.69 ms. On the other hand, the proposed method (
) can captures eight belief states, and only consumes 263.87 ms. Therefore, the proposed method achieves more estimation accuracy and consumes less time, and this advantage becomes significantly apparent as the number of obtained belief states or trajectories increases.
Table 6.
The time consumption of different algorithms (confidence 95%).
Table 6.
The time consumption of different algorithms (confidence 95%).
LUG | BFTE | CDA* |
---|
| | | Time (ms) | | | Time (ms) | | | Time (ms) |
---|
100 | 8 | 35 | 263.87 ± 0.21 | 1 | 1 | 51.97 ± 0.08 | 1 | 1 | 27.38 ± 0.03 |
200 | 8 | 67 | 276.70 ± 0.25 | 2 | 2 | 156.89 ± 0.12 | 2 | 2 | 82.15 ± 0.05 |
300 | 8 | 117 | 277.38 ± 0.32 | 3 | 3 | 489.86 ± 0.43 | 3 | 3 | 194.76 ± 0.45 |
400 | 8 | 117 | 289.23 ± 0.47 | 4 | 3 | 809.56 ± 0.54 | 4 | 3 | 375.23 ± 0.49 |
500 | 8 | 152 | 292.08 ± 0.63 | 5 | 3 | 1352.88 ± 0.61 | 5 | 3 | 587.18 ± 0.58 |
600 | 9 | 174 | 541.17 ± 0.67 | 6 | 3 | 2307.51 ± 0.65 | 6 | 3 | 961.42 ± 0.69 |
700 | 13 | 280 | 559.83 ± 0.71 | 7 | 3 | 3573.87 ± 0.73 | 7 | 3 | 1276.36 ± 0.76 |
800 | 24 | 337 | 640.24 ± 0.77 | 8 | 3 | 4922.32 ± 0.82 | 8 | 3 | 2058.53 ± 0.71 |
900 | 25 | 408 | 638.71 ± 0.81 | 9 | 3 | 6214.18 ± 1.03 | 9 | 3 | 3468.74 ± 0.92 |
1000 | 28 | 419 | 692.72 ± 0.85 | 10 | 4 | 8446.02 ± 1.15 | 10 | 4 | 4185.69 ± 0.97 |
Figure 9.
The performance results for different time step.
Figure 9.
The performance results for different time step.
Figure 9 shows the performance results as the estimated time-step increases for the third experiment. The number of particles in our approach is set to 100 and 500, while both BFTE and CDA* consider the value of
k as 1 and 5 together. As can be seen from
Figure 9a, the time consumption of BFTE and CDA* with
k = 5 increase sharply, and the other curves go up smoothly. Since
Figure 9a cannot clearly show the differences among our approach, BFTE and CDA* with single-estimation,
Figure 9b zooms in these curves. This figure shows that our approach with 500 particles has more time consumption than 100 particles. It is in line with our previous analysis in
Section 5.2. Similarly, we can also see that single-estimate results for BFTE and CDA* outperform our approach.
As a summary, k best BFTE and CDA* algorithm are well suited for a system with a relatively concentrated belief state distribution, while our approach can be applied for the systems with either concentrated or flat distributions. Moreover, our approach has better estimation accuracy and outperforms the k best BFTE and CDA* algorithms for sufficiently sized belief states.
6. Conclusions
In this paper, we propose a novel simulation-based fault diagnosis approach, which models the systems as concurrent probabilistic automata and applies LUG to state tracking and fault diagnosis of these systems. Moreover, the MC technique is introduced into this scheme, so our algorithm is anytime, and can balance between accuracy and time efficiency by varying the number of particles. On the one hand, the particles control the breadth of best-first A* search and maintain most likely belief states; on the other hand, the tagged particles can be used to generate system evolution trajectories. Finally, this paper analyzes the sample impoverishment problem resulted from the MC technique, and employs a novel recursively one step look-ahead strategy to mitigate this situation and improve the estimation accuracy.
The method has been successfully applied to a non-trivial real-world example: a power supply control unit of a spacecraft. The experimental results show its satisfactory performance including estimation accuracy, time and space complexity. It is also possible to diagnose the system without making any simplifying assumption such as single fault. In future work, we will introduce some variance into our predefined probability transition matrix, because the fixed transition probability in our experiment is relatively simple. Moreover, distributed diagnosis techniques can efficiently decrease the computational complexity for large-scale complex systems, so this is another research direction for the future.