1. Introduction
Maintenance aims to minimize or avoid the performance degradation and unplanned downtime caused by the inevitable degradation of machines over their service times and to keep them in good working condition. Lack of maintenance is one of the key reasons that has led to catastrophic consequences and significant economic losses in industry [
1]. Therefore, an appropriate maintenance strategy is essential for companies to minimize their maintenance expenses and maximize profit [
2].
Among the commonly used maintenance methods, the risk based inspection/maintenance (RBI/RBM), which aims to reduce the overall risks that may result in unexpected failures [
3], is attracting more attention. Many companies report that RBI/RBM’s systematic approach to inspection and maintenance can not only improve safety but also reduce operating cost [
4].
Although RBI/RBM can quantify the risk associated with a particular process activity, the quantitative risk assessment factor cannot be updated during the life of a process [
5]. To overcome this, a dynamic risk assessment method has been developed, which can capture the time-dependent behavior of the system risk profile.
Bayes theorem is a popular algorithm used for dynamic risk assessment. Meel and Seider [
6] applied Bayes theorem to dynamically update the estimates of accident probabilities, using near misses and incident data collected from similar systems. The hierarchical Bayesian model was applied to estimate the updated risk profile in [
7,
8]. In [
9], Bayes theorem was combined with a bow-tie model to assess and update the risk profile in a sugar refinery.
Most existing dynamic risk assessment methods, as reviewed above, only use statistical data, i.e., count data of accidents or near misses (precursors) from similar systems, to update the estimated risk profile. A major drawback of these methods is that one must wait until accidents or near misses (precursors) occur before updating the estimation of the risk profile. Besides, statistical data are collected from similar systems, reflecting population characteristics but not fully accounting for the individual features of the target system.
Industry today seeks maintenance solutions based on real-time health monitoring of assets [
10,
11]. Instead of collecting data from similar systems, the condition monitoring data give information on the individual degradation process of the target system, providing an opportunity to update the risk factor before actual failure occurs. Therefore, introducing condition monitoring data in dynamic risk assessment could be a beneficial complement to the statistical data, towards a condition monitoring-based dynamic risk assessment [
12].
There have been a few attempts in the direction of applying condition monitoring data into the dynamic risk assessment. Zadakbar et al. [
13] proposed a multivariate risk-based fault detection and diagnosis method using Kalman filter to estimate the degradation states based on condition monitoring data and calculated the residual between the measured data and estimated data. Similar works were carried out by the same research group using different condition monitoring techniques, such as the control chart technique [
14], principal component analysis (PCA) [
15], and the particle filter [
16]. Zeng and Zio [
17] developed a dynamic risk assessment model by combining statistical failure data and condition monitoring data. However, most of the aforementioned methods do not involve consequence analysis models when calculating the risk profiles. In reference [
13,
14,
15,
16], the “consequence” in the risk model was replaced by a severity score for a fault designed for the applied fault detection method, i.e., control chart technique, PCA. In [
17], the consequence analysis model was specially designed for a high-flow safety system in a tank.
Therefore, a literature review was carried out on the consequence analysis model, which is an important component of the dynamic risk assessment model. API RP 581 [
18], published by American Petroleum Institute, is a risk-based maintenance standard applied for petrochemical equipment, such as tanks, compressors, and pumps. The consequences calculation method in [
18] includes qualitive, semi-qualitive, and static quantitative models, lacking a fully quantitative consequence model that is applied on condition monitoring data. To quantify process losses, several researchers proposed different types of loss functions. Khan et al. [
19] carried out a comprehensive review of these loss functions and discussed their application in detail. In the paper, an inverted beta loss function was applied for units requiring both the upper and lower boundaries of an operating variable; inverted normal loss function was used for units only requiring the upper boundary of an operating variable; and multivariate inverted loss function was employed for units requiring multivariate monitoring. More works on applying loss function in risk assessment can be found in [
20,
21,
22].
However, irrespective of the good performance of the aforementioned works, some challenges remain in maintenance of machines in real industrial applications.
On one hand, dynamic risk-based maintenance research [
13,
14,
15,
16,
19,
20,
21,
22] focuses on lowering the entire risk of the system, without putting much attention on the early detection of the fault. In the dynamic risk model, the fault/failure probability is heavily related to fault detection. Therefore, there is a requirement to improve the fault/failure probability calculation model with the application of advanced fault detection methods. Meanwhile, the loss function is suggested to be integrated into the risk model, as it can help estimate process economic risk and assist in effective operational decision-making.
On the other hand, fault detection research [
23,
24,
25,
26] mainly focused on the development of an advanced model to detect an incipient fault, without considering the optimum time for maintenance. Most of the models [
23,
24,
25] were tested on simulated or experimental data only, lacking the evaluation on real industrial data.
To address the aforementioned challenges, this paper proposes a system-wide health indicator using a dynamic risk profile, which takes into account both the financial loss and the fault probability based on condition monitoring data. In our methodology, the fault probability is calculated by robust Mahalanobis distance, measuring the difference between the condition monitoring data and the data under healthy conditions, and it is presented as a system-wide feature from a sparse autoencoder fault detection model, enabling early fault detection. The value of the health indicator is presented in financial cost, which assists in effective operational decision-making in a process system. The threshold setting is based on various literature reviews on anomaly detection [
27,
28,
29] and an industrial standard (API 581) [
18] by the American Petroleum Institute. The main benefits of the proposed method include early detection of faults, fault analysis, suggesting maintenance time, safety improvement, and minimum interruption of operation.
The rest of the paper is organized as follows. The proposed health indicator and the algorithms employed in this paper are explained in
Section 2. Based on the proposed methodology, the performance of the health indicator was evaluated on multivariate industrial data obtained from a pump and a compressor; see
Section 3 and
Section 4 respectively. Finally, the conclusion and the main contributions are highlighted in
Section 5.
2. Methodology
This section proposes a methodology to calculate a system health indicator, as presented in
Figure 1. It consists of two phases: (1) an offline phase to develop the fault detection model and risk model, and (2) an online phase to detect make maintenance decisions. The detailed working mechanisms of these two phases are explained in the following two sub-sections.
In the proposed methodology, the system health indicator is represented by the dynamic risk profile () of the system, which is calculated by the probability of fault () and consequence of fault (). The is derived from the offline training data under normal conditions and the is measured by financial loss using inversed normal distribution. The health indicator contains two stages of threshold. The threshold indicates a fault is detected in the system and suggests operators to take priority response (i.e., schedule a proper maintenance time). The threshold suggests a shutdown is required to ensure the safety running of the machine and avoid unplanned shutdown. The health indicator can be used for fault detection and for taking any supervisory decisions to activate appropriate safety systems in real time.
This work is a further development of our pervious work [
1], where we developed a fault detection and isolation scheme based on sparse autoencoder (SAE), but did not take financial factors into consideration. This paper merged the system-wide feature, which is obtained by SAE and Mahalanobis distance (MD), and the financial consequences into a comprehensive system health indicator. The proposed health indicator can demonstrate the health condition of the system to the operator at real time, assist the operators on when an incipient fault is detected, how the system is degraded, what type of fault the machine suffers from, when the deadline is for maintenance.
2.1. Offline Phase—Model Development and Threshold Calculation
The offline phase builds three models, which are fault detection model, model, and model with application of history data under normal condition. Based on these models, the risk threshold () of the health indicator is calculated.
2.1.1. Fault Detection Model Training
2.1.1.1. Calculation of a System-Wide Feature
In this methodology, the fault detection model is built using the SAE. The MD calculates the statistical difference of the residual outputs between actual inputs and reconstructed outputs. Then, the statistical difference is used as a system-wide feature for fault detection and estimation.
The SAE is a special type of feedforward unsupervised neural networks. An advantage of the unsupervised neural network is that it can detect anomaly without data being labelled. While for conventional Artificial neural network (ANN) and other types of supervised ANN methods, the input and output of models should be identified by the operators. The SAE can be divided into an encoder and a decoder. The encoder extracts features from the input data, which is case-dependent and is defined in
Section 3 and
Section 4. The decoder reconstructs the autoencoder state back to the input data space. The structure of SAE can be found in
Figure 2.
The SAE reconstructs the input vectors in the output layer using Equation (1).
where
is the input vector, and
is the reconstructed output.
is the nonlinear function of SAE, which predicts output
based on the input
, using parameters
and
. The details of SAE mechanisms can be found in [
30,
31].
Then, the multivariate residuals
between the input variables
and the reconstructed outputs
can be calculated by Equation (2).
To detect the existence of a fault, an integrated feature is developed based on the multivariate residuals. In our methodology, the Mahalanobis distance are applied. The MD is a unitless distance measurement, which considers the correlations among variables. It has been successfully applied for the anomaly detection in wind turbine data [
27,
28,
29], and early fault detection in pumps [
30].
In this paper, a robust MD [
29,
32] is calculated using Equation (3). It calculates statistical difference of the residual outputs between actual and estimated inputs and used as an integrated system-wide feature.
where
is the Mahalanobis distance,
is the robust measure of central tendency (the median) and
is the inverse covariance matrix calculated from the sample population through the minimum covariance determinant estimator [
29].
The MD considers all input variables of the process data. Any anomalies in the process data can cause changes in this integrated system-wide feature.
2.1.1.2. Estimation of Probability Density Function (PDF)
The distribution of system-wide feature (
) influences the following steps on threshold and
calculation. Reference [
27] used Weibull distribution on fault detection threshold estimation, while reference [
29] applied kernel distribution. Reference [
15] used Normal distribution on
calculation, while reference [
33] applied kernel distribution. In this paper, to make our model more accurate, we add a procedure to select the best-fitting probability distribution model for system wide feature
among four well-known candidate distributions. The candidate distributions and their probability density functions (PDF) are summarized in
Table 1.
In
Table 1, the generalized extreme value distribution (GEVD) includes three types of extreme value distributions (see
Figure 3), which are type I (
), type II (
), and type III (
).
The parameters of the distributions can be estimated using maximum likelihood estimation. The goodness-of-fit test should be performed to quantify how the selected distribution matches the original data. There are numerous statistical fitting tests, which are commonly used for evaluating the goodness of the distribution fitting. Some of the popular statistical fitting tests include Kolmogorov–Smirnov test (K–S) test, Anderson–Darling test, Cramer–von Mises test, chi-squared test, etc.
In this paper, K–S test [
34] is applied to determine if a hypothesized distribution fits a data set. This test can be used for small and large sample sizes. The K–S test compared the sample’s empirical cumulative distribution function (
) (
) with the
of a selected distribution
. If
deviates too much from
, the null hypothesis is rejected. In this paper, the null hypothesis is that the selected distribution fits the sample data. The outputs of K–S test are
and a
value.
is the decision whether to rejected the hypothesis or not.
if the test rejects the null hypothesis at the 5% significance level, or 0 otherwise. The
value is the probability of observing a test statistic as extreme as, or more extreme than, the observed value under the null hypothesis. Its range is [0, 1]. Small values of
cast doubt on the validity of the null hypothesis.
2.1.1.3. Calculation of Thresholds
After getting the PDF function, the fault detection threshold
can be obtained from the PDF function of
for a given confidence level
(
was applied [
27]) by solving Equation (4):
In Equation (4),
is the PDF function of the selected probability distribution of
. The details of probability distribution selection can be found in
Section 2.1.1.2.
If no corrective action was taken after a fault being detected, the online MD value would exceed the shutdown threshold
. The value of this threshold based on the acceptable criteria of the specific process system. Alternatively, according to [
20], when not enough information is available, the shutdown threshold of a measurement can be defined as 4 times larger than its maximum operating value.
where
is a combination matrix, which covers any variables in the training data becomes 4 times larger than their normal operating values. The value of
can also be given by the end users based on the acceptable criteria of the specific process system.
is the reconstructed output.
and
are adopted directly from the training phase as calculated in Equation (3).
2.1.2. Build Probability of Fault () Model
The probability of fault (
) indicates the probability of a fault that happened on the piece of equipment. Hence, the
of equipment should be calculated using measurements that can reflect the health status and degradation process of equipment. Usually, only one key variable is selected to calculate the probability, and this variable is expected to be the most sensitive one in the system [
35]. In our methodology, the system wide feature (
) is selected for
calculation.
A visual depiction of the proposed
is presented in
Figure 4. It ranges from 0 to 1. When no fault is detected in the system,
; when a fault or an anomaly is detected,
; if no maintenance actions are taken after a fault is detected, the
would keep increasing to 1.
Note that, the means the system definitely has a fault, rather than a catastrophic failure. is a MD value for early fault detection. When the integrated system-wide feature () is greater than , there should be an early fault in the system. However, considering the fault detection method has false alarm rates, we adjusted the () to 0.5. If the system suffers a fault, it will cause further performance degrading, and the system-wide feature () will increase to 1.
In
Figure 4, the blue curve is the cumulative distribution function (
) of a standard distribution, and it can be express as:
where
is the probability density function depend on the selected distribution.
The offline
model (black curve in
Figure 4) is obtained by moving the
to a new position, with the horizontal movement distance equals
. The
can be calculated by solving Equation (7).
where
is the probability density function depending on the selected distribution for the training data. The details of probability distribution selection can be found in
Section 2.1.1.2.
Therefore, the
is expressed in Equation (8):
where
is the fault detection threshold for the fault detection model that were calculated during training stage using Equation (4).
is the Mahalanobis distance in Equation (3).
When
reaches 1, a fault is occurred in the system, however, as it is still in its early stage, the financial consequence is low. In order to fully use the remaining life of the machine, we calculated a shutdown threshold
, which is calculated using method that was proposed in reference [
2] and an industrial standard API 581 [
3].
2.1.3. Build Consequence of Fault () Model Using Loss Function
The consequence of fault (
) is to quantify the potential consequence of the fault scenario and the loss function is employed for the calculation of
. In condition monitoring process, the process loss starts to increase if the high-contributing process variables exceed their normal operation thresholds, and the maximum process loss is reached after any of the identified variables breach the shutdown threshold [
22]. The loss function used to quantify the process loss is the inverted normal loss function, which is the most widely used pattern for describing random variables [
19]. The loss function is given by Equation (9) [
19,
22].
In Equation (9),
represents the estimated maximum financial loss based on the worst conditions and it can be calculated as:
where
is downtime production loss,
is the material cost,
is the labor cost,
is the technical support cost,
is the economic consequence of asset loss,
is the economic consequence of human health loss, and
can include environmental clean-up cost, indirect costs that represent secondary effects, etc.
In Equation (9),
represents the Mahalanobis distance calculated by Equation (3),
is the target value of variable when financial loss is zero [
19]. Hence,
in this case.
is the shape parameter defining at which value of the process variable when the maximum loss is reached. The shape parameter
is calculated using Equation (11) by the least squares method.
where
is the financial loss based on the worst case conditions,
is the
value when MD reaches the shutdown threshold
. The shutdown threshold is set to ensure the safety running of the machine, avoid unplanned shutdown and high financial loss. In our case, when the performance feature exceeds its shutdown threshold, the degradation of a machine can increase rapidly, or its performance becomes very difficult to predict. At this circumstance, a machine is in a very dangerous situation, but there still exist a bit of time for the faulty machine to reach its maximum financial loss. In this case, the financial loss is set to 50% [
20] of
when MD reaches the shutdown threshold
, which means
.
In our case, we assume that critical faults can be detected by condition monitoring and fault detection system. Therefore, when a fault not be detected, is given an initial value (), which is composed of inspection cost and production loss that are caused by shutdown inspection. When a fault is detected, the fault analysis scheme would analyze the cause of the fault and infer the fault type, hence, the value would be updated according to the possible fault type.
2.1.4. Calculate Health Indicator and Threshold
The system health indicator is determined by risk, which depends on two factors: the probability of occurrence of a fault leading to an unwanted event (
) and consequence of the event (
). The system health indicator is given by:
The operation of complex industrial processes is often subjected to multiple constraints to prevent catastrophic failure. These constraints are set at fault detection threshold and shutdown threshold for critical process variables.
The fault detection threshold can be calculated at the offline phase using the initial value of
:
where
is the fault detection threshold of MD using Equation (4).
is the Mahalanobis distance calculated by Equation (7).
is the initial estimated maximum financial loss before a fault being detected.
is the shape parameter of the loss function, calculated by Equation (11).
2.2. Online Phase—Fault Detection and Decision Making
To monitor the system health at real time, the online phase calculates a health indicator based on dynamic risk profile of the system. Two key parameters, and of the risk profile, are updated by applying condition monitoring data into the offline models. In the online phase, after a fault is detected, the fault analysis system will help operators to deduce the possible fault type. The and shutdown threshold () can be updated for a specific fault. The comparison of the real time value of the health indicator () with fault detection threshold () and shutdown threshold (), provides guidance for operators on maintenance decision making.
2.2.1. Calculate Probability of Fault ()
In the online fault detection phase, the monitoring data are fed to the SAE model trained offline. The MD at online monitoring stage is calculated as
where
and
are adopted directly from the training phase as calculated in Equation (3).
is the residual between the actual measurement values
and the reconstructed output
obtained using the trained SAE fault detection model.
The
at online monitoring stage can be calculated as:
where
is the MD at online monitoring stage.
is the fault detection threshold presented in MD, and the value of
is given in Equation (4).
is the Mahalanobis distance calculated by Equation (7).
2.2.2. Calculate Consequence of Fault ()
The consequence of an unwanted event can be largely influenced the fault type of the machine. In this section, the fault type is achieved by fault analysis using a two-dimensional statistic contribution map, which stacks multiple observations (time point) into one image to clearly illustrate the contribution of the variables over the entire faulty data times series.
The
statistic (also called squared prediction error, SPE) is widely used in process control for condition monitoring data [
36,
37,
38]. The traditional
statistic contribution plot can be calculated by Equation (19) [
39]:
The conventional contribution plot is a one-dimensional plot, which only examines the contributions at one time point (one observation), and multiple contribution plots are needed to illustrate multiple observations in time series data. In contrast, a two-dimensional contribution map [
40] stacks multiple observations into one image to clearly illustrate the contribution of the variables over the entire faulty data times series, which enables the fast identification of faulty variables within large heterogeneous data sets. Therefore, in our methodology, a two-dimensional
statistic contribution map is applied for data analysis.
The fault analysis result updates the value of estimated maximum financial loss via Equation (10).
Therefore, the consequence in the online phase is calculated as:
where
is the estimated maximum financial loss. When no fault is detected, the
remains its initial value
, which is described in
Section 2.1.3. When a fault is detected, the
should be updated according to the inferred fault type.
2.2.3. Calculate System Health Indicator and Shutdown Threshold
Combining Equations (16) and (18), a general system-wide health indicator using dynamic risk profile at online monitoring stage is expressed as:
The shutdown threshold is calculated as
where
is the shutdown threshold of MD in Equation (5).
is the estimated maximum financial loss for a specific fault type. Alternatively, the shutdown threshold can be decided by operators according to the maximum risk that a company can take.
To ensure the system risk within the acceptable range, two constraints ( and ) are set at fault detection threshold and shutdown threshold. These constraints can be decided by operators or calculated using Equations (13) and (20). The health indicator can be used for fault detection and for taking any supervisory decisions to activate appropriate safety systems in real time. If , it indicates that the health indicator at the monitoring stage is under the fault detection threshold. Therefore, the system is healthy during this status. If , it means that health indicator exceeds the fault detection threshold (). If the health indicator continuously exceeds , suggests operators to take priority response, for example, order inspection equipment, buy maintenance materials, and schedule a proper maintenance time. The fault analysis scheme calculates the contribution of each measurements, infers the possible fault types, and suggests the maintenance planning. If , it indicates the system is under a dangerous condition, and thereafter, the automatic safety system (i.e., emergency shutdown system) would be activated to avoid unplanned shutdown.
The value
is a fault detection threshold in Mahalanobis distance, without considering any financial factors. It is obtained at confidence level
of the training data, which means that we assume 99% of the training data is healthy and 1% is anomalies [
30].
is the fault detection threshold, which is calculated based on and taken financial factors into consideration.
The value is a shutdown threshold in Mahalanobis distance. can be adjusted by operators according to system requirement. is the shutdown threshold, which is calculated based and taken financial factors into consideration to help operators make maintenance decision.
5. Conclusions
In this paper, a system-wide health indicator is proposed using condition monitoring-based dynamic risk assessment, providing maintenance solutions based on real-time health monitoring of assets. The methodology combines two advanced fault detection methods, a sparse autoencoder and robust Mahalanobis distance, which enables the health indicator to detect a fault at its incipient stage and estimate financial loss. The proposed health indictor presents the system’s risk in dollars, making it effective in operational decision-making in a process system. To evaluate the performance of the proposed indicator, two case studies were carried out with multivariate industrial data obtained from a pump and a compressor. The results show that the indicator was able to show the degradation of the system with dynamically updated process risk at each sampling instant. In both cases, the health indicator was able to identify the faults at their incipient stages, before the measured signals showed obvious changes. The fault analysis scheme can analyze the contribution of each measurement, inferring the possible fault types to assist the maintenance planning. Especially in the second case, after the indicator exceeded the suggested shutdown threshold, many other measurements (i.e., the vibration of the stage 1–2 DE bearing, vibration of the stage 3 NDE bearing) were influenced by the faulty bearing, which indicated the system was deteriorated to a dangerous point. In this case, the proposed health indicator suggested appropriate shutdown time before the system suffered severe damage.
The main benefits of the proposed method include early detection of faults, fault analysis, suggesting maintenance time, safety improvement, and minimum interruption of operation. In summary, the main contributions of this paper include:
- (1)
A system-wide health indicator has been developed using condition-based dynamic risk assessment. The proposed health indictor presented the system’s risk in dollars, making it easier for operators to make maintenance decisions. In addition, the health indicator can demonstrate the health condition of the system to the operator in real time, and assist the operators as to when an incipient fault is detected, how the system is degraded, what type of fault the machine suffers from, and when the deadline is for maintenance.
- (2)
The probability of a fault is calculated based on the application of a state-of-the-art fault detection models, SAE and MD. To the authors’ knowledge, this is the first time a study has obtained fault probability from a single system-wide feature calculated in MD value, instead of using multiple measurements of a system. Compared with other statistical measurements, such as Hotelling’s and Euclidean distance, the MD is a better way to calculate probability of fault. The value of Hotelling’s is much higher than MD (nearly squared). When using Hotelling’s , the value can increase rapidly to a very high value after a fault appears. This makes a fault more obvious in a fault detection process; however, it is hard to transfer such rapidly changing and highly statistical value to a fault probability. In contrast, the value of Euclidean distance is much more moderate. However, it is not as sensitive as MD for early fault detection in our cases.
- (3)
The proposed health indicator is evaluated by using a pump and a compressor using multivariate industrial data. This methodology can also be applied to other types of machines’ health assessments, such as turbines and motors. In addition, our experience of processing the industrial data set can benefit relevant readers.