Optimal Condition-Based Maintenance Strategy for Multi-Component Systems under Degradation Failures

This paper proposes a condition-based maintenance strategy for multi-component systems under degradation failures. The maintenance decision is based on the minimum long-run average cost rate (LACR) and the maximum residual useful lifetime (RUL), respectively. The aim of this paper is to determine the optimal monitoring interval and critical level for multi-component systems under different optimization objectives. A preventive maintenance (PM) is triggered when the degradation of component exceeds the corresponding critical level. Afterwards, the paper discusses the relationship between the critical level and the monitoring interval with regards to the LACR and RUL. Methods are also proposed to determine the optimal monitoring interval and the critical level under two decision models. Finally, the impact of maintenance decision variables on the LACR and RUL is discussed through a case study. A comparison with conventional maintenance policy shows an outstanding performance of the new model.


Introduction
Maintenance decision-making is crucial to reduce cost and enhance the productivity of industries [1,2]. It has been proved by longstanding practices that condition-based maintenance (CBM) is more efficient than conventional scheduling maintenance, particularly for complex equipment [3]. CBM can effectively avoid high maintenance costs and potential hazards as well as improve the availability of sophisticated equipment [4,5]. So far, hundreds of CBM models have been reported in literature [6][7][8][9][10], and most of them assumed that the system can completely be restored to the original state. These assumptions, in fact, are not practical in reality and need to be relaxed in the CBM framework [11].
Basically, a maintenance policy relies essentially on two main decisions: when to take (preventively/correctively) maintenance actions and how to implement preventive maintenance actions [12]. In CBM modeling, the degradation process of most engineering systems can be generally divided into two phases [13]: • The first phase is called the normal working phase, where no obvious deviation from the normal operating state is observed. The end point of this phase is called the critical level of the system.

•
The second phase is referred to as the failure delay period, since a defect may be initiated and progressively develop into a true failure. That means the system is in a defective stage but still working, and the point of this phase is referred to as failure level.
In most present works, only the characteristic of the second phase of the system is taken into account rather than that of the whole degradation process [14]. If the characteristic of the two stages could be jointly considered into degradation modeling and maintenance optimization, the unnecessary maintenance actions and potential failures could be effectively avoided [15]. Generally, the determination of the initiation point of the second phase (i.e., critical level) will affect decision making in CBM policy, where choosing too high or too low are detrimental to the system's reliability and production [16]. In the current industrial applications, the critical level and the maintenance cycle are usually determined based on maintainer experiences or through software simulation. Due to a lack of supportive model analysis, the selection of critical level is generally and relatively conservative, which leads to unnecessary maintenance actions with a costly system operation [17].
Although some investigations [9,18] considered the impact of the critical level and the monitoring interval on the maintenance policy, most of them have had perfect maintenance or imperfect maintenance for single component system [19]. Those assumptions have not considered the uncertainty of maintenance action and are not applicable for practical applications [11].
From the above motivations, the paper hereby proposes a practical maintenance policy based on the deterioration characteristics of complex systems. We develop a model exploring the impact of the critical level and the monitoring interval of the maintenance policy and address the problem of determining the optimal model decision variables. Meanwhile, we analyze the effects of a different soft failure cost rate on the overall operation cost. We also discuss the effects of the critical level and monitoring interval on the system reliability.
The remaining parts of the paper are organized as follows. Section 2 presents a brief system description, maintenance model, and health quantities derivations. Section 3 constructs two CBM models in which maintenance decision is made based on the LACR and RUL, respectively. The methods of finding optimal solutions and parameters estimates are presented. Section 4 gives a case example to illustrate the proposed method and analyzes the related CBM decision variables. Section 5 summarizes this paper.

System Description
The system consists of m components, i.e., {1, . . . i, . . . , m} as shown in Figure 1. The overall system is subject to condition monitoring at discrete time points θ, 2θ, . . . , nθ, n = (1, 2, .., ∞), θ ∈ R + due to the limitations of sensor techniques and system structures. The system fails if the degradation level of one component i exceeds the corresponding L i , which is called the degradation failure level. Each component i experiences a degradation failure denoted by F i , and all the degradation failures are numbered from F 1 to F m . If the deterioration of the component has reached failure level, the system will continue operating with a lower performance and be likely to cause high quality loss in production. In most present works, only the characteristic of the second phase of the system is taken into account rather than that of the whole degradation process [14]. If the characteristic of the two stages could be jointly considered into degradation modeling and maintenance optimization, the unnecessary maintenance actions and potential failures could be effectively avoided [15]. Generally, the determination of the initiation point of the second phase (i.e., critical level) will affect decision making in CBM policy, where choosing too high or too low are detrimental to the system's reliability and production [16]. In the current industrial applications, the critical level and the maintenance cycle are usually determined based on maintainer experiences or through software simulation. Due to a lack of supportive model analysis, the selection of critical level is generally and relatively conservative, which leads to unnecessary maintenance actions with a costly system operation [17].
Although some investigations [9,18] considered the impact of the critical level and the monitoring interval on the maintenance policy, most of them have had perfect maintenance or imperfect maintenance for single component system [19]. Those assumptions have not considered the uncertainty of maintenance action and are not applicable for practical applications [11].
From the above motivations, the paper hereby proposes a practical maintenance policy based on the deterioration characteristics of complex systems. We develop a model exploring the impact of the critical level and the monitoring interval of the maintenance policy and address the problem of determining the optimal model decision variables. Meanwhile, we analyze the effects of a different soft failure cost rate on the overall operation cost. We also discuss the effects of the critical level and monitoring interval on the system reliability.
The remaining parts of the paper are organized as follows. Section 2 presents a brief system description, maintenance model, and health quantities derivations. Section 3 constructs two CBM models in which maintenance decision is made based on the LACR and RUL, respectively. The methods of finding optimal solutions and parameters estimates are presented. Section 4 gives a case example to illustrate the proposed method and analyzes the related CBM decision variables. Section 5 summarizes this paper.

System Description
The system consists of m components, i.e., { } the system will continue operating with a lower performance and be likely to cause high quality loss in production. The system is operational under varying working conditions and subject to a continuous accumulation of wear and tear. For a given component i , it is assumed that deterioration is a The system is operational under varying working conditions and subject to a continuous accumulation of wear and tear. For a given component i, it is assumed that deterioration is a stochastic process and described by a detective stochastic scalar variable f i (t). Based on the actual deterioration behavior of component i, the random deterioration increments in a time interval are considered to be non-negative and monotonically increasing, which is shown in Figure 2. Accordingly, the deterioration Energies 2020, 13, 4346 3 of 11 increment of component i between t j−1 and t j is described by an exponential random variable ∆ ( j−1,j) f i (t), j ∈ R + . The degradation path of component i can be expressed as where b i and c i are constant parameters for component i. α is a random variable and M is the sample space of α. The degradation of the system has the following characteristics: • The random variable parameter α follows an exponential probability density function. stochastic process and described by a detective stochastic scalar variable ( ) i f t . Based on the actual deterioration behavior of component i , the random deterioration increments in a time interval are considered to be non-negative and monotonically increasing, which is shown in Figure 2. Accordingly, the deterioration increment of component i between 1 j t − and j t is described by an where i b and i c are constant parameters for component i . α is a random variable and M is the sample space of α . The degradation of the system has the following characteristics:  The random variable parameter α follows an exponential probability density function. An amount of degradation failure cost rate E is charged on the overall system if the system fails, and a costly corrective maintenance (CM) action would be performed. In practice, there is less likelihood that maintenance action could restore the conditions of components back to the exact initial operating condition. Therefore, a practical model is presented in the following section to estimate the real status of the system.

Maintenance Model
Since the deterioration process of the system is a stochastic progress, any of the maintenance actions will take the system condition to an approximately initial operating state. It is shown in the literature that the deterioration level of the system after a maintenance action can be random [11]; therefore, we assume the actual degradation path component ij via condition monitoring has the following relation: where ε is the combined measurement and maintenance restoration errors, An amount of degradation failure cost rate E is charged on the overall system if the system fails, and a costly corrective maintenance (CM) action would be performed. In practice, there is less likelihood that maintenance action could restore the conditions of components back to the exact initial operating condition. Therefore, a practical model is presented in the following section to estimate the real status of the system.

Maintenance Model
Since the deterioration process of the system is a stochastic progress, any of the maintenance actions will take the system condition to an approximately initial operating state. It is shown in the literature that the deterioration level of the system after a maintenance action can be random [11]; therefore, we assume the actual degradation path component i j via condition monitoring has the following relation: where ε is the combined measurement and maintenance restoration errors, ε ∼N µ, σ 2 .
In practice, a failure always leads to large amount of economic losses and delays in product processing [20,21]. To reduce this high expense of maintenance cost, for each component i, the critical level R i is introduced to conduct PM actions at the monitoring epoch, before its degradation level exceeds L i (R i < L i ), as shown in Figure 3. In this paper, the following decision process of the maintenance policy is adopted.

•
If the degradation level of a component increases rapidly and exceeds both the failure level and critical level at the monitoring epoch nθ, that is f i (t) > L i , a CM action would be taken. • If the degradation of components increases gradually and the degradation level is between R i and L i at the monitoring epoch nθ, that is R i ≤ f i (t) < L i , a PM action would be taken.

•
If the degradation level of components increases to below the critical level at the monitoring epoch nθ, that is f i (t) < R i , maintenance action would not be taken until it reaches the critical level R i or failure level L i at next monitoring epoch.  From the above decision process, the monitoring interval and the critical level are the decision variables in our maintenance model and need to be optimized.

Health Quantities Derivation
In this part, we focus on derivation of a PM action or CM action within a monitoring interval. According to the probability distribution, the probability of Y t t ≤ is shown as follows: where Y t is the passage time of the threshold Y . On the basis of distribution function of ε , the distribution of ( ) i y t at time Y t for a given α can be written by [14]: From the above decision process, the monitoring interval and the critical level are the decision variables in our maintenance model and need to be optimized.

Health Quantities Derivation
In this part, we focus on derivation of a PM action or CM action within a monitoring interval. According to the probability distribution, the probability of t ≤ t Y is shown as follows: where t Y is the passage time of the threshold Y. On the basis of distribution function of ε, the distribution of y i (t) at time t Y for a given α can be written by [14]: where F ε is the distribution function of ε; Φ(•) is the normal distribution function. Integrating overall possible α, the P(t ≤ t Y ) is given by where g(α) is the probability distribution function of α, and M is the sample space of α. According to Equation (3), the distribution of the time to failure can be expressed by Energies 2020, 13, 4346

of 11
If a maintenance action is set up at monitoring point nθ, this means that from (n − 1)θ to nθ, the actual degradation level y i (t) increases to a value between critical level R i and failure level L i or exceeds L ij (which CM is performed). Therefore, based on Equation (3), the probability that CM or PM occurs at time point t k can be expressed as Based on Equations (5) and (6), the incidence of CM or PM can be obtained as: For the totality of α, integrating all the possible α, Equation (8) can be derived from Equation (5), which is written as: Therefore, the probability of a PM or CM action performing in a monitoring interval is obtained. In the next, we devote to the optimal maintenance decision-making model.

Maintenance Decision-Making
This section proposes two maintenance decision models, which are based on the minimum LACR and the maximum RUL, respectively. A method to determine the optimal monitoring interval and critical level is discussed and the unknown model parameters are also estimated.

Maintenance Decision Based on Maximum LACR
The LACR comprises CM cost, PM cost, and soft failure cost. To begin with, the LACR for component i can be obtained by dividing the total expected operation cost by the expected cycle length. The total operation cost TC can be calculated by adding all the costs for components, which is given by: where C PM,i and C CM,i are the mean PM and CM costs for component i, respectively. The average soft failure cost for component i in a monitoring interval can be calculated as The expected cycle length for system can be solved using Equation (9) as: Therefore, the optimization model based on minimum LACR is expressed as where R * = R * 1 , . . . , R * m are the set of optimal values of critical level for each component. θ * is the optimal monitoring interval under a minimum LACR model.

Maintenance Decision Based on Maximum RUL
RUL based maintenance policy is particularly effective, especially in the context of CBM [22]. The aim of this policy is to establish an optimization model based on the maximum RUL for the system. RUL is defined as the duration left for the system before it fails [23]. Let f i be the failure rate of the component i, so f i can be expressed as: Let S i denote the reliability of the component i over time interval [0, nθ], we have Then the mean RUL of component i can be derived as In a similar way, the RUL optimization model is given by where R * = R * 1 , . . . , R * m are the set of optimal values of critical level under maximum RUL. θ * is the optimal monitoring interval under maximum RUL.

Optimal Solution for Monitoring Interval and Critical Level
As we know the failure level can be known from maintenance records and is usually fixed [24]. The critical level and the maintenance interval in Equations (13) and (16) are decision variables and need to be determined. Notice that the monitoring interval is assumed to be constant, namely each interval is equal, so that the critical level can be obtained for each component. This could be determined by: From this equation, we can find the optimal values of R * and θ * for the system by minimizing the LACR.
In a same manner, optimal values of R * and θ * for each component based on the longest RUL can be determined by solving: In some cases, this method is especially useful for equipment whose degradation increases slightly in a long period of time [25]. This equipment generally adopt fixed monitoring intervals because of economic and convenient reasons [26].

Application and Results
The purpose of this section is to describe how the proposed maintenance model can be used in the maintenance optimization of a wind turbine transmission system through a simple example whose characteristics are illustrated in Section 2. The transmission unit consists of six bearing components. We monitor the crack length and width of each bearing and use the product of the crack length and width as measurement of the bearing degradation. The measurement path of one bearing is illustrated in Figure 4.  According to statistics from the previous maintenance records, the model parameters are estimated using the maximum likelihood estimation method [27]. The fitted values are given in Table  1, and the maintenance data is indicated in Table 2. Using the approach presented in Section 4, for the minimum LACR model, the optimal monitoring interval is computed as =48 θ , and the optimal critical levels for each component are illustrated in Table 3. For the maximum RUL model, the optimal monitoring interval is computed as =43 θ , and corresponding optimal critical levels are presented in Table 4. We can find that the According to statistics from the previous maintenance records, the model parameters are estimated using the maximum likelihood estimation method [27]. The fitted values are given in Table 1, and the maintenance data is indicated in Table 2. Using the approach presented in Section 4, for the minimum LACR model, the optimal monitoring interval is computed as θ= 48, and the optimal critical levels for each component are illustrated in Table 3. For the maximum RUL model, the optimal monitoring interval is computed as θ= 43, and corresponding optimal critical levels are presented in Table 4. We can find that the maximum RUL model gives much more conservative control policy than the minimum LACR model.
The soft failure cost rate is an essential part of the maintenance cost and has a potential impact on the process of establishing a CBM model. In proposed maintenance strategy, the soft failure cost rate of the system under different working conditions is not constant. Changes in amount of soft failure cost rate can affect determination of the optimal decision variables (i.e., monitoring interval, critical level) in the CBM model. Therefore, the influence of failure cost rate on decision variables is analyzed in the subsection.

Influence of Failure Cost Rate on Optimal LACR
This subsection focuses on analyzing the influence of the failure cost rate on the average operation cost and related decision variables. The three situations are considered: (1) for a system incurred low failure cost rate; (2) for a system incurred medium failure cost rate; (3) for a system incurred high failure cost rate.
Hence, for three different failure cost rates (i.e., E = 1, 15, 30), Figure 5 indicates the evolution of the cost surface as a function of the monitoring interval and critical threshold. In order to find the optimal decision variables under varying policies, we have investigated the minimum maintenance cost rate under different values of monitoring interval and critical levels using Equations (13) and (16). The optimal critical level and monitoring interval are obtained for the transmission system and presented in Figure 5. The obtained minimum operation cost rate is denoted as C opt .
As shown in Figure 5, all surfaces are concave shaped which can lead to optimization procedure. When the soft failure cost rate is negligible or close to zero (Figure 5a), the optimal value of the monitoring interval is 85 with the optimal C opt = 21.7. When the soft failure cost rate increases to a medium cost range (Figure 5b), the optimal monitoring interval decreases to 55 and the optimal C opt rising to 33.2. When the soft failure cost rate reaches a high level (Figure 5c), the optimal C opt continues increasing to 55.6, with the optimal monitoring interval dropping to 48.
It is clear that as the soft failure cost rate increases, the values of optimal monitoring interval and critical level present decreasing trends, but the minimum LACR of the transmission system shows an upward trend. For a system incurred by a higher soft failure cost rate, the preventive maintenance actions need to be set up at a lower degradation level. This means that earlier and more frequent maintenance actions are needed when the soft failure cost rate goes higher.
Energies 2020, 13, 4346 9 of 11 of the cost surface as a function of the monitoring interval and critical threshold. In order to find the optimal decision variables under varying policies, we have investigated the minimum maintenance cost rate under different values of monitoring interval and critical levels using Equations (13) and (16). The optimal critical level and monitoring interval are obtained for the transmission system and presented in Figure 5. The obtained minimum operation cost rate is denoted as opt C .

The Relationship between the RUL and Decision Variables
For the RUL based maintenance policy, it is clear that system reliability depends on the monitoring epoch and the degradation level. We have investigated the relationship of optimal RUL and decision variables. Figure 6 illustrates the evolution of the reliability of the transmission system as a function of the monitoring interval and critical threshold at the 25th monitoring epoch. A medium level of failure cost rate is considered to show the reliability of transmission system changes with decision variables. The other maintenance costs remain the same as the previous case.
As indicated in Figure 6, the monitoring interval and the critical level present strong negative influences on reliability of the transmission system. The change of monitoring interval has a more noticeable impact on the transmission system reliability. In this case, we can obtain the optimal range of values of decision variables. As shown in Figure 6, the higher values of system reliability are found in terms of smaller values of monitoring intervals and critical, which means more maintenance actions are set up, the system is more likely to keep a higher level of reliability. However, according to what we have found above, smaller values of decision variables suggest that the failure cost rate will be very high. Consequently, it is especially important to emphasize the balance of LACR and RUL when we consider both objectives of complex production systems.
Compared with the typical age-based policy [22], which carries out repairs 55 days with a failure level of 9.7, a medium level of failure cost rate 12, and same parameters configurations mentioned above, the strategy recommended by the proposed model gives 31.6 of LACR. This is noticeably less than the 52.7 under classical age-based maintenance policy.
For the RUL based maintenance policy, it is clear that system reliability depends on the monitoring epoch and the degradation level. We have investigated the relationship of optimal RUL and decision variables. Figure 6 illustrates the evolution of the reliability of the transmission system as a function of the monitoring interval and critical threshold at the 25th monitoring epoch. A medium level of failure cost rate is considered to show the reliability of transmission system changes with decision variables. The other maintenance costs remain the same as the previous case. As indicated in Figure 6, the monitoring interval and the critical level present strong negative influences on reliability of the transmission system. The change of monitoring interval has a more noticeable impact on the transmission system reliability. In this case, we can obtain the optimal range of values of decision variables. As shown in Figure 6, the higher values of system reliability are found in terms of smaller values of monitoring intervals and critical, which means more maintenance actions are set up, the system is more likely to keep a higher level of reliability. However, according to what we have found above, smaller values of decision variables suggest that the failure cost rate will be very high. Consequently, it is especially important to emphasize the balance of LACR and RUL when we consider both objectives of complex production systems.
Compared with the typical age-based policy [22], which carries out repairs 55 days with a failure level of 9.7, a medium level of failure cost rate 12, and same parameters configurations mentioned

Conclusions and Future Work
In this study, a CBM model which maintenance decision is made based on minimum LACR and maximum RUL. The preventive maintenance is triggered when the degradation level exceeds a critical level. The degradation failure incurs costly production loss and the reasonable optimizations of critical level and monitoring interval are required. The optimal maintenance critical level of components and the optimal monitoring interval are determined by maximizing RUL and minimizing LACR. Afterwards, the proposed strategy is optimized and a method to find the optimal monitoring interval and the critical level is presented. Lastly, the influence of soft failure cost rates on the LACR, and the relationship between the RUL and decision variables, are investigated through a case of multi-component system.
The model developed can be applied on a variety of engineering systems with a large number of non-identical components with different random degradation characteristics (e.g., generator set, turbines set). In practice, the developed model is instructive and meaningful to practical engineering applications. Our future research work will focus on the competing risks of failure modes under economic dependency among components. In this consideration, a dynamic monitoring interval will be also investigated.